# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [12]:
tokenize_text <- function(text) {
  tokenizers::tokenize_words(text, lowercase = TRUE, strip_punct = TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [13]:
keys_from <- function(ngram, sep="\x1f") {
  paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [14]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
  if(length(tokens) < n) return(new.env(parent=emptyenv()))
  tbl <- new.env(parent = emptyenv())
  for (i in seq_len(length(tokens) - n + 1L)) {
    ngram <- tokens[i:(i + n - 2L)]
    next_word <- tokens[i + n - 1L]
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (next_word %in% names(counts)) {
      counts[[next_word]] <- counts[[next_word]] + 1L
    }
    else {
      counts[[next_word]] <- 1L
    }
    tbl[[key]] <- counts
  }
  tbl
}

#### d) Function to digest the text.

In [15]:
digest_text <- function(text, n) {
  tokens <- tokenize_text(text)
  build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [16]:
digest_url <- function(url, n) {
  res <- httr::GET(url)
  txt <- httr::content(res, as = "text", encoding = "UTF-8")
  digest_text(txt,n)
}

#### f) Function that gives random start.

In [17]:
random_start <- function(tbl, sep = "\x1f") {
  keys <- ls(envir = tbl, all.names = TRUE)
  if (length(keys)==0) stop("No n-grams available. Digest text first.")
  picked <- sample(keys, 1)
  strsplit(picked, sep, fixed = TRUE)[[1]]
}

#### g) Function to predict the next word.

In [18]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
  key <- paste(ngram, collapse = sep)
  counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
  if(length(counts) == 0) return(NA_character_)
  sample(names(counts), size = 1, prob = as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [19]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
  force(tbl); n <- as.integer(n); force(sep)
  function(start_words = NULL, length = 10L) {
    if((is.null(start_words)) || length(start_words) != n - 1L) {
      start_words <- random_start(tbl, sep = sep)
    }
    word_sequence <- start_words
    for (i in seq_len(max(0L, length - length(start_words)))) {
      ngram <- tail(word_sequence, n - 1L)
      next_word <- predict_next_word(tbl, ngram, sep = sep)
      if (is.na(next_word)) break
      word_sequence <- c(word_sequence, next_word)
    }
    paste(word_sequence, collapse = " ")
  }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [9]:
library(httr)
install.packages("tokenizers")
library(tokenizers)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘SnowballC’




In [10]:
set.seed(2025)

grimm_url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
grimm_tbl <- digest_url(grimm_url, n=3)
grimm_gen <- make_ngram_generator(grimm_tbl, n=3)

print(grimm_gen(start_words=c("the", "king"), length=15))
print(grimm_gen(length=15))

[1] "the king has forbidden me to marry another husband am not i shall ride upon"
[1] "song was over the lake and herself into her little daughter’s hand and was about"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [11]:
set.seed(2025)

aawe_url <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"
aawe_tbl <- digest_url(aawe_url, n=3)
aawe_gen <- make_ngram_generator(aawe_tbl, n=3)

print(aawe_gen(start_words=c("the", "king"), length=15))
print(aawe_gen(length=15))

[1] "the king he added to the entire exclusion of the swords were made prisoners the"
[1] "king was campaigning in france denmark germany switzerland and livonia figures 5 and the sword"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The difference in content generated from each source is that the first source has more fictional content while the second one has more nonfiction content.

## Question 3
#### a) What is a language learning model?
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

a) A language model is a type of machine learning model that predicts the probability of a sequence of words to understand and generate human language. It is a probability distribution over words.

b) To run one locally, you need to install OLLAMA which is essentailly a wrapper around a docker. This allows you to run LLMs on your laptop.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | Interprets commands |
| **Terminal emulator** | Provides a window to type commands |
| **Process** | Instance of running a program |
| **Signal** | A message sent to a process to tell it to do something. |
| **Standard input** | Default input stream for a process |
| **Standard output** | Default output for a process |
| **Command line argument** | Extra information given to a command. |
| **The environment** | Set of variables available to the shell and processes. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a) the programs that are being used are fine, xargs, and grep.

b) The find function starts in the current director . and recursively searches all subdirectories that match *.R. The xargs program reads the file and allows it to be passed as arguments to the next program. Grep program searches the content of each of the .R files for read_csv.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions.
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

a) Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/


b) docker run -d -p 8787:8787 -e PASSWORD=mypassword -v ~/Desktop/rprojects:/home/rstudio/projects rocker/rstudio

output: 3989f45b8d6421bc0e339d46345ab5ed97da5bc3cc1181d3e96d3933a0c07c6d

c) go to localhost:8787, type rstudio as the username and the password is whatever we set our password to in part b.