# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [2]:
tokenize <- function(text) {
tokens <- unlist(strsplit(tolower(text), "\\s+"))
tokens[tokens != ""]
}

#### b) Make a function generate keys for ngrams.

In [3]:
generate_keys <- function(tokens, n) {
  keys <- list()
  L <- length(tokens)
  # if there aren't enough tokens to form any n-gram, return empty list
  if (L <= n) return(keys)

  for (i in seq_len(L - n)) {
    key <- tokens[i:(i + n - 1)]
    next_word <- tokens[i + n]
    keys[[length(keys) + 1]] <- list(key = key, next_word = next_word)
  }
  keys
}

#### c) Make a function to build an ngram table.

In [4]:
build_ngram_table <- function(tokens, n) {
  table <- list()
  keys <- generate_keys(tokens, n)
  if (length(keys) == 0) return(table)

  for (pair in keys) {
    key_str <- paste(pair$key, collapse = " ")
    if (!key_str %in% names(table)) {
      table[[key_str]] <- character(0)
    }
    table[[key_str]] <- c(table[[key_str]], pair$next_word)
  }
  table
}

#### d) Function to digest the text.

In [5]:
digest_text <- function(text, n) {
tokens <- tokenize(text)
build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [6]:
digest_url <- function(url, n) {
text <- paste(readLines(url, warn = FALSE), collapse = " ")
digest_text(text, n)
}

#### f) Function that gives random start.

In [7]:
random_start <- function(table) {
sample(names(table), 1)
}

#### g) Function to predict the next word.

In [8]:
predict_next <- function(table, key) {
if (key %in% names(table)) {
sample(table[[key]], 1)
} else {
NA
}
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [9]:
generate_text <- function(table, start = NULL, length = 20) {
if (is.null(start)) {
key <- random_start(table)
} else {
key <- start
if (!(key %in% names(table))) {
key <- random_start(table)
}
}


output <- unlist(strsplit(key, " "))
n <- length(output)


for (i in seq_len(length)) {
next_word <- predict_next(table, key)
if (is.na(next_word)) break
output <- c(output, next_word)
key <- paste(output[(length(output) - n + 1):length(output)], collapse = " ")
}


paste(output, collapse = " ")
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [12]:
set.seed(2025)
read_corpus <- function(path) {
  paste(readLines(path, warn = FALSE), collapse = " ")
}
grimm_text <- read_corpus("grimms_fairy_tales.txt")
grimm_table_3 <- digest_text(grimm_text, n = 3)
gen1a <- generate_text(grimm_table_3, start = "the king", length = 15)
cat("Grimm, with 'the king':", gen1a, "\n")
gen1b <- generate_text(grimm_table_3, start = NULL, length = 15)
cat("Grimm, random start:", gen1b, "\n")

Grimm, with 'the king': the water, i will change your little hut into a splendid castle.” then the fisherman got up and 
Grimm, random start: of each other. the princess put the piece of cloth in her bosom, mounted her horse, and thought 


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [13]:
armour_text <- read_corpus("ancient_armour_weapons_europe.txt")
armour_table_3 <- digest_text(armour_text, n = 3)
gen2a <- generate_text(armour_table_3, start = "the king", length = 15)
cat("Armour, with 'the king':", gen2a, "\n")
gen2b <- generate_text(armour_table_3, start = NULL, length = 15)
cat("Armour, random start:", gen2b, "\n")

“cannot open file 'ancient_armour_weapons_europe.txt': No such file or directory”


ERROR: Error in file(con, "r"): cannot open the connection


#### c) Explain in 1-2 sentences the difference in content generated from each source.
The text generated from Grimm’s Fairy Tales is more narrative, fairy-tale–like, with common storytelling words (e.g. “once upon a time,” “forest,” “king”), while the model trained on Ancient Armour and Weapons produces more technical, historical, and descriptive language (e.g. “warriors,” “iron,” “shields,” “swords”) reflecting the subject-matter of that corpus.

## Question 3
#### a) What is a language learning model? 
A language learning model is a statistical or machine-learning system that learns patterns in text so it can predict the next word, generate text, classify language, or otherwise process natural language. It learns these patterns from training data, building probabilities of word sequences so that it can produce human-like outputs.
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?
Running a language model locally means you must have both the code and the model weights saved on your computer, and use offline software to execute it.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |The shell is the program that interprets your command. When you type mkdir project, the shell (e.g., bash, zsh) reads the text, parses it, and runs the mkdir program with the argument project.  |
| **Terminal emulator** |The terminal emulator is the window or application you type into. So when you type mkdir project, you're typing into a terminal emulator that passes your text to the shell.  |
| **Process** |A process is a running program. When you enter mkdir project, the shell creates a new process to run the mkdir command.  |
| **Signal** |A signal is a message sent to a process to control it. mkdir project usually doesn't involve signals unless you send one manually.  |
| **Standard input** |stdin is data input sent into a process.  |
| **Standard output** |stdout is what a process prints out.  |
| **Command line argument** |A command line argument is extra information passed to a program after the command name.  |
| **The environment** |The environment is a set of variables the shell passes to processes.  |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
The programs being run are:find, xargs, grep
#### b) Explain what this command is doing, part by part.
The entire command:
Finds every R script (*.R, case-insensitive) in the current directory and all subdirectories.
Sends that list of files into xargs.
xargs passes those filenames to grep read_csv.
grep searches each R script for lines containing the function read_csv.
The result is a list of lines in your R files where read_csv is used.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?