# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [1]:
tokenize <- function(text) {
  text <- tolower(text)
  text <- gsub("[^a-z0-9'\\s]", "", text)
  tokens <- unlist(strsplit(text, "\\s+"))
  tokens <- tokens[tokens != ""]  # remove empty tokens
  return(tokens)
}

#### b) Make a function generate keys for ngrams.

In [None]:
generate_keys <- function(tokens, n) {
  keys <- list()
  for (i in 1:(length(tokens) - n)) {
    key <- tokens[i:(i + n - 1)]
    next_word <- tokens[i + n]
    keys[[i]] <- list(key = key, next = next_word)
  }
  return(keys)
}

#### c) Make a function to build an ngram table.

In [None]:
build_ngram_table <- function(keys) {
  table <- list()
  
  for (pair in keys) {
    key_str <- paste(pair$key, collapse = " ")
    
    if (!key_str %in% names(table)) {
      table[[key_str]] <- c()
    }
    
    table[[key_str]] <- c(table[[key_str]], pair$next)
  }
  
  return(table)
}

#### d) Function to digest the text.

In [None]:
digest_text <- function(text, n = 2) {
  tokens <- tokenize(text)
  keys <- generate_keys(tokens, n)
  table <- build_ngram_table(keys)
  
  list(table = table, tokens = tokens)
}

#### e) Function to digest the url.

In [None]:
digest_url <- function(url, n = 2) {
  text <- paste(readLines(url, warn = FALSE), collapse = " ")
  digest_text(text, n)
}


#### f) Function that gives random start.

In [None]:
random_start <- function(model) {
  sample(names(model$table), 1)
}


#### g) Function to predict the next word.

In [None]:
predict_next_word <- function(model, key_str) {
  if (key_str %in% names(model$table)) {
    possible <- model$table[[key_str]]
    return(sample(possible, 1))
  } else {
    return(NULL)
  }
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [None]:
generate_text <- function(model, start = NULL, length = 50, n = 2) {
  
  # If no start is specified, choose randomly
  if (is.null(start)) {
    start <- random_start(model)
  }
  
  words <- unlist(strsplit(start, " "))
  
  # Generate new words
  for (i in 1:length) {
    key_str <- paste(tail(words, n), collapse = " ")
    next_word <- predict_next_word(model, key_str)
    
    if (is.null(next_word)) {
      break  # stop if no continuation exists
    }
    
    words <- c(words, next_word)
  }
  
  paste(words, collapse = " ")
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

#### c) Explain in 1-2 sentences the difference in content generated from each source.

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |  |
| **Terminal emulator** |  |
| **Process** |  |
| **Signal** |  |
| **Standard input** |  |
| **Standard output** |  |
| **Command line argument** |  |
| **The environment** |  |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?