# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [1]:
tokenize <- function(text) {
  text <- tolower(text)
  unlist(regmatches(text, gregexpr("[a-z']+|[.?!]", text, perl = TRUE)))
}


#### b) Make a function generate keys for ngrams.

In [2]:
.join_key  <- function(x) paste(x, collapse = "\x1F")
.split_key <- function(k) strsplit(k, "\x1F", fixed = TRUE)[[1]]

ngram_pairs <- function(tokens, n = 3) {
  if (n < 2) stop("n must be >= 2")
  if (length(tokens) < n)
    return(data.frame(key = character(0), next_token = character(0), stringsAsFactors = FALSE))
  idx  <- seq_len(length(tokens) - n + 1)
  keys <- vapply(idx, function(i) .join_key(tokens[i:(i + n - 2)]), character(1))
  nexts <- tokens[n:length(tokens)]
  data.frame(key = keys, next_token = nexts, stringsAsFactors = FALSE)
}


#### c) Make a function to build an ngram table.

In [3]:
build_ngram_table <- function(pairs) {
  if (!nrow(pairs)) return(list())
  split(pairs$next_token, pairs$key)
}


#### d) Function to digest the text.

In [4]:
digest_text <- function(text, n = 3) {
  toks  <- tokenize(text)
  pairs <- ngram_pairs(toks, n)
  table <- build_ngram_table(pairs)
  list(table = table, n = n)
}


#### e) Function to digest the url.

In [5]:
digest_url <- function(u, n = 3) {
  text <- tryCatch({
    readr::read_file(u)
  }, error = function(e) {
    con <- url(u, open = "rb")
    on.exit(try(close(con), silent = TRUE), add = TRUE)
    paste(readLines(con, warn = FALSE, encoding = "UTF-8"), collapse = "\n")
  })
  text <- gsub("<[^>]+>", " ", text)
  digest_text(text, n)
}


#### f) Function that gives random start.

In [6]:
random_start <- function(table) {
  key <- sample(names(table), 1)
  .split_key(key)
}


#### g) Function to predict the next word.

In [7]:
predict_next_word <- function(table, history, n) {
  if (length(history) < (n - 1)) return(NA_character_)
  key <- .join_key(tail(history, n - 1))
  if (!key %in% names(table)) return(NA_character_)
  sample(table[[key]], 1)
}


#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [8]:
.detokenize <- function(tokens) {
  if (!length(tokens)) return("")
  out <- tokens[1]
  if (length(tokens) > 1) {
    for (tok in tokens[-1]) {
      if (tok %in% c(".", "?", "!")) out <- paste0(out, tok) else out <- paste(out, tok)
    }
  }
  out
}

generate_text <- function(table, n, max_words = 50, start_words = NULL) {
  if (is.null(start_words) || length(start_words) < (n - 1) ||
      !.join_key(tail(start_words, n - 1)) %in% names(table)) {
    history <- random_start(table)
  } else {
    history <- tail(start_words, n - 1)
  }
  generated <- history
  for (i in seq_len(max_words)) {
    nxt <- predict_next_word(table, history, n)
    if (is.na(nxt)) break
    generated <- c(generated, nxt)
    history <- tail(generated, n - 1)
  }
  .detokenize(generated)
}


## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [9]:
set.seed(2025)
n  <- 3L
L  <- 15L  # number of tokens to generate
u_grimm  <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
u_armour <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"

m_grimm  <- digest_url(u_grimm,  n = n)
m_armour <- digest_url(u_armour, n = n)
# i) n=3, start = "the king", length = 15
cat("a.i:\n")
out_ai  <- generate_text(m_grimm$table, m_grimm$n, max_words = L, start_words = c("the","king"))
cat(out_ai, "\n\n")

# ii) n=3, no start word, length = 15
cat("a.ii:\n")
out_aii <- generate_text(m_grimm$table, m_grimm$n, max_words = L)
cat(out_aii, "\n\n")


a.i:
the king said to the king and accused him of his friends and both studied law at 

a.ii:
horsemen would not rest until the liquid ran out of the wine ran out. then he 



#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [10]:
# i) n=3, start = "the king", length = 15
cat("b.i:\n")
out_bi  <- generate_text(m_armour$table, m_armour$n, max_words = L, start_words = c("the","king"))
cat(out_bi, "\n\n")

# ii) n=3, no start word, length = 15
cat("b.ii:\n")
out_bii <- generate_text(m_armour$table, m_armour$n, max_words = L)
cat(out_bii, "\n\n")


b.i:
the king of england have been found with the description left us by sidonius as forming part 

b.ii:
this incident gives it in the pourpointed chausson worked in the last of the weapons disclosed by 



#### c) Explain in 1-2 sentences the difference in content generated from each source.

In [None]:
# Grimm’s text produces narrative, character‑driven phrases (story events, dialogue, motifs), whereas the armour text yields technical, expository language about materials, weapons, and historical context; starting from “the king” fits Grimm naturally but in the armour corpus it is rarer and tends to drift quickly into descriptive prose.

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

In [None]:
# A) A model that assigns probabilities to token sequences and predicts the next token from context; modern LLMs are neural (usually Transformers) trained via next‑token prediction on large corpora.
# B) Install OLLAMA, check the version with `ollama -v, Pull or install a model locally:** `ollama pull gemma3:1b`.
4.  **Run the OLLAMA API server:** `ollama serve`. If it says it's already in use, check `lsof -i :11434`.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |  |
| **Terminal emulator** |  |
| **Process** |  |
| **Signal** |  |
| **Standard input** |  |
| **Standard output** |  |
| **Command line argument** |  |
| **The environment** |  |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?