# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [None]:
install.packages("tokenizers")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘SnowballC’




In [None]:

library(httr)
library(tokenizers)
library(stringr)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

#### b) Make a function generate keys for ngrams.

In [None]:
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [None]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

#### d) Function to digest the text.

In [None]:
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [None]:
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

#### f) Function that gives random start.

In [None]:
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [None]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [None]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [None]:
set.seed(seed = 2025)

url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

In [None]:
print(gen3(start_words = c("the", "king"), length = 15))


[1] "the king their leader fawkes who now surrendered himself at coventry was banished from the"


In [None]:
print(gen3(length = 15))

[1] "a hillock where during the whole of the arbalest of this day had no uniform"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15.
#### ii) Using n=3, with no start word, with length=15.

In [None]:
url <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

In [None]:
print(gen3(start_words = c("the", "king"), length = 15))

[1] "the king qe nul c͂hr ne esquier qe serra de meisme l'estat de chevalerie car"


In [None]:
print(gen3(length = 15))

[1] "ardenti nimium prorumpere tandem _vix obstat ferro fabricata patena recocto_ qua bene munierat pectus sibi"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The biggest difference that is obvious upon a quick glance is that the content generated from one model is in english while the other model outputs text that is in latin.

## Question 3
#### a) What is a language learning model?
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

a) A Language learning model is a type of model that is trained to generate human language based off of text data that it is trained on. LLMs that are more advanced are able to take in an input and output a response that may match what it is that you originally put in or asked.

b) I would be able to run a language learning model locally on my computer using Ollama, although setting it up would require an initial internet connection.

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** | The command-line interpreter that is running inside the terminal. It takes in the mkdir project and executes the mkdir program.|
| **Terminal emulator** | This is the environment in which one writes the commands into. It emulates a physical terminal and provides you a way to interact with the shell. |
| **Process** | A process is a running instance of a program. When a command is input, it creates a process, does its job and then terminates. It then waits for the next command. |
| **Signal** | A signal is a way of interrupting any ongoing processes. |
| **Standard input** | The default source of input for a process. |
| **Standard output** | The default destination for a process's output |
| **Command line argument** | additional pieces of information that are provided after the program is named. In this case, project is a command line argument. |
| **The environment** | A collection of variable-value pairs that can be used by any process to give additional context. When the mkdir process is started, it gives a copy of the environment to it, which includes variables such as \$PATH and \$HOME. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a) There are three programs, find, xargs, and grep.

b) This command is first executing the find program to search for R files within the current directory/subdirectories it was ran in and then it prints the full path of each file into a standard output. The pipe operator then takes the standard output from that command and feeds it in as the standard input for the second command. xargs then converts those inputs into a list of command-line arguments which are then read by the grep program to search inside all the r files and find any line that contains the string "read_csv".

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions.
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

a) https://gyazo.com/2a1088666e17b616995e3f8fe1ec64b1

b) docker run -it -p 8787:8787 rocker/verse

output: The password is set to poogoosh1sheeDik
If you want to set your own password, set the PASSWORD environment variable. e.g. run with:
docker run -e PASSWORD=<YOUR_PASS> -p 8787:8787 rocker/rstudio

   https://gyazo.com/12cbf5eee2f21cfd0d1558e573c56633


c) You log in by typing in rstudio as the username and whatever password you were given or your custom password if you added the argument.