# Homework 10
#### Course Notes
**Language Models:** https://github.com/rjenki/BIOS512/tree/main/lecture17  
**Unix:** https://github.com/rjenki/BIOS512/tree/main/lecture18  
**Docker:** https://github.com/rjenki/BIOS512/tree/main/lecture19

## Question 1
#### Make a language model that uses ngrams and allows the user to specify start words, but uses a random start if one is not specified.

#### a) Make a function to tokenize the text.

In [14]:
install.packages("tokenizers")
install.packages("httr")
library(httr)
library(tokenizers)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}


The downloaded binary packages are in
	/var/folders/x3/qc354__13zj86qpggg0h_4480000gn/T//RtmpARQDDU/downloaded_packages

The downloaded binary packages are in
	/var/folders/x3/qc354__13zj86qpggg0h_4480000gn/T//RtmpARQDDU/downloaded_packages


#### b) Make a function generate keys for ngrams.

In [2]:
key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

#### c) Make a function to build an ngram table.

In [3]:
build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

#### d) Function to digest the text.

In [4]:
digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

#### e) Function to digest the url.

In [10]:
digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

#### f) Function that gives random start.

In [6]:
random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

#### g) Function to predict the next word.

In [7]:
predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

#### h) Function that puts everything together. Specify that if the user does not give a start word, then the random start will be used.

In [8]:
make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

## Question 2
#### For this question, set `seed=2025`.
#### a) Test your model using a text file of [Grimm's Fairy Tails](https://www.gutenberg.org/cache/epub/2591/pg2591.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [18]:
set.seed(2025)
suppressWarnings({
url <- "https://www.gutenberg.org/cache/epub/2591/pg2591.txt"
tbl <- digest_url(url, n = 3)
gen <- make_ngram_generator(tbl, n = 3)

# i
print(gen(start_words = c("the king"), length = 15))

# ii
print(gen(length = 15))
})


[1] "sprang so quickly that the person should be changed into a scrape in the sun"
[1] "occur a distribution of project gutenberg concept of a husband but whoever tasted it when"


#### b) Test your model using a text file of [Ancient Armour and Weapons in Europe](https://www.gutenberg.org/cache/epub/46342/pg46342.txt)
#### i) Using n=3, with the start word(s) "the king", with length=15. 
#### ii) Using n=3, with no start word, with length=15.

In [20]:
set.seed(2025)
suppressWarnings({
   url_anc <- "https://www.gutenberg.org/cache/epub/46342/pg46342.txt"
anc_tbl <- digest_url(url_anc, n = 3)
anc_gen <- make_ngram_generator(anc_tbl, n = 3)

# i
print(anc_gen(start_words = c("the king"), length = 15))

# ii
print(anc_gen(length = 15))
})

[1] "labour and exercise like a dog or a mixture of both figures are armed with"
[1] "furnishes an example in the lives of the aforesaid town and went forward to terminate"


#### c) Explain in 1-2 sentences the difference in content generated from each source.

The content generated from The King in Yellow is more dramatic, narrative, and emotional (fiction sounding) because the source material is a work of fiction with descriptive storytelling. In contrast, the content generated from Ancient Armour and Weapons in Europe is more factual and instructional (non-fiction sounding), reflecting the historical and academic nature of the source text.

## Question 3
#### a) What is a language learning model? 
#### b) Imagine the internet goes down and you can't run to your favorite language model for help. How do you run one locally?

a) A language model is a system that learns patterns in text so it can guess what word is likely to come next. It looks at the words you have already given it, figures out the most probable continuation based on what it has learned, and then generates the next token in the sequence. It is a probability distribution over words. 

b) You could run a model locally by installing Ollama on your computer, use it to download a model using ollama pull followed by the model name, and then talking to that model through a local HTTP endpoint localhost using httr to send requests and jsonlite to parse the JSON responses. 

## Question 4
#### Explain what the following vocab words mean in the context of typing `mkdir project` into the command line. If the term doesn't apply to this command, give the definition and/or an example.
| Term | Meaning |  
|------|---------|
| **Shell** |The shell is the program that lets you interact with the OS. When you enter mkdir project, the shell reads your characters, parses the command, and starts the mkdir process.|
| **Terminal emulator** |The terminal emulator is the application (like macOS Terminal) that hosts the shell and provides the interface where you type mkdir project. It displays the input you type and whatever the shell sends back.  |
| **Process** |A process is something running on your computer. When the shell executes mkdir project, it starts a new process that runs the mkdir program.  |
| **Signal** |A signal is something we send to a process to tell it to do something. For example, pressing Ctrl-C while a command is running sends a signal asking the process to stop.  |
| **Standard input** |Standard input is the stream a process can read characters from. mkdir project does not read anything from input but other commands like cat do.  |
| **Standard output** |Standard output is the stream a process can write characters to. mkdir project does not usually print anything if it is successful, but errors like "directory already exists" appear through standard output if there is one.  |
| **Command line argument** |A command line argument is a value you pass to a command when you run it. For example, when you run mkdir project, "project" is an argument passed to mkdir.  |
| **The environment** |The environment is all the stuff available to the process when it is running. For example, mkdir might use a PWD environment variable to know which directory you are currently in. |

## Question 5
#### Consider the following command `find . -iname "*.R" | xargs grep read_csv`.
#### a) What are the programs?
#### b) Explain what this command is doing, part by part.

a) The programs are find (search directories and list files), grep (searches inside files for matching text), and xargs (takes input from standard input and turns it into command line arguments).

b) First the program find starts searching in the current directory (.). -iname "*.R" means it will find all files whose names end in .R (case-insensitive). Second, the pipe operator (|) takes the output of find and passes it as input to the next program, xargs. xargs then reads the file paths coming from find and turns them into arguments for the grep program. grep read_csv searches inside those files for lines that contain the text read_csv. 

## Question 6
#### Install Docker on your machine. See [here](https://github.com/rjenki/BIOS512/blob/main/lecture18/docker_install.md) for instructions. 
#### a) Show the response when you run `docker run hello-world`.
#### b) Access Rstudio through a Docker container. Set your password and make sure your files show up on the Rstudio server. Type the command and the output you get below.
#### c) How do you log in to the RStudio server?

a) I see the following response when I run the command "docker run hello-world":

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (arm64v8)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/


b) 

Command:
docker run --platform linux/amd64 -it -p 8787:8787 rocker/verse

Output:
The password is set to iew5eezaith7Quei

c) 

1. Go to http://localhost:8787
2. Type in the username as rstudio and the password as iew5eezaith7Quei