# Language Models
**NEW** required R packages for this code (*already installed*): `httr`, `jsonlite`, `tokenizers`, `stringr`, `R6`, `digest`, and `viridis`

## Why do we need `httr` and `jsonlite`?
#### `httr`
Your **web browser** is an object that makes requests over HTTP for content, documents, etc. It converts what you type in the address bar to an HTTP request.  
A **language model** is a type of machine learning model that predicts the probability of a sequence of words to understand and generate human language, such as Chat GPT. All major language models have an HTTP API, which is a way for you to send HTTP requests to OpenAI from a program (*not a browser*) and to get responses from the language model.  
`httr` is a library to **make HTTP requests and look at the results**.
#### `jsonlite`
Your web browser runs **JavaScript**, which is a programming language. Like most programming languages, it has ways of representing data, so lists and arrays and tables with named elements. **JSON** is basically a subset of JavaScript syntax that denotes data. If you're communicating with a *HTTP endpoint*, and you want to exchange data with it, rather than use it for a web page, those endpoints will almost always exchange that *data in the form of JSON*.  
`jsonlite` is a library to **parse JSON into our objects**.

## Back To Language Models
A **language model** is a **probability distribution over words**. You put some input into the probability distribution, which is conditioning the probability distribution, and you get an output from the probability distribution. It's basically predicts the next word (*token*) given the previous words.

### Markov Model
A **Markov model** is a mathematical tool that predicts future states based on the current state, without considering the past history.

In [1]:
# required libraries
library(httr)
library(tokenizers)
library(stringr)

# Step 1. Fetch text from URL
url <- "https://www.gutenberg.org/cache/epub/10662/pg10662.txt" # Text file of The Night Land by William Hope Hodgson
resp <- httr::GET(url)
text <- httr::content(resp, as = "text", encoding= "UTF-8")

# Step 2. Tokenize the text
tokens <- tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]

# Step 3. Build the Markov model (bigram-based)
markov_model <- new.env(parent = emptyenv())

get_counts <- function(env, key) {
    if (!is.null(env[[key]])) env[[key]] else {
        x <- integer(0); names(x) <- character(0); x
    }
}

if (length(tokens) > 1) {
    for (i in seq_len(length(tokens)-1)) {
        current_word <- tokens[i]
        next_word <- tokens[i+1]
        counts <- get_counts(markov_model, current_word)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        markov_model[[current_word]] <- counts
    }
}

# Step 4. Function to predict the next word
predict_next_word <- function(word, model=markov_model) {
    counts <- get_counts(model, word)
    if (length(counts)==0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

# Step 5. Generate a sequence
generate_text <- function(start_word, length=10) {
    word_sequence <- c(start_word)
    for (i in seq_len(length - 1)) {
        next_word <- predict_next_word(tail(word_sequence, 1))
        word_sequence <- c(word_sequence, next_word)
    }
    paste(word_sequence, collapse= " ")
}

# Example usage
print(generate_text("love", 45))
print(generate_text("she", 10))

[1] "love and now this purpose and brutish men as i set out mine arms a dull glowing of the end to a proper to make an illusion of my will of my head from out of my head piece of her quiet for after that"
[1] "she to have read the diskos that we travelled over"


Looking at these outputs, they kind of make sense if you look at the pairs, but don't make sense together. We only are including one word as the context, so **our model will improve if we give it more context**.  

The code below something called an **ngram**, which is the last n tokens.

In [2]:
library(httr)
library(tokenizers)

tokenize_text <- function(text) {
    tokenizers::tokenize_words(text, lowercase=TRUE, strip_punct=TRUE)[[1]]
}

key_from <- function(ngram, sep = "\x1f") {
    paste(ngram, collapse=sep)
}

build_ngram_table <- function(tokens, n, sep = "\x1f") {
    if (length(tokens) < n) return(new.env(parent = emptyenv()))
    tbl <- new.env(parent = emptyenv())
    for (i in seq_len(length(tokens) - n + 1L)) {
        ngram <- tokens[i:(i + n - 2L)]
        next_word <- tokens[i + n - 1L]
        key <- paste(ngram, collapse = sep)
        counts <- if (!is.null(tbl[[key]])) tbl[[key]] else integer(0)
        if (next_word %in% names(counts)) {
            counts[[next_word]] <- counts[[next_word]] + 1L
        } else {
            counts[[next_word]] <- 1L
        }
        tbl[[key]] <- counts
    }
    tbl
}

digest_text <- function(text, n) {
    tokens <- tokenize_text(text)
    build_ngram_table(tokens, n)
}

digest_url <- function(url, n) {
    res <- httr::GET(url)
    txt <- httr::content(res, as = "text", encoding = "UTF-8")
    digest_text(txt,n)
}

random_start <- function(tbl, sep = "\x1f") {
    keys <- ls(envir = tbl, all.names=TRUE)
    if (length(keys)==0) stop("No n-grams available. Digest text first.")
    picked <- sample(keys, 1)
    strsplit(picked, sep, fixed=TRUE)[[1]]
}

predict_next_word <- function(tbl, ngram, sep = "\x1f") {
    key <- paste(ngram, collapse = sep)
    counts <- if(!is.null(tbl[[key]])) tbl[[key]] else integer(0)
    if (length(counts) == 0) return(NA_character_)
    sample(names(counts), size=1, prob=as.numeric(counts))
}

make_ngram_generator <- function(tbl, n, sep = "\x1f") {
    force(tbl); n <- as.integer(n); force(sep)
    function(start_words = NULL, length = 10L) {
        if ((is.null(start_words)) || length(start_words) != n - 1L) {
            start_words <- random_start(tbl, sep=sep)
        }
        word_sequence <- start_words
        for (i in seq_len(max(0L, length - length(start_words)))) {
            ngram <- tail(word_sequence, n - 1L)
            next_word <- predict_next_word(tbl, ngram, sep=sep)
            if (is.na(next_word)) break
            word_sequence <- c(word_sequence, next_word)
        }
        paste(word_sequence, collapse= " ")
    }
}

# Using it (n=3)
url <- "https://www.gutenberg.org/cache/epub/10662/pg10662.txt"
tbl3 <- digest_url(url, n=3)
gen3 <- make_ngram_generator(tbl3, n=3)

print(gen3(length=128))

# Using it (n=5)
url <- "https://www.gutenberg.org/cache/epub/10662/pg10662.txt"
tbl5 <- digest_url(url, n=5)
gen5 <- make_ngram_generator(tbl5, n=5)

print(gen5(length=128))

[1] "the refuge of humanity and surely it did be very sedate outward and to set you the working of her as i have no knowledge this way as i stood utter still was that i be not to be a lack because that my tales concerning the olden sea bed did be quiet as that eternity and to this side there was a great caution and we only to the front and so was she so utter happy and she took odd leave with her and surely it alway now to shake the aether half across the abyss of the light from the terror of the world the deep valley with redness so that in that it had been that the dead the diskos made a smooth place"
[1] "that did go upward for ever into the everlasting night and so i did be an utter mystery and deathly dark beyond the shining of the morning sun and know it by name and the meaning of aught else and yet as i do strive to make plain unto you because that this thing must be and because that i went with a very wary hearing i heard the sound of it running very swiftly and coming nigh and

### Can I just use bigger and bigger lookbacks to get a better language model?
No! You need balance. The more words you use in the ngrams, the less ngrams you have. At a centain point, you will only receive the input text back to you. We don't have enough training data to use a substantial amount of context.

### Ways To Think of A Solution
1. Track concepts instead of word. Assume that tokens in the past are not as interesting, so you only need to know their parts of speech or what type of object they are. But, if we're looking at Alice in Wonderland, we want to generate something on the White Rabbit, not a Colored Thing.
2. Choose which details to pay attention to in the history of the text. Create a function that takes the current state of the model and returns the parts thar are relevant to generating the next token. We need a **function that can pay attention to the text contextually**.

With enough training and enough degrees of freedom, neural networks can do an adequate job of looking at an input text and generating a new token. This tracks the rough meaning of words but doesn't depend on specific words.

### Attention Mechanism
The idea of how to solve our issue with language models lies in the attention mechanism of neural networks.  
**Transformers** take our entire text as input and learn how to pull out words which are important to this word, and with this extra information,  we're able to produce a uniform representation that we can pass to a neural network.  
Then, when we're training the neural network, we have an objective function and a gradient, and it learns how to do all this stuff.  
The general idea from the Markov model of taking previous words in your corpus, and calculating a probability distribution for the next word still applies. 
![](attn_mech.png)

When there's no training data that reflects the stuff that you're trying to ask the model, it cannot do the goal, because it doesn't have anything outside of the distribution.

## Next Section - Running Language Models Locally
Please see the file `running_lm_locally.Rmd`, download it, and follow the instructions to get OLLAMA installed. This section doesn't run on Binder (because it runs locally). If you just want to view it, you can see the HTML version!