In [1]:
%load_ext autoreload
%autoreload 2

from project import *

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import re
import requests
import time

# Preparing/Scraping Corpus

We will use The Great Gatsby to train our models

In [3]:
gatsby = get_book('https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
# gatsby

# Tokenizing Corpus

In [4]:
tokens = tokenize(gatsby)
# tokens

# Uniform Language Model

A uniform language model is one in which each unique token is equally likely to appear in any position, unconditional of any other information.

In [5]:
unif = UniformLM(tokens)
unif.sample(100)

'cornets correctness stored wait eddies arms fingers against Wild cupboard series Fixed won lumps powdered Slenderly roller slipping Indies finding gloomily flannels always lounged stronger maybe association play rejected Blues warn conscious awkward faintest kitchen remarking detail vitality Works ties possible ghastly keys be bursts sail ferryboat violence Together temporarily damned brooding thoroughly positively ejaculated Lemme dyed Croirier Miss ineffable countenance Beast ecstatically motor Avenue convinced Know rapidly brothel transported telephoned band included extraordinarily spent bleak Both behind visitors hedge afterwards matter policeman saucer could several hilarity quieter cottages forever unjustly apartments Par contrast tears youth wrote headed ravages traffic'

Uniform model has an equal chance of chance of choosing every token. Thus this output makes no sense. The point of it is to demonstrate the simplest language model, and to get used to the coding needed for more complicated models.

# Unigram Language Model

A unigram language model is one in which the probability assigned to a token is equal to the proportion of tokens in the corpus that are equal to said token. That is, the probability distribution associated with a unigram language model is just the empirical distribution of tokens in the corpus.

In [6]:
unigram = UnigramLM(tokens)
unigram.sample(100)

's rushed morning less its on carriages \x03 to too you ” Thomas his hour , that I , his could they of to ’ moment the ” I outside , rang of said \x02 , yawned “ out and the \x03 I end — - woman friendly sophisticated at . step his left ’ wherever at that . some are intended \x02 \x02 Tom . and indiscreet him water of a ’ on \x02 , took around suddenly : had speech Gatsby , held blue see you understand ran I , that I out people just wondering ’ in'

Again this model doesn't make too much sense, but it is better than the uniform model since it takes into account the frequency of words.

# N-Gram Language Model

Now we will build an N-Gram language model, in which the probability of a token appearing in a sentence does depend on the tokens that come before it.

The N-Gram language model relies on the assumption that only nearby tokens matter. Specifically, it assumes that the probability that a token occurs depends only on the previous $N-1$ tokens, rather than all previous tokens. That is:

$$P(w_n|w_1,\ldots,w_{n-1}) = P(w_n|w_{n-(N-1)},\ldots,w_{n-1})$$

When $N=3$, we have a "trigram" model. Such a model looks at the previous $N-1 = 2$ tokens when computing probabilities.

Consider the tuple `('when', 'I', 'drink', 'Coke', 'I', 'smile')`, corresponding to the sentence `'when I drink Coke I smile'`. Under the trigram model, the probability of this sentence is computed as follows:

$$P(\text{when I drink Coke I smile}) = P(\text{when}) \cdot P(\text{I | when}) \cdot P(\text{drink | when I}) \cdot P(\text{Coke | I drink}) \cdot P(\text{I | drink Coke}) \cdot P(\text{smile | Coke I})$$

The main issue is figuring out how to implement the hyperparameter N in this model. For example in the 3-gram model, we must also store a unigram model to determine the first token, and a bigram model to determine the second token (given the first token). How can we implement this without repeating code?

## Creating N-Grams

In [7]:
llm = NGramLM(3, tokens)
# llm.create_ngrams(tokens)

## Training the N-Gram Language Model

The N-Gram LM consists of probabilities of the form

$$P(w_n|w_{n-(N-1)},\ldots,w_{n-1})$$

which we estimate by  

$$\frac{C(w_{n-(N-1)}, w_{n-(N-2)}, \ldots, w_{n-1}, w_n)}{C(w_{n-(N-1)}, w_{n-(N-2)}, \ldots, w_{n-1})}$$

for every N-Gram that occurs in the corpus.

In [8]:
llm.mdl # for 3-gram

Unnamed: 0,ngram,n1gram,prob
0,"(-, -, -)","(-, -)",0.985915
1,"(, , “)","(, )",0.587022
2,"(”, , )","(”, )",1.000000
3,"(., , )","(., )",1.000000
4,"(., ”, )","(., ”)",0.900000
...,...,...,...
48866,"(Jackson, Abrams, of)","(Jackson, Abrams)",1.000000
48867,"(Abrams, of, Georgia)","(Abrams, of)",1.000000
48868,"(of, Georgia, ,)","(of, Georgia)",1.000000
48869,"(Georgia, ,, and)","(Georgia, ,)",1.000000


In [9]:
llm.prev_mdl.mdl # for bigram

Unnamed: 0,ngram,n1gram,prob
0,"(-, -)","(-,)",0.867123
1,"(, )","(,)",1.000000
2,"(, “)","(,)",0.586667
3,"(”, )","(”,)",0.539202
4,"(., )","(.,)",0.251771
...,...,...,...
29621,"(wrote, down)","(wrote,)",0.333333
29622,"(Once, I)","(Once,)",0.125000
29623,"(crystal, glass)","(crystal,)",1.000000
29624,"(there, crystal)","(there,)",0.005780


In [10]:
llm.prev_mdl.prev_mdl.mdl # for unigram

.            0.046488
,            0.044437
the          0.033287
-            0.030638
            0.024696
               ...   
deeply       0.000015
flicked      0.000015
protested    0.000015
Either       0.000015
borne        0.000015
Name: proportion, Length: 6279, dtype: float64

## Sampling from the N-Gram Model

In [11]:
llm.sample(100)

'\x02 A dim background started to town ? ” \x03 \x02 “ I wouldn ’ t happy , and probably transported complete from some ruin overseas . She had it coming to tea . ” \x03 \x02 “ You wait here till Daisy goes to bed . Good night , old sport , ” said Tom intently . “ Mrs . Wilson ’ s going whether she wants to be looked at my front door opened nervously , and simultaneously there was one of the corners , as though she were balancing something on now that I had stayed so \x03'

This model makes a lot more sense than the previous ones. To improve the model we can increaes the size of the training corpus and increase N.