# NLP Seminar 1: N-grams Language Models

In this NLP seminar, we will learn to estimate and use n-gram language models (LM). They are a classical approach to sentence modelling, and text autocompletion.

We will use the `nltk` (natural language toolkit) python package. 
If you want to learn more about this popular module, refer to the [official website](https://www.nltk.org/) ([API reference](https://www.nltk.org/api/nltk.html), [installation guide](https://www.nltk.org/install.html)).

In particular, the `nltk.lm` submodule provides optimized implementations of classical n-grams language models such as the maximum likelihood estimator (MLE) and its smoothing variants (Laplace, Lidstone, ...).

To illustrate the ngram approach, we will use n-gram LMs (LM) to model script lines (what the characters say) from the Simpsons TV show. The goal will then be to generate new script lines, or do autocompletion, in the writing style of The Simpsons. 

Before that, we will begin with the basics on how to preprocess the text data into tokens and ngrams, which are a prerequisite step for fitting those LMs.

In [None]:
import numpy as np
import nltk

In [None]:
#!pip install nltk

In [None]:
# First download some nltk resources
# (By default 'pip install nltk' does not actually download every resource in the module,
# as for example some language models are heavy.)
# The following commands should download every resource needed for this practical:
nltk.download('popular', quiet=True)
nltk.download('punkt_tab', quiet=True)

## 0. Introduction: preprocessing and n-grams with dummy data

For simplicity, we consider the dummy corpus `corp` with two tokenized documents (sequences of tokens). The tokens are here simple letters, but we can think of them as representing words in our vocabulary. (A raw corpus would need to be tokenized first.)

In [None]:
corp = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a bigram model, we need to turn this tokenized text into n-grams. We can use the `bigrams` and `ngrams` functions from NLTK as helpers, to turn the each token list document into an ngram list (for ex. with $n=2$ and $n=3$).

In [None]:
from nltk.util import bigrams, ngrams

In [None]:
list(bigrams(?))

In [None]:
list(ngrams(?, n=?))

*Remark:* The `list()` is here just used to display the results, as `bigrams`, `ngrams` and other `nltk` functions return python lazy generators, for efficiency.

Notice how "b" occurs both as the first and second member of different bigrams but "a" and "c" don't? 

It would be nice to indicate to the model how often sentences start with "a" and end with "c" for example, when we will count those ngrams later-on.


A standard way to deal with this is to add special "padding" symbols to the document/sequence before splitting it into ngrams. Fortunately, NLTK also has a `pad_sequence` function for that. We use `"<s>"` and `"</s>"` by convention in `nltk` to pad before and after the sequence, respectively.

Lets add the relevent paddings and construct the bigrams and 3-grams for the first text sequence. Note the `n` argument, that tells the function we need padding for `n`-grams.

In [None]:
from nltk.util import pad_sequence

In [None]:
#n=2
padded_seq2 = list(pad_sequence(?, n=?, # n: order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 
                                pad_left=?, left_pad_symbol=?,
                                pad_right=?, right_pad_symbol=?)
                   )
padded_seq2

In [None]:
list(ngrams(?, n=?))

In [None]:
#n=3
padded_seq3 = list(pad_sequence(corp[0], n=3, # n: order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 
                                pad_left=True, left_pad_symbol="<s>",
                                pad_right=True, right_pad_symbol="</s>")
                   )
padded_seq3

In [None]:
list(ngrams(padded_seq3, n=3))

Passing all these parameters every time can be tedious and in most cases one uses the same defaults anyway.

Thus, the `nltk.lm` module provides a convenience function that has all these arguments already set while the other arguments remain the same as for `pad_sequence`.

In [None]:
from nltk.lm.preprocessing import pad_both_ends

In [None]:
list(pad_both_ends(corp[0], n=2))

Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [None]:
list(bigrams(pad_both_ends(corp[0], n=2)))

For versatility and conditional probability computations, the `nltk.lm` n-gram models that we will use typically rely on counting everygrams of order n. 
For example, LMs of order 2 are trained by counting unigrams (single words) as well as bigrams (word pairs). For LMs of order 3, they usually rely on counting unigrams, bigrams and 3-grams. And so on... 
That way, an `nltk` LM model of order $n$ can output word probabilities for contexts (i.e. previous words/tokens in the conditioning) of size $0, 1, 2, ..., n-1$ tokens.

To construct those everygrams, that will serve as training data for the LM model to count, NLTK once again helpfully provides a function called `everygrams`.

In [None]:
from nltk.util import everygrams

In [None]:
padded_seq2 = list(pad_both_ends(corp[0], n=2))

list(everygrams(padded_seq2, max_len=2))

We are almost ready to start counting ngrams, just one more step left.

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model, to efficiently perform and store the word and ngram counts.

One can create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.


In [None]:
from nltk.lm.preprocessing import flatten

In [None]:
list(flatten(pad_both_ends(sent, n=2) for sent in corp)) #vocab

Now that we discussed the necessary preprocessing steps, in most cases, one typically wants to use the same text as the source for both vocabulary and ngram counts.

To this aim, the `padded_everygram_pipeline` function does exactly everything above (padding, everygrams, vocabulary stream) for us for the whole tokenized corpus, in a single function call.

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

In [None]:
?? = padded_everygram_pipeline(??)

To avoid re-creating the text in memory, both `training_neverygrams` and `padded_vocab_stream` are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the outputs of `padded_everygram_pipeline`, we "materialize" the lazy iterators by casting them into a list.

In [None]:
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(2, corp)

print('==== n-everygram data (n=2) for each sequence in "corp": ====')
for ngramlize_sent in training_neverygrams:
    print(list(ngramlize_sent))
    print()
print('==== Vocabulary data: ====')
print(list(padded_vocab_stream))

# Generating Simpsons Episodes with N-Gram Models

Let's try some text generation with "The Simpsons" TV show episodes!

**Dataset source:** https://www.kaggle.com/datasets/prashant111/the-simpsons-dataset

## 1. Import, inspect and preprocess and tokenize the text data

We start by importing the provided dataset, `simpsons_script_lines.csv`. The `"spoken_words"` column gives the desired script lines.

In [None]:
import pandas as pd

In [None]:
simpsons = pd.read_csv("../data/simpsons_script_lines.csv",
                       usecols=["raw_character_text", "raw_location_text", "spoken_words", "normalized_text"],
                       dtype={'raw_character_text':'string', 'raw_location_text':'string',
                              'spoken_words':'string', 'normalized_text':'string'})
simpsons.head()

In [None]:
simpsons.info()

Be aware that the typical textual dataset is rarely that clean, and that manual text cleaning is typically a required prior step, prior to tokenization and modelling.
Some typical cleaning steps e.g. includes: normalizing special characters, like the different types of apostrophes and quotes (e.g. `` ’, ”, ` ``) to the corresponding ` ' ` or ` " `, remove line breaks `\n` (careful about not "merging" words), and remove multiple spacing. Also having to make make sure urls (e.g. `https://www.website.com/`) are not split into too many meaningless tokens is quite common for social media data. 
Other types of textual pre-processing/cleaning is typically specific to the dataset and task at hand (some example in future seminars).

(Facultative) Feel free to perform cleaning steps that you believe will improve the tokens or the downstream LMs.

In [None]:
simpsons = simpsons.dropna().drop_duplicates()

We wil train the model to "talk like" Homer Simpson. We thus restrict the data to his lines only.

In [None]:
simpsons = simpsons[simpsons['raw_character_text']=="Homer Simpson"].sample(frac=1, random_state=1).reset_index(drop=True)
simpsons.head()

Then, we tokenize the text corpus into a list of tokenized script lines (documents) by splitting each script line into word tokens. 
We consider `"spoken_words"` and not `"normalized_text"`, as we are interested in keeping punctuation and capitalization. 
The result should be a list of lists containing word-level tokens (e.g. words, punctuation, and other "special words"). 

We use `nltk.word_tokenize`, which is the recommended english tokenizer in `nltk` (model-based).
(Alternatives include, `wordpunct_tokenize`, which is a simpler rule-based tokenizer.) 
You can also use a custom procedure to deal with other data format specifics. 
We then show the result for the first five script lines of the corpus.

In [None]:
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize

print(simpsons['spoken_words'][0])
print(word_tokenize(simpsons['spoken_words'][0]))

In [None]:
simpsons_tok = simpsons['spoken_words'].apply(??).to_list()

for i in range(5):
    print(simpsons_tok[i])

## 2. Fitting and Accessing the language model

The `nltk.lm` submodule has implementations of the language models (LM) you have seen in class, and several others. In particular, you will find implementations of: The simple Maximum Likelihood Estimator (MLE) (`nltk.lm.MLE`), Laplace smoothing (`nltk.lm.Laplace`), and Lidstone smoothing (`nltk.lm.Lidstone`). 
Lidstone is a simple generalization of the other two (more details later).

In this section, you will find the very basics on how to use these language model implementations. For more details, you are encouraged to look into the nltk doccumentation.

### 2.1 Fitting an n-gram Language model in NLTK and vocabulary

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE).

We first prepare the itterators for the everygrams and vocabulary.

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = 3
training_neverygrams, padded_vocab_stream = padded_everygram_pipeline(n, ??)

The LM model usage is quite similar to scikit-learn, with an object-oriented implementation. Using the simple MLE as an example, it first has to be instantiated. 

We only need to specify the highest ngram order to instantiate the MLE (there might be some other hyperparameters for other models).

In [None]:
from nltk.lm import MLE
mle_model = MLE(?) # n is the desired (max) order of the MLE LM

Initializing the MLE model, creates an empty vocabulary. The vocabulary object is accessible as an argument.

In [None]:
len(mle_model.vocab)

We now fit the LM to the training corpus, that has been properly preprocessed into everygrams and a vocabulary stream, for the correct order $n$:

In [None]:
# model.fit(training_neverygrams, padded_vocab_stream)
mle_model.fit(?, ?)

The vocabulary gets filled as the model is fit.

In [None]:
print(mle_model.vocab)

In [None]:
len(mle_model.vocab)

The vocabulary object stores all "known" words, and can help handle words that have not occurred during training.
One can "lookup" a list of tokens in the vocabulary:

In [None]:
print(mle_model.vocab.lookup(simpsons_tok[0]))

We lookup the words of the sentence 'I love UNIGE students!' in the model vocabulary.

In [None]:
# If we lookup the vocab on unseen sentences not from the training data, 
# it automatically replace words not in the vocabulary with `<UNK>`.
print(mle_model.vocab.lookup(word_tokenize("I love UNIGE students!")))

Looking up the token 'UNIGE' in the model's vocabulary results in the `'<UNK>'` token. This means that this word does not exist in the training corpus. Thus, Homer Simpson sadly never talked about 'UNIGE'...

In [None]:
for tok in ['day', 'food', 'qwertz', 'UNIGE', '<s>', '</s>', '<UNK>']:
    print(('Vocabulary contains \"' + tok + '\": '), (tok in mle_model.vocab))

The special token `'<UNK>'` does not appear in the original corpus, neither do the special padding tokens `<s>` and `'</s>'`. Otherwise, it should contain exactly the tokens encountered in the training corpus.


### 2.2. LM fitting function
As `padded_everygram_pipeline` returns itterators (that can only be used once), it is good practice to have the full pipeline in a single function.

We thus create a function that takes as arguments (at least) the desired order $n$ of the model and a tokenized training corpus, and that returns the "simple" Maximum Likelihood Estimator (MLE) language model, fitted on the given training corpus.

In [None]:
def fit_ngram_MLE(order, train_corpus_tokens):
    """
    :param order: integer stting the maximum order of the n-grams.
    :param train_corpus_tokens: list of tokenized text sequences.
    """
    training_neverygrams, padded_vocab = padded_everygram_pipeline(order=order, text=train_corpus_tokens)
    model = nltk.lm.MLE(order=order)
    model.fit(training_neverygrams, padded_vocab)
    return model

In [None]:
n=3
mle_model = fit_ngram_MLE(order=n, train_corpus_tokens=simpsons_tok)

### 2.3. Accessing the fitted model

Apart from the vocabulary, fitting n-gram LMs basically boils down to counting the number of word/token and n-gram occurrences in the training data. To access token counts, and conditional token counts (in a context of one or several preceding tokens), try:
```python
    model.counts
    model.counts['word']
    model.counts[('context_word1', "context_word2", ...)]["word"]
```

In [None]:
print(mle_model.counts)

This provides a convenient interface to access counts for unigrams...

In [None]:
mle_model.counts['Marge'] # i.e. Count('Marge') (Marge is Homer's wife)

In [None]:
mle_model.counts['want'] # i.e. Count('want')

...and bigrams for the phrase bit "I want"

In [None]:
mle_model.counts[['I']]['want'] # i.e. Count('I want')

... and trigrams for the phrase bit "I want to ..."

In [None]:
mle_model.counts[('I', 'want')]['to'] # i.e. Count('I want a')

However, the real purpose of training a language model is to have it score how probable words are in certain contexts. 
For the MLE, the model returns the item's relative frequency as its score, i.e. (conditional) occurrence probability.
```python
    model.score('word')                                             # P('word')
    model.score('word', ('context_word1', "context_word2", ...))    # P('word'|'context_word1 context_word2 ...')
```

In [None]:
mle_model.score('Marge') # P('Marge')

In [None]:
mle_model.score('want') # P('want')

In [None]:
mle_model.score('want', ('I',))  # P('want'|'I')

In [None]:
mle_model.score('to', ('I', 'want')) # P('to'|'I want')

In [None]:
#e.g. P('Marge') = Counts[('Marge')]/len(vocab)
#e.g. P('to'|'I want') = Counts[('I', 'want', 'to')]/Counts[('I', 'want')]

Remark: Items that are not seen during training are mapped to a specific vocabulary "unknown label" token. The scores for those are 0.


In [None]:
print(mle_model.score("<UNK>"))
print(mle_model.score("<UNK>") == mle_model.score("UNIGE"))

In [None]:
mle_model.score("<UNK>") == mle_model.score("erer")

To avoid underflow when working with many small score values it makes sense to take their logarithm. 
For convenience this can be done by using the `logscore` method instead of the `score`.
```python
    model.logscore('word')
    model.logscore('word', ('context_word1', "context_word2", ...))
```

In [None]:
mle_model.logscore('to', ('I', 'want')) # log2(P('to'|'I want'))

## 3. Generation using N-gram Language Model

### 3.1. Generation with NLTK LMs

One cool feature of fitted ngram models is that they can be used to generate text that resembles the training data. The `nltk.lm.model` classes have a `.generate()` method to sample sequentially from the estimated (conditional) probabilities. This can be achieved using:
```python
    model.generate(num_words = num_words, text_seed = initial_context_tokens, random_seed = None)
```

In [None]:
print(mle_model.generate(20))

Keep in mind that this will generate `num_words` new words according to the model's fitted scores, as a list of vocabulary tokens. For a realistic output text, it might thus need some post-processing. `nltk.tokenize.treebank.TreebankWordDetokenizer()` and its `.detokenize()` method provides a general-purpose **sentence** detokenizer, but might need some additional post-processing for specific tasks.

Furthermore, are generations without initial context (or text seed) complete examples? Do they look like complete examples similar to the training documents? If not, what is missing?

In [None]:
print(mle_model.generate(30, text_seed=['<s>']*(n-1)))

The first words generated in the script line should be generated conditionally to the fact that they are the first words of the line. Otherwise, if the unconditional probability is used, a generation could begin with any word from the vocabulary, e.g. in the middle of a sentence. The context (previous tokens) when using a LM of order $n$ should thus be a sequence of $n-1$ start-of-document padding tokens (`'<s>'`, if you did not change the padding default in 2.1).

We can do some cleaning and detokenization in a function to make the generated tokens more human-like. In particular it should:
- take as input arguments: a fitted `nltk.lm.model`, a maximum number of words (integer), a text seed (initial context tokens), and a random "RNG" seed for generation,
- have the padding tokens as text seed default, as discussed above,
- output a newly generated Simpsons script lines, according to the input arguments, post-processed as a single text string that is formatted like a script line from the original dataset.

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

def simpson_detokenizer(token_list: list[str]) -> str:
    TbDetok = TreebankWordDetokenizer()
    tb_string = TbDetok.detokenize(token_list)
    # As it's a sentence detokenizer, it will add spaces before non-ending punctuation marks:
    detokenized_line = tb_string.replace(' .','.').replace(' ,',',').replace(' !','!').replace(' ?','?').replace(' :',':').replace(' ;',';')
    # (Possibly more steps depending on pre-processing...)
    return detokenized_line

def generate_line(model, max_words, text_seed=None, random_seed=None):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param max_words: Max no. of words to generate.
    :param text_seed: Generation can be conditioned on preceding context tokens.
    :param random_seed: Seed value for random.
    """
    if text_seed is None:
        text_seed = ['<s>']*(model.order-1)
    
    content = [tok for tok in text_seed if tok!='<s>']
    
    for token in model.generate(num_words=max_words, text_seed=text_seed, random_seed=random_seed):
        if token == '</s>':
            break
        if token != '<s>':
            content.append(token)
        
    line = simpson_detokenizer(content)
    return line

We can now generate some more realistic Simpsons script lines.

In [None]:
print(mle_model.generate(28, random_seed=5))

In [None]:
generate_line(mle_model, max_words=28, text_seed=[], random_seed=5)

In [None]:
generate_line(mle_model, max_words=1000, random_seed=2)

In [None]:
generate_line(mle_model, max_words=1000, random_seed=30)

In [None]:
generate_line(mle_model, max_words=1000, random_seed=42)

In [None]:
generate_line(mle_model, max_words=1000, random_seed=0)

In [None]:
generate_line(mle_model, max_words=1000, random_seed=100)

In [None]:
print(generate_line(mle_model, max_words=1000, random_seed=52))

In [None]:
print(generate_line(mle_model, max_words=1000, random_seed=17))

**To go further:** Especially with some "less clean" data, you could sometimes see in some generations some weird or very particular tokens that probably did not occur often in the training data overall, and that we might want to ignore.

For a more advanced usage, the vocabulary can be constructed separately and given to the model, instead of letting it infer it from the vocabulary stream during the model fit. 
This allows for example cutting-off infrequent words from the vocabulary. 
You can tell the vocabulary to ignore such words using the `unk_cutoff` argument for the vocabulary lookup, which will turn them to `'<UNK>'`.
If you are interested in the implementation and going a bit further, you can check out the documentation for the `nltk.lm.vocabulary.Vocabulary` class [here](https://www.nltk.org/api/nltk.lm.vocabulary.html) or the source code: [`nltk.lm.vocabulary.Vocabulary`](https://github.com/nltk/nltk/blob/develop/nltk/lm/vocabulary.py).

In [None]:
from nltk.lm import Vocabulary

In [None]:
voc = nltk.lm.Vocabulary(unk_cutoff=2)
voc.update(["a","b","a"])
voc.lookup(["a","b","c"])

In [None]:
voc["a"], voc["b"], voc["c"]

## 4. Smoothing and model comparizon

### 4.1. Smoothing

As discussed in the lecture, the issue of the simple MLE is that it gives 0 probability to any sequence for which even a single trigram has never been seen during the training. To avoid this issue, several smoothing techniques exist. A few implementations are available in the `nltk.lm` submodule, for example:

 - `Lidstone`: Provides Lidstone-smoothed scores, with hyperparameter $\gamma$. It avoids the 0 probability issue by adding $\gamma$ to all counts. A value $\gamma=0$ corresponds to the simple MLE, and a value $\gamma=1$ corresponds to Laplace smoothing.
 - `Laplace`: Implements Laplace (add one) smoothing. It avoids the 0 probability issue by adding $1$ to all counts. Equivalent to Lidstone with $\gamma=1$.
 
 If you want to go further, there are additional language models available in `nltk.lm`.
 
Let's fit the Laplace model introduced in the lecture, as well as its Lindstone generalization, that performs smoothing by adding an arbitrary value `gamma` instead of `1` to the word counts.
We can modify the function defined in 2.2., to be compatible with other LMs (and accepts additional hyperparameters).

In [None]:
def fit_ngram_language_model(order, train_corpus_tokens, LM_Class=nltk.lm.MLE, *args, **kwargs):
    """
    :param order: integer stting the maximum order of the n-grams.
    :param train_corpus_tokens: list of tokenized text sequences.
    :param LM_Class: a language model as a nltk.lm.LanguageModel sub-class.
    additional arguments are passed to `LM_Class`.
    """
    training_neverygrams, padded_vocab = padded_everygram_pipeline(order=order, text=train_corpus_tokens)
    model = LM_Class(order=order, *args, **kwargs)
    model.fit(training_neverygrams, padded_vocab)
    return model

In [None]:
from nltk.lm import Laplace, Lidstone

laplace = fit_ngram_language_model(??)
lidstone = fit_ngram_language_model(??)

In [None]:
print(generate_line(laplace, max_words=1000, text_seed=["Marge", ","], random_seed=None))

In [None]:
print(generate_line(lidstone, max_words=1000, text_seed=["Marge", ","], random_seed=None))

### 4.2. Qualitative model comparison 

To try to observe the impact of the n-gram order on the realism of the generated lines, we can fit and generate new text from the simple MLE and from the Laplace LM of different orders (for ex. $n=1,2,3,4$).
- We then compare the results between the different $n$ values and between the two models. 
- What are the main differences for generation? Which model(s) do you think might be the best options for generating new realistic Homer script lines?
- Do you see hints of those differences in the generated text?

In [None]:
mle_models = [None]
lapace_models = [None]
max_n = 4
for n in range(1,max_n+1):
    print(n)
    mle_models.append(fit_ngram_language_model(order=n, train_corpus_tokens=simpsons_tok, LM_Class=MLE))
    lapace_models.append(fit_ngram_language_model(order=n, train_corpus_tokens=simpsons_tok, LM_Class=Laplace))

In [None]:
seed = None
prior_tokens = None # ["Marge", ","]

MLE:

In [None]:
print("==== n=1: ====")
print(generate_line(mle_models[1], 30, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=2: ====")
print(generate_line(mle_models[2], 100, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=3: ====")
print(generate_line(mle_models[3], 1000, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=4: ====")
print(generate_line(mle_models[4], 1000, text_seed=prior_tokens, random_seed=seed))

Laplace:

In [None]:
print("==== n=1: ====")
print(generate_line(lapace_models[1], 30, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=2: ====")
print(generate_line(lapace_models[2], 100, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=3: ====")
print(generate_line(lapace_models[3], 1000, text_seed=prior_tokens, random_seed=seed))

In [None]:
print("==== n=4: ====")
print(generate_line(lapace_models[4], 1000, text_seed=prior_tokens, random_seed=seed))

Larger $n$ values lead to greater sentence coherence, as the model has more context. For even larger $n$, it might also lead to overfitting, and to the model always generating the same sentences from the training set, with little "novelty". $n=1$ leads to just random words independently sampled from the dictionary, according to their train corpus frequencies.

Laplace might also rarely generate less representative sentences, or sentences with less coherence, as there's always a little probability that the model will generate any word from the vocabulary, regardless of the observed training contexts, due to smoothing.

### 4.3. Quantitative model comparison using perplexity scores

The model perplexity is a normalized form of the sequence probability, as seen in the lecture. It can be used on a kept-aside test dataset to evaluate the performance of a ngram probability model. 
The `nltk.lm.model` classes have a `.perplexity()` method to compute the perplexity on a given list or corpus of n-grams.
```python
    model.perplexity(test_ngrams)
```
To compute the perplexity correctly with his method, one needs to preprocess the relevant corpus documents to a list of padded $n$-grams.
We can use it to compare the MLE, Laplace and Lindstone (e.g. with $\gamma=0.1$) models. 
To do so, we perform the following steps:

- Split the tokenized Simpsons lines corpus into a (reproducible) training set (80%) and a test set (20%). 
- Compute the train and test 3-gram perplexity scores of a simple MLE LM, a Laplace LM, and a Lidstone LM with $\gamma=0.1$. Use model order $n=3$ for each.
- Compare and discuss the obtained train and test perplexity scores of the three models. Argue which model might represent the Homer Simpson script lines data best.

In [None]:
def evaluate_perplexity(lm_model, corpus_tokens, order=None):
    if order is None: #Facultative, if you want to evaluate the lm_model for a lower order than its lm_model.order
        order=lm_model.order
    
    test_ngrams = []
    for s in corpus_tokens:
        test_ngrams += list(ngrams(pad_both_ends(s, n=order), n=order)) #Padded n-grams
        
    return lm_model.perplexity(test_ngrams)

In [None]:
from sklearn.model_selection import train_test_split
train_corp, valdid_corp = train_test_split(?, test_size=0.2, shuffle=True, random_state=1)

In [None]:
n = 3
gamma = 0.1
mle3t = fit_ngram_language_model(order=n, train_corpus_tokens=?, LM_Class=?, ?)
lapl3t = fit_ngram_language_model(order=n, train_corpus_tokens=?, LM_Class=?, ?)
lid3t = fit_ngram_language_model(order=n, train_corpus_tokens=?, LM_Class=?, ?)

print("Train:")
print("MLE:", evaluate_perplexity(mle3t, ?, n))
print("Laplace:", evaluate_perplexity(lapl3t, ?, n))
print(f"Lidstone (gamma={gamma}):", evaluate_perplexity(lid3t, ?, n))
print("")
print("Test:")
print("MLE:", evaluate_perplexity(mle3t, ?, n))
print("Laplace:", evaluate_perplexity(lapl3t, ?, n))
print(f"Lidstone (gamma={gamma}):", evaluate_perplexity(lid3t, ?, n))

On train set: The MLE has by far the lowest (i.e. best) and Laplace the largest, with Lidstone in between. This makes sense, as the MLE estimated occurence probabilities are exactly those estimated from the train set, and Laplace and Lidstone smooth the estimated probabilities with "artificially" inflated counts, with Lidstone having a lighter smoothing (i.e. counts are inbetween the two others). 

On the test set: The MLE's perplexity score is infinite. This is generally expected on any set-aside test set, as it suffices to observe a single three-word combination (as $n=3$) that was not present in the training set for the model giving zero probability to the test text. The Laplace and Lidstone smoothing solve this issue with smoothing, by adding $1$ or $\gamma=0.1$ to all counts, including for (context-conditionally) unobserved words. The Lidstone model has the lowest test perplexity in this case.

One can argue that Lidstone has the best tradeoff in the estimated probabilities, as the simple MLE strongly overfits the training data, and Laplace having a significantly largest perplexity, indicating that the $\gamma=0.1$ might be more suitable than $\gamma=1$.

### 4.4. Hyper-parameter tuning

Having the perplexity score as a comparison metric, we can perform a grid-search to select the best values for the hyperparameters $n$ and $\gamma$ of the Lidstone class of LMs. (Remember the simple MLE and Laplace are spacial cases of lidstone with $\gamma=0$ and $\gamma=1$, respectively). 
The goal is to select the model that generalises best to new data. 

What do you observe in the obtained perplexity scores? Was it expected? Explain it in statistical terms.

- Perform a grid-search to select the best hyperparameter values for $n$ and $\gamma$, for the Lidstone LM. You want to select the model that generalizes best to new data.
- What do you observe in the obtained perplexity scores? Was it expected? Explain it in statistical terms.

(One can generally try a few values for $n$ and $\gamma$ by hand to identify the general hyperparameter region of interest before defining a more thorough hyperparameter value grid.)

In [None]:
gamma_list = [0.0001, 0.001, 0.01, 0.1, 0.2]
n_list = [1,2,3,4,5]

for gamma in gamma_list:
    for n in n_list:
        lidnt = fit_ngram_language_model(order=?, train_corpus_tokens=?, LM_Class=Lidstone, gamma=?)
        print(f"Lidstone (gamma={gamma}, n={n}):", evaluate_perplexity(lidnt, ?, n))
        # Or clearner: store the values in a a dataframe.

For our grid and data split, the best hyperparameter combination in terms of validation perplexity score seems to be $\gamma=0.01$ and $n=2$.

From that optimum, we observe a "U-shape" in both hyperparameter directions. There is a bias-variance tradeoff between low n (bias) and large n (variance) values. Same for the $\gamma$ values.