# NLP Seminar 1 - N-grams Language Models (N-grams LM)

In this first NLLP seminar, we will focus on n-gram language models. They are a classical approach to sentence modelling, and text autocompletion.

We will use the `nltk` (natural language toolkit) python package. 
If you want to learn more about this popular module, refer to the [official website](https://www.nltk.org/) ([API reference](https://www.nltk.org/api/nltk.html), [installation guide](https://www.nltk.org/install.html)).

In particular, the `nltk.lm` submodule provides optimized implementations of classical n-grams language models such as the maximum likelihood estimator (MLE) and its smoothing variants (Laplace, Lidstone, ...).

To illustrate the ngram approach, we will apply it on the Trump Tweets dataset, and try to generate new tweets!

Before that, we will begin with the basics, by understanding how to preprocess the text data into tokens and ngrams. 

In [None]:
import numpy as np
import nltk

In [None]:
# First download some nltk resources
# (By default '!pip install nltk' does not actually download every resource in the module,
# as for example some language models are heavy.)
# The following command should download every resource needed for this practical:
nltk.download('popular', quiet=True)

# Introduction: dummy data

For simplicity, we first consider the dummy corpus `text` with two documents/sequences of tokens. The tokens are here simple letters, but we can think of them as representing words in our vocabulary.

In [None]:
text = [['a', 'b', 'c'], ['a', 'c', 'd', 'c', 'e', 'f']]

If we want to train a bigram model, we need to turn this text into n-grams. We can use the `bigrams` and `ngrams` functions from NLTK as helpers, to turn the each token list document into an ngram list (for ex. with $n=2$ and $n=3$).

In [None]:
from nltk.util import bigrams, ngrams

In [None]:
list(bigrams(?))

In [None]:
list(ngrams(??))

Notice how "b" occurs both as the first and second member of different bigrams but "a" and "c" don't? 

It would be nice to indicate to the model how often sentences start with "a" and end with "c" for example, when we will count those ngrams later-on.


A standard way to deal with this is to add special "padding" symbols to the document/sequence before splitting it into ngrams. Fortunately, NLTK also has a `pad_sequence` function for that. We use `"<s>"` and `"</s>"` by convention in `nltk` to pad before and after the sequence, respectively.

Lets add the relevent paddings and construct the bigrams and 3-grams for the first text sequence. Note the `n` argument, that tells the function we need padding for `n`-grams.

In [None]:
from nltk.util import pad_sequence

In [None]:
padded_seq = list(pad_sequence(??))
list(ngrams(padded_seq, n=?))

In [None]:
padded_seq = list(pad_sequence(??))
list(ngrams(padded_seq, n=?))

Passing all these parameters every time can be tedious and in most cases one uses the same defaults anyway.

Thus the `nltk.lm` module provides a convenience function that has all these arguments already set while the other arguments remain the same as for `pad_sequence`.

In [None]:
from nltk.lm.preprocessing import pad_both_ends

In [None]:
list(pad_both_ends(?, n=?))

Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [None]:
???

For versatility and conditional probabilities, the `nltk.lm` n-gram models usually need everygrams of order n. For bigrams, they are trained using unigrams (single words) as well as bigrams. For 3-grams, they usually rely on unigrams, bigrams and 3-grams. And so on... 
NLTK once again helpfully provides a function called `everygrams`.

In [None]:
from nltk.util import everygrams

In [None]:
#with n=2:
padded_bigrams = list(pad_both_ends(??))
list(everygrams(padded_bigrams, max_len=?)) #train[0]

We are almost ready to start counting ngrams, just one more step left.

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model, to efficiently perform the counting.

To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.


In [None]:
from nltk.lm.preprocessing import flatten

In [None]:
list(flatten(pad_both_ends(sent, n=2) for sent in text)) #vocab

In most cases we want to use the same text as the source for both vocabulary and ngram counts.

Now that we understand what this means for our preprocessing, we can simply import the `padded_everygram_pipeline` function that does exactly everything above for us for the whole corpus, in a single function call.

In [None]:
from nltk.lm.preprocessing import padded_everygram_pipeline

In [None]:
?? = padded_everygram_pipeline(??)

So as to avoid re-creating the text in memory, both `train` and `vocab` are lazy iterators. They are evaluated on demand at training time.

For the sake of understanding the output of `padded_everygram_pipeline`, we "materialize" the lazy iterators by casting them into a list.

In [None]:
training_ngrams, padded_sequences = padded_everygram_pipeline(??)

print('==== n-gram data (n=1,2) for each sequence in "text": ====')
for ngramlize_sent in training_ngrams:
    print(list(ngramlize_sent))
    print()
print('==== Vocabulary data: ====')
print(list(padded_sequences))

# Tokenizing real data

Lets try some text generation with Donald Trump tweets!


**Dataset source:** https://www.kaggle.com/kingburrito666/better-donald-trump-tweets#Donald-Tweets!.csv


In [None]:
import pandas as pd

First import, inspect and preprocess the text data:

In [None]:
df = pd.read_csv('./data/Trump_tweets.csv')
df.head()

In [None]:
df['Tweet_Text'].values[0]

In [None]:
df['Tweet_Text'].values[1]

In [None]:
# Facultative preprocessing and text wrangling ...


Then tokenize the text corpus (split around words and punctuation). `nltk.word_tokenize` is the recommended tokenizer in `nltk`.

In [None]:
from nltk import word_tokenize, sent_tokenize

In [None]:
trump_corpus = list(df['Tweet_Text'].apply(??))
print(trump_corpus[0])
print(trump_corpus[1])

## Training an N-gram Model

Having prepared our data we are ready to start training a model. As a simple example, let us train a Maximum Likelihood Estimator (MLE).

We first prepare the itterators for the everygrams and vocabulary.

In [None]:
# Preprocess the tokenized text for 3-grams language modelling
n = ?
train_data, padded_seqs = padded_everygram_pipeline(??)

We only need to specify the highest ngram order to instantiate the MLE.

In [None]:
from nltk.lm import MLE
model = MLE(?) # Lets train a 3-grams model

Initializing the MLE model, creates an empty vocabulary

In [None]:
len(model.vocab)

... which gets filled as we fit the model.

In [None]:
model.fit(?, ?)
print(model.vocab)

In [None]:
len(model.vocab)

The vocabulary helps us handle words that have not occurred during training.

In [None]:
print(model.vocab.lookup(trump_corpus[1]))

In [None]:
# If we lookup the vocab on unseen sentences not from the training data, 
# it automatically replace words not in the vocabulary with `<UNK>`.
print(model.vocab.lookup('Busy day government erer .'.split()))

As `padded_everygram_pipeline` returns itterators (that can only be used once), it might be a good idea to have the full pipeline in a single function:

In [None]:
def fit_ngram_language_model(order, train_corpus_tokens, LM_Class=nltk.lm.MLE, *args, **kwargs):
    """
    :param order: integer stting the maximum order of the n-grams.
    :param train_corpus_tokens: list of tokenized text sequences.
    :param LM_Class: a language model as a nltk.lm.LanguageModel sub-class.
    additional arguments are passed to `LM_Class`.
    """
    ???
    ???
    return ???

In [None]:
mle_model = fit_ngram_language_model(order=?, train_corpus_tokens=?, LM_Class=?)

## Using the N-gram Language Model

When it comes to ngram models the training boils down to counting up the ngrams from the training corpus.

In [None]:
print(mle_model.counts)

This provides a convenient interface to access counts for unigrams...

In [None]:
mle_model.counts['America'] # i.e. Count('America')

In [None]:
mle_model.counts['Trump'] # i.e. Count('Trump')

...and bigrams for the phrase bit "I will"

In [None]:
mle_model.counts[['I']]["will"]

... and trigrams for the phrase bit "will never forget"

In [None]:
mle_model.counts[('will', "never")]["forget"]

And so on. However, the real purpose of training a language model is to have it score how probable words are in certain contexts.

This being MLE, the model returns the item's relative frequency as its score.

In [None]:
mle_model.score('America') # P('America')

In [None]:
mle_model.score('Trump') # P('Trump')

In [None]:
mle_model.score('will', ('I',))  # P('will'|'I')

In [None]:
mle_model.score(??) # P('forget'|'will never')

Items that are not seen during training are mapped to a specific vocabulary "unknown label" token.


In [None]:
print(mle_model.score("<UNK>"))
print(mle_model.score("<UNK>") == mle_model.score("erer"))

In [None]:
mle_model.score("<UNK>") == mle_model.score("vava")

To avoid underflow when working with many small score values it makes sense to take their logarithm. 

For convenience this can be done with the `logscore` method.


In [None]:
mle_model.logscore('forget', ('will', 'never'))

## Generation using N-gram Language Model

One cool feature of ngram models is that they can be used to generate text. The `nltk.lm.model` classes have a `.generate()` method to sample sequentially from the extimated (conditional) propabilities.

In [None]:
print(mle_model.generate(??))

We can do some cleaning and detokenization in a function to make the generated tokens mor human-like.

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

def tweet_detokenizer(token_list: list[str]) -> str:
    TbDetok = TreebankWordDetokenizer()
    tb_string = TbDetok.detokenize(token_list)
    detokenized_tweet = tb_string.replace(' .','.').replace('@ ', '@')
    return detokenized_tweet

def generate_tweet(model, max_words, text_seed=None, random_seed=None):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param max_words: Max no. of words to generate.
    :param text_seed: Generation can be conditioned on preceding context tokens.
    :param random_seed: Seed value for random.
    """
    if text_seed is None:
        text_seed = ['<s>']*(model.order-1)
    
    content = [tok for tok in text_seed if tok!='<s>']
    
    for token in model.generate(num_words=max_words, text_seed=text_seed, random_seed=random_seed):
        if token == '</s>':
            break
        if token != '<s>':
            content.append(token)
        
    tweet = tweet_detokenizer(content)
    return tweet

In [None]:
generate_tweet(??)

In [None]:
generate_tweet(??)

In [None]:
generate_tweet(??)

**To go further:** We see in some generations some weird typos or tokens that probably did not occur often in the training data overall, and we might want to ignore.

You can tell the vocabulary to ignore such words using the `unk_cutoff` argument for the vocabulary lookup, which will turn them to `'<UNK>'`.

In [None]:
from nltk.lm import Vocabulary

In [None]:
voc = nltk.lm.Vocabulary(unk_cutoff=?)
voc.update(["a","b","a"])
voc.lookup(["a","b","c"])

In [None]:
voc["a"], voc["b"], voc["c"]

 If you are interested in the implementation and going a bit further, you can check out the documentation for the `nltk.lm.vocabulary.Vocabulary` class [here](https://www.nltk.org/api/nltk.lm.vocabulary.html) or the source code: [`nltk.lm.vocabulary.Vocabulary`](https://github.com/nltk/nltk/blob/develop/nltk/lm/vocabulary.py).

## Smoothing

As discussed in the lecture, the issue of the simple MLE is that it gives 0 probability to any sequence for which even a single trigram has never been seen during the training. To avoid this issue, several smoothing techniques exist. A few implementations are available in the `nltk.lm` submodule, for example:

 - `Lidstone`: Provides Lidstone-smoothed scores.
 - `Laplace`: Implements Laplace (add one) smoothing. Equivalent to Lidstone with gamma=1.
 - `InterpolatedLanguageModel`: Logic common to all interpolated language models (Chen & Goodman 1995).
 - `WittenBellInterpolated`: Interpolated version of Witten-Bell smoothing.
 
Let's fit the Laplace model introduced in the lecture, as well as its Lindstone generalization, that performs smoothing by adding an arbitrary value `gamma` instead of `1` to the word counts.

In [None]:
from nltk.lm import Laplace, Lidstone
laplace = ??
lidstone = ??

In [None]:
print(generate_tweet(laplace, max_words=100, text_seed=["Donald", "Trump"], random_seed=None))

In [None]:
print(generate_tweet(lidstone, max_words=100, text_seed=["Donald", "Trump"], random_seed=None))

## Qualitative effects of n

To try to visualize the impact of the n-gram order on the realism of the generated tweets, we can fit and generate from MLE and Laplace models with different orders (for ex. $n=1,2,3,4$).

In [None]:
???

## Qantitative evaluation

The model perplexity is a normalized form of the sequence probability, as seen in the lecture. It can be used on a kept-aside test dataset to evaluate the performance of a ngram probability model. The `nltk.lm.model` classes have a `.perplexity()` method to compute the perplexity on a list of ngrams.

We can use it to compare the MLE, Laplace and Lindstone (e.g. with $\gamma=1$) models.

In [None]:
def evaluate_perplexity(lm_model, test_corpus_tokens, order=None):
    if order is None:
        order=lm_model.order
    
    test_ngrams = []
    for s in test_corpus_tokens:
        test_ngrams += list(ngrams(pad_both_ends(s, n=order), n=order))
    return lm_model.perplexity(test_ngrams)

In [None]:
???
???