<a href="https://colab.research.google.com/github/rahiakela/transformer-research-and-practice/blob/main/mastering-transformers/01-from-bag-of-words-to-transformer/2_language_modeling_and_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Language modeling and generation

For language-generation problems, the traditional approaches are based on leveraging n-gram language models. This is also called a Markov process, which is a stochastic model in which each word (event) depends on a subset of previous words—`unigram, bigram, or n-gram`, outlined as follows:

- **Unigram** (all words are independent and no chain): This estimates the probability of word in a vocabulary simply computed by the frequency of it to the total word count.
- **Bigram** (First-order Markov process): This estimates the `P (wordi| wordi-1)`. probability of wordi depending on `wordi-1`, which is simply computed by the ratio of `P(wordi , wordi-1)` to `P (wordi-1)`.
- **Ngram** (N-order Markov process): This estimates `P(wordi | word0, ..., wordi-1)`.

Let's give a simple language model implementation with the Natural Language Toolkit
(NLTK) library. In the following implementation, we train a Maximum Likelihood
Estimator (MLE) with order `n=2`. We can select any n-gram order such as `n=1` for unigrams, `n=2` for bigrams, `n=3` for trigrams, and so forth:

In [5]:
!pip -q install -U nltk==3.4

[?25l[K     |▎                               | 10 kB 31.2 MB/s eta 0:00:01[K     |▌                               | 20 kB 38.8 MB/s eta 0:00:01[K     |▊                               | 30 kB 25.6 MB/s eta 0:00:01[K     |█                               | 40 kB 22.5 MB/s eta 0:00:01[K     |█▏                              | 51 kB 13.4 MB/s eta 0:00:01[K     |█▍                              | 61 kB 14.7 MB/s eta 0:00:01[K     |█▋                              | 71 kB 13.1 MB/s eta 0:00:01[K     |█▉                              | 81 kB 14.1 MB/s eta 0:00:01[K     |██                              | 92 kB 15.1 MB/s eta 0:00:01[K     |██▎                             | 102 kB 12.5 MB/s eta 0:00:01[K     |██▌                             | 112 kB 12.5 MB/s eta 0:00:01[K     |██▊                             | 122 kB 12.5 MB/s eta 0:00:01[K     |███                             | 133 kB 12.5 MB/s eta 0:00:01[K     |███▏                            | 143 kB 12.5 MB/s eta 0:

In [6]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import gutenberg
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

nltk.download('gutenberg')
nltk.download('punkt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:
macbeth = gutenberg.sents("shakespeare-macbeth.txt")
macbeth[:5]

[['[',
  'The',
  'Tragedie',
  'of',
  'Macbeth',
  'by',
  'William',
  'Shakespeare',
  '1603',
  ']'],
 ['Actus', 'Primus', '.'],
 ['Scoena', 'Prima', '.'],
 ['Thunder', 'and', 'Lightning', '.'],
 ['Enter', 'three', 'Witches', '.']]

In [8]:
model, vocab = padded_everygram_pipeline(2, macbeth)
lm = MLE(2)
lm.fit(model, vocab)

print(list(lm.vocab)[:10])
print(f"The number of words is {len(lm.vocab)}")

['<s>', '[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603']
The number of words is 4020


The following code produces what the language model learned so far:

In [10]:
print(f"The frequency of the term 'Macbeth' is {lm.counts['Macbeth']}")
print(f"The language model probability score of 'Macbeth' is {lm.score('Macbeth')}")
print(f"The number of times 'Macbeth' follows 'Enter' is {lm.counts[['Enter']]['Macbeth']}")
print(f"P(Macbeth | Enter) is {lm.score('Macbeth', ['Enter'])}")
print(f"P(shaking | for) is {lm.score('shaking', ['for'])}")

The frequency of the term 'Macbeth' is 61
The language model probability score of 'Macbeth' is 0.0022631149365585812
The number of times 'Macbeth' follows 'Enter' is 15
P(Macbeth | Enter) is 0.1875
P(shaking | for) is 0.012195121951219513


The n-gram language model keeps `n-gram` counts and computes the conditional
probability for sentence generation. `lm=MLE(2)` stands for MLE, which yields the maximum probable sentence from each token probability. 

The following code produces a random sentence of 10 words with the `<s>` starting condition given:

In [11]:
lm.generate(10, text_seed=["<s>"], random_seed=42)

['My', 'first', 'i', "'", 's', 'not', 'put', 'that', 'most', 'may']

We can give a specific starting condition through the text_seed parameter, which
makes the generation be conditioned on the preceding context. 

In our preceding example, the preceding context is <s>, which is a special token indicating the beginning of a sentence.

In [12]:
lm.generate(10, text_seed=["love"], random_seed=42)

['done', 'double', 'sence', ',', 'as', 'palpable', ',', 'as', 'palpable', ',']