Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

In [3]:
import nltk
nltk.download('reuters')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\ineso\AppData\Roaming\nltk_data...


True

## Unigram model

For starters, let's build a unigram language model.

In [4]:
from collections import defaultdict

# Create a placeholder for the model
model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [5]:
total_count = float(sum(model.values()))
for w in model:
    model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [6]:
# your code here
model['the']

0.03384881432399122

What is the most likely word in the corpus?

In [9]:
# your code here
print(max(model, key=model.get), "->", max(model.values()))

. -> 0.05503054476189148


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [10]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in model.keys():
        accumulator += model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

years 808 by > ISSUES gains agreement cannot Opr necessary where in CANADIAN ban the 491p reached from Grains for also His said related French biggest Qtly upward 1 s S company . mln and made shrs year De and SELL a Net It - speculation balance from budget in companies 326 recognised total of Net . badly products 608 Indonesia The mln primarily mln . market Knox cts he Bell Atlantic prices was ., " 91 May Sandra said North the 841 Net about , - July the ban U dlrs state Canada forces breakdown cut . it is


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [11]:
from nltk import bigrams

# Create a placeholder for the model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [12]:
# your code here
for w1 in model:
    total_count = float(sum(model[w1].values()))
    for w2 in model[w1]:
        model[w1][w2] /= total_count

#### Likely pairs

What are the probabilities of each word following 'today'?

In [13]:
# your code here
model['today']

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'.': 0.18636363636363637,
             'to': 0.0659090909090909,
             "'": 0.10681818181818181,
             'and': 0.025,
             'as': 0.013636363636363636,
             ',': 0.16363636363636364,
             'with': 0.007575757575757576,
             'by': 0.020454545454545454,
             'when': 0.0030303030303030303,
             'on': 0.011363636363636364,
             'recommended': 0.0007575757575757576,
             'he': 0.005303030303030303,
             'its': 0.0022727272727272726,
             'for': 0.01893939393939394,
             'De': 0.0007575757575757576,
             'European': 0.0007575757575757576,
             'described': 0.0007575757575757576,
             'the': 0.013636363636363636,
             ',"': 0.007575757575757576,
             'they': 0.0015151515151515152,
             'issued': 0.0015151515151515152,
             'being': 0.0007575757575757576,
            

What are the probabilities for sentence-starting words? What do most of them have in common?

In [14]:
# your code here
model['<s>'] 
# most have capitalized first letter 

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'ASIAN': 7.31047591198187e-05,
             'They': 0.008151180641859785,
             'But': 0.019263104028072228,
             'The': 0.16154324146501936,
             'Unofficial': 1.8276189779954676e-05,
             '"': 0.06559324512025733,
             'In': 0.02522114189633745,
             'Threat': 3.655237955990935e-05,
             'Taiwan': 0.0006944952116382777,
             'Retaliation': 5.482856933986402e-05,
             'A': 0.013963008991885371,
             'Last': 0.0036917903355508444,
             'Much': 0.0001462095182396374,
             'He': 0.028986036991008116,
             'Meanwhile': 0.0007493237809781417,
             'Japan': 0.0020286570655749687,
             'Deputy': 0.0001462095182396374,
             'CHINA': 0.0009138094889977337,
             'It': 0.03231230353095987,
             'JAPAN': 0.002997295123912567,
             'MITI': 0.0002193142773594561,
             

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [15]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in model[text[-1]].keys():
        accumulator += model[text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    
print (' '.join([t for t in text if t]))

<s> The transaction is expected to recover its protected ," he said the stockpile management board of credit of life of Oregon , 328 billion vs 4 pct in special shares to a smaller - SCALE CO - Iraq has offered five days but his policies could be a significant current marketing and flats in the buffer stock split in his bid by strikes , a package announced he said it is no assurance that much as well short of 50 pct . </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [30]:
# your code here
from nltk import trigrams

# Create a placeholder for the model
model = {}

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        if w1 not in model:
            model[w1] = {}
        if w2 not in model[w1]:
            model[w1][w2] = {}
        if w3 not in model[w1][w2]:
            model[w1][w2][w3] = 0
        model[w1][w2][w3] += 1

for w1 in model:
    for w2 in model[w1]:
        total_count = float(sum(model[w1][w2].values()))
        for w3 in model[w1][w2]:
            model[w1][w2][w3] /= total_count

#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [31]:
# your code here
sorted(model['today']['the'], key=model['today']['the'].get)

['public',
 'European',
 'Bank',
 'emirate',
 'overseas',
 'newspaper',
 'Turkish',
 'increase',
 'options',
 'Higher',
 'pound',
 'Italian',
 'time',
 'price',
 'company']

#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [50]:
# your code here
import random

# sequence start symbol
text = ["<s>","<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in model[text[-2]][text[-1]].keys():
        accumulator += model[text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    
print (' '.join([t for t in text if t]))

<s> <s> He urged creation of a reorganisation , employee training and education , transport and communications activities , with the company intends to sell its shares in the second highest ever reported , ETL moved into the markets will increase its presence in the U . K . Ltd , a retail closeout merchandiser . </s>


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [51]:
# your code here
from nltk import ngrams

# Create a placeholder for the model
model = {}

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(sentence, 4, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        if w1 not in model:
            model[w1] = {}
        if w2 not in model[w1]:
            model[w1][w2] = {}
        if w3 not in model[w1][w2]:
            model[w1][w2][w3] = {}
        if w4 not in model[w1][w2][w3]:
            model[w1][w2][w3][w4] = 0
        model[w1][w2][w3][w4] += 1

for w1 in model:
    for w2 in model[w1]:
        for w3 in model[w1][w2]:
            total_count = float(sum(model[w1][w2][w3].values()))
            for w4 in model[w1][w2][w3]:
                model[w1][w2][w3][w4] /= total_count

#### Likely tuples

Check the most likely words following "today the public".

In [52]:
# your code here
model['today']['the']['public']

{'is': 1.0}

#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [58]:
# your code here
import random

# sequence start symbol
text = ["<s>","<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    # your code here
    accumulator = .0
    for word in model[text[-3]][text[-2]][text[-1]].keys():
        accumulator += model[text[-3]][text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break
    
print (' '.join([t for t in text if t]))

<s> <s> <s> Economists estimate that the Fed was firming policy , but said its doubtful lending to Japan will occur because that country doesn ' t decide prices by itself but certainly desires price stability ," he said . </s>
