Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt), based on [A Comprehensive Guide to Build your own Language Model in Python](https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-language-model-nlp-python-code/) by Mohd Sanad Zaki Rizvi.

# N-GRAM LANGUAGE MODELS

N-gram language models are based on computing probabilities for the occurrence of each word given *n-1* previous words.

To "train" such models, we will make use of the [Reuters](https://www.nltk.org/book/ch02.html) corpus, which contains 10,788 news documents in a total of 1.3 million words.

In [1]:
from nltk.corpus import reuters

We can check the number of sentences there are in the corpus. Each sentence is a list of words.

In [2]:
print(len(reuters.sents()))

print(reuters.sents()[0])
for w in reuters.sents()[0]:
    print(w, end=' ')

54716
['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . 

## Unigram model

For starters, let's build a unigram language model.

In [3]:
from collections import defaultdict

# Create a placeholder for the model
uni_model = defaultdict(int)

# Count the frequency of each token
for sentence in reuters.sents():
    for w in sentence:
        uni_model[w] += 1

Now that we have the counts, we need to transform them into probabilities:

In [4]:
total_count = float(sum(uni_model.values()))
for w in uni_model:
    uni_model[w] /= total_count

#### Likely words

How likely is the word 'the'?

In [5]:
print(uni_model['the'])

0.03384881432399122


What is the most likely word in the corpus?

In [6]:
sorted_words = list(uni_model.items())
sorted_words.sort(reverse=True, key=lambda x: x[1])

print(sorted_words[:5])

[('.', 0.05503054476189148), (',', 0.04204735033705867), ('the', 0.03384881432399122), ('of', 0.02090687697314862), ('to', 0.01977724666558585)]


#### Generating text

Based on this unigram language model, we can try generating some text. It will not be pretty, though...

In [7]:
import random

# number of words to generate
total_words = 100
text = []

for i in range(total_words):
    # select a random probability threshold
    r = random.random()

    # select word above the probability threshold
    accumulator = .0
    for word in uni_model.keys():
        accumulator += uni_model[word]
        if accumulator >= r:
            text.append(word)
            break

print (' '.join([t for t in text]))

reserves shareholders danger of undertaken announced improve 5 2 the per 635 had . S MARCH within CONSUMERS . year GDP AMUSEMENTS In 646 a the it which fell Reuters currency and 70 vs end news gain market dlrs , announced July said / now an of of . excludes the . total this PROBLEMS Insurance doesn of . , and banks It depends to Express RADIO banks its new interest be vs ENDO cycle & vs 8 ." " , mln electronic 1 a 22 McCarthy CORP 4 for of six ICCO revises S Harpers But . cts that


## Bigram model

In a bigram model, we'll compute the probability of each word given the previous word as context. To obtain bigrams, we can use NLTK's [bigrams](https://www.nltk.org/_modules/nltk/util.html#bigrams). When doing so, we can padd the input left and right and define our own sequence start and sequence end symbols.

We first need to obtain the counts:

In [8]:
from nltk import bigrams

# Create a placeholder for the model
bi_model = defaultdict(lambda: defaultdict(lambda: 0))

# Count the frequency of each bigram
for sentence in reuters.sents():
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        bi_model[w1][w2] += 1

As before, we need to transform counts into probabilities. For that, we divide each count by the total number of occurrences of the first word in the bigram.

In [10]:
for word1, words in bi_model.items():
    count = sum(bi_model[word1].values())
    for word2, value in words.items():
        bi_model[word1][word2] = value / count

print(bi_model['FEAR']['DAMAGE'])

0.5


#### Likely pairs

What are the probabilities of each word following 'today'?

In [None]:
for item in bi_model['today'].items():
    print(item)

('.', 0.18636363636363637)
('to', 0.08099153270339919)
("'", 0.14281845668426268)
('and', 0.03898818921166616)
('as', 0.022127996247286685)
(',', 0.27153714127289674)
('with', 0.017249027560317914)
('by', 0.04738836773278181)
('when', 0.007369095420981912)
('on', 0.027838877233845354)
('recommended', 0.0019089701027416616)
('he', 0.013388299794026642)
('its', 0.005815554235563076)
('for', 0.048745886771773383)
('De', 0.002049547688160887)
('European', 0.002053748316205343)
('described', 0.0020579661805245923)
('the', 0.037119624972313185)
(',"', 0.02141530257202476)
('they', 0.0043765857046362015)
('issued', 0.004395782122323787)
('being', 0.002207573746004408)
('that', 0.09734767257803173)
('quoted', 0.014702799275565621)
('it', 0.05222578101178188)
('."', 0.013118109464209864)
('show', 0.005316807256942274)
('of', 0.03474348048732474)
('at', 0.10520322358541373)
('through', 0.0061859968726243365)
('reported', 0.062243817860557354)
('(', 0.00331807797378481)
('said', 0.069910837365821

What are the probabilities for sentence-starting words? What do most of them have in common? (Hint: check the *left_pad_symbol* defined above for collecting bigrams.)

In [None]:
for item in bi_model['<s>'].items():
    print(item)

# Common feature is that they start with an uppercase char

('ASIAN', 7.31047591198187e-05)
('They', 0.008151776564630543)
('But', 0.01942284008862723)
('The', 0.16610906200599357)
('Unofficial', 2.2536128579816908e-05)
('"', 0.08088398824384287)
('In', 0.03383740053717582)
('Threat', 5.075715913459519e-05)
('Taiwan', 0.0009644349742944421)
('Retaliation', 7.621310395123603e-05)
('A', 0.01941041643008517)
('Last', 0.005233658361310965)
('Much', 0.00020836407660533536)
('He', 0.04131678689693373)
('Meanwhile', 0.0011141190586176728)
('Japan', 0.0030196376934263753)
('Deputy', 0.00021829068632500394)
('CHINA', 0.0013646146640747382)
('It', 0.04831870914116985)
('JAPAN', 0.004709608173737834)
('MITI', 0.0003462360655041191)
('Nuclear', 2.8862998581404126e-05)
('THAI', 0.0005484128014277204)
('Thailand', 0.0003754356993039059)
('Export', 0.00034668618473795775)
('Products', 0.00011560213808565536)
('INDONESIA', 0.0006069813912683967)
('Prices', 0.0012436214915964214)
('Harahap', 5.791488218337605e-05)
('Indonesia', 0.0009846100190393852)
('Indonesi

#### Generating text

Now that we have a bigram model, we can generate text based on it.

In [None]:
import random

# sequence start symbol
text = ["<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0
    for word in bi_model[text[-1]].keys():
        accumulator += bi_model[text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break   
    

print (' '.join([t for t in text if t]))

<s> It ' s stock , Shearson Lehman Brothers Stores Inc said . </s>


## Trigram model

In a trigram model, we'll compute the probability of each word given the previous two words as context. To obtain trigrams, we can use NLTK's [trigrams](https://www.nltk.org/_modules/nltk/util.html#trigrams).

In [12]:
from nltk import trigrams

# Create a placeholder for the model
tri_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0)))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        tri_model[w1][w2][w3] += 1

for word1, words1 in tri_model.items():
    for word2, words2 in words1.items():
        count = sum(tri_model[word1][word2].values())
        for word3, value in words2.items():
            tri_model[word1][word2][word3] = value / count

#### Likely triplets

What are the most likely words following "today the"?
What about "England has"?

In [13]:
print(tri_model['today']['the'])
print(tri_model['England']['has'])

defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x7f2e84bb3b50>, {'public': 0.05555555555555555, 'European': 0.05555555555555555, 'Bank': 0.05555555555555555, 'price': 0.1111111111111111, 'emirate': 0.05555555555555555, 'overseas': 0.05555555555555555, 'newspaper': 0.05555555555555555, 'company': 0.16666666666666666, 'Turkish': 0.05555555555555555, 'increase': 0.05555555555555555, 'options': 0.05555555555555555, 'Higher': 0.05555555555555555, 'pound': 0.05555555555555555, 'Italian': 0.05555555555555555, 'time': 0.05555555555555555})
defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x7f2e82617640>, {'carried': 0.25, 'been': 0.5, 'recently': 0.25})


#### Generating text

Create your text generator based on the trigram model. Does the generated text start to feel a bit more sound?

In [27]:
import random

# sequence start symbol
text = ["<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0
    for word in tri_model[text[-2]][text[-1]]:
        accumulator += tri_model[text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break   
    

print (' '.join([t for t in text if t]))

<s> <s> The present agreement if we find this afternoon , buying bank bills outright comprising 25 mln stg . </s>


## N-gram models

For larger *n*, we can use NLTK's [n-grams](https://www.nltk.org/_modules/nltk/util.html#ngrams), which allows us to choose an arbitrary *n*.

Create your own 4-gram model.

In [29]:
from nltk import ngrams

# Create a placeholder for the model
quad_model = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: 0))))

# Count the frequency of each trigram
for sentence in reuters.sents():
    for w1, w2, w3, w4 in ngrams(sentence, 4, pad_right=True, pad_left=True, left_pad_symbol='<s>', right_pad_symbol='</s>'):
        quad_model[w1][w2][w3][w4] += 1

for word1, words1 in quad_model.items():
    for word2, words2 in words1.items():
        for word3, words3 in words2.items():
            count = sum(quad_model[word1][word2][word3].values())
            for word4, value in words3.items():
                quad_model[word1][word2][word3][word4] = value / count

#### Likely tuples

Check the most likely words following "today the public".

In [30]:
print(quad_model['today']['the']['public'])

defaultdict(<function <lambda>.<locals>.<lambda>.<locals>.<lambda>.<locals>.<lambda> at 0x7f2e74568550>, {'is': 1.0})


#### Generating text

Create your text generator based on the 4-gram model. Even better, uh?

In [33]:
import random

# sequence start symbol
text = ["<s>", "<s>", "<s>"]

# generate text until we find the end of sequence symbol
while text[-1] != "</s>":
    # select a random probability threshold
    r = random.random()
    
    # select word above the probability threshold, conditioned to the previous word text[-1]
    accumulator = .0
    for word in quad_model[text[-3]][text[-2]][text[-1]]:
        accumulator += quad_model[text[-3]][text[-2]][text[-1]][word]
        if accumulator >= r:
            text.append(word)
            break   
    

print (' '.join([t for t in text if t]))

<s> <s> <s> Another move Herrington said he may buy up to 800 , 000 more than a simple disaster payment and feedgrains should be treated much more seriously by the commercial banks rose by 1 . 0 MM , FRANCA 28 . 0 mln Nine mths Shr 1 . 54 mln Extraordinary credit 78 mln vs 16 . 5 pct , the National Statistics Office said producer prices for liquefied gas fell 10 pct in 1987 . </s>
