## Exercise | Building your own language model

---

### A quick recap to language models

> Language models assign probability to predict the next word based on the previous words.

1. n-gram language model - Predicts next word by looking at (n-1) words in the past.
    - bi-gram/2-gram language model - Predicts next word by looking at (2-1) = 1 word in the past.
    - tri-gram/3-gram language model - Predicts next word by looking at (3-1) = 2 words in the past.
    
    
2. How to assign probabilites?

Suppose the past words for a tri-gram language model are $w_1$, $w_2$ and we need to assign probability for the next word $w$.

Count the number of times $w_1$ and $w_2$ come together[C1] , and count the number of times $w_1$, $w_2$, and $w$ come together [C2].

Divide C2 by C1.

$P(w | w_1,w_2) = \frac{C2}{C1}$


For example,

Imagine you have a corpus, and you want to assign probability to the word 'Sam' after seeing two words 'I am'.

Count the number of times 'I am Sam' come together, and count the number of times 'I am' come together. You will get the probability of getting 'Sam' after seeing 'I am'. Easy maths right?

To generate more coherent sentence, we continue this simple math with chain rule of probability probability.

---


## Python code

---

In [23]:
# Get a dataset

from nltk.corpus import reuters

# Import APIs for generating trigrams

from nltk import trigrams

# Import Counter API to store frequencies

from collections import Counter, defaultdict


### Let's name our State-Of-The-Art model as gptFree



In [24]:
gptFree = defaultdict(lambda: defaultdict(lambda: 0))

# We don't want a KeyError, hence using defaultdict. 
# It will assign zero probability if a trigram probobality turns out zero

print(gptFree)

defaultdict(<function <lambda> at 0x1a1dc5ae60>, {})


### Let's count and store frequency of trigrams in the reuters data set.

In [25]:
for sentence in reuters.sents():
    for word_1, word_2, word_3 in trigrams(sentence, pad_right=True, pad_left=True):
        gptFree[(word_1, word_2)][word_3] += 1 # Storing frequencies as and updating +1 as we see them
        
        

### Convert frequencies to probabilites

In [26]:
for word1_word2 in gptFree:
    freq = float(sum(gptFree[word1_word2].values())) # Fetch frequencies of two words coming together
    for word_3 in gptFree[word1_word2]:
        gptFree[word1_word2][word_3] /= freq

### Predict next word



> Provide two starter words to predict the next word

In [31]:
dict(gptFree['What','is'])

{'obvious': 0.25, 'needed': 0.25, 'important': 0.25, 'happening': 0.25}

In [38]:
dict(gptFree['The','bank'])

{'said': 0.34328358208955223,
 'reiterated': 0.007462686567164179,
 'holding': 0.014925373134328358,
 "'": 0.11194029850746269,
 'stepped': 0.007462686567164179,
 'is': 0.05223880597014925,
 'expects': 0.007462686567164179,
 'last': 0.007462686567164179,
 'rate': 0.007462686567164179,
 'dealers': 0.007462686567164179,
 'also': 0.07462686567164178,
 ',': 0.05223880597014925,
 'has': 0.04477611940298507,
 'earlier': 0.022388059701492536,
 'more': 0.007462686567164179,
 'would': 0.007462686567164179,
 'had': 0.014925373134328358,
 'added': 0.007462686567164179,
 'board': 0.014925373134328358,
 'continued': 0.007462686567164179,
 'previously': 0.007462686567164179,
 'gave': 0.007462686567164179,
 'moved': 0.007462686567164179,
 'did': 0.007462686567164179,
 'will': 0.04477611940298507,
 'bought': 0.014925373134328358,
 'estimated': 0.007462686567164179,
 'was': 0.007462686567164179,
 'transferred': 0.007462686567164179,
 'official': 0.03731343283582089,
 'of': 0.007462686567164179,
 'rarel