## Intro to Language Models and NLG
##### By Ruben Seoane, all credit to nlpforhackers.io
Based on: http://nlpforhackers.io/language-models/ Adapted to work on Python 3.6
        
### What is a model?
A model is a mathematical representation of a process. They usually are an approximation of the process. 
A language model is built by observing some text.

## Bag of Words (BOW)
The most simple way of modeling the human language.
1. A BOW model is an oversimplified view of the language
2. It only takes into account the frequency of the words in the language, not order or position.

In [2]:
# Let's start building the model
from nltk.corpus import reuters
from collections import Counter

counts = Counter(reuters.words())
total_count = len(reuters.words())

# Most common 20 words:
print(counts.most_common(n=20))

[('.', 94687), (',', 72360), ('the', 58251), ('of', 35979), ('to', 34035), ('in', 26478), ('said', 25224), ('and', 25043), ('a', 23492), ('mln', 18037), ('vs', 14120), ('-', 13705), ('for', 12785), ('dlrs', 11730), ("'", 11272), ('The', 10968), ('000', 10277), ('1', 9977), ('s', 9298), ('pct', 9093)]


In [3]:
# Compute the frequencies
for word in counts:
    counts[word]/= float(total_count)
    
# Frequencies should add up to 1
print(sum(counts.values()))

1.0000000000006808


In [6]:
import random

# Generate 100 words of language
text = []

for _ in range(100):
    r = random.random()
    accumulator = .0
    
    for word, freq in counts.items():
        accumulator += freq
        
        if accumulator >= r:
            text.append(word)
            break
            
print(' '.join(text))

later 430 from time , political of Intermagnetics projects approved led TO CORP Exxon , definitive company . BANKS - average results stabbed . towards April QTR new The . 1 as the agency gain tonnages again cts open S three , week , Corp reflects . Shirt from 5 need House . Louisiana because loss Texaco 4 6503 begin INC most the in vegetable pinch , CTWL / one Trail & 897 of owned . be told - completed bonds firm decline chief only on , . rules of investigating or 1976 inflation 33 downs not value 8 have


As we now know the probability for all the words, we can compute the text probability:

In [8]:
from operator import mul
from functools import reduce

print(reduce(mul, [counts[w] for w in text], 1.0))

3.13e-321


### Bigrams and Trigrams
One approach to generate better text is to make sure the new word  fits to the last word (bigram model- tuple) or to the last two words (trigram model-triple).

In [12]:
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

first_sentence = reuters.sents()[0]
print(first_sentence)

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


In [13]:
# Get Bigrams
print(list(bigrams(first_sentence)))

[('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'nations'), ('nations', 'that'), ('that', 'the'), ('the', 'row'), ('row', 'could'), ('could', 'inflict'), ('inflict', 'far'), ('far', '-'), ('-', 'reaching'), ('reaching', 'economic'), ('economic', 'damage'), ('damage', ','), (',', 'businessmen'), ('businessmen', 'and'), ('and', 'officials'), ('officials', 'said'), ('said', '.')]


In [15]:
# Get padded Bigrams
print(list(bigrams(first_sentence, pad_left=True, pad_right=True)))

[(None, 'ASIAN'), ('ASIAN', 'EXPORTERS'), ('EXPORTERS', 'FEAR'), ('FEAR', 'DAMAGE'), ('DAMAGE', 'FROM'), ('FROM', 'U'), ('U', '.'), ('.', 'S'), ('S', '.-'), ('.-', 'JAPAN'), ('JAPAN', 'RIFT'), ('RIFT', 'Mounting'), ('Mounting', 'trade'), ('trade', 'friction'), ('friction', 'between'), ('between', 'the'), ('the', 'U'), ('U', '.'), ('.', 'S'), ('S', '.'), ('.', 'And'), ('And', 'Japan'), ('Japan', 'has'), ('has', 'raised'), ('raised', 'fears'), ('fears', 'among'), ('among', 'many'), ('many', 'of'), ('of', 'Asia'), ('Asia', "'"), ("'", 's'), ('s', 'exporting'), ('exporting', 'nations'), ('nations', 'that'), ('that', 'the'), ('the', 'row'), ('row', 'could'), ('could', 'inflict'), ('inflict', 'far'), ('far', '-'), ('-', 'reaching'), ('reaching', 'economic'), ('economic', 'damage'), ('damage', ','), (',', 'businessmen'), ('businessmen', 'and'), ('and', 'officials'), ('officials', 'said'), ('said', '.'), ('.', None)]


In [14]:
# Get Trigrams
print(list(trigrams(first_sentence)))

[('ASIAN', 'EXPORTERS', 'FEAR'), ('EXPORTERS', 'FEAR', 'DAMAGE'), ('FEAR', 'DAMAGE', 'FROM'), ('DAMAGE', 'FROM', 'U'), ('FROM', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.-'), ('S', '.-', 'JAPAN'), ('.-', 'JAPAN', 'RIFT'), ('JAPAN', 'RIFT', 'Mounting'), ('RIFT', 'Mounting', 'trade'), ('Mounting', 'trade', 'friction'), ('trade', 'friction', 'between'), ('friction', 'between', 'the'), ('between', 'the', 'U'), ('the', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.'), ('S', '.', 'And'), ('.', 'And', 'Japan'), ('And', 'Japan', 'has'), ('Japan', 'has', 'raised'), ('has', 'raised', 'fears'), ('raised', 'fears', 'among'), ('fears', 'among', 'many'), ('among', 'many', 'of'), ('many', 'of', 'Asia'), ('of', 'Asia', "'"), ('Asia', "'", 's'), ("'", 's', 'exporting'), ('s', 'exporting', 'nations'), ('exporting', 'nations', 'that'), ('nations', 'that', 'the'), ('that', 'the', 'row'), ('the', 'row', 'could'), ('row', 'could', 'inflict'), ('could', 'inflict', 'far'), ('inflict', 'far', '-'), ('far', '-', 'rea

In [16]:
# Get padded Trigrams
print(list(trigrams(first_sentence, pad_left=True, pad_right=True)))

[(None, None, 'ASIAN'), (None, 'ASIAN', 'EXPORTERS'), ('ASIAN', 'EXPORTERS', 'FEAR'), ('EXPORTERS', 'FEAR', 'DAMAGE'), ('FEAR', 'DAMAGE', 'FROM'), ('DAMAGE', 'FROM', 'U'), ('FROM', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.-'), ('S', '.-', 'JAPAN'), ('.-', 'JAPAN', 'RIFT'), ('JAPAN', 'RIFT', 'Mounting'), ('RIFT', 'Mounting', 'trade'), ('Mounting', 'trade', 'friction'), ('trade', 'friction', 'between'), ('friction', 'between', 'the'), ('between', 'the', 'U'), ('the', 'U', '.'), ('U', '.', 'S'), ('.', 'S', '.'), ('S', '.', 'And'), ('.', 'And', 'Japan'), ('And', 'Japan', 'has'), ('Japan', 'has', 'raised'), ('has', 'raised', 'fears'), ('raised', 'fears', 'among'), ('fears', 'among', 'many'), ('among', 'many', 'of'), ('many', 'of', 'Asia'), ('of', 'Asia', "'"), ('Asia', "'", 's'), ("'", 's', 'exporting'), ('s', 'exporting', 'nations'), ('exporting', 'nations', 'that'), ('nations', 'that', 'the'), ('that', 'the', 'row'), ('the', 'row', 'could'), ('row', 'could', 'inflict'), ('could', 'inflict

We'll build a trigram model from the Reuters corpus:

In [18]:
model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
        
print(model["what", "the"]["economists"]) # 'economists' follows 'what the'
print(model["what", "the"]["bambawamba"])
print(model[None, None]["The"])

2
0
8839


In [20]:
# Transforming the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
        
print(model["what", "the"]["economists"]) 
print(model["what", "the"]["bambawamba"])
print(model[None, None]["The"])

0.043478260869565216
0.0
0.16154324146501936


Let's generate some text now:

In [24]:
import random

text = [None, None]
sentence_finished = False

while not sentence_finished:
    r = random.random()
    accumulator = .0
    
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
        
        if accumulator >= r:
            text.append(word)
            break
        
    if text[-2:] == [None, None]:
        sentence_finished = True
        
print(' '.join([t for t in text if t]))

He also said & lt ; PKD > SUSPENDS PREFERRED PAYMENTS Texstyrene Corp said it raised its west coast .


**Conditional probabilities** are used to compute the probability os a sequence.
The probability of _word[i]_ given _word[i-1]_ and _word[i-2]_ is _P(word[i] | word[i-1], word[i-2])_ which in our case is equal to: _model[(word[i-2], word[i-1])][word[i]]_
Let's add this computation to the text generating script:

In [25]:
import random

text = [None, None]
prob = 1.0 # Init probability
sentence_finished = False

while not sentence_finished:
    r = random.random()
    accumulator = .0
    
    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]
        
        if accumulator >= r:
            prob *= model[tuple(text[-2:])][word] #Update p with the conditional probability of the new word
            text.append(word)
            break
            
    if text[-2:] == [None, None]:
        sentence_finished = True
        
print("Probability of text=" , prob)
print(' '.join([t for t in text if t]))

Probability of text= 3.576299122959625e-79
South African Mutual Life Insurance Co of New Zealand ' s December agreement by the strong results of an unfriendly cash offer for Bell stock to a low of 143 , 000 Sales 66 . 3 mln 12 mths Shr profit 10 . 2 pct U . S . A as the distant months are expected to bear the burden of their proposal at 36 . 5 mln nine mths net includes losses from discontinued operations 1986 loss of export promotion abroad .


### Conclusions
- We have implemented a basic Natural Language Generation model
- The bigger the n-grams, the better and more accurate language we will generate (theoretically)
- The bigger the n-grams,the bigger the model gets