In [2]:
import nltk
from nltk.corpus import gutenberg
import matplotlib.pyplot as plt
import numpy as np
import math
%matplotlib inline

### Language modeling 2

In the previous exercises we saw how to calculate a language model out of raw text. In this set of exercises you will learn how to deal with zero probabilities and how to evaluate LMs. Let's get started!

In order to make everything easier and to avoid repeating code (even though it can be good to internalize doing some things), here you have a part of the previous exercise to calculate language models for Shakespeare's and Carrol's book.

In [3]:
def prod (ite):
    if len(ite)==0:
        return 1
    else:
        return ite[0]*prod(ite[1:])
        
alice = gutenberg.sents("carroll-alice.txt")
hamlet = gutenberg.sents("shakespeare-hamlet.txt")

#Create a LM for Alice's Adventures in Wonderland
alice_wbounds = [["<s>"]+sentence+["</s>"] for sentence in alice]
alice_words = [word for sentence in alice_wbounds for word in sentence]
cfreq_alice_bigrams= nltk.ConditionalFreqDist(nltk.bigrams(alice_words))
cprob_alice_bigrams = nltk.ConditionalProbDist(cfreq_alice_bigrams, nltk.MLEProbDist)

#Create a LM for Shakespeare's Hamlet
hamlet_wbounds = [["<s>"]+sentence+["</s>"] for sentence in hamlet]
hamlet_words = [word for sentence in hamlet_wbounds for word in sentence]
cfreq_hamlet_bigrams= nltk.ConditionalFreqDist(nltk.bigrams(hamlet_words))
cprob_hamlet_bigrams = nltk.ConditionalProbDist(cfreq_hamlet_bigrams, nltk.MLEProbDist)

def prob_bigram_lm(sent, lm):
    sent_wb = "<s> " + sent + " </s>"
    return prod([lm[bigram[0]].prob(bigram[1]) for bigram in nltk.bigrams(sent_wb.split())])


We now learned in class how to smooth a language model in a pretty simple way. This was interesting for the cases in which some words don't appear in a corpus.

For example, the conditional probability $P(w_{n}=screen | w_{n-1}=The)$ is $0.0$.

In [4]:
cprob_alice_bigrams["The"].prob('screen')

0.0

### Exercise 1

Use the variable `cfreq_alice_bigrams` to calculate $P_{Laplace}(w_{n}=screen | w_{n-1}=The)$.

In [5]:
#YOUR CODE HERE
V_alice = len(cfreq_alice_bigrams.conditions())

(cfreq_alice_bigrams['The'].get("screen",0) + 1)/(alice_words.count('The')+V_alice)

0.00032

### Exercise 2
As you may have imagined, we can estimate the smoothed probabilities in NLTK. It is actually pretty simple, as we only need to change the probability distribution when calculating a conditional probaility distribution.

Please, check the documentation of NLTK, mainly the one about probability, and check how you can estimate the smoothed probability distributions. Once you know how to do it, create a variable with it.

In [6]:
#YOUR CODE HERE
cprob_alice_bigrams_smooth = nltk.ConditionalProbDist(cfreq_alice_bigrams, nltk.LaplaceProbDist)

Once you create the smoothed language model, you can use it as if it was a regular language model. Then, calculate again $P_{Laplace}(w_{n}=screen | w_{n-1}=The)$ using the new language model.

In [7]:
#YOUR CODE HERE
cprob_alice_bigrams_smooth["The"].prob('screen')

0.005988023952095809

Do you get the same result? Why? Think about it.

### Exercise 3

Now, following the instructions from the previous class, please calculate three smoothed language models, based on bigrams, trigrams and quadrigrams, from Carrol's book.

#### Hint: This doesn't work:

`cfreq_corpus1_trigrams= nltk.ConditionalFreqDist(nltk.trigrams(words))`

#### Hint: Recall the word boundaries.

In [8]:
#YOUR CODE HERE
alice_wbounds = [["<s>"]+sentence+["</s>"] for sentence in alice]
alice_wbounds_2 = [["<s>","<s>"]+sentence+["</s>","</s>"] for sentence in alice]
alice_wbounds_3 = [["<s>","<s>","<s>"]+sentence+["</s>","</s>","</s>"] for sentence in alice]
alice_words = [word for sentence in alice_wbounds for word in sentence]+["<s>"]
alice_words_2 = [word for sentence in alice_wbounds_2 for word in sentence]+["<s>","<s>"]
alice_words_3 = [word for sentence in alice_wbounds_3 for word in sentence]+["<s>","<s>","<s>"]

cfreq_alice_trigrams = nltk.ConditionalFreqDist([(' '.join(i[:-1]), i[-1]) for i in nltk.trigrams(alice_words_2)])
cfreq_alice_quadrigrams = nltk.ConditionalFreqDist([(' '.join(i[:-1]), i[-1]) for i in nltk.ngrams(alice_words_3,4)])

#If we don't change the bins, then, if we look for the conditional probabilitiy distribution of a word
#that doesn't appear in the training corpus, it's going to raise an exception
cprob_alice_bigrams_smooth = nltk.ConditionalProbDist(cfreq_alice_bigrams, nltk.LaplaceProbDist, bins=V_alice)
cprob_alice_trigrams_smooth = nltk.ConditionalProbDist(cfreq_alice_trigrams, nltk.LaplaceProbDist, bins=(V_alice**2))
cprob_alice_quadrigrams_smooth = nltk.ConditionalProbDist(cfreq_alice_quadrigrams, nltk.LaplaceProbDist, bins=(V_alice**3))

### Exercise 4

To conclude, calculate the perplexity of the three language models using this string as a test set. 

In [9]:
test_string = 'The cat was brown. the brownie was really good'
test_string2 = '''People were going home'''
test_string

'The cat was brown. the brownie was really good'

In [10]:
def addbounds (list_sentences, n):
    list_sent_wb = [''.join(["<s> "]*(n-1)) + sent + ''.join([" </s>"]*(n-1)) for sent in list_sentences]
    return ' '.join(list_sent_wb) + " " + ''.join([" <s>"]*(n-1))

In [11]:
sentence_splited = [sentence.lower().strip() for sentence in test_string.replace("\n", " ").split(".")]
sentence_splited2 = [sentence.lower().strip() for sentence in test_string2.replace("\n", " ").split(".")]
addbounds(sentence_splited,2)

'<s> the cat was brown </s> <s> the brownie was really good </s>  <s>'

In [12]:
def prob_ngram_lm(sent, lm, n):
    ngrams = [(' '.join(i[:-1]), i[-1]) for i in nltk.ngrams(sent.split(),n)]
    return prod([lm[ngram[0]].prob(ngram[1]) for ngram in ngrams])

prob_ngram_lm(addbounds(sentence_splited,2), cprob_alice_bigrams_smooth, 2)

2.261502301937878e-37

In [13]:
def perplexity (sentence, lm, n):
    N = len(sentence.split()) + (n-1)*2
    print (N, sentence)
    return (prob_ngram_lm(sentence,lm, n))**(-1/N)

def prod (ite):
    res = 0.0
    for el in ite:
        res=res + np.log(el)
    return np.exp(res)

In [14]:
perplexity(addbounds(sentence_splited,2), cprob_alice_bigrams_smooth, 2)
#perplexity(addbounds(sentence_splited2,2), cprob_alice_bigrams_smooth, 2)

16 <s> the cat was brown </s> <s> the brownie was really good </s>  <s>


195.14174551177027

In [15]:
perplexity(addbounds(sentence_splited,3), cprob_alice_trigrams_smooth, 3)
#perplexity(addbounds(sentence_splited2,3), cprob_alice_trigrams_smooth, 3)

23 <s> <s> the cat was brown </s> </s> <s> <s> the brownie was really good </s> </s>  <s> <s>


28955.31127061816

In [16]:
perplexity(addbounds(sentence_splited,4), cprob_alice_quadrigrams_smooth, 4)
#perplexity(addbounds(sentence_splited2,4), cprob_alice_quadrigrams_smooth, 4)

30 <s> <s> <s> the cat was brown </s> </s> </s> <s> <s> <s> the brownie was really good </s> </s> </s>  <s> <s> <s>


3705101.9084645724