In [1]:
import nltk
from nltk.corpus import gutenberg
import matplotlib.pyplot as plt
import numpy as np
import math
%matplotlib inline

### Language modeling 2

In the previous exercises we saw how to calculate a language model out of raw text. In this set of exercises you will learn how to deal with zero probabilities and how to evaluate LMs. Let's get started!

In order to make everything easier and to avoid repeating code (even though it can be good to internalize doing some things), here you have a part of the previous exercise to calculate language models for Shakespeare's and Carrol's book.

In [2]:
def prod (ite):
    if len(ite)==0:
        return 1
    else:
        return ite[0]*prod(ite[1:])
    
alice = gutenberg.sents("carroll-alice.txt")
hamlet = gutenberg.sents("shakespeare-hamlet.txt")

#Create a LM for Alice's Adventures in Wonderland
alice_wbounds = [["<s>"]+sentence+["</s>"] for sentence in alice]
alice_words = [word for sentence in alice_wbounds for word in sentence]
cfreq_alice_bigrams= nltk.ConditionalFreqDist(nltk.bigrams(alice_words))
cprob_alice_bigrams = nltk.ConditionalProbDist(cfreq_alice_bigrams, nltk.MLEProbDist)

#Create a LM for Shakespeare's Hamlet
hamlet_wbounds = [["<s>"]+sentence+["</s>"] for sentence in hamlet]
hamlet_words = [word for sentence in hamlet_wbounds for word in sentence]
cfreq_hamlet_bigrams= nltk.ConditionalFreqDist(nltk.bigrams(hamlet_words))
cprob_hamlet_bigrams = nltk.ConditionalProbDist(cfreq_hamlet_bigrams, nltk.MLEProbDist)

We now learned in class how to smooth a language model in a pretty simple way. This was interesting for the cases in which some words don't appear in a corpus.

For example, the conditional probability $P(w_{n}=screen | w_{n-1}=The)$ is $0.0$.

In [12]:
cprob_alice_bigrams["The"]

<MLEProbDist based on 108 samples>

In [3]:
cprob_alice_bigrams["The"].prob('screen')

0.0

### Exercise 1

Use the variable `cfreq_alice_bigrams` to calculate $P_{Laplace}(w_{n}=screen | w_{n-1}=The)$.

In [6]:
from nltk import LaplaceProbDist

In [13]:
LaplaceProbDist(cfreq_alice_bigrams["The"])

<LaplaceProbDist based on 108 samples>

In [11]:
#YOUR CODE HERE
LaplaceProbDist(cfreq_alice_bigrams["The"]).prob('screen')

0.005988023952095809

### Exercise 2
As you may have imagined, we can estimate the smoothed probabilities in NLTK. It is actually pretty simple, as we only need to change the probability distribution when calculating a conditional probaility distribution.

Please, check the documentation of NLTK, mainly the one about probability, and check how you can estimate the smoothed probability distributions. Once you know how to do it, create a variable with it.

In [59]:
#YOUR CODE HERE
cprob_alice_bigrams_laplace = nltk.ConditionalProbDist(cfreq_alice_bigrams, LaplaceProbDist, bins=len(alice))

Once you create the smoothed language model, you can use it as if it was a regular language model. Then, calculate again $P_{Laplace}(w_{n}=screen | w_{n-1}=The)$ using the new language model.

In [15]:
#YOUR CODE HERE
cprob_alice_bigrams_laplace["The"].prob('screen')

0.005988023952095809

Do you get the same result? Why? Think about it.

### Exercise 3

Now, following the instructions from the previous class, please calculate three smoothed language models, based on bigrams, trigrams and quadrigrams, from Carrol's book.

#### Hint: This doesn't work:

`cfreq_corpus1_trigrams= nltk.ConditionalFreqDist(nltk.trigrams(words))`

#### Hint: Recall the word boundaries.

In [56]:
#Create a LM for Alice's Adventures in Wonderland
alice_wbounds = [["<s>"]+sentence+["</s>"] for sentence in alice]
alice_words = [word for sentence in alice_wbounds for word in sentence]
alice_trigrams = nltk.trigrams(alice_words)
condition_pairs = (((w0, w1), w2) for w0, w1, w2 in alice_trigrams)
# https://stackoverflow.com/questions/38068539/finding-conditional-probability-of-trigram-in-python-nltk/38098159
cfreq_alice_trigrams = nltk.ConditionalFreqDist(condition_pairs)
cprob_alice_trigrams = nltk.ConditionalProbDist(cfreq_alice_trigrams, LaplaceProbDist, bins=len(alice))

In [36]:
#Create a LM for Alice's Adventures in Wonderland
alice_wbounds = [["<s>"]+sentence+["</s>"] for sentence in alice]
alice_words = [word for sentence in alice_wbounds for word in sentence]
alice_ngrams = nltk.ngrams(alice_words, n=4)
condition_pairs = (((w0, w1, w2), w3) for w0, w1, w2, w3 in alice_ngrams)
# https://stackoverflow.com/questions/38068539/finding-conditional-probability-of-trigram-in-python-nltk/38098159
cfreq_alice_4grams = nltk.ConditionalFreqDist(condition_pairs)
cprob_alice_4grams = nltk.ConditionalProbDist(cfreq_alice_4grams, LaplaceProbDist)

### Exercise 4

To conclude, calculate the perplexity of the three language models using this string as a test set. 

In [37]:
test_string = 'The cat was brown. the brownie was really good'

In [52]:
example = "<s> " + test_string + " </s> <s>"
example_list = example.replace('.', ' </s> <s>').split()
example_list

['<s>',
 'The',
 'cat',
 'was',
 'brown',
 '</s>',
 '<s>',
 'the',
 'brownie',
 'was',
 'really',
 'good',
 '</s>',
 '<s>']

In [53]:
bigrams = [bigram for bigram in nltk.bigrams(example_list)]
bigrams

[('<s>', 'The'),
 ('The', 'cat'),
 ('cat', 'was'),
 ('was', 'brown'),
 ('brown', '</s>'),
 ('</s>', '<s>'),
 ('<s>', 'the'),
 ('the', 'brownie'),
 ('brownie', 'was'),
 ('was', 'really'),
 ('really', 'good'),
 ('good', '</s>'),
 ('</s>', '<s>')]

In [55]:
for bigram in bigrams:
    print (bigram)
    try:
        print(cprob_alice_bigrams_laplace[bigram[0]].prob(bigram[1]))
    except:
        print( 'no: ', bigram)

('<s>', 'The')
0.04543053354463814
('The', 'cat')
0.005988023952095809
('cat', 'was')
0.045454545454545456
('was', 'brown')
0.0020242914979757085
('brown', '</s>')
0.25
('</s>', '<s>')
1.0
('<s>', 'the')
0.012678288431061807
('the', 'brownie')
0.000510986203372509
('brownie', 'was')
no:  ('brownie', 'was')
('was', 'really')
0.0020242914979757085
('really', 'good')
0.09090909090909091
('good', '</s>')
0.023809523809523808
('</s>', '<s>')
1.0


In [61]:
#YOUR CODE HERE

p = prod([cprob_alice_bigrams_laplace[bigram[0]].prob(bigram[1]) for bigram in nltk.bigrams(example_list)])
print(np.power(2, -np.log2(p).sum() / count))


3.3676024796704636e-40