# Exercise Sheet 4 - Language Modelling - Solutions

## Learning Objectives

In this lab we are going to:

- Play around with text corpora <br>
- Learn some statistics tricks in Python and NLTK <br>
- Learn about language modelling <br>
- Learn about n-grams <br>
- Naive bayes as a lanugage model <br>
- Hands-on data sparsity and smoothing techniques <br>


In [None]:
# setting the stage ;)
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

--------------------
## Text Corpus: Statistics and Probability

### Accessing the corpus
Open a Python session and  obtain the <a href="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip">Brown corpus</a>, using NLTK.

**Exercise 1:**

What is the frequency of the word (ignoring case) &lsquo;world&rsquo; in the news category in Brown corpus?

In [None]:
# You will need to import 'Brown' as follows:
from nltk.corpus import brown


In [None]:
def count_freq(category, given_word):
    count = 0
    for word in brown.words(categories=category):
        word = word.lower()
        if word == given_word:
            count += 1
    return count

print(count_freq('news', 'world'))

46


### Frequency of Words

We can easily get the frequency distribution of the words in a corpus as follows:

**Exercise 2:**
In the Brown Corpus, in which category(s) of the  news, government and editorial categories, the word (ignoring case) &lsquo;world&rsquo; has the highest total frequency?
* news
* government
* editorial

In [None]:
print(count_freq('news', 'world'))
print(count_freq('government', 'world'))
print(count_freq('editorial', 'world'))

46
51
77


### Probabilities

**Exercise 3:**
Calculate probabilities (relative frequency) of all words for only __news__ category in Brown corpora.
What is the probability of the words &lsquo;jury&rsquo; and &lsquo;government&rsquo;?

In [None]:
def prob(category, given_word):
    count = 0
    total = 0

    for word in brown.words(categories=category):
        word = word.lower()
        if word == given_word:
            count += 1
        total += 1
            
    return float(count)/total

print(prob('news','jury'))
print(prob('news','government'))

0.00045746564035244744
0.000725978081428884


--------------------
## N-Grams

The probabilisic Language Models (a.k.a n-gram LMs) are developed to construct the joint probability distribution of a sequence of words. Based on the Markov assumption, the process of predicting a word sequence is broken up into predicting one word at a time.

We can extract unigrams, and bigrams from a corpus as follows:
In this example, we are going to generate unigrams and bigrams from the novel *Emma* by Jane Austen from The Gutenberg Corpus

**Exercise 4:**
Write a function to find the most common phrases (trigrams) in the __fiction__ category of the brown corpus.

In [None]:
from nltk.corpus import brown

fiction_text = brown.words(categories='fiction')
trigram =  [t for t in nltk.trigrams(fiction_text)]
freq = nltk.FreqDist(trigram) #have you noticed the difference between ConditionalFreqDist and FreqDist!
freq.most_common(20)


[(("''", '?', '?'), 128),
 (("''", '.', '``'), 117),
 (('.', 'He', 'was'), 76),
 (('.', 'It', 'was'), 73),
 (("''", '!', '!'), 64),
 (('.', 'He', 'had'), 53),
 (('.', '``', 'I'), 53),
 (('?', '?', '``'), 45),
 ((',', 'and', 'the'), 42),
 (('.', 'There', 'was'), 40),
 (("''", ',', 'he'), 37),
 (('said', '.', '``'), 33),
 ((',', 'and', 'he'), 33),
 (('said', ',', '``'), 30),
 (('.', 'She', 'was'), 30),
 (("''", ',', 'she'), 29),
 (("''", '.', 'He'), 28),
 ((',', 'he', 'said'), 26),
 (('one', 'of', 'the'), 24),
 (('.', 'In', 'the'), 24)]

--------------
## Probabilistic modeling

## Naïve Bayes	as	a	Language	Model
Based on probabilities of words in only the news and fiction categories in the brown corpus, classify the phrase 'mysterious murder case' to one of these categories. 

You should implement Naive Bayes classifier using probabilities of each word:

$P(fiction|mysterious\ murder\ case) \propto P(mysterious|fiction) \times P(murder|fiction) \times P(case|fiction) \times P(fiction)$
where $P(news) = 0.5$ and $P(fiction) = 0.5$

**Exercise 5:**
Write a general purpose Naive Bayes classifier such as follows:

In [None]:
from random import random

def calculate_probability(phrase, category):
    p = 1.0
    for word in phrase.split():
        word = word.lower()
        p *= prob(category, word)
        return p * 0.5

def naive_bayes(phrase):
    news_prob = calculate_probability(phrase, 'news')
    fiction_prob = calculate_probability(phrase, 'fiction')
    if news_prob > fiction_prob:
        return 0 #news
    else:
        return 1 #fiction

print(naive_bayes("mysterious murder case"))

1


### Smoothing

A simple n-gram model would give zero probability to all of the combination that were not encountered in the training corpus, i.e. it would most likely give zero probability to most of the out-of-sample test cases. This problem is known as data sparsity and the traditional solution to it is to use smoothing techniques.

#### Example: bigram model

(pen and paper exercises)

Given Corpus:

$JOHN\ READ\ MOBY\ DICK$
<br>
$MARY\ READ\ A\ DIFFERENT\ BOOK$
<br>
$SHE\ READ\ A\ BOOK\ BY\ CHER$


**Exercise 6:**
Calculate the probability of the sentence "JOHN READ A BOOK"?

p(john read a book)

= p(john) × p(read|john) × p(a|read) × p(book|a) 

= 1/15 × 1/1 × 2/3 × 1/2

= 0.022222222



**Exercise 7:**
What is the $p(CHER\ READ\ A\ BOOK)$?

p(cher read a book)

= p(cher) × p(read|cher) × p(a|read) × p(book|a) 

= 1/15 × 0/0 × 2/3 × 1/2

= 0

$p(w_i|w_{i-1}) = \frac{1 + c(w_{i−1} w_i)} {\sum_{w'_i}{c( w'_i) }+ |V|}$

### Add-one smoothing

$p(w_i|w_{i-1}) = \frac{1 + c(w_{i−1} w_i)} {\sum_{w'_i} [1 + c(w_{i−1} w'_i)] }$

**Exercise 8:**
Re-calculate the $p(JOHN\ READ\ A\ BOOK)$ and $p(CHER\ READ\ A\ BOOK)$ using add-one smoothing

p(john read a book)

= p(john) × p(read|john) × p(a|read) × p(book|a) 

= (1+1)/(11+15) × (1+1)/(11+1) × (1+2)/(11+3) × (1+1)/(11+2)

= 0.000422654




p(cher read a book)

= p(cher) × p(read|cher) × p(a|read) × p(book|a) 

= (1+1)/(11+15) × (1+0)/(11+0) × (1+2)/(11+3) × (1+1)/(11+2)

= 0.000230539


