# Exercise Sheet 4 - Language Modelling

## Learning Objectives

In this lab we are going to:

- Play around with text corpora <br>
- Learn some statistics tricks in Python and NLTK <br>
- Learn about language modelling <br>
- Learn about n-grams <br>
- Naive bayes as a lanugage model <br>
- Hands-on data sparsity and smoothing techniques <br>


In [None]:
# setting the stage ;)
import nltk
nltk.download('brown')

--------------------
## Text Corpus: Statistics and Probability

### Accessing the corpus
Open a Python session and  obtain the <a href="https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip">Brown corpus</a>, using NLTK.

In [None]:
import nltk

# You will need to import 'Brown' as follows:
from nltk.corpus import brown

# read a list of the words in the Brown Corpus
list_words = brown.words()

# print the first 20 words
print(list_words[0:20])

We can access the corpus as a list of words, or a list of sentences (where each sentence is itself just a list of words). 


In [None]:
brown.sents()

The Brown corpus consists of different categories. We can list the available categories as follows:

In [None]:
brown.categories()

We can access the text of a certain category as follows:

In [None]:
brown.words(categories='fiction')

**Exercise 1:**

What is the frequency of the word (ignoring case) &lsquo;world&rsquo; in the news category in the Brown corpus?

In [None]:
#your code goes here

### Frequency of Words

We can easily get the frequency distribution of the words in a corpus as follows:

In [None]:
from nltk.probability import FreqDist

news_text = brown.words(categories='news')

# the frequency of each vocabulary item in the text
fd = FreqDist(news_text)

# total number of samples
print (fd.N()) 

# how many unique words does this corpus have
print (fd.B())

# Get a list of the top 10 words sorted by frequency
print(fd.most_common(10))


**Exercise 2:**
In the Brown Corpus, in which category(s) of the  news, government and editorial categories, the word (ignoring case) &lsquo;world&rsquo; has the highest total frequency?
* news
* government
* editorial

In [None]:
# your code goes here

### Probabilities

**Exercise 3:**
Calculate probabilities (relative frequency) of all words for only __news__ category in Brown corpora.
What is the probability of the words &lsquo;jury&rsquo; and &lsquo;government&rsquo;?

--------------------
## N-Grams

The probabilisic Language Models (a.k.a n-gram LMs) are developed to construct the joint probability distribution of a sequence of words. Based on the Markov assumption, the process of predicting a word sequence is broken up into predicting one word at a time.

We can extract unigrams, and bigrams from a corpus as follows:
In this example, we are going to generate unigrams and bigrams from the novel *Emma* by Jane Austen from The Gutenberg Corpus

In [None]:
#explore the gutenberg corpus
nltk.corpus.gutenberg.fileids()


In [None]:
# get the text of the novel Emma by Jane Austen 
emma_words = nltk.corpus.gutenberg.words('austen-emma.txt')
emma = " ".join(emma_words) 
emma


In [None]:
from nltk.tokenize import word_tokenize

tokens = nltk.word_tokenize(emma)
tokens

In [None]:
from nltk.util import ngrams

#unigrams
print (list(ngrams(word_tokenize(emma), 1)))


In [None]:
#bigrams
print (list(ngrams(word_tokenize(emma[:50]), 2)))

#or simply
print(list(nltk.bigrams(emma_words[:50])))

In [None]:
from nltk.probability import ConditionalFreqDist

#Make a conditional frequency distribution of all the bigrams in the novel Emma by Jane Austen from The Gutenberg Corpus
bigrams = nltk.bigrams(emma_words)

cfd = ConditionalFreqDist(bigrams)

#get the most frequently used word after ‘fully’
cfd['fully']


In [None]:
#same with 'good' but sort by freq
cfd['good'].most_common(20) 

**Exercise 4:**
Write a function to find the most common phrases (trigrams) in the __fiction__ category of the brown corpus.

In [None]:
# your code goes here

--------------
## Probabilistic modeling


### Naïve Bayes	as	a	Language	Model
Based on probabilities of words in only the news and fiction categories in the brown corpus, classify the phrase 'mysterious murder case' to one of these categories. 

You should implement Naive Bayes classifier using probabilities of each word:

$P(fiction|mysterious\ murder\ case) \propto P(mysterious|fiction) \times P(murder|fiction) \times P(case|fiction) \times P(fiction)$
where $P(news) = 0.5$ and $P(fiction) = 0.5$

**Exercise 5:**
Write a general purpose Naive Bayes classifier such as follows:

In [None]:
# template code to be updated
from random import random
def calculate_probability(phrase, category):
    return random() # TODO: change this

def naive_bayes(phrase):
    news_prob = calculate_probability(phrase, 'news')
    fiction_prob = calculate_probability(phrase, 'fiction')
    if news_prob > fiction_prob:
        return 0
    else:
        return 1

### Smoothing

A simple n-gram model would give zero probability to all of the combination that were not encountered in the training corpus, i.e. it would most likely give zero probability to most of the out-of-sample test cases. This problem is known as data sparsity and the traditional solution to it is to use smoothing techniques.

#### Example: bigram model

Given Corpus:

$JOHN\ READ\ MOBY\ DICK$
<br>
$MARY\ READ\ A\ DIFFERENT\ BOOK$
<br>
$SHE\ READ\ A\ BOOK\ BY\ CHER$


**Exercise 6 (Pen and Paper):**
Calculate the probability of the sentence "JOHN READ A BOOK"?

**Exercise 7 (Pen and Paper):**
What is the $p(CHER\ READ\ A\ BOOK)$?

### Add-one smoothing

$p(w_i|w_{i-1}) = \frac{1 + c(w_{i−1} w_i)} {\sum_{w'_i} [1 + c(w_{i−1} w'_i)] }$

**Exercise 8 (Pen and Paper):**
Re-calculate the $p(JOHN\ READ\ A\ BOOK)$ and $p(CHER\ READ\ A\ BOOK)$ using add-one smoothing

#### Other Smoothing methods include:
- Additive smoothing
- Good-Turing estimate
- Jelinek-Mercer smoothing (interpolation)
- Katz smoothing (backoff)
- Witten-Bell smoothing
- Absolute discounting
- Kneser-Ney smoothing