<pre style="float: right">version 1.0.1</pre>
# FNLP: Lab Session 2: Smoothing and Authorship Identification

## Aim

The aims of this lab session are to:
- explore Laplace, Lidstone and backoff smoothing methods for language models
- use of language models in authorship identification. 

Successful completion of this lab will help you solidify your understanding of smoothing (important not just for LMs but all over NLP), cross-entropy (important also for assignment 1), and one type of text classification (authorship identification). By the end of this lab session, you should be able to:
- Compute smoothed bigram probabilities by hand for simple smoothing methods.
- Train an ``NgramModel``  with smoothing for unseen n-grams.
- Make use of language models to identify the author of a text.

## Introduction

For this lab, we will continue to use ``nltk`` and ``nltk_models`` package from the previous lab. Moreover, we will only work with Gutenberg corpus. Execute the bottom cell to import all libraries for this lab.

In [None]:
import nltk
from nltk.corpus import gutenberg

# The NgramModel from NLTK version 2 has been removed from NLTK 3.
# So we're using a ported version from a local directory.
try:
    from nltk_model import *  # See the README inside the nltk_model folder for more information
except ImportError:
    from .nltk_model import *  # Compatibility depending on how this script was run

## Smoothing

In the final exercise of Lab 1, you were asked to calculate the probability of a word given its context, using a bigram language model with no smoothing. For the first two word-context pairs, these bigrams had been seen in the data used to train the language model. For the third word-context pair, the bigram had not been seen in the training data, which led to an estimated probability of 0.0.

Zero probabilities for unseen n-grams cause problems. Suppose for example you take a bigram language model and use it to score an automatically generated sentence of 10 tokens (say the output of a machine translation system). If one of the bigrams in that sentence is unseen, the probability of the sentence will be zero.

Smoothing is a method of assigning probabilities to unseen n-grams. As language models are typically trained using large amounts of data, any n-gram not seen in the training data is probably unlikely to be seen in other (test) data. A good smoothing method is, therefore, one that assigns a fairly small probability to unseen n-grams.

We’ll explore two different smoothing methods: Laplace (add-one) and Lidstone (add-alpha), and we will also consider the effects of backoff.

## Maximum-Likelihood estimation

Before implementing any smoothing, you should make sure you understand how to implement maximum likelihood estimation. In last week’s lab, we used NLTK to do this for us by training a bigram language model with an MLE estimator. We could then use the language model to find the MLE probability of any word given its context. Here, you’ll do the same thing but without using NLTK, just to make sure you understand how. We will also compare the smoothed probabilities you compute later to these MLE probabilities.

### Exercise 0
The function below extracts all the words from the specified document in Gutenberg corpus and then computes a list of bigram tuples by pairing up each word in the corpus with the following word. Using the resulting lists of unigrams and bigrams, complete the code block so it returns the MLE probability of a word given a single word of context.

In [None]:
def myMLE(doc_name, word,context):
    """
    :type doc_name: str
    :param doc_name: name of the document to use for estimation
    :type word: str
    :param word: The input word
    :type context: str
    :param context: The preceding word
    :rtype: float
    :return: The MLE probability of word given context
    """
    # Preprocessing all words to be lowercased
    words = [w.lower() for w in gutenberg.words(doc_name)]
    # list of bigrams as tuples (doesn't include begin/end of corpus: but basically this is fine)
    bigrams = list(zip(words[:-1], words[1:])) 
    # Compute probability of word given context
    prob = 0

    return prob

Test your estimates using Jane Austen’s “Sense and Sensibility” from Gutenberg Corpus by computing the probabilities:
    
1. $ P_{MLE}(“end”|“the”) $
2. $ P_{MLE}(“the”|“end”) $

Make sure your answers match the MLE probability estimates from Exercise 5 of Lab 1, where we used NLTK to compute these estimates.

In [None]:
doc_name = 'austen-sense.txt'
print("MLE probability of 'end' given 'the': {:.5f}".format(myMLE(doc_name, 'end', 'the')))
print("MLE probability of 'the' given 'end': {:.5f}".format(myMLE(doc_name, 'the', 'end')))

## Laplace (add-1)

Laplace smoothing adds a value of 1 to the sample count for each “bin” (possible observation, in this case, each possible bigram), and then takes the maximum likelihood estimate of the resulting frequency distribution.

### Exercise 1

Assume that the size of the vocabulary is just the number of different words observed in the training data (that is, we will not deal with unseen words). Complete the function ``myLaplace`` to compute Laplace smoothed probabilities, again without using NLTK. Hint: if you have trouble, study the equations and example in Lectures.

In [None]:
def myLaplace(doc_name, word, context):
    """
    :type doc_name: str
    :param doc_name: name of the document to use for estimation
    :type word: str
    :param word: The input word
    :type context: str
    :param context: The preceding word
    :rtype: float
    :return: The Laplace-smoothed probability of word given context
    """
    # Preprocessing all words to be lowercased
    words = [w.lower() for w in gutenberg.words(doc_name)]
    # list of bigrams as tuples (doesn't include begin/end of corpus: but basically this is fine)
    bigrams = list(zip(words[:-1], words[1:]))
    # Estimate the size of the vocabluary
    V = 0
    # Compute probability of word given context
    prob = 0

    return prob

Now test your code and look at the estimates for:

1. $P_{+1} (“end”|“the”)$
2. $P_{+1} (“the”|“end”)$

using Jane Austen’s “Sense and Sensibility” as training data. How do these probabilities differ from the MLE estimates performed previously?

In [None]:
doc_name = 'austen-sense.txt'
print("LAPLACE probability of 'end' given 'the': {:.5f}".format(myLaplace(doc_name, 'end', 'the')))
print("LAPLACE probability of 'the' given 'end': {:.5f}".format(myLaplace(doc_name, 'the', 'end')))

## Lidstone (add-alpha)

In practice, Laplace smoothing assigns too much mass to unseen n-grams. The Lidstone method works in a similar way, but instead of adding 1, it adds a value between 0 and 1 to the sample count for each bin (in class we called this value alpha, NLTK calls it gamma).

### Exercise 2

Complete function ``myLidstone`` to compute Lidstone smoothed probabilities

In [None]:
def myLidstone(doc_name, word, context, alpha=1.0):
    """
    :type doc_name: str
    :param doc_name: name of the document to use for estimation
    :type word: str
    :param word: The input word
    :type context: str
    :param context: The preceding word
    :type alpha: float 
    :param alpha: smoothing constant
    :rtype: float
    :return: The Lidstone-smoothed probability of word given context
    """
    # Preprocessing all words to be lowercased
    words = [w.lower() for w in gutenberg.words(doc_name)]
    # list of bigrams as tuples (doesn't include begin/end of corpus: but basically this is fine)
    bigrams = list(zip(words[:-1], words[1:]))
    # Estimate the size of the vocabluary
    V = 0
    # Compute probability of word given context
    prob = 0

    return prob

test code again using Jane Austen Novel. Look at the probability estimates that are computed for the same bigrams as before using various values of alpha.

What do you notice about using `alpha = 0` and `alpha = 1`? (Compare to the probabilities computed by the previous methods.) What about when `alpha = 0.01`? Are the estimated probabilities more similar to MLE or Laplace smoothing in this case?

In [None]:
doc_name = 'austen-sense.txt'
print("alpha=0")
print("LIDSTONE probability of 'end' given 'the': {:.5f}".format(myLidstone(doc_name, 'end', 'the', 0)))
print("LIDSTONE probability of 'the' given 'end': {:.5f}".format(myLidstone(doc_name, 'the', 'end', 0)))
print("alpha=1")
print("LIDSTONE probability of 'end' given 'the': {:.5f}".format(myLidstone(doc_name, 'end', 'the', 1)))
print("LIDSTONE probability of 'the' given 'end': {:.5f}".format(myLidstone(doc_name, 'the', 'end', 1)))
print("alpha=0.1")
print("LIDSTONE probability of 'end' given 'the': {:.5f}".format(myLidstone(doc_name, 'end', 'the', 0.1)))
print("LIDSTONE probability of 'the' given 'end': {:.5f}".format(myLidstone(doc_name, 'the', 'end', 0.1)))

## Backoff

Now we will look at the effects of incorporating backoff in addition to some of these simple smoothing methods. In a bigram language model with backoff, the probability of an unseen bigram is computed by “backing off”: that is, if a word has never been seen in a particular context, then we compute its probability by using one fewer context words. Backing off from a bigram model (one word of context) therefore means we’d get estimates based on unigram frequencies (no context).

The mathematical details of backoff are a bit complex to ensure all the probabilities sum to 1. You needn’t understand all the details of backoff but you should understand these basic principles:

- Bigram probabilities for seen bigrams will be slightly lower than MLE to allocate some probability mass to unseen bigrams.
- The unigram probabilities inside the backoff (i.e. the ones we use if we didn’t see the bigram) are similar in their relatives sizes to the unigram probabilities we would get if we just estimated a unigram model directly.

That is, a word with high corpus frequency will have a higher unigram backoff probability than a word with a low corpus frequency. Look back at the initialization method for NgramModel earlier in the lab. If you pass in MLEProbDist as the estimator (which we did in the last lab), then no backoff is used. However, with any other estimator (i.e., smoothing), the NgramModel does use backoff.

### Exercise 3

Complete the function ``myLaplaceBackoff`` to estimate the Laplace Language model with backoff for the given document of Gutenberg corpus, using ``NgramModel``.

In [None]:
def myLaplaceBackoff(doc_name, word, context):
    """
    :type doc_name: str
    :param doc_name: name of the document to use for estimation
    :type word: str
    :param word: The input word
    :type context: str
    :param context: The preceding word
    :rtype: float
    :return: The Laplace-smoothed probability of word given context
    """
    words = [w.lower() for w in gutenberg.words(doc_name)]
    est = lambda fdist,bins: nltk.probability.LaplaceProbDist(fdist,bins+1)
    # Train a bigram language model using a LAPLACE estimator AND BACKOFF
    lm = NgramModel(<order>,<word_list>,estimator=<estimator>)
    # Compute probability of word given context (note lm requires a list context)
    prob = 0

    return prob

Test your function again with Jane Austen novel by explore how diffrent values of the interpolation constant effect the probability estimate. How different are the estimated probabilities, compared to previously implemented ones?

In [None]:
doc_name = 'austen-sense.txt'
print("LAPLACE(backoff) probability of 'end' given 'the': {:.5f}".format(myLaplaceBackoff(doc_name, 'end', 'the')))
print("LAPLACE(backoff) probability of 'the' given 'end': {:.5f}".format(myLaplaceBackoff(doc_name, 'the', 'end')))

# Authorship Identification

## Cross-entropy

In language modelling, a model is trained on a set of data (i.e. the training data). The cross-entropy of this model may then be measured on a test set (i.e. another set of data that is different from the training data) to assess how accurate the model is in predicting the test data.

Another way to look at this is: if we used the trained model to generate new sentences by sampling words from its probability distribution, how similar would those new sentences be to the sentences in the test data? This interpretation allows us to use cross-entropy for authorship detection.

`NgramModel` contains the following cross-entropy method:
```python
def entropy(self, text, pad_left=False, pad_right=False,
    verbose=False, perItem=False):
    """
    Calculate the approximate cross-entropy of the n-gram model for a
    given evaluation text.
    This is the average log probability of each item in the text.
    :param text: items to use for evaluation
    :type text: iterable(str)
    :param pad_left: whether to pad the left of each text with an (n-1)-gram\
    of <s> markers
    :type pad_left: bool
    :param pad_right: whether to pad the right of each sentence with an </s>\
    marker
    :type pad_right: bool
    :param perItem: normalise for length if True
    :type perItem: bool
    """
```

### Exercise 4

We can use cross-entropy in authorship detection. For example, suppose we have a language model trained on Jane Austen’s “Sense and Sensibility” (training data) plus the texts for two other novels (test data), one by Jane Austen and one by another author, but we don’t know which is which. We can work out the cross-entropy of the model on each of the texts and from the scores, determine which of the two test texts was more likely written by Jane Austen. For testing use :

- text a: ``austen-emma.txt`` (Jane Austen’s “Emma”)
- text b: ``chesterton-ball.txt`` (G.K. Chesterton’s “The Ball and Cross”)

and complete functions bellow in which you will:

- Evaluate a trigram language model with a Lidstone probability distribution. 
- Compute total document cross-entropy for each text
- Compute per word cross-entropy for each text

Note:  The “f.B()+1” argument (already provided for you in the code) means that we lump together all the unseen n-grams as a single “unknown” token.

In [None]:
def estimateLM(doc_name):
    """
    type doc_name: string
    param doc_name: name of the document in Gutenberg corpus
    rtype: NgramModel
    return: Lidstone smoothed language model with backoff
    """
    # Construct a list of lowercase words from the document (training data for lm)
    doc_words = [w.lower() for w in gutenberg.words(doc_name)]
    # a Lidstone probability distribution with +0.01 added to the sample count for each bin
    est = lambda fdist,bins:nltk.LidstoneProbDist(fdist,0.01,fdist.B()+1)
    # Train a trigram language model with backoff using doc_words and    
    lm = NgramModel(<order>,<word_list>,estimator=<estimator>)
    # Return the language model
    return lm

In [None]:
def document_xent(lm, doc_name):
    """
    Use a language model to compute the total word-level cross-entropy of a document
    
    :type lm: NgramModel
    :param lm: a language model
    :type doc_name: str
    :param doc_name: A gutenberg document name
    :rtype: float
    :return: The total entropy of the named document per the model
    """
    # Construct a list of lowercase words from the document (test document)
    doc_words = [w.lower() for w in gutenberg.words(doc_name)]
    # Compute the total cross entropy of the text in doc_name
    xent = 0
    
    return xent

In [None]:
def perword_xent(lm, doc_name):
    """
    Use a language model to compute the total average (per-word) word-level cross-entropy of a document
    
    :type lm: NgramModel
    :param lm: a language model
    :type doc_name: str
    :param doc_name: A gutenberg document name
    :rtype: float
    :return: The total entropy of the named document per the model
    """
    # Construct a list of lowercase words from the document (test document)
    doc_words = [w.lower() for w in gutenberg.words(doc_name)]
    # Compute the total cross entropy of the text in doc_name
    xent = 0
    
    return xent

In [None]:
train_doc = 'austen-sense.txt'
test_a = 'austen-emma.txt'
test_b = 'chesterton-ball.txt'
lm = estimateLM(train_doc)

print('Document {}:'.format(test_a))
print('document xent: {} perword xent {}'.format(document_xent(lm, test_a), perword_xent(lm, test_a)))

print('Document {}:'.format(test_b))
print('document xent: {} perword xent {}'.format(document_xent(lm, test_b), perword_xent(lm, test_b)))

##  Going further

###  Padding

Redo exercise 4 setting `pad_left` and `pad_right` to `True` both when initialising
the n-gram model and when computing entropy. What difference does this
make?

### Sentences

Using one enormous string of words as the training and test data is less than optimal, as it trains/tests across sentence boundaries.  Look back at the argument description for the `train` argument to `NgramModel` and see that it will actually train on an input which is a list of list of words, that is, a list of *sentences*, padding each sentence appropriately.  Redo exercise 4 training and testing on the sentences in the specified documents.

### Case

If we're training on sentences, maybe we shouldn't be down-casing?  Give it a try.