Lab 3: Language Modeling
=============
Yucheng Jin (yucheng9)
=============
In this problem set, you will train a language model. You will:

- Train an n-gram language model.
- Use that language model to generate representative sentences.
- Study the effect of training data size, and language model complexity (n-gram size), on the modeling capacity of a language model.

- **To submit this assignment, compress the whole directory using tar, and submit the tarball ```lab2-submission.tgz``` on Compass.**

Total points: 100 points

# 0. Setup

In order to develop this assignment, you will need [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point would be to install that.

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- [nosetests](https://nose.readthedocs.org/en/latest/)
- [nltk](https://www.nltk.org)

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo.

In [27]:
import sys
from importlib import reload
from collections import defaultdict

In [28]:
print('My Python version')

print('python: {}'.format(sys.version))

My Python version
python: 3.7.3 (default, Mar 27 2019, 16:54:48) 
[Clang 4.0.1 (tags/RELEASE_401/final)]


In [29]:
import nose
import nltk

In [30]:
print('My library versions')

print('nose: {}'.format(nose.__version__))
print('nltk: {}'.format(nltk.__version__))

My library versions
nose: 1.3.7
nltk: 3.4.4


To test whether your libraries are the right version, run:

`nosetests tests/test_environment.py`

In [31]:
! nosetests tests/test_environment.py

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


# 1. Training a language model

Let us first train a 3-gram language model using the language modeling toolkit: kenlm. To train a language model, all we need is a monolingual corpus, which we will get using nltk.

Total: 40 points

Let us first extract from nltk's reuters corpus, 2 corpora of 2 different domains, the food industry and the natural resources industry.

In [32]:
import nltk

food = ['barley', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copra-cake''grain', 'groundnut', 'groundnut-oil', 'potato''soy-meal', 'soy-oil', 'soybean', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'veg-oil', 'wheat']
natural_resources = ['alum', 'fuel', 'gas', 'gold', 'iron-steel', 'lead', 'nat-gas', 'palladium', 'propane', 'tin', 'zinc']
corpus = nltk.corpus.reuters
food_corpus = corpus.raw(categories=food)
natr_corpus = corpus.raw(categories=natural_resources)

## Tokenization

Your first task is to tokenize the raw text into a list of sentences, which are in turn a list of words. No need for any other kind of preprocessing such as lowercasing.

- **Deliverable 1.1**: Complete the function `ece365lib.train.tokenize`. (5 points)
- **Test**: `nose tests/test_train.py:test_d1_1_tk`

In [33]:
from ece365lib import train
reload(train);

In [34]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/yuchengjin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
food_corpus_tk = train.tokenize_corpus(food_corpus)
natr_corpus_tk = train.tokenize_corpus(natr_corpus)

In [36]:
! nosetests tests/test_train.py:test_d1_1_tk

.
----------------------------------------------------------------------
Ran 1 test in 2.601s

OK


## Padding

Your second task is to pad your sentences with the start-of-sentence symbol and end-of-sentence symbol. These symbols are necessary to model the probability of words that usually start a sentence and those that usually end a sentence.

- **Deliverable 1.2**: Complete the function `ece365lib.train.pad`. (5 points)
- **Test**: `nosetests tests/test_train.py:test_d1_2_pad`

In [37]:
reload(train);

In [39]:
food_corpus_tk_pd = train.pad_corpus(food_corpus_tk)
natr_corpus_tk_pd = train.pad_corpus(natr_corpus_tk)

In [41]:
! nosetests tests/test_train.py:test_d1_2_pad

.
----------------------------------------------------------------------
Ran 1 test in 2.563s

OK


## Train-Test Split

Your third task is to split the corpora into train, for training the language model, and test, for testing the language model. We will go with the traditional 80% (train), 20% (test) split. The first floor(0.8*nsents) should constitute the training corpus, and the rest should constitute the test corpus.

- **Deliverable 1.3**: Complete the function `ece365lib.train.split_corpus`. (10 points)
- **Test**: `nosetests tests/test_train.py:test_d1_3_spc`

In [50]:
reload(train);

In [51]:
food_corpus_tr, food_corpus_te = train.split_corpus(food_corpus_tk_pd)
natr_corpus_tr, natr_corpus_te = train.split_corpus(natr_corpus_tk_pd)

In [53]:
! nosetests tests/test_train.py:test_d1_3_spc

.
----------------------------------------------------------------------
Ran 1 test in 2.817s

OK


## Splitting into n-grams

Your fourth task is to count n-grams in the text up to a specific order.

- **Deliverable 1.4**: Complete the function `ece365lib.train.count_ngrams`. (10 points)
- **Test**: `nosetests tests/test_train.py:test_d1_4_cn`

In [132]:
reload(train);

In [133]:
food_ngrams, food_vocab = train.count_ngrams(food_corpus_tr, 3)
natr_ngrams, natr_vocab = train.count_ngrams(natr_corpus_tr, 3)

In [134]:
! nosetests tests/test_train.py:test_d1_4_cn

.
----------------------------------------------------------------------
Ran 1 test in 6.443s

OK


## Estimating n-gram probability

Your last task in this part of the problem set is to estimate the n-gram probabilities p(w_i|w_{i-n+1}, w_{i-n+2}, .., w_{i-1}). For the purposes of this exercise we use maximum likelihood estimate and perform no smoothing. 

- **Deliverable 1.5**: Complete the function `ece365lib.train.estimate`. (10 points)
- **Test**: `nosetests tests/test_train.py:test_d1_5_es`

In [151]:
reload(train);

In [152]:
print(train.estimate(food_ngrams, ['palm'], ['producer', 'of']))
print(train.estimate(natr_ngrams, ['basis'], ['tested', 'the']))

0.25
0.5


In [153]:
!nosetests tests/test_train.py:test_d1_5_es

.
----------------------------------------------------------------------
Ran 1 test in 6.407s

OK


## Training a language model

For the sake of simplicity, and for the purposes of later parts in this problem set, we use nltk's lm module to train a language model.

In [154]:
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline

food_train, food_vocab = padded_everygram_pipeline(3, food_corpus_tk[:int(0.8*len(food_corpus_tk))])
natr_train, natr_vocab = padded_everygram_pipeline(3, natr_corpus_tk[:int(0.8*len(natr_corpus_tk))])

food_test = sum([['<s>'] + x + ['</s>'] for x in food_corpus_tk[int(0.8*len(food_corpus_tk)):]],[])
natr_test = sum([['<s>'] + x + ['</s>'] for x in natr_corpus_tk[int(0.8*len(natr_corpus_tk)):]],[])

food_lm = Laplace(3)
natr_lm = Laplace(3)

food_lm.fit(food_train, food_vocab)
natr_lm.fit(natr_train, natr_vocab)

One way to see what the language model learned is to see the sentences it can generate.

In [155]:
n_words = 10
print(food_lm.generate(n_words, random_seed=3))
print(natr_lm.generate(n_words, random_seed=3))

['<s>', 'Commenting', 'on', 'the', 'new', 'Congress', 'convenes', 'after', 'the', 'May']
['<s>', 'Currently', ',', 'there', 'was', 'a', '15.3', 'pct', 'increase', ',']


# 2. Evaluating a language model

Next, we evaluate our language models using the perplexity measure, and draw conclusions on how switch of domains can affect the performance of a language model. Perplexity measures the language model capacity at predicting sentences in a test corpus.

Total: 20 points

- **Deliverable 2.1**: Complete the function `ece365lib.evaluate.get_perplexity`. (10 points)
- **Test**: `nosetests tests/test_train.py:test_d2_1_gp`

In [157]:
from ece365lib import evaluate
reload(evaluate);

In [158]:
# This might take some time
print(evaluate.get_perplexity(food_lm, food_test[:2500]))
print(evaluate.get_perplexity(food_lm, natr_test[:2500]))
print(evaluate.get_perplexity(natr_lm, natr_test[:2500]))
print(evaluate.get_perplexity(natr_lm, food_test[:2500]))

7318.4416424924275
7309.175157015321
5222.470200914429
5354.495948590062


In [159]:
! nosetests tests/test_train.py:test_d2_1_gp

.
----------------------------------------------------------------------
Ran 1 test in 50.581s

OK


- **Deliverable 2.2**: What observations can you make on the results? Is the domain shift affecting the performance of the language model? Is that always the case? What are possible explanations? (10 points)

**My Answer**

* My observations: the language model trained by the food industry corpus has very close perplexities on the food industry corpus test set and the natural resources industry corpus test set (one is 7318, the other is 7309); similarly, the language model trained by the natural resources industry corpus has very close perplexities on the food industry corpus test set and the natural resources industry corpus test set (one is 5354, the other is 5222).  

 
* The domain shift not significantly affects the performance of the language model.  


* This is not always the case. Perplexity can be viewed as the weighted average branching factor of a language, which is the number of possible next words that can follow any word (J & M, Chapter 3, p. 8). Because the language model is trained on the training data, it will be biased towards the training data, if the test data are similar to the training data, then the number of possible next words that can follow any word is limited since the training and test data have a lot of common characteristics. In this case, the two corpora both focus on the industry (one is food, the other is natural resources), and the contexts may have a lot in common.  

# 3. Data size and model complexity

Let us now see how the size of the training data size and the complexity of the model we choose affects the quality of our language model.

Total: 40 points

For this part we'd like to see the difference between a 2-gram model and a 4-gram model, when trained on 25%, 50%, 75%, and 100% of the training corpus. 

- **Deliverable 3.1**: Complete the function `ece365lib.train.vary_ngram`. (40 points)
- **Test**: `nosetests tests/test_train.py:test_d3_1_vary`

In [188]:
from ece365lib import train
reload(train);

In [189]:
n_gram_orders = [2, 3]

train_corpus = food_corpus_tk[:int(0.8*len(food_corpus_tk))]
test_corpus = food_corpus_tk[int(0.8*len(food_corpus_tk)): int(0.85*len(food_corpus_tk))]

results = train.vary_ngram(train_corpus, test_corpus, n_gram_orders)

print(results)

defaultdict(<class 'float'>, {2: 7387.09139428148, 3: 7428.409564676149})


In [190]:
! nosetests tests/test_train.py:test_d3_1_vary

.
----------------------------------------------------------------------
Ran 1 test in 89.953s

OK


We notice that the 3-gram language model actually performs worse than the 2-gram language model. This is due to the smal size of the training corpus. A 3-gram language model is actually too complex of a model for a small training size. If our training data was larger, we would be seeing the opposite.