# Language Modeling: KenLM

We are going to learn how to use KenLM, a toolkit for language modeling.
First of all, you need to install KenLM.
- Download and unzip: http://kheafield.com/code/kenlm.tar.gz
- You need:
    - cmake : https://cmake.org/download/ and unzip.
      - Do the following:
       ``` cd cmake
           ./bootstrap
           make
           make install
       ```
    - Need Boost >= 1.42.0 and bjam
        - Ubuntu: sudo apt-get install libboost-all-dev
        - Mac: brew install boost; brew install bjam
- cd into kenlm folder and compiling using the following commands:
```bash
mkdir -p build
cd build
cmake ..
make -j 4
```
- Install python KenLM: pip install https://github.com/kpu/kenlm/archive/master.zip
- Check out KenLM website for more info: http://kheafield.com/code/kenlm/

In [1]:
import kenlm
import os
import random

## Training a language model with KenLM
Let's train a bigram language model and 4-gram language model.  
First, download the preprocessed Penn Treebank (Wall Street Journal) dataset from here: https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data.
KenLM doesn't support <unk> token so let's remove it.

In [None]:
# This removes all occurences of <unk> tokens
# sed is a very handy command for quick processing.  
# I strongly recommend you to learn how to use it. 
# https://www.tutorialspoint.com/sed/sed_overview.htm
!sed -e 's/<unk>//g' data/ptb.train.txt > data/ptb.train.nounk.txt

In [None]:
#bigram
!./kenlm/build/bin/lmplz -o 2 < data/ptb.train.nounk.txt > ptb_lm_2gram.arpa

In [None]:
# 4-gram
!./kenlm/build/bin/lmplz -o 4 < data/ptb.train.nounk.txt > ptb_lm_4gram.arpa


## Scoring using KenLM
Let's score a sentence using the language model we just trained.  
**Note that the score KenLM returns is log likelihood, not perplexity!**  
Pereplexity is defined as follow: $$ PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)} $$  

All probabilities here are in log base 10 so to convert to perplexity, we do the following:  
$$PPL = 10^{-\log(P) / N} $$
where $-\log(P)$ is the total NLL of the whole sentence, and $N$ is the word count.


In [2]:
# load the pre-trained LMs
bigram_model = kenlm.LanguageModel('ptb_lm_2gram.arpa')
trigram_model = kenlm.LanguageModel('ptb_lm_4gram.arpa')


In [3]:
# function for calculating perplexity
def get_ppl(model, sent):
    return 10**(-model.score(sent)/len(sent.split()))


In [4]:
sentence = "dividend yields have been bolstered by stock declines "

PPL of a sentence from PTB test set.

In [5]:
print(get_ppl(bigram_model, sentence))
print(get_ppl(trigram_model, sentence))

749.9773725405043
733.1557213309632


PPL of an out-of-domain sentence.

In [6]:
ood_sentence = 'artificial neural networks are computing systems vaguely inspired by the biological neural networks'
print(get_ppl(bigram_model, ood_sentence))
print(get_ppl(trigram_model, ood_sentence))


13349.78268920608
13699.961190363858


Let's shuffle the sentence from test set to get novel N-grams and see how it performs.

In [7]:
random.seed(555)
tmp = sentence.split()
random.shuffle(tmp)
tmp_sent_2 = ' '.join(tmp)
print(tmp_sent_2)
print(get_ppl(bigram_model, tmp_sent_2))
print(get_ppl(trigram_model, tmp_sent_2))

stock bolstered declines dividend by yields have been
3207.5970808942507
3302.2615231292616


Notice that perplexity gets higher, but not as high as the out-of-domain sentence. 
Why?

How can we know if a word is OOV?

In [8]:
random_word='wioruqoeruq4r'
if random_word not in bigram_model:
    print('OOV word!')
else:
    print('not OOV word!')

OOV word!


# Explore and experiment

Pick one to two corpora to test the language models on. Feel free to use one of following or any other appropriate corpus that appeals to you,

Billion word dataset: http://www.statmt.org/lm-benchmark/  
Quaker historical corpus: https://www.woodbrooke.org.uk/resource-library/quaker-historical-corpus/  
All of Shakespeare: http://norvig.com/ngrams/  
IMDB: http://ai.stanford.edu/~amaas/data/sentiment/  
SNLI test set and MultiNLI dev-set (only hypothesis sentences) in data folder.

**Exercise 1**: Load the data and get the perpelxity.

**Exercise 2**: How many OOV words are there?

**Exercise 3**: Find the sentence with the highest and lowest perplexity.

**Exercise 4**: Do you think vocabulary size affect the perplexity of language model? Explore this by removing some words from training corpus.

**Exercise 5**: If you want to train a model on a larger dataset, follow the directions on the KenLM website, and see how this new model fares. 