# 08. Language Modeling with KenLM

We are going to learn how to use KenLM, a toolkit for language modeling.

First of all, install KenLM as follows:
- Download and unzip: http://kheafield.com/code/kenlm.tar.gz
- You need:
    - cmake : https://cmake.org/download/ and unzip.
      - Do the following:
           ```bash
           cd cmake
           ./bootstrap
           make
           make install
           ```
    - Need Boost >= 1.42.0 and bjam
        - Ubuntu: `sudo apt-get install libboost-all-dev`
        - Mac OS: `brew install boost; brew install bjam`
- cd into kenlm folder and compiling using the following commands:
    ```bash
    mkdir -p build
    cd build
    cmake ..
    make -j 4
    ```
- Install python KenLM: 
    ```bash
    pip install https://github.com/kpu/kenlm/archive/master.zip
    ```
- Check out KenLM website for more info: http://kheafield.com/code/kenlm/

In [1]:
import kenlm
import os
import random

## Training a language model with KenLM
Let's train a bigram language model and 4-gram language model.  
First, download the preprocessed Penn Treebank (Wall Street Journal) dataset from here: https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data.
KenLM doesn't support <unk> token so let's remove it.

In [2]:
# This removes all occurences of <unk> tokens
# sed is a very handy command for quick processing.  
# https://www.tutorialspoint.com/sed/sed_overview.htm
!sed -e 's/<unk>//g' data/ptb.train.txt > data/ptb.train.nounk.txt

In [3]:
#bigram
# !./kenlm/build/bin/lmplz -o 2 < data/ptb.train.nounk.txt > data/ptb_lm_2gram.arpa
!<path where you unzipped kenlm>/kenlm/build/bin/lmplz -o 2 < data/ptb.train.nounk.txt > data/ptb_lm_2gram.arpa

=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 842501 types 10001
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:120012 2:6871827456
Statistics:
1 10001 D1=0.269406 D2=0.72032 D3+=1.15669
2 271372 D1=0.717012 D2=1.07895 D3+=1.44702
Memory estimate for binary LM:
type      kB
probing 5024 assuming -p 1.5
probing 5063 assuming -r models -p 1.5
trie    1725 without quantization
trie     964 assuming -q 8 -b 8 quantization 
trie    1725 assuming -a 22 array pointer compression
trie     964 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:120012 2:4341952
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
#####

In [4]:
# 4-gram
# !./kenlm/build/bin/lmplz -o 4 < data/ptb.train.nounk.txt > data/ptb_lm_4gram.arpa
!<path where you unzipped kenlm>/kenlm/build/bin/lmplz -o 4 < data/ptb.train.nounk.txt > data/ptb_lm_4gram.arpa

=== 1/5 Counting and sorting n-grams ===
Reading stdin
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 842501 types 10001
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:120012 2:1169672704 3:2193136384 4:3509018368
Statistics:
1 10001 D1=0.269406 D2=0.72032 D3+=1.15669
2 271372 D1=0.736147 D2=1.10173 D3+=1.46771
3 578317 D1=0.878891 D2=1.26107 D3+=1.45765
4 685219 D1=0.930799 D2=1.34496 D3+=1.30068
Memory estimate for binary LM:
type       kB
probing 32213 assuming -p 1.5
probing 37231 assuming -r models -p 1.5
trie    14059 without quantization
trie     7265 assuming -q 8 -b 8 quantization 
trie    12759 assuming -a 22 array pointer compression
trie     5965 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:12

## Scoring using KenLM
Let's score a sentence using the language model we just trained.  
**Note that the score KenLM returns is log likelihood, not perplexity!**  
Pereplexity is defined as follow: $$ PPL = b^{- \frac{1}{N} \sum_{i=1}^N \log_b q(x_i)} $$  

All probabilities here are in log base 10 so to convert to perplexity, we do the following:  
$$PPL = 10^{-\log(P) / N} $$
where $-\log(P)$ is the total NLL of the whole sentence, and $N$ is the word count.


In [5]:
# load the pre-trained LMs
bigram_model = kenlm.LanguageModel('data/ptb_lm_2gram.arpa')
fourgram_model = kenlm.LanguageModel('data/ptb_lm_4gram.arpa')


In [6]:
# function for calculating perplexity
def get_ppl(model, sent):
    return 10**(-model.score(sent)/len(sent.split()))


In [7]:
sentence = "dividend yields have been bolstered by stock declines "

**PPL of a sentence from PTB test set:**

In [8]:
print(get_ppl(bigram_model, sentence))
print(get_ppl(fourgram_model, sentence))

749.9773725405043
733.1557213309632


**PPL of an out-of-domain sentence:**

In [11]:
ood_sentence = 'artificial neural networks are computing systems vaguely inspired by the biological neural networks'
print(get_ppl(bigram_model, ood_sentence))
print(get_ppl(fourgram_model, ood_sentence))

13349.78268920608
13699.961190363858


**Why is the perplexity so high?**

In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample. (from [Wikipedia](https://en.wikipedia.org/wiki/Perplexity))

Therefore, it suggests that the model can't predict well for out-of-domain sentences. 



Let's shuffle the sentence above to get novel N-grams (for the same sentence) and see how it performs.

In [12]:
random.seed(555)
tmp = sentence.split()
random.shuffle(tmp)
tmp_sent_2 = ' '.join(tmp)
print(tmp_sent_2)
print(get_ppl(bigram_model, tmp_sent_2))
print(get_ppl(fourgram_model, tmp_sent_2))

stock bolstered declines dividend by yields have been
3207.5970808942507
3302.2615231292616


Notice that perplexity gets higher, but not as high as the out-of-domain sentence. 

**Why?**

Shuffle only introduce new n-grams with n>1, while the unigrams remain the same. A shuffled sentence with few out-of-domain words has lower perplexity than oov sentences.