# Explorint n-gram LM

This Jupyter Notebook lets you explore some n-gram LM.

In [None]:
import kenlm
from random import shuffle
from lm.lm_util import process_sentence

def load_lm(lm_path):
    model = kenlm.LanguageModel(lm_path)
    print(f'loaded {model.order}-gram model from {lm_path}')
    return model

def create_test_pair(sentence):
    words = sentence.lower().split()
    sentence_original = ' '.join(words)
    sentence_shuffled = sentence_original
    while sentence_shuffled == sentence_original:
        shuffle(words)
        sentence_shuffled = ' '.join(words)
    return sentence_original, sentence_shuffled

def score_sentence(model, sentence):
    score = model.score(sentence)
    print(f'score for \'{sentence}\': ', score)
    for prob, ngram_length, oov in model.full_scores(sentence):
        print({'probability': prob, "n-gram lenght": ngram_length, "oov?": oov})
    print("perplexity:", model.perplexity(sentence))
    print()
    return score
    
def check_lm(model, sentences, language):
    ok = True
    for sentence in sentences:
        print('original sentence:', sentence)
        sentence = process_sentence(sentence, language=language)
        print('normalized sentence:', sentence)
        original, shuffled = create_test_pair(sentence)
        print()
        print('scoring original sentence: ')
        score_original = score_sentence(model, original)
        print('scoring shuffled sentence: ')
        score_shuffled = score_sentence(model, shuffled)
        if score_original < score_shuffled:
            ok = False
    if ok:
        print('model seems to be OK')
        
english_sentences = [
    'Language modeling is fun',
    'New York'
]
german_sentences = [
    'Seine Pressebeauftragte ist ratlos.',
    'Fünf Minuten später steht er im Eingang des Kulturcafés an der Zürcher Europaallee.',
    'Den Leuten wird bewusst, dass das System des Neoliberalismus nicht länger tragfähig ist.',
    'Doch daneben gibt es die beeindruckende Zahl von 30\'000 Bienenarten, die man unter dem Begriff «Wildbienen» zusammenfasst.',
    'Bereits 1964 plante die US-Airline Pan American touristische Weltraumflüge für das Jahr 2000.',
]
german_sayings = [
    'Ich bin ein Berliner',
    'Man soll den Tag nicht vor dem Abend loben',
    'Was ich nicht weiss macht mich nicht heiss',
    'Ein Unglück kommt selten allein',
    'New York'
]

## English models

### Custom model (4-gram, details unknown)

The following model was trained on the TIMIT corpus and downloaded from https://www.dropbox.com/s/2n897gu5p3o2391/libri-timit-lm.kl. Details as the vocabulary or the data structure are not known.

In [None]:
model = load_lm('../lm/timit_en/libri-timit-lm.klm')
check_lm(model, english_sentences, 'english')

### LibriSpeech (4-gram)

The following model has been trained on the LibriSpeech corpus. The ARPA file was downloaded from http://www.openslr.org/11. The ARPA model has been lowercased for the sake of consistence. Apart from that, no other preprocessing was done. The model was trained using a vocabulary of 200k words.

A KenLM binary model was trained on the lowercased ARPA model using the _Trie_ data structure. This data structure is also what was used to train the German model (see below).

In [None]:
model = load_lm('../lm/libri_en/librispeech-4-gram.klm')
check_lm(model, english_sentences, 'english')

## German models

### SRI model (3-Gram, CMUSphinx)
The following is a 3-gram LM that has been trained with CMUSphinx. The ARPA file was downloaded from https://cmusphinx.github.io/wiki/download/ and converted to a binary KenLM model.

In [None]:
model = load_lm('../lm/srilm_de/srilm-voxforge-de-r20171217.klm')
check_lm(model, german_sentences, 'german')

### Custom KenLM (2-gram, probing, all words)

The following 2-gram model was trained on sentences from articles and pages in a Wikipedia dump. The dump was downloaded on 2018-09-21 and contains the state from 2018-09-01. The current dump of the German Wikipedia can be downloaded at http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2.

The model was not pruned. Probing was used as data structure. The following command was used to create the model:

```bash
lmplz -o 2 -T /home/daniel/tmp -S 40% <wiki_de.txt.bz2 | build_binary /dev/stdin wiki_de_2_gram.klm
```

In [None]:
model = load_lm('../lm/wiki_de/wiki_de_2_gram.klm')
check_lm(model, german_sentences, 'german')

### Custom KenLM (4-gram, trie, 500k words)

The following 4-gram model was trained on the same dump like the 2-gram model above, but with a limited vocabulary of the first 500k most frequend words in the corpus. Additionally, a _Trie_ was used as data structure instead of the hash table in _Probing_. The model was built with the following program

```bash
lmplz -o 4 -T /home/daniel/tmp -S 40% <wiki_de.txt.bz2 | build_binary /dev/stdin wiki_de_4_gram_500k_trie.klm
```

In [None]:
model = load_lm('../lm/wiki_de/wiki_de_4_gram_500k_trie.klm')
check_lm(model, german_sentences, 'german')