# Exploring n-gram LM

This Jupyter Notebook lets you explore some n-gram LM.

In [None]:
import kenlm
import random
import langdetect
from random import shuffle
from util.lm_corpus_util import process_sentence
from util.lm_util import load_lm, load_vocab, load_lm_and_vocab

def create_test_pair(sentence):
    words = sentence.lower().split()
    sentence_original = ' '.join(words)
    sentence_shuffled = sentence_original
    while sentence_shuffled == sentence_original:
        shuffle(words)
        sentence_shuffled = ' '.join(words)
    return sentence_original, sentence_shuffled

def score_sentence(model, sentence):
    score = model.score(sentence)
    print(f'score for \'{sentence}\': ', score)
    for prob, ngram_length, oov in model.full_scores(sentence):
        print({'probability': prob, "n-gram length": ngram_length, "oov?": oov})
    print("perplexity:", model.perplexity(sentence))
    print()
    return score
    
def check_lm(model, sentences, language=None):
    ok = True
    for sentence in sentences:
        language = language if language else {'en': 'english', 'de': 'german'}[langdetect.detect(sentence)]
        print(f'original sentence ({language}):', sentence)
        sentence = process_sentence(sentence, language=language)
        print('normalized sentence:', sentence)
        original, shuffled = create_test_pair(sentence)
        print()
        print('scoring original sentence: ')
        score_original = score_sentence(model, original)
        print('scoring shuffled sentence: ')
        score_shuffled = score_sentence(model, shuffled)
        if score_original < score_shuffled:
            ok = False
    if ok:
        print('model seems to be OK')
               
english_sentences = [
    'Language modeling is fun', # normal sentence
    'New York', # only one shuffled variant (York New), which should have a lower probabilty
    'adasfasf askjh aksf' # some OOV words
]
german_sentences = [
    'Seine Pressebeauftragte ist ratlos.',
    'Fünf Minuten später steht er im Eingang des Kulturcafés an der Zürcher Europaallee.',
    'Den Leuten wird bewusst, dass das System des Neoliberalismus nicht länger tragfähig ist.',
    'Doch daneben gibt es die beeindruckende Zahl von 30\'000 Bienenarten, die man unter dem Begriff «Wildbienen» zusammenfasst.',
    'Bereits 1964 plante die US-Airline Pan American touristische Weltraumflüge für das Jahr 2000.',
]
german_sayings = [
    'Ich bin ein Berliner',
    'Man soll den Tag nicht vor dem Abend loben',
    'Was ich nicht weiss macht mich nicht heiss',
    'Ein Unglück kommt selten allein',
    'New York'
]

## English models

### DeepSpeech (5-gram, 250k words)

The following model was trained for the Mozilla implementation of DeepSpeech and is included in [download of the pre-trained model](https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model). The model's vocabulary is contained in the file  (). The file `vocab.txt` contatins the vocabulary of the model (one word per line), which comprises also very exotic words and probably spelling errors and is therefore very big (973.673 words). To train the \ac{LM}, $n$-grams of order 4 and 5 were pruned with a threshold value of 1, meaning only 4- and 5-grams with a minimum count of 2 and higher are estimated ([see the details about how Mozilla trained the LM](https://github.com/mozilla/DeepSpeech/tree/master/data/lm)). Because spelling errors are probably unique within the training corpus, 4- or 5-grams containing a misspelled word are unique too and are therefore pruned. 

Such a large vocabulary is counter-productive to use in a spell checker because it raises the probability that minor misspellings are "corrected" to the wrong word or that a very rare or misspelled word is used. Unfortunately,`vocab.txt` does not contain any information about how often it appears in the corpus. Therefore, a vocabulary of the 250.000 most frequent word in standard format (one line, words separated by single space) is created using the following commands:

```bash
n=250000 # use 250k most frequent words

# download file
wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz

# decompress file
gunzip librispeech-lm-norm.txt.gz

# count word occurrences and keep n most frequent words
cat librispeech-lm-norm.txt |
    pv -s $(stat --printf="%s" librispeech-lm-norm.txt) | # show a progress bar
    tr '[:upper:]' '[:lower:]' | # lowercase everything
    tr -s '[:space:]' '\n' | # replace spaces with one newline
    sort | # sort alphabetically
    uniq -c | # count occurrences
    sort -bnr | # numeric sort
    tr -d '[:digit:] ' | # remove counts from lines
    head -${n} | # keep n most frequent words words
    tr '\n' ' ' > lm.vocab # replace line breaks with spaces and write to lm.vocab
```

In [None]:
model = load_lm('/media/daniel/IP9/lm/ds_en/lm.binary')
check_lm(model, english_sentences, 'english')

### Custom model (4-gram, details unknown)

The following model was trained on the TIMIT corpus and downloaded from https://www.dropbox.com/s/2n897gu5p3o2391/libri-timit-lm.klm. Details as the vocabulary or the data structure are not known.

In [None]:
model = load_lm('/media/daniel/IP9/lm/timit_en/libri-timit-lm.klm')
check_lm(model, english_sentences, 'english')

### LibriSpeech (4-gram)

The following model has been trained on the LibriSpeech corpus. The ARPA file was downloaded from http://www.openslr.org/11. The ARPA model has been lowercased for the sake of consistence. Apart from that, no other preprocessing was done. The model was trained using a vocabulary of 200k words.

A KenLM binary model was trained on the lowercased ARPA model using the _Trie_ data structure. This data structure is also what was used to train the German model (see below).

In [None]:
model = load_lm('/media/daniel/IP9/lm/libri_en/librispeech-4-gram.klm')
check_lm(model, english_sentences, 'english')

## German models

### SRI model (3-Gram, CMUSphinx)
The following is a 3-gram LM that has been trained with CMUSphinx. The ARPA file was downloaded from https://cmusphinx.github.io/wiki/download/ and converted to a binary KenLM model.

In [None]:
model = load_lm('/media/daniel/IP9/lm/srilm_de/srilm-voxforge-de-r20171217.klm')
check_lm(model, german_sentences, 'german')

### Custom KenLM (2-gram, probing, all words)

The following 2-gram model was trained on sentences from articles and pages in a Wikipedia dump. The dump was downloaded on 2018-09-21 and contains the state from 2018-09-01. The current dump of the German Wikipedia can be downloaded at http://download.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2.

The model was not pruned. Probing was used as data structure. The following command was used to create the model:

```bash
lmplz -o 2 -T /home/daniel/tmp -S 40% <wiki_de.txt.bz2 | build_binary /dev/stdin wiki_de_2_gram.klm
```

In [None]:
model = load_lm('/media/daniel/IP9/lm/wiki_de/wiki_de_2_gram.klm')
check_lm(model, german_sentences, 'german')

### Custom KenLM (4-gram, trie, 500k words)

The following 4-gram model was trained on the same dump like the 2-gram model above, but with a limited vocabulary of the first 500k most frequent words in the corpus. Additionally, a _Trie_ was used as data structure instead of the hash table in _Probing_. The model was built with the following program

```bash
lmplz --order 4 \
      --temp_prefix /tmp/ \
      --memory 40% \
      --limit_vocab_file wiki_de_500k.vocab \
      --text wiki_de.txt.bz2 \
      --arpa wiki_de_trie_4_gram_500k.arpa
      
build_binary trie wiki_de_trie_4_gram_500k.arpa wiki_de_trie_4_gram_500k.klm
```

Where `wiki_de.txt.bz2` is the training corpus and `wiki_de_500k.vocab` is a text file containing the 500k most frequent words from the training corpus.

In [None]:
model = load_lm('/media/daniel/IP9/lm/wiki_de/wiki_de_4_gram_500k_trie.klm')
check_lm(model, german_sentences, 'german')

### Custom KenLM (5-gram, trie, pruned)

The following model was trined like the 4-gram model above, but with a higher order (5-gram instead of 4-gram). Additionally, the vocabulary was not pruned. The model was quantized with 8 bits and pointers were compressed to save memory.

```bash
lmplz --order 5 \
      --temp_prefix /tmp/ \
      --memory 40% \
      --text wiki_de.txt.bz2 \
      --arpa wiki_de_5_gram_pruned.arpa
      
build_binary -a 255 \
             -q 8 \
             trie wiki_de_5_gram_pruned.arpa \
             wiki_de_5_gram_pruned.klm
```

The file `wiki_de_5_gram_pruned.klm` is the binary KenLM model that was used to implement a simple spell checker in this project. The spell checker uses a truncated vocabulary of the 250k most frequent words and the model is then used to calculate the likelihood (score) for each sentence. Note that although the spell checker uses a truncated vocabulary, the model was trained on the full text corpus without limiting the vocabulary.

In [None]:
model = load_lm('/media/daniel/IP9/lm/wiki_de/wiki_de_5_gram_pruned.klm')
check_lm(model, german_sentences, 'german')

# A simple word predictor

The trained model can be used together with its vocabulary to create a simple word predictor that lets you start a sentence and will propose possible continuations:

In [None]:
from tabulate import tabulate

def predict_next_word(model, vocab, language):
    inp = input('Your turn now! Enter a word or the beginning of a sentence and the LM will predict a continuation. Enter nothing to quit.\n')
    sentence = process_sentence(inp, language)
    while (inp):
        score = model.score(sentence, bos=False, eos=False)
        print(f'score for \'{sentence}\': {score}')        
        top_5 = sorted(((word, model.score(sentence.lower() + ' ' + word)) for word in vocab), key=lambda t: t[1], reverse=True)[:5]
        print(f'top 5 words:')
        print(tabulate(top_5, headers=['word', 'log10-probability']))
        inp = input('Enter continuation:\n')
        sentence += ' ' + process_sentence(inp, language)
    print('Done!')

## English

In [None]:
from util.lm_util import load_lm, load_vocab
from util.lm_corpus_util import process_sentence

model = load_lm('/media/daniel/IP9/lm/ds_en/lm.binary')
vocab = load_vocab('/media/daniel/IP9/lm/ds_en/lm_80k.vocab')

predict_next_word(model, vocab, 'german')

## German

In [None]:
from util.lm_util import load_lm, load_vocab
from util.lm_corpus_util import process_sentence

model = load_lm('/media/daniel/IP9/lm/wiki_de/wiki_de_5_gram.klm')
vocab = load_vocab('/media/daniel/IP9/lm/wiki_de/wiki_de_80k.vocab')

predict_next_word(model, vocab, 'german')

# A simple spell checker

The trained model together with its vocabulary can be used to implement a simple spell checker. For each word of a sentence, the spell checker checks if it appears in the vocabulary. If it does, it is not changed. If it does not, all words in the vocabulary with edit distance 1 are searched. If there are none, all words in the vocabulary with edit distance 2 are searched. If there are none, the original word is kept. This is done for each word in the sentence. The spell checker then calculates the probabilities for all combinations of words using beam search with a beam width of 1024. The most probable combination is used as corrected sentence. The following sections illustrate examples for English and German.

## English

In [None]:
from util.lm_util import load_lm, load_vocab, correction

model = load_lm('/media/daniel/IP9/lm/ds_en/lm.binary')
vocab = load_vocab('/media/daniel/IP9/lm/ds_en/lm_80k.vocab')

sentence = 'i seee i sey saind the blnd manp to his deaf dauhgter'
sentence_corr = correction(sentence, language='en', lm=model, lm_vocab=vocab)

print(f'original sentence:  {sentence}')
print(f'corrected sentence: {sentence_corr}')

## German

In [None]:
from util.lm_util import load_lm, load_vocab, correction

model = load_lm('/media/daniel/IP9/lm/wiki_de/wiki_de_5_gram.klm')
vocab = load_vocab('/media/daniel/IP9/lm/wiki_de/wiki_de_80k.vocab')
print('superheld' in vocab)

sentence = 'man isd nur dannn ein supeerheld wenn man sihc selbsd fur supehr häält'
sentence_corr = correction(sentence, language='de', lm=model, lm_vocab=vocab)

print(f'original sentence:  {sentence}')
print(f'corrected sentence: {sentence_corr}')