An implementation of Kneser-Ney language modeling in Python3. This is not a particularly optimized implementation, but is hopefully helpful for learning and works fine for corpuses that aren't too large.
The KneserNey class does language model estimation when given a sequence of ngrams.
class KneserNey: def __init__(self, highest_order, ngrams, start_pad_symbol='<s>', end_pad_symbol='</s>'): """ Constructor for KneserNeyLM. Params: highest_order [int] The order of the language model. ngrams [list->tuple->string] Ngrams of the highest_order specified. Ngrams at beginning / end of sentences should be padded. start_pad_symbol [string] The symbol used to pad the beginning of sentences. end_pad_symbol [string] The symbol used to pad the beginning of sentences. """
It is easy to create a KneserNeyLM out of an NLTK corpus (see example.py).
from nltk.corpus import gutenberg from nltk.util import ngrams from kneser_ney import KneserNeyLM gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left=True, pad_right=True, pad_symbol='<s>')) lm = KneserNeyLM(3, gut_ngrams, end_pad_symbol='<s>')
The language model can then be used to score sentences or generate sentences.
lm.score_sent(('This', 'is', 'a', 'sample', 'sentence', '.')) lm.generate_sentence()