# Statistical Language Modeling

Computational linguistics is an emerging field that is widely used in analytics, software applications, and contexts where people communicate with machines. Computational linguistics may be defined as a subfield of artificial intelligence. Applications of computational linguistics include machine translation, speech recognition, intelligent Web searching, information retrieval, and intelligent spelling checkers. It is important to understand the preprocessing tasks or the computations that can be performed on natural language text. In the following chapter, we will discuss ways to calculate word frequencies, the Maximum Likelihood Estimation (MLE) model, interpolation on data, and so on. But first, let's go through the various topics that we will cover in this chapter. They are as follows:



•Calculating word frequencies (1-gram, 2-gram, 3-gram)

•Developing MLE for a given text

•Applying smoothing on the MLE model

•Developing a back-off mechanism for MLE

•Applying interpolation on data to get a mix and match

•Evaluating a language model through perplexity

•Applying Metropolis-Hastings in modeling languages

•Applying Gibbs sampling in language processing

# Understanding word frequency

Collocations may be defined as the collection of two or more tokens that tend to exist together. For example, the United States, the United Kingdom, Union of Soviet Socialist Republics, and so on.

In [6]:
import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
alpino.words()

['De', 'verzekeringsmaatschappijen', 'verhelen', ...]

In [18]:
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.corpus import webtext
from nltk.metrics import BigramAssocMeasures
tokens=[t.lower() for t in webtext.words('grail.txt')]
words=BigramCollocationFinder.from_words(tokens)
words.nbest(BigramAssocMeasures.likelihood_ratio, 10)

[("'", 's'),
 ('arthur', ':'),
 ('#', '1'),
 ("'", 't'),
 ('villager', '#'),
 ('#', '2'),
 (']', '['),
 ('1', ':'),
 ('oh', ','),
 ('black', 'knight')]

In [26]:
from nltk.corpus import stopwords
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
stop = (stopwords.words('english'))
stops_filter = lambda w: len(w) < 3 or w in stop
tokens=[t.lower() for t in webtext.words('grail.txt')]
words=BigramCollocationFinder.from_words(tokens)
words.apply_word_filter(stops_filter)
words.nbest(BigramAssocMeasures.likelihood_ratio, 15)

[('black', 'knight'),
 ('clop', 'clop'),
 ('head', 'knight'),
 ('mumble', 'mumble'),
 ('squeak', 'squeak'),
 ('saw', 'saw'),
 ('holy', 'grail'),
 ('run', 'away'),
 ('french', 'guard'),
 ('cartoon', 'character'),
 ('iesu', 'domine'),
 ('pie', 'iesu'),
 ('round', 'table'),
 ('sir', 'robin'),
 ('clap', 'clap')]

In [27]:
text1="Hardwork is the key to success. Never give up!"
word = nltk.wordpunct_tokenize(text1)
finder = BigramCollocationFinder.from_words(word)
bigram_measures = nltk.collocations.BigramAssocMeasures()
value = finder.score_ngrams(bigram_measures.raw_freq)
sorted(bigram for bigram, score in value)

[('.', 'Never'),
 ('Hardwork', 'is'),
 ('Never', 'give'),
 ('give', 'up'),
 ('is', 'the'),
 ('key', 'to'),
 ('success', '.'),
 ('the', 'key'),
 ('to', 'success'),
 ('up', '!')]

In [31]:
text="Hello how are you doing ? I hope you find the book interesting Hello how are you"
tokens=nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
    print(fourgram,freq)

('Hello', 'how', 'are', 'you') 2
('how', 'are', 'you', 'doing') 1
('are', 'you', 'doing', '?') 1
('you', 'doing', '?', 'I') 1
('doing', '?', 'I', 'hope') 1
('?', 'I', 'hope', 'you') 1
('I', 'hope', 'you', 'find') 1
('hope', 'you', 'find', 'the') 1
('you', 'find', 'the', 'book') 1
('find', 'the', 'book', 'interesting') 1
('the', 'book', 'interesting', 'Hello') 1
('book', 'interesting', 'Hello', 'how') 1
('interesting', 'Hello', 'how', 'are') 1


In [51]:
sent=" Hello , please read the book thoroughly . If you have anyqueries , then don't hesitate to ask . There is no shortcut to success."
n = 5
n_grams=ngrams(sent.split(),n)
for grams in n_grams:
    print(grams)

('Hello', ',', 'please', 'read', 'the')
(',', 'please', 'read', 'the', 'book')
('please', 'read', 'the', 'book', 'thoroughly')
('read', 'the', 'book', 'thoroughly', '.')
('the', 'book', 'thoroughly', '.', 'If')
('book', 'thoroughly', '.', 'If', 'you')
('thoroughly', '.', 'If', 'you', 'have')
('.', 'If', 'you', 'have', 'anyqueries')
('If', 'you', 'have', 'anyqueries', ',')
('you', 'have', 'anyqueries', ',', 'then')
('have', 'anyqueries', ',', 'then', "don't")
('anyqueries', ',', 'then', "don't", 'hesitate')
(',', 'then', "don't", 'hesitate', 'to')
('then', "don't", 'hesitate', 'to', 'ask')
("don't", 'hesitate', 'to', 'ask', '.')
('hesitate', 'to', 'ask', '.', 'There')
('to', 'ask', '.', 'There', 'is')
('ask', '.', 'There', 'is', 'no')
('.', 'There', 'is', 'no', 'shortcut')
('There', 'is', 'no', 'shortcut', 'to')
('is', 'no', 'shortcut', 'to', 'success.')


# Develop MLE for a given text