# Language Modeling

Week 2 - Corpus Statistics and Language Modeling

Given a set of provided bigrams, compute the bigram predictive probabilites. (i.e. given bigram "word1 word2", the probability that word2 follows word1 in the corpus)
For a bigram "w1 w2", P(w2|w1) = (count of "w1 w2") / (count of w1)

Example:
bigram = "we bear"
count("we bear") = 1
count("we") = 21
P("we bear") = 1/21

## Libraries

In [1]:
import nltk
from nltk import FreqDist
from nltk import word_tokenize
from nltk import bigrams

# function to convert raw input bigram into tuple
# :param raw: a string representing a bigram (e.g. "we will")
# :return: a tuple of split bigram (e.g. ('we', 'will'))
def processDat(raw):
    return tuple(raw.split())

# function to compute predict probability for a bigram
# requires frequency distribution of unigrams - "fdist1" and bigrams - "fdist2"
# :param x: a tuple of split bigram (e.g. ('we', 'will'))
# :return: prints fraction and decimal representing the bigram probability
def getProb(x):
    # string of bigram
    myStr = ' '.join(x)
    # frequency of bigram
    a = fdist2[x]
    # frequency of first word in bigram
    b = fdist1[x[0]]
    # probability string (fraction)
    frac = str(a) + '/' + str(b)
    # probability string (decimal)
    dec = str(round(a/b, 3))
    # print
    print("Probability of '{:s}' is {:s} ({:s})".format(myStr, frac, dec))    

## Process Data

In [2]:
# read local file
f = open("data/sample_text1.txt")
raw = f.read()

In [3]:
# tokenize the text into unigrams; list of words
tokens = word_tokenize(raw)
# convert words to lower case
tokens = [w.lower() for w in tokens]
print(tokens[:10])

['each', 'time', 'we', 'gather', 'to', 'inaugurate', 'a', 'president', ',', 'we']


In [4]:
# get frequency distribution of unigrams
fdist1 = FreqDist(tokens)
# top 10 most common unigrams
print(fdist1.most_common(10))

[('the', 28), ('our', 23), ('we', 21), (',', 21), ('of', 18), ('.', 18), ('to', 17), ('and', 15), ('that', 14), ('a', 11)]


In [5]:
# get bigrams from tokens
bgrams = list(bigrams(tokens))
# get frequency distribution of bigrams
fdist2 = FreqDist(bgrams)
# top 10 most common bigrams
print(fdist2.most_common(10))

[(('.', 'we'), 10), (('of', 'our'), 6), (('is', 'not'), 5), (('we', 'will'), 4), (('.', 'our'), 4), (('our', 'journey'), 4), (('journey', 'is'), 4), (('not', 'complete'), 4), (('complete', 'until'), 4), ((',', 'that'), 3)]


## Analysis - Compute Predictive Probabilities of Bigrams
P(w2|w1) = (count of "w1 w2") / (count of w1)

In [7]:
# Bigram set A
bgramA1 = processDat("we ,")
bgramA2 = processDat("we will")
bgramA3 = processDat("we know")

# Bigram set B
bgramB1 = processDat("our people")
bgramB2 = processDat("our journey")

# Bigram set C
bgramC1 = processDat("believe that")

In [8]:
# Compute bigram predictive probabilities for Bigram set A
getProb(bgramA1)
getProb(bgramA2)
getProb(bgramA3)

Probability of 'we ,' is 3/21 (0.143)
Probability of 'we will' is 4/21 (0.19)
Probability of 'we know' is 1/21 (0.048)


In [9]:
# Compute bigram predictive probabilities for Bigram set B
getProb(bgramB1)
getProb(bgramB2)

Probability of 'our people' is 1/23 (0.043)
Probability of 'our journey' is 4/23 (0.174)


In [10]:
# Compute bigram predictive probabilities for Bigram set C
getProb(bgramC1)

Probability of 'believe that' is 3/3 (1.0)
