# Add-one smoothing

In the 18th century, Laplace invented add-one smoothing. In add-one smoothing, 1
is added to the count of each word. Instead of 1, any other value can also be added
to the count of unknown words so that unknown words can be handled and their
probability is non-zero. Pseudo count is the value (that is, either 1 or nonzero) that is
added to the counts of unknown words to make their probability nonzero.

In [5]:
import nltk
corpus=u"<s> hello how are you doing ? Hope you find the book interesting. </s>".split()

In [6]:
sentence=u"<s>how are you doing</s>".split()

In [3]:
vocabulary = set(corpus)
len(vocabulary)

13

In [7]:
cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus))

In [13]:
cfd.keys()

dict_keys(['<s>', 'hello', 'how', 'are', 'you', 'doing', '?', 'Hope', 'find', 'the', 'book', 'interesting.', '<s>how'])

In [9]:
[cfd[a][b] for (a,b) in nltk.bigrams(sentence)]   # The corpus counts of each bigram in the sentence

[0, 1, 0]

In [10]:
[cfd[a].N() for (a,b) in nltk.bigrams(sentence)]   # The counts for each word in the sentence

[0, 1, 2]

In [11]:
[cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)]  # There is already a FreqDist method for MLE probability

[0, 1.0, 0.0]

In [14]:
[1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)]   # Laplace smoothing of each bigram count:

[1, 2, 1]

In [15]:
[len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)]   # We need to normalise the counts for each word

[13, 14, 15]

In [16]:
# The smoothed Laplace probability for each bigram:
[1.0 * (1+cfd[a][b]) / (len(vocabulary)+cfd[a].N()) for (a,b) in nltk.bigrams(sentence)]

[0.07692307692307693, 0.14285714285714285, 0.06666666666666667]

# Consider another way of performing Add-one smoothing or generating a Laplace probability distribution:

In [18]:
# MLEProbDist is the unsmoothed probability distribution
cpd_mle = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist,bins=len(vocabulary))

In [19]:
[cpd_mle[a].prob(b) for (a,b) in nltk.bigrams(sentence)] # Now we can get the MLE probabilities by using the .prob method

[0, 1.0, 0.0]

In [22]:
#LaplaceProbDist is the add-one smoothed ProbDist
cpd_laplace = nltk.ConditionalProbDist(cfd, nltk.LaplaceProbDist,bins=len(vocabulary))

In [23]:
# Getting the Laplace probabilities is the same as for MLE
[cpd_laplace[a].prob(b) for (a,b) in nltk.bigrams(sentence)]

[0.07692307692307693, 0.14285714285714285, 0.06666666666666667]

NameError: name 'corpus_kn' is not defined