#### We will address some issues with our vanilla bigram model implementation One is out of vocabulary tokens and the other is bigrams that are never observed in the training data. We will add a special `<UNK>` token to our vocabulary to address out of vocabulary words. For the zero bigram count problem, we will explore some smoothing technique.  

In [40]:
import re
import random
from collections import defaultdict, Counter
import numpy as np

#### The first type of smoothing we explore is `add-k smoothing` for which the bi-gram probability estimate is modified as follows:

#### $P(w_k|w_{k-1}) = \frac{C(w_k, w_{k-1}) + k}{C(w_{k-1}) + k|V|}$ where $k$ is a positive constant.

This has the effect of redistributing the probability masses so that bigrams with zero count now have a non-zero probability. Also note that the factor of $k|V|$ in the denominator can cause a substantial decrease in the probabilities that were already non-zero before smoothing depending on how big $k$ is.


In [49]:
class bigram_LM_addk():

    def __init__(self, count_threshold=2, k=0.1):
        self.count_threshold = count_threshold 
        self.k = k
        self.bigram_counts = None
        self.unigram_counts = None
        self.vocab = None
        self.word2idx = None
        self.bigram_probs = None
        self.num_sentences = None
        self.unk_token = '<UNK>'

    def train(self, sentences):
        self.num_sentences = len(sentences)
        self.vocab, self.unigram_counts, self.bigram_counts = self.get_counts(sentences)
        self.vocab = list(self.unigram_counts.keys())
        self.word2idx = {word:i for i,word in enumerate(self.vocab)}
        self.compute_probs()
        print("Training complete!")         

    def get_counts(self, sentences):
        # collect unigram counts 
        print("Collecting unigram counts...")
        unigram_counts = Counter()
        for s in sentences:
            for word in s:
                unigram_counts[word] += 1 
        
        # remove all words that have count below the threshold    
        print("Constructing vocab...")     
        for w in list(unigram_counts.keys()):
            if unigram_counts[w] < self.count_threshold:
                unigram_counts.pop(w)
        # construct vocab 
        vocab = [self.unk_token] + sorted(list(unigram_counts.keys()))            
        
        # replace all oov tokens in training sentences with <UNK>
        print("Replacing with oov tokens in training data...")
        sentences_unk = []
        for s in sentences:
            sent = []
            for word in s:
                if word in vocab:
                    sent.append(word)
                else:
                    sent.append(self.unk_token)
            sentences_unk.append(sent)            

        # re-collect unigram counts 
        print("Re-collecting unigram counts...")
        unigram_counts = Counter()
        for s in sentences_unk:
            for word in s:
                unigram_counts[word] += 1 

        # collect bigram counts
        print("Collecting bigram counts...")
        bigram_counts = Counter()
        for s in sentences_unk:
            for bigram in zip(s[:-1], s[1:]):
                bigram_counts[bigram] += 1     

        return vocab, unigram_counts, bigram_counts
    
    def compute_probs(self):
        print("Computing bigram probabilities...")
        bigram_probs = Counter()
        for word1 in self.vocab:
            probs = []
            for word2 in self.vocab:
                # compute P(word2|word1)
                p = self.bg_prob(word1, word2)
                probs.append(p)
            bigram_probs[word1] = probs 
        self.bigram_probs = bigram_probs   

    def bg_prob(self, word1, word2):
        # addk probability
        p = (self.bigram_counts[(word1, word2)] + self.k) / (self.unigram_counts[word1] + self.k*len(self.vocab)) 
        return p        

In [4]:
# prep the training data
with open('shakespeare.txt', 'r') as file:
    lines = file.readlines()

# remove all punctuations (except for apostrophe) and escape characters from the lines, lowercase all characters
sentences_clean = []
for line in lines:
    cleaned = re.sub(r"[^\w\s']",'',line).strip().lower()
    if len(cleaned) > 0:
        sentences_clean.append(cleaned)

# tokenize the sentences (split on whitespaces) and add start and end sentence tokens
start_token = '<s>'        
end_token = '</s>'        
sentences_tokenized = [[start_token]+s.split()+[end_token] for s in sentences_clean]
print(f"Num sentences: {len(sentences_tokenized)}")    

# now we split the data into train and test sentences
num_sent = len(sentences_tokenized)
num_test = int(0.1 * num_sent)
test_idx = random.sample(range(num_sent), num_test)

sentences_train = []
sentences_test = []
for i in range(num_sent):
    if i not in test_idx:
        sentences_train.append(sentences_tokenized[i])
    else:
        sentences_test.append(sentences_tokenized[i])    

print(f"Number of training sentences: {len(sentences_train)}")        
print(f"Number of test sentences: {len(sentences_test)}")        


Num sentences: 32777
Number of training sentences: 29500
Number of test sentences: 3277


In [50]:
model = bigram_LM_addk()
model.train(sentences_train)

Collecting unigram counts...
Constructing vocab...
Replacing with oov tokens in training data...
Re-collecting unigram counts...
Collecting bigram counts...
Computing bigram probabilities...
Training complete!


In [62]:
def generate_text(model, n=10):
    sentences = []
    i = 0
    for i in range(n):
        current_word = '<s>'
        words = []    
        while True:
            # get probabilities of next word given current context, i.e P(w|w_current)
            probs = model.bigram_probs[current_word]
            # now sample from the vocabulry according to this distribution
            next_word = random.choices(model.vocab, weights=probs, k=1)[0]
            if next_word == '</s>':
                break
            if next_word == '<s>':
                continue    
            words.append(next_word)
            current_word = next_word
        if len(words) > 0:    
            sentences.append(" ".join(words))
        i += 1
         
        
    return "\n".join(sentences)   

In [53]:
model.k = 0.0001
model.compute_probs()

Computing bigram probabilities...


In [54]:
text = generate_text(model, n=100)
print(text)

under your life before our henry of day
why then you did for then be thus i love as
o let this breathing world
the dear'st <UNK> well graced before we pray you give good morrow kate neither care keeps from heaven forbid but gentle princes there
to the gods will
provost
is to do no settled hate
if two deep as they are up thy lawful hangman must reach them nor i <UNK> and but with a map of the field
king richard iii
shrift come you most
whiles thy years
cominius
what ugly sights
why should i crave the man and titus indictment parlous boy
<UNK>
throw up they dead
cuts off
my babes for his sleep and i' the lute
might better <UNK> but did we are these three daughters the fray at the king richard ii
you'll stay
be mother cast off
king is adrian
this
first senator
to do remain alike will't please you have hands and cousins indeed
or how i revolt to berkeley to piece
first as i have you
thou lovest me all the cause to melt the
find
shall ne'er speak taught thee <UNK> on't
abhorson
can you comp

#### Note that increaing the smoothing factor k will result in longer sentences being generated. This is because for larger k, the probability of the `</s>` token becomes smaller. 

In [41]:
def compute_perplexity(model, test_sentences):
    sum_log_probs = 0.0
    n = 0
    for s in test_sentences:
        for w1,w2 in zip(s[:-1], s[1:]):
            # replace any oov token with <UNK>
            if w1 not in model.vocab:
                w1 = model.unk_token    
            if w2 not in model.vocab:
                w2 = model.unk_token
            sum_log_probs += np.log(model.bg_prob(w1, w2))
            n += 1
    sum_log_probs *= (-1/n) 
    perplexity = np.exp(sum_log_probs)
    return perplexity  

In [44]:
# now lets compute perplexity on both the training and test data for different k values
kvals = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
for k in kvals:
    model.k = k
    model.compute_probs()
    pp_train = compute_perplexity(model, sentences_train)
    pp_test = compute_perplexity(model, sentences_test)

    print(f"\nk = {k}")
    print(f"Perplexity computed on training set: {pp_train:.3f}")
    print(f"Perplexity computed on test set: {pp_test:.3f}")


Computing bigram probabilities...

k = 1.0
Perplexity computed on training set: 743.252
Perplexity computed on test set: 855.281
Computing bigram probabilities...

k = 0.1
Perplexity computed on training set: 213.299
Perplexity computed on test set: 383.594
Computing bigram probabilities...

k = 0.01
Perplexity computed on training set: 92.112
Perplexity computed on test set: 291.047
Computing bigram probabilities...

k = 0.001
Perplexity computed on training set: 62.757
Perplexity computed on test set: 351.426
Computing bigram probabilities...

k = 0.0001
Perplexity computed on training set: 56.286
Perplexity computed on test set: 550.344
Computing bigram probabilities...

k = 1e-05
Perplexity computed on training set: 55.361
Perplexity computed on test set: 934.920


#### Note that the best perpexlixty on the test set seems to be ~290.

#### Now we will try a different type of smoothing which interpolates between bigram, unigram and zerogram probabilities (zerogram probability is defined as just 1/|V|) in the following way:

$\hat{P}(w_k|w_{k-1}) = \lambda_2 P(w_k|w_{k-1}) + \lambda_1 P(w_k) + \lambda_0 P(0)$

where $P(w_k|w_{k-1}) = \frac{C(w_k, w_{k-1})}{C(w_{k-1})}$, $P(w_k) = \frac{C(w_k)}{\sum_{w \in V} C(w)}$ and $P(0) = \frac{1}{|V|}$

and $\lambda_0$, $\lambda_1$, $\lambda_2$ are constant interpolation weights which sum to 1 and whose values must be chosen such that the performance of the model on a held out test set is maximised. 




In [46]:
class bigram_LM_interp():

    def __init__(self, count_threshold=2, lmda = [0.01, 0.4, 0.59]):
        self.count_threshold = count_threshold 
        self.lmda = lmda
        self.bigram_counts = None
        self.unigram_counts = None
        self.vocab = None
        self.word2idx = None
        self.bigram_probs = None
        self.total_tokens = None
        self.unk_token = '<UNK>'

    def train(self, sentences):
        self.vocab, self.unigram_counts, self.bigram_counts, self.total_tokens = self.get_counts(sentences)
        self.vocab = list(self.unigram_counts.keys())
        self.word2idx = {word:i for i,word in enumerate(self.vocab)}
        self.compute_probs()
        print("Training complete!")         

    def get_counts(self, sentences):
        # collect unigram counts 
        print("Collecting unigram counts...")
        unigram_counts = Counter()
        for s in sentences:
            for word in s:
                unigram_counts[word] += 1 
        
        # remove all words that have count below the threshold    
        print("Constructing vocab...")     
        for w in list(unigram_counts.keys()):
            if unigram_counts[w] < self.count_threshold:
                unigram_counts.pop(w)
        # construct vocab 
        vocab = [self.unk_token] + sorted(list(unigram_counts.keys()))            
        
        # replace all oov tokens in training sentences with <UNK>
        print("Replacing with oov tokens in training data...")
        sentences_unk = []
        for s in sentences:
            sent = []
            for word in s:
                if word in vocab:
                    sent.append(word)
                else:
                    sent.append(self.unk_token)
            sentences_unk.append(sent)            

        # re-collect unigram counts 
        print("Re-collecting unigram counts...")
        unigram_counts = Counter()
        total_tokens = 0
        for s in sentences_unk:
            for word in s:
                unigram_counts[word] += 1 
                total_tokens += 1

        # collect bigram counts
        print("Collecting bigram counts...")
        bigram_counts = Counter()
        for s in sentences_unk:
            for bigram in zip(s[:-1], s[1:]):
                bigram_counts[bigram] += 1     

        return vocab, unigram_counts, bigram_counts, total_tokens
    
    def compute_probs(self):
        print("Computing bigram probabilities...")
        bigram_probs = Counter()
        for word1 in self.vocab:
            probs = []
            for word2 in self.vocab:
                # compute P(word2|word1)
                p = self.bg_prob(word1, word2)
                probs.append(p)
            bigram_probs[word1] = probs 
        self.bigram_probs = bigram_probs   

    def bg_prob(self, word1, word2):
        # linearly interpolated probability
        p_zerogram = self.lmda[0] * 1 / len(self.vocab)
        p_unigram =  self.lmda[1] * self.unigram_counts[word2] / self.total_tokens 
        p_bigram = self.lmda[2] * self.bigram_counts[(word1, word2)] / self.unigram_counts[word1] 
        p = p_zerogram + p_unigram + p_bigram
        return p        

In [55]:
model = bigram_LM_interp()
model.train(sentences_train)

Collecting unigram counts...
Constructing vocab...
Replacing with oov tokens in training data...
Re-collecting unigram counts...
Collecting bigram counts...
Computing bigram probabilities...
Training complete!


In [56]:
text = generate_text(model, n=100)
print(text)

and yet <UNK> dear
let and will take it good of a second <UNK> <s> i will <UNK> by the <UNK>
dukedom
soul that i of your fancy if sorrow woman issued to gaunt
autolycus
henry bolingbroke rode
i pray think richard is his word she
and still as art a for thou and never find me at once you are come cursed be full myself ranks
be brief for thou wilt <s> what must not thou been
nay soft
most forward <UNK> so great anchors on <s> then if it <s> he it advanced your ages
to thyself for for ere foul sin our services dead
<s> he comfort <s> petruchio
no some half <UNK> to make <s> he's on
wretches so sailors thou aufidius can kiss your itself <s> wife this <UNK> greek latin books she speaks drunk all
ferdinand
good
here this peace
<s> sirrah fetch written to the
fortune
his him angelo it o preposterous estate
death
there changed it you <s>
belly answer <s> <UNK> to <s>
and my the <UNK> do look thy stout tybalt somerset
their spite gremio falsely shame it 'twas to hear bodes henceforward
urge it o

In [58]:
# now lets compute perplexity on both the training and test data for different lambda values (lambda_0 will be held fixed at 0.01)
lambda2_vals = [0.5, 0.6, 0.7, 0.8, 0.9, 0.95]
for l2 in lambda2_vals:
    model.lmda = [0.01, 0.99-l2 ,l2]
    model.compute_probs()
    pp_train = compute_perplexity(model, sentences_train)
    pp_test = compute_perplexity(model, sentences_test)

    print(f"\nlambdas = {model.lmda}")
    print(f"Perplexity computed on training set: {pp_train:.3f}")
    print(f"Perplexity computed on test set: {pp_test:.3f}")


Computing bigram probabilities...

lambdas = [0.01, 0.49, 0.5]
Perplexity computed on training set: 85.047
Perplexity computed on test set: 198.653
Computing bigram probabilities...

lambdas = [0.01, 0.39, 0.6]
Perplexity computed on training set: 76.158
Perplexity computed on test set: 195.270
Computing bigram probabilities...

lambdas = [0.01, 0.29000000000000004, 0.7]
Perplexity computed on training set: 69.225
Perplexity computed on test set: 196.535
Computing bigram probabilities...

lambdas = [0.01, 0.18999999999999995, 0.8]
Perplexity computed on training set: 63.667
Perplexity computed on test set: 204.578
Computing bigram probabilities...

lambdas = [0.01, 0.08999999999999997, 0.9]
Perplexity computed on training set: 59.133
Perplexity computed on test set: 228.348
Computing bigram probabilities...

lambdas = [0.01, 0.040000000000000036, 0.95]
Perplexity computed on training set: 57.180
Perplexity computed on test set: 261.265


#### Note that with interpolation, we get much lower perplexity on the test set compared to add-k smoothing. The best value is ~190. The quality of the generated text also seems to be slightly better, but that's hard to tell for sure.

In [64]:
model.lmda = [0.01, 0.99-0.8 ,0.8]
model.compute_probs()

Computing bigram probabilities...


In [65]:
text = generate_text(model, n=100)
print(text)

the precious jewel strong purpose not exempt in thy rocky bosom of the maid hath banish'd haughty mind
and all
that princely knee rise we marry i think but a man life
your subject made disgraced <UNK> <UNK> will of virtue
it will by this while a <UNK> <UNK> then shepherd
sir there
escalus
servant
my memory of prompt my have forgot
prince
thy loss well he you
he shall i are a concealment
find a consul
to thrust myself
these other home
and
you
whose feeling but we have knowledge find love
petruchio
nay good brother i shall be there to us
some pretty i' faith the maid's mild entreaty shall wear the high'st my <UNK>
no is well that moving
sir
what say'st thou take this
broke off send tybalt's doomsday is
and to god on and sir king richard moe
beg starve
where's barnardine partial to <UNK>
would <UNK> night
but till he that warwick's daughter is but
northumberland
corioli wear their king usurping him but this there brother die to pass
all kneel for exile him mistress and your hand that i kn