# CS 584 Assignment 3 -- Language Model

#### Name: Varun Doddipalli

## In this assignment, you are required to follow the steps below:
1. Review the lecture slides.
2. Implement N-gram language modeling.
3. Implement RNN language modeling.

*** Please read the code and comments very carefully and install these packages (NumPy, sklearn, and tqdm) before you start ***

In [None]:
!pip install numpy scikit-learn tqdm matplotlib
!pip install -U spacy
!python -m spacy download en_core_web_sm

## 0. Data Process
Run the following cells to preprocess training data, validation data, and test data.

### Load Data

In [1]:
train_texts = []
with open('./data/train.txt', 'r') as fp:
    for line in fp:
        train_texts.append(line)
        
valid_texts = []
with open('./data/valid.txt', 'r') as fp:
    for line in fp:
        valid_texts.append(line)
        
test_texts = []
with open('./data/input.txt', 'r') as fp:
    for line in fp:
        test_texts.append(line)


### Preprocessing

In [2]:
import re
import string

class Preprocesser(object):
    def __init__(self, punctuation=True, url=True, number=True):
        self.punctuation = punctuation
        self.url = url
        self.number = number
    
    def apply(self, text):
        
        text = self._lowercase(text)
        # text = text.replace('<unk>', '')
        
        if self.url:
            text = self._remove_url(text)
            
        if self.punctuation:
            text = self._remove_punctuation(text)
            
        if self.number:
            text = self._remove_number(text)
        
        text = '<s> ' + text + ' </s>'
        text = re.sub(r'\s+', ' ', text)
            
        return text
    
        
    def _remove_punctuation(self, text):
        ''' Please fill this function to remove all the punctuations in the text
        '''
        ### Start your code
        puncs = string.punctuation
        puncs = puncs.replace('<', '')
        puncs = puncs.replace('>', '')
        puncs = puncs.replace('/', '')
        text = ''.join(c for c in text if c not in puncs)
        
        ### End
        
        return text
    
    def _remove_url(self, text):
        ''' Please fill this function to remove all the urls in the text
        '''
        ### Start your code
        
        text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text)
        
        ### End
        
        return text
    
    def _remove_number(self, text):
        ''' Please fill this function to remove all the numbers in the text
        '''
        
        ### Start your code
        text = ''.join([i for i in text if not i.isdigit()])
        
        ### End
        
        return text
    
    def _lowercase(self, text):
        ''' Please fill this function to lower the text
        '''
        
        ### Start your code
        text = text.lower()
        ### End
        
        return text
    
    
preprocesser = Preprocesser()

### Tokenization

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'ner'])

prefixes = [ss for ss in nlp.Defaults.prefixes if '<' not in ss]
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

suffixes = [ss for ss in nlp.Defaults.suffixes if '>' not in ss]
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

def tokenize(text):
    ''' Since it is a language model, we don't need to remove the stop words.
    '''
    
    doc = nlp(text)
    tokens = [token.text for token in doc]
    
    return tokens
    

## 1. N-gram (50 points)
In this section, you are required to implement an N-gram model for language modeling and two smoothing methods.
1. Implement N-gram (Bigram).
2. Implement Good Turing smoothing.
3. Implement Kneser-Ney smoothing.

### 1.1 Implement a bigram for language modeling (fill the code, 10 points)

In [4]:
from collections import defaultdict, Counter
from tqdm.notebook import tqdm

class BiGram(object):
    
    def __init__(self):
        ''' Construction function of BiGram.
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
        '''

        self.uni_count = defaultdict(lambda: 0)
        self.bi_count = defaultdict(lambda: defaultdict(lambda: 0))
        
        
    def fit(self, texts):
        self._unigram_count(texts)
        self._bigram_count(texts)
        
    
    def _unigram_count(self, texts):
        ''' Count tokens, and store in self.uni_count
            Input
                texts: a list of text
        '''
        
        ### Start you code
        freq_count = None
        for text in tqdm(texts, total=len(texts)):
            freq_count = Counter(tokenize(preprocesser.apply(text.strip()))) if freq_count is None else freq_count + Counter(tokenize(preprocesser.apply(text.strip())))
        self.uni_count = defaultdict(lambda: 0, freq_count)
        ### End
            
    
    
    def _bigram_count(self, texts):
        ''' Count tokens in bigram way, and store in self.bi_count
            Input
                texts: a list of text
        '''
        
        ### Start you code
        for text in tqdm(texts, total=len(texts)):
            tokens = tokenize(preprocesser.apply(text.strip()))
            for i in range(len(tokens[:-1])):
                self.bi_count[tokens[i]][tokens[i+1]] += 1
        ### End
    
    
    def probability(self, w1, w2):
        ''' Given two tokens, calculate the bigram probability
            Input
                w1: the first token of bigram
                w2: the second token of bigram
        '''
        prob = 0.
        
        ### Start you code
        try:
            prob = self.bi_count[w1][w2]/self.uni_count[w1]
        except ZeroDivisionError:
            prob = 1E-2
        ### End
        
        return prob
    
    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        w_next = None
        
        ### Start your code
        w_next = sorted(self.bi_count[w].items(), key=lambda x:x[1], reverse=True)[0][0]

        
        ### End
        
        return w_next

### 1.2 Implement Good Turing smoothing (fill the code, 15 points)

In [5]:
class GoodTuring(object):
    
    def __init__(self, bigram):
        ''' Construction function of Good Turing.
            Input
                bigram: Bigram model
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
                -----------------
                For bigram
                bi_nc: a dictionary with default value 0, the count of things we've seen c times.
                bi_c_star: (c+1)*N_c+1 / N_c, page 64 of slides (lecture 5).
                bi_N: \sum c*N_c, page 64 of slides (lecture 5).
                
                For unigram
                uni_nc: a dictionary with default value 0, the count of things we've seen c times.
                uni_c_star: (c+1)*N_c+1 / N_c, page 64 of slides (lecture 5).
                uni_N: \sum c*N_c, page 64 of slides (lecture 5).
            
        '''
        self.uni_count = bigram.uni_count
        self.bi_count = bigram.bi_count
        
        self.uni_nc = defaultdict(lambda: 0)
        self.bi_nc = defaultdict(lambda: 0)
        
        self.uni_c_star = defaultdict(lambda: 0)
        self.bi_c_star = defaultdict(lambda: 0)
        
        self.uni_N = 0
        self.bi_N = 0
        
        
    def fit(self, texts):
        self._calc_N_c()
        self._calc_c_star_and_N()
        
    
    def _calc_N_c(self):
        ''' Count the frequency of frequency c, and store to self.nc.
            Page 64 of slides (lecture 5)
            Hint: You could directly utililze self.bi_count and self.uni_count to calculate N_c
        '''
        
        ### Start you code
        self.uni_nc = defaultdict(lambda: 0, Counter(self.uni_count.values()))
        temp = Counter()
        for i in self.bi_count.values():
            temp += Counter(i.values())
        self.bi_nc = defaultdict(lambda: 0, temp)
        ### End
        
        
    def _calc_c_star_and_N(self):
        ''' Calculate c_star and N. (page 65 of slides (lecture 5))
        '''
        
        ### Start your code
        # c_list = list(self.uni_nc.keys())
        for c in range(max(self.uni_nc.keys())): 
            try:
                self.uni_c_star[c] = (c + 1)*(self.uni_nc[c + 1]) / self.uni_nc[c]
                self.uni_N += c*self.uni_nc[c]
            except:
                continue
        for c in range(max(self.bi_nc.keys())):
            try:
                self.bi_c_star[c] = (c + 1)*(self.bi_nc[c + 1]) / self.bi_nc[c]
                self.bi_N += c*self.bi_nc[c]
            except:
                continue

        ### End
        
        
    def probability(self, w1, w2):
        ''' Given two words, calculate the GT probability
                p_GT = c_star / N, if c != 0
                p_GT = N_1 / N, if c = 0
                
                p = p_GT(w1, w2) / p_GT(w1)
                
            Input
                w1: the first word
                w2: the second word
                
        '''
        prob = 0.
        
        ### Start you code
        c_bi = self.bi_count[w1][w2]
        bi_p_GT = self.bi_c_star[c_bi]/self.bi_N if c_bi == 0 else self.bi_nc[c_bi]/self.bi_N

        c_uni = self.uni_count[w1]
        uni_p_GT = self.uni_c_star[c_uni]/self.uni_N if c_uni == 0 else self.uni_nc[1]/self.uni_N
  
        prob = bi_p_GT/uni_p_GT
        ### End
        
        return prob

    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        w_next = None
        
        ### Start your code
        w_next_list = list(self.bi_count[w].keys())
        prob_list = []
        for w2 in w_next_list:
            prob_list.append(self.probability(w, w2))
        w_next = w_next_list[np.argmax(prob_list)]
        ### End
        
        return w_next

### 1.3 Implement Kneser-Ney smoothing (fill the code, 15 points)

In [6]:
class KneserNey(object):
    
    def __init__(self, bigram, d=0.75):
        ''' Construction function of KneserNey.
            Params:
                uni_count: a dictionary with default value 0
                bi_count: a dictionary that each value is a dictionary with default value 0
                -----------------
                num_bigram_types: page 73 of slides (lecture 5)
                novel_continuation: \{ w_{i-1}: c(w_{i-1}, w) \}, page 73 of slides (lecture 5)
                p_continuation: page 73 of slides (lecture 5)
                novel_previous: \{ w: c(w_{i-1}, w) \}, page 75 of slides (lecture 5)
                lam: page 75 of slides (lecture 5)
                d: 0.75
            
        '''
        
        self.uni_count = bigram.uni_count
        self.bi_count = bigram.bi_count
        
        self.num_bigram_types = 0
        self.novel_continuation = defaultdict(lambda: 0)
        self.novel_previous = defaultdict(lambda: 0)
        self.p_continuation = defaultdict(lambda: 0)
        self.lam = defaultdict(lambda: 0)
        
        self.d = d
        
    
    def fit(self, texts):
        self._calc_num_bigram_types()
        self._calc_novel_continuation_and_novel_previous()
        self._calc_P_continuation()
        self._calc_lambda()
        
    
    def _calc_num_bigram_types(self):
        ''' Calculate the number of bigram types, and store in self.num_bigram_types
            page 73 of slides (lecture 5)
            
            Hint: you could utilize the bigram count (self.bi_count) which is obtained from Bigram model.
        '''
        
        ### Start your code                    
        for w1 in tqdm(self.bi_count.keys(), desc = 'calculating bigram_types',total = len(self.bi_count.keys())):
            for w2,freq in self.bi_count[w1].items():
                if freq != 0:
                    self.num_bigram_types += 1
        ### End
      
    
    def _calc_novel_continuation_and_novel_previous(self):
        ''' Calculate novel continuation, and novel previous, 
            and store them in self.novel_continuation and self.novel_previous
            
            novel_continuation = \{ w_{i-1}: c(w_{i-1}, w) \}, page 73 of slides (lecture 5)
            novel_previous = \{ w: c(w_{i-1}, w) \}, page 75 of slides (lecture 5)
            
            Hint: you could utilize the bigram count (self.bi_count) which obtained from Bigram model.
        '''
        
        ### Start your code
        w1_list = list(self.bi_count.keys())
        
        for w1 in tqdm(w1_list, desc='novel_continuation', total=len(w1_list)):
            self.novel_continuation[w1] = len([i for i in self.bi_count[w1].values() if i != 0])
            
        w2_list = list(self.uni_count.keys())
        
        for w2 in tqdm(w2_list, desc='novel_previous', total=len(w2_list)):
            for w1 in w1_list:
                if self.bi_count[w1][w2] != 0:
                    self.novel_previous[w1] += 1
        ### End
    
    
    def _calc_P_continuation(self):
        ''' Calculate p continuation, and store in self.p_continuation.
            page 73 of slides (lecture 5)
            
            Hint: you could utilize the novel continuation (self.novel_continuation).
        '''
        
        ### Start your code 
        w_list = list(self.novel_continuation.keys())
        for w in tqdm(w_list,desc='p_continuation',total=len(w_list)):
            try:
                self.p_continuation[w] = self.novel_continuation[w] / self.num_bigram_types
            except:
                self.p_continuation[w] = self.novel_continuation[w]        
        ### End
    
    
    def _calc_lambda(self):
        ''' Calculate lambda, and store in self.lam.
            page 75 of slides (lecture 5)
            
            Hint: you could utilize the novel previous (self.novel_previous) and unigram (self.uni_count).
        '''
        
        ### Start your code
        w1_list = list(self.bi_count.keys())
        for w1 in w1_list:
            try:
                self.lam[w1] = (self.d / self.uni_count[w1]) * (len(self.bi_count[w1]))
            except:
                self.lam[w1] = (self.d) * (len(self.bi_count[w1]))
        ### End
        
        
    def probability(self, w1, w2):
        ''' Given two words, calculate the KN probability
            Page 74 of slides (lecture 5)
                
            Input
                w1: the first word
                w2: the second word
        '''
        
        prob = 0.
        
        # Start your code
        try:
            prob = (max(self.bi_count[w1][w2] - self.d, 0)/self.uni_count(w1)) + (self.lam[w1] * self.p_continuation[w2])
        except:
            prob = (max(self.bi_count[w1][w2] - self.d, 0)) + (self.lam[w1] * self.p_continuation[w2])
        # End
            
        return prob
    
    
    def predict(self, w):
        ''' Given a word, find a word with the highest probability
            Input
                w: a word
                
            Hint: utilize self.probability(w, w2) to find which w2 has the highest probability
        '''
        
        pred = ''
        
        ### Start your code
        w_next_list = list(self.bi_count[w].keys())
        prob_list = []
        for w2 in w_next_list:
            prob_list.append(self.probability(w, w2))
        pred = w_next_list[np.argmax(prob_list)]
        ### End
                
        return pred

### 1.4 Implement Perplexity (fill the code, 10 point)
**Hint:** Multiplication of probabilities may lead to an overflow problem. One trick is to move the computation to the logarithm space. Therefore, you could use summation instead of multiplication to calculate perplexity.

In [7]:
import math
import numpy as np
def perplexity(model, texts):
    ''' Calculate the perplexity score.
        Inputs
            model: the model you want to evaluate (BiGram, GoodTuring, or KneserNey)
            texts: a list of validation text
        Output
            perp: the perplexity of the model on texts
    '''
    perp = 1.
    
    ### Start your code

    prob_array = []
    for text in texts:
        tokens = tokenize(preprocesser.apply(text.strip()))
        for i in range(len(tokens)-1):
            prob = model.probability(tokens[i],tokens[i+1])
            prob = 1E-1 if prob == 0 else prob
            prob_array.append(prob)
    perp = np.exp(-np.mean(np.log(prob_array)))

    ### End
    
    return perp


### 1.5 Calculate the perplexity of three models

Run the following cell to obtian the perplexity of BiGram, Good Turing, and Kneser-Ney.

**Note that, the perlexity should be less than 100.**

In [8]:
# Train Bigram
bigram = BiGram()
bigram.fit(train_texts)

# Perplexity
bigram_perplexity = perplexity(bigram, valid_texts)
print(f'The perplexity of Bigram is: {bigram_perplexity:.4f}')

  0%|          | 0/42068 [00:00<?, ?it/s]



  0%|          | 0/42068 [00:00<?, ?it/s]

The perplexity of Bigram is: 43.6457


In [9]:
# Train Good Turing
gt = GoodTuring(bigram)
gt.fit(train_texts)

# Perplexity
gt_perplexity = perplexity(gt, valid_texts)
print(f'The perplexity of Good Turing is: {gt_perplexity:.4f}')

The perplexity of Good Turing is: 0.2939


In [10]:
# For Kneser-Ney
kn = KneserNey(bigram, d=0.75)
kn.fit(train_texts)

# Perplexity
kn_perplexity = perplexity(kn, valid_texts)
print(f'The perplexity of Kneser-Ney is: {kn_perplexity:.4f}')

calculating bigram_types:   0%|          | 0/9892 [00:00<?, ?it/s]

novel_continuation:   0%|          | 0/9892 [00:00<?, ?it/s]

novel_previous:   0%|          | 0/9893 [00:00<?, ?it/s]

p_continuation:   0%|          | 0/9892 [00:00<?, ?it/s]

The perplexity of Kneser-Ney is: 0.1660


### 1.6 Use N-gram model make predictions

Run the following cells to see how your models work

#### 1.6.1 Bigram

In [11]:
import random

sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = bigram.predict(tokens[-2])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

0 ==> investors taking this as a sign that a broad industry ___, prediction: </s>
1 ==> the subcommittee congress and the american public have every right ___, prediction: to
2 ==> they hope the foreign deals will divide the hollywood opposition ___, prediction: to
3 ==> robert h. <unk> an economist for lloyd 's bank in ___, prediction: the
4 ==> but mr. butcher 's comments make one thing clear some ___, prediction: of
5 ==> canada 's current oil exports to the u.s. total about ___, prediction: n
6 ==> a second <unk> plant at uncle sam <unk> that produces ___, prediction: <unk>
7 ==> the monthly increase is the highest recorded in the past ___, prediction: n
8 ==> but senate supporters of the <unk> legislation said that other ___, prediction: <unk>
9 ==> giant has n't ever disclosed the proposed price although <unk> ___, prediction: <unk>
10 ==> lexus sales were n't available the cars are imported and ___, prediction: <unk>
11 ==> once again the specialists were not able to handle the 

#### 1.6.2 Good Turing

In [12]:
sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = gt.predict(tokens[-2])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

0 ==> in a related development the <unk> for the fourth year ___, prediction: casting
1 ==> this is n't a change in government policy this provision ___, prediction: initially
2 ==> mrs. <unk> said her <unk> investment club 's portfolio lost ___, prediction: more
3 ==> to fend off sir james 's advances b.a.t has proposed ___, prediction: lower
4 ==> the chinese leaders have to decide whether they want control ___, prediction: point
5 ==> see related story fed ready to <unk> big funds wsj ___, prediction: business
6 ==> but even among those aged N and older the share ___, prediction: data
7 ==> brazil and venezuela are the only two countries that have ___, prediction: memory
8 ==> japanese stocks dropped early monday but by late morning were ___, prediction: exposed
9 ==> and even if a <unk> would wear flowers in her ___, prediction: eyes
10 ==> seven big board stocks ual amr bankamerica walt disney capital ___, prediction: may
11 ==> alan <unk> executive vice president of investment ca

#### 1.6.3 Kneser-Ney

In [13]:
sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    clean_text = preprocesser.apply(text)
    tokens = tokenize(clean_text)
    pred = kn.predict(tokens[-2])
    print(f'{i} ==> {text.strip()}, prediction: {pred}')

0 ==> among classes for which details were available yields ranged from ___, prediction: n
1 ==> some lagging competitors even may leave the personal computer business ___, prediction: </s>
2 ==> <unk> <unk> of anc and <unk> we <unk> shook the ___, prediction: <unk>
3 ==> the company was officially merged with bristol-myers co. earlier this ___, prediction: year
4 ==> richard p. <unk> formerly eastern airlines ' top lawyer joined ___, prediction: the
5 ==> the competitive spirit is clearly <unk> radio free europe which ___, prediction: is
6 ==> <unk> percent do n't even feel they are financially well ___, prediction: as
7 ==> its strategy in the past has been to serve as ___, prediction: a
8 ==> since then life has changed a lot for <unk> leonard ___, prediction: <unk>
9 ==> we could n't get dealers to answer their phones said ___, prediction: </s>
10 ==> in these four for instance the rtc is stuck with ___, prediction: the
11 ==> the <unk> radio reporters seem better informed and more

## 2. RNN (50 points)
In this section, you are required to implement an RNN-based language model. **Libraries are allowed in this section, such as PyTorch or TensorFlow**. And, of course, you could implement the model from scratch which will get extra credits. 

I divided the whole process into several steps.
1. Initialize parameters
2. Prepare Data
3. Implement the model
4. Train your model
5. Evaluate your model

Please note that you could change those steps by your needs. As long as you correctly implement the model and have reasonable results you will get full points.

### 2.1 Initialize parameters for the model

In [14]:
#######################################################
#                                                     #
#        Change the default values accordingly        #
#                                                     #
#######################################################

learning_rate = 1E-3
batch_size = 50
hidden_size = 100
embedding_size = 200
num_epochs = 30
window_size = 20

### 2.2 Data preparation (Fill the code: 5 points)

Here is what do you might need to do in this section:
1. Build a vocabulary.
2. Prepare the training data.
3. Prepare the validation data.

In [15]:
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import CategoricalCrossentropy
import numpy as np

In [16]:
class Tokenizer(object):
    def __init__(self):
        self.preprocessor = Preprocesser()
        self.word2freq = Counter()
        self.word2idx = defaultdict(lambda: 0)

    def making_idx(self):
        self.unique_tokens = self.word2freq.keys()
        self.vocab_size = len(self.unique_tokens)
        for i,j in enumerate(self.unique_tokens):
            self.word2idx[j] = i+1
        
    def fit_on_texts(self, texts):
        for doc in tqdm(texts):
            self.word2freq += Counter(tokenize(self.preprocessor.apply(doc.strip())))
        self.making_idx()

    def texts_to_sequences(self, sentence):
        tokens = tokenize(self.preprocessor.apply(sentence.strip()))
        return list(map(lambda x: self.word2idx[x] if x in self.unique_tokens else self.word2idx['<unk>'], tokens))

In [17]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_texts)

train_input_sequences = []
for line in tqdm(train_texts, total=len(train_texts)):
    token_list = tokenizer.texts_to_sequences(line)
    for i in range(1, len(token_list)):
        train_input_sequences.append(token_list[:i+1][-(window_size+1):])

train_input_sequences = pad_sequences(train_input_sequences, maxlen=window_size+1)

val_input_sequences = []
for line in tqdm(valid_texts, total=len(valid_texts)):
    token_list = tokenizer.texts_to_sequences(line)
    for i in range(1, len(token_list)):
        val_input_sequences.append(token_list[:i+1][-(window_size+1):])

val_input_sequences = pad_sequences(val_input_sequences, maxlen=window_size+1)

X_train = train_input_sequences[:,:-1]
labels_train = train_input_sequences[:,-1]

Y_train = tf.keras.utils.to_categorical(labels_train, num_classes=tokenizer.vocab_size+1, dtype='float32')

X_val = val_input_sequences[:,:-1]
labels_val = val_input_sequences[:,-1]
Y_val = tf.keras.utils.to_categorical(labels_val, num_classes=tokenizer.vocab_size+1, dtype='float32')

  0%|          | 0/42068 [00:00<?, ?it/s]

  0%|          | 0/42068 [00:00<?, ?it/s]

  0%|          | 0/3370 [00:00<?, ?it/s]

### 2.3 Build your model (Fill the code: 10 points)


Here is what do you might need to do in this section:
1. Create a model.
2. Add an embedding layer as the first layer.
3. Add a RNN cell (GRU or LSTM) as the next layer.
4. Add a output layer.
5. Given a sequence words, for each word, predict the next word.

In [18]:
model = Sequential()
model.add(Embedding(tokenizer.vocab_size+1, embedding_size, input_length=window_size))
model.add(Bidirectional(LSTM(hidden_size)))
model.add(Dense(tokenizer.vocab_size+1, activation='softmax'))

adam = Adam(learning_rate=learning_rate)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 20, 200)           1978800   
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               240800    
_________________________________________________________________
dense (Dense)                (None, 9894)              1988694   
Total params: 4,208,294
Trainable params: 4,208,294
Non-trainable params: 0
_________________________________________________________________


### 2.4 Setup the training step and train the model (Fill the code: 10 points)
Based on your implementation, setup your training process.

In [19]:
history = model.fit(X_train, Y_train, epochs=num_epochs, verbose=1, batch_size=batch_size, validation_data=(X_val, Y_val))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30

KeyboardInterrupt: 

In [None]:
model.save('Bi LSTM Model')

### 2.5 Evaluate the model (15 points)
Calculate the model's perplexity on the valid set.

#### 2.5.1 Deliverable (5 points)
Prove
<center>$perplexity = exp(\frac{total\ loss}{number\ of\ predictions})$
    
*You can either list the steps in the notebook or submit a pdf with all the steps in the submission.*

#### 2.5.2 Implement the algorithm to calculate the perplexity of the model. (10 points)

In [None]:
perp = 0.

### Start your code
val_data_sequences = []
for line in tqdm(valid_texts, total=len(valid_texts)):
    token_list = tokenizer.texts_to_sequences(line)
    val_data_sequences.append(token_list[-(window_size+1):])

val_data_sequences = pad_sequences(val_data_sequences, maxlen=window_size+1)

X_val = val_data_sequences[:,:-1]
labels_val = val_data_sequences[:,-1]
Y_val = tf.keras.utils.to_categorical(labels_val, num_classes=tokenizer.vocab_size+1, dtype='float32')

crossentropy_loss = CategoricalCrossentropy()

loss = []

for i,j in zip(X_val, Y_val):
    prediction = model.predict(i.reshape(1,-1))
    loss.append(float(crossentropy_loss(j,prediction.reshape(-1))))

perp = np.exp(sum(loss)/len(loss))


### End

print(f'The perplexity of of RNN based model is: {perp:.4f}')


The perplexity of RNN based model is: 11.1432


### 2.6 Use RNN language modeling make predictions (10 points)
Print the predictions of next words using the RNN model for the same 30 lines of input.txt as in section 1.6

In [78]:
### Start your code

test_data_sequences = []
sampled_texts = random.sample(test_texts, 30)
for i, text in enumerate(sampled_texts):
    
    token_list = tokenizer.texts_to_sequences(text)
    test_data_sequences.append(token_list[1:-1][-window_size:])

X_test = pad_sequences(test_data_sequences, maxlen=window_size)

Y_test_pred = model.predict_classes(X_test)

for i,data in enumerate(zip(sampled_texts, Y_test_pred)):
    clean_text = preprocesser.apply(data[0])
    tokens = tokenize(clean_text)
    pred = list(tokenizer.unique_tokens)[data[1]-1]
    print(f'{i} ==> {data[0].strip()}, prediction: {pred}')

### End


0 ==> at stake is what mike <unk> compaq 's president of ___, prediction: the
1 ==> it said the N rail cars are in addition to ___, prediction: a
2 ==> if the dollar stays weak he says that will add ___, prediction: n
3 ==> on a recent saturday night in the midst of west ___, prediction: germany
4 ==> it is bigger faster and more profitable than the news ___, prediction: could
5 ==> that will translate into sharply higher production profits particularly compared ___, prediction: with
6 ==> the sales drop for the no. N car maker may ___, prediction: <unk>
7 ==> grumman corp. received an $ N million navy contract to ___, prediction: build
8 ==> those that pulled out of stocks <unk> it he said ___, prediction: </s>
9 ==> they say investors should sell stocks but not necessarily right ___, prediction: </s>
10 ==> the name <unk> in rumors is british petroleum co. which ___, prediction: has
11 ==> the <unk> is believed to be the first such cross-border ___, prediction: in
12 ==> but recent d