## <span style="color:#0b486b">Set random seeds</span>

We start with importing tensorflow and numpy and setting random seeds for TF and numpy. You can use any seeds you prefer.

In [159]:
import numpy as np
import tensorflow as tf

tf.random.set_seed(6789)
np.random.seed(6789)

## <span style="color:#0b486b">Part 1: Download and preprocess the data</span>

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

The dataset we use for this assignment is a question classification dataset for which the train set consists of $5,500$ questions belonging to 6 coarse question categories including:
- abbreviation (ABBR), 
- entity (ENTY), 
- description (DESC), 
- human (HUM), 
- location (LOC) and 
- numeric (NUM).


Preprocessing data is an inital and important step in any machine learning or deep learning projects. The following *DataManager* class helps you to download data and preprocess data for the later steps of a deep learning project. 

In [160]:
import os
import zipfile
import collections
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
plt.style.use('ggplot')

class DataManager:
    def __init__(self, verbose=True, maxlen= 50, random_state=6789):
        self.verbose = verbose
        self.max_sentence_len = 0
        self.str_questions = list()
        self.str_labels = list()
        self.numeral_labels = list()
        self.maxlen = maxlen
        self.numeral_data = list()
        self.random_state = random_state
        self.random = np.random.RandomState(random_state)
        
    @staticmethod
    def maybe_download(dir_name, file_name, url, verbose= True):
        if not os.path.exists(dir_name):
            os.mkdir(dir_name)
        if not os.path.exists(os.path.join(dir_name, file_name)):
            urlretrieve(url + file_name, os.path.join(dir_name, file_name))
        if verbose:
            print("Downloaded successfully {}".format(file_name))
    
    def read_data(self, dir_name, file_names):
        for file_name in file_names:
            file_path= os.path.join(dir_name, file_name)
            self.str_questions= list(); self.str_labels= list()
            with open(file_path, "r", encoding="latin-1") as f:
                for row in f:
                    row_str= row.split(":")
                    label, question= row_str[0], row_str[1]
                    question= question.lower()
                    self.str_labels.append(label)
                    self.str_questions.append(question[0:-1])
                    if self.max_sentence_len < len(self.str_questions[-1]):
                        self.max_sentence_len= len(self.str_questions[-1])
         
        # turns labels into numbers
        le= preprocessing.LabelEncoder()
        le.fit(self.str_labels)
        self.numeral_labels = np.array(le.transform(self.str_labels))
        self.str_classes= le.classes_
        self.num_classes= len(self.str_classes)
        if self.verbose:
            print("\nSample questions... \n")
            print(self.str_questions[0:5])
            print("Labels {}\n\n".format(self.str_classes))
    
    def manipulate_data(self):
        tokenizer = tf.keras.preprocessing.text.Tokenizer()
        tokenizer.fit_on_texts(self.str_questions)
        self.numeral_data = tokenizer.texts_to_sequences(self.str_questions)
        self.numeral_data = tf.keras.preprocessing.sequence.pad_sequences(self.numeral_data, padding='post', truncating= 'post', maxlen= self.maxlen)
        self.word2idx = tokenizer.word_index
        self.word2idx = {k:v for k,v in self.word2idx.items()}
        self.idx2word = {v:k for k,v in self.word2idx.items()}
        self.vocab_size = len(self.word2idx)
    
    def train_valid_split(self, train_ratio=0.9):
        idxs = np.random.permutation(np.arange(len(self.str_questions)))
        train_size = int(train_ratio*len(idxs)) +1
        self.train_str_questions, self.valid_str_questions = self.str_questions[0:train_size], self.str_questions[train_size:]
        self.train_numeral_data, self.valid_numeral_data = self.numeral_data[0:train_size], self.numeral_data[train_size:]
        self.train_numeral_labels, self.valid_numeral_labels = self.numeral_labels[0:train_size], self.numeral_labels[train_size:]
        self.tf_train_set = tf.data.Dataset.from_tensor_slices((self.train_numeral_data, self.train_numeral_labels))
        self.tf_valid_set = tf.data.Dataset.from_tensor_slices((self.valid_numeral_data, self.valid_numeral_labels))

In [161]:
print('Loading data...')

dm = DataManager(maxlen=100)
dm.read_data("Data/", ["train_set.label"])   # read data

Loading data...

Sample questions... 

['manner how did serfdom develop in and then leave russia ?', 'cremat what films featured the character popeye doyle ?', "manner how can i find a list of celebrities ' real names ?", 'animal what fowl grabs the spotlight after the chinese year of the monkey ?', 'exp what is the full form of .com ?']
Labels ['ABBR' 'DESC' 'ENTY' 'HUM' 'LOC' 'NUM']




In [162]:
dm.manipulate_data()
dm.train_valid_split(train_ratio=0.6)

You now have a data manager, named *dm* containing the training and validiation sets in both text and numeric forms. Your task is to play around and read this code to figure out the meanings of some important attributes that will be used in the next parts.

#### <span style="color:red"></span> 
**What is the purpose of `self.train_str_questions` and `self.train_numeral_labels`? Write your code to print out the first five questions with labels in the training set.**

<div style="text-align: right"><span style="color:red"></span></div> 

The purpose of self.train_str_questions is to store all the quentions present in the training dataset by reading the data as above and there are 5,500 questions in the dataset belonging to 6 coarse question categories. Therefore, self.train_str_questions is the extraction of questions from the training data into the list.


The purpose of self.train_numeral_labels is to store the corresponding labels for the above questions in train_str_questions list. The benifit of self.train_numeral_labels list is that it would have the labels converted into numerical form from the original string format which would make it easier in the next stages to classify the questions values into corresponding labels. 

Below, top-5 string questions in the list of training dataset with their corresponding labels are printed in the form of item.  


In [163]:
for i in range(len(dm.train_str_questions)):
    for j in range(len(dm.train_numeral_labels)):
        if i==j:
            print("({:d}: {:s})".format(dm.train_numeral_labels[j],dm.train_str_questions[i]))
        
        if i>=5 or j>=5:
            break


(1: manner how did serfdom develop in and then leave russia ?)
(2: cremat what films featured the character popeye doyle ?)
(1: manner how can i find a list of celebrities ' real names ?)
(2: animal what fowl grabs the spotlight after the chinese year of the monkey ?)
(0: exp what is the full form of .com ?)


#### <span style="color:red"></span> 
**What is the purpose of `self.train_numeral_data`? Write your code to print out the first five questions in the numeric format with labels in the training set.**

<div style="text-align: right"><span style="color:red"></span></div> 

The purpose of self.train_numeral_data is to storing the questions in from the training dataset into numerical form of 100 dimension array. This would help converting the texts (string) of the questions into numerical array form. Therefore, self.train_numeral_data is useful to make text analysis of the given text (questions) with numerical values allocated for each question.  



In [164]:
for i in range(len(dm.train_numeral_data)):
    for j in range(len(dm.train_numeral_labels)):
        if i==j:
            print("({:d}: {})".format(dm.train_numeral_labels[j],dm.train_numeral_data[i]))
        
        if i>=5 or j>=5:
            break


(1: [  29    8   19 3497 2219    5   16  433  814  990    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0])
(2: [  32    2  815  619    1  148 1255 3498    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0 

#### <span style="color:red"></span> 
**What is the purpose of two dictionaries: `self.word2idx` and `self.idx2word`? Write your code to print out the first five key-value pairs of those dictionaries.**

<div style="text-align: right"><span style="color:red"></span></div> 

The dictionary self.word2idx defines the words and their corresponding labels into key-vlue form. The dictionary contains one pair for each word and the pair consisting of word and its corresponding key. So, based on the key, dictionary can find its corresponding word which is the pair element.

The dictionary self.idx2word contains the same structure as self.word2idx dictionary but it would store labels for each word and based on word as the key, the labels can be retained. 

This will be useful for analysing the sentence or string of words which can be classified into words and dictionary would find the labels based on corresponding words which can be used to calculate the features of the sentence or string overall.



In [165]:
for i, (key,word) in enumerate(dm.idx2word.items()):
        print("({:d}, {:s})".format(key,word))
        if i >= 4:
            break
            
print("-----------")
    
for i, (word,key) in enumerate(dm.word2idx.items()):
        print("({:s}, {:d})".format(word,key))
        if i >= 4:
            break

(1, the)
(2, what)
(3, is)
(4, of)
(5, in)
-----------
(the, 1)
(what, 2)
(is, 3)
(of, 4)
(in, 5)


#### <span style="color:red"></span> 
**What is the purpose of `self.tf_train_set`? Write your code to print out the first five items of `self.tf_train_set`.**

<div style="text-align: right"><span style="color:red"></span></div> 

The self.tf_train_set takes the slices of an array of self.train_numeral_data (which is the numerical conversion of questions in string form).

With the help of tf.data.Dataset.from_tensor_slices() method, we can get the slices of an array in the form of objects and we can allocate the labels into another tensor object. The method from_tensor_slices() accepts individual (or multiple) Numpy (or Tensors) objects. In case multiple objects are inputted, it passes them as tuple and make sure that all the objects have same size in zeroth dimension. Therefore, self.tf_train_set would have the numerical form of question in the training dataset and the corresponding labels for each question in the form of object which would be retained in the tensor form. 

The training dataset in such form would be useful while fitting the model as it becomes easy to enter the training dataset containing both questions and labels in the form of objects in numeric datatype. 


In [166]:
for i, (x,y) in enumerate(dm.tf_train_set):
    print(x,y)
    
    if i>=4:
        break


tf.Tensor(
[  29    8   19 3497 2219    5   16  433  814  990    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0], shape=(100,), dtype=int32) tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(
[  32    2  815  619    1  148 1255 3498    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0

#### <span style="color:red"></span> 
**What is the purpose of `self.tf_valid_set`? Write your code to print out the first five items of `self.tf_valid_set`.**

<div style="text-align: right"><span style="color:red"></span></div> 

The self.tf_valid_set takes the slices of an array of self.valid_numeral_data (which is the numerical conversion of questions in string form).

In the same format as the training dataset, tf.data.Dataset.from_tensor_slices() method converts the validation dataset into form of objects carrying the numerical form of question and corresponding labels for each question. 

While fitting the model, the argument for validation dataset can be given as tensor form as produced here which would make it in the compatible form to access the data along with their labels.

In [167]:
for i, (x,y) in enumerate(dm.tf_valid_set):
    print(x,y)
    
    if i>=4:
        break


tf.Tensor(
[  38   12  279    1   33 2178    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0], shape=(100,), dtype=int32) tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(
[  27    2 6443  584   27   55    1 6444  158 6445    1 6446  158   69
 6447    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0

## <span style="color:#0b486b">Part 2: Using Word2Vect to transform texts to vectors </span>

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

In this part, you will be assessed on how to use a pretrained Word2Vect model for realizing a machine learning task. Basically, you will use this pretrained Word2Vect to transform the questions in the above dataset stored in the *data manager object dm* to numeric form for training a Support Vector Machine in sckit-learn.  

In [168]:
import gensim.downloader as api
from gensim.models import Word2Vec
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

#### <span style="color:red"></span> 
**Write code to download the pretrained model *glove-wiki-gigaword-100*. Note that this model transforms a word in its dictionary to a $100$ dimensional vector.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [169]:
word2vect = api.load("glove-wiki-gigaword-100") #Downloading the pre-trained model using "api"

#### <span style="color:red"></span> 
**Write code for the function get_word_vector(word, model) used to transform a word to a vector usingthe pretrained Word2Vect model model. Note that for a word not in the vocabulary of our word2vect,you need to return a vector $0$ with 100 dimensions.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [171]:
def get_word_vector(word, model):
    #Transforming a word to vector with Word2Vect model
    try:
        vector = model.get_vector(word) 
    except: #word not in the vocabulary
        vector = np.zeros([model.vector_size]) 
    return vector

#### <span style="color:red"></span> 

**Write the code for the function *get_sentence_vector(sentence, important_score=None, model= None)*. Note that this function will transform a sentence to a 100-dimensional vector using the pretrained model *model*. In addition, the list *important_score* which has the same length as the *sentence* specifies the important scores of the words in the sentence. In your code, you first need to apply *softmax* function over *important_score* to obtain the important weight *important_weight* which forms a probability over the words of the sentence. Furthermore, the final vector of the sentence will be weighted sum of the individual vectors for words and the weights in *important_weight*.**
- $final\_vector= important\_weight[1]\times v[1] + important\_weight[2]\times v[2] + ...+ important\_weight[L]\times v[L]$ where $L$ is the length of the sentence and $v[i]$ is the vector of the word $i-th$ in this sentence.

**Note that if *important_score=None* is set by default, your function should return the average of all representation vectors corresponding to set *important_score=[1,1,...,1]*.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [172]:
def get_sentence_vector(sentence, important_score=None, model= None):
    #Transforming a sentence to a 100 dimensional vector
    #Here the important_score is calculated using the index of the token 
    #or word into sentence where the index of the last word will be the 
    #important_score value of the first word in the sentence, second
    #last word's index to important_score value of second word and so on    
    
    token = sentence.split() #Splitting the sentence into tokens
    
    important_score_values = [token for token in range(len(token))]#Allocating the index values
    important_score_values = important_score_values[::-1] #Reversing the order 
    
    
    
    if important_score == None:
        important_score = np.ones([len(token)]) #In case none is received, it will be equal 
        
    else:
        important_score = important_score_values  
    
    imp_exp = np.exp(important_score - np.max(important_score))
    important_weight = imp_exp / imp_exp.sum(axis=0) #Activation function weight allocation
    
    vec = [get_word_vector(t, model) for t in token] #Retrieving the word into numeric vector
    
    vector = []
    for i in range(len(important_weight)):
        vector += [important_weight[i] * (np.array(vec[i]))] #Calculation for the sentence
    
    if len(vector) > 0:
        vector = np.asarray(vector).sum(axis = 0) 
    return vector
    

#### <span style="color:red"></span> 

**Write code to transform the training questions in *dm.train_str_questions* to feature vectors. Note that after running the following cell, you must have $X\_train$ which is an numpy array of the feature vectors and $y\_train$ which is an array of numeric labels (*Hint: dm.train_numeral_labels*). You can add more lines to the following cell if necessary. In addition, you should decide the *important_score* by yourself. For example, you might reckon that the 1st score is 1, the 2nd score is decayed by 0.9, the 3rd is decayed by 0.9, and so on.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [174]:
print("Transform training set to feature vectors...")

X_train= [] 
y_train= dm.train_numeral_labels #Allocating the labels into training array

for line in dm.train_str_questions:
    vector = get_sentence_vector(line,100,word2vect) #Converting sentences into vec

    if len(vector)>0:
        X_train += [vector] #Building the training dataset with all sentences vectors



Transform training set to feature vectors...


#### <span style="color:red"></span> 

**Write code to transform the training questions in *dm.valid_str_questions* to feature vectors. Note that after running the following cell, you must have $X\_valid$ which is an numpy array of the feature vectors and $y\_valid$ which is an array of numeric labels (*Hint: dm.valid_numeral_labels*). You can add more lines to the following cell if necessary. In addition, you should decide the *important_score* by yourself. For example, you might reckon that the 1st score is 1, the 2nd score is decayed by 0.9, the 3rd is decayed by 0.9, and so on.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [175]:
print("Transform valid set to feature vectors...")

X_valid= []
y_valid= dm.valid_numeral_labels #Allocating the labels into validation array



for line in dm.valid_str_questions:
    vector = get_sentence_vector(line,important_score,word2vect) #Converting sentences into vec

    if len(vector)>0:
        X_valid += [vector] #Building the validation dataset with all sentences vectors


Transform valid set to feature vectors...


#### <span style="color:red"></span> 

**It is now to use *MinMaxScaler(feature_range=(-1,1))* in sckit-learn to scale both training and valid sets to the range $(-1,1)$.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [176]:
scaler = MinMaxScaler(feature_range=(-1,1))

scaler.fit(X_train)
X_train = scaler.transform(X_train) #Scalling the training dataset values into range (-1,1)
scaler.fit(X_valid)
X_valid = scaler.transform(X_valid) #Scalling the validation dataset values into range (-1,1)

#### <span style="color:red"></span> 

**Declare a support vector machine (the class *SVC*  in sckit-learn) with RBF kernel, $C=1$, $gamma= 2^{-3}$ and fit on the training set.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [177]:
svm = SVC(C=1, kernel = 'rbf',gamma = 2**(-3)) #the class SVC
svm.fit(X_train, y_train) #fitting svm on X_train

SVC(C=1, gamma=0.125)

#### <span style="color:red"></span> 

**Finally, we use the trained *svm* to evaluate on the valid set $X\_valid$.**

<div style="text-align: right"><span style="color:red"></span></div> 

In [178]:
y_valid_pred= svm.predict(X_valid) #predicting the values on validation dataset for testing
acc = accuracy_score(y_valid, y_valid_pred)#Computing the accuracy on validation actual values 
print(acc)

0.9660550458715597


## <span style="color:#0b486b">Part 3: Text CNN for sequence modeling and neural embedding </span>

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

#### <span style="color:red"></span> 

**In what follows, you are required to complete the code for Text CNN for sentence classification. The paper of Text CNN can be found at this [link](https://www.aclweb.org/anthology/D14-1181.pdf). Here is the description of the Text CNN you need to construct.**
- There are three attributes (properties or instance variables): *embed_size, state_size, data_manager*.
  - `embed_size`: the dimension of the vector space for which the words are embedded to using the embedding matrix.
  - `state_size`: the number of filters used in *Conv1D* (reference [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D)).
  - `data_manager`: the data manager to store information of the dataset.
- The detail of the computational process is as follows:
  - Given input $x$, we embed $x$ using the embedding matrix to obtain an $3D$ tensor $[batch\_size \times vocab\_size \times embed\_size]$ as $h$.
  - We feed $h$ to three Convd 1D layers, each of which has $state\_size$ filters, padding=same, activation= relu, and $kernel\_size= 3, 5, 7$ respectively to obtain $h1, h2, h3$. Note that each $h1, h2, h3$ is a 3D tensor with the shape $[batch\_size \times output\_size \times state\_size]$.
  - We then apply *GlobalMaxPool1D()* (reference [here](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalMaxPool1D)) over $h1, h2, h3$ to obtain 2D tensors stored in $h1, h2, h3$ again.
  - We then concatenate three 2D tensors $h1, h2, h3$ to obtain $h$. Note that you need to specify the axis to concatenate.
  - We finally build up one dense layer on the top of $h$ for classification.
  
  <div style="text-align: right"><span style="color:red"></span></div>
  

In [109]:
class TextCNN:
    def __init__(self, embed_size= 128, state_size=16, data_manager=None):
        self.data_manager = data_manager
        self.embed_size = embed_size
        self.state_size = state_size
    
    def build(self):
        x = tf.keras.layers.Input(shape=[None])
        h = tf.keras.layers.Embedding(self.data_manager.vocab_size +1, self.embed_size)(x)
        h1 = tf.keras.layers.Conv1D(filters = self.state_size, padding = 'same', kernel_size=3, activation= 'relu')(h)#1D convolutional layer
        h2 = tf.keras.layers.Conv1D(filters = self.state_size, padding = 'same', kernel_size=5, activation= 'relu')(h)#1D convolutional layer
        h3 = tf.keras.layers.Conv1D(filters = self.state_size, padding = 'same', kernel_size=7, activation= 'relu')(h)#1D convolutional layer
        h1 = tf.keras.layers.GlobalMaxPool1D()(h1)#1D Global max pooling layer
        h2 = tf.keras.layers.GlobalMaxPool1D()(h2)#1D Global max pooling layer
        h3 = tf.keras.layers.GlobalMaxPool1D()(h3)#1D Global max pooling layer
        h = tf.keras.layers.concatenate([h1,h2,h3],axis=-1,name="concatenate")#Concatenation of h1, h2 and h3 layers
        h = tf.keras.layers.Dense(self.data_manager.num_classes, activation='softmax')(h)
        self.model = tf.keras.Model(inputs=x, outputs=h) 
    
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        return self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)


#### <span style="color:red"></span> 
**Here is the code to test TextCNN above. You can observe that TextCNN outperforms the traditional approach SVM + Word2Vect for this task. The reason is that TextCNN enables us to automatically learn the feature that fits to the task. This makes deep learning different from hand-crafted feature approaches. Complete the code to test the model. Note that when compiling the model, you can use the Adam optimizer.**

<div style="text-align: right"><span style="color:red"></span></div>

In [110]:
text_cnn = TextCNN(data_manager=dm)
text_cnn.build()
text_cnn.compile_model(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) #compile the model
text_cnn.fit(dm.tf_train_set.batch(64), validation_data=dm.tf_valid_set.batch(64), epochs=20) #train the model on 20 epochs
#Proving the better result than hand-crafted approch with validation accuracy of 97.94%
#as compared to 96.60% by hand-crafted SVM model 

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x22322b6b748>

## <span style="color:#0b486b">Part 4: RNNs for sequence modeling and neural embedding </span>

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

### <span style="color:#0b486b">4.1. One-directional RNNs for sequence modeling and neural embedding </span> ###

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

#### <span style="color:red"></span> 
**In this part, you need to construct an RNN to learn from the dataset of interest. Basically, you are required first to construct the class UniRNN (Uni-directional RNN) with the following requirements:**
- Attribute `data_manager (self.data_manager)`: specifies the data manager used to store data for the model.
- Attribute `cell_type (self.cell_type)`: can receive three values including `basic_rnn`, `gru`, and `lstm` which specifies the memory cells formed a hidden layer.
- `state_sizes (self.state_sizes)` indicates the list of the hidden sizes from the second hidden layers of memory cells. For example, $embed\_size =128$ and $state\_sizes = [64, 64]$ means that you have three hidden layers in your network with hidden sizes of $128, 64$ and $64$ respectively.

**Note that when declaring an embedding layer for the network, you need to set *mask_zero=True* so that the padding zeros in the sentences will be masked and ignored. This helps to have variable length RNNs. For more detail, you can refer to this [link](https://www.tensorflow.org/guide/keras/masking_and_padding).**

<div style="text-align: right"><span style="color:red"></span></div>

In [181]:
class UniRNN:
    def __init__(self, cell_type= 'gru', embed_size= 128, state_sizes= [128, 64], data_manager= None):
        self.cell_type = cell_type
        self.state_sizes = state_sizes
        self.embed_size = embed_size
        self.data_manager = data_manager
        self.vocab_size = self.data_manager.vocab_size +1 
        
    #return the correspoding memory cell
    @staticmethod
    def get_layer(cell_type= 'gru', state_size= 128, return_sequences= False, activation = 'tanh'):
        if cell_type=='gru':
            return tf.keras.layers.GRU(state_size, return_sequences=return_sequences) #GRU memory cell
        elif cell_type== 'lstm':
            return tf.keras.layers.LSTM(state_size, return_sequences=return_sequences)#LSTM memory cell
        else:
            return tf.keras.layers.SimpleRNN(state_size, return_sequences=return_sequences) #Basic RNN memory cell
    
    def build(self):
        x = tf.keras.layers.Input(shape=[None])
        h = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero= True)(x)#Embedding layer
        num_layers = len(self.state_sizes) #number of layers
        
        for i in range(num_layers):
            if i==num_layers-1:
                h = UniRNN.get_layer(self.cell_type,self.state_sizes[i],return_sequences= False, activation = 'tanh')(h) #calling get_layer method to run memory cell 
                                                                                                                         #when the current layer is the last one for the output shape
            else:
                h = UniRNN.get_layer(self.cell_type,self.state_sizes[i],return_sequences= True, activation = 'tanh')(h) #calling get_layer method to run memory cell 
                                                                                                                         #when the current layer is not the last one for the output shape
           
        h = tf.keras.layers.Dense(dm.num_classes, activation='softmax')(h)
        self.model = tf.keras.Model(inputs=x, outputs=h)
   
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        return self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)       


#### <span style="color:red"></span> 
**Run with basic RNN ('basic_rnn') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [112]:
uni_rnn = UniRNN(None, embed_size=128, state_sizes=[128,128], data_manager=dm)#Basic RNN network
uni_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
uni_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
uni_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x2232300af98>

#### <span style="color:red"></span> 
**Run with GRU ('gru') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [113]:
uni_rnn = UniRNN('gru', embed_size=128, state_sizes=[128,128], data_manager=dm)#GRU cell
uni_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
uni_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
uni_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x22338733630>

#### <span style="color:red"></span> 
**Run with LSTM ('lstm') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [114]:
uni_rnn = UniRNN('lstm', embed_size=128, state_sizes=[128,128], data_manager=dm)#LSTM cell
uni_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
uni_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
uni_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x223509f7438>

#### <span style="color:red"></span> 
**Give your own comments about the performance of three memory cells for the dataset of interest as well as what happening during the training process of each cell. Note that there are not right or wrong comments and your comments rely on the status of your training. In addition, some comments and hypothesized assessments of what and why are occurring are useful to obtain the highest score for this question.**

<div style="text-align: right"><span style="color:red"></span></div>

The UniRNN() is one directional network built with three memory cell types which are Simple RNN, GRU and LSTM. In terms of validation accuracy, all three models for their memory cells are performing quite similar. Applying both basic and LSTM memory cells for RNN gives the same validation accuracy of 97.43%, while GRU memory cell for RNN has a validation accuracy of 97.75% which is slight better performance than the other two memory cells.  

Simple RNN memory is only capable of saving the information on previous data in the hidden state. When words get transformed into machine-readable vectors using simple RNN, it gets processed one by one sequence of vectors and this way, it passes the previous hidden state to the next step of the sequence. And hidden state acts as the neural network memory which is the only information RNN can store which gets overwritten at each step. Hence, not being able to store information further makes it model decreasing its capability to learn previous aspects. But the computation process of simple RNN includes the tanh activation function.

The tanh activation is used to help regulate the values flowing through the network which makes sure the values stay between -1 and 1. Activation function helps saving the model from running into gradient exploding issues which is the reason simple RNN succeeds in providing decent results in the end. Finally, simple RNN has very few operations internally but works pretty well given the right circumstances and here given the short sequences, it performs good.

On the other hand, LSTM can choose which information is relevant to remember or forget during sequence processing because of its memory cell structure. LSTM also has access to forget layer additionally which can remove the information if not relevant to remember.

Finally, the reason behind GRU performing better is that GRU is the newer generation of Recurrent Neural networks with upgraded fetures. Instead of using the cell state, it has the hidden state to transfer information. It also only has two gates, a reset gate and update gate where update gate gets to decides what information to throw away and what new information to add and reset gate gets to decide how much past information to forget. Therefore, GRU is having features which allows to access past information in more flexible and speedier way. So, GRU's fewer tensor operations makes them a little speedier to train than LSTM’s which makes the main difference between their performances.




### <span style="color:#0b486b">4.2. Bi-directional RNNs for sequence modeling and neural embedding </span> ###

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

#### <span style="color:red"></span> 
**In what follow, you will investigate BiRNN. The task is similar to Part 4.1 but you need to write the code for an BiRNN. Note that the function *get_layer(cell_type= 'gru', state_size= 128, return_sequences= False, activation = 'tanh')* has to return the hidden layer with bidirectional memory cells (e.g., Basic RNN, GRU, and LSTM cells).**

**Complete the code of the class *BiRNN*. Note that for the embedding layer you need to set *mask_zero=True*.**

<div style="text-align: right"><span style="color:red"></span></div>

In [180]:
class BiRNN:
    def __init__(self, cell_type= 'gru', embed_size= 128, state_sizes= [128, 64], data_manager= None):
        self.cell_type = cell_type
        self.state_sizes = state_sizes
        self.embed_size = embed_size
        self.data_manager = data_manager
        self.vocab_size = self.data_manager.vocab_size +1
        
    @staticmethod
    def get_layer(cell_type= 'gru', state_size= 128, return_sequences= False, activation = 'tanh'):
        if cell_type=='gru':
            return tf.keras.layers.Bidirectional(tf.keras.layers.GRU(state_size, return_sequences=return_sequences)) #GRU cell execution in bi-direction
        elif cell_type== 'lstm':
            return tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(state_size, return_sequences=return_sequences)) #LSTM cell execution in bi-direction
        else:
            return tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(state_size, return_sequences=return_sequences)) #Simple RNN cell execution in bi-direction
    
    def build(self):
        x = tf.keras.layers.Input(shape=[None])
        
        h = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero= True)(x)#Embedding layer
        num_layers = len(self.state_sizes) #Number of layers
        
        for i in range(num_layers):
            if i==num_layers-1:
                h = BiRNN.get_layer(self.cell_type,self.state_sizes[i],return_sequences= False, activation = 'tanh')(h) #calling get_layer method of bi-directional network to run memory cell 
                                                                                                                         #when the current layer is the last one for the output shape
            else:
                h = BiRNN.get_layer(self.cell_type,self.state_sizes[i],return_sequences= True, activation = 'tanh')(h) #calling get_layer method of bi-directional network to run memory cell 
                                                                                                                         #when the current layer is not the last one for the output shape
        
        h = tf.keras.layers.Dense(dm.num_classes, activation='softmax')(h)
        self.model = tf.keras.Model(inputs=x, outputs=h)
        
    
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        return self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)       


#### <span style="color:red"></span> 
**Run BiRNN for basic RNN ('basic_rnn') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [116]:
bi_rnn = BiRNN(None, embed_size=128, state_sizes=[128,128], data_manager=dm)#Simple RNN in bi-direction
bi_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
bi_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
bi_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x2235bd62c88>

#### <span style="color:red"></span> 
**Run BiRNN for GRU ('gru') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [117]:
bi_rnn = BiRNN('gru', embed_size=128, state_sizes=[128,128], data_manager=dm)#GRU in bi-direction
bi_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
bi_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
bi_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x2236c895668>

#### <span style="color:red"></span> 
**Run BiRNN for LSTM ('lstm') cell with $embed\_size= 128, state\_sizes= [128, 128], data\_manager= dm$.**

<div style="text-align: right"><span style="color:red"></span></div>

In [118]:
bi_rnn = BiRNN('lstm', embed_size=128, state_sizes=[128,128], data_manager=dm)#LSTM in bi-direction
bi_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
bi_rnn.compile_model(optimizer=opt, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
bi_rnn.fit(dm.tf_train_set.batch(64), epochs=20, validation_data = dm.tf_valid_set.batch(64))

Train for 52 steps, validate for 35 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x2238e0ea2b0>

#### <span style="color:red"></span> 

**Give your own comments about the performance of three memory cells for the dataset of interest as well as comparing BiRNN to UniRNN in Part 1.**

<div style="text-align: right"><span style="color:red"></span></div>

BiRNN() is the bidirectional network built with three memory cells such as basic, GRU and LSTM. The three memory cells of BiRNN perform very similar to each other just as for uni direction RNN network. Basic RNN memory cell reports the highest validation accuracy of 97.71% for BiRNN network while GRU and LSTM records 97.52% and 97.25% correspondingly. 

Overall, UniRNN and BiRNN have memory cells performing similar to each other (in the range between 97 and 98%). However, GRU cell of UniRNN network performs the best of all combination between network and memory cells. LSTM seemingly performs slight better in UniRNN network than in BiRNN. Additionally, Basic RNN cell gives better result for BiRNN as compared to what it gives for UniRNN.  


Here, it seems the bi-directional approach in this case have not made too much difference in the performance of the various memory cells. RNN preserves information from embedding layer that has already passed through it using the hidden state. Unidirectional RNN only preserves information of the past because the only word embedding it has seen are from the past.

Using bidirectional will run the embedding layer inputs in two ways, one from past to future and one from future to past and here, in the RNN that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future. But in this case, the words given are short sequence which might be the reason bi-directional approach is not being too effective but it can be useful for longer sequences. However, the improvement using b-directional RNN can be seen for simple RNN memory cell which is giving better performances as compared to uni-directional. 


### <span style="color:#0b486b">4.3. RNNs with various types, cells, and fine-tuning embedding matrix for sequence modeling and neural embedding </span> ###

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

#### <span style="color:red"></span> 

**In what follows, you are required to combine the code in Part 1 and Part 2 to gain a general RNN which can be either Uni-directional RNN or Bi-directional RNN and the embedding matrix can be initialized using a pretrained Word2Vect.**

**Below are the descriptions of the attributes of the class *RNN*:**
- `run_mode (self.run_mode)` has three values (scratch, init-only, and init-fine-tune).
  - `scratch` means training the embedding matrix from scratch.
  - `init-only` means only initialzing the embedding matrix with a pretrained Word2Vect but not further doing fine-tuning that matrix.
  - `init-fine-tune` means both initialzing the embedding matrix with a pretrained Word2Vect and further doing fine-tuning that matrix.
- `network_type (self.network_type)` has two values (uni-directional and bi-directional) which correspond to either Uni-directional RNN or Bi-directional RNN.
- `cell_type (self.cell_type)` has three values (simple-rnn, gru, and lstm) which specify the memory cell used in the network.
- `embed_model (self.embed_model)` specifes the pretrained Word2Vect model used.
-  `embed_size (self.embed_size)` specifes the embedding size. Note that when run_mode is either init-only' or 'init-fine-tune', this embedding size is extracted from embed_model for dimension compatability.
- `state_sizes (self.state_sizes)` indicates the list of the hidden sizes from the second hidden layers of memory cells. For example, $embed\_size =128$ and $state\_sizes = [64, 64]$ means that you have three hidden layers in your network with hidden sizes of $128, 64$ and $64$ respectively.

**Complete the code of the class *RNN*.**

<div style="text-align: right"><span style="color:red"></span></div>

In [182]:
class RNN:
    def __init__(self, run_mode = 'scratch', cell_type= 'gru', network_type = 'uni-directional', embed_model= 'glove-wiki-gigaword-100', 
                 embed_size= 128, state_sizes = [64, 64], data_manager = None):
        self.run_mode = run_mode
        self.data_manager = data_manager
        self.cell_type = cell_type
        self.network_type = network_type
        self.state_sizes = state_sizes
        self.embed_model = embed_model
        self.embed_size = embed_size
        if self.run_mode != 'scratch':
            self.embed_size = int(self.embed_model.split("-")[-1])
        self.data_manager = data_manager
        self.vocab_size = dm.vocab_size +1
        self.word2idx = dm.word2idx
        self.word2vect = None
        self.embed_matrix = np.zeros(shape= [self.vocab_size, self.embed_size])
    
    def build_embedding_matrix(self):
        if os.path.exists("E.npy"):  #if file exists
            self.embed_matrix = np.load("E.npy")           #Load the file for embedding matrix if existed
        else: #file not existed or first-time run
            self.word2vect = api.load(self.embed_model)   #load embedding model
            for word, idx in self.word2idx.items():
                try:
                    self.embed_matrix[idx] = self.word2vect.word_vec(word)    #assign weight for the corresponding word and index
                except KeyError: #word cannot be found
                    pass
            np.save("E.npy", self.embed_matrix)
        
    
    @staticmethod
    def get_layer(cell_type= 'gru', network_type= 'uni-directional', hidden_size= 128, return_sequences= False, activation = 'tanh'):        
        
        if network_type == 'bi-directional': #If selected network is bi-directional, call the method get_layer from bi_rnn to pass the given arguments
            return bi_rnn.get_layer(cell_type= cell_type, state_size= hidden_size, return_sequences= return_sequences, activation = activation)
        
        else: #call the method get_layer from Uni rnn to pass the given arguments
            return UniRNN.get_layer(cell_type= cell_type, state_size= hidden_size, return_sequences= return_sequences, activation = activation)
        
        
    def build(self):
        x = tf.keras.layers.Input(shape=[None])
        if self.run_mode == "scratch": #Build the embedding layer from scratch
            self.embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero= True, trainable= True)
        
        elif self.run_mode == "init-only": #Initialise embedding matrix but without fine-tuning
            self.build_embedding_matrix()
            self.embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero= True, weights= [self.embed_matrix], trainable= False)
                    
        else: #fine-tuned after embeding_matrix
            self.build_embedding_matrix()
            self.embedding_layer = tf.keras.layers.Embedding(self.vocab_size, self.embed_size, mask_zero= True, weights= [self.embed_matrix], trainable= True)
            
            
        h = self.embedding_layer(x)
                
        num_layers = len(self.state_sizes) #number of layers
        
        for i in range(num_layers):
            if i==num_layers-1:
                h = RNN.get_layer(self.cell_type, self.network_type, self.state_sizes[i],return_sequences= False, activation = 'tanh')(h) #calling get_layer method with arguments to run memory cell 
                                                                                                                                          #when the current layer is the last one for the output shape

            else:
                h = RNN.get_layer(self.cell_type, self.network_type, self.state_sizes[i],return_sequences= True, activation = 'tanh')(h) #calling get_layer method with arguments to run memory cell 
                                                                                                                                         #when the current layer is not the last one for the output shape
                 
        h = tf.keras.layers.Dense(dm.num_classes, activation='softmax')(h)
        self.model = tf.keras.Model(inputs=x, outputs=h)
        
    
    def compile_model(self, *args, **kwargs):
        self.model.compile(*args, **kwargs)
    
    def fit(self, *args, **kwargs):
        return self.model.fit(*args, **kwargs)
    
    def evaluate(self, *args, **kwargs):
        self.model.evaluate(*args, **kwargs)       


#### <span style="color:red"></span> 

**Design the experiment to compare three running modes. Note that you should stick with fixed values for other attributes and only vary *run_mode*. Give your comments for the results.**

<div style="text-align: right"><span style="color:red"></span></div>

In [183]:
tf.random.set_seed(6789)
np.random.seed(6789)

In [136]:
rnn1 = RNN(data_manager=dm, run_mode= "scratch") #Running from scratch
rnn1.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
rnn1.compile_model(optimizer=opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [137]:
rnn1.fit(dm.tf_train_set.batch(64), epochs=20, validation_data= dm.tf_valid_set.batch(64))

Train for 77 steps, validate for 9 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x223e54ef710>

In [138]:
rnn1.evaluate(dm.tf_valid_set.batch(64))



In [139]:
rnn2 = RNN(data_manager=dm, run_mode= "init-only") #Running with init embedding matrix only
rnn2.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
rnn2.compile_model(optimizer=opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [140]:
rnn2.model.fit(dm.tf_train_set.batch(64), epochs=20, validation_data= dm.tf_valid_set.batch(64))
rnn2.evaluate(dm.tf_valid_set.batch(64))

Train for 77 steps, validate for 9 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [141]:
rnn3 = RNN(data_manager=dm, run_mode= "init-fine-tune") #Running with init embedding matrix
                                                        #and fine-tuning
rnn3.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
rnn3.compile_model(optimizer=opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [142]:
rnn3.model.fit(dm.tf_train_set.batch(64), epochs=20, validation_data= dm.tf_valid_set.batch(64))
rnn3.evaluate(dm.tf_valid_set.batch(64))

Train for 77 steps, validate for 9 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


The above experiment is conducted to compare the performance for the three running models for the RNN model. For the better comparison, the best network types and memory cells achieved according to the performance of the corresponding model are used in this experiment. So, GRU memory cell, Uni directional RNN model and other arguments stay constant (such as embed size=128, state size = [64,64], data_manager=dm as above) for the experiment and it will compare for the run model of three different types which are:

i) "scratch" means training the embedding matrix from scratch

ii) "init-only" means only initialzing the embedding matrix with a pretrained Word2Vect but not further doing fine-tuning that matrix

iii) "init-fine-tune" means both initialzing the embedding matrix with a pretrained Word2Vect and further doing fine-tuning that matrix

Run mode type 'scratch' has the validation accuracy 98.9% which is pretty good performance considering the model training the embedding matrix from scratch. Initializing the embedding matrix here without fine-tuning provided the joint best results along with the embedding matrix and fine-tuning. Both of them are achieving the validation accuracy of 99.63% which is the best accuracy of any model in the experiment. However, fine-tuning the model allows us to learn the data in more depth which uses the trainable attribute. Therefore, it can be seen that initialising the embeding matrix and then fine-tuning the model provides more stability while training the model as compared to only initialising the matrix. Still, the final validation accuracy suggests that both results are qually best here.

In the end, Run mode 'init-fine-tune' (which is initializing the embedding matrix and fine-tuning) provides the best result along with 'init-only' among all run mode type. Moreover, fine-tuning the embeding matrix does help in the RNN model learning various dimension of data which would provide best result. Here, the model with these two run type provides the accuracy rate of 99.63% which is impressive.




#### <span style="color:red"></span> 

**Run the above general RNN with at least five parameter sets and try to obtain the best performance. You can stick with the running mode *init-fine-tune* and use grid search to tune other parameters. Record your best model which will be used in the next part.**

<div style="text-align: right"><span style="color:red"></span></div>

The experiment is conducted to achieve the best model using the parameter given the best results in previous RNN construction. 

The results reported along with the model parameters are as following:

Model 1:
(data_manager=dm, run_mode= "scratch", cell_type= 'basic_rnn', network_type = 'bi-directional', embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [64, 64])
: accuracy = 98.87%

Model 2: 
(data_manager=dm, run_mode= "init-only", cell_type= 'lstm', network_type = 'uni-directional', embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [64, 64])
: accuracy = 99.45%

Model 3: 
(data_manager=dm, run_mode= "init-only", cell_type= 'basic_rnn', network_type = 'bi-directional', embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [64, 64])
: accuracy = 99.33%

Model 4: 
(data_manager=dm, run_mode= "init-fine-tune", cell_type= 'gru', network_type = 'uni-directional', embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [64, 64])
: accuracy = 99.63%

Model 5: 
(data_manager=dm, run_mode= "init-fine-tune", cell_type= 'lstm', network_type = 'bi-directional', embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [128, 64])
: accuracy = 99.40%

Model 6: 
(data_manager=dm, run_mode="init-fine-tune", cell_type= 'basic_rnn', network_type= 'bi-directional', embed_model='glove-wiki-gigaword-100', embed_size= 128, state_sizes = [128, 64])
: accuracy = 98.29%

Finally, we can conclude that the combinations of parameters used in model 4 provides the highest accuracy among all models, which is 99.63%. 

So, model 4 will be used as part of problem solving model in the next part.


In [141]:
#The run of the best RNN model
my_best_rnn = RNN(data_manager=dm, run_mode= "init-fine-tune", cell_type= 'gru', network_type = 'uni-directional', 
                  embed_model= 'glove-wiki-gigaword-100', embed_size= 128, state_sizes = [64, 64]) 

my_best_rnn.build()
opt = tf.keras.optimizers.RMSprop(learning_rate=0.001)
my_best_rnn.compile_model(optimizer=opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [142]:
my_best_rnn.model.fit(dm.tf_train_set.batch(64), epochs=20, validation_data= dm.tf_valid_set.batch(64))
my_best_rnn.evaluate(dm.tf_valid_set.batch(64))

Train for 77 steps, validate for 9 steps
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### <span style="color:#0b486b">4.4. Investigating the embedding vectors from the embedding matrix</span> ###

<div style="text-align: right"><span style="color:red; font-weight:bold"><span></div>

**As you know, the embedding matrix is a collection of embedding vectors, each is for one word. In this part, you will base on the cosine similarity of the embedding vectors for the words to find the top-k most relevant words for a given word.**

**Good embeddings should have words close in meaning near each other by some similarity metrics. The similarity metric we'll use is the `consine` distance, which is defined for two vector $\mathbf{u}$ and $\mathbf{v}$ as $\cos(\mathbf{u}, \mathbf{v})=\frac{\mathbf{u} \cdot \mathbf{v}}{\left\Vert{\mathbf{u}}\right\Vert\left\Vert{\mathbf{v}}\right\Vert}$ where $\cdot$ means dot product and $\left\Vert\cdot\right\Vert$ means the $\mathcal{L}_2$ norm.**

In [188]:
def cosine_similarity(u,v):
    return np.dot(u,v)/(np.linalg.norm(u)*np.linalg.norm(u))

#### <span style="color:red"></span> 

**You are required to write the code for the function *find_most_similar(word= None, k=5, model= None)*. As its name, this function returns the top-k most relevant word for a given word based on the cosine similarity of the embedding vectors.**

<div style="text-align: right"><span style="color:red"></span></div>

In [189]:
def find_most_similar(word= None, k=5, model= None):
    
    try:
        #List for collecting all words, collecting the cosine similarity between words, Ranked words       
        words_collect=[]
        most_common =[]
        top_common_words = []  
        
        for words, idx in model.word2idx.items():
            words_collect+= [words] #Copying words into list
        
        for i in range(len(words_collect)):
            if words_collect[i] == word: #Extractingt the index of the entered word to process
                
                for j in range(len(model.embed_matrix)):
                    #Storing the cosine value between each words with given word
                    most_common += [cosine_similarity(model.embed_matrix[i],model.embed_matrix[j])]
        
        top_n = sorted(range(len(most_common)), key=lambda n: most_common[n], reverse=True)
        
        top_index = top_n[1:k+1]#taking the index of first k words of the list for output
        
        for i in top_index:
            top_common_words += [words_collect[i]] #First k words in the list
        
        if top_common_words == []:
            raise Exception
                    
    except: #word not in the vocabulary
        print("Word is not in the dictionary!")
        
    return top_common_words

Here is the example of the above function. As you can observe, the result makes sense which demonstrates that we obtain a good model with the meaningful embedding matrix.

In [190]:
find_most_similar(word='poland',k=10,model=my_best_rnn)

['sinatra',
 'mountains',
 'spock',
 'firm',
 'ford',
 'seaport',
 'arcadia',
 'exp',
 'went',
 'driven']