<a href="https://colab.research.google.com/github/sorobedio/KAIST-AI502/blob/main/KAIST_AI605_ass1_20205677.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAIST AI605 Assignment 1: Text Classification
TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due Date:** September 29 (Wed) 11:00pm, 2021

## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/aGZZ86YpCdv2zEVt9). 

You need to submit both (1) a PDF of this notebook, and (2) a link to CoLab for execution (.ipynb file is also allowed).

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points. You can obtain up to 5 bonus points (i.e. max score is 25 points). For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will only use Python 3.7 and PyTorch 1.9, which is already available on Colab:

In [None]:
from platform import python_version

import os
import numpy as np
import pandas as pd
from tqdm import tqdm


import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F

print("python", python_version())
print("torch", torch.__version__)

python 3.7.12
torch 1.9.0+cu102


## 1. Limitations of Vanilla RNNs
In Lecture 02, we saw that a multi-layer perceptron (MLP) without activation function is equivalent to a single linear transformation with respect to the inputs. One can define a vanilla recurrent neural network without activation as, given inputs $\textbf{x}_1 \dots \textbf{x}_T$, the outputs $\textbf{h}_t$ is obtained by
$$\textbf{h}_t = \textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b},$$
where $\textbf{V}, \textbf{U}, \textbf{b}$ are trainable weights. 

>**Problem 1.1** *(2 point)* Show that such recurrent neural network (RNN) without activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.
#### $\color{red}{\text{Solution 1.1}}$
<font color='red'> **Solution:**  Let's define the weight matrix between the input layer and the hidden layer by $W_{x2h}$ , the weight tensor between hidden layer by $W_{h2h}$ and  the bias by $b_h$. We want to prove $P_t$: each $h_t$ is a linear combination of the input, $$ h_t= W_{x2h} x_t+ W_{h2h}h_{t-1}+b_h$$

<font color='red'>**Resolution by induction**: Base case: $h_0=0$ and $$ h_1= W_{x2h} x_1+ b_h$$. The statement is true for $t=1$. For $t=2$ we have $$ h_2= W_{x2h} x_2+ W_{h2h}h_{1}+b_h$$
$$ h_2= W_{x2h} x_2+ W_{h2h}(W_{x2h} x_1+ b_h)+b_h$$
$$ h_2= W_{x2h} x_2+ W_{h2h}W_{x2h} x_1+ W_{h2h}b_h+b_h$$, The statement is true for $t \leq2$.  Assuming that the statement is true for a certain $t=n-1>2$ that is $h_{n-1}$ is a linear combination of the input. $$ h_n= W_{x2h} x_n+ W_{h2h}h_{n-1}+b_h$$. Given that $h_{n-1}$ is already a linear combination of the inputs, multiplying by $W_{h2h}$ does not change the linearity. Thus $h_n$ is also a linear combination of the inputs. We can also notice that the linear combination is done in a auto regressive way were $h_t$ is linear combination of $t$ inputs sequence. Which conclude that  such a recurrent neural network without activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.



In Lecture 05 and 06, we will see how RNNs can model non-linearity via activation function, but they still suffer from exploding or vanishing gradients. We can mathematically show that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.




> **Problem 1.2** *(2 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.
#### $\color{red}{\text{Solution 1.2}}$
<font color='red'> **Solution:** 
The exploding gradients problem occurs when the norm of the gradient largely increases during training. 
One option is to clip the parameter gradient from a minibatch
element-wise, just before the parameter update. Another is to clip the norm $||g||$ of the gradient $g$ just before the parameter update
$$ if \quad ||g||>v$$
$$g \leftarrow v * \frac{g}{||g||}$$
where $v$ is the norm threshold.  The gradient cliping ensures that the norm of gradient is bounded. This bounded
gradient avoids performing a detrimental step when the gradient explodes.



> **Problem 1.3** *(2 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 05 and 06 slides for the definition of LSTM.
#### $\color{red}{\text{Solution 1.3}}$

<font color='red'>The error gradient is given as sum of T gradients 
\begin{equation}
\frac{\partial E}{\partial W}=\sum_{t=1}^{T}\frac{\partial E_t}{\partial W}
\end{equation}

<font color='red'>The gradient vanishes if the sum of sub-gradients vanishes. We can control the sub-gradient to make the sum not converge to zero. he gradient of the error for some time step k has the form:
\begin{equation}
\frac{\partial E_k}{\partial W}=\frac{\partial E_k}{\partial h_k}\frac{\partial h_k}{\partial c_k}( \prod_{t=2}^{T}\frac{\partial c_t}{\partial c_{t-1}})\frac{\partial c_1}{\partial W}
\end{equation}
The gradient vanishing is due to $\prod_{t=2}^{T}\frac{\partial c_t}{\partial c_{t-1}}$. In an LSTM the stae vector$c_t$ is the form:
\begin{align}
c_t &= c_{t-1}\otimes \sigma(W_f.[h_{t-1},x_t])\oplus tanh(W_c.[h_{t-1},x_t])\otimes\sigma(W_i.[h_{t-1},x_t]) \\
&=c_{t-1}\otimes f_t\oplus \tilde{c_t}\otimes i_t
\end{align}

<font color='red'>From the above equation one can see that the gradient of the cell state is sum of four element and can be computed as follows:
 \begin{aligned} \frac{\partial c_{t}}{\partial c_{t-1}}=& \sigma^{\prime}\left(W_{f} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{f} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot c_{t-1} \\ &+f_{t} \\ &+\sigma^{\prime}\left(W_{i} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{i} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot \tilde{c}_{t} \\ &+\sigma^{\prime}\left(W_{c} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{c} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot i_{t} \end{aligned}
In LSTMs the presence of the forget gate, along with the additive property of the cell state gradients, enables the network to update the parameters so that the series of functions or the sum of gradients does not converge to zero:
<font>



## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank (SST), a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST via Hugging Face
We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Sentiment Treebank.

First, install the package:

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[?25l[K     |█▏                              | 10 kB 21.9 MB/s eta 0:00:01[K     |██▍                             | 20 kB 11.9 MB/s eta 0:00:01[K     |███▋                            | 30 kB 9.8 MB/s eta 0:00:01[K     |████▉                           | 40 kB 8.8 MB/s eta 0:00:01[K     |██████                          | 51 kB 5.3 MB/s eta 0:00:01[K     |███████▎                        | 61 kB 5.8 MB/s eta 0:00:01[K     |████████▌                       | 71 kB 5.7 MB/s eta 0:00:01[K     |█████████▊                      | 81 kB 6.4 MB/s eta 0:00:01[K     |███████████                     | 92 kB 4.9 MB/s eta 0:00:01[K     |████████████▏                   | 102 kB 5.3 MB/s eta 0:00:01[K     |█████████████▍                  | 112 kB 5.3 MB/s eta 0:00:01[K     |██████████████▋                 | 122 kB 5.3 MB/s eta 0:00:01[K     |███████████████▊                | 133 kB 5.3 MB/s eta 0:00:01

Then download SST and print the first example:

In [None]:
from datasets import load_dataset
from pprint import pprint

sst_dataset = load_dataset('sst')
# train_df=pd.DataFrame.from_dict(sst_dataset['train'])
pprint(sst_dataset['train'][0])

Downloading:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

No config specified, defaulting to: sst/default


Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff...


Downloading:   0%|          | 0.00/6.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/790k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset sst downloaded and prepared to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' "
             "and that he 's going to make a splash even greater than Arnold "
             'Schwarzenegger , Jean-Claud Van Damme or Steven Segal .',
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}


In [None]:
sst_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'tokens', 'tree'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['sentence', 'label', 'tokens', 'tree'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['sentence', 'label', 'tokens', 'tree'],
        num_rows: 2210
    })
})

Note that each `label` is a score between 0 and 1. You will round it to either 0 or 1 for binary classification (positive for 1, negative for 0).
In this first example, the label is rounded to 1, meaning that the sentence is a positive review.
You will only use `sentence` as the input; please ignore other values.

In [None]:
train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df.head(5)

Unnamed: 0,sentence,label,tokens,tree
0,The Rock is destined to be the 21st Century 's...,0.69444,The|Rock|is|destined|to|be|the|21st|Century|'s...,70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|5...
1,The gorgeously elaborate continuation of `` Th...,0.83333,The|gorgeously|elaborate|continuation|of|``|Th...,71|70|69|69|67|67|66|64|63|62|62|61|61|58|57|5...
2,Singer\/composer Bryan Adams contributes a sle...,0.625,Singer\/composer|Bryan|Adams|contributes|a|sle...,72|71|71|70|68|68|67|67|66|63|62|62|60|60|58|5...
3,You 'd think by now America would have had eno...,0.5,You|'d|think|by|now|America|would|have|had|eno...,36|35|34|33|33|32|30|29|27|26|25|24|23|23|22|2...
4,Yet the act is still charming here .,0.72222,Yet|the|act|is|still|charming|here|.,15|13|13|10|9|9|11|12|10|11|12|14|14|15|0


In [None]:
train_df.shape

(8544, 4)

In [None]:
train_df['label'] = train_df['label'].apply(lambda x: round(x))
train_df.head(5)

Unnamed: 0,sentence,label,tokens,tree
0,The Rock is destined to be the 21st Century 's...,1,The|Rock|is|destined|to|be|the|21st|Century|'s...,70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|5...
1,The gorgeously elaborate continuation of `` Th...,1,The|gorgeously|elaborate|continuation|of|``|Th...,71|70|69|69|67|67|66|64|63|62|62|61|61|58|57|5...
2,Singer\/composer Bryan Adams contributes a sle...,1,Singer\/composer|Bryan|Adams|contributes|a|sle...,72|71|71|70|68|68|67|67|66|63|62|62|60|60|58|5...
3,You 'd think by now America would have had eno...,0,You|'d|think|by|now|America|would|have|had|eno...,36|35|34|33|33|32|30|29|27|26|25|24|23|23|22|2...
4,Yet the act is still charming here .,1,Yet|the|act|is|still|charming|here|.,15|13|13|10|9|9|11|12|10|11|12|14|14|15|0


> **Problem 2.1** *(2 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.


In [None]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [None]:
# Constructing vocabulary with `UNK`
vocab = ['PAD', 'UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['PAD', 'UNK', 'world!', 'Hello']
3


In [None]:
# alltokens=[]
# for i in range(len(train_df)):
#     alltokens+=list(list(filter(('').__ne__,train_df.tokens.iloc[i].split("|"))))

In [None]:
corpus =" "
texts = list(train_df.sentence)
for line in texts:
    corpus+= line.lower()

In [None]:
# corpus=''
# for i in range(len(df)):
#     corpus+=df.sentence.iloc[i]

#### $\color{red}{\text{Solution 2.1}}$
<font color='red'> **Solution** The below cells showed how the vocabular is created using  the space split tokenizer. The vocabulary size is **18466** words</font>

In [None]:
# Constructing vocabulary with `UNK`
vocab = ['PAD', 'UNK'] + list(set(filter(('').__ne__, corpus.split(' '))))
word2id = {word: id_ for id_, word in enumerate(vocab)}

In [None]:
len(vocab)

18466

In [None]:
vocab[:10]

['PAD',
 'UNK',
 'operates',
 'noyce',
 'catching',
 'pen',
 'reward',
 'barrel',
 "''is",
 'horrified']

> **Problem 2.2** *(1 point)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?
#### $\color{red}{\text{Solution 2.2}}$
<font color='red'> **Solution:** In the below cells we created the vocabulary from only words that appears at least twice in the corpus. The vocabulary length we got is: 8572 which is more than 2 times the previous vocabulary size.  About 9894 words have been removed. </font>



In [None]:
corpus=" "
for i in range(len(train_df)):
    corpus+=train_df.sentence.iloc[i].lower()
texts= list(list(filter(('').__ne__, corpus.split(" "))))

In [None]:
# texts[]

In [None]:
import collections

In [None]:
occurrences = collections.Counter(texts)
tokens={i:occurrences[i] for i in occurrences if occurrences[i]>=2}
# Constructing vocabulary with `UNK`
vocab = ['PAD', 'UNK'] + list(set(tokens))
word2id = {word: id_ for id_, word in enumerate(vocab)}

In [None]:
len(vocab)

8572

## 3. Text Classification with Multi-Layer Perceptron and Recurrent Neural Network

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to fix the input length (with truncation or padding), flatten the word embeddings, apply a linear transformation followed by an activation, and finally classify the output into the two classes: 

In [None]:
from torch import nn

# length = 8
# input_ = "hi world!"
# input_tokens = input_.split(' ')
# input_ids = [word2id[word] if word in word2id else 1 for word in input_tokens] # UNK if word not found
# if len(input_ids) < length:
#   input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
# else:
#   input_ids = input_ids[:length]

# input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
# print(input_tensor)

In [None]:
# input_tensor.shape

In [None]:
# Two-layer MLP classification
class Baseline(nn.Module):

    def __init__(self, d, length=32):
        super(Baseline, self).__init__()
        self.embedding = nn.Embedding(len(vocab), d)
        self.layer = nn.Linear(d * length, d, bias=True)
        self.relu = nn.ReLU()
        self.class_layer = nn.Linear(d, 2, bias=True)

    def forward(self, input_tensor):
        emb = self.embedding(input_tensor) # [batch_size, length, d]
        emb_flat = emb.view(emb.size(0), -1) # [batch_size, length*d]
        hidden = self.relu(self.layer(emb_flat))
        logits = self.class_layer(hidden)
        return logits

# d = 3 # usually bigger, e.g. 128
# baseline = Baseline(d, length)
# logits = baseline(input_tensor)
# softmax = nn.Softmax(1)
# print(softmax(logits)) # probability for each class

Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
# cel = nn.CrossEntropyLoss()
# label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
# loss = cel(logits, label) # Loss, a.k.a L
# print(loss)

Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
# optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
# optimizer.zero_grad() # reset process
# loss.backward() # compute gradients
# optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
# print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

In [None]:
def get_tokens_ids(word2id, df, length=32):
    token_ids=[]
    labels=[]
    # corpus= list(df.sentence)
    targets = list(df.label)
    max_len=52

#     sent_tokens_= []
    for i in range(len(df)):
        
        
        sentence=df.sentence.iloc[i].lower()
        input_tokens = list(list(filter(('').__ne__, sentence.split(" "))))
        if max_len> len(input_tokens):
            max_len= len(input_tokens)
        #        sent_tokens.append(input_tokens)
        input_ids=[word2id[word] if word in word2id else 1 for word in input_tokens]

        if len(input_ids) < length:
            input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
        else:
            input_ids = input_ids[:length]

        token_ids.append(input_ids)
        labels.append(targets[i])
    token_ids =torch.LongTensor(token_ids)
    labels = torch.LongTensor(labels)
    return max_len, token_ids, labels

> **Problem 3.1** *(2 points)* Properly train a MLP baseline model on SST and report the model's accuracy on the dev data.

> **Problem 3.2** *(2 points)* Implement a recurrent neural network (without using PyTorch's RNN module) with `tanh` activation, and use the output of the RNN at the final time step for the classification. Report the model's accuracy on the dev data.

> **Problem 3.3** *(2 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

> **Problem 3.4 (bonus)** *(1 points)* Why is it numerically unstable if you compute log on top of softmax?

In [None]:

from pprint import pprint

# sst_dataset = load_dataset('sst')
# train_df=pd.DataFrame.from_dict(sst_dataset['train'])
# train_df['label'] = train_df['label'].apply(lambda x: round(x))

In [None]:

# result_path = 'checkpoint/'
criterion = nn.CrossEntropyLoss()


def train(model, train_loader, val_loader, num_epochs = 12, file = None):    
    if (file != None):
        best_file = file#os.path.join(result_path, file)

    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    best_acc = 0
    best_acc_epoch = 0
    
  

    for epoch in range(num_epochs):
        
        model.train()
        epoch_loss = 0
        train_acc =0
        total = 0
        epoch_acc = 0
        avg_loss=0
        labels =[]
        for batch in tqdm(train_loader):

            data, target = batch
            labels.extend(target)
            data, target = data.cuda(), target.cuda()
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)

            loss.backward()
            optimizer.step()
            total += len(target)
            avg_loss += loss.item()
            _, preds = torch.max(output.data, 1)

            train_acc += (preds == target).sum().item()
            
            # pred = output.data.max(1, keepdim=True)[1]
            # train_acc += pred.eq(target.data.view_as(pred)).cpu().sum()

        epoch_acc = (100 * train_acc / total)
        epoch_loss = avg_loss/total


        val_loss,val_acc= validate(epoch, model, val_loader)
        print("epoch {}: Training_Loss- {:.3f}, Val_Loss- {:.3f}, Training_Acc- {:.2f}, Val_Acc- {:.2f}".format(epoch, epoch_loss, val_loss, epoch_acc, val_acc))

        if (val_acc > best_acc):
            best_acc = val_acc
            best_epoch = epoch
            if (file!=None):
                torch.save(model.state_dict(), best_file)         
        if (epoch == num_epochs-1):
            print("Best accuracy at epoch: {}".format(best_epoch))

      

In [None]:
def validate(epoch, model, valid_loader):
    
    with torch.no_grad():

        model.eval()
        epoch_loss = 0
        val_total = 0
        val_correct = 0
        epoch_val_acc = 0
        val_loss =0
        val_labels =[]
        for val_batch in tqdm(valid_loader):
            
            val_data, val_target =  val_batch
            val_labels.extend(val_target) 
            val_data, val_target = val_data.cuda(), val_target.cuda()
            val_output = model(val_data)
            val_loss += criterion(val_output, val_target).item()
            _, preds = torch.max(val_output.data, 1)
            val_correct += (preds == val_target).sum().item()

            # val_pred = val_output.data.max(1, keepdim=True)[1]
            # val_correct += val_pred.eq(val_target.data.view_as(val_pred)).cpu().sum()
            val_total+= len(val_target)

        epoch_loss=val_loss / len(val_labels)
        epoch_val_acc = (100 * val_correct / val_total)

    return epoch_loss, epoch_val_acc



In [None]:

def test(model, test_loader, file=None):
    
    if (file!=None):
        best_file = file #os.path.join(result_path, file)
        model.load_state_dict(torch.load(best_file))
    model.eval()
    test_loss = 0
    correct = 0
    for batch in tqdm(test_loader):
        data, target = batch
        data, target = data.cuda(), target.cuda()
        output = model(data)
        test_loss += criterion(output, target).item()
        _, preds = torch.max(output.data, 1)

        correct += (preds == target).sum().item()

        # pred = output.data.max(1, keepdim=True)[1]
        # correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(test_loss, correct,
                                                                                len(test_loader.dataset),
                                                                                100. * correct / len(
                                                                                    test_loader.dataset)))
    # return correct / float(len(test_loader.dataset))

In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 16)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 16)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 16)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

2


In [None]:
input_dim = 100 # word-embedding dimension
num_epochs = 50
model = Baseline(d=100,length=52).cuda()
train(model, train_loader, valid_loader, num_epochs, 'base_best_model.pth')

In [None]:
test(model, test_loader, file = 'base_best_model.pth')

In [None]:
test(model, valid_loader, file = 'base_best_model.pth')

In [None]:
# L=52 vac =56.49 tac=54.52 epco =43
# L=40 vac =57.40 tac=55.57 epco =41
# L=32 vac =57.22 tac=56.06 epco =10
# L=16 vac =57.77 tac=57.51 epco =17
# L=8 vac =59.49 tac=55.57 epco =4
# L=45 vac =56.58 tac=55.70 epco =38

#### $\color{red}{\text{Solution 3.1}}$
<font color='red'> The Baseline model trained for 50 epochs achieved its best performance in 2 epochs whith sequence length 16. We believe that the model overfit the data and does not generalized. The model may have started with good initialization but does not not improve the performance while the training accuracy was around 99%. Below we report the experiment results for three sequence lengths.<font>

Sequence length=52 validation accuracy =56.49 test accuracy=54.52 best epoch =43
    
Sequence length=40 validation accuracy =57.40 test accuracy=55.57 best epoch =41
    
Sequence length=32 validation accuracy =57.22 test accuracy=56.06 best epoch =10
    
Sequence length=16 validation accuracy=57.77 test accuracy=57.51 best epoch =17
    
Sequences length=8 validation accuracy =59.49 test accuracy=55.57  best epoch =4

In [None]:

from torch.autograd import Variable

class RNN(nn.Module):
    
    def __init__(self, input_dim, hidden_dim, output_dim=2):
        super(RNN, self).__init__()
        self.input_dim= input_dim
        self.hidden_dim= hidden_dim
        self.output_dim=output_dim
        self.embedding = nn.Embedding(len(vocab), self.input_dim)
        self.fc_x2h = nn.Linear(in_features=self.input_dim, out_features=self.hidden_dim)
        self.fc_h2h = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim)
        self.fc_h2y = nn.Linear(in_features=self.hidden_dim, out_features=self.output_dim)

    def forward(self, input):
        emb = self.embedding(input)

        h = Variable(torch.zeros(input.size(0), self.hidden_dim)).to(input.device)
        # h = Variable(input.new_zeros(input.size(0), self.fc_h2y.weight.size(1)))
        for t in range(emb.size(1)):
            h = torch.tanh(self.fc_x2h(emb[:,t,:])+self.fc_h2h(h))
        return self.fc_h2y(h)



In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 32)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 32)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 32)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

2


In [None]:
# nn.utils.clip_grad_norm_(model.parameters(), clip)

num_epochs = 50
model = RNN(input_dim=100, hidden_dim=100).cuda()
train(model, train_loader, valid_loader, num_epochs, 'rnn_best_model.pth')
# train(model, train_loader, valid_loader, num_epochs, None)

100%|██████████| 267/267 [00:04<00:00, 58.92it/s]
100%|██████████| 35/35 [00:00<00:00, 207.21it/s]


epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.88, Val_Acc- 50.23


100%|██████████| 267/267 [00:04<00:00, 59.69it/s]
100%|██████████| 35/35 [00:00<00:00, 196.42it/s]


epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.34, Val_Acc- 50.41


100%|██████████| 267/267 [00:04<00:00, 59.52it/s]
100%|██████████| 35/35 [00:00<00:00, 131.70it/s]


epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.21, Val_Acc- 48.68


100%|██████████| 267/267 [00:04<00:00, 59.92it/s]
100%|██████████| 35/35 [00:00<00:00, 209.77it/s]


epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.76, Val_Acc- 48.05


100%|██████████| 267/267 [00:04<00:00, 59.26it/s]
100%|██████████| 35/35 [00:00<00:00, 190.94it/s]


epoch 4: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.66, Val_Acc- 51.04


100%|██████████| 267/267 [00:04<00:00, 60.01it/s]
100%|██████████| 35/35 [00:00<00:00, 192.00it/s]


epoch 5: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.85, Val_Acc- 50.86


100%|██████████| 267/267 [00:04<00:00, 58.88it/s]
100%|██████████| 35/35 [00:00<00:00, 207.33it/s]


epoch 6: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.59, Val_Acc- 52.13


100%|██████████| 267/267 [00:04<00:00, 59.58it/s]
100%|██████████| 35/35 [00:00<00:00, 195.99it/s]


epoch 7: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 52.67, Val_Acc- 50.95


100%|██████████| 267/267 [00:04<00:00, 59.33it/s]
100%|██████████| 35/35 [00:00<00:00, 208.67it/s]


epoch 8: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 53.55, Val_Acc- 48.23


100%|██████████| 267/267 [00:04<00:00, 59.91it/s]
100%|██████████| 35/35 [00:00<00:00, 207.41it/s]


epoch 9: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 53.60, Val_Acc- 51.32


100%|██████████| 267/267 [00:04<00:00, 60.30it/s]
100%|██████████| 35/35 [00:00<00:00, 192.66it/s]


epoch 10: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 54.58, Val_Acc- 49.59


100%|██████████| 267/267 [00:04<00:00, 58.97it/s]
100%|██████████| 35/35 [00:00<00:00, 205.51it/s]


epoch 11: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 53.87, Val_Acc- 50.50


100%|██████████| 267/267 [00:04<00:00, 59.54it/s]
100%|██████████| 35/35 [00:00<00:00, 213.83it/s]


epoch 12: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 55.33, Val_Acc- 51.59


100%|██████████| 267/267 [00:04<00:00, 60.10it/s]
100%|██████████| 35/35 [00:00<00:00, 207.96it/s]


epoch 13: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 55.79, Val_Acc- 48.32


100%|██████████| 267/267 [00:04<00:00, 60.03it/s]
100%|██████████| 35/35 [00:00<00:00, 200.60it/s]


epoch 14: Training_Loss- 0.021, Val_Loss- 0.023, Training_Acc- 57.10, Val_Acc- 50.41


100%|██████████| 267/267 [00:04<00:00, 58.47it/s]
100%|██████████| 35/35 [00:00<00:00, 200.34it/s]


epoch 15: Training_Loss- 0.021, Val_Loss- 0.024, Training_Acc- 57.30, Val_Acc- 49.14


100%|██████████| 267/267 [00:04<00:00, 59.96it/s]
100%|██████████| 35/35 [00:00<00:00, 213.08it/s]


epoch 16: Training_Loss- 0.020, Val_Loss- 0.024, Training_Acc- 56.98, Val_Acc- 48.77


100%|██████████| 267/267 [00:04<00:00, 59.43it/s]
100%|██████████| 35/35 [00:00<00:00, 198.36it/s]


epoch 17: Training_Loss- 0.020, Val_Loss- 0.025, Training_Acc- 57.98, Val_Acc- 48.14


100%|██████████| 267/267 [00:04<00:00, 59.68it/s]
100%|██████████| 35/35 [00:00<00:00, 216.33it/s]


epoch 18: Training_Loss- 0.020, Val_Loss- 0.025, Training_Acc- 59.13, Val_Acc- 49.05


100%|██████████| 267/267 [00:04<00:00, 57.98it/s]
100%|██████████| 35/35 [00:00<00:00, 201.52it/s]


epoch 19: Training_Loss- 0.020, Val_Loss- 0.025, Training_Acc- 59.42, Val_Acc- 49.50


100%|██████████| 267/267 [00:04<00:00, 60.01it/s]
100%|██████████| 35/35 [00:00<00:00, 208.38it/s]


epoch 20: Training_Loss- 0.020, Val_Loss- 0.027, Training_Acc- 58.88, Val_Acc- 51.59


100%|██████████| 267/267 [00:04<00:00, 59.61it/s]
100%|██████████| 35/35 [00:00<00:00, 189.76it/s]


epoch 21: Training_Loss- 0.020, Val_Loss- 0.026, Training_Acc- 60.88, Val_Acc- 48.68


100%|██████████| 267/267 [00:04<00:00, 59.07it/s]
100%|██████████| 35/35 [00:00<00:00, 204.30it/s]


epoch 22: Training_Loss- 0.019, Val_Loss- 0.025, Training_Acc- 60.93, Val_Acc- 47.96


100%|██████████| 267/267 [00:04<00:00, 59.98it/s]
100%|██████████| 35/35 [00:00<00:00, 193.77it/s]


epoch 23: Training_Loss- 0.019, Val_Loss- 0.027, Training_Acc- 62.31, Val_Acc- 52.32


100%|██████████| 267/267 [00:04<00:00, 59.05it/s]
100%|██████████| 35/35 [00:00<00:00, 199.22it/s]


epoch 24: Training_Loss- 0.019, Val_Loss- 0.026, Training_Acc- 61.83, Val_Acc- 49.50


100%|██████████| 267/267 [00:04<00:00, 60.13it/s]
100%|██████████| 35/35 [00:00<00:00, 207.76it/s]


epoch 25: Training_Loss- 0.019, Val_Loss- 0.027, Training_Acc- 63.01, Val_Acc- 52.32


100%|██████████| 267/267 [00:04<00:00, 60.11it/s]
100%|██████████| 35/35 [00:00<00:00, 206.52it/s]


epoch 26: Training_Loss- 0.018, Val_Loss- 0.028, Training_Acc- 63.83, Val_Acc- 49.77


100%|██████████| 267/267 [00:04<00:00, 59.11it/s]
100%|██████████| 35/35 [00:00<00:00, 136.45it/s]


epoch 27: Training_Loss- 0.019, Val_Loss- 0.027, Training_Acc- 61.54, Val_Acc- 51.68


100%|██████████| 267/267 [00:04<00:00, 60.10it/s]
100%|██████████| 35/35 [00:00<00:00, 190.21it/s]


epoch 28: Training_Loss- 0.019, Val_Loss- 0.028, Training_Acc- 62.75, Val_Acc- 51.41


100%|██████████| 267/267 [00:04<00:00, 59.51it/s]
100%|██████████| 35/35 [00:00<00:00, 199.50it/s]


epoch 29: Training_Loss- 0.018, Val_Loss- 0.032, Training_Acc- 63.73, Val_Acc- 51.50


100%|██████████| 267/267 [00:04<00:00, 59.33it/s]
100%|██████████| 35/35 [00:00<00:00, 203.04it/s]


epoch 30: Training_Loss- 0.018, Val_Loss- 0.032, Training_Acc- 65.87, Val_Acc- 50.41


100%|██████████| 267/267 [00:04<00:00, 58.18it/s]
100%|██████████| 35/35 [00:00<00:00, 189.70it/s]


epoch 31: Training_Loss- 0.019, Val_Loss- 0.035, Training_Acc- 63.26, Val_Acc- 51.59


100%|██████████| 267/267 [00:04<00:00, 58.87it/s]
100%|██████████| 35/35 [00:00<00:00, 185.69it/s]


epoch 32: Training_Loss- 0.018, Val_Loss- 0.028, Training_Acc- 65.54, Val_Acc- 48.32


100%|██████████| 267/267 [00:04<00:00, 59.29it/s]
100%|██████████| 35/35 [00:00<00:00, 201.58it/s]


epoch 33: Training_Loss- 0.018, Val_Loss- 0.030, Training_Acc- 65.16, Val_Acc- 53.22


100%|██████████| 267/267 [00:04<00:00, 59.47it/s]
100%|██████████| 35/35 [00:00<00:00, 198.68it/s]


epoch 34: Training_Loss- 0.017, Val_Loss- 0.034, Training_Acc- 66.30, Val_Acc- 52.13


100%|██████████| 267/267 [00:04<00:00, 59.91it/s]
100%|██████████| 35/35 [00:00<00:00, 133.24it/s]


epoch 35: Training_Loss- 0.017, Val_Loss- 0.035, Training_Acc- 67.22, Val_Acc- 51.95


100%|██████████| 267/267 [00:04<00:00, 59.02it/s]
100%|██████████| 35/35 [00:00<00:00, 212.54it/s]


epoch 36: Training_Loss- 0.020, Val_Loss- 0.031, Training_Acc- 63.17, Val_Acc- 49.86


100%|██████████| 267/267 [00:04<00:00, 59.61it/s]
100%|██████████| 35/35 [00:00<00:00, 200.43it/s]


epoch 37: Training_Loss- 0.018, Val_Loss- 0.031, Training_Acc- 65.57, Val_Acc- 51.23


100%|██████████| 267/267 [00:04<00:00, 59.05it/s]
100%|██████████| 35/35 [00:00<00:00, 205.06it/s]


epoch 38: Training_Loss- 0.017, Val_Loss- 0.030, Training_Acc- 67.40, Val_Acc- 52.68


100%|██████████| 267/267 [00:04<00:00, 59.74it/s]
100%|██████████| 35/35 [00:00<00:00, 199.75it/s]


epoch 39: Training_Loss- 0.017, Val_Loss- 0.029, Training_Acc- 67.80, Val_Acc- 52.41


100%|██████████| 267/267 [00:04<00:00, 59.14it/s]
100%|██████████| 35/35 [00:00<00:00, 196.79it/s]


epoch 40: Training_Loss- 0.017, Val_Loss- 0.032, Training_Acc- 67.78, Val_Acc- 49.05


100%|██████████| 267/267 [00:04<00:00, 58.98it/s]
100%|██████████| 35/35 [00:00<00:00, 205.76it/s]


epoch 41: Training_Loss- 0.017, Val_Loss- 0.035, Training_Acc- 67.04, Val_Acc- 49.77


100%|██████████| 267/267 [00:04<00:00, 59.55it/s]
100%|██████████| 35/35 [00:00<00:00, 194.03it/s]


epoch 42: Training_Loss- 0.016, Val_Loss- 0.034, Training_Acc- 68.83, Val_Acc- 49.14


100%|██████████| 267/267 [00:04<00:00, 58.77it/s]
100%|██████████| 35/35 [00:00<00:00, 193.44it/s]


epoch 43: Training_Loss- 0.017, Val_Loss- 0.031, Training_Acc- 67.56, Val_Acc- 51.68


100%|██████████| 267/267 [00:04<00:00, 58.88it/s]
100%|██████████| 35/35 [00:00<00:00, 179.66it/s]


epoch 44: Training_Loss- 0.017, Val_Loss- 0.034, Training_Acc- 67.73, Val_Acc- 52.59


100%|██████████| 267/267 [00:04<00:00, 59.76it/s]
100%|██████████| 35/35 [00:00<00:00, 189.96it/s]


epoch 45: Training_Loss- 0.016, Val_Loss- 0.036, Training_Acc- 68.91, Val_Acc- 49.59


100%|██████████| 267/267 [00:04<00:00, 59.72it/s]
100%|██████████| 35/35 [00:00<00:00, 203.07it/s]


epoch 46: Training_Loss- 0.016, Val_Loss- 0.033, Training_Acc- 68.48, Val_Acc- 48.86


100%|██████████| 267/267 [00:04<00:00, 59.66it/s]
100%|██████████| 35/35 [00:00<00:00, 191.52it/s]


epoch 47: Training_Loss- 0.018, Val_Loss- 0.036, Training_Acc- 66.05, Val_Acc- 51.23


100%|██████████| 267/267 [00:04<00:00, 59.89it/s]
100%|██████████| 35/35 [00:00<00:00, 211.99it/s]


epoch 48: Training_Loss- 0.017, Val_Loss- 0.036, Training_Acc- 68.87, Val_Acc- 48.50


100%|██████████| 267/267 [00:04<00:00, 58.76it/s]
100%|██████████| 35/35 [00:00<00:00, 208.03it/s]

epoch 49: Training_Loss- 0.016, Val_Loss- 0.034, Training_Acc- 69.83, Val_Acc- 48.68
Best accuracy at epoch: 33





In [None]:
test(model, test_loader, file = 'rnn_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 170.05it/s]


Test set: Average loss: 0.0296, Accuracy: 1112/2210 (50.32%)






In [None]:
test(model, valid_loader, file = 'rnn_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 159.37it/s]


Test set: Average loss: 0.0296, Accuracy: 586/1101 (53.22%)






In [None]:
# L=52 vac =50.68 tac=51.72 epco =2
# L=40 vac =52.13 tac=51.22 epco =16
# L=32 vac =54.41 tac=51.27 epco =17  training slowly learning
# L=16 vac =57.86 tac=57.24 epco =15 training accuracy  up90
# L=8 vac =60.94 tac=56.56 epco =19 overfitting
# L=24 vac =53.13 tac=52.08 epco =8

#### $\color{red}{\text{Solution 3.2}}$
<font color='red'>The RNN model is trained for 50 epochs.The model fails to learn for larger sequence length and overfits for smaller sequence length. The best performance is obtained with sequence length set to 8. the validation and test accuracies are respectively 60.94& and 56.56% The model achieve better performance than the baseline <font>

Sequence length=52 validation accuracy =50.68 test accuracy=51.72 best epoch =2
    
Sequence length=40 validation accuracy =53.68 test accuracy=50.45 best epoch =17
    
Sequence length=32 validation accuracy =54.41 test accuracy=51.27 best epoch =18
    
Sequence length=16 validation accuracy=57.86 test accuracy=57.24 best epoch =15
    
Sequences length=8 validation accuracy =60.94 test accuracy=56.56  best epoch =19

#### $\color{red}{\text{Solution 3.3}}$
**Theoretically**

<font color='red'> The cross-entropy of the distribution ${\displaystyle q}$ relative to a distribution ${\displaystyle p}$  over a given set is defined as follows: 
$$H(p,q) = - \sum_{i} {\displaystyle p_i} \log {\displaystyle q_i}$$,
where $q_{i}$ is the estimated probability of outcome $i$ and $p_{i}$ is the empirical probability of outcome $i$ in the training set. <font>

<font color='red'> **Relation to log-likelihood**

<font color='red'> Let the estimated probability of outcome $i$ be $q_{\theta }(x=i)= q_i$ and let the frequency (empirical probability) of outcome $i$ in the training set be $p(x=i)= p_i$

<font color='red'> Given N conditionally independent samples in the training set, then the **likelihood** is given by:
$$\text{likelihood} = \prod_{i} q_{i}^{Np_{i}} $$

Taking the **logarithm of likelihood** followed by dividing it by N, we get
$$\frac{1}{N} \log \prod_{i} q_{i}^{Np_{i}}  = \sum_{i} p_{i} \log q_{i} = - H(p,q) $$
</font>
<font color='red'>  so that maximizing the likelihood with respect to the parameters ${\displaystyle \theta }$ is the same as minimizing the cross-entropy.<font>



#### $\color{red}{\text{Solution 3.4}}$

<font color='red'>  If some input values are very large compared to the others, the smaller will have  value zero as softmax output. And computing the logarithm on top of this result will lead to undeflow or undefine isssue which arises when one tries to compute log(0). Example: $x = [5.0, 1.0, 10.0, 1000.0]$ we have $ softmax(x) = [0., 0., 0., 1.]$ and taking log we get $log(softmax(x))=[-inf, -inf, -inf, 0.]$.
Therefore, **computing log on top of softmax is numerically unstable.

In [None]:
import torch
import torch.nn as nn
x =torch.tensor([5, 1, 10.0, 500])
x= x.reshape((1,-1))
print(x)

tensor([[  5.,   1.,  10., 500.]])


In [None]:
m=nn.Softmax(dim=1)
m(x)

tensor([[0., 0., 0., 1.]])

In [None]:
torch.log(m(x))

tensor([[-inf, -inf, -inf, 0.]])

## 4. Text Classification with LSTM and Dropout

Replace your RNN module with an LSTM module. See Lecture slides 05 and 06 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.2000, 0.6000, 0.0000, 0.0000, 0.0000])


> **Problem 4.1** *(3 points)* Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN. Report the accuracy on the dev data.



In [None]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np

class LSTMCell(nn.Module):
    def __init__(self, vocab, input_size, hidden_size, output_size=2, bias=True, drop = 0):
        super(LSTMCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bias = bias
        self.drop =drop
        self.output_size =output_size

        self.fc_x2h = nn.Linear(input_size, hidden_size * 4, bias=bias)
        self.fc_h2h = nn.Linear(hidden_size, hidden_size * 4, bias=bias)
        self.embedding = nn.Embedding(len(vocab), self.input_size)
        self.dropout = nn.Dropout(drop)
        self.classifier = nn.Linear(self.hidden_size, self.output_size, bias=True) 
        self.reset_parameters()

    def reset_parameters(self):
        std = 1.0 / np.sqrt(self.hidden_size)
        for w in self.parameters():
            w.data.uniform_(-std, std)

    def forward(self, input, hx=None):

  
        emb =self.embedding(input)

        if hx is None:
          #get initial values
            batch_size = emb.shape[0]

        hx, cx = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))

        # hx, cx = hx
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]

            gates = self.fc_x2h(emb_t) + self.fc_h2h(hx)
            # Get gates (i_t, f_t, g_t, o_t)
            input_gate, forget_gate, cell_gate, output_gate = gates.chunk(4, 1)

            i_t = torch.sigmoid(input_gate)
            f_t = torch.sigmoid(forget_gate)
            g_t = torch.tanh(cell_gate)
            o_t = torch.sigmoid(output_gate)

            cy = cx * f_t + i_t * g_t

            hy = o_t * torch.tanh(cy)


        if (self.drop != 0):
            h_t = self.dropout(hy)
            out = self.linear(h_t)
        else:
            out = self.classifier(hy)


        return out

In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 16)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 16)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 16)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

2


In [None]:
d = 100 # size of word-embedding
num_epochs = 50
model = LSTMCell(vocab, input_size=100, hidden_size=100).cuda()
train(model, train_loader, valid_loader, num_epochs, 'lstm1_best_model.pth')

100%|██████████| 267/267 [00:02<00:00, 116.30it/s]
100%|██████████| 35/35 [00:00<00:00, 179.76it/s]


epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.81, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 117.39it/s]
100%|██████████| 35/35 [00:00<00:00, 184.13it/s]


epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.95, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 120.99it/s]
100%|██████████| 35/35 [00:00<00:00, 169.68it/s]


epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.11, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 118.74it/s]
100%|██████████| 35/35 [00:00<00:00, 182.07it/s]


epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.12, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 122.88it/s]
100%|██████████| 35/35 [00:00<00:00, 183.49it/s]


epoch 4: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.30, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 121.43it/s]
100%|██████████| 35/35 [00:00<00:00, 178.03it/s]


epoch 5: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.22, Val_Acc- 50.05


100%|██████████| 267/267 [00:02<00:00, 122.39it/s]
100%|██████████| 35/35 [00:00<00:00, 171.14it/s]


epoch 6: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.50, Val_Acc- 52.77


100%|██████████| 267/267 [00:02<00:00, 122.24it/s]
100%|██████████| 35/35 [00:00<00:00, 171.73it/s]


epoch 7: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.69, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 118.68it/s]
100%|██████████| 35/35 [00:00<00:00, 140.37it/s]


epoch 8: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.54, Val_Acc- 53.04


100%|██████████| 267/267 [00:02<00:00, 122.36it/s]
100%|██████████| 35/35 [00:00<00:00, 178.05it/s]


epoch 9: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.26, Val_Acc- 49.50


100%|██████████| 267/267 [00:02<00:00, 118.18it/s]
100%|██████████| 35/35 [00:00<00:00, 171.36it/s]


epoch 10: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.15, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 121.54it/s]
100%|██████████| 35/35 [00:00<00:00, 179.49it/s]


epoch 11: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.95, Val_Acc- 51.86


100%|██████████| 267/267 [00:02<00:00, 118.60it/s]
100%|██████████| 35/35 [00:00<00:00, 155.77it/s]


epoch 12: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.19, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 120.80it/s]
100%|██████████| 35/35 [00:00<00:00, 156.67it/s]


epoch 13: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.02, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 118.01it/s]
100%|██████████| 35/35 [00:00<00:00, 179.22it/s]


epoch 14: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.00, Val_Acc- 48.59


100%|██████████| 267/267 [00:02<00:00, 120.25it/s]
100%|██████████| 35/35 [00:00<00:00, 179.82it/s]


epoch 15: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.51, Val_Acc- 52.04


100%|██████████| 267/267 [00:02<00:00, 119.15it/s]
100%|██████████| 35/35 [00:00<00:00, 168.33it/s]


epoch 16: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.43, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 113.09it/s]
100%|██████████| 35/35 [00:00<00:00, 163.97it/s]


epoch 17: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.69, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 118.41it/s]
100%|██████████| 35/35 [00:00<00:00, 172.06it/s]


epoch 18: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.19, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 119.38it/s]
100%|██████████| 35/35 [00:00<00:00, 160.89it/s]


epoch 19: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.19, Val_Acc- 52.23


100%|██████████| 267/267 [00:02<00:00, 117.70it/s]
100%|██████████| 35/35 [00:00<00:00, 167.96it/s]


epoch 20: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.56, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 115.39it/s]
100%|██████████| 35/35 [00:00<00:00, 160.16it/s]


epoch 21: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.57, Val_Acc- 49.14


100%|██████████| 267/267 [00:02<00:00, 117.17it/s]
100%|██████████| 35/35 [00:00<00:00, 175.82it/s]


epoch 22: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.69, Val_Acc- 49.23


100%|██████████| 267/267 [00:02<00:00, 120.21it/s]
100%|██████████| 35/35 [00:00<00:00, 173.88it/s]


epoch 23: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.73, Val_Acc- 48.59


100%|██████████| 267/267 [00:02<00:00, 119.27it/s]
100%|██████████| 35/35 [00:00<00:00, 181.86it/s]


epoch 24: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.24, Val_Acc- 51.68


100%|██████████| 267/267 [00:02<00:00, 116.53it/s]
100%|██████████| 35/35 [00:00<00:00, 172.39it/s]


epoch 25: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.19, Val_Acc- 52.41


100%|██████████| 267/267 [00:02<00:00, 118.14it/s]
100%|██████████| 35/35 [00:00<00:00, 173.58it/s]


epoch 26: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.85, Val_Acc- 51.86


100%|██████████| 267/267 [00:02<00:00, 119.01it/s]
100%|██████████| 35/35 [00:00<00:00, 186.12it/s]


epoch 27: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.44, Val_Acc- 50.77


100%|██████████| 267/267 [00:02<00:00, 123.07it/s]
100%|██████████| 35/35 [00:00<00:00, 158.42it/s]


epoch 28: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.61, Val_Acc- 48.41


100%|██████████| 267/267 [00:02<00:00, 117.49it/s]
100%|██████████| 35/35 [00:00<00:00, 185.56it/s]


epoch 29: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.83, Val_Acc- 51.77


100%|██████████| 267/267 [00:02<00:00, 118.45it/s]
100%|██████████| 35/35 [00:00<00:00, 180.61it/s]


epoch 30: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.35, Val_Acc- 48.50


100%|██████████| 267/267 [00:02<00:00, 122.00it/s]
100%|██████████| 35/35 [00:00<00:00, 172.22it/s]


epoch 31: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.48, Val_Acc- 52.41


100%|██████████| 267/267 [00:02<00:00, 118.90it/s]
100%|██████████| 35/35 [00:00<00:00, 178.91it/s]


epoch 32: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.99, Val_Acc- 51.86


100%|██████████| 267/267 [00:02<00:00, 114.46it/s]
100%|██████████| 35/35 [00:00<00:00, 136.61it/s]


epoch 33: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.55, Val_Acc- 49.14


100%|██████████| 267/267 [00:02<00:00, 121.53it/s]
100%|██████████| 35/35 [00:00<00:00, 185.14it/s]


epoch 34: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.62, Val_Acc- 48.96


100%|██████████| 267/267 [00:02<00:00, 118.96it/s]
100%|██████████| 35/35 [00:00<00:00, 175.90it/s]


epoch 35: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.55, Val_Acc- 49.23


100%|██████████| 267/267 [00:02<00:00, 119.34it/s]
100%|██████████| 35/35 [00:00<00:00, 175.74it/s]


epoch 36: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 54.15, Val_Acc- 49.41


100%|██████████| 267/267 [00:02<00:00, 121.40it/s]
100%|██████████| 35/35 [00:00<00:00, 177.02it/s]


epoch 37: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 54.34, Val_Acc- 49.23


100%|██████████| 267/267 [00:02<00:00, 117.48it/s]
100%|██████████| 35/35 [00:00<00:00, 186.26it/s]


epoch 38: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 54.30, Val_Acc- 51.95


100%|██████████| 267/267 [00:02<00:00, 119.90it/s]
100%|██████████| 35/35 [00:00<00:00, 179.61it/s]


epoch 39: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 55.40, Val_Acc- 51.77


100%|██████████| 267/267 [00:02<00:00, 121.68it/s]
100%|██████████| 35/35 [00:00<00:00, 158.33it/s]


epoch 40: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 55.03, Val_Acc- 52.23


100%|██████████| 267/267 [00:02<00:00, 118.08it/s]
100%|██████████| 35/35 [00:00<00:00, 123.45it/s]


epoch 41: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.58, Val_Acc- 51.14


100%|██████████| 267/267 [00:02<00:00, 117.76it/s]
100%|██████████| 35/35 [00:00<00:00, 170.01it/s]


epoch 42: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.30, Val_Acc- 48.50


100%|██████████| 267/267 [00:02<00:00, 117.59it/s]
100%|██████████| 35/35 [00:00<00:00, 177.19it/s]


epoch 43: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.04, Val_Acc- 52.13


100%|██████████| 267/267 [00:02<00:00, 114.67it/s]
100%|██████████| 35/35 [00:00<00:00, 152.73it/s]


epoch 44: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.46, Val_Acc- 48.96


100%|██████████| 267/267 [00:02<00:00, 117.70it/s]
100%|██████████| 35/35 [00:00<00:00, 181.02it/s]


epoch 45: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 57.58, Val_Acc- 49.59


100%|██████████| 267/267 [00:02<00:00, 121.12it/s]
100%|██████████| 35/35 [00:00<00:00, 180.97it/s]


epoch 46: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 57.08, Val_Acc- 52.13


100%|██████████| 267/267 [00:02<00:00, 115.99it/s]
100%|██████████| 35/35 [00:00<00:00, 166.45it/s]


epoch 47: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 58.02, Val_Acc- 52.50


100%|██████████| 267/267 [00:02<00:00, 117.55it/s]
100%|██████████| 35/35 [00:00<00:00, 185.43it/s]


epoch 48: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 57.78, Val_Acc- 52.04


100%|██████████| 267/267 [00:02<00:00, 116.41it/s]
100%|██████████| 35/35 [00:00<00:00, 180.20it/s]

epoch 49: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 59.12, Val_Acc- 49.68
Best accuracy at epoch: 8





In [None]:
test(model, test_loader, file = 'lstm1_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 138.00it/s]


Test set: Average loss: 0.0220, Accuracy: 1121/2210 (50.72%)






In [None]:
test(model, valid_loader, file = 'lstm1_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 141.31it/s]


Test set: Average loss: 0.0220, Accuracy: 584/1101 (53.04%)






In [None]:
# L=52 vac =50.68 tac=51.72 epco =2
# L=40 vac =50.58 tac=51.72 epco =9
# L=32 vac =50.95 tac=51.40 epco =6  training slowly learning
# L=16 vac =53.86.86 tac=52.53 epco =27 
# L=8 vac =53.86.94 tac=54.16 epco =47 underfitting
# L=24 vac =53.13 tac=52.08 epco =8

In [None]:

#WE build our lstm while closely studying the work on
#https://github.com/piEsposito/pytorch-lstm-by-hand

In [None]:
import math
class LSTM(nn.Module):
    def __init__(self, d, dropout = None):
     
        super(LSTM, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2
        self.drop = dropout
        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.dropout = nn.Dropout(0.25)
        self.linear = nn.Linear(self.hidden_size, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, x):

        emb = self.embedding(x)
        batch_size = emb.shape[0]

        h_t, c_t = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * torch.tanh(c_t)

        if (self.drop != None):
          h_t = self.dropout(h_t)
          out = self.linear(h_t)
        else:
          out = self.linear(h_t)
        return out

In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 16)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 16)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 16)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

2


In [None]:
d = 100 # size of word-embedding
num_epochs = 50
model = LSTM(d=100).cuda()

train(model, train_loader, valid_loader, num_epochs, 'lstm20_best_model.pth')
# train_loss, train_acc, val_loss, val_acc = train(model, train_loader, valid_loader, num_epochs, 'lstm2_best_model.pth')

100%|██████████| 267/267 [00:05<00:00, 51.53it/s]
100%|██████████| 35/35 [00:00<00:00, 193.21it/s]


epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.40, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.35it/s]
100%|██████████| 35/35 [00:00<00:00, 194.20it/s]


epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.33, Val_Acc- 49.50


100%|██████████| 267/267 [00:05<00:00, 50.96it/s]
100%|██████████| 35/35 [00:00<00:00, 181.09it/s]


epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.88, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.41it/s]
100%|██████████| 35/35 [00:00<00:00, 205.41it/s]


epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.73, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 51.30it/s]
100%|██████████| 35/35 [00:00<00:00, 201.78it/s]


epoch 4: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.63, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 51.13it/s]
100%|██████████| 35/35 [00:00<00:00, 193.10it/s]


epoch 5: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.23, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.21it/s]
100%|██████████| 35/35 [00:00<00:00, 197.01it/s]


epoch 6: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.71, Val_Acc- 50.77


100%|██████████| 267/267 [00:05<00:00, 50.63it/s]
100%|██████████| 35/35 [00:00<00:00, 186.70it/s]


epoch 7: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.05, Val_Acc- 51.41


100%|██████████| 267/267 [00:05<00:00, 50.90it/s]
100%|██████████| 35/35 [00:00<00:00, 168.46it/s]


epoch 8: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.82, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.03it/s]
100%|██████████| 35/35 [00:00<00:00, 190.02it/s]


epoch 9: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.74, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 50.28it/s]
100%|██████████| 35/35 [00:00<00:00, 197.84it/s]


epoch 10: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.64, Val_Acc- 52.32


100%|██████████| 267/267 [00:05<00:00, 51.41it/s]
100%|██████████| 35/35 [00:00<00:00, 174.12it/s]


epoch 11: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.07, Val_Acc- 53.22


100%|██████████| 267/267 [00:05<00:00, 52.04it/s]
100%|██████████| 35/35 [00:00<00:00, 186.98it/s]


epoch 12: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.90, Val_Acc- 50.86


100%|██████████| 267/267 [00:05<00:00, 51.38it/s]
100%|██████████| 35/35 [00:00<00:00, 188.42it/s]


epoch 13: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.65, Val_Acc- 53.04


100%|██████████| 267/267 [00:05<00:00, 51.23it/s]
100%|██████████| 35/35 [00:00<00:00, 184.66it/s]


epoch 14: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.73, Val_Acc- 51.23


100%|██████████| 267/267 [00:05<00:00, 50.91it/s]
100%|██████████| 35/35 [00:00<00:00, 184.74it/s]


epoch 15: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.72, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 52.04it/s]
100%|██████████| 35/35 [00:00<00:00, 203.42it/s]


epoch 16: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.31, Val_Acc- 53.95


100%|██████████| 267/267 [00:05<00:00, 51.68it/s]
100%|██████████| 35/35 [00:00<00:00, 183.19it/s]


epoch 17: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.28, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 49.88it/s]
100%|██████████| 35/35 [00:00<00:00, 193.27it/s]


epoch 18: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.91, Val_Acc- 54.68


100%|██████████| 267/267 [00:05<00:00, 51.03it/s]
100%|██████████| 35/35 [00:00<00:00, 196.46it/s]


epoch 19: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.00, Val_Acc- 51.77


100%|██████████| 267/267 [00:05<00:00, 50.95it/s]
100%|██████████| 35/35 [00:00<00:00, 193.83it/s]


epoch 20: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.50, Val_Acc- 50.77


100%|██████████| 267/267 [00:05<00:00, 51.25it/s]
100%|██████████| 35/35 [00:00<00:00, 194.48it/s]


epoch 21: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 54.42, Val_Acc- 52.41


100%|██████████| 267/267 [00:05<00:00, 51.62it/s]
100%|██████████| 35/35 [00:00<00:00, 183.85it/s]


epoch 22: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 54.85, Val_Acc- 56.58


100%|██████████| 267/267 [00:05<00:00, 51.47it/s]
100%|██████████| 35/35 [00:00<00:00, 192.25it/s]


epoch 23: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 55.49, Val_Acc- 58.86


100%|██████████| 267/267 [00:05<00:00, 51.46it/s]
100%|██████████| 35/35 [00:00<00:00, 196.12it/s]


epoch 24: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 57.49, Val_Acc- 59.04


100%|██████████| 267/267 [00:05<00:00, 51.48it/s]
100%|██████████| 35/35 [00:00<00:00, 203.90it/s]


epoch 25: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 57.87, Val_Acc- 56.58


100%|██████████| 267/267 [00:05<00:00, 52.03it/s]
100%|██████████| 35/35 [00:00<00:00, 185.79it/s]


epoch 26: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 59.01, Val_Acc- 59.40


100%|██████████| 267/267 [00:05<00:00, 50.45it/s]
100%|██████████| 35/35 [00:00<00:00, 190.43it/s]


epoch 27: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 61.00, Val_Acc- 59.13


100%|██████████| 267/267 [00:05<00:00, 52.31it/s]
100%|██████████| 35/35 [00:00<00:00, 201.06it/s]


epoch 28: Training_Loss- 0.020, Val_Loss- 0.022, Training_Acc- 62.87, Val_Acc- 57.95


100%|██████████| 267/267 [00:05<00:00, 50.94it/s]
100%|██████████| 35/35 [00:00<00:00, 189.02it/s]


epoch 29: Training_Loss- 0.020, Val_Loss- 0.020, Training_Acc- 64.20, Val_Acc- 63.40


100%|██████████| 267/267 [00:05<00:00, 50.63it/s]
100%|██████████| 35/35 [00:00<00:00, 191.95it/s]


epoch 30: Training_Loss- 0.019, Val_Loss- 0.020, Training_Acc- 66.12, Val_Acc- 64.31


100%|██████████| 267/267 [00:05<00:00, 51.78it/s]
100%|██████████| 35/35 [00:00<00:00, 191.73it/s]


epoch 31: Training_Loss- 0.019, Val_Loss- 0.023, Training_Acc- 67.38, Val_Acc- 58.31


100%|██████████| 267/267 [00:05<00:00, 50.97it/s]
100%|██████████| 35/35 [00:00<00:00, 188.25it/s]


epoch 32: Training_Loss- 0.019, Val_Loss- 0.019, Training_Acc- 68.55, Val_Acc- 69.48


100%|██████████| 267/267 [00:05<00:00, 51.42it/s]
100%|██████████| 35/35 [00:00<00:00, 204.22it/s]


epoch 33: Training_Loss- 0.018, Val_Loss- 0.020, Training_Acc- 70.51, Val_Acc- 64.85


100%|██████████| 267/267 [00:05<00:00, 51.95it/s]
100%|██████████| 35/35 [00:00<00:00, 188.45it/s]


epoch 34: Training_Loss- 0.017, Val_Loss- 0.021, Training_Acc- 72.23, Val_Acc- 64.49


100%|██████████| 267/267 [00:05<00:00, 49.98it/s]
100%|██████████| 35/35 [00:00<00:00, 201.61it/s]


epoch 35: Training_Loss- 0.017, Val_Loss- 0.019, Training_Acc- 73.36, Val_Acc- 66.94


100%|██████████| 267/267 [00:05<00:00, 51.57it/s]
100%|██████████| 35/35 [00:00<00:00, 189.90it/s]


epoch 36: Training_Loss- 0.017, Val_Loss- 0.019, Training_Acc- 74.25, Val_Acc- 68.30


100%|██████████| 267/267 [00:05<00:00, 50.82it/s]
100%|██████████| 35/35 [00:00<00:00, 188.97it/s]


epoch 37: Training_Loss- 0.016, Val_Loss- 0.018, Training_Acc- 75.37, Val_Acc- 69.48


100%|██████████| 267/267 [00:05<00:00, 50.24it/s]
100%|██████████| 35/35 [00:00<00:00, 187.99it/s]


epoch 38: Training_Loss- 0.016, Val_Loss- 0.019, Training_Acc- 76.98, Val_Acc- 69.03


100%|██████████| 267/267 [00:05<00:00, 51.42it/s]
100%|██████████| 35/35 [00:00<00:00, 198.59it/s]


epoch 39: Training_Loss- 0.015, Val_Loss- 0.020, Training_Acc- 77.19, Val_Acc- 67.03


100%|██████████| 267/267 [00:05<00:00, 51.11it/s]
100%|██████████| 35/35 [00:00<00:00, 192.90it/s]


epoch 40: Training_Loss- 0.015, Val_Loss- 0.020, Training_Acc- 77.90, Val_Acc- 67.48


100%|██████████| 267/267 [00:05<00:00, 50.14it/s]
100%|██████████| 35/35 [00:00<00:00, 172.94it/s]


epoch 41: Training_Loss- 0.015, Val_Loss- 0.020, Training_Acc- 79.10, Val_Acc- 70.03


100%|██████████| 267/267 [00:05<00:00, 50.99it/s]
100%|██████████| 35/35 [00:00<00:00, 193.00it/s]


epoch 42: Training_Loss- 0.014, Val_Loss- 0.019, Training_Acc- 79.89, Val_Acc- 70.03


100%|██████████| 267/267 [00:05<00:00, 50.39it/s]
100%|██████████| 35/35 [00:00<00:00, 184.27it/s]


epoch 43: Training_Loss- 0.014, Val_Loss- 0.020, Training_Acc- 80.75, Val_Acc- 70.12


100%|██████████| 267/267 [00:05<00:00, 49.43it/s]
100%|██████████| 35/35 [00:00<00:00, 197.97it/s]


epoch 44: Training_Loss- 0.013, Val_Loss- 0.020, Training_Acc- 81.41, Val_Acc- 70.48


100%|██████████| 267/267 [00:05<00:00, 50.02it/s]
100%|██████████| 35/35 [00:00<00:00, 197.68it/s]


epoch 45: Training_Loss- 0.013, Val_Loss- 0.024, Training_Acc- 81.79, Val_Acc- 65.21


100%|██████████| 267/267 [00:05<00:00, 51.00it/s]
100%|██████████| 35/35 [00:00<00:00, 163.14it/s]


epoch 46: Training_Loss- 0.013, Val_Loss- 0.020, Training_Acc- 82.07, Val_Acc- 69.75


100%|██████████| 267/267 [00:05<00:00, 50.58it/s]
100%|██████████| 35/35 [00:00<00:00, 177.65it/s]


epoch 47: Training_Loss- 0.012, Val_Loss- 0.020, Training_Acc- 83.19, Val_Acc- 71.30


100%|██████████| 267/267 [00:05<00:00, 50.98it/s]
100%|██████████| 35/35 [00:00<00:00, 176.95it/s]


epoch 48: Training_Loss- 0.012, Val_Loss- 0.019, Training_Acc- 84.29, Val_Acc- 69.94


100%|██████████| 267/267 [00:05<00:00, 51.27it/s]
100%|██████████| 35/35 [00:00<00:00, 192.59it/s]


epoch 49: Training_Loss- 0.011, Val_Loss- 0.020, Training_Acc- 85.45, Val_Acc- 71.57
Best accuracy at epoch: 49


In [None]:
test(model, test_loader, file = 'lstm20_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 148.33it/s]


Test set: Average loss: 0.0184, Accuracy: 1587/2210 (71.81%)






In [None]:
test(model, valid_loader, file = 'lstm20_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 151.34it/s]


Test set: Average loss: 0.0197, Accuracy: 788/1101 (71.57%)






#### $\color{red}{\text{Solution 4.1}}$
<font color='red'> We trained the model for 50 epochs and the validation and test accuracy are respectively. We trained two lstm the first using the linear layer from pytorch and the second based on nn.parameter from pytorch library. Both model achieve similar result but the nn.parameter based model achieve the best result with sequence length set to 16 and 8. The experiment on local machine demonstrated that the model fails to properly learn for larger sequence length as showed below. In our experiment the LSTM is the better than the previous methods.
.<font>

Sequences length=52 validation accuracy =50.68 test accuracy=51.72 best epoch =1
    
Sequences length=40 validation accuracy =50.68 test accuracy=51.72 best epoch =3
    
Sequences length=32 validation accuracy =52.59 test accuracy=51.49 best epoch =45
    
Sequences length=16 validation accuracy=72.12 test accuracy=72.35 best epoch =47
    
Sequences length=8 validation accuracy =67.39 test accuracy=67.19  best epoch =37



> **Problem 4.2** *(2 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data.



In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 16)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 16)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 16)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/soro/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


2


In [None]:
d = 100 # size of word-embedding
num_epochs = 50
model = LSTM(d,'dropout').cuda()
train(model, train_loader, valid_loader, num_epochs, 'lstm_drop_best_model.pth')

100%|██████████| 267/267 [00:05<00:00, 51.01it/s]
100%|██████████| 35/35 [00:00<00:00, 193.85it/s]


epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.15, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 51.25it/s]
100%|██████████| 35/35 [00:00<00:00, 182.19it/s]


epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.12, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 51.50it/s]
100%|██████████| 35/35 [00:00<00:00, 168.67it/s]


epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.26, Val_Acc- 50.32


100%|██████████| 267/267 [00:05<00:00, 50.65it/s]
100%|██████████| 35/35 [00:00<00:00, 191.90it/s]


epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.61, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.04it/s]
100%|██████████| 35/35 [00:00<00:00, 198.48it/s]


epoch 4: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.13, Val_Acc- 51.23


100%|██████████| 267/267 [00:05<00:00, 51.33it/s]
100%|██████████| 35/35 [00:00<00:00, 184.22it/s]


epoch 5: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.08, Val_Acc- 50.86


100%|██████████| 267/267 [00:05<00:00, 50.90it/s]
100%|██████████| 35/35 [00:00<00:00, 184.59it/s]


epoch 6: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.95, Val_Acc- 50.68


100%|██████████| 267/267 [00:05<00:00, 50.43it/s]
100%|██████████| 35/35 [00:00<00:00, 203.07it/s]


epoch 7: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.33, Val_Acc- 52.59


100%|██████████| 267/267 [00:05<00:00, 51.77it/s]
100%|██████████| 35/35 [00:00<00:00, 185.29it/s]


epoch 8: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.81, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.16it/s]
100%|██████████| 35/35 [00:00<00:00, 178.54it/s]


epoch 9: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.37, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.36it/s]
100%|██████████| 35/35 [00:00<00:00, 196.08it/s]


epoch 10: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.59, Val_Acc- 52.13


100%|██████████| 267/267 [00:05<00:00, 52.41it/s]
100%|██████████| 35/35 [00:00<00:00, 197.88it/s]


epoch 11: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.08, Val_Acc- 51.86


100%|██████████| 267/267 [00:05<00:00, 51.32it/s]
100%|██████████| 35/35 [00:00<00:00, 202.06it/s]


epoch 12: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.33, Val_Acc- 53.13


100%|██████████| 267/267 [00:05<00:00, 51.96it/s]
100%|██████████| 35/35 [00:00<00:00, 202.46it/s]


epoch 13: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.67, Val_Acc- 49.32


100%|██████████| 267/267 [00:05<00:00, 51.27it/s]
100%|██████████| 35/35 [00:00<00:00, 193.61it/s]


epoch 14: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.80, Val_Acc- 51.32


100%|██████████| 267/267 [00:05<00:00, 51.32it/s]
100%|██████████| 35/35 [00:00<00:00, 196.71it/s]


epoch 15: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.43, Val_Acc- 52.13


100%|██████████| 267/267 [00:05<00:00, 51.65it/s]
100%|██████████| 35/35 [00:00<00:00, 188.40it/s]


epoch 16: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.80, Val_Acc- 49.14


100%|██████████| 267/267 [00:05<00:00, 51.82it/s]
100%|██████████| 35/35 [00:00<00:00, 192.28it/s]


epoch 17: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.95, Val_Acc- 48.86


100%|██████████| 267/267 [00:05<00:00, 51.17it/s]
100%|██████████| 35/35 [00:00<00:00, 199.46it/s]


epoch 18: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.91, Val_Acc- 55.13


100%|██████████| 267/267 [00:05<00:00, 51.87it/s]
100%|██████████| 35/35 [00:00<00:00, 204.03it/s]


epoch 19: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.36, Val_Acc- 54.68


100%|██████████| 267/267 [00:05<00:00, 51.71it/s]
100%|██████████| 35/35 [00:00<00:00, 176.38it/s]


epoch 20: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.89, Val_Acc- 49.86


100%|██████████| 267/267 [00:05<00:00, 51.64it/s]
100%|██████████| 35/35 [00:00<00:00, 171.36it/s]


epoch 21: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.89, Val_Acc- 54.59


100%|██████████| 267/267 [00:05<00:00, 51.14it/s]
100%|██████████| 35/35 [00:00<00:00, 207.70it/s]


epoch 22: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.36, Val_Acc- 54.41


100%|██████████| 267/267 [00:05<00:00, 51.23it/s]
100%|██████████| 35/35 [00:00<00:00, 176.89it/s]


epoch 23: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 54.56, Val_Acc- 52.13


100%|██████████| 267/267 [00:05<00:00, 52.04it/s]
100%|██████████| 35/35 [00:00<00:00, 174.63it/s]


epoch 24: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 55.63, Val_Acc- 57.58


100%|██████████| 267/267 [00:05<00:00, 52.33it/s]
100%|██████████| 35/35 [00:00<00:00, 191.03it/s]


epoch 25: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.51, Val_Acc- 56.22


100%|██████████| 267/267 [00:05<00:00, 51.00it/s]
100%|██████████| 35/35 [00:00<00:00, 195.39it/s]


epoch 26: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 57.32, Val_Acc- 57.40


100%|██████████| 267/267 [00:05<00:00, 52.22it/s]
100%|██████████| 35/35 [00:00<00:00, 199.89it/s]


epoch 27: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 58.75, Val_Acc- 59.58


100%|██████████| 267/267 [00:05<00:00, 52.56it/s]
100%|██████████| 35/35 [00:00<00:00, 184.27it/s]


epoch 28: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 59.87, Val_Acc- 58.95


100%|██████████| 267/267 [00:05<00:00, 51.87it/s]
100%|██████████| 35/35 [00:00<00:00, 185.90it/s]


epoch 29: Training_Loss- 0.020, Val_Loss- 0.021, Training_Acc- 61.58, Val_Acc- 62.76


100%|██████████| 267/267 [00:05<00:00, 50.20it/s]
100%|██████████| 35/35 [00:00<00:00, 204.55it/s]


epoch 30: Training_Loss- 0.020, Val_Loss- 0.021, Training_Acc- 63.27, Val_Acc- 58.76


100%|██████████| 267/267 [00:05<00:00, 50.47it/s]
100%|██████████| 35/35 [00:00<00:00, 190.28it/s]


epoch 31: Training_Loss- 0.020, Val_Loss- 0.020, Training_Acc- 65.27, Val_Acc- 66.67


100%|██████████| 267/267 [00:05<00:00, 52.11it/s]
100%|██████████| 35/35 [00:00<00:00, 206.99it/s]


epoch 32: Training_Loss- 0.019, Val_Loss- 0.019, Training_Acc- 67.26, Val_Acc- 67.85


100%|██████████| 267/267 [00:05<00:00, 50.79it/s]
100%|██████████| 35/35 [00:00<00:00, 199.23it/s]


epoch 33: Training_Loss- 0.019, Val_Loss- 0.019, Training_Acc- 67.84, Val_Acc- 69.30


100%|██████████| 267/267 [00:05<00:00, 52.71it/s]
100%|██████████| 35/35 [00:00<00:00, 170.88it/s]


epoch 34: Training_Loss- 0.018, Val_Loss- 0.021, Training_Acc- 70.42, Val_Acc- 64.03


100%|██████████| 267/267 [00:05<00:00, 52.08it/s]
100%|██████████| 35/35 [00:00<00:00, 198.57it/s]


epoch 35: Training_Loss- 0.018, Val_Loss- 0.019, Training_Acc- 71.25, Val_Acc- 68.39


100%|██████████| 267/267 [00:05<00:00, 51.78it/s]
100%|██████████| 35/35 [00:00<00:00, 195.99it/s]


epoch 36: Training_Loss- 0.017, Val_Loss- 0.019, Training_Acc- 73.05, Val_Acc- 67.94


100%|██████████| 267/267 [00:05<00:00, 51.00it/s]
100%|██████████| 35/35 [00:00<00:00, 174.25it/s]


epoch 37: Training_Loss- 0.017, Val_Loss- 0.020, Training_Acc- 74.09, Val_Acc- 68.39


100%|██████████| 267/267 [00:05<00:00, 50.52it/s]
100%|██████████| 35/35 [00:00<00:00, 183.48it/s]


epoch 38: Training_Loss- 0.017, Val_Loss- 0.019, Training_Acc- 74.24, Val_Acc- 69.66


100%|██████████| 267/267 [00:05<00:00, 50.77it/s]
100%|██████████| 35/35 [00:00<00:00, 182.52it/s]


epoch 39: Training_Loss- 0.016, Val_Loss- 0.020, Training_Acc- 76.22, Val_Acc- 67.67


100%|██████████| 267/267 [00:05<00:00, 50.27it/s]
100%|██████████| 35/35 [00:00<00:00, 176.78it/s]


epoch 40: Training_Loss- 0.016, Val_Loss- 0.019, Training_Acc- 76.56, Val_Acc- 69.12


100%|██████████| 267/267 [00:05<00:00, 51.40it/s]
100%|██████████| 35/35 [00:00<00:00, 195.17it/s]


epoch 41: Training_Loss- 0.015, Val_Loss- 0.019, Training_Acc- 78.08, Val_Acc- 69.30


100%|██████████| 267/267 [00:05<00:00, 51.19it/s]
100%|██████████| 35/35 [00:00<00:00, 183.67it/s]


epoch 42: Training_Loss- 0.015, Val_Loss- 0.019, Training_Acc- 79.18, Val_Acc- 68.39


100%|██████████| 267/267 [00:05<00:00, 51.86it/s]
100%|██████████| 35/35 [00:00<00:00, 190.84it/s]


epoch 43: Training_Loss- 0.015, Val_Loss- 0.019, Training_Acc- 78.86, Val_Acc- 69.94


100%|██████████| 267/267 [00:05<00:00, 52.38it/s]
100%|██████████| 35/35 [00:00<00:00, 196.29it/s]


epoch 44: Training_Loss- 0.014, Val_Loss- 0.023, Training_Acc- 80.21, Val_Acc- 64.31


100%|██████████| 267/267 [00:05<00:00, 51.19it/s]
100%|██████████| 35/35 [00:00<00:00, 184.37it/s]


epoch 45: Training_Loss- 0.014, Val_Loss- 0.019, Training_Acc- 80.78, Val_Acc- 71.75


100%|██████████| 267/267 [00:05<00:00, 51.75it/s]
100%|██████████| 35/35 [00:00<00:00, 173.64it/s]


epoch 46: Training_Loss- 0.014, Val_Loss- 0.021, Training_Acc- 81.00, Val_Acc- 69.12


100%|██████████| 267/267 [00:05<00:00, 50.27it/s]
100%|██████████| 35/35 [00:00<00:00, 195.91it/s]


epoch 47: Training_Loss- 0.013, Val_Loss- 0.020, Training_Acc- 82.20, Val_Acc- 68.76


100%|██████████| 267/267 [00:05<00:00, 51.26it/s]
100%|██████████| 35/35 [00:00<00:00, 197.21it/s]


epoch 48: Training_Loss- 0.012, Val_Loss- 0.020, Training_Acc- 83.73, Val_Acc- 68.12


100%|██████████| 267/267 [00:05<00:00, 51.33it/s]
100%|██████████| 35/35 [00:00<00:00, 184.21it/s]

epoch 49: Training_Loss- 0.012, Val_Loss- 0.021, Training_Acc- 84.00, Val_Acc- 69.85
Best accuracy at epoch: 45





In [None]:
test(model, test_loader, file = 'lstm_drop_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 161.51it/s]


Test set: Average loss: 0.0178, Accuracy: 1597/2210 (72.26%)






In [None]:
test(model, valid_loader, file = 'lstm_drop_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 155.29it/s]


Test set: Average loss: 0.0187, Accuracy: 790/1101 (71.75%)






#### $\color{red}{\text{Solution 4.2}}$
<font color='red'> The drop out is inserted on the last hidden state before feeding to the ouput linear layer. The below results obatined on local machine with RTX2018 GPU show that the model learn too slowly for smaller sequence lengths and does not not learn for larger sequence length. and for large sequence size the model fails to learn similar to the LSTM and method. The drop out operation has not improve the performance significantly. That could be due to the fact that the model is underfitting.</font>

Sequences length=52 validation accuracy =50.68 test accuracy=51.72 best epoch =1
    
Sequences length=40 validation accuracy =50.68 test accuracy=51.72 best epoch =3
    
Sequences length=32 validation accuracy =50.96 test accuracy=51.40 best epoch =45
    
Sequences length=16 validation accuracy=71.75 test accuracy=72.26 best epoch =46
    
Sequences length=8 validation accuracy =55.04 test accuracy=52.71 best epoch =30
    


In [None]:
# class CustomLSTM(nn.Module):
#     def __init__(self, input_sz, hidden_sz):
#         super().__init__()
#         self.input_sz = input_sz
#         self.hidden_size = hidden_sz
#         self.W = nn.Parameter(torch.Tensor(input_sz, hidden_sz * 4))
#         self.U = nn.Parameter(torch.Tensor(hidden_sz, hidden_sz * 4))
#         self.bias = nn.Parameter(torch.Tensor(hidden_sz * 4))
#         self.init_weights()
                
#     def init_weights(self):
#         stdv = 1.0 / math.sqrt(self.hidden_size)
#         for weight in self.parameters():
#             weight.data.uniform_(-stdv, stdv)
         
#     def forward(self, x, 
#                 init_states=None):

#         bs, seq_sz, _ = x.size()
#         hidden_seq = []
#         if init_states is None:
#             h_t, c_t = (torch.zeros(bs, self.hidden_size).to(x.device), 
#                         torch.zeros(bs, self.hidden_size).to(x.device))
#         else:
#             h_t, c_t = init_states
         
#         HS = self.hidden_size
#         for t in range(seq_sz):
#             x_t = x[:, t, :]
#             # batch the computations into a single matrix multiplication
#             gates = x_t @ self.W + h_t @ self.U + self.bias
#             i_t, f_t, g_t, o_t = (
#                 torch.sigmoid(gates[:, :HS]), # input
#                 torch.sigmoid(gates[:, HS:HS*2]), # forget
#                 torch.tanh(gates[:, HS*2:HS*3]),
#                 torch.sigmoid(gates[:, HS*3:]), # output
#             )
#             c_t = f_t * c_t + i_t * g_t
#             h_t = o_t * torch.tanh(c_t)
#             hidden_seq.append(h_t.unsqueeze(0))
#         hidden_seq = torch.cat(hidden_seq, dim=0)
#         # reshape from shape (sequence, batch, feature) to (batch, sequence, feature)
#         hidden_seq = hidden_seq.transpose(0, 1).contiguous()
#         return hidden_seq, (h_t, c_t)

> **Problem 4.3 (bonus)** *(2 points)* Consider implementing bidirectional LSTM and two layers of LSTM. Concatenate the forward direction output at the final time step and the backward direction output at the first time step for the final classificaiton. Report your accuracy on dev data.

In [None]:
class BiLSTM(nn.Module):
    def __init__(self, d, dropout = None):

        super(BiLSTM, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2

        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # forward lstm parameters
        self.forward_W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.forward_U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.forward_b = nn.Parameter(torch.Tensor(self.hidden_size * 4))
      #backward lstm parameters
        self.backward_W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.backward_U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.backward_b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.linear = nn.Linear(self.hidden_size*2, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, input):
        """Assumes x is of shape (batch, sequence, feature)"""
       
        emb = self.embedding(input)
        batch_size = emb.shape[0]

        forward_pass = []
        backward_pass = []
        h_t_forward, c_t_forward = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
        h_t_backward, c_t_backward = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size 
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.forward_W + h_t_forward @ self.forward_U + self.forward_b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_forward = f_t * c_t_forward + i_t * g_t
            h_t_forward = o_t * torch.tanh(c_t_forward)
            forward_pass.append(h_t_forward)

        for t in reversed(range(emb.shape[1])):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.backward_W + h_t_backward @ self.backward_U + self.backward_b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_backward = f_t * c_t_backward + i_t * g_t
            h_t_backward = o_t * torch.tanh(c_t_backward)
            backward_pass.append(h_t_backward)

        h_final= torch.cat((h_t_forward, h_t_backward), 1)
        out = self.linear(h_final)
        return out

In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 8)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 8)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 8)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/soro/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


2


In [None]:
d = 100
num_epochs = 50
model = BiLSTM(d).cuda()
train(model, train_loader, valid_loader, num_epochs, 'bilstm_best_model.pth')

100%|██████████| 267/267 [00:02<00:00, 113.48it/s]
100%|██████████| 35/35 [00:00<00:00, 462.20it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.10it/s]

epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.99, Val_Acc- 49.32


100%|██████████| 267/267 [00:02<00:00, 111.45it/s]
100%|██████████| 35/35 [00:00<00:00, 452.27it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.90it/s]

epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.98, Val_Acc- 50.68


100%|██████████| 267/267 [00:02<00:00, 111.81it/s]
100%|██████████| 35/35 [00:00<00:00, 452.69it/s]
  4%|▍         | 11/267 [00:00<00:02, 107.46it/s]

epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 51.74, Val_Acc- 53.04


100%|██████████| 267/267 [00:02<00:00, 111.21it/s]
100%|██████████| 35/35 [00:00<00:00, 453.80it/s]
  4%|▍         | 12/267 [00:00<00:02, 110.56it/s]

epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.98, Val_Acc- 57.67


100%|██████████| 267/267 [00:02<00:00, 112.90it/s]
100%|██████████| 35/35 [00:00<00:00, 439.10it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.35it/s]

epoch 4: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 53.45, Val_Acc- 56.68


100%|██████████| 267/267 [00:02<00:00, 113.25it/s]
100%|██████████| 35/35 [00:00<00:00, 431.77it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.91it/s]

epoch 5: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 54.62, Val_Acc- 57.86


100%|██████████| 267/267 [00:02<00:00, 114.75it/s]
100%|██████████| 35/35 [00:00<00:00, 465.17it/s]
  4%|▍         | 12/267 [00:00<00:02, 115.23it/s]

epoch 6: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 54.65, Val_Acc- 56.68


100%|██████████| 267/267 [00:02<00:00, 115.11it/s]
100%|██████████| 35/35 [00:00<00:00, 458.31it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.42it/s]

epoch 7: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 55.62, Val_Acc- 53.68


100%|██████████| 267/267 [00:02<00:00, 115.34it/s]
100%|██████████| 35/35 [00:00<00:00, 436.43it/s]
  4%|▍         | 12/267 [00:00<00:02, 116.05it/s]

epoch 8: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 56.51, Val_Acc- 58.04


100%|██████████| 267/267 [00:02<00:00, 114.57it/s]
100%|██████████| 35/35 [00:00<00:00, 446.54it/s]
  4%|▍         | 12/267 [00:00<00:02, 115.12it/s]

epoch 9: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 56.91, Val_Acc- 58.67


100%|██████████| 267/267 [00:02<00:00, 116.28it/s]
100%|██████████| 35/35 [00:00<00:00, 450.40it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.44it/s]

epoch 10: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 57.60, Val_Acc- 58.58


100%|██████████| 267/267 [00:02<00:00, 116.10it/s]
100%|██████████| 35/35 [00:00<00:00, 446.30it/s]
  4%|▍         | 12/267 [00:00<00:02, 115.64it/s]

epoch 11: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 57.24, Val_Acc- 58.04


100%|██████████| 267/267 [00:02<00:00, 115.40it/s]
100%|██████████| 35/35 [00:00<00:00, 448.53it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.76it/s]

epoch 12: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 58.80, Val_Acc- 58.76


100%|██████████| 267/267 [00:02<00:00, 113.68it/s]
100%|██████████| 35/35 [00:00<00:00, 449.20it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.40it/s]

epoch 13: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 60.23, Val_Acc- 59.67


100%|██████████| 267/267 [00:02<00:00, 112.95it/s]
100%|██████████| 35/35 [00:00<00:00, 445.36it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.76it/s]

epoch 14: Training_Loss- 0.021, Val_Loss- 0.021, Training_Acc- 60.84, Val_Acc- 60.49


100%|██████████| 267/267 [00:02<00:00, 112.01it/s]
100%|██████████| 35/35 [00:00<00:00, 446.88it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.37it/s]

epoch 15: Training_Loss- 0.020, Val_Loss- 0.020, Training_Acc- 62.49, Val_Acc- 61.85


100%|██████████| 267/267 [00:02<00:00, 112.01it/s]
100%|██████████| 35/35 [00:00<00:00, 451.72it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.69it/s]

epoch 16: Training_Loss- 0.020, Val_Loss- 0.020, Training_Acc- 63.83, Val_Acc- 60.85


100%|██████████| 267/267 [00:02<00:00, 110.87it/s]
100%|██████████| 35/35 [00:00<00:00, 448.52it/s]
  4%|▍         | 11/267 [00:00<00:02, 109.99it/s]

epoch 17: Training_Loss- 0.019, Val_Loss- 0.021, Training_Acc- 66.25, Val_Acc- 58.76


100%|██████████| 267/267 [00:02<00:00, 112.59it/s]
100%|██████████| 35/35 [00:00<00:00, 439.21it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.20it/s]

epoch 18: Training_Loss- 0.019, Val_Loss- 0.020, Training_Acc- 67.64, Val_Acc- 63.22


100%|██████████| 267/267 [00:02<00:00, 112.57it/s]
100%|██████████| 35/35 [00:00<00:00, 458.65it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.72it/s]

epoch 19: Training_Loss- 0.018, Val_Loss- 0.019, Training_Acc- 70.19, Val_Acc- 65.03


100%|██████████| 267/267 [00:02<00:00, 113.94it/s]
100%|██████████| 35/35 [00:00<00:00, 463.58it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.80it/s]

epoch 20: Training_Loss- 0.017, Val_Loss- 0.020, Training_Acc- 71.36, Val_Acc- 64.85


100%|██████████| 267/267 [00:02<00:00, 112.23it/s]
100%|██████████| 35/35 [00:00<00:00, 465.75it/s]
  4%|▍         | 12/267 [00:00<00:02, 115.02it/s]

epoch 21: Training_Loss- 0.017, Val_Loss- 0.019, Training_Acc- 72.98, Val_Acc- 66.12


100%|██████████| 267/267 [00:02<00:00, 113.60it/s]
100%|██████████| 35/35 [00:00<00:00, 449.27it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.08it/s]

epoch 22: Training_Loss- 0.016, Val_Loss- 0.020, Training_Acc- 74.39, Val_Acc- 65.49


100%|██████████| 267/267 [00:02<00:00, 113.20it/s]
100%|██████████| 35/35 [00:00<00:00, 449.11it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.04it/s]

epoch 23: Training_Loss- 0.016, Val_Loss- 0.019, Training_Acc- 75.95, Val_Acc- 66.58


100%|██████████| 267/267 [00:02<00:00, 113.70it/s]
100%|██████████| 35/35 [00:00<00:00, 463.87it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.64it/s]

epoch 24: Training_Loss- 0.015, Val_Loss- 0.021, Training_Acc- 76.98, Val_Acc- 64.12


100%|██████████| 267/267 [00:02<00:00, 113.79it/s]
100%|██████████| 35/35 [00:00<00:00, 323.55it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.05it/s]

epoch 25: Training_Loss- 0.014, Val_Loss- 0.021, Training_Acc- 78.58, Val_Acc- 66.12


100%|██████████| 267/267 [00:02<00:00, 112.61it/s]
100%|██████████| 35/35 [00:00<00:00, 452.79it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.84it/s]

epoch 26: Training_Loss- 0.014, Val_Loss- 0.021, Training_Acc- 79.13, Val_Acc- 65.40


100%|██████████| 267/267 [00:02<00:00, 112.08it/s]
100%|██████████| 35/35 [00:00<00:00, 437.36it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.00it/s]

epoch 27: Training_Loss- 0.013, Val_Loss- 0.024, Training_Acc- 79.95, Val_Acc- 63.67


100%|██████████| 267/267 [00:02<00:00, 113.59it/s]
100%|██████████| 35/35 [00:00<00:00, 449.86it/s]
  4%|▍         | 12/267 [00:00<00:02, 112.34it/s]

epoch 28: Training_Loss- 0.013, Val_Loss- 0.021, Training_Acc- 81.14, Val_Acc- 68.21


100%|██████████| 267/267 [00:02<00:00, 113.19it/s]
100%|██████████| 35/35 [00:00<00:00, 430.07it/s]
  4%|▍         | 12/267 [00:00<00:02, 111.03it/s]

epoch 29: Training_Loss- 0.013, Val_Loss- 0.022, Training_Acc- 82.08, Val_Acc- 66.49


100%|██████████| 267/267 [00:02<00:00, 114.05it/s]
100%|██████████| 35/35 [00:00<00:00, 442.56it/s]
  4%|▍         | 12/267 [00:00<00:02, 110.69it/s]

epoch 30: Training_Loss- 0.012, Val_Loss- 0.022, Training_Acc- 82.37, Val_Acc- 66.94


100%|██████████| 267/267 [00:02<00:00, 110.73it/s]
100%|██████████| 35/35 [00:00<00:00, 465.87it/s]
  4%|▍         | 11/267 [00:00<00:02, 109.80it/s]

epoch 31: Training_Loss- 0.012, Val_Loss- 0.025, Training_Acc- 83.27, Val_Acc- 64.49


100%|██████████| 267/267 [00:02<00:00, 110.21it/s]
100%|██████████| 35/35 [00:00<00:00, 429.15it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.95it/s]

epoch 32: Training_Loss- 0.011, Val_Loss- 0.023, Training_Acc- 83.73, Val_Acc- 63.94


100%|██████████| 267/267 [00:02<00:00, 117.13it/s]
100%|██████████| 35/35 [00:00<00:00, 320.18it/s]
  4%|▍         | 12/267 [00:00<00:02, 117.96it/s]

epoch 33: Training_Loss- 0.011, Val_Loss- 0.028, Training_Acc- 85.15, Val_Acc- 63.58


100%|██████████| 267/267 [00:02<00:00, 109.14it/s]
100%|██████████| 35/35 [00:00<00:00, 429.24it/s]
  4%|▍         | 11/267 [00:00<00:02, 108.64it/s]

epoch 34: Training_Loss- 0.011, Val_Loss- 0.024, Training_Acc- 85.48, Val_Acc- 66.58


100%|██████████| 267/267 [00:02<00:00, 110.53it/s]
100%|██████████| 35/35 [00:00<00:00, 419.86it/s]
  4%|▍         | 11/267 [00:00<00:02, 105.55it/s]

epoch 35: Training_Loss- 0.010, Val_Loss- 0.024, Training_Acc- 85.83, Val_Acc- 65.76


100%|██████████| 267/267 [00:02<00:00, 113.85it/s]
100%|██████████| 35/35 [00:00<00:00, 468.29it/s]
  4%|▍         | 12/267 [00:00<00:02, 119.18it/s]

epoch 36: Training_Loss- 0.010, Val_Loss- 0.026, Training_Acc- 86.10, Val_Acc- 64.85


100%|██████████| 267/267 [00:02<00:00, 114.78it/s]
100%|██████████| 35/35 [00:00<00:00, 473.83it/s]
  4%|▍         | 12/267 [00:00<00:02, 119.35it/s]

epoch 37: Training_Loss- 0.010, Val_Loss- 0.025, Training_Acc- 86.79, Val_Acc- 66.03


100%|██████████| 267/267 [00:02<00:00, 118.03it/s]
100%|██████████| 35/35 [00:00<00:00, 468.12it/s]
  4%|▍         | 12/267 [00:00<00:02, 118.00it/s]

epoch 38: Training_Loss- 0.009, Val_Loss- 0.026, Training_Acc- 87.58, Val_Acc- 65.12


100%|██████████| 267/267 [00:02<00:00, 117.03it/s]
100%|██████████| 35/35 [00:00<00:00, 475.47it/s]
  4%|▍         | 12/267 [00:00<00:02, 116.43it/s]

epoch 39: Training_Loss- 0.009, Val_Loss- 0.026, Training_Acc- 87.34, Val_Acc- 66.12


100%|██████████| 267/267 [00:02<00:00, 116.77it/s]
100%|██████████| 35/35 [00:00<00:00, 468.02it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.60it/s]

epoch 40: Training_Loss- 0.009, Val_Loss- 0.030, Training_Acc- 88.69, Val_Acc- 64.12


100%|██████████| 267/267 [00:02<00:00, 116.52it/s]
100%|██████████| 35/35 [00:00<00:00, 476.45it/s]
  4%|▍         | 12/267 [00:00<00:02, 119.01it/s]

epoch 41: Training_Loss- 0.008, Val_Loss- 0.030, Training_Acc- 88.99, Val_Acc- 65.21


100%|██████████| 267/267 [00:02<00:00, 115.96it/s]
100%|██████████| 35/35 [00:00<00:00, 474.87it/s]
  4%|▍         | 12/267 [00:00<00:02, 117.77it/s]

epoch 42: Training_Loss- 0.008, Val_Loss- 0.029, Training_Acc- 89.34, Val_Acc- 66.12


100%|██████████| 267/267 [00:02<00:00, 117.23it/s]
100%|██████████| 35/35 [00:00<00:00, 450.96it/s]
  4%|▍         | 12/267 [00:00<00:02, 114.87it/s]

epoch 43: Training_Loss- 0.008, Val_Loss- 0.034, Training_Acc- 89.41, Val_Acc- 61.94


100%|██████████| 267/267 [00:02<00:00, 116.87it/s]
100%|██████████| 35/35 [00:00<00:00, 467.40it/s]
  4%|▍         | 12/267 [00:00<00:02, 117.88it/s]

epoch 44: Training_Loss- 0.008, Val_Loss- 0.030, Training_Acc- 89.90, Val_Acc- 63.31


100%|██████████| 267/267 [00:02<00:00, 115.69it/s]
100%|██████████| 35/35 [00:00<00:00, 469.37it/s]
  4%|▍         | 12/267 [00:00<00:02, 113.88it/s]

epoch 45: Training_Loss- 0.007, Val_Loss- 0.032, Training_Acc- 90.93, Val_Acc- 63.31


100%|██████████| 267/267 [00:02<00:00, 112.36it/s]
100%|██████████| 35/35 [00:00<00:00, 451.94it/s]
  4%|▎         | 10/267 [00:00<00:02, 96.07it/s]

epoch 46: Training_Loss- 0.007, Val_Loss- 0.034, Training_Acc- 90.84, Val_Acc- 65.21


100%|██████████| 267/267 [00:02<00:00, 111.49it/s]
100%|██████████| 35/35 [00:00<00:00, 447.57it/s]
  4%|▍         | 12/267 [00:00<00:02, 116.12it/s]

epoch 47: Training_Loss- 0.007, Val_Loss- 0.034, Training_Acc- 91.14, Val_Acc- 64.85


100%|██████████| 267/267 [00:02<00:00, 113.69it/s]
100%|██████████| 35/35 [00:00<00:00, 443.59it/s]
  4%|▎         | 10/267 [00:00<00:02, 94.26it/s]

epoch 48: Training_Loss- 0.007, Val_Loss- 0.033, Training_Acc- 91.25, Val_Acc- 66.94


100%|██████████| 267/267 [00:02<00:00, 114.08it/s]
100%|██████████| 35/35 [00:00<00:00, 467.57it/s]

epoch 49: Training_Loss- 0.007, Val_Loss- 0.037, Training_Acc- 91.94, Val_Acc- 63.67
Best accuracy at epoch: 28





In [None]:
test(model, test_loader, file = 'bilstm_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 351.23it/s]


Test set: Average loss: 0.0201, Accuracy: 1513/2210 (68.46%)






In [None]:
test(model, valid_loader, file = 'bilstm_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 351.77it/s]


Test set: Average loss: 0.0205, Accuracy: 751/1101 (68.21%)






#### $\color{red}{\text{Solution 4.3}}$

<font color='red'> We trained the model for 50 epochs with different sequence length on local machine. The bidirectionnal LSTM outperformed all the previous methods tested in this study for the different sequence length. the best performance is achieve with sequence length set 40 which is 73.48% and 73.30 for the validation and test set respectively.<font>

Sequences length=52 validation accuracy =72.39 test accuracy=72.26 best epoch =45
    
Sequences length=40 validation accuracy =73.48 test accuracy=73.30 best epoch =33
    
Sequences length=32 validation accuracy =72.66 test accuracy=71.76 best epoch =37
    
Sequences length=16 validation accuracy=71.03 test accuracy=71.76 best epoch =47
    
Sequences length=8 validation accuracy =68.21 test accuracy=68.30 best epoch =34 
    
Sequences length=24 validation accuracy =73.02 test accuracy=73.12 best epoch =42

## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

> **Problem 5.1 (bonus)** *(2 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to replace word embeddings in your model from 4.2. Report the model's accuracy on the dev data.

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

--2021-09-26 09:06:29--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-09-26 09:06:29--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-26 09:06:29--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
import pandas as pd

In [None]:
glove = pd.read_csv('glove.6B.100d.txt', sep=" ", quoting=3, header=None, index_col=0)
glove_embedding = {key: val.values for key, val in glove.T.items()}

In [None]:
glove_embedding['bedtime']
# word2id.items()

array([-0.038901 ,  0.14695  , -0.062946 ,  0.0076634, -1.4591   ,
        1.2302   ,  0.045659 ,  0.41254  , -0.074825 , -0.30942  ,
        0.69898  ,  0.38158  , -0.18333  ,  0.21672  ,  0.84012  ,
       -0.17574  , -0.028191 , -0.15103  ,  0.22295  , -0.18099  ,
        0.22464  , -0.6649   , -0.038821 ,  0.31431  ,  0.43477  ,
        0.54241  , -0.46119  ,  0.068432 , -1.0356   ,  0.48924  ,
       -0.09531  , -0.61189  ,  0.13242  , -0.36736  , -0.66299  ,
        0.55532  , -0.58866  , -0.83158  , -0.025662 , -0.31443  ,
        0.043428 ,  0.98309  , -0.48474  ,  0.042313 , -0.5515   ,
       -0.087464 , -0.77318  , -0.46762  , -0.03519  , -0.70757  ,
        0.35963  ,  0.11543  , -0.065643 ,  0.66271  , -0.34811  ,
       -0.67678  ,  0.38171  ,  0.36789  , -0.030991 , -0.075245 ,
       -0.26004  ,  0.45867  , -0.19097  , -0.82811  ,  0.18871  ,
       -0.5021   ,  0.031506 , -0.51581  , -0.030224 , -0.33641  ,
       -0.41555  , -0.23367  ,  1.0551   ,  0.49484  , -0.5600

In [None]:
glove_embedding['bedtime'].shape

(100,)

In [None]:
word2id.items()



In [None]:
def create_embedding_matrix(word2index,embedding_dict,dimension):
    embedding_matrix=np.zeros((len(word2index),dimension))

    for word,index in word2index.items():
        if word in embedding_dict:
            embedding_matrix[index]=embedding_dict[word]
    return embedding_matrix
 
# text=["The cat sat on mat","we can play with model"]
 
# tokenizer=tf.keras.preprocessing.text.Tokenizer(split=" ")
# tokenizer.fit_on_texts(text)
 
# text_token=tokenizer.texts_to_sequences(text)
 
embedding_matrix=create_embedding_matrix(word2index=word2id,embedding_dict=glove_embedding,dimension=100)

In [None]:
embedding_matrix.shape[0]

8572

In [None]:
vocab_size=embedding_matrix.shape[0]
vector_size=embedding_matrix.shape[1]

In [None]:
print(embedding_matrix.shape)

(8572, 100)


In [None]:

 
embedding=nn.Embedding(num_embeddings=vocab_size,embedding_dim=vector_size)

In [None]:
embedding.weight=nn.Parameter(torch.tensor(embedding_matrix,dtype=torch.float32))

In [None]:
embedding.weight.requires_grad=False

In [None]:
# def get_tokens_ids(word2id, df, length=32):
#     token_ids=[]
#     labels=[]
#     # corpus= list(df.sentence)
#     targets = list(df.label)
#     max_len=52

# #     sent_tokens_= []
#     for i in range(len(df)):
#       sentence=df.sentence.iloc[i].lower()
#       input_tokens = list(list(filter(('').__ne__, sentence.split(" "))))
#       if max_len> len(input_tokens):
#           max_len= len(input_tokens)
# #        sent_tokens.append(input_tokens)
#       input_ids=[word2id[word] if word in word2id else 1 for word in input_tokens]
  
#       if len(input_ids) < length:
#           input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
#       else:
#           input_ids = input_ids[:length]
          
#       token_ids.append(input_ids)
#       labels.append(targets[i])
#     token_ids =torch.LongTensor(token_ids)
#     labels = torch.LongTensor(labels)
#     return max_len, token_ids, labels

In [None]:
sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
max_len, data, labels =get_tokens_ids(word2id, train_df, 32)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 32)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 32)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)

2


In [None]:
embedding_vec=embedding(torch.LongTensor(data))
print(embedding)
print(embedding_vec.shape)

Embedding(8572, 100)
torch.Size([8544, 32, 100])


In [None]:
embedding_vec[0]

tensor([[-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        [-0.6839,  0.3918,  0.5367,  ..., -0.1415,  1.3115,  0.3148],
        [-0.5426,  0.4148,  1.0322,  ..., -1.2969,  0.7622,  0.4635],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1381, -1.2345, -0.3397,  ..., -0.1517, -0.0621, -1.2900],
        [ 0.6328, -0.5065, -0.3616,  ...,  0.1686,  0.0148, -0.7369]])

In [None]:
# train_set = TensorDataset(data, labels)
# valid_set = TensorDataset(val_data, val_labels)
# test_set = TensorDataset(test_data, test_labels)

# batch_size = 32

# train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
# valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
# test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

In [None]:
import math
class LSTM(nn.Module):
    def __init__(self, d, dropout = None):

        super(LSTM, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2
        self.drop = dropout
        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.dropout = nn.Dropout(0.25)
        self.linear = nn.Linear(self.hidden_size, self.output_size, bias=True) 
        self.init_weights()
        self.embedding.weight=nn.Parameter(torch.tensor(embedding_matrix,dtype=torch.float32))

        
        self.embedding.weight.requires_grad=False
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, x):

        emb = self.embedding(x)
        batch_size = emb.shape[0]

        h_t, c_t = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * torch.tanh(c_t)

        if (self.drop != None):
          h_t = self.dropout(h_t)
          out = self.linear(h_t)
        else:
          out = self.linear(h_t)
        return out

In [None]:
from datasets import load_dataset


sst_dataset = load_dataset('sst')


train_df=pd.DataFrame.from_dict(sst_dataset['train'])
train_df['label'] = train_df['label'].apply(lambda x: round(x))

val_df=pd.DataFrame.from_dict(sst_dataset['validation'])
val_df['label'] = val_df['label'].apply(lambda x: round(x))

test_df=pd.DataFrame.from_dict(sst_dataset['test'])
test_df['label'] = test_df['label'].apply(lambda x: round(x))

max_len, data, labels =get_tokens_ids(word2id, train_df, 16)
val_max_len, val_data, val_labels =get_tokens_ids(word2id, val_df, 16)
test_max_len, test_data, test_labels =get_tokens_ids(word2id, test_df, 16)
# print(test_data.shape)
# print(test_labels.shape)
# print(test_max_len)
# print(data.shape)
# print(labels.shape)
print(max_len)


train_set = TensorDataset(data, labels)
valid_set = TensorDataset(val_data, val_labels)
test_set = TensorDataset(test_data, test_labels)

batch_size = 32

train_loader = DataLoader(train_set, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_set, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_set, shuffle=False, batch_size=batch_size)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

2


In [None]:
d = 100 # size of word-embedding
num_epochs = 50
model = LSTM(d).cuda()
train(model, train_loader, valid_loader, num_epochs, 'glove_lstm_best_model.pth')

100%|██████████| 267/267 [00:07<00:00, 38.07it/s]
100%|██████████| 35/35 [00:00<00:00, 128.65it/s]


epoch 0: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 50.70, Val_Acc- 50.68


100%|██████████| 267/267 [00:06<00:00, 38.69it/s]
100%|██████████| 35/35 [00:00<00:00, 133.40it/s]


epoch 1: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 49.58, Val_Acc- 52.04


100%|██████████| 267/267 [00:06<00:00, 38.99it/s]
100%|██████████| 35/35 [00:00<00:00, 137.66it/s]


epoch 2: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.10, Val_Acc- 49.41


100%|██████████| 267/267 [00:06<00:00, 39.24it/s]
100%|██████████| 35/35 [00:00<00:00, 135.72it/s]


epoch 3: Training_Loss- 0.022, Val_Loss- 0.022, Training_Acc- 52.24, Val_Acc- 53.13


100%|██████████| 267/267 [00:06<00:00, 38.43it/s]
100%|██████████| 35/35 [00:00<00:00, 136.93it/s]


epoch 4: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 54.99, Val_Acc- 57.49


100%|██████████| 267/267 [00:06<00:00, 39.08it/s]
100%|██████████| 35/35 [00:00<00:00, 135.68it/s]


epoch 5: Training_Loss- 0.021, Val_Loss- 0.022, Training_Acc- 57.78, Val_Acc- 57.04


100%|██████████| 267/267 [00:06<00:00, 39.24it/s]
100%|██████████| 35/35 [00:00<00:00, 139.09it/s]


epoch 6: Training_Loss- 0.020, Val_Loss- 0.018, Training_Acc- 63.32, Val_Acc- 71.21


100%|██████████| 267/267 [00:06<00:00, 38.58it/s]
100%|██████████| 35/35 [00:00<00:00, 129.69it/s]


epoch 7: Training_Loss- 0.019, Val_Loss- 0.019, Training_Acc- 67.56, Val_Acc- 68.30


100%|██████████| 267/267 [00:07<00:00, 37.98it/s]
100%|██████████| 35/35 [00:00<00:00, 134.90it/s]


epoch 8: Training_Loss- 0.018, Val_Loss- 0.017, Training_Acc- 70.87, Val_Acc- 73.12


100%|██████████| 267/267 [00:06<00:00, 40.12it/s]
100%|██████████| 35/35 [00:00<00:00, 130.94it/s]


epoch 9: Training_Loss- 0.017, Val_Loss- 0.017, Training_Acc- 71.66, Val_Acc- 73.66


100%|██████████| 267/267 [00:06<00:00, 39.47it/s]
100%|██████████| 35/35 [00:00<00:00, 127.84it/s]


epoch 10: Training_Loss- 0.017, Val_Loss- 0.017, Training_Acc- 71.91, Val_Acc- 74.11


100%|██████████| 267/267 [00:06<00:00, 39.90it/s]
100%|██████████| 35/35 [00:00<00:00, 135.35it/s]


epoch 11: Training_Loss- 0.017, Val_Loss- 0.017, Training_Acc- 73.00, Val_Acc- 74.57


100%|██████████| 267/267 [00:06<00:00, 40.39it/s]
100%|██████████| 35/35 [00:00<00:00, 138.53it/s]


epoch 12: Training_Loss- 0.017, Val_Loss- 0.016, Training_Acc- 73.30, Val_Acc- 75.39


100%|██████████| 267/267 [00:06<00:00, 40.45it/s]
100%|██████████| 35/35 [00:00<00:00, 136.58it/s]


epoch 13: Training_Loss- 0.016, Val_Loss- 0.018, Training_Acc- 74.34, Val_Acc- 71.48


100%|██████████| 267/267 [00:06<00:00, 40.64it/s]
100%|██████████| 35/35 [00:00<00:00, 133.29it/s]


epoch 14: Training_Loss- 0.016, Val_Loss- 0.016, Training_Acc- 74.74, Val_Acc- 75.39


100%|██████████| 267/267 [00:06<00:00, 40.44it/s]
100%|██████████| 35/35 [00:00<00:00, 144.51it/s]


epoch 15: Training_Loss- 0.016, Val_Loss- 0.017, Training_Acc- 74.78, Val_Acc- 75.02


100%|██████████| 267/267 [00:06<00:00, 39.90it/s]
100%|██████████| 35/35 [00:00<00:00, 104.58it/s]


epoch 16: Training_Loss- 0.016, Val_Loss- 0.017, Training_Acc- 75.00, Val_Acc- 72.57


100%|██████████| 267/267 [00:06<00:00, 40.26it/s]
100%|██████████| 35/35 [00:00<00:00, 137.54it/s]


epoch 17: Training_Loss- 0.016, Val_Loss- 0.016, Training_Acc- 76.16, Val_Acc- 76.29


100%|██████████| 267/267 [00:06<00:00, 40.03it/s]
100%|██████████| 35/35 [00:00<00:00, 123.55it/s]


epoch 18: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 76.26, Val_Acc- 75.84


100%|██████████| 267/267 [00:06<00:00, 39.39it/s]
100%|██████████| 35/35 [00:00<00:00, 137.61it/s]


epoch 19: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 76.23, Val_Acc- 76.02


100%|██████████| 267/267 [00:06<00:00, 39.49it/s]
100%|██████████| 35/35 [00:00<00:00, 130.07it/s]


epoch 20: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 76.28, Val_Acc- 75.66


100%|██████████| 267/267 [00:06<00:00, 39.59it/s]
100%|██████████| 35/35 [00:00<00:00, 137.45it/s]


epoch 21: Training_Loss- 0.015, Val_Loss- 0.017, Training_Acc- 76.07, Val_Acc- 72.21


100%|██████████| 267/267 [00:06<00:00, 39.09it/s]
100%|██████████| 35/35 [00:00<00:00, 139.31it/s]


epoch 22: Training_Loss- 0.015, Val_Loss- 0.017, Training_Acc- 77.25, Val_Acc- 74.21


100%|██████████| 267/267 [00:06<00:00, 39.84it/s]
100%|██████████| 35/35 [00:00<00:00, 126.46it/s]


epoch 23: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 76.84, Val_Acc- 76.29


100%|██████████| 267/267 [00:06<00:00, 41.01it/s]
100%|██████████| 35/35 [00:00<00:00, 108.78it/s]


epoch 24: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 77.18, Val_Acc- 75.02


100%|██████████| 267/267 [00:06<00:00, 40.45it/s]
100%|██████████| 35/35 [00:00<00:00, 143.63it/s]


epoch 25: Training_Loss- 0.015, Val_Loss- 0.016, Training_Acc- 77.08, Val_Acc- 74.66


100%|██████████| 267/267 [00:06<00:00, 40.05it/s]
100%|██████████| 35/35 [00:00<00:00, 148.18it/s]


epoch 26: Training_Loss- 0.015, Val_Loss- 0.015, Training_Acc- 77.98, Val_Acc- 76.48


100%|██████████| 267/267 [00:06<00:00, 40.28it/s]
100%|██████████| 35/35 [00:00<00:00, 140.38it/s]


epoch 27: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 78.07, Val_Acc- 74.93


100%|██████████| 267/267 [00:06<00:00, 40.56it/s]
100%|██████████| 35/35 [00:00<00:00, 136.46it/s]


epoch 28: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 77.77, Val_Acc- 76.39


100%|██████████| 267/267 [00:06<00:00, 39.45it/s]
100%|██████████| 35/35 [00:00<00:00, 123.39it/s]


epoch 29: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 78.50, Val_Acc- 75.20


100%|██████████| 267/267 [00:06<00:00, 39.22it/s]
100%|██████████| 35/35 [00:00<00:00, 128.49it/s]


epoch 30: Training_Loss- 0.014, Val_Loss- 0.018, Training_Acc- 78.48, Val_Acc- 70.21


100%|██████████| 267/267 [00:06<00:00, 38.55it/s]
100%|██████████| 35/35 [00:00<00:00, 120.68it/s]


epoch 31: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 78.80, Val_Acc- 75.30


100%|██████████| 267/267 [00:06<00:00, 38.35it/s]
100%|██████████| 35/35 [00:00<00:00, 139.92it/s]


epoch 32: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 78.55, Val_Acc- 75.84


100%|██████████| 267/267 [00:06<00:00, 39.14it/s]
100%|██████████| 35/35 [00:00<00:00, 127.19it/s]


epoch 33: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 79.03, Val_Acc- 76.29


100%|██████████| 267/267 [00:06<00:00, 40.13it/s]
100%|██████████| 35/35 [00:00<00:00, 131.81it/s]


epoch 34: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 79.79, Val_Acc- 75.30


100%|██████████| 267/267 [00:06<00:00, 39.76it/s]
100%|██████████| 35/35 [00:00<00:00, 136.11it/s]


epoch 35: Training_Loss- 0.014, Val_Loss- 0.016, Training_Acc- 79.61, Val_Acc- 76.20


100%|██████████| 267/267 [00:06<00:00, 40.34it/s]
100%|██████████| 35/35 [00:00<00:00, 133.84it/s]


epoch 36: Training_Loss- 0.013, Val_Loss- 0.016, Training_Acc- 79.58, Val_Acc- 75.84


100%|██████████| 267/267 [00:06<00:00, 39.70it/s]
100%|██████████| 35/35 [00:00<00:00, 128.48it/s]


epoch 37: Training_Loss- 0.013, Val_Loss- 0.016, Training_Acc- 80.27, Val_Acc- 76.20


100%|██████████| 267/267 [00:06<00:00, 40.37it/s]
100%|██████████| 35/35 [00:00<00:00, 135.69it/s]


epoch 38: Training_Loss- 0.013, Val_Loss- 0.016, Training_Acc- 80.13, Val_Acc- 75.66


100%|██████████| 267/267 [00:06<00:00, 40.62it/s]
100%|██████████| 35/35 [00:00<00:00, 136.02it/s]


epoch 39: Training_Loss- 0.013, Val_Loss- 0.017, Training_Acc- 80.63, Val_Acc- 75.66


100%|██████████| 267/267 [00:06<00:00, 39.18it/s]
100%|██████████| 35/35 [00:00<00:00, 133.46it/s]


epoch 40: Training_Loss- 0.013, Val_Loss- 0.016, Training_Acc- 80.81, Val_Acc- 75.84


100%|██████████| 267/267 [00:06<00:00, 39.05it/s]
100%|██████████| 35/35 [00:00<00:00, 130.33it/s]


epoch 41: Training_Loss- 0.013, Val_Loss- 0.016, Training_Acc- 81.09, Val_Acc- 76.75


100%|██████████| 267/267 [00:06<00:00, 38.77it/s]
100%|██████████| 35/35 [00:00<00:00, 132.12it/s]


epoch 42: Training_Loss- 0.013, Val_Loss- 0.018, Training_Acc- 81.37, Val_Acc- 75.30


100%|██████████| 267/267 [00:06<00:00, 39.08it/s]
100%|██████████| 35/35 [00:00<00:00, 130.09it/s]


epoch 43: Training_Loss- 0.013, Val_Loss- 0.017, Training_Acc- 81.37, Val_Acc- 75.48


100%|██████████| 267/267 [00:06<00:00, 38.56it/s]
100%|██████████| 35/35 [00:00<00:00, 142.05it/s]


epoch 44: Training_Loss- 0.012, Val_Loss- 0.016, Training_Acc- 82.07, Val_Acc- 76.39


100%|██████████| 267/267 [00:06<00:00, 38.99it/s]
100%|██████████| 35/35 [00:00<00:00, 136.46it/s]


epoch 45: Training_Loss- 0.012, Val_Loss- 0.016, Training_Acc- 82.14, Val_Acc- 75.48


100%|██████████| 267/267 [00:06<00:00, 38.76it/s]
100%|██████████| 35/35 [00:00<00:00, 128.87it/s]


epoch 46: Training_Loss- 0.012, Val_Loss- 0.017, Training_Acc- 82.33, Val_Acc- 75.30


100%|██████████| 267/267 [00:06<00:00, 39.22it/s]
100%|██████████| 35/35 [00:00<00:00, 126.78it/s]


epoch 47: Training_Loss- 0.012, Val_Loss- 0.017, Training_Acc- 82.27, Val_Acc- 74.48


100%|██████████| 267/267 [00:06<00:00, 38.51it/s]
100%|██████████| 35/35 [00:00<00:00, 133.13it/s]


epoch 48: Training_Loss- 0.012, Val_Loss- 0.018, Training_Acc- 83.12, Val_Acc- 74.21


100%|██████████| 267/267 [00:06<00:00, 38.47it/s]
100%|██████████| 35/35 [00:00<00:00, 133.16it/s]

epoch 49: Training_Loss- 0.012, Val_Loss- 0.018, Training_Acc- 83.31, Val_Acc- 76.02
Best accuracy at epoch: 41





In [None]:
test(model, test_loader, file = 'glove_lstm_best_model.pth')

100%|██████████| 70/70 [00:00<00:00, 113.14it/s]


Test set: Average loss: 0.0162, Accuracy: 1722/2210 (77.92%)






In [None]:
test(model, valid_loader, file = 'glove_lstm_best_model.pth')

100%|██████████| 35/35 [00:00<00:00, 106.15it/s]


Test set: Average loss: 0.0164, Accuracy: 845/1101 (76.75%)






#### $\color{red}{\text{Solution 5.1}}$

<font color='red'>We kept the same lstm model as in previous section then initialized the embedding matrix with the GLOVE embedding and frozen it during training. By doing so the model significantly outperformed the previous methods on validation and test dataset). The best performance has been obtained with sequence length set 24 which is 78.56% and 76.88 for validation and test respectively. Those result are obtained by training the model on local machine. By only using the pretrained embedding  representation we went from the worst model to the best model with LSTM<font>
    
Sequences length=52 validation accuracy =50.86 test accuracy=51.67 best epoch =45
    
Sequences length=40 validation accuracy =60.67 test accuracy=62.53 best epoch =33
    
Sequences length=32 validation accuracy =78.29 test accuracy=76.24 best epoch =50 
    
Sequences length=16 validation accuracy=76.75 test accuracy=777.92 best epoch =42
    
Sequences length=8 validation accuracy =69.75 test accuracy=70.27 best epoch =34 
    
Sequences length=24 validation accuracy =78.56 test accuracy=76.88 best epoch =30


