## Assignment 2.1: Text classification via RNN (50 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN.

In [None]:
!pip install torch==1.6.0
!pip install torchtext==0.7
!pip install numpy
!pip install pandas



In [None]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Preparing Data

In [None]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField()



In [None]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()



In [None]:
%%time
TEXT.build_vocab(trn)

CPU times: user 1.19 s, sys: 36.5 ms, total: 1.23 s
Wall time: 1.23 s


In [None]:
LABEL.build_vocab(trn)

In [None]:
TEXT.vocab.freqs.most_common(10)

[('the', 225710),
 ('a', 111293),
 ('and', 111161),
 ('of', 101207),
 ('to', 93119),
 ('is', 72976),
 ('in', 63110),
 ('i', 49427),
 ('this', 48747),
 ('that', 46589)]

### Creating the Iterator

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [None]:
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(64, 64, 64),
        sort=True,
        sort_key=lambda x: len(x.text),
        sort_within_batch=False,
        device='cuda',
        repeat=False
)



Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [None]:
batch = next(train_iter.__iter__()); batch.text



tensor([[   10, 26916,  7270,  ...,    10,     9, 43905],
        [   20,     2, 24459,  ...,     7,   364,     7],
        [    7, 22208,     7,  ...,     3,     2,     3],
        ...,
        [    1,     1,     1,  ...,    25,   219,    89],
        [    1,     1,     1,  ...,    40,   531,   139],
        [    1,     1,     1,  ...,     9,   112,  5633]], device='cuda:0')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [None]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

In [None]:
torch.transpose(batch.text, 0, 1)

tensor([[   10,    20,     7,  ...,     1,     1,     1],
        [26916,     2, 22208,  ...,     1,     1,     1],
        [ 7270, 24459,     7,  ...,     1,     1,     1],
        ...,
        [   10,     7,     3,  ...,    25,    40,     9],
        [    9,   364,     2,  ...,   219,   531,   112],
        [43905,     7,     3,  ...,    89,   139,  5633]], device='cuda:0')

### Define the RNN-based text classification model (20 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [None]:
class RNNBaseline(nn.Module):
    def __init__(self, V, D, emb_dim, num_classes=1):
        super().__init__()
        self.embed = nn.Embedding(V+1, emb_dim, padding_idx=1)

        self.gru = nn.GRU(emb_dim, D, batch_first=True)
        self.linear = nn.Linear(D, num_classes)
        self.sm = nn.Sigmoid()
            
    def forward(self, seq):
        seq = torch.transpose(seq, 0, 1)

        x = self.embed(seq)
        
        prev, x = self.gru(x)
        # print(x.shape)
        x = self.linear(x)
        preds = self.sm(x)
        
        preds = torch.squeeze(preds)
        return preds

In [None]:
vocab_size = len(TEXT.vocab)
em_sz = 200
nh = 300
model = RNNBaseline(vocab_size, nh, emb_dim=em_sz); model

RNNBaseline(
  (embed): Embedding(202237, 200, padding_idx=1)
  (gru): GRU(200, 300, batch_first=True)
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [None]:
model.cuda()

RNNBaseline(
  (embed): Embedding(202237, 200, padding_idx=1)
  (gru): GRU(200, 300, batch_first=True)
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

### The training loop (10 points)

Define the optimization and the loss functions.

In [None]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

Define the stopping criteria.

Была попытка обучить на 20 эпохах, но после 5-ти заметно переобучение модели. 


In [None]:
epochs = 5

In [None]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.float()

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.float()
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))



Epoch: 1, Training Loss: 0.010040745798179082, Validation Loss: 0.0095876127799352
Epoch: 2, Training Loss: 0.00681613667522158, Validation Loss: 0.006512776052951813
Epoch: 3, Training Loss: 0.0035694457711918015, Validation Loss: 0.007702694930632909
Epoch: 4, Training Loss: 0.00240104661429567, Validation Loss: 0.006040231561660767
Epoch: 5, Training Loss: 0.001051394115208781, Validation Loss: 0.007511228881279627
CPU times: user 1min 3s, sys: 1.93 s, total: 1min 5s
Wall time: 1min 7s


In [None]:
def get_metrics(pred, gt):
    print(f'Accuracy: {accuracy_score(gt, pred):.2f}')
    print(f'Precision: {precision_score(gt, pred):.2f}')
    print(f'Recall: {recall_score(gt, pred):.2f}')
    print(f'F1: {f1_score(gt, pred):.2f}')

### Calculate performance of the trained model (10 points)

In [None]:
pred = []
gt = []

for batch in test_iter:
    # x = torch.ones(batch.text.shape[0], n, dtype=torch.int64).cuda()
    # x[:, :batch.text.shape[1]] = batch.text
    # print(x)
    # print(batch.text.shape[1])

    x = batch.text
    y = batch.label.float()
    
    pred += model(x).round().tolist()
    gt += y.tolist()

get_metrics(gt, pred)



Accuracy: 0.84
Precision: 0.79
Recall: 0.89
F1: 0.83


Write down the calculated performance

### Accuracy: 0.83
### Precision: 0.78
### Recall: 0.86
### F1: 0.82

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

### 1. Простая RNN с GRU но больше размерность эмбединга и размер скрытого слоя
`em_sz = 300`

`nh = 500`

#### Accuracy: 0.81
#### Precision: 0.95
#### Recall: 0.74
#### F1: 0.83

### 2. RNN, но вместо GRU используется LSTM модуль

#### Accuracy: 0.83
#### Precision: 0.76
#### Recall: 0.88
#### F1: 0.82

### 3. LSTM с предыдущего эксперимента, но добавлен механизм внимания (Self-Attention)
Число эпох обучения увеличил вдвое `epochs = 10`
#### Accuracy: 0.81
#### Precision: 0.78
#### Recall: 0.84
#### F1: 0.81

In [None]:
class RNN(nn.Module):
    def __init__(self, V, D, emb_dim, num_classes=1):
        super().__init__()
        self.embed = nn.Embedding(V+1, emb_dim, padding_idx=1)

        self.gru = nn.GRU(emb_dim, D, batch_first=True)
        self.linear = nn.Linear(D, num_classes)
        self.sm = nn.Sigmoid()
            
    def forward(self, seq):
        seq = torch.transpose(seq, 0, 1)

        x = self.embed(seq)
        
        prev, x = self.gru(x)
        # print(x.shape)
        x = self.linear(x)
        preds = self.sm(x)
        
        preds = torch.squeeze(preds)
        return preds

In [None]:
vocab_size = len(TEXT.vocab)
em_sz = 300
nh = 500
model = RNN(vocab_size, nh, emb_dim=em_sz); model

RNN(
  (embed): Embedding(202237, 300, padding_idx=1)
  (gru): GRU(300, 500, batch_first=True)
  (linear): Linear(in_features=500, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
model.cuda()

RNN(
  (embed): Embedding(202237, 300, padding_idx=1)
  (gru): GRU(300, 500, batch_first=True)
  (linear): Linear(in_features=500, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [None]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.float()

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.float()
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))



Epoch: 1, Training Loss: 0.009736107305118015, Validation Loss: 0.0075226970473925275
Epoch: 2, Training Loss: 0.0047743097926889145, Validation Loss: 0.005689547514915467
Epoch: 3, Training Loss: 0.0019570412627288275, Validation Loss: 0.005847744688888391
Epoch: 4, Training Loss: 0.0006602316673645483, Validation Loss: 0.008146367043256759
Epoch: 5, Training Loss: 0.0003243293537475568, Validation Loss: 0.008822253026564916
CPU times: user 1min 45s, sys: 5.85 s, total: 1min 51s
Wall time: 1min 51s


In [None]:
pred = []
gt = []

for batch in test_iter:
    # x = torch.ones(batch.text.shape[0], n, dtype=torch.int64).cuda()
    # x[:, :batch.text.shape[1]] = batch.text
    # print(x)
    # print(batch.text.shape[1])

    x = batch.text
    y = batch.label.float()
    
    pred += model(x).round().tolist()
    gt += y.tolist()

get_metrics(gt, pred)



Accuracy: 0.85
Precision: 0.89
Recall: 0.82
F1: 0.86


In [None]:
class LSTM(nn.Module):
    def __init__(self, V, D, emb_dim, num_classes=1):
        super().__init__()
        self.hidden_dim = D
        self.embed = nn.Embedding(V+1, emb_dim, padding_idx=1)

        self.lstm = nn.LSTM(emb_dim, D, batch_first=True)

        self.linear = nn.Linear(D, num_classes)
        self.sm = nn.Sigmoid()
            
    def forward(self, seq):
        seq = torch.transpose(seq, 0, 1)

        x = self.embed(seq)
        
        out, (x, _) = self.lstm(x)
        # print(x.shape)
        # out = out.contiguous().view(-1, self.hidden_dim)

        x = self.linear(x)
        preds = self.sm(x)
        
        preds = torch.squeeze(preds)
        return preds

In [None]:
vocab_size = len(TEXT.vocab)
em_sz = 200
nh = 300
model = LSTM(vocab_size, nh, emb_dim=em_sz); model

LSTM(
  (embed): Embedding(202237, 200, padding_idx=1)
  (lstm): LSTM(200, 300, batch_first=True)
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
model.cuda()

LSTM(
  (embed): Embedding(202237, 200, padding_idx=1)
  (lstm): LSTM(200, 300, batch_first=True)
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [None]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.float()

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.float()
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))



Epoch: 1, Training Loss: 0.010147611509050641, Validation Loss: 0.009657022051016489
Epoch: 2, Training Loss: 0.008146413929121835, Validation Loss: 0.008766460768381754
Epoch: 3, Training Loss: 0.005623402918662344, Validation Loss: 0.009291505881150564
Epoch: 4, Training Loss: 0.0036178770635809216, Validation Loss: 0.008916429054737092
Epoch: 5, Training Loss: 0.002209438592089074, Validation Loss: 0.009596401166915893
CPU times: user 1min 10s, sys: 2.65 s, total: 1min 12s
Wall time: 1min 13s


In [None]:
pred = []
gt = []

for batch in test_iter:
    # x = torch.ones(batch.text.shape[0], n, dtype=torch.int64).cuda()
    # x[:, :batch.text.shape[1]] = batch.text
    # print(x)
    # print(batch.text.shape[1])

    x = batch.text
    y = batch.label.float()
    
    pred += model(x).round().tolist()
    gt += y.tolist()

get_metrics(gt, pred)



Accuracy: 0.72
Precision: 0.77
Recall: 0.71
F1: 0.74


In [None]:
class LSTM_Attn(nn.Module):
    def __init__(self, V, D, emb_dim, num_classes=1):
        super().__init__()
        self.hidden_dim = D
        self.embed = nn.Embedding(V+1, emb_dim, padding_idx=1)

        self.lstm = nn.LSTM(emb_dim, D, batch_first=True)

        self.attn = nn.MultiheadAttention(D, 1)
        
        self.linear = nn.Linear(D, num_classes)
        self.sm = nn.Sigmoid()
            
    def forward(self, seq):
        seq = torch.transpose(seq, 0, 1)

        x = self.embed(seq)
        
        out, (x, _) = self.lstm(x)
        # print(x.shape)
        # out = out.contiguous().view(-1, self.hidden_dim)

        x, _ = self.attn(x, x, x)

        x = self.linear(x)
        preds = self.sm(x)
        
        preds = torch.squeeze(preds)
        return preds

In [None]:
vocab_size = len(TEXT.vocab)
em_sz = 200
nh = 300
model = LSTM_Attn(vocab_size, nh, emb_dim=em_sz); model

LSTM_Attn(
  (embed): Embedding(202237, 200, padding_idx=1)
  (lstm): LSTM(200, 300, batch_first=True)
  (attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=300, out_features=300, bias=True)
  )
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
model.cuda()

LSTM_Attn(
  (embed): Embedding(202237, 200, padding_idx=1)
  (lstm): LSTM(200, 300, batch_first=True)
  (attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=300, out_features=300, bias=True)
  )
  (linear): Linear(in_features=300, out_features=1, bias=True)
  (sm): Sigmoid()
)

In [None]:
epochs = 10
opt = torch.optim.Adam(model.parameters())
loss_func = nn.BCELoss()

In [None]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 
        
        x = batch.text
        y = batch.label.float()

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label.float()
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))



Epoch: 1, Training Loss: 0.010146206368718828, Validation Loss: 0.010841795555750528
Epoch: 2, Training Loss: 0.009751753304685865, Validation Loss: 0.008532008218765258
Epoch: 3, Training Loss: 0.00543042129107884, Validation Loss: 0.005889902116854986
Epoch: 4, Training Loss: 0.002367686459262456, Validation Loss: 0.007636993495623271
Epoch: 5, Training Loss: 0.0015292037265375257, Validation Loss: 0.01303813465833664
Epoch: 6, Training Loss: 0.0009754644131770224, Validation Loss: 0.00883774971763293
Epoch: 7, Training Loss: 0.000580570675364288, Validation Loss: 0.022201929569244386
Epoch: 8, Training Loss: 0.0005947379629218111, Validation Loss: 0.012605170675118764
Epoch: 9, Training Loss: 0.0003124555932439762, Validation Loss: 0.03167486324310303
Epoch: 10, Training Loss: 0.0003928777390600382, Validation Loss: 0.04041039793491363
CPU times: user 2min 26s, sys: 2.08 s, total: 2min 28s
Wall time: 2min 28s


In [None]:
pred = []
gt = []

for batch in test_iter:
    # x = torch.ones(batch.text.shape[0], n, dtype=torch.int64).cuda()
    # x[:, :batch.text.shape[1]] = batch.text
    # print(x)
    # print(batch.text.shape[1])

    x = batch.text
    y = batch.label.float()
    
    pred += model(x).round().tolist()
    gt += y.tolist()

get_metrics(gt, pred)



Accuracy: 0.79
Precision: 0.94
Recall: 0.73
F1: 0.82
