# KAIST AI605 Assignment 1: Text Classification
TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due Date:** September 29 (Wed) 11:00pm, 2021

## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/aGZZ86YpCdv2zEVt9). 

You need to submit both (1) a PDF of this notebook, and (2) a link to CoLab for execution (.ipynb file is also allowed).

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points. You can obtain up to 5 bonus points (i.e. max score is 25 points). For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will only use Python 3.7 and PyTorch 1.9, which is already available on Colab:

In [1]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.8.3
torch 1.7.0


## 1. Limitations of Vanilla RNNs
In Lecture 02, we saw that a multi-layer perceptron (MLP) without activation function is equivalent to a single linear transformation with respect to the inputs. One can define a vanilla recurrent neural network without activation as, given inputs $\textbf{x}_1 \dots \textbf{x}_T$, the outputs $\textbf{h}_t$ is obtained by
$$\textbf{h}_t = \textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b},$$
where $\textbf{V}, \textbf{U}, \textbf{b}$ are trainable weights. 

> **Problem 1.1** *(2 point)* Show that such recurrent neural network (RNN) without activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.



In Lecture 05 and 06, we will see how RNNs can model non-linearity via activation function, but they still suffer from exploding or vanishing gradients. We can mathematically show that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

> **Problem 1.2** *(2 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

> **Problem 1.3** *(2 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 05 and 06 slides for the definition of LSTM.

## (Answer)

**Problem 1.1.**
For every $t$, $\mathbf{h}_t$ is recursively defined as
$$ \mathbf{h}_t = \mathbf{V}\mathbf{h}_{t-1} + \mathbf{U}\mathbf{x}_t + \mathbf{b} = \mathbf{V}(\mathbf{V}\mathbf{h}_{t-2} + \mathbf{U}\mathbf{x}_{t-1} + \mathbf{b}) + \mathbf{U}\mathbf{x}_t + \mathbf{b}$$
$$ =\cdots = \mathbf{V}^t\mathbf{h}_0 + \sum_{k=0}^{t-1} \mathbf{V}^{k}\mathbf{U}\mathbf{x}_{t-k} + \sum_{k=0}^{t-1} \mathbf{V}^k \mathbf{b} $$
which is a linear combination of the inputs $\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_t$. This holds for every $t$, therefore RNN without non-linear activation function is equiavlent to a single linear transformation of the inputs.

**Problem 1.2.**
If we use gradient clipping, the norm of $\mathbf{V}$ does not exceed some value, i.e., 1. Then, the gradient $\frac{\partial\mathbf{h}_T}{\partial\mathbf{h}_1}$ can be reduced, mitigating the exploding gradient issue.

**Problem 1.3.**
In LSTM, activation function helps avoiding vanishing gradient. In the recurrency of the LSTM, the activation function is the identity function with a derivative of 1.0. Specifically, the effective weight of the recurrency is equal to the forget gate activation. So, if the forget gate is on (activation close to 1.0), the gradient does not vanish. 

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank (SST), a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST via Hugging Face
We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Sentiment Treebank.

First, install the package:

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 12.2 MB/s 
Collecting huggingface-hub<0.1.0,>=0.0.14
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.6 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 41.0 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 46.5 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.9.0-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 52.2 MB/s 
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.6.3-cp37-cp37m-manylinux2014_x86_64.whl (294 kB)
[K     |████████████████████████████████| 294 kB 49.8 MB/s 
[?25hCollecting async-timeout<4.0,>=3.0
  Downloading async_timeout-3.0.1-

Then download SST and print the first example:

In [2]:
from datasets import load_dataset
from pprint import pprint

sst_dataset = load_dataset('sst')
pprint(sst_dataset['train'][0])

No config specified, defaulting to: sst/default
Reusing dataset sst (/home/sungnyun/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


  0%|          | 0/3 [00:00<?, ?it/s]

{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' "
             "and that he 's going to make a splash even greater than Arnold "
             'Schwarzenegger , Jean-Claud Van Damme or Steven Segal .',
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}


Note that each `label` is a score between 0 and 1. You will round it to either 0 or 1 for binary classification (positive for 1, negative for 0).
In this first example, the label is rounded to 1, meaning that the sentence is a positive review.
You will only use `sentence` as the input; please ignore other values.

> **Problem 2.1** *(2 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

## (Answer)
**Problem 2.1.**
Vocabulary size is 18282, including 'PAD' and 'UNK' token. (see the code below)

In [3]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [4]:
# Constructing vocabulary with `UNK`
vocab = ['PAD', 'UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['PAD', 'UNK', 'Hello', 'world!']
2


In [5]:
### Problem 2.1 ###
vocab = ['PAD', 'UNK']
for data in sst_dataset['train']:
  for word in data['sentence'].split(' '):
    if word not in vocab:
      vocab.append(word)
word2id = {word: id_ for id_, word in enumerate(vocab)}

print('Vocabulary size: {}'.format(len(vocab)))

Vocabulary size: 18282


> **Problem 2.2** *(1 point)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

## (Answer)
**Problem 2.2.**
Vocabulary size is now 8738, including 'PAD' and 'UNK' token. (see the code below)

In [6]:
### Problem 2.2 ###
vocab = ['PAD', 'UNK']
vocab_count = {}
for data in sst_dataset['train']:
  for word in data['sentence'].split(' '):
    if word not in vocab and word not in vocab_count.keys():
      vocab_count[word] = 1
    elif word not in vocab and word in vocab_count.keys():
      vocab_count[word] += 1
      if vocab_count[word] >= 2:
        vocab.append(word)
    else:
      continue
word2id = {word: id_ for id_, word in enumerate(vocab)}

print('Vocabulary size: {}'.format(len(vocab)))

Vocabulary size: 8738


## 3. Text Classification with Multi-Layer Perceptron and Recurrent Neural Network

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to fix the input length (with truncation or padding), flatten the word embeddings, apply a linear transformation followed by an activation, and finally classify the output into the two classes: 

In [7]:
from torch import nn

length = 8
input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 1 for word in input_tokens] # UNK if word not found
if len(input_ids) < length:
  input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
else:
  input_ids = input_ids[:length]

input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
print(input_tensor)

tensor([[1, 1, 0, 0, 0, 0, 0, 0]])


In [8]:
# Two-layer MLP classification
class Baseline(nn.Module):
  def __init__(self, d, length):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d * length, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor) # [batch_size, length, d]
    emb_flat = emb.view(emb.size(0), -1) # [batch_size, length*d]
    hidden = self.relu(self.layer(emb_flat))
    logits = self.class_layer(hidden)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d, length).cuda()
logits = baseline(input_tensor.cuda())
softmax = nn.Softmax(1)
print(softmax(logits)) # probability for each class

tensor([[0.1654, 0.8346]], device='cuda:0', grad_fn=<SoftmaxBackward>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [9]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]).cuda() # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.1809, device='cuda:0', grad_fn=<NllLossBackward>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [10]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [11]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[ 3.7857e-04, -3.7395e-03, -6.8773e-05,  3.7857e-04, -3.7395e-03,
         -6.8773e-05,  9.9354e-04,  2.3825e-03,  4.2434e-03,  9.9354e-04,
          2.3825e-03,  4.2434e-03,  9.9354e-04,  2.3825e-03,  4.2434e-03,
          9.9354e-04,  2.3825e-03,  4.2434e-03,  9.9354e-04,  2.3825e-03,
          4.2434e-03,  9.9354e-04,  2.3825e-03,  4.2434e-03],
        [-2.7539e-02,  2.7203e-01,  5.0028e-03, -2.7539e-02,  2.7203e-01,
          5.0028e-03, -7.2274e-02, -1.7331e-01, -3.0868e-01, -7.2274e-02,
         -1.7331e-01, -3.0868e-01, -7.2274e-02, -1.7331e-01, -3.0868e-01,
         -7.2274e-02, -1.7331e-01, -3.0868e-01, -7.2274e-02, -1.7331e-01,
         -3.0868e-01, -7.2274e-02, -1.7331e-01, -3.0868e-01],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  

> **Problem 3.1** *(2 points)* Properly train a MLP baseline model on SST and report the model's accuracy on the dev data.

> **Problem 3.2** *(2 points)* Implement a recurrent neural network (without using PyTorch's RNN module) with `tanh` activation, and use the output of the RNN at the final time step for the classification. Report the model's accuracy on the dev data.

> **Problem 3.3** *(2 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

> **Problem 3.4 (bonus)** *(1 points)* Why is it numerically unstable if you compute log on top of softmax?

## (Answer)
**Problem 3.1.**
(See the code below.) Validation accuracy of MLP after 10 epochs is 55.40%.

**Problem 3.2.** 
(See the code below.) Validation accuracy of RNN  after 10 epochs is 57.49%.

**Problem 3.3.** 
The cross-entropy computed above is formulated as
$$ H(p,\hat{p}) = -\sum_x \sum_i p_i(x) \log \hat{p}_i(x) $$
where $p$ is the one-hot probability vector of the ground-truth label, and $\hat{p}$ is the predicted probability distribution. Since $p_i$ is 0 except for the ground-truth label dimension, the above form is equivalent to
$$ -\sum_x \log\hat{y}(x) $$
which is negative log likelihood (NLL).

**Problem 3.4.**
Softmax function computes the exponential term with logit values. However, if the logit values are too large, overflow can happen in our computer, i.e., it is only concentrated on the largest value. Likewise, if the logit values are too small, underflow happens and the values are equally assigned.

In [13]:
### Problem 3.1 ###

class SSTDataset(torch.utils.data.Dataset):
  def __init__(self, split, length=16):
    assert split in ['train', 'validation', 'test']
    self._data = sst_dataset[split]
    self.input_lst, self.label_lst = [], []
    for data in self._data:
      sentence = data['sentence']      
      tokens = sentence.split(' ')
      input_ids = [word2id[word] if word in word2id else 1 for word in tokens] # UNK if word not found
      if len(input_ids) < length:
        input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
      else:
        input_ids = input_ids[:length]
      self.input_lst.append(torch.LongTensor(input_ids))

      label = round(data['label'])
      self.label_lst.append(label)

  def __getitem__(self, idx):
    return self.input_lst[idx], self.label_lst[idx]

  def __len__(self):
    return len(self._data)


trainset = SSTDataset('train')
validset = SSTDataset('validation')
train_loader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(validset, batch_size=16, shuffle=False)

baseline = Baseline(d=128, length=16).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.01)


for epoch in range(10):
  baseline.train()
  avg_loss = 0
  for input, label in train_loader:
    input = input.cuda()
    label = label.cuda()
    logits = baseline(input)
    loss = cel(logits, label)
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}')

  baseline.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      input = input.cuda()
      logits = baseline(input)
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.6926739388860567
Epoch 1, Valid Acc 0.5485921889191644
Epoch 2, Train Loss 0.6309732247269555
Epoch 2, Valid Acc 0.5358764759309719
Epoch 3, Train Loss 0.5576151426365313
Epoch 3, Valid Acc 0.5440508628519528
Epoch 4, Train Loss 0.4595080493541246
Epoch 4, Valid Acc 0.5286103542234333
Epoch 5, Train Loss 0.35082442009270415
Epoch 5, Valid Acc 0.5513169845594914
Epoch 6, Train Loss 0.2495291709397616
Epoch 6, Valid Acc 0.5395095367847411
Epoch 7, Train Loss 0.17458258945955318
Epoch 7, Valid Acc 0.5376930063578564
Epoch 8, Train Loss 0.12860913224839204
Epoch 8, Valid Acc 0.5522252497729337
Epoch 9, Train Loss 0.09566429610039746
Epoch 9, Valid Acc 0.5367847411444142
Epoch 10, Train Loss 0.07502119265468826
Epoch 10, Valid Acc 0.5504087193460491
Last Validation Accuracy: 0.5504087193460491


In [None]:
### Problem 3.2 ###

class RNN(nn.Module):
    def __init__(self, seq_length, hidden_dim, embed_dim):
        super(RNN, self).__init__()
        self.seq_length = seq_length
        self.hidden_dim = hidden_dim
        self.embed_dim = embed_dim

        self.embedding = nn.Embedding(len(vocab), embed_dim)
        self.linear = nn.Linear(embed_dim + hidden_dim, hidden_dim)
        self.classifier = nn.Linear(hidden_dim, 2)
        self.tanh  = nn.Tanh()

    def forward(self, input):
        assert self.seq_length == input.size(1)  # batch_first
        emb = self.embedding(input)
        h_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()

        for seq in range(self.seq_length):
          x = torch.cat([emb[:, seq, :], h_t], dim=-1)
          h_t = self.tanh(self.linear(x))

        out = self.classifier(h_t)
        return out

rnn = RNN(seq_length=16, hidden_dim=64, embed_dim=64).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.001)

for epoch in range(10):
  rnn.train()
  avg_loss = 0
  for input, label in train_loader:
    logits = rnn(input.cuda())
    loss = cel(logits, label.cuda())
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}')

  rnn.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      logits = rnn(input.cuda())
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.7002752028154523
Epoch 1, Valid Acc 0.5049954586739328
Epoch 2, Train Loss 0.6841855224375422
Epoch 2, Valid Acc 0.5549500454132607
Epoch 3, Train Loss 0.6612656375442105
Epoch 3, Valid Acc 0.5776566757493188
Epoch 4, Train Loss 0.6143353563011362
Epoch 4, Valid Acc 0.6239782016348774
Epoch 5, Train Loss 0.5568217269601893
Epoch 5, Valid Acc 0.6021798365122616
Epoch 6, Train Loss 0.501285749037614
Epoch 6, Valid Acc 0.6584922797456857
Epoch 7, Train Loss 0.4489893540460965
Epoch 7, Valid Acc 0.6575840145322435
Epoch 8, Train Loss 0.38173778384421647
Epoch 8, Valid Acc 0.6657584014532243
Epoch 9, Train Loss 0.3311019776884313
Epoch 9, Valid Acc 0.6512261580381471
Epoch 10, Train Loss 0.3220137807676631
Epoch 10, Valid Acc 0.5749318801089919
Last Validation Accuracy: 0.5749318801089919


## 4. Text Classification with LSTM and Dropout

Replace your RNN module with an LSTM module. See Lecture slides 05 and 06 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.0000, 0.6000, 0.0000, 1.4000, 0.0000])


> **Problem 4.1** *(3 points)* Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN. Report the accuracy on the dev data.

> **Problem 4.2** *(2 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data.

> **Problem 4.3 (bonus)** *(2 points)* Consider implementing bidirectional LSTM and two layers of LSTM. Concatenate the forward direction output at the final time step and the backward direction output at the first time step for the final classificaiton. Report your accuracy on dev data.

## (Answer)

**Problem 4.1.**
(See the code below.) Validation accuracy of LSTM after 10 epochs is 66.94%.

**Problem 4.2.**
(See the code below.) Validation accuracy of LSTM with Dropout after 10 epochs is 68.76%.

**Problem 4.3.**
(See the code below.) Validation accuracy of bi-directional stacked (2 layers) LSTM after 10 epochs is 64.03%.

In [14]:
### Problem 4.1 ###

import time

class LSTM(nn.Module):
    def __init__(self, seq_length, hidden_dim, embed_dim, dropout=False, pretrained=False):
        super(LSTM, self).__init__()
        self.seq_length = seq_length
        self.hidden_dim = hidden_dim
        self.embed_dim = embed_dim
        self.pretrained = pretrained

        if not self.pretrained:
          self.embedding = nn.Embedding(len(vocab), embed_dim)
        self.linear_input = nn.Linear(embed_dim + hidden_dim, hidden_dim)
        self.linear_forget = nn.Linear(embed_dim + hidden_dim, hidden_dim)
        self.linear_cell = nn.Linear(embed_dim + hidden_dim, hidden_dim)
        self.linear_output = nn.Linear(embed_dim + hidden_dim, hidden_dim)

        self.classifier = nn.Linear(hidden_dim, 2)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()
        if dropout:
          self.dropout = nn.Dropout(p=0.5)
        else:
          self.dropout = None

    def forward(self, input):
        assert self.seq_length == input.size(1)  # batch_first
        if not self.pretrained:
          emb = self.embedding(input)
        else:
          emb = input
        h_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()
        c_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()

        for seq in range(self.seq_length):
          i_t = self.sigmoid(self.linear_input(torch.cat([emb[:, seq, :], h_t], dim=-1)))
          f_t = self.sigmoid(self.linear_forget(torch.cat([emb[:, seq, :], h_t], dim=-1)))
          g_t = self.tanh(self.linear_cell(torch.cat([emb[:, seq, :], h_t], dim=-1)))
          o_t = self.sigmoid(self.linear_output(torch.cat([emb[:, seq, :], h_t], dim=-1)))

          c_t = f_t * c_t + i_t * g_t
          h_t = o_t * self.tanh(c_t)

        if self.dropout is not None:
          h_t = self.dropout(h_t)
        out = self.classifier(h_t)
        return out

lstm = LSTM(seq_length=16, hidden_dim=64, embed_dim=64).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm.parameters(), lr=0.01)


for epoch in range(10):
  start = time.time()
  lstm.train()
  avg_loss = 0
  for input, label in train_loader:
    logits = lstm(input.cuda())
    loss = cel(logits, label.cuda())
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  end = time.time() - start
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}, Time {end}s')

  lstm.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      logits = lstm(input.cuda())
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.700248077567597, Time 21.331552743911743s
Epoch 1, Valid Acc 0.5522252497729337
Epoch 2, Train Loss 0.6394619273670604, Time 21.157572746276855s
Epoch 2, Valid Acc 0.6267029972752044
Epoch 3, Train Loss 0.4642184291327937, Time 20.68288016319275s
Epoch 3, Valid Acc 0.6603088101725704
Epoch 4, Train Loss 0.32396950177876244, Time 20.515620708465576s
Epoch 4, Valid Acc 0.667574931880109
Epoch 5, Train Loss 0.25231142262736345, Time 20.70114278793335s
Epoch 5, Valid Acc 0.6757493188010899
Epoch 6, Train Loss 0.2162350971174988, Time 20.935175895690918s
Epoch 6, Valid Acc 0.6548592188919165
Epoch 7, Train Loss 0.16316376408049313, Time 21.018038034439087s
Epoch 7, Valid Acc 0.6821071752951862
Epoch 8, Train Loss 0.12651554092700323, Time 21.22074818611145s
Epoch 8, Valid Acc 0.662125340599455
Epoch 9, Train Loss 0.10965743735288741, Time 20.96744990348816s
Epoch 9, Valid Acc 0.6693914623069936
Epoch 10, Train Loss 0.11846032501298634, Time 21.01356315612793s
Epoch 10,

In [None]:
### Problem 4.2 ###

lstm = LSTM(seq_length=16, hidden_dim=64, embed_dim=64, dropout=True).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm.parameters(), lr=0.01)

for epoch in range(10):
  lstm.train()
  avg_loss = 0
  for input, label in train_loader:
    logits = lstm(input.cuda())
    loss = cel(logits, label.cuda())
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}')

  lstm.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      logits = lstm(input.cuda())
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.7021140858028712
Epoch 1, Valid Acc 0.5894641235240691
Epoch 2, Train Loss 0.6200328616613753
Epoch 2, Valid Acc 0.6485013623978202
Epoch 3, Train Loss 0.4675403159767501
Epoch 3, Valid Acc 0.6966394187102634
Epoch 4, Train Loss 0.3609386019352893
Epoch 4, Valid Acc 0.6920980926430518
Epoch 5, Train Loss 0.28877523076835643
Epoch 5, Valid Acc 0.6948228882833788
Epoch 6, Train Loss 0.2440362356791503
Epoch 6, Valid Acc 0.6939146230699365
Epoch 7, Train Loss 0.20462624923315612
Epoch 7, Valid Acc 0.6884650317892824
Epoch 8, Train Loss 0.18639810670216042
Epoch 8, Valid Acc 0.6811989100817438
Epoch 9, Train Loss 0.1740493546571401
Epoch 9, Valid Acc 0.695731153496821
Epoch 10, Train Loss 0.15924992400948795
Epoch 10, Valid Acc 0.6875567665758402
Last Validation Accuracy: 0.6875567665758402


In [None]:
### Problem 4.3 ###

class BiStackedLSTM(nn.Module):
    def __init__(self, seq_length, hidden_dim, embed_dim, dropout=False):
        super(BiStackedLSTM, self).__init__()
        self.seq_length = seq_length
        self.hidden_dim = hidden_dim
        self.embed_dim = embed_dim

        self.embedding = nn.Embedding(len(vocab), embed_dim)
        self.layers = []
        assert embed_dim == hidden_dim  # I used the same dimension for convenience
        for layer in range(2):
          layer_dict = {}
          layer_dict['input_gate'] = nn.Linear(embed_dim + hidden_dim, hidden_dim).cuda()
          layer_dict['forget_gate'] = nn.Linear(embed_dim + hidden_dim, hidden_dim).cuda()
          layer_dict['cell_gate'] = nn.Linear(embed_dim + hidden_dim, hidden_dim).cuda()
          layer_dict['output_gate'] = nn.Linear(embed_dim + hidden_dim, hidden_dim).cuda()
          self.layers.append(layer_dict)
 
        self.classifier = nn.Linear(hidden_dim*2, 2)
        self.sigmoid = nn.Sigmoid()
        self.tanh = nn.Tanh()

        if dropout:
          self.dropout = nn.Dropout(p=0.5)
        else:
          self.dropout = None

    def forward(self, input):
        assert self.seq_length == input.size(1)  # batch_first
        emb = self.embedding(input)

        h_t_bi = []
        for direction in [range(self.seq_length), range(self.seq_length)[::-1]]:
          # 1st layer
          all_h_t = []
          h_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()
          c_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()
          for seq in direction:
            i_t = self.sigmoid(self.layers[0]['input_gate'](torch.cat([emb[:, seq, :], h_t], dim=-1)))
            f_t = self.sigmoid(self.layers[0]['forget_gate'](torch.cat([emb[:, seq, :], h_t], dim=-1)))
            g_t = self.tanh(self.layers[0]['cell_gate'](torch.cat([emb[:, seq, :], h_t], dim=-1)))
            o_t = self.sigmoid(self.layers[0]['output_gate'](torch.cat([emb[:, seq, :], h_t], dim=-1)))
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * self.tanh(c_t)
            if self.dropout is not None:
              h_t = self.dropout(h_t)
            all_h_t.append(h_t)
          # 2nd layer
          h_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()
          c_t = torch.autograd.Variable(torch.zeros(emb.size(0), self.hidden_dim)).cuda()
          for seq in direction:
            i_t = self.sigmoid(self.layers[1]['input_gate'](torch.cat([all_h_t[seq], h_t], dim=-1)))
            f_t = self.sigmoid(self.layers[1]['forget_gate'](torch.cat([all_h_t[seq], h_t], dim=-1)))
            g_t = self.tanh(self.layers[1]['cell_gate'](torch.cat([all_h_t[seq], h_t], dim=-1)))
            o_t = self.sigmoid(self.layers[1]['output_gate'](torch.cat([all_h_t[seq], h_t], dim=-1)))
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * self.tanh(c_t)
            if self.dropout is not None:
              h_t = self.dropout(h_t)
          h_t_bi.append(h_t)  # last (or first) hidden state

        h_t_concat = torch.cat(h_t_bi, dim=-1)  # [B, hidden_dim*2]
        out = self.classifier(h_t_concat)
        return out


model = BiStackedLSTM(seq_length=16, hidden_dim=64, embed_dim=64, dropout=True).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(10):
  model.train()
  avg_loss = 0
  for input, label in train_loader:
    logits = model(input.cuda())
    loss = cel(logits, label.cuda())
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}')

  model.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      logits = model(input.cuda())
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.6937321649955006
Epoch 1, Valid Acc 0.5785649409627611
Epoch 2, Train Loss 0.6566966125246291
Epoch 2, Valid Acc 0.6012715712988193
Epoch 3, Train Loss 0.5758043701729078
Epoch 3, Valid Acc 0.6021798365122616
Epoch 4, Train Loss 0.5019804891129112
Epoch 4, Valid Acc 0.6121707538601272
Epoch 5, Train Loss 0.43898056347048686
Epoch 5, Valid Acc 0.6158038147138964
Epoch 6, Train Loss 0.399179231696361
Epoch 6, Valid Acc 0.6521344232515894
Epoch 7, Train Loss 0.34490792578860613
Epoch 7, Valid Acc 0.6503178928247049
Epoch 8, Train Loss 0.2993847786375646
Epoch 8, Valid Acc 0.6512261580381471
Epoch 9, Train Loss 0.2646490029872078
Epoch 9, Valid Acc 0.6457765667574932
Epoch 10, Train Loss 0.2349852400549342
Epoch 10, Valid Acc 0.6403269754768393
Last Validation Accuracy: 0.6403269754768393


## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

> **Problem 5.1 (bonus)** *(2 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to replace word embeddings in your model from 4.2. Report the model's accuracy on the dev data.

## (Answer)

**Problem 5.1.**
(See the code below.) Validation accuracy of LSTM using the pretrained embeddings after 10 epochs training is 70.75%. Leveraging the self-supervised pretrained information helps the downstream classification task.

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2021-09-26 13:59:12--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-09-26 13:59:13--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-09-26 14:01:55 (5.08 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [None]:
### Problem 5.1 - (1) ###
import numpy as np

vocab = ['PAD', 'UNK']
embedding = np.zeros((400002, 300), dtype=np.float32)
embedding[1,:] = np.random.randn(300)  # random number for UNK
with open('./glove.6B.300d.txt') as f:
  for i, line in enumerate(f.readlines()):
    vocab.append(line.split(' ')[0])
    embedding[i+2] = np.array(line.split(' ')[1:], dtype=np.float32)
word2id = {word: id_ for id_, word in enumerate(vocab)}
print('Vocabulary size: {}'.format(len(vocab)))

Vocabulary size: 400002


In [None]:
### Problem 5.1 - (2) ###

class SSTDataset(torch.utils.data.Dataset):
  def __init__(self, split, length=16):
    assert split in ['train', 'validation', 'test']
    self._data = sst_dataset[split]
    self.input_lst, self.label_lst = [], []
    for data in self._data:
      sentence = data['sentence']      
      tokens = sentence.split(' ')
      input_ids = [word2id[word] if word in word2id else 1 for word in tokens] # UNK if word not found
      if len(input_ids) < length:
        input_ids = input_ids + [0] * (length - len(input_ids)) # PAD tokens at the end
      else:
        input_ids = input_ids[:length]
      self.input_lst.append(torch.tensor(embedding[input_ids])) # use the pretrained embedding

      label = round(data['label'])
      self.label_lst.append(label)

  def __getitem__(self, idx):
    return self.input_lst[idx], self.label_lst[idx]

  def __len__(self):
    return len(self._data)

trainset = SSTDataset('train')
validset = SSTDataset('validation')
train_loader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(validset, batch_size=16, shuffle=False)

In [None]:
### Problem 5.1 - (3) ###

lstm = LSTM(seq_length=16, hidden_dim=128, embed_dim=300, dropout=True, pretrained=True).cuda()
cel = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm.parameters(), lr=0.01)

for epoch in range(10):
  lstm.train()
  avg_loss = 0
  for input, label in train_loader:
    logits = lstm(input.cuda())
    loss = cel(logits, label.cuda())
    avg_loss += loss.item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
  print(f'Epoch {epoch+1}, Train Loss {avg_loss / len(train_loader)}')

  lstm.eval()
  acc, count = 0, 0
  for input, label in valid_loader:
    with torch.no_grad():
      logits = lstm(input.cuda())
      _, preds = torch.max(logits, dim=1)
      acc += (preds.cpu().data == label).sum().item()
      count += float(input.size(0))
  acc /= count
  print(f'Epoch {epoch+1}, Valid Acc {acc}')

print('='*50)
print(f'Last Validation Accuracy: {acc}')

Epoch 1, Train Loss 0.6598808100161034
Epoch 1, Valid Acc 0.7129881925522252
Epoch 2, Train Loss 0.5778945932506622
Epoch 2, Valid Acc 0.706630336058129
Epoch 3, Train Loss 0.5275646816757734
Epoch 3, Valid Acc 0.7211625794732062
Epoch 4, Train Loss 0.489664972470271
Epoch 4, Valid Acc 0.7184377838328792
Epoch 5, Train Loss 0.46001328106564976
Epoch 5, Valid Acc 0.7157129881925522
Epoch 6, Train Loss 0.4412068678728873
Epoch 6, Valid Acc 0.7247956403269755
Epoch 7, Train Loss 0.4032142126744383
Epoch 7, Valid Acc 0.7293369663941871
Epoch 8, Train Loss 0.39443781687302537
Epoch 8, Valid Acc 0.7229791099000908
Epoch 9, Train Loss 0.3683058825994699
Epoch 9, Valid Acc 0.7120799273387829
Epoch 10, Train Loss 0.3465916405288914
Epoch 10, Valid Acc 0.7075386012715713
Last Validation Accuracy: 0.7075386012715713
