<a href="https://colab.research.google.com/github/woo-choi/kaist-ai605/blob/main/KAIST_AI605_Assignment_1_%EC%B5%9C%EC%9A%B0%EC%9A%A9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAIST AI605 Assignment 1: Text Classification with RNNs
Authors: Hyeong-Gwon Hong (honggudrnjs@kaist.ac.kr) and Minjoon Seo (minjoon@kaist.ac.kr)

**Due Date:** March 31 (Wed) 11:00pm, 2021

## Assignment Objectives
- Verify theoretically and empirically why gating mechanism (LSTM, GRU) helps in Recurrent Neural Networks (RNNs)
- Design an LSTM-based text classification model from scratch using PyTorch.
- Apply the classification model to a popular classification task, Stanford Sentiment Treebank v2 (SST-2).
- Achieve higher accuracy by applying common machine learning strategies, including Dropout.
- Utilize pretrained word embedding (e.g. GloVe) to leverage self-supervision over a large text corpus.
- (Bonus) Use Hugging Face library (`transformers`) to leverage self-supervision via large language models.

## Your Submission
Your submission will be a link to a CoLab notebook that has all written answers and is fully executable. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 10 points each. Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Limitations of Vanilla RNNs
In Lecture 04 and 05, we saw how RNNs suffer from exploding or vanishing gradients. We mathematically showed that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

**Problem 1.1** *(10 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

Exploding gradient happens when a large gradient propagates through unrolled RNN and get multiplied each step, resulting a very large(in magnitude) error derivative. So, limiting the gradient to a fixed range, i.e. gradient clipping, should prevent error derivative to grow and stabilize the network while training.

**Problem 1.2** *(10 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 04 and 05 slides for the definition of LSTM.

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank v2, a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST-2 via GLUE
General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing natural language understanding (NLU) tasks. See GLUE website (https://gluebenchmark.com/) and the GLUE paper (https://openreview.net/pdf?id=rJ4km2R5t7) for more details. GLUE provides an easy way to access the datasets, including SST-2.
You can download SST-2 dataset by following the steps below:

1. Clone GitHub repository:

In [7]:
!git clone https://github.com/nyu-mll/GLUE-baselines.git

Cloning into 'GLUE-baselines'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 20.17 MiB/s, done.
Resolving deltas: 100% (610/610), done.


In [None]:
%ls -al

total 20
drwxr-xr-x 1 root root 4096 Mar 30 18:14 [0m[01;34m.[0m/
drwxr-xr-x 1 root root 4096 Mar 30 18:11 [01;34m..[0m/
drwxr-xr-x 4 root root 4096 Mar 25 13:38 [01;34m.config[0m/
drwxr-xr-x 4 root root 4096 Mar 30 18:14 [01;34mGLUE-baselines[0m/
drwxr-xr-x 1 root root 4096 Mar 25 13:38 [01;34msample_data[0m/


2. Download SST-2 only:

In [8]:
%cd GLUE-baselines/
%ls -al
!python download_glue_data.py --data_dir glue_data --tasks SST
%ls -al

/content/GLUE-baselines
total 36
drwxr-xr-x 4 root root 4096 Mar 31 04:20 [0m[01;34m.[0m/
drwxr-xr-x 1 root root 4096 Mar 31 04:20 [01;34m..[0m/
-rw-r--r-- 1 root root 6743 Mar 31 04:20 download_glue_data.py
-rw-r--r-- 1 root root  176 Mar 31 04:20 environment.yml
drwxr-xr-x 8 root root 4096 Mar 31 04:20 [01;34m.git[0m/
-rw-r--r-- 1 root root   18 Mar 31 04:20 .gitignore
-rw-r--r-- 1 root root 3735 Mar 31 04:20 README.md
drwxr-xr-x 3 root root 4096 Mar 31 04:20 [01;34msrc[0m/
Downloading and extracting SST...
	Completed!
total 40
drwxr-xr-x 5 root root 4096 Mar 31 04:20 [0m[01;34m.[0m/
drwxr-xr-x 1 root root 4096 Mar 31 04:20 [01;34m..[0m/
-rw-r--r-- 1 root root 6743 Mar 31 04:20 download_glue_data.py
-rw-r--r-- 1 root root  176 Mar 31 04:20 environment.yml
drwxr-xr-x 8 root root 4096 Mar 31 04:20 [01;34m.git[0m/
-rw-r--r-- 1 root root   18 Mar 31 04:20 .gitignore
drwxr-xr-x 3 root root 4096 Mar 31 04:20 [01;34mglue_data[0m/
-rw-r--r-- 1 root root 3735 Mar 31 04:20 RE

In [None]:
%ls glue_data/SST-2/

dev.tsv  [0m[01;34moriginal[0m/  test.tsv  train.tsv


Your training, dev, and test data can be found at `glue_data/SST-2`. Note that each file is in a tsv format, where the first column is the sentence and te second column is the label (either 0 or 1, where 1 means positive review). 

In [None]:
!head -10 glue_data/SST-2/train.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 	0
that 's far too tragic to merit such superficial treatment 	0
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 	1
of saucy 	1
a depressed fifteen-year-old 's suicidal poetry 	0


**Problem 2.1** *(10 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

In [None]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [None]:
# Constructing vocabulary with `UNK`
vocab = ['UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['UNK', 'Hello', 'world!']
1


In [None]:
vocab_1 = set()
for fname in ['glue_data/SST-2/train.tsv', 'glue_data/SST-2/dev.tsv']:
  with open(fname) as f:
    f.readline()  # skip headeer
    for line in f:
      parts = line.split('\t')
      tokens = parts[0].strip().split()
      vocab_1.update(tokens)

vocab_1 = ['UNK'] + sorted(vocab_1)
word2id_1 = {word: id_ for id_, word in enumerate(vocab_1)}

print(vocab_1)
print(f"vocab_1 size = {len(vocab_1)}")

vocab_1 size = 15757


**Problem 2.2** *(10 points)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

In [None]:
token_counter = dict()
for fname in ['glue_data/SST-2/train.tsv', 'glue_data/SST-2/dev.tsv']:
  with open(fname) as f:
    print(f"{fname}: header {f.readline()}")  # skip headeer
    for line in f:
      parts = line.split('\t')
      tokens = parts[0].strip().split()
      for t in tokens:
        if t in token_counter:
          token_counter[t] += 1
        else:
          token_counter[t] = 1

vocab_2 = [ t for t, c in token_counter.items() if c > 1]

vocab_2 = ['UNK'] + sorted(vocab_2)
word2id_2 = {word: id_ for id_, word in enumerate(vocab_2)}

print(vocab_2)
print(f"vocab_2 size = {len(vocab_2)}")
print(f"size change = {len(vocab_1) - len(vocab_2)}")

glue_data/SST-2/train.tsv: header sentence	label

glue_data/SST-2/dev.tsv: header sentence	label

vocab_2 size = 14381
size change = 1376


## 3. Text Classification Baselines

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to go through one layer of neural network and then average the outputs, and finally classify the average embedding: 

In [None]:
from torch import nn

input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 0 for word in input_tokens]
input_tensor = torch.LongTensor([input_ids]) # the first dimension is batch size
print(input_tensor)

tensor([[0, 2]])


In [None]:
# One layer, average pooling and classification
class Baseline(nn.Module):
  def __init__(self, d):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d)
logits = baseline(input_tensor)
softmax = nn.Softmax(1)
print(logits)
print(softmax(logits)) # probability for each class

tensor([[0.3106, 0.2626]], grad_fn=<AddmmBackward>)
tensor([[0.5120, 0.4880]], grad_fn=<SoftmaxBackward>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.7174, grad_fn=<NllLossBackward>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[-0.0170, -0.0322,  0.0076],
        [ 0.0000,  0.0000,  0.0000],
        [-0.0711, -0.0577,  0.1934]])


**Problem 3.1** *(10 points)* Properly train this average-pooling baseline model on SST-2 and report the model's accuracy on the dev data.

In [None]:
class SST2Dataset(torch.utils.data.Dataset):
  def __init__(self, fname, vocab):
    self.fname = fname
    self.vocab = vocab
    if vocab[0] != 'UNK':
      self.vocab = ['UNK'] + sorted(vocab)
    self.w2i = {w: i for i, w in enumerate(self.vocab)}
    
    self.X = list()
    self.y = list()
    with open(self.fname) as f:
      f.readline()  # skip headeer
      for line in f:
        parts = line.split('\t')
        tokens = parts[0].strip().split()
        self.X.append([self.w2i[w] if w in self.w2i else 0 for w in tokens])
        self.y.append(int(parts[1]))
      
      self.max_len = max(len(x) for x in self.X)
      self.X = [x + [0] * (self.max_len - len(x)) for x in self.X]

  def __len__(self):
    return len(self.y)

  def __getitem__(self, index):
    return torch.LongTensor(self.X[index]), self.y[index]


In [None]:
class Baseline(nn.Module):
  def __init__(self, d, vocab):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

In [None]:
batch_size = 16
epoch = 10
lr = 0.1

train_dataset = SST2Dataset('glue_data/SST-2/train.tsv', vocab_2)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2Dataset('glue_data/SST-2/dev.tsv', vocab_2)
test_size = len(test_dataset)

baseline_model = Baseline(128, vocab_2)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(baseline_model.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = baseline_model(X)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.LongTensor(test_dataset.X)
    y = torch.LongTensor(test_dataset.y)
    logits = baseline_model(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

epoch=0, lr=0.1, loss=0.733130931854248, accuracy=59.52%
epoch=1, lr=0.1, loss=0.6045098304748535, accuracy=64.11%
epoch=2, lr=0.1, loss=0.6751127243041992, accuracy=67.43%
epoch=3, lr=0.1, loss=0.7138670086860657, accuracy=70.18%
epoch=4, lr=0.1, loss=0.5974430441856384, accuracy=72.02%
epoch=5, lr=0.1, loss=0.3969469666481018, accuracy=73.62%
epoch=6, lr=0.1, loss=0.58131343126297, accuracy=75.11%
epoch=7, lr=0.1, loss=0.36984890699386597, accuracy=75.23%
epoch=8, lr=0.1, loss=0.45373639464378357, accuracy=77.06%
epoch=9, lr=0.1, loss=0.476972758769989, accuracy=77.06%


**Problem 3.2** *(10 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data. Is it better or worse than the baseline? Why?

In [None]:
class MyRNN(nn.Module):
  def __init__(self, d, vocab, h):
    super(MyRNN, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.h = h
    self.U = nn.Linear(d, h, bias=True)
    self.V = nn.Linear(h, h, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    hiddens = list()
    for words in input_tensor:
      h = torch.zeros(self.h)
      for w in words:
        if w == 0:
          # end-of-sequence
          break
        x = self.embedding(w)
        h = self.relu(self.U(x) + self.V(h))

      hiddens.append(h) # hidden state at last time step

    logits = self.class_layer(torch.stack(hiddens))
    return logits

In [None]:
batch_size = 16
epoch = 10
lr = 0.1

train_dataset = SST2Dataset('glue_data/SST-2/train.tsv', vocab_2)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2Dataset('glue_data/SST-2/dev.tsv', vocab_2)
test_size = len(test_dataset)

rnn = MyRNN(64, vocab_2, 64)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = rnn(X)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.LongTensor(test_dataset.X)
    y = torch.LongTensor(test_dataset.y)
    logits = rnn(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

epoch=0, lr=0.1, loss=0.6140936017036438, accuracy=56.77%
epoch=1, lr=0.1, loss=0.6850536465644836, accuracy=59.17%
epoch=2, lr=0.1, loss=0.5253406763076782, accuracy=61.93%
epoch=3, lr=0.1, loss=0.8947893977165222, accuracy=63.99%
epoch=4, lr=0.1, loss=0.6624628901481628, accuracy=67.09%
epoch=5, lr=0.1, loss=0.5647495985031128, accuracy=67.66%
epoch=6, lr=0.1, loss=0.5241442918777466, accuracy=65.25%
epoch=7, lr=0.1, loss=0.6072881817817688, accuracy=63.53%
epoch=8, lr=0.1, loss=0.6026771068572998, accuracy=63.76%
epoch=9, lr=0.1, loss=nan, accuracy=49.08%


**Problem 3.3 (bonus)** *(10 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

## 4. Text Classification with LSTM and Dropout

Now it is time to improve your baselines! Replace your RNN module with an LSTM module. See Lecture slides 04 and 05 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.0000, 0.0000, 1.0000, 1.4000, 0.0000])


**Problem 4.1** *(20 points)* Use LSTM instead of vanilla RNN to improve your model. Report the accuracy on the dev data.


In [None]:
class TorchRNN(nn.Module):
  def __init__(self, d, vocab, h):
    super(TorchRNN, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.rnn = nn.LSTM(d, h, batch_first=True)
    self.class_layer = nn.Linear(h, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out, (h_n, c_n) = self.rnn(emb)
    # out.shape = (batch_size, seq_len, features)
    avg = out.mean(1) # avg.shape = (batch_size, features)
    logits = self.class_layer(avg)
    return logits

In [None]:
batch_size = 16
epoch = 10
lr = 0.1

train_dataset = SST2Dataset('glue_data/SST-2/train.tsv', vocab_2)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2Dataset('glue_data/SST-2/dev.tsv', vocab_2)
test_size = len(test_dataset)

rnn = TorchRNN(128, vocab_2, 64)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = rnn(X)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.LongTensor(test_dataset.X)
    y = torch.LongTensor(test_dataset.y)
    logits = rnn(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

epoch=0, lr=0.1, loss=0.6736119389533997, accuracy=50.11%
epoch=1, lr=0.1, loss=0.5504904389381409, accuracy=63.88%
epoch=2, lr=0.1, loss=0.702921986579895, accuracy=67.89%
epoch=3, lr=0.1, loss=0.5247427225112915, accuracy=72.25%
epoch=4, lr=0.1, loss=0.3166922330856323, accuracy=76.15%
epoch=5, lr=0.1, loss=0.4449685215950012, accuracy=74.31%
epoch=6, lr=0.1, loss=0.313717782497406, accuracy=75.92%
epoch=7, lr=0.1, loss=0.5156047940254211, accuracy=75.46%
epoch=8, lr=0.1, loss=0.34010523557662964, accuracy=78.10%
epoch=9, lr=0.1, loss=0.3634919226169586, accuracy=78.10%


**Problem 4.2** *(10 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.

In [None]:
class TorchRNNDropout(nn.Module):
  def __init__(self, d, vocab, h, p):
    super(TorchRNNDropout, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.dropout = nn.Dropout(p)
    self.rnn = nn.LSTM(d, h, batch_first=True)
    self.class_layer = nn.Linear(h, 2, bias=True)

  def forward(self, input_tensor, in_training=False):
    emb = self.embedding(input_tensor)
    if in_training:
      emb = self.dropout(emb)
    out, (h_n, c_n) = self.rnn(emb)
    # out.shape = (batch_size, seq_len, features)
    avg = out.mean(1) # avg.shape = (batch_size, features)
    logits = self.class_layer(avg)
    return logits

In [None]:
%cd /content/GLUE-baselines/
batch_size = 16
epoch = 10
lr = 0.1
p = 0.5

train_dataset = SST2Dataset('glue_data/SST-2/train.tsv', vocab_2)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2Dataset('glue_data/SST-2/dev.tsv', vocab_2)
test_size = len(test_dataset)

rnn = TorchRNNDropout(128, vocab_2, 64, p)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = rnn(X, True)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.LongTensor(test_dataset.X)
    y = torch.LongTensor(test_dataset.y)
    logits = rnn(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

/content/GLUE-baselines
epoch=0, lr=0.1, loss=0.6947087049484253, accuracy=51.03%
epoch=1, lr=0.1, loss=0.6960293054580688, accuracy=52.98%
epoch=2, lr=0.1, loss=0.5357664823532104, accuracy=62.39%
epoch=3, lr=0.1, loss=0.5676203370094299, accuracy=64.22%
epoch=4, lr=0.1, loss=0.7327960729598999, accuracy=66.40%
epoch=5, lr=0.1, loss=0.8588662147521973, accuracy=68.92%
epoch=6, lr=0.1, loss=0.7106691598892212, accuracy=70.76%
epoch=7, lr=0.1, loss=0.732146143913269, accuracy=70.30%
epoch=8, lr=0.1, loss=0.42834463715553284, accuracy=72.48%
epoch=9, lr=0.1, loss=0.4111449122428894, accuracy=72.82%


## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST-2 training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

**Problem 5.1** *(10 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to further improve your model from 4.2. Report the model's accuracy on the dev data.

In [None]:
%mkdir -p /content/GLOVE
%cd /content/GLOVE
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

/content/GLOVE
--2021-03-31 01:13:22--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-03-31 01:13:23--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-03-31 01:13:23--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6

In [None]:
word2vec = dict()
with open('glove.6B.50d.txt') as f:
  for line in f:
    parts = line.strip().split()
    word = parts[0]
    vector = [float(elem) for elem in parts[1:]]
    word2vec[word] = vector

word2vec['UNK'] = [.0, ] * 50

print("word2vec", len(word2vec))

word2vec 400001


In [None]:
class SST2DatasetWord2Vec(torch.utils.data.Dataset):
  def __init__(self, fname, word2vec):
    self.fname = fname
    self.w2v = word2vec
    
    self.X = list()
    self.y = list()
    with open(self.fname) as f:
      f.readline()  # skip headeer
      for line in f:
        parts = line.split('\t')
        tokens = parts[0].strip().split()
        self.X.append([self.w2v[w] if w in self.w2v else self.w2v['UNK'] for w in tokens])
        self.y.append(int(parts[1]))
      
      self.max_len = max(len(x) for x in self.X)
      self.X = [x + [self.w2v['UNK']] * (self.max_len - len(x)) for x in self.X]

  def __len__(self):
    return len(self.y)

  def __getitem__(self, index):
    return torch.FloatTensor(self.X[index]), self.y[index]

In [10]:
from torch import nn
class TorchRNNDropoutW2V(nn.Module):
  def __init__(self, d, h, p):
    super(TorchRNNDropoutW2V, self).__init__()
    self.dropout = nn.Dropout(p)
    self.rnn = nn.LSTM(d, h, batch_first=True)
    self.class_layer = nn.Linear(h, 2, bias=True)

  def forward(self, input_tensor, in_training=False):
    emb = input_tensor
    if in_training:
      emb = self.dropout(input_tensor)
    out, (h_n, c_n) = self.rnn(emb)
    # out.shape = (batch_size, seq_len, features)
    avg = out.mean(1) # avg.shape = (batch_size, features)
    logits = self.class_layer(avg)
    return logits

In [None]:
%cd /content/GLUE-baselines/
batch_size = 16
epoch = 10
lr = 0.1
p = 0.5

train_dataset = SST2DatasetWord2Vec('glue_data/SST-2/train.tsv', word2vec)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2DatasetWord2Vec('glue_data/SST-2/dev.tsv', word2vec)
test_size = len(test_dataset)

rnn = TorchRNNDropoutW2V(50, 64, p)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = rnn(X, True)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.FloatTensor(test_dataset.X)
    y = torch.FloatTensor(test_dataset.y)
    logits = rnn(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

/content/GLUE-baselines
epoch=0, lr=0.1, loss=0.635400652885437, accuracy=62.61%
epoch=1, lr=0.1, loss=0.8506923913955688, accuracy=57.68%
epoch=2, lr=0.1, loss=1.1131291389465332, accuracy=50.92%
epoch=3, lr=0.1, loss=0.7132390737533569, accuracy=50.92%
epoch=4, lr=0.1, loss=0.7131808996200562, accuracy=50.92%
epoch=5, lr=0.1, loss=0.6581298112869263, accuracy=68.23%
epoch=6, lr=0.1, loss=0.7804511785507202, accuracy=52.52%
epoch=7, lr=0.1, loss=0.4705461859703064, accuracy=65.37%
epoch=8, lr=0.1, loss=0.5670986175537109, accuracy=65.25%
epoch=9, lr=0.1, loss=0.46743375062942505, accuracy=68.92%


**Problem 5.2 (bonus)** *(10 points)* You can go one step further by using word vectors obtained from pretrained language models. Can you import the word embeddings from `bert-base-uncased` model (via Hugging Face's `transformers`: https://huggingface.co/transformers/pretrained_models.html) into your model and improve it further? Report the accuracy on the dev data here. If the score is now higher, explain where the improvement is coming from.

In [4]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 8.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 41.3MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 58.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=86d4bad054b

In [5]:
import torch
class SST2DatasetBERT(torch.utils.data.Dataset):
  def __init__(self, fname):
    from transformers import BertTokenizer, BertModel

    model_name = "bert-base-uncased"
    self.tokenizer = BertTokenizer.from_pretrained(model_name)
    self.model = BertModel.from_pretrained(model_name)
    self.fname = fname
    
    self.inputs = list()
    self.y = list()
    with open(self.fname) as f:
      f.readline()  # skip headeer
      for line in f:
        parts = line.strip().split('\t')
        self.inputs.append(parts[0])
        self.y.append(int(parts[1]))

  def __len__(self):
    return len(self.y)

  def __getitem__(self, index):
    input = self.inputs[index]
    input_ids = self.tokenizer.encode(input, truncation=True, padding='max_length', max_length=60, return_tensors="pt")
    outputs = self.model(input_ids)
    return outputs[0][0], self.y[index]

In [None]:
%cd /content/GLUE-baselines/
batch_size = 160
epoch = 10
lr = 0.1
p = 0.5

train_dataset = SST2DatasetBERT('glue_data/SST-2/train.tsv')
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = SST2DatasetBERT('glue_data/SST-2/dev.tsv')
test_size = len(test_dataset)

rnn = TorchRNNDropoutW2V(768, 64, p)
softmax = nn.Softmax(1)

loss_func = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=lr)
for i in range(epoch):
  for X, y in train_dataloader:
    logits = rnn(X, True)
    y_pred = softmax(logits)
    loss = loss_func(y_pred, y)

    optimizer.zero_grad() # reset process
    loss.backward() # compute gradients
    optimizer.step() # update parameters

  accuracy = 0.0
  with torch.no_grad():
    X = torch.FloatTensor(test_dataset.X)
    y = torch.FloatTensor(test_dataset.y)
    logits = rnn(X)

    correct = (logits.argmax(1) == y).type(torch.float).sum().item()
    
    accuracy = correct / test_size
  
  print(f"epoch={i}, lr={lr}, loss={loss}, accuracy={(100 * accuracy):>0.2f}%")

/content/GLUE-baselines
