In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

import torch
print(torch.cuda.is_available())

True


# KAIST AI605 Assignment 1: Text Classification

## Environment
You will only use Python 3.7 and PyTorch 1.9, which is already available on Colab:

In [2]:
from platform import python_version
from pprint import pprint
from collections import Counter
import re

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import TensorDataset, DataLoader

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(f"python {python_version()}")
print(f"torch {torch.__version__}")

python 3.7.11
torch 1.9.0


## 1. Limitations of Vanilla RNNs
In Lecture 02, we saw that a multi-layer perceptron (MLP) without activation function is equivalent to a single linear transformation with respect to the inputs. One can define a vanilla recurrent neural network without activation as, given inputs $\textbf{x}_1 \dots \textbf{x}_T$, the outputs $\textbf{h}_t$ is obtained by
$$\textbf{h}_t = \textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b},$$
where $\textbf{V}, \textbf{U}, \textbf{b}$ are trainable weights.

> **Problem 1.1** *(2 point)* Show that such recurrent neural network (RNN) without activation function is equivalent to a single linear transformation with respect to the inputs, which means each $\textbf{h}_t$ is a linear combination of the inputs.

In Lecture 05 and 06, we will see how RNNs can model non-linearity via activation function, but they still suffer from exploding or vanishing gradients. We can mathematically show that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

> **Problem 1.2** *(2 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

> **Problem 1.3** *(2 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 05 and 06 slides for the definition of LSTM.

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank (SST), a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST via Hugging Face
We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Sentiment Treebank.

First, install the package:

In [3]:
try:
    import datasets
except ImportError:
    !pip install datasets

Then download SST and print the first example:

In [4]:
from datasets import load_dataset

sst = load_dataset("sst", "default")
pprint(sst["train"][0])

Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


{'label': 0.6944400072097778,
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' "
             "and that he 's going to make a splash even greater than Arnold "
             'Schwarzenegger , Jean-Claud Van Damme or Steven Segal .',
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}


Note that each `label` is a score between 0 and 1. You will round it to either 0 or 1 for binary classification (positive for 1, negative for 0).
In this first example, the label is rounded to 1, meaning that the sentence is a positive review.
You will only use `sentence` as the input; please ignore other values.

> **Problem 2.1** *(2 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

In [5]:
# Space tokenization
words = [word for data in sst["train"] for word in data["sentence"].split(" ")]
tokens = sorted(list(set(words)))  # to ensure reproducibility (set doesn't guarantee order)
print(f"Number of tokens: {len(tokens)}")

# Constructing vocabulary with `UNK`
vocab = ["PAD", "UNK"] + tokens
word2id = {word: word_id for word_id, word in enumerate(vocab)}
print(f"Number of vocabs: {len(vocab)}")
print(f"ID of 'star': {word2id['star']}")

Number of tokens: 18280
Number of vocabs: 18282
ID of 'star': 15904


> **Problem 2.2** *(1 point)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

In [6]:
# Include words that occur at least 2 times
tokens = [token for token, count in Counter(words).items() if count >= 2]
print(f"Number of tokens: {len(tokens)}")

# Constructing vocabulary with `UNK`
vocab = ["PAD", "UNK"] + tokens
word2id = {word: word_id for word_id, word in enumerate(vocab)}
print(f"Number of vocabs: {len(vocab)}")
print(f"ID of 'star': {word2id['star']}")

Number of tokens: 8736
Number of vocabs: 8738
ID of 'star': 2308


## 3. Text Classification with Multi-Layer Perceptron and Recurrent Neural Network

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to fix the input length (with truncation of padding), flatten the word embeddings, apply a linear transformation followed by an activation, and finally classify the output into the two classes: 

In [7]:
words = [word.lower() for data in sst["train"] for word in re.split(r"\W+", data["sentence"]) if word]
tokens = [token for token, count in Counter(words).items() if count >= 2]
vocab = ["pad", "unk"] + tokens
word2id = {word: word_id for word_id, word in enumerate(vocab)}

In [8]:
def tokenize(sentence, length):
    tokens = [word for word in re.split(r"\W+", sentence.lower()) if word]
    tokens = (tokens + ["pad"] * (length - len(tokens)))[:length]
    return tokens

def binarize(label):
    return 1 if label > 0.5 else 0

def preprocess(sentences, labels, length):
    x = [tokenize(sentence, length) for sentence in sentences]
    y = [binarize(label) for label in labels]
    return x, y

Let's see the examples:

In [9]:
length = 8
input_sentence = "What a nice day!"
input_tensor = torch.LongTensor([[
    word2id.get(token, 1) for token in tokenize(input_sentence, length)
]])  # the first dimension is minibatch size
print(f"{input_sentence} -> {input_tensor}")

What a nice day! -> tensor([[266,  18, 297, 591,   0,   0,   0,   0]])


In [10]:
class ModelBase(nn.Sequential):
    def __init__(self, hiddens, dim, embed=None):
        super().__init__(*(([] if embed is None else [nn.Embedding(len(embed), dim)])
                         + (hiddens if type(hiddens) is list else [hiddens])
                         + [nn.ReLU(), nn.Linear(dim, 2)]))

In [11]:
class MLPModel(ModelBase):
    def __init__(self, dim, length, embed=None):
        super().__init__([
            nn.Flatten(),
            nn.Linear(dim * length, dim),
        ], dim, embed)
        
torch.manual_seed(19)

baseline = MLPModel(dim=3, length=length, embed=vocab)  # dim is usually bigger, e.g. 128
logits = baseline(input_tensor)
print(f"softmax logits: {F.softmax(logits, dim=1).detach()}")  # probability for each class

softmax logits: tensor([[0.6985, 0.3015]])


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [12]:
ce = nn.CrossEntropyLoss()
label = torch.LongTensor([1])  # The ground truth label for "What a nice day!" is positive.
loss = ce(logits, label)  # Loss, a.k.a L
print(f"loss: {loss.detach():.6f}")

loss: 1.198835


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [13]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad()  # reset process
loss.backward()  # compute gradients
optimizer.step()  # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [14]:
print(baseline[2].weight.grad)

tensor([[-0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.0000, -0.0000,  0.0000,
         -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000],
        [-0.0022, -0.0209, -0.0229,  0.0020, -0.0056, -0.0060,  0.0074,  0.0346,
          0.0035,  0.0151,  0.0065,  0.0169,  0.0153, -0.0066, -0.0034,  0.0153,
         -0.0066, -0.0034,  0.0153, -0.0066, -0.0034,  0.0153, -0.0066, -0.0034],
        [-0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000, -0.0000, -0.0000,  0.0000,
         -0.0000, -0.0000,  0.0000, -0.0000, -0.0000,  0.0000, -0.0000, -0.0000]])


Now, define the helper class to train the model and get the data loaders

In [15]:
class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.criterion = nn.CrossEntropyLoss()
        
    def train_epoch(self, dataloader):
        self.model.train()
        moving_loss = None
        for x_batch, y_batch in dataloader:
            logits = self.model(x_batch)
            loss = self.criterion(logits, y_batch)
            moving_loss = (loss.detach() if moving_loss is None else
                           0.2 * moving_loss + 0.8 * loss.detach())
            self.optimizer.zero_grad()  # reset process
            loss.backward()  # compute gradients
            self.optimizer.step()  # update parameters
        return moving_loss
    
    def test_epoch(self, dataloader):
        with torch.no_grad():
            self.model.eval()
            correct, total = 0, 0
            for x_batch, y_batch in dataloader:
                logits = self.model(x_batch)
                y_pred = torch.argmax(logits, dim=1)
                correct += torch.sum(y_pred == y_batch).item()
                total += len(y_batch)
            return correct * 100 / total
        
    def train(self, train_loader, valid_loader, epochs=10, print_every=10):
        print(f"{self.model.__class__.__name__}")
        best_acc = 0.
        for i in range(epochs):
            train_loss = self.train_epoch(train_loader)
            if (i + 1) % print_every == 0:
                train_acc = self.test_epoch(train_loader)
                valid_acc = self.test_epoch(valid_loader)
                print(f"Epoch {i+1:3d}: Train Loss {train_loss:.6f}, "
                      f"Train Acc: {train_acc:.2f}, Valid Acc: {valid_acc:.2f}")
                best_acc = max(best_acc, valid_acc)
        print(f"Best Valid Acc: {best_acc:.2f}%")

In [16]:
def get_dataloader(x, y, x_trans, y_trans, batch_size, shuffle):
    dataset = TensorDataset(torch.vstack([x_trans(x_item) for x_item in x]),
                            torch.cat([y_trans(y_item) for y_item in y]))
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
    return dataloader

In [17]:
length = 32
batch_size = 128

unk_id = word2id["unk"]
x2long = lambda x_item: torch.LongTensor([word2id.get(token, unk_id) for token in x_item]).to(device)
y2long = lambda y_item: torch.LongTensor([y_item]).to(device)

sst_train, sst_valid = sst["train"], sst["validation"]

x_train, y_train = preprocess(sst_train["sentence"], sst_train["label"], length)
x_valid, y_valid = preprocess(sst_valid["sentence"], sst_valid["label"], length)

train_loader = get_dataloader(x_train, y_train, x2long, y2long, batch_size, shuffle=True)
valid_loader = get_dataloader(x_valid, y_valid, x2long, y2long, batch_size, shuffle=False)

> **Problem 3.1** *(2 points)* Properly train a MLP baseline model on SST and report the model's accuracy on the dev data.

In [18]:
torch.manual_seed(19)
model = MLPModel(dim=64, length=length, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=7e-4),
).train(train_loader, valid_loader, epochs=100)

MLPModel
Epoch  10: Train Loss 0.329118, Train Acc: 89.95, Valid Acc: 56.86
Epoch  20: Train Loss 0.153379, Train Acc: 97.10, Valid Acc: 58.49
Epoch  30: Train Loss 0.083661, Train Acc: 99.18, Valid Acc: 59.22
Epoch  40: Train Loss 0.044256, Train Acc: 99.88, Valid Acc: 60.94
Epoch  50: Train Loss 0.043314, Train Acc: 99.95, Valid Acc: 63.40
Epoch  60: Train Loss 0.047403, Train Acc: 99.64, Valid Acc: 65.67
Epoch  70: Train Loss 0.032097, Train Acc: 99.82, Valid Acc: 64.58
Epoch  80: Train Loss 0.048647, Train Acc: 99.81, Valid Acc: 65.21
Epoch  90: Train Loss 0.014697, Train Acc: 99.96, Valid Acc: 67.39
Epoch 100: Train Loss 0.012125, Train Acc: 99.95, Valid Acc: 67.03
Best Valid Acc: 67.39%


> **Problem 3.2** *(2 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data.

In [19]:
class RNNCell(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()
        self.V = nn.Linear(hidden_features, hidden_features)
        self.U = nn.Linear(in_features, hidden_features)
        self.tanh = nn.Tanh()
        
    def forward(self, h, x):
        h = self.tanh(self.V(h) + self.U(x))
        return h
    
class RNN(nn.Module):
    def __init__(self, in_features, hidden_features, num_layers):
        super().__init__()
        self.hidden_features = hidden_features
        self.layers = nn.ModuleList([
            RNNCell((in_features if i == 0 else hidden_features), hidden_features)
            for i in range(num_layers)
        ])
        self.W = nn.Linear(hidden_features, hidden_features)
            
    def forward(self, x):
        h = [torch.zeros((x.shape[0], self.hidden_features), device=device)
             for _ in range(len(self.layers))]
        for x in x.split(1, dim=1):
            x = x.squeeze(dim=1)
            for i, layer in enumerate(self.layers):
                h[i] = layer(h[i], h[i - 1] if i > 0 else x)
        o = self.W(h[-1])
        return o

class RNNModel(ModelBase):
    def __init__(self, dim, embed=None):
        super().__init__(RNN(dim, dim, 1), dim, embed)

In [20]:
torch.manual_seed(19)
model = RNNModel(dim=64, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=7e-4),
).train(train_loader, valid_loader, epochs=100)

RNNModel
Epoch  10: Train Loss 0.676140, Train Acc: 52.39, Valid Acc: 49.68
Epoch  20: Train Loss 0.654097, Train Acc: 53.60, Valid Acc: 50.23
Epoch  30: Train Loss 0.643048, Train Acc: 54.05, Valid Acc: 49.14
Epoch  40: Train Loss 0.590394, Train Acc: 56.07, Valid Acc: 49.59
Epoch  50: Train Loss 0.584751, Train Acc: 61.19, Valid Acc: 51.50
Epoch  60: Train Loss 0.350945, Train Acc: 88.44, Valid Acc: 52.13
Epoch  70: Train Loss 0.097391, Train Acc: 97.93, Valid Acc: 53.59
Epoch  80: Train Loss 0.072805, Train Acc: 99.05, Valid Acc: 53.50
Epoch  90: Train Loss 0.013090, Train Acc: 99.65, Valid Acc: 54.59
Epoch 100: Train Loss 0.065586, Train Acc: 99.71, Valid Acc: 52.77
Best Valid Acc: 54.59%


> **Problem 3.3** *(2 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

> **Problem 3.4 (bonus)** *(1 points)* Why is it numerically unstable if you compute log on top of softmax?

## 4. Text Classification with LSTM and Dropout

Replace your RNN module with an LSTM module. See Lecture slides 05 and 06 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [21]:
class LSTMCell(nn.Module):
    def __init__(self, in_features, hidden_features):
        super().__init__()
        self.Vf = nn.Linear(hidden_features, hidden_features)
        self.Vi = nn.Linear(hidden_features, hidden_features)
        self.Vo = nn.Linear(hidden_features, hidden_features)
        self.Vc = nn.Linear(hidden_features, hidden_features)
        self.Uf = nn.Linear(in_features, hidden_features)
        self.Ui = nn.Linear(in_features, hidden_features)
        self.Uo = nn.Linear(in_features, hidden_features)
        self.Uc = nn.Linear(in_features, hidden_features)
        self.tanh = nn.Tanh()
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, h, c, x):
        f = self.sigmoid(self.Vf(h) + self.Uf(x))
        i = self.sigmoid(self.Vi(h) + self.Ui(x))
        o = self.sigmoid(self.Vo(h) + self.Uo(x))
        c_ = self.tanh(self.Vc(h) + self.Uc(x))
        c = f * c + i * c_
        h = o * self.tanh(c)
        return h, c
    
class LSTM(nn.Module):
    def __init__(self, in_features, hidden_features, dropout=0, num_layers=1, bidirectional=False):
        super().__init__()
        self.bidirectional = bidirectional
        self.hidden_features = hidden_features
        self.layers = nn.ModuleList([
            LSTMCell((hidden_features if i > 0 else in_features), hidden_features)
            for i in range(num_layers)
        ])
        self.dropouts = nn.ModuleList([
            nn.Dropout(dropout) if dropout > 0 else nn.Identity()
            for _ in range(num_layers)
        ])
        self.W = nn.Linear(hidden_features * (2 if bidirectional else 1), hidden_features)
            
    def forward(self, x):
        x_seq = x.split(1, dim=1)
        hs = []
        for seq in [x_seq, x_seq[::-1]] if self.bidirectional else [x_seq]:
            h = [torch.zeros((x.shape[0], self.hidden_features), device=x.device)
                for _ in range(len(self.layers))]
            c = [torch.zeros((x.shape[0], self.hidden_features), device=x.device)
                for _ in range(len(self.layers))]
            for x in seq:
                x = x.squeeze(dim=1)
                for i in range(len(self.layers)):
                    h_ = (self.dropouts[i - 1](h[i - 1]) if i > 0 else x)
                    h[i], c[i] = self.layers[i](h[i], c[i], h_)
            hs.append(h[-1])
        h = torch.cat(hs, dim=1)
        o = self.W(self.dropouts[-1](h))
        return o

> **Problem 4.1** *(3 points)* Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN. Report the accuracy on the dev data.

In [22]:
class LSTMModel(ModelBase):
    def __init__(self, dim, embed=None):
        super().__init__(LSTM(dim, dim), dim, embed)

In [23]:
torch.manual_seed(19)
model = LSTMModel(dim=64, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=7e-4),
).train(train_loader, valid_loader, epochs=100)

LSTMModel
Epoch  10: Train Loss 0.691812, Train Acc: 50.73, Valid Acc: 49.95
Epoch  20: Train Loss 0.682169, Train Acc: 51.36, Valid Acc: 50.59
Epoch  30: Train Loss 0.686595, Train Acc: 51.78, Valid Acc: 50.68
Epoch  40: Train Loss 0.441996, Train Acc: 88.45, Valid Acc: 73.12
Epoch  50: Train Loss 0.189206, Train Acc: 94.99, Valid Acc: 74.57
Epoch  60: Train Loss 0.265423, Train Acc: 96.71, Valid Acc: 72.30
Epoch  70: Train Loss 0.163695, Train Acc: 95.90, Valid Acc: 73.66
Epoch  80: Train Loss 0.079303, Train Acc: 97.52, Valid Acc: 71.84
Epoch  90: Train Loss 0.077685, Train Acc: 97.26, Valid Acc: 71.66
Epoch 100: Train Loss 0.133543, Train Acc: 97.71, Valid Acc: 73.12
Best Valid Acc: 74.57%


> **Problem 4.2** *(2 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data.

In [24]:
class LSTMDropoutModel(ModelBase):
    def __init__(self, dim, dropout, embed=None):
        super().__init__(LSTM(dim, dim, dropout), dim, embed)

In [25]:
torch.manual_seed(19)
model = LSTMDropoutModel(dim=64, dropout=0.5, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4),
).train(train_loader, valid_loader, epochs=100)

LSTMDropoutModel
Epoch  10: Train Loss 0.696492, Train Acc: 52.12, Valid Acc: 51.14
Epoch  20: Train Loss 0.497213, Train Acc: 85.62, Valid Acc: 71.75
Epoch  30: Train Loss 0.079066, Train Acc: 97.48, Valid Acc: 72.84
Epoch  40: Train Loss 0.033284, Train Acc: 98.84, Valid Acc: 71.66
Epoch  50: Train Loss 0.057790, Train Acc: 96.65, Valid Acc: 70.57
Epoch  60: Train Loss 0.019081, Train Acc: 99.70, Valid Acc: 71.75
Epoch  70: Train Loss 0.003143, Train Acc: 99.91, Valid Acc: 73.30
Epoch  80: Train Loss 0.014404, Train Acc: 99.78, Valid Acc: 72.48
Epoch  90: Train Loss 0.004946, Train Acc: 99.86, Valid Acc: 71.57
Epoch 100: Train Loss 0.057146, Train Acc: 99.84, Valid Acc: 72.30
Best Valid Acc: 73.30%


> **Problem 4.3 (bonus)** *(2 points)* Consider implementing bidirectional LSTM and two layers of LSTM. Report your accuracy on dev data.

In [26]:
class LSTMBidirectionalModel(ModelBase):
    def __init__(self, dim, embed=None):
        super().__init__(LSTM(dim, dim, bidirectional=True), dim, embed)

In [27]:
torch.manual_seed(19)
model = LSTMBidirectionalModel(dim=64, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=7e-4),
).train(train_loader, valid_loader, epochs=100)

LSTMBidirectionalModel
Epoch  10: Train Loss 0.496354, Train Acc: 79.44, Valid Acc: 68.30
Epoch  20: Train Loss 0.237126, Train Acc: 93.42, Valid Acc: 70.75
Epoch  30: Train Loss 0.094455, Train Acc: 98.35, Valid Acc: 69.30
Epoch  40: Train Loss 0.040121, Train Acc: 98.96, Valid Acc: 70.03
Epoch  50: Train Loss 0.014624, Train Acc: 99.86, Valid Acc: 71.39
Epoch  60: Train Loss 0.008459, Train Acc: 99.95, Valid Acc: 71.30
Epoch  70: Train Loss 0.016680, Train Acc: 99.12, Valid Acc: 69.12
Epoch  80: Train Loss 0.046878, Train Acc: 99.45, Valid Acc: 70.03
Epoch  90: Train Loss 0.031936, Train Acc: 99.67, Valid Acc: 71.57
Epoch 100: Train Loss 0.290228, Train Acc: 97.34, Valid Acc: 70.39
Best Valid Acc: 71.57%


In [28]:
class LSTMTwoLayerModel(ModelBase):
    def __init__(self, dim, embed=None):
        super().__init__(LSTM(dim, dim, num_layers=2), dim, embed)

In [29]:
torch.manual_seed(19)
model = LSTMTwoLayerModel(dim=64, embed=vocab).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4),
).train(train_loader, valid_loader, epochs=100)

LSTMTwoLayerModel
Epoch  10: Train Loss 0.664806, Train Acc: 52.52, Valid Acc: 50.14
Epoch  20: Train Loss 0.433135, Train Acc: 84.49, Valid Acc: 69.48
Epoch  30: Train Loss 0.159966, Train Acc: 94.99, Valid Acc: 71.12
Epoch  40: Train Loss 0.120943, Train Acc: 97.19, Valid Acc: 70.57
Epoch  50: Train Loss 0.014681, Train Acc: 99.29, Valid Acc: 71.12
Epoch  60: Train Loss 0.004357, Train Acc: 99.73, Valid Acc: 71.75
Epoch  70: Train Loss 0.003934, Train Acc: 99.59, Valid Acc: 70.21
Epoch  80: Train Loss 0.007000, Train Acc: 99.82, Valid Acc: 71.39
Epoch  90: Train Loss 0.004084, Train Acc: 99.75, Valid Acc: 70.57
Epoch 100: Train Loss 0.005964, Train Acc: 99.92, Valid Acc: 69.75
Best Valid Acc: 71.75%


## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

> **Problem 5.1 (bonus)** *(2 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to replace word embeddings in your model from 4.2. Report the model's accuracy on the dev data.

In [30]:
from urllib import request
import zipfile

if not os.path.exists("glove/glove.6B.100d.txt"):
    os.makedirs("glove", exist_ok=True)
    print("Downloading glove.6B.zip")
    request.urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", "glove/glove.6B.zip")
    with zipfile.ZipFile("glove.6B.zip", "r") as zipf:
        zipf.extractall("glove")

In [31]:
dim = 50

with open(f"glove/glove.6B.{dim}d.txt", "r") as glove:
    glove_map = map(lambda line: line.split(), glove.readlines())
    word2vec = {word: list(map(float, vec)) for word, *vec in glove_map}

unk_vec = word2vec["unk"]
x2vec = lambda x_item: torch.Tensor([[word2vec.get(token, unk_vec) for token in x_item]]).to(device)

glove_train_loader = get_dataloader(x_train, y_train, x2vec, y2long, batch_size, shuffle=True)
glove_valid_loader = get_dataloader(x_valid, y_valid, x2vec, y2long, batch_size, shuffle=False)

In [32]:
torch.manual_seed(19)
model = MLPModel(dim=dim, length=length).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=7e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

MLPModel
Epoch  10: Train Loss 0.369379, Train Acc: 83.02, Valid Acc: 65.58
Epoch  20: Train Loss 0.271266, Train Acc: 90.94, Valid Acc: 66.58
Epoch  30: Train Loss 0.202376, Train Acc: 95.59, Valid Acc: 65.49
Epoch  40: Train Loss 0.086696, Train Acc: 95.94, Valid Acc: 66.39
Epoch  50: Train Loss 0.056413, Train Acc: 98.88, Valid Acc: 65.21
Epoch  60: Train Loss 0.029469, Train Acc: 97.81, Valid Acc: 66.67
Epoch  70: Train Loss 0.017755, Train Acc: 98.36, Valid Acc: 65.85
Epoch  80: Train Loss 0.025084, Train Acc: 99.56, Valid Acc: 64.31
Epoch  90: Train Loss 0.025480, Train Acc: 99.92, Valid Acc: 66.30
Epoch 100: Train Loss 0.023588, Train Acc: 99.53, Valid Acc: 65.40
Best Valid Acc: 66.67%


In [33]:
torch.manual_seed(19)
model = RNNModel(dim=dim).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

RNNModel
Epoch  10: Train Loss 0.686666, Train Acc: 51.58, Valid Acc: 50.77
Epoch  20: Train Loss 0.578314, Train Acc: 68.79, Valid Acc: 70.75
Epoch  30: Train Loss 0.648527, Train Acc: 69.82, Valid Acc: 70.48
Epoch  40: Train Loss 0.564061, Train Acc: 70.68, Valid Acc: 71.03
Epoch  50: Train Loss 0.559427, Train Acc: 71.15, Valid Acc: 69.94
Epoch  60: Train Loss 0.518446, Train Acc: 71.61, Valid Acc: 69.57
Epoch  70: Train Loss 0.535677, Train Acc: 73.05, Valid Acc: 69.66
Epoch  80: Train Loss 0.502536, Train Acc: 73.46, Valid Acc: 71.48
Epoch  90: Train Loss 0.552313, Train Acc: 74.88, Valid Acc: 71.93
Epoch 100: Train Loss 0.513261, Train Acc: 74.60, Valid Acc: 72.03
Best Valid Acc: 72.03%


In [34]:
torch.manual_seed(19)
model = LSTMModel(dim=dim).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

LSTMModel
Epoch  10: Train Loss 0.688367, Train Acc: 56.16, Valid Acc: 57.77
Epoch  20: Train Loss 0.535414, Train Acc: 71.00, Valid Acc: 69.85
Epoch  30: Train Loss 0.536799, Train Acc: 70.68, Valid Acc: 69.30
Epoch  40: Train Loss 0.506775, Train Acc: 75.91, Valid Acc: 73.12
Epoch  50: Train Loss 0.438820, Train Acc: 78.57, Valid Acc: 74.57
Epoch  60: Train Loss 0.487117, Train Acc: 80.20, Valid Acc: 74.84
Epoch  70: Train Loss 0.421967, Train Acc: 81.64, Valid Acc: 74.75
Epoch  80: Train Loss 0.463179, Train Acc: 80.10, Valid Acc: 71.66
Epoch  90: Train Loss 0.429875, Train Acc: 82.12, Valid Acc: 73.93
Epoch 100: Train Loss 0.366922, Train Acc: 85.56, Valid Acc: 74.84
Best Valid Acc: 74.84%


In [35]:
torch.manual_seed(19)
model = LSTMDropoutModel(dim=dim, dropout=0.5).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

LSTMDropoutModel
Epoch  10: Train Loss 0.693591, Train Acc: 50.54, Valid Acc: 50.14
Epoch  20: Train Loss 0.568008, Train Acc: 68.88, Valid Acc: 69.85
Epoch  30: Train Loss 0.575666, Train Acc: 70.96, Valid Acc: 68.85
Epoch  40: Train Loss 0.529619, Train Acc: 75.41, Valid Acc: 73.75
Epoch  50: Train Loss 0.472025, Train Acc: 77.39, Valid Acc: 75.57
Epoch  60: Train Loss 0.511019, Train Acc: 78.52, Valid Acc: 74.93
Epoch  70: Train Loss 0.420726, Train Acc: 80.29, Valid Acc: 75.39
Epoch  80: Train Loss 0.402832, Train Acc: 80.69, Valid Acc: 75.66
Epoch  90: Train Loss 0.443642, Train Acc: 81.17, Valid Acc: 74.57
Epoch 100: Train Loss 0.381463, Train Acc: 83.92, Valid Acc: 76.20
Best Valid Acc: 76.20%


In [36]:
torch.manual_seed(19)
model = LSTMBidirectionalModel(dim=dim).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

LSTMBidirectionalModel
Epoch  10: Train Loss 0.505551, Train Acc: 72.05, Valid Acc: 71.48
Epoch  20: Train Loss 0.429013, Train Acc: 75.19, Valid Acc: 72.57
Epoch  30: Train Loss 0.435818, Train Acc: 78.53, Valid Acc: 73.57
Epoch  40: Train Loss 0.432332, Train Acc: 79.89, Valid Acc: 71.30
Epoch  50: Train Loss 0.406536, Train Acc: 82.43, Valid Acc: 74.75
Epoch  60: Train Loss 0.389317, Train Acc: 83.73, Valid Acc: 73.84
Epoch  70: Train Loss 0.358165, Train Acc: 86.03, Valid Acc: 74.48
Epoch  80: Train Loss 0.414326, Train Acc: 85.39, Valid Acc: 72.48
Epoch  90: Train Loss 0.296984, Train Acc: 89.83, Valid Acc: 73.02
Epoch 100: Train Loss 0.209353, Train Acc: 91.20, Valid Acc: 73.12
Best Valid Acc: 74.75%


In [37]:
torch.manual_seed(19)
model = LSTMTwoLayerModel(dim=dim).to(device)
Trainer(
    model, torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4),
).train(glove_train_loader, glove_valid_loader, epochs=100)

LSTMTwoLayerModel
Epoch  10: Train Loss 0.586694, Train Acc: 68.42, Valid Acc: 71.21
Epoch  20: Train Loss 0.584138, Train Acc: 71.61, Valid Acc: 70.84
Epoch  30: Train Loss 0.495693, Train Acc: 74.50, Valid Acc: 70.75
Epoch  40: Train Loss 0.469064, Train Acc: 76.47, Valid Acc: 74.39
Epoch  50: Train Loss 0.517824, Train Acc: 77.72, Valid Acc: 74.11
Epoch  60: Train Loss 0.468988, Train Acc: 78.48, Valid Acc: 75.30
Epoch  70: Train Loss 0.471545, Train Acc: 79.97, Valid Acc: 74.93
Epoch  80: Train Loss 0.397451, Train Acc: 81.34, Valid Acc: 74.57
Epoch  90: Train Loss 0.413070, Train Acc: 81.92, Valid Acc: 74.84
Epoch 100: Train Loss 0.345543, Train Acc: 83.36, Valid Acc: 74.66
Best Valid Acc: 75.30%
