# HW2: Spam classification with LSTM

The deadline is **9:30 am Feb 16, 2022**.   
You should submit a `.ipynb` file with your solutions to NYU Brightspace.

---

In this homework, we will reuse the spam prediction dataset used in HW1.
We will use a word-level BiLSTM sentence encoder to encode the sentence and a neural network classifier.

For reference, you may read [this paper](https://arxiv.org/abs/1705.02364).

Lab 3 is especially relevant to this homework.

## Points distribution

1. code `spam_collate_func`: 25 pts
2. code `LSTMClassifier.init`: 25 pts
3. code `LSTMClassifier.forward`: 20 pts
4. code `evaluate`: 10 pts
5. code for training loop: 10 pts
6. Question on early stopping: 10 pts

How we grade the code: 
- full points if code works and the underlying logic is correct;
- half points if code works but the underlying logic is incorrect;
- zero points if code does not work.

Therefore, **make sure your code works, i.e., no error is being produced when you execute the code.**


# Data Loading
First, reuse the code from HW1 to download and read the data.

In [1]:
# !wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


We will split the data into train, val, and test sets.  
`train_texts`, `val_texts`, and `test_texts` should contain a list of text examples in the dataset.


In [3]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

# Shuffle the data (sample all rows without replacement)
df = df.sample(frac=1)
# Split df to test/val/train
test_df = df[:test_size]
val_df = df[test_size:test_size+val_size]
train_df = df[test_size+val_size:]


train_texts, train_labels = list(train_df.v2), list(train_df.v1)
val_texts, val_labels     = list(val_df.v2), list(val_df.v1)
test_texts, test_labels   = list(test_df.v2), list(test_df.v1)


# Check that idces do not overlap
assert set(train_df.index).intersection(set(val_df.index)) == set({})
assert set(test_df.index).intersection(set(train_df.index)) == set({})
assert set(val_df.index).intersection(set(test_df.index)) == set({})
# Check that all idces are present
assert df.shape[0] == len(train_labels) + len(val_labels) + len(test_labels)

# Sizes
print(
    f"Size of initial data: {df.shape[0]}\n"
    f"Train size: {len(train_labels)}\n"
    f"Val size: {len(val_labels)}\n"
    f"Test size: {len(test_labels)}\n"
)

Size of initial data: 5572
Train size: 3902
Val size: 835
Test size: 835



In [4]:
# train_texts[:10]  # Just checking the examples in train_text

# Download and Load GloVe Embeddings
We will use GloVe embedding parameters to initialize our layer of word representations / embedding layer.
Let's download and load glove.


This is related Lab 3 Deep Learning, please watch the recording and check the notebook for details.


In [5]:
#@title Download GloVe word embeddings

# === Download GloVe word embeddings
# !wget http://nlp.stanford.edu/data/glove.6B.zip

# === Unzip word embeddings and use only the top 50000 word embeddings for speed
# !unzip glove.6B.zip
# !head -n 50000 glove.6B.300d.txt > glove.6B.300d__50k.txt

# === Download Preprocessed version
# !wget 'https://docs.google.com/uc?id=1KMJTagaVD9hFHXFTPtNk0u2JjvNlyCAu' -O glove_split.aa
# !wget 'https://docs.google.com/uc?id=1LF2yD2jToXriyD-lsYA5hj03f7J3ZKaY' -O glove_split.ab
# !wget 'https://docs.google.com/uc?id=1N1xnxkRyM5Gar7sv4d41alyTL92Iip3f' -O glove_split.ac
# !cat glove_split.?? > 'glove.6B.300d__50k.txt'

## Load GloVe Embeddings

In [6]:
def load_glove(glove_path, embedding_dim):
    with open(glove_path) as f:
        token_ls = [PAD_TOKEN, UNK_TOKEN]
        embedding_ls = [np.zeros(embedding_dim), np.random.rand(embedding_dim)]
        for line in f:
            token, raw_embedding = line.split(maxsplit=1)
            token_ls.append(token)
            embedding = np.array([float(x) for x in raw_embedding.split()])
            embedding_ls.append(embedding)
        embeddings = np.array(embedding_ls)
        print(embedding_ls[-1].size)
    return token_ls, embeddings

PAD_TOKEN = '<PAD>'
UNK_TOKEN = '<UNK>'
EMBEDDING_DIM = 300 # dimension of Glove embeddings
glove_path = "glove.6B.300d__50k.txt"
vocab, embeddings = load_glove(glove_path, EMBEDDING_DIM)

300


In [7]:
len(vocab), embeddings.shape

(50002, (50002, 300))

In [8]:
len(vocab)

50002

In [9]:
embeddings.shape[0], embeddings.shape[1]

(50002, 300)

## Import packages

In [10]:
# !pip install sacremoses

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import os
import pandas as pd
import sacremoses
from torch.utils.data import dataloader, Dataset
from tqdm.auto import tqdm

# Tokenize text data.
We will use the `tokenize` function to convert text data into sequence of indices.

In [11]:
def tokenize(data, labels, tokenizer, vocab, max_seq_length=128):
    vocab_to_idx = {word: i for i, word in enumerate(vocab)}
    text_data = []
    label_data = []
    for ex in tqdm(data):
        tokenized = tokenizer.tokenize(ex.lower())
        ids = [vocab_to_idx.get(token, 1) for token in tokenized]
        text_data.append(ids)
    return text_data, labels

tokenizer = sacremoses.MosesTokenizer()

train_data_indices, train_labels = tokenize(train_texts, train_labels, tokenizer, vocab)
val_data_indices, val_labels = tokenize(val_texts, val_labels, tokenizer, vocab)
test_data_indices, test_labels = tokenize(test_texts, test_labels, tokenizer, vocab)

  0%|          | 0/3902 [00:00<?, ?it/s]

  0%|          | 0/835 [00:00<?, ?it/s]

  0%|          | 0/835 [00:00<?, ?it/s]

In [12]:
print("\nTrain text first 5 examples:\n", train_data_indices[:5])
print("\nTrain labels first 5 examples:\n", train_labels[:5])


Train text first 5 examples:
 [[1, 9600, 43, 17, 2712, 12, 9, 7301], [52, 2316, 47, 4021, 1892, 1970, 796, 3, 36, 7235, 2, 1, 102, 46770, 4119, 19, 10576, 1097, 122, 1, 4, 4, 4, 3363, 103, 725, 8459, 91, 205, 3, 43, 1, 692, 3954, 4, 807], [43, 3318, 1, 35, 4354, 1, 225, 3, 43, 151, 6016, 1, 806, 62, 9, 1, 181, 6, 1714, 3793], [102, 8237, 161, 22, 279, 48, 185, 21, 534, 9760, 4], [1, 1, 8609, 1, 1, 145, 1, 192, 759, 1, 1, 5561, 1, 1, 1]]

Train labels first 5 examples:
 [0, 0, 0, 0, 0]


# Create DataLoaders (25 pts)
 Now, let's create pytorch DataLoaders for our train, val, and test data.

 `SpamDataset` class is based on torch [`Dataset`](https://pytorch.org/docs/1.7.0/data.html?highlight=dataset#torch.utils.data.Dataset). It has an additional parameter called `self.max_sent_length` and a `spam_collate_func`.

In order to use batch processing, all the examples need to effectively be the same length. We'll do this by adding padding tokens. `spam_collate_func` is supposed to dynamically pad or trim the sentences in the batch based on `self.max_sent_length` and the length of longest sequence in the batch. 
- If `self.max_sent_length` is less than the length of longest sequence in the batch, use `self.max_sent_length`. Otherwise, use the length of longest sequence in the batch.
- We do this because our input sentences in the batch may be much shorter than `self.max_sent_length`.  

Please check the comment block in the code near TODO for more details.


Example: 

* PAD token id = 0
* max_sent_length = 5

input list of sequences:
```
inp = [
    [1,4,5,3,5,6,7,4,4],
    [3,5,3,2],
    [2,5,3,5,6,7,4],
]
```
then padded minibatch looks like this:
```
padded_input = 
    [[1,4,5,3,5],
     [3,5,3,2,0],
     [2,5,3,5,6]]
```

In [13]:
import numpy as np
import torch
from torch.utils.data import Dataset

class SpamDataset(Dataset):
    """
    Class that represents a train/validation/test dataset that's readable for PyTorch
    Note that this class inherits torch.utils.data.Dataset
    """
    
    def __init__(self, data_list, target_list, max_sent_length=128):
        """
        @param data_list: list of data tokens 
        @param target_list: list of data targets 

        """
        self.data_list = data_list
        self.target_list = target_list
        self.max_sent_length = max_sent_length
        assert (len(self.data_list) == len(self.target_list))

    def __len__(self):
        return len(self.data_list)
        
    def __getitem__(self, key, max_sent_length=None):
        """
        Triggered when you call dataset[i]
        """
        if max_sent_length is None:
            max_sent_length = self.max_sent_length
        token_idx = self.data_list[key][:max_sent_length]
        label = self.target_list[key]
        return [token_idx, label]

    def spam_collate_func(self, batch):
        """
        Customized function for DataLoader that dynamically pads the batch so that all 
        data have the same length
        # What the input `batch`? That's for you to figure out!
        # You can read the Dataloader documentation, or you can use print
        # function to debug. 
        """ 
        data_list = [] # store padded sequences
        label_list = []
        
        # the length of longest sequence in batch
        # if it is less than self.max_sent_length
        # else max_batch_seq_len = self.max_sent_length
        
        data_indice_list = []
        
        for data_indice, label_index in batch:
            data_indice_list.append(len(data_indice))
        
        max_batch_seq_len = min(max(data_indice_list), self.max_sent_length)

        """
        Pad the sequences in your data 
        if their length is less than max_batch_seq_len
        or trim the sequences that are longer than self.max_sent_length  
        return padded data_list and label_list
        1. TODO: Your code here 
         """
    
        # Pad the sequences in your data 
        for data_index, label_index in batch:
            if len(data_index) < max_batch_seq_len:
                data_index = np.pad(data_index, (0, max_batch_seq_len - len(data_index)), 'constant', constant_values=0)
                
            else:
                data_index = data_index[:max_batch_seq_len]
            
            data_list.append(data_index)
            label_list.append(label_index)
    
        data_list = torch.tensor(np.array(data_list))
        label_list = torch.tensor(np.array(label_list))
        
        return [data_list, label_list]

BATCH_SIZE = 64
max_sent_length=128
train_dataset = SpamDataset(train_data_indices, train_labels, max_sent_length)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=train_dataset.spam_collate_func,
                                           shuffle=True)

val_dataset = SpamDataset(val_data_indices, val_labels, train_dataset.max_sent_length)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=train_dataset.spam_collate_func,
                                           shuffle=False)

test_dataset = SpamDataset(test_data_indices, test_labels, train_dataset.max_sent_length)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                           batch_size=BATCH_SIZE,
                                           collate_fn=train_dataset.spam_collate_func,
                                           shuffle=False)

In [14]:
data_batch, labels = next(iter(train_loader))
print(data_batch[0].shape, data_batch[1].shape, data_batch[2].shape)

torch.Size([68]) torch.Size([68]) torch.Size([68])


Let's try to print out an batch from train_loader.


In [15]:
data_batch, labels = next(iter(train_loader))
print("data batch dimension: ", data_batch.size())
print("data_batch: ", data_batch)
print("labels: ", labels)

data batch dimension:  torch.Size([64, 65])
data_batch:  tensor([[  366,   366,     3,  ...,     0,     0,     0],
        [ 3204,  8740,    43,  ...,     0,     0,     0],
        [    1,    62, 10576,  ...,     0,     0,     0],
        ...,
        [  199,    34,    83,  ...,     0,     0,     0],
        [  411, 42256,  3280,  ...,     0,     0,     0],
        [   88,     9,    38,  ...,     0,     0,     0]])
labels:  tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


In [16]:
print("data_batch: ", data_batch[0])

data_batch:  tensor([ 366,  366,    3,  255,   83, 4004,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0])


# Build a BiLSTM Classifier (20 + 25 + 10 pts)

Now we are going to build a BiLSTM classifier. Check this [blog post](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) and [`torch.nn.LSTM`](https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) for reference. Recall that we've also seen LSTM in Lab 3. 

The hyperparameters for LSTM are already given, but they are not necessarily optimal. You should get a good accuracy with these hyperparameters but you may try to tune the hyperparameters and use different hyperparameters to get better performance.

* `__init__`: Class constructor. Here we define layers / parameters of LSTM.
* `forward`: This function is used whenever you call your object as `model()`. It takes the input minibatch and returns the output representation from LSTM.

In [17]:
# class ModelBiLSTM(nn.Module):
#   def __init__(self, options):
#     # All the parameters to the model class & all the layers, go into __init__.
#     # We define the layers here, and we **call** the layers later, in the
#     # forward() function.
#     super(ModelBiLSTM, self).__init__()
# x(defined outside)    self.device = args['device']
# x    self.embed_dim = args['embed_dim']  # using 100-dim GloVe
# x    self.hidden_size = args['hidden_size']
# x    self.num_layers = args['num_layers']
# x    self.embedding = nn.Embedding.from_pretrained(
# x     torch.load('/content/drive/My Drive/colabs/prep-lab3-nli/.vector_cache/multinli_vectors.pt'))  # modify to your path 
# x    self.directions = 2

#     # Layers below: ReLU, dropout, projection (from embedding to input-to-LSTM),
#     # LSTM, linear layers.
# x    self.relu = nn.LeakyReLU()  # non-linear layer
# x    self.dropout = nn.Dropout(p=args['dropout'])  # prob of an elt to be zeroed

x   self.lstm = nn.LSTM(  # LSTM layer; Q: why put it here, within __init__?
      self.embed_dim, self.hidden_size, self.num_layers,
      dropout=args['dropout'], bidirectional=True,
      batch_first=True)  # BATCH FIRST!!!!!!!!!!!!!!!!!!!!!!!!!!!

#     self.linear_first = nn.Linear(
#       self.hidden_size * self.directions * 4, self.hidden_size)  # why 4?
#     self.linear_second = nn.Linear(self.hidden_size, self.hidden_size)
#     self.linear_third = nn.Linear(self.hidden_size, options['out_dim'])
#     for layer in [self.linear_first, self.linear_second, self.linear_third]:
#       nn.init.xavier_uniform_(layer.weight)
#       nn.init.zeros_(layer.bias)
        
#     # this is the linear classification layers
#     # with nonlinearity in the middle
# x    self.layers = nn.Sequential(  
#   	  self.linear_first, self.relu, self.dropout,
#       self.linear_second, self.relu, self.dropout,
#   	  self.linear_third,
#     )

In [18]:
#   # In this part, remove premise functions to only use hypothesis
# #   def forward(self, batch):
#     # This is the forward pass of the our model.

#     # ==========================================================================

#     # STEP 1 (part 2): batched inputs -> embedded input (which will be fed into LSTM)

#     # Q: shape of batch.premise (which is the token inputs)? batch_size * input_len_premise
#     # Q: what does batch.premise[i][j] mean?

#     # Q: shape of premise_embed (which will be fed into LSTM layers)?

#     # premise_embed = self.embedding(batch.premise)
#     hypothesis_embed = self.embedding(batch.hypothesis)

#     # premise_embed.shape: batch size * premise input len * embed dim

#     # ==========================================================================

#     # STEP 2: LSTM
#     # Q: what are the parameters to LSTM (when we're defining the LSTM)?
#     # Q: what are the inputs to the LSTM layers?
#     # Q: what are the outputs from the LSTM layers?
#     # https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

#     # Q: what is self.lstm?
#     # If we write nn.LSTM(...)(premise_embed_proj) and
#     # nn.LSTM(...)(hypothesis_embed_proj), what will be different?
#     # premise_out, (premise_ht, _) = self.lstm(premise_embed, None)
#     hypothesis_out, (hypothesis_ht, _) = self.lstm(hypothesis_embed, None)
    
#     # Q: what is the shape of premise_out and premise_ht?

#     # Equivalently, the above code can be written as
#     # h0 = torch.zeros((self.num_layers * self.directions, batch.batch_size,
#     #                   self.hidden_size)).to(self.device)
#     # c0 = torch.zeros((self.num_layers * self.directions, batch.batch_size,
#     #                   self.hidden_size)).to(self.device)
#     # premise_out, (premise_ht, _) = self.lstm(premise_embed, (h0, c0))
#     # hypothesis_out, (hypothesis_ht, _) = self.lstm(hypothesis_embed, (h0, c0))    

#     # ==========================================================================

#     # STEP 3: linear layers for classification

#     # Q: how do we convert premise_out and hypothesis_out into the inputs to linear layers?
#     # Q: how do we design the linear layers?

#     # premise = premise_out[:, -1, :]
#     hypothesis = hypothesis_out[:, -1, :]

#     # in combination, remove effect of premise
#     combined = torch.cat(
#       (hypothesis,  # Q: shape?
#        torch.abs(hypothesis),
#        hypothesis),  # Q: shape?

#     # combined = torch.cat(
#     #   (premise,  # Q: shape? A: batch_size * (num_directions * hidden_size)
#     #    hypothesis,  # Q: shape?
#     #    torch.abs(premise - hypothesis),
#     #    premise * hypothesis),  # Q: shape?
#       1)  # Q: what are we doing here and what's 1?

#     # Q: what's the shape of combined? 

#     return self.layers(hypothesis)

In [1]:
# Testing
# First import torch related libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

class LSTMClassifier(nn.Module):
    """
    LSTMClassifier classification model
    """
    def __init__(self, embeddings, hidden_size, num_layers, num_classes, bidirectional, dropout_prob=0.3):
        super().__init__()
        self.embedding_layer = self.load_pretrained_embeddings(embeddings)
#         print(self.embedding_layer)
        self.dropout = nn.Dropout(p=dropout_prob)
        
        self.embed_dim = embeddings.shape[1] #300
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.directions = 2
        
        self.non_linearity = nn.LeakyReLU() # For example, ReLU
        
        self.lstm = nn.LSTM(self.embed_dim, self.hidden_size, self.num_layers, dropout=0.3, \
                            bidirectional=True, batch_first=True) 
   
        self.linear_first = nn.Linear(self.hidden_size * self.directions * 4, self.hidden_size) 
        self.linear_second = nn.Linear(self.hidden_size, self.hidden_size)
        self.linear_third = nn.Linear(self.hidden_size, 3)
        
        for layer in [self.linear_first, self.linear_second, self.linear_third]:
            nn.init.xavier_uniform_(layer.weight)
            nn.init.zeros_(layer.bias)
            
        self.clf = nn.Sequential(  
            self.linear_first, self.non_linearity, self.dropout,
            self.linear_second, self.non_linearity, self.dropout,
            self.linear_third,
        ) # classifier layer
        
        """
           Define the components of your BiLSTM Classifier model
           2. TODO: Your code here
        """

    
    def load_pretrained_embeddings(self, embeddings):
        """
           The code for loading embeddings from Lab 3 Deep Learning
           Unlike lab, we are not setting `embedding_layer.weight.requires_grad = False`
           because we want to finetune the embeddings on our data
        """
        embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
        embedding_layer.weight.data = torch.Tensor(embeddings).float()
        return embedding_layer


    def forward(self, inputs):
        logits = None
        """
           Write forward pass for LSTM. You must use dropout after embedding
           the inputs. 

           Example, forward := embedding -> bilstm -> pooling (sum?mean?max?) 
                              nonlinearity -> classifier
           Refer to: https://arxiv.org/abs/1705.02364

           Return logits

           3. TODO: Your code here
        """
        
        # self.embedding = nn.Embedding.from_pretrained(
        # torch.load('/content/drive/My Drive/colabs/prep-lab3-nli/.vector_cache/multinli_vectors.pt'))
        
        
        # STEP 1: batched inputs -> embedded input
        inputs_embed = self.embedding_layer(inputs)
        
#         # STEP 2: LSTM
        inputs_out, (inputs_ht, _) = self.lstm(inputs_embed, None)
                
# #         # STEP 3: linear layers for classification
# #         inputs = inputs_out[:, -1, :]
        
# #         logits = self.clf(inputs)
        
#         return logits

In [2]:
# # First import torch related libraries
# import torch
# import torch.nn as nn
# import torch.nn.functional as F

# class LSTMClassifier(nn.Module):
#     """
#     LSTMClassifier classification model
#     """
#     def __init__(self, embeddings, hidden_size, num_layers, num_classes, bidirectional, dropout_prob=0.3):
#         super().__init__()
#         self.embedding_layer = self.load_pretrained_embeddings(embeddings)

#         self.dropout = nn.Dropout(p=dropout_prob)  # prob of an elt to be zeroed
        
#         self.lstm = nn.LSTM(300, hidden_size, num_layers, \
#                             dropout=dropout_prob, bidirectional=True, batch_first=True)
#         self.non_linearity = nn.LeakyReLU()  # For example, ReLU
        
#         self.directions = 2
#         self.linear_first = nn.Linear(hidden_size * self.directions * 3, hidden_size)  # why 4?
#         self.linear_second = nn.Linear(hidden_size, hidden_size)
#         self.linear_third = nn.Linear(hidden_size, 300)
                                      
#         for layer in [self.linear_first, self.linear_second, self.linear_third]:
#             nn.init.xavier_uniform_(layer.weight)
#             nn.init.zeros_(layer.bias)

#         self.clf = nn.Sequential(self.linear_first, self.non_linearity, self.dropout, \
#                                  self.linear_second, self.non_linearity, self.dropout, \
#                                  self.linear_third)         # classifier layer
        

#         """
#            Define the components of your BiLSTM Classifier model
#            2. TODO: Your code here
#         """
    
# #         raise NotImplementedError  # delete this line
    
#     def load_pretrained_embeddings(self, embeddings):
#         """
#            The code for loading embeddings from Lab 3 Deep Learning
#            Unlike lab, we are not setting `embedding_layer.weight.requires_grad = False`
#            because we want to finetune the embeddings on our data
#         """
#         embedding_layer = nn.Embedding(embeddings.shape[0], embeddings.shape[1], padding_idx=0)
#         embedding_layer.weight.data = torch.Tensor(embeddings).float()
#         return embedding_layer


#     def forward(self, inputs):
#         logits = None
#         """
#            Write forward pass for LSTM. You must use dropout after embedding
#            the inputs. 

#            Example, forward := embedding -> bilstm -> pooling (sum?mean?max?) 
#                               nonlinearity -> classifier
#            Refer to: https://arxiv.org/abs/1705.02364

#            Return logits

#            3. TODO: Your code here
#         """
#         inputs_embed = self.embedding_layer(inputs)
        
#         inputs_out, (inputs_ht, _) = self.lstm(inputs_embed, None)
        
#         inputs_f = inputs_out[:, -1, :]
        
#         logits = self.clf(inputs_f)
        
#         return logits

First, we will define an evaluation function that will return the accuracy of the model. We will use this to compute validation accuracy and test accuracy of the model given a dataloader.

In [3]:
def evaluate(model, dataloader, device):
    accuracy = None
    model.eval()
    """
        4. TODO: Your code here
        Calculate the accuracy of the model on the data in dataloader
        You may refer to `run_inference` function from Lab 3 part 1.
    """

    with torch.no_grad():
        all_preds = []
        for batch_text, batch_labels in dataloader:
            preds = model(batch_text.to(device))
            all_preds.append(preds.detach().cpu().numpy())
    preds = np.concatenate(all_preds, axis=0)
    
    
    accuracy = test_labels==preds.argmax(-1).mean()
    
    return accuracy 

# Initialize the BiLSTM classifier model, criterion and optimizer


In [4]:
# BiLSTM hyperparameters
hidden_size = 32
num_layers = 1
num_classes = 2
bidirectional=True
torch.manual_seed(1234)

# if cuda exists, use cuda, else run on cpu
if torch.cuda.is_available():
    device = torch.device("cuda:0")
else:
    device = torch.device('cpu')

model = LSTMClassifier(embeddings, hidden_size, num_layers, num_classes, bidirectional)
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

NameError: name 'embeddings' is not defined

In [5]:
model

NameError: name 'model' is not defined

In [6]:
# stop point 

In [None]:
model.forward(data_batch)

# Train model with early stopping (10 pts)

Train the model for `NUM_EPOCHS`. 
Keep track of training loss.  
Compute the validation accuracy after each epoch. Keep track of the best validation accuracy and save the model with the best validation accuracy.  

If the validation accuracy does not improve for more than `early_stop_patience` number of epochs in a row, stop training. 


In [None]:
train_loss_history = []
val_accuracy_history = []
best_val_accuracy = 0
n_no_improve = 0
early_stop_patience=2
NUM_EPOCHS=10
  
for epoch in tqdm(range(NUM_EPOCHS)):
    model.train()  # this enables dropout/regularization
    for i, (data_batch, batch_labels) in enumerate(train_loader):
        """
           Code for training lstm
           Keep track of training of for each batch using train_loss_history
        """
        preds = model(data_batch.to(device))
        loss = criterion(preds, batch_labels.to(device))
        """
          5(1). TODO: Recall that pytorch training involves five critical
          components, as discussed in the Lab. Some of the components are
          still missing here. Your code here.
        """
#         raise NotImplementedError  # delete this line
        train_loss_history.append(loss.item())
        
    # The end of a training epoch 

    """
        Code for tracking best validation accuracy, saving the best model, and early stopping
        # Compute validation accuracy after each training epoch using `evaluate` function
        # Keep track of validation accuracy in `val_accuracy_history`
        # save model with best validation accuracy, hint: torch.save(model, 'best_model.pt')
        # Early stopping: 
        # stop training if the validation accuracy does not improve for more than `early_stop_patience` runs
        5(2). TODO: Your code here
    """
    
    
    
#     raise NotImplementedError  # delete this line

print("Best validation accuracy is: ", best_val_accuracy)

#Question: Why do we want to use early stopping? Write the most important reason in concise way. (10 pts)

Your answer:

# Draw training curve 
X-axis: training steps, Y-axis: training loss

Make sure to draw your own curves. 

In [None]:
pd.Series(train_loss_history).plot()

# Validation accuracy curve
X-axis: Epochs, Y-axis: validation accuracy

In [None]:
pd.Series(val_accuracy_history).plot()

## You should expect to get test accuracy > 0.95.

In [None]:
# Reload best model from saved checkpoint
# Compute test accuracy
model = torch.load('best_model.pt')
test_accuracy = evaluate(model, test_loader, device)
print(test_accuracy)