In this project, I will attempt to solve the following problem. Given a headline, predict whether or not it is from the Onion (a satire news source). I wil first attempt to do this using a Nueral Bag Of Words (NBOW model). Then, I'll attempt to use a LSTM and hopefully get better results.

### Part 1. Loading and Preprocessing Data 
The following cell loads the OnionOrNot dataset

In [None]:
!curl https://raw.githubusercontent.com/lukefeilberg/onion/master/OnionOrNot.csv > OnionOrNot.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 1903k  100 1903k    0     0  9019k      0 --:--:-- --:--:-- --:--:-- 9019k


In [None]:
# DO NOT MODIFY #
import torch
import random
import numpy as np

RANDOM_SEED = 42
torch.manual_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# this is how we select a GPU if it's avalible on your computer or in the Colab environment.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# DO NOT MODIFY THIS BLOCK
# example code taken from fast-bert

import re
import html

def spec_add_spaces(t: str) -> str:
    "Add spaces around / and # in `t`. \n"
    return re.sub(r"([/#\n])", r" \1 ", t)

def rm_useless_spaces(t: str) -> str:
    "Remove multiple spaces in `t`."
    return re.sub(" {2,}", " ", t)

def replace_multi_newline(t: str) -> str:
    return re.sub(r"(\n(\s)*){2,}", "\n", t)

def fix_html(x: str) -> str:
    "List of replacements from html strings in `x`."
    re1 = re.compile(r"  +")
    x = (
        x.replace("#39;", "'")
        .replace("amp;", "&")
        .replace("#146;", "'")
        .replace("nbsp;", " ")
        .replace("#36;", "$")
        .replace("\\n", "\n")
        .replace("quot;", "'")
        .replace("<br />", "\n")
        .replace('\\"', '"')
        .replace(" @.@ ", ".")
        .replace(" @-@ ", "-")
        .replace(" @,@ ", ",")
        .replace("\\", " \\ ")
    )
    return re1.sub(" ", html.unescape(x))

def clean_text(input_text):
    text = fix_html(input_text)
    text = replace_multi_newline(text)
    text = spec_add_spaces(text)
    text = rm_useless_spaces(text)
    text = text.strip()
    return text

In [None]:
import pandas as pd
import nltk
from tqdm import tqdm

nltk.download('punkt')
df              = pd.read_csv("OnionOrNot.csv")
df["tokenized"] = df["text"].apply(lambda x: nltk.word_tokenize(clean_text(x.lower())))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
df.head()

Unnamed: 0,text,label,tokenized
0,Entire Facebook Staff Laughs As Man Tightens P...,1,"[entire, facebook, staff, laughs, as, man, tig..."
1,Muslim Woman Denied Soda Can for Fear She Coul...,0,"[muslim, woman, denied, soda, can, for, fear, ..."
2,Bold Move: Hulu Has Announced That They’re Gon...,1,"[bold, move, :, hulu, has, announced, that, th..."
3,Despondent Jeff Bezos Realizes He’ll Have To W...,1,"[despondent, jeff, bezos, realizes, he, ’, ll,..."
4,"For men looking for great single women, online...",1,"[for, men, looking, for, great, single, women,..."


In [None]:
df.iloc[42]

text         Customers continued to wait at drive-thru even...
label                                                        0
tokenized    [customers, continued, to, wait, at, drive-thr...
Name: 42, dtype: object

#### Split the dataset into training, validation, and testing

Now that we've loaded this dataset, we need to split the data into train, validation, and test sets. A good explanation of why we need these different sets can be found in subsection 2.2.5 of [Eisenstein](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf) but at the end it comes down to having a trustworthy and generalized model. The validation (sometimes called a development or tuning) set is used to help choose hyperparameters for our model, whereas the training set is used to fit the learned parameters (weights and biases) to the task. The test set is used to provide a final unbiased evaluation of our trained model, hopefully providing some insight into how it would actually do in production. Each of these sets should be disjoint from the others, to prevent any "peeking" that could unfairly influence our understanding of the model's accuracy. 

In addition to these different sets of data, we also need to create a vocab map for words in our Onion dataset, which will map tokens to numbers. This will be useful later, since torch PyTorch use tensors of sequences of numbers as inputs. **Go to the following cell, and fill out split_train_val_test and generate_vocab_map.**

In [None]:
# BEGIN - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.
from collections import Counter
PADDING_VALUE = 0
UNK_VALUE     = 1
# END - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.


# split_train_val_test
# This method takes a dataframe and splits it into train/val/test splits.
# It uses the props argument to split the dataset appropriately.
#
# args:
# df - the entire dataset DataFrame 
# props - proportions for each split in the order of [train, validation, test]. 
#         the last value of the props array is repetitive, but we've kept it for clarity.
#
# returns: 
# train DataFrame, val DataFrame, test DataFrame
#
def split_train_val_test(df, props=[.8, .1, .1]):
    assert round(sum(props), 2) == 1 and len(props) >= 2
    train_df, test_df, val_df = None, None, None
    
    ## YOUR CODE STARTS HERE (~3-5 lines of code) ##
    prev = 0
    for i, prop in enumerate(props):
      props[i] = prev + prop
      prev = props[i]
    train_df, val_df, test_df  = df.iloc[0:int(props[0] * len(df))], df.iloc[int(props[0] * len(df)):int(props[1] * len(df))], df.iloc[int(props[1] * len(df)):]
    # hint: you can use df.iloc to slice into specific indexes or ranges.

  
    
    ## YOUR CODE ENDS HERE ##
    
    return train_df, val_df, test_df

# generate_vocab_map
# This method takes a dataframe and builds a vocabulary to unique number map.
# It uses the cutoff argument to remove rare words occuring <= cutoff times. 
# *NOTE*: "" and "UNK" are reserved tokens in our vocab that will be useful
# later. You'll also find the Counter imported for you to be useful as well.
# 
# args:
# df     - the entire dataset this mapping is built from 
# cutoff - we exclude words from the vocab that appear less than or
#          eq to cutoff
#
# returns: 
# vocab - dict[str] = int
#         In vocab, each str is a unique token, and each dict[str] is a 
#         unique integer ID. Only elements that appear > cutoff times appear
#         in vocab.
#
# reversed_vocab - dict[int] = str
#                  A reversed version of vocab, which allows us to retrieve 
#                  words given their unique integer ID. This map will 
#                  allow us to "decode" integer sequences we'll encode using
#                  vocab!
# 
def generate_vocab_map(df, cutoff=2):
    vocab          = {"": PADDING_VALUE, "UNK": UNK_VALUE}
    reversed_vocab = dict()
    
    ## YOUR CODE STARTS HERE (~5-15 lines of code) ##
    # hint: start by iterating over df["tokenized"]
    freqMap = nltk.FreqDist(sum(df["tokenized"].tolist(), []))
    uniqId = 2
    for word in freqMap.keys():
      if freqMap[word] > cutoff:
        vocab[word] = uniqId
        uniqId += 1
    for word in vocab.keys():
      reversed_vocab[vocab[word]] = word

    ## YOUR CODE ENDS HERE ##
    
    return vocab, reversed_vocab

With the methods you have implemented above, we can now split the dataset into training, validation, and testing sets and generate our dictionaries mapping from word tokens to IDs (and vice versa). 

Note: The props list currently being used splits the dataset so that 80% of samples are used to train, and the remaining 20% are evenly split between training and validation. How you split your dataset is itself a major choice and something you would need to consider in your own projects. Can you think of why?

In [None]:
df                         = df.sample(frac=1)
train_df, val_df, test_df  = split_train_val_test(df, props=[.8, .1, .1])
train_vocab, reverse_vocab = generate_vocab_map(train_df)

In [None]:
# This line of code will help test your implementation, the expected output is the same distribution used in 'props'
#   in the above cell. Try out some different values to ensure it works, but for submission ensure you use 
#   [.8, .1, .1] 

(len(train_df) / len(df)), (len(val_df) / len(df)), (len(test_df) / len(df))

(0.8, 0.1, 0.1)

In [None]:
print(type(df["tokenized"][0]))
print(torch.zeros([1], dtype=torch.int32))

<class 'list'>
tensor([0], dtype=torch.int32)


#### Building a Dataset Class

PyTorch has custom Dataset Classes that have very useful extentions, we want to turn our current pandas DataFrame into a subclass of Dataset so that we can iterate and sample through it for minibatch updates. **In the following cell, fill out the HeadlineDataset class.** Refer to PyTorch documentation on [Dataset Classes](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 
for help.

In [None]:
# BEGIN - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.
from torch.utils.data import Dataset
# END - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.

# HeadlineDataset
# This class takes a Pandas DataFrame and wraps in a Torch Dataset.
# Read more about Torch Datasets here: 
# https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
# 
class HeadlineDataset(Dataset):
    
    # initialize this class with appropriate instance variables
    def __init__(self, vocab, df, max_length=50):
        # For this method: We would *strongly* recommend storing the dataframe 
        #                  itself as an instance variable, and keeping this method
        #                  very simple. Leave processing to __getitem__. 
        #              
        #                  Sometimes, however, it does make sense to preprocess in 
        #                  __init__. If you are curious as to why, read the aside at the 
        #                  bottom of this cell.
        # 
        
        ## YOUR CODE STARTS HERE (~3 lines of code) ##
        self.vocab = vocab
        self.df = df 
        self.df.reset_index(inplace = True)
        self.max_length = max_length
        return 
        ## YOUR CODE ENDS HERE ##
    
    # return the length of the dataframe instance variable
    def __len__(self):

        df_len = None
        ## YOUR CODE STARTS HERE (1 line of code) ##
        df_len = len(self.df)
        ## YOUR CODE ENDS HERE ##
        return df_len

    # __getitem__
    # 
    # Converts a dataframe row (row["tokenized"]) to an encoded torch LongTensor,
    # using our vocab map created using generate_vocab_map. Restricts the encoded 
    # headline length to max_length.
    # 
    # The purpose of this method is to convert the row - a list of words - into
    # a corresponding list of numbers.
    #
    # i.e. using a map of {"hi": 2, "hello": 3, "UNK": 0}
    # this list ["hi", "hello", "NOT_IN_DICT"] will turn into [2, 3, 0]
    #
    # returns: 
    # tokenized_word_tensor - torch.LongTensor 
    #                         A 1D tensor of type Long, that has each
    #                         token in the dataframe mapped to a number.
    #                         These numbers are retrieved from the vocab_map
    #                         we created in generate_vocab_map. 
    # 
    #                         **IMPORTANT**: if we filtered out the word 
    #                         because it's infrequent (and it doesn't exist 
    #                         in the vocab) we need to replace it w/ the UNK 
    #                         token
    # 
    # curr_label            - int
    #                         Binary 0/1 label retrieved from the DataFrame.
    # 
    def __getitem__(self, index: int):
        tokenized_word_tensor = None
        curr_label            = None
        ## YOUR CODE STARTS HERE (~3-7 lines of code) ##
        l = self.df["tokenized"][index]
        tokenized_word_tensor = torch.empty([len(l)], dtype=torch.long)
        for i, word in enumerate(l):
          if word in self.vocab.keys():
            tokenized_word_tensor[i] = self.vocab[word]
          else:
            tokenized_word_tensor[i] = self.vocab["UNK"]
        curr_label = self.df["label"][index]
        ## YOUR CODE ENDS HERE ##
        return tokenized_word_tensor, curr_label



#
# Completely optional aside on preprocessing in __init__.
# 
# Sometimes the compute bottleneck actually ends up being in __getitem__.
# In this case, you'd loop over your dataset in __init__, passing data 
# to __getitem__ and storing it in another instance variable. Then,
# you can simply return the preprocessed data in __getitem__ instead of
# doing the preprocessing.
# 
# There is a tradeoff though: can you think of one?
# 

In [None]:
from torch.utils.data import RandomSampler

train_dataset = HeadlineDataset(train_vocab, train_df)
val_dataset   = HeadlineDataset(train_vocab, val_df)
test_dataset  = HeadlineDataset(train_vocab, test_df)

# Now that we're wrapping our dataframes in PyTorch datsets, we can make use of PyTorch Random Samplers, they'll
#   define how our DataLoaders sample elements from the HeadlineDatasets  
train_sampler = RandomSampler(train_dataset)
val_sampler   = RandomSampler(val_dataset)
test_sampler  = RandomSampler(test_dataset)

#### Finishing DataLoader

We can now use PyTorch DataLoaders to batch our data for us. **In the following cell fill out collate_fn.** Refer to PyTorch documentation on [DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) for help.

In [None]:
# BEGIN - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.
from torch.nn.utils.rnn import pad_sequence
# END - DO NOT CHANGE THESE IMPORTS/CONSTANTS OR IMPORT ADDITIONAL PACKAGES.

# collate_fn
# This function is passed as a parameter to Torch DataSampler. collate_fn collects
# batched rows, in the form of tuples, from a DataLoader and applies some final 
# pre-processing.
#
# Objective:
# In our case, we need to take the batched input array of 1D tokenized_word_tensors, 
# and create a 2D tensor that's padded to be the max length from all our tokenized_word_tensors 
# in a batch. We're moving from a Python array of tuples, to a padded 2D tensor. 
#
# *HINT*: you're allowed to use torch.nn.utils.rnn.pad_sequence (ALREADY IMPORTED)
# 
# Finally, you can read more about collate_fn here: https://pytorch.org/docs/stable/data.html
#
# args: 
# batch - PythonArray[tuple(tokenized_word_tensor: 1D Torch.LongTensor, curr_label: int)]
#         len(batch) == BATCH_SIZE
# 
# returns:
# padded_tokens - 2D LongTensor of shape (BATCH_SIZE, max len of all tokenized_word_tensor))
# y_labels      - 1D FloatTensor of shape (BATCH_SIZE)
# 
def collate_fn(batch, padding_value=PADDING_VALUE):
    padded_tokens, y_labels = None, None
    ## YOUR CODE STARTS HERE (~4-8 lines of code) ##
    maxlen = max(batch, key=len)
    padded_tokens = pad_sequence([x[0] for x in batch], batch_first = True, padding_value = PADDING_VALUE)
    y_labels = torch.Tensor([x[1] for x in batch])
    ## YOUR CODE ENDS HERE ##
    return padded_tokens, y_labels

In [None]:
from torch.utils.data import DataLoader
BATCH_SIZE = 16

train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, collate_fn=collate_fn)
val_iterator   = DataLoader(val_dataset, batch_size=BATCH_SIZE, sampler=val_sampler, collate_fn=collate_fn)
test_iterator  = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)

In [None]:
# Use this to test your collate_fn implementation.
# You can look at the shapes of x and y or put print statements in collate_fn while running this snippet

for x, y in test_iterator:
    print(f'x: {x.shape}')
    print(f'y: {y.shape}')
    break
test_iterator = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)

x: torch.Size([16, 23])
y: torch.Size([16])


### Create NBOW Model
Architecture Reference: Section 2.1 in (https://www.aclweb.org/anthology/P15-1162.pdf). 

In [None]:
# BEGIN - DO NOT CHANGE THESE IMPORTS OR IMPORT ADDITIONAL PACKAGES.
import torch.nn as nn
# END - DO NOT CHANGE THESE IMPORTS OR IMPORT ADDITIONAL PACKAGES.

class NBOW(nn.Module):
    # Instantiate layers for your model-
    # 
    # Your model architecture will be a feed-forward neural network.
    #
    # You'll need 3 nn.Modules at minimum
    # 1. An embeddings layer (see nn.Embedding)
    # 2. A linear layer (see nn.Linear)
    # 3. A sigmoid output (see nn.Sigmoid)
    #
    # HINT: In the forward step, the BATCH_SIZE is the first dimension.
    # 
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        ## YOUR CODE STARTS HERE (~4 lines of code) ##
        
        self.embedLayer = nn.Embedding(vocab_size, embedding_dim)
        self.linearLayer = nn.Linear(embedding_dim, 1)
        self.sigmoidLayer = nn.Sigmoid()

        ## YOUR CODE ENDS HERE ##
        
    # Complete the forward pass of the model.
    #
    # Use the output of the embedding layer to create
    # the average vector, which will be input into the 
    # linear layer.
    # 
    # args:
    # x - 2D LongTensor of shape (BATCH_SIZE, max len of all tokenized_word_tensor))
    #     This is the same output that comes out of the collate_fn function you completed
    def forward(self, x):
        ## YOUR CODE STARTS HERE (~4-5 lines of code) ##
        EmbedOutput = self.embedLayer(x)
        LinearOutput = self.linearLayer(torch.mean(EmbedOutput, dim=1))
        return torch.squeeze(self.sigmoidLayer(LinearOutput))


        #return x
        ## YOUR CODE ENDS HERE ##
    

In [None]:
model = NBOW(vocab_size    = len(train_vocab.keys()),
             embedding_dim = 300).to(device)

Loss function and Optimizer

In [None]:
#while Adam is already imported, you can try other optimizers as well
from torch.optim import Adam

criterion, optimizer = None, None
### YOUR CODE GOES HERE ###
criterion = nn.BCELoss()
optimizer = Adam(model.parameters(), lr = .001)


### YOUR CODE ENDS HERE ###

### Part 3: Training and Evaluation


In [None]:
# returns the total loss calculated from criterion
def train_loop(model, criterion, optim, iterator):
    model.train()
    total_loss = 0
    for x, y in tqdm(iterator):
        ### YOUR CODE STARTS HERE (~6 lines of code) ###
        output = model(x.to(device))
        loss = criterion(output, y.to(device))
        optim.zero_grad()
        loss.backward()
        optim.step()
        total_loss += loss.item()
        ### YOUR CODE ENDS HERE ###
    return total_loss

# returns:
# - true: a Python boolean array of all the ground truth values 
#         taken from the dataset iterator
# - pred: a Python boolean array of all model predictions. 
def val_loop(model, iterator):
    true, pred = [], []
    ### YOUR CODE STARTS HERE (~8 lines of code) ###
    for x, y in tqdm(iterator):
      predictedVals = model(x.to(device))
      trueVals = y.to(device)
      true = [x == 1 for x in trueVals]
      pred = [x >= .5 for x in predictedVals]
    ### YOUR CODE ENDS HERE ###
    return true, pred

#Define and Use evaluation metrics

For the sake of learning, I chose to implement my own evaluation metrics.

In [None]:
# DO NOT IMPORT ANYTHING IN THIS CELL. You shouldn't need any external libraries.

# accuracy
#
# What percent of classifications are correct?
# 
# true: ground truth, Python list of booleans.
# pred: model predictions, Python list of booleans.
# return: percent accuracy bounded between [0, 1]
#

def accuracy(true, pred):
    acc = None
    ## YOUR CODE STARTS HERE (~2-5 lines of code) ##
    arr = [x[0] == x[1] for x in zip(true, pred)]
    acc = sum(arr) / len(arr)
    ## YOUR CODE ENDS HERE ##
    return acc

# binary_f1 
#
# A method to calculate F-1 scores for a binary classification task.
# 
# args -
# true: ground truth, Python list of booleans.
# pred: model predictions, Python list of booleans.
# selected_class: Boolean - the selected class the F-1 
#                 is being calculated for.
# 
# return: F-1 score between [0, 1]
#
def binary_f1(true, pred, selected_class=True):
    f1 = None
    ## YOUR CODE STARTS HERE (~10-15 lines of code) ##
    tup = zip(true, pred)
    tp = 0
    tn = 0 
    fp = 0
    fn = 0
    for t, p in tup:
      if t == p:
        if p == True:
          tp += 1
        else:
          tn += 1
      else:
        if t == True and p == False:
          fn += 1
        else:
          fp += 1

    if selected_class:
      tprecision = 0
      if tp + fp == 0:
        tprecision = 0
      else:
        tprecision = tp / (tp + fp)
      trecall = 0
      if tp + fn == 0:
        trecall = 0
      else:
        trecall = tp / (tp + fn)
      if tprecision + trecall == 0:
        return 0
      f1 = 2 * (tprecision * trecall) / (tprecision + trecall)
    else:
      fprecision = 0
      if tn + fn == 0:
        fprecision = 0
      else:
        fprecision = tn / (tn + fn)
      frecall = 0
      if tn + fp == 0:
        frecall = 0
      else:
        frecall = tn / (tn + fp)
      if fprecision + frecall == 0:
        return 0
      f1 = 2 * (fprecision * frecall) / (fprecision + frecall)
    ## YOUR CODE ENDS HERE ##
    return f1

# binary_macro_f1
# 
# Averaged F-1 for all selected (true/false) classes.
#
# args -
# true: ground truth, Python list of booleans.
# pred: model predictions, Python list of booleans.
#
#
def binary_macro_f1(true, pred):
    averaged_macro_f1 = None
    ## YOUR CODE STARTS HERE (1 line of code) ##
    averaged_macro_f1 = (binary_f1(true, pred, selected_class=True) + binary_f1(true, pred, selected_class=False)) / 2
    ## YOUR CODE ENDS HERE ##
    return averaged_macro_f1

In [None]:
# To test your eval implementation, let's see how well the untrained model does on our dev dataset.
# It should do pretty poorly, but this can be random because of the initialization of the parameters of the model.
true, pred = val_loop(model, val_iterator)
print()
print(f'Binary Macro F1: {binary_macro_f1(true, pred)}')
print(f'Accuracy: {accuracy(true, pred)}')

100%|██████████| 150/150 [00:00<00:00, 553.81it/s]


Binary Macro F1: 0.375
Accuracy: 0.375





### Part 4: Training the model 

In [None]:
TOTAL_EPOCHS = 10
for epoch in range(TOTAL_EPOCHS):
    train_loss = train_loop(model, criterion, optimizer, train_iterator)
    true, pred = val_loop(model, val_iterator)
    print(f"EPOCH: {epoch}")
    print(f"TRAIN LOSS: {train_loss}")
    print(f"VAL F-1: {binary_macro_f1(true, pred)}")
    print(f"VAL ACC: {accuracy(true, pred)}")

100%|██████████| 1200/1200 [00:03<00:00, 385.75it/s]
100%|██████████| 150/150 [00:00<00:00, 570.37it/s]


EPOCH: 0
TRAIN LOSS: 621.4540008604527
VAL F-1: 0.5897435897435898
VAL ACC: 0.75


100%|██████████| 1200/1200 [00:03<00:00, 384.94it/s]
100%|██████████| 150/150 [00:00<00:00, 558.99it/s]


EPOCH: 1
TRAIN LOSS: 404.4328829944134
VAL F-1: 0.8333333333333333
VAL ACC: 0.875


100%|██████████| 1200/1200 [00:03<00:00, 385.53it/s]
100%|██████████| 150/150 [00:00<00:00, 550.92it/s]


EPOCH: 2
TRAIN LOSS: 317.68711391836405
VAL F-1: 0.746031746031746
VAL ACC: 0.75


100%|██████████| 1200/1200 [00:03<00:00, 382.98it/s]
100%|██████████| 150/150 [00:00<00:00, 555.87it/s]


EPOCH: 3
TRAIN LOSS: 267.90556765161455
VAL F-1: 0.9352226720647774
VAL ACC: 0.9375


100%|██████████| 1200/1200 [00:03<00:00, 385.54it/s]
100%|██████████| 150/150 [00:00<00:00, 552.14it/s]


EPOCH: 4
TRAIN LOSS: 229.62556424643844
VAL F-1: 0.9372549019607843
VAL ACC: 0.9375


100%|██████████| 1200/1200 [00:03<00:00, 386.34it/s]
100%|██████████| 150/150 [00:00<00:00, 559.25it/s]


EPOCH: 5
TRAIN LOSS: 201.79788933508098
VAL F-1: 0.8117647058823529
VAL ACC: 0.8125


100%|██████████| 1200/1200 [00:03<00:00, 386.44it/s]
100%|██████████| 150/150 [00:00<00:00, 558.21it/s]


EPOCH: 6
TRAIN LOSS: 179.83632330223918
VAL F-1: 1.0
VAL ACC: 1.0


100%|██████████| 1200/1200 [00:03<00:00, 388.64it/s]
100%|██████████| 150/150 [00:00<00:00, 556.32it/s]


EPOCH: 7
TRAIN LOSS: 159.9260141660925
VAL F-1: 0.9352226720647774
VAL ACC: 0.9375


100%|██████████| 1200/1200 [00:03<00:00, 387.05it/s]
100%|██████████| 150/150 [00:00<00:00, 552.77it/s]


EPOCH: 8
TRAIN LOSS: 144.3743618351873
VAL F-1: 0.8545454545454546
VAL ACC: 0.875


100%|██████████| 1200/1200 [00:03<00:00, 387.43it/s]
100%|██████████| 150/150 [00:00<00:00, 562.06it/s]

EPOCH: 9
TRAIN LOSS: 131.1241482088808
VAL F-1: 0.7090909090909091
VAL ACC: 0.75





We can also look at the models performance on the held-out test set, using the same val_loop we wrote earlier.

In [None]:
true, pred = val_loop(model, test_iterator)
print()
print(f"TEST F-1: {binary_macro_f1(true, pred)}")
print(f"TEST ACC: {accuracy(true, pred)}")

100%|██████████| 150/150 [00:00<00:00, 559.35it/s]


TEST F-1: 0.8666666666666667
TEST ACC: 0.875





### Part 6: LSTM Model 

In [None]:
class RecurrentModel(nn.Module):
    # Instantiate layers for your model-
    # 
    # Your model architecture will be an optionally bidirectional LSTM,
    # followed by a linear + sigmoid layer.
    #
    # You'll need 4 nn.Modules
    # 1. An embeddings layer (see nn.Embedding)
    # 2. A bidirectional LSTM (see nn.LSTM)
    # 3. A Linear layer (see nn.Linear)
    # 4. A sigmoid output (see nn.Sigmoid)
    #
    # HINT: In the forward step, the BATCH_SIZE is the first dimension.
    # HINT: Think about what happens to the linear layer's hidden_dim size
    #       if bidirectional is True or False.
    # 
    def __init__(self, vocab_size, embedding_dim, hidden_dim, \
                 num_layers=1, bidirectional=True):
        super().__init__()
        ## YOUR CODE STARTS HERE (~4 lines of code) ##
        mult = 1
        if(bidirectional):
          mult = 2
        self.embedLayer = nn.Embedding(vocab_size, embedding_dim)
        self.LSTMLayer = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers, bidirectional=bidirectional, batch_first=True)
        self.linearLayer = nn.Linear(mult * embedding_dim, 1)
        self.sigmoidLayer = nn.Sigmoid()
        ## YOUR CODE ENDS HERE ##
        
    # Complete the forward pass of the model.
    #
    # Use the last timestep of the output of the LSTM as input
    # to the linear layer. This will only require some indexing 
    # into the correct return from the LSTM layer. 
    # 
    # args:
    # x - 2D LongTensor of shape (BATCH_SIZE, max len of all tokenized_word_tensor))
    #     This is the same output that comes out of the collate_fn function you completed-
    def forward(self, x):
        ## YOUR CODE STARTS HERE (~4-5 lines of code) ##
        EmbedOutput = self.embedLayer(x)
        LSTMOutput, (hidden, cell) = self.LSTMLayer(EmbedOutput)
        LinearOutput = self.linearLayer(LSTMOutput[:, -1, :])
        return torch.squeeze(self.sigmoidLayer(LinearOutput))

        #return x
        ## YOUR CODE ENDS HERE ##
    

In [None]:
train_iterator = DataLoader(train_dataset, batch_size=BATCH_SIZE, sampler=train_sampler, collate_fn=collate_fn)
val_iterator   = DataLoader(val_dataset, batch_size=BATCH_SIZE, sampler=val_sampler, collate_fn=collate_fn)
test_iterator  = DataLoader(test_dataset, batch_size=BATCH_SIZE, sampler=test_sampler, collate_fn=collate_fn)

In [None]:
lstm_model = RecurrentModel(vocab_size    = len(train_vocab.keys()),
                            embedding_dim = 300,
                            hidden_dim    = 300,
                            num_layers    = 5,
                            bidirectional = False).to(device)

In [None]:
from torch.optim.adagrad import Adagrad
from torch.optim import Adam

lstm_criterion, lstm_optimizer = None, None
### YOUR CODE STARTS HERE ###

lstm_criterion = nn.BCELoss()
lstm_optimizer = Adam(lstm_model.parameters(), lr = .001)

### YOUR CODE ENDS HERE ###

### Training and Evaluation



In [None]:
#Pre-training to see what accuracy we can get with random parameters
true, pred = val_loop(lstm_model, val_iterator)
print()
print(f'Binary Macro F1: {binary_macro_f1(true, pred)}')
print(f'Accuracy: {accuracy(true, pred)}')

100%|██████████| 150/150 [00:00<00:00, 218.41it/s]


Binary Macro F1: 0.36
Accuracy: 0.5625





In [None]:
#Watch the model train!
TOTAL_EPOCHS = 10
for epoch in range(TOTAL_EPOCHS):
    train_loss = train_loop(lstm_model, lstm_criterion, lstm_optimizer, train_iterator)
    true, pred = val_loop(lstm_model, val_iterator)
    print(f"EPOCH: {epoch}")
    print(f"TRAIN LOSS: {train_loss}")
    print(f"VAL F-1: {binary_macro_f1(true, pred)}")
    print(f"VAL ACC: {accuracy(true, pred)}")

100%|██████████| 1200/1200 [00:12<00:00, 94.15it/s]
100%|██████████| 150/150 [00:00<00:00, 219.89it/s]


EPOCH: 0
TRAIN LOSS: 711.4798891246319
VAL F-1: 0.805668016194332
VAL ACC: 0.8125


100%|██████████| 1200/1200 [00:12<00:00, 94.81it/s]
100%|██████████| 150/150 [00:00<00:00, 221.23it/s]


EPOCH: 1
TRAIN LOSS: 601.4297215938568
VAL F-1: 0.746031746031746
VAL ACC: 0.75


100%|██████████| 1200/1200 [00:12<00:00, 94.94it/s]
100%|██████████| 150/150 [00:00<00:00, 217.84it/s]


EPOCH: 2
TRAIN LOSS: 399.8543336354196
VAL F-1: 0.8666666666666667
VAL ACC: 0.875


100%|██████████| 1200/1200 [00:12<00:00, 94.85it/s]
100%|██████████| 150/150 [00:00<00:00, 223.04it/s]


EPOCH: 3
TRAIN LOSS: 289.0646998193115
VAL F-1: 0.9352226720647774
VAL ACC: 0.9375


100%|██████████| 1200/1200 [00:12<00:00, 94.63it/s]
100%|██████████| 150/150 [00:00<00:00, 219.63it/s]


EPOCH: 4
TRAIN LOSS: 221.1596870906651
VAL F-1: 0.8117647058823529
VAL ACC: 0.8125


100%|██████████| 1200/1200 [00:12<00:00, 94.53it/s]
100%|██████████| 150/150 [00:00<00:00, 218.60it/s]


EPOCH: 5
TRAIN LOSS: 162.88612027280033
VAL F-1: 0.7681159420289854
VAL ACC: 0.8125


100%|██████████| 1200/1200 [00:12<00:00, 94.00it/s]
100%|██████████| 150/150 [00:00<00:00, 221.00it/s]


EPOCH: 6
TRAIN LOSS: 110.52352964691818
VAL F-1: 1.0
VAL ACC: 1.0


100%|██████████| 1200/1200 [00:12<00:00, 93.93it/s]
100%|██████████| 1200/1200 [00:12<00:00, 94.34it/s]
100%|██████████| 150/150 [00:00<00:00, 218.57it/s]


EPOCH: 8
TRAIN LOSS: 65.77736117457971
VAL F-1: 0.8666666666666667
VAL ACC: 0.875


100%|██████████| 1200/1200 [00:12<00:00, 94.07it/s]
100%|██████████| 150/150 [00:00<00:00, 218.84it/s]

EPOCH: 9
TRAIN LOSS: 61.73469458904583
VAL F-1: 0.8545454545454546
VAL ACC: 0.875





In [None]:
#See how your model does on the held out data
true, pred = val_loop(lstm_model, test_iterator)
print()
print(f"TEST F-1: {binary_macro_f1(true, pred)}")
print(f"TEST ACC: {accuracy(true, pred)}")

100%|██████████| 150/150 [00:00<00:00, 214.51it/s]


TEST F-1: 0.873015873015873
TEST ACC: 0.875



