Task 4: Train an LSTM Model (40 points)
----
1. Using PyTorch, implement a neural network that uses one or more LSTM cells to do sentiment analysis. Use the nn.Embedding, nn.LSTM and nn.Linear layers to construct your model.
2. Note that sequence processing works differently with the PyTorch Embedding layer as compared to my sample code from class. The model input expects a padded tensor of token indices from the vocabulary, instead of one-hot encodings. For evaluation, use a vocabulary size of 10000 (max_features = 10000).
3. The model should have a single output with the sigmoid activation function for classification. The dimensions of the embedding layer and the hidden layer(s) are up to you, but please make sure your model does not take more than ~3 minutes to train.
4. Evaluate the model using PyTorch functions for average accuracy, area under the ROC curve and F1 scores (see [torchedev](https://pytorch.org/torcheval/stable/)).

In [1]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from nltk.corpus import stopwords
from collections import Counter
import string
import re
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split

In [2]:
from torcheval.metrics.functional import binary_f1_score
from torcheval.metrics import BinaryAUROC, BinaryAccuracy

In [3]:
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    print("MPS is available")
else:
    device = torch.device("cpu")
    print("CPU used")

MPS is available


In [4]:
train_data_file = 'movie_reviews_train.txt'
train_df = pd.read_csv(train_data_file, sep='\t', header=None, names=['id', 'review', 'label'])[['review', 'label']]
X_train, y_train = train_df['review'].values, train_df['label'].values

dev_data_file = 'movie_reviews_dev.txt'
dev_df = pd.read_csv(dev_data_file, sep='\t', header=None, names=['id', 'review', 'label'])[['review', 'label']]
X_dev, y_dev = dev_df['review'].values, dev_df['label'].values

test_data_file = 'movie_reviews_test.txt'
test_df = pd.read_csv(test_data_file, sep='\t', header=None, names=['id', 'review', 'label'])[['review', 'label']]
X_test, y_test = test_df['review'].values, test_df['label'].values

In [5]:
def preprocess_token(s): 
    """
      This function is for pre-processing each token, not the entire sequence
      Retain only alphanumeric characters
      replace digits with no space
      Replace all whitespace sequences with no space
    """
    
    s = re.sub("[^\w]", "", s)
    return s

def clean_words(sentence: list[str], del_words: list[str], lowercase: bool = False):
    if lowercase:
        cleaned = list(map(lambda x: x.lower(), sentence))
    cleaned = list(filter(lambda word: word not in del_words, cleaned))
    
    return cleaned

def max_length_in_list_of_lists(list_of_lists):
    return max(len(sublist) for sublist in list_of_lists)

def encode(corpus: list[list[str]], vocab_map: dict[int]):
    processed_corp = []
    
    for document in corpus:
        enc_doc = []
        for token in document:
            if token in vocab_map.keys():
                enc_doc.append(vocab_map[token])
        
        processed_corp.append(enc_doc)
    
    return processed_corp

def pad(sentence:list, length:int):
    sentence.extend([0]*abs(len(sentence)-length))
    return sentence
    
def tokenize(x_train: list[str],
             x_dev: list[str],
             x_test: list[str],
             vocab_size: int,
             stopwords: list[str] = stopwords.words('english')):
    
    # spliting to tokens
    x_train = [x.split() for x in x_train]
    x_dev = [x.split() for x in x_dev]
    x_test = [x.split() for x in x_test]

    # Iterate through each document in the data and preprocess
    processed_data = []
    for doc in x_train:
        temp = list(map(lambda x: preprocess_token(x), doc))
        try:
            temp.remove('')
        except ValueError:
            ...
        processed_data.append(temp)
    x_train = list(processed_data)
    
    processed_data = []
    for doc in x_test:
        temp = list(map(lambda x: preprocess_token(x), doc))
        try:
            temp.remove('')
        except ValueError:
            ...
        processed_data.append(temp)
    x_test = list(processed_data)
    
    processed_data = []
    for doc in x_dev:
        temp = list(map(lambda x: preprocess_token(x), doc))
        try:
            temp.remove('')
        except ValueError:
            ...
        processed_data.append(temp)
    x_dev = list(processed_data)
    
    # process stop words to match with vocab
    stopwords = list(map(lambda word: preprocess_token(word), stopwords))
    
    # Remove stop words
    x_train = list(map(lambda x: clean_words(x, stopwords, lowercase=True), x_train))
    x_dev = list(map(lambda x: clean_words(x, stopwords, lowercase=True), x_dev))
    x_test = list(map(lambda x: clean_words(x, stopwords, lowercase=True), x_test))          
    
    # Creating a unified token list
    all_tok = []
    for doc in x_train:
        all_tok.extend(doc)
    for doc in x_test:
        all_tok.extend(doc)
    for doc in x_dev:
        all_tok.extend(doc)
    all_toks = Counter(all_tok)
    
    # Retain the 'vocab_size' most frequent words
    freq_vocab_map = {}
    for index, freq in enumerate(all_toks.most_common(vocab_size)):
        freq_vocab_map[freq[0]] = index + 1

    # Initialize empty lists to store padded sequences for training, development, and testing data
    padd_train = padd_test = padddev = []
    
    # convert tokens to their corresponding indices in the vocabulary if they exist
    padd_train = encode(x_train, freq_vocab_map)
    padd_test = encode(x_test, freq_vocab_map)
    padd_dev = encode(x_dev, freq_vocab_map)
    
        
    # Determine the maximum sequence size among all datasets (training, development, and testing)
    max_len = max(max_length_in_list_of_lists(padd_train),
                  max_length_in_list_of_lists(padd_test),
                  max_length_in_list_of_lists(padd_dev))
    
    padd_train = np.array(list(map(lambda x: pad(x, max_len), padd_train)))
    padd_test = np.array(list(map(lambda x: pad(x, max_len), padd_test)))
    padd_dev = np.array(list(map(lambda x: pad(x, max_len), padd_dev)))
    
    return (padd_train, padd_dev, padd_test, freq_vocab_map)

In [6]:
# Tokenize your train, test and development data
X_train, X_dev, X_test, vocab = tokenize(X_train, X_dev, X_test, 100000)


In [7]:
train_data = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train.astype(np.float32)))
dev_data = TensorDataset(torch.from_numpy(X_dev), torch.from_numpy(y_dev.astype(np.float32)))
test_data = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test.astype(np.float32)))

# dataloaders
batch_size = 25

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)
dev_loader = DataLoader(dev_data, shuffle=True, batch_size=batch_size)

In [8]:
class SentimentRNN(nn.Module):
    def __init__(self,num_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.5):
        super(SentimentRNN,self).__init__()
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        
        # embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # LSTM layers
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, dropout=drop_prob, batch_first=True)

        # dropout layer
        self.dropout_layer = nn.Dropout(drop_prob)

        # linear and sigmoid layer
        self.linear_layer = nn.Linear(hidden_dim, 1)
        self.activation = nn.Sigmoid()



    def forward(self,x,hidden):

        # embeddings
        embedded = self.embedding(x)
        
        # LSTM
        lstm_out, hidden = self.lstm(embedded, hidden)

        # stack up LSTM outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # dropout and fully connected layer
        out = self.dropout_layer(lstm_out)
        out = self.linear_layer(out)

        # sigmoid function
        sig_out = self.activation(out)

        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)

        # get the last batch of labels
        sig_out = sig_out[:, -1]
        
        return sig_out, hidden


    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # initialize hidden state(s) and cell state(s) of LSTM to zero with appropriate dimensions
        h0 = torch.zeros((self.num_layers, batch_size, self.hidden_dim)).to(device)
        c0 = torch.zeros((self.num_layers, batch_size, self.hidden_dim)).to(device)
        hidden = (h0,c0)
        return hidden


In [9]:
num_layers = 4
vocab_size = len(vocab)
embedding_dim = int(32)
output_dim = 1
hidden_dim = 256

model = SentimentRNN(num_layers,vocab_size,hidden_dim,embedding_dim,drop_prob=0.3).to(device)

In [10]:
model

SentimentRNN(
  (embedding): Embedding(32202, 32)
  (lstm): LSTM(32, 256, num_layers=4, batch_first=True, dropout=0.3)
  (dropout_layer): Dropout(p=0.3, inplace=False)
  (linear_layer): Linear(in_features=256, out_features=1, bias=True)
  (activation): Sigmoid()
)

In [11]:
lr=0.0001

# you should use binary cross-entropy as your loss function and Adam optimizer for this task

optimizer = torch.optim.Adam(model.parameters(), lr)
loss_func = nn.BCELoss()

# function to predict accuracy
def acc(preds, labels, threshold=0.5):
    binary_predictions = (preds > threshold).float()
    return torch.tensor(torch.sum(binary_predictions == labels).item() / len(preds))

def mean(listt):
    return sum(listt)/len(listt)

In [12]:
clip = 5
epochs = 10
dev_loss_min = np.Inf
best_epoch = 0

epoch_tr_loss,epoch_dv_loss = [],[]
epoch_tr_acc,epoch_dv_acc = [],[]

for epoch in range(epochs): # Train your model
    train_loss = []
    train_acc = []
    dev_loss = []
    dev_acc = []
    
    for features, target in train_loader:
        features = features.to(device)
        target = target.to(device)
        hidden_state = model.init_hidden(batch_size)
        out, _ = model(features, hidden_state)
        loss = loss_func(out, target)
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        optimizer.zero_grad()
        train_loss.append(loss.item())
        train_acc.append(acc(out, target))
    
    for features, target in dev_loader:
        features = features.to(device)
        target = target.to(device)
        hidden_state = model.init_hidden(batch_size)
        out, _ = model(features, hidden_state)
        loss = loss_func(out, target)
        dev_loss.append(loss.item())
        dev_acc.append(acc(out, target))
    
    mean_dev_loss = mean(dev_loss)
    mean_train_loss = mean(train_loss)
    mean_train_acc = mean(train_acc)
    mean_dev_acc = mean(dev_acc)
    
    if (epoch+1)%2==0 or epoch==0:
        print(f'Epoch {epoch+1}')
        print(f'train_loss : {mean_train_loss} dev_loss : {mean_dev_loss}')
        print(f'train_accuracy : {mean_train_acc} dev_accuracy : {mean_dev_acc}')
        print(25*'==')

    # if dev_loss goes less than or equal to dev_loss_min then save your model and update the dev_loss_min

    if mean_dev_loss<dev_loss_min:
        dev_loss_min = mean_dev_loss
        # save model here
        torch.save(model.state_dict(), f'best_model.pth')
        best_epoch = epoch+1

Epoch 1
train_loss : 0.6936050280928612 dev_loss : 0.6949827894568443
train_accuracy : 0.48875001072883606 dev_accuracy : 0.4650000333786011
Epoch 2
train_loss : 0.6941576525568962 dev_loss : 0.6935289204120636
train_accuracy : 0.49187496304512024 dev_accuracy : 0.5200000405311584
Epoch 4
train_loss : 0.6932710977271199 dev_loss : 0.6941909193992615
train_accuracy : 0.5037499666213989 dev_accuracy : 0.45499998331069946
Epoch 6
train_loss : 0.6944630108773708 dev_loss : 0.692916750907898
train_accuracy : 0.4818749725818634 dev_accuracy : 0.5049999952316284
Epoch 8
train_loss : 0.6936004627496004 dev_loss : 0.6921800076961517
train_accuracy : 0.4818749725818634 dev_accuracy : 0.5300000309944153
Epoch 10
train_loss : 0.6931778881698847 dev_loss : 0.6931089609861374
train_accuracy : 0.5081250071525574 dev_accuracy : 0.49500006437301636


In [13]:
print(f"Best Epoch Model = {best_epoch}")

Best Epoch Model = 3


NOTE: your train loss should be smaller than 1 and your train accuracy should be over 75%

In [14]:
# Load the best model
model.load_state_dict(torch.load('best_model.pth'))
model.eval()

test_acc = 0.0
predictions = []
true_labels = []

# Iterate over test data batches
for inputs, labels in test_loader:
    inputs, labels = inputs.to(device), labels.to(device)
    test_h = model.init_hidden(batch_size)
    # Forward pass
    with torch.no_grad():
        outputs, _ = model(inputs, test_h)
    
    # Compute predictions
    predictions.extend(outputs.cpu().numpy())  # Assuming outputs are on CPU
    true_labels.extend(labels.cpu().numpy())  # Assuming labels are on CPU

# Convert predictions and true_labels to numpy arrays
predictions = torch.tensor(predictions)
true_labels = torch.tensor(true_labels)

In [15]:
print(f"F1 Score - {binary_f1_score(predictions, true_labels)}")
metric = BinaryAUROC()
metric.update(predictions, true_labels)
print(f"Area under ROC {metric.compute()}")
acc_metric = BinaryAccuracy()
acc_metric.update(predictions, true_labels)
print(f"Accuracy {metric.compute()}")

F1 Score - 0.6666666865348816
Area under ROC 0.4693013408609739
Accuracy 0.4693013408609739


NOTE: your eval accuracy should be of at least 60%.