# Bidirectional LSTM with word2vec

## Run on Google Colab
In order to run on google colab, you need to prepare three things:
1. mount to google drive. 
2. upload dataManager.py and loadSST.py file to the runtime memory. upload the TempFiles/SST_DataManager to your google drive. 
3. change the file path for dataManger and save model path. (Search for variable name dataManger_fp and model_path). Change them to where your dataManger's file path on drive, and where you want to save the trained model on drive. 

## Data Representation
The input to our model is a sentence string, we will represent a sentence by word2vec vector. The shape is (300,), type is float64.   
Need to use gensim 4.3.2 (current version) will "from scipy.linalg import get_blas_funcs, triu", and triu is removed from scipy 1.12. And we can't install scipy 1.11. I tried to download the file and write a load function for it, but there is little information on the internet, everyone is using gensim.  
So eventually my solution is to use gensim 4.3.2 and python 3.10. We can install scipy 1.11.0 with python 3.10, and it solves the problem. besides, gensim.downloader.load("word2vec-google-news-300") seems stop working. So we will have to download the file from
 https://code.google.com/archive/p/word2vec/    
 (1.5 GB) and unzip it (not sure if it's necessary), and then use from gensim.models import KeyedVectors to solve this problem. 

In [2]:
import pickle

def save_pickle(obj, path):
    with open(path, "wb") as f:
        pickle.dump(obj, f)


def load_pickle(path):
    with open(path, "rb") as f:
        return pickle.load(f)

In [3]:
import os

import gensim.downloader
from gensim.models import KeyedVectors
import numpy as np

W2V_EMBEDDING_DIM = 300
SEQ_LEN = 52

def load_word2vec():
    # word2vec_model = gensim.downloader.load("word2vec-google-news-300") 
    # above method doesn't work anymore, you need to download the file from internet
    word2vec_file = 'TempFiles/GoogleNews-vectors-negative300.bin'
    word2vec_model = KeyedVectors.load_word2vec_format(word2vec_file, binary=True)
    return word2vec_model

def create_or_load_slim_w2v(words_list, cache_w2v=True):
    """
    We are trying to get a smaller word2vec dictionary: word2vec dict only for words which appear in the training dataset.
    :param words_list: list of words to use for the w2v dict
    :param cache_w2v: whether to save locally the small w2v dictionary
    :return: dictionary which maps the known words to their vectors
    """
    w2v_path = "TempFiles/w2v_dict.pkl"
    if not os.path.exists(w2v_path):
        full_w2v = load_word2vec()
        w2v_emb_dict = {k: full_w2v[k] for k in words_list if k in full_w2v}
        if cache_w2v:
            save_pickle(w2v_emb_dict, w2v_path)
    else:
        w2v_emb_dict = load_pickle(w2v_path)
    return w2v_emb_dict


def sentence_to_embedding(sent, word_to_vec, seq_len=SEQ_LEN, embedding_dim=300):
    """
    this method gets a sentence and a word to vector mapping, and returns a list containing the
    words embeddings of the tokens in the sentence.
    :param sent: a list of word (string)
    :param word_to_vec: a word to vector mapping.
    :param seq_len: the fixed length for which the sentence will be mapped to.
    :param embedding_dim: the dimension of the w2v embedding
    :return: numpy ndarray of shape (seq_len, embedding_dim) with the representation of the sentence
    """
    sentence_embedding = np.zeros((seq_len, embedding_dim))
    for i in range(min([len(sent), seq_len])):
        word = sent[i]
        try:
            word_embedding = word_to_vec[word]
            sentence_embedding[i] = word_embedding
        except:
            pass
    return sentence_embedding

## Data Manager

In [4]:
from loadSST import SentimentTreeBank
from dataManager import DataManager, TRAIN, VAL, TEST
def load_data_manager(dataManger_fp):
    if os.path.exists(dataManger_fp):
        return load_pickle(dataManger_fp)
    # load the dataset
    dataset = SentimentTreeBank()
    # the function that will map a sentence to vector is get_w2v_average
    sent_func = sentence_to_embedding
    # The param it takes other than the Sentence object: word2Vec_dic, W2V_EMBEDDING_DIM
    # initialize the dictionary that map a word to Word2Vec vectors
    words_list = list(dataset.get_word_counts().keys())
    word2Vec_dic = create_or_load_slim_w2v(words_list)
    # We just know that the embedding size of word2Vec is 300
    sent_func_kwargs = {"word_to_vec": word2Vec_dic, "embedding_dim": W2V_EMBEDDING_DIM, "seq_len": SEQ_LEN}
    # pass it to the dataManager: batch_size 50
    data_manager = DataManager(use_sub_phrases=False, 
                               sentiment_dataset=dataset, 
                               sent_func=sent_func, sent_func_kwargs=sent_func_kwargs, 
                               batch_size=50)
    save_pickle(data_manager, "TempFiles/SST_DataManager")
    return data_manager

## Training
We will use the bidirectional LSTM architecture

Define the model  
Regarding LSTM: 
If we passed in a sentence as list of words, each word as representation, then it will run recurrently word by word for each layer, and the h_n and c_n will be the hidden_layer after the final word. Better shown in a graph. I always get confused by the "layer of LSTM" to "LSTM cell for each word". each layer will handle an entire sentence, and out put just one h_n for each sentence. For each sentence there are many words so it might be passed to the same layer of LSTM recurrently, each time a word passed to the LSTM cell it will create a h_n, but this h_n will be passed again into the same layer in the next word's step. 
Also, bi-directional is not two layer of LSTM stacking up. (It might be a little confusing from the graph), it is the same layer, just first time we pass the sentence (list of word) in the front order and second time the reverse order. So each layer will create two h_n and c_n pair. 
On the other hand, if we pass a sentence in as on single vector (average_word2vec), then it will pass through the LSTM  once (because it's basically just one word) (but sentence of one word is a list of one vector, here we don't even have a list, just one vector, so we basically didn't use the "recurrent" attribute at all). 

In [5]:
import torch
import torch.nn as nn
class LSTM(nn.Module):
    """
    An LSTM for sentiment analysis with architecture as described in the exercise description.
    """

    def __init__(self, embedding_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.LSTM = nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=n_layers,
            bidirectional=True,
            batch_first=True,
            dropout=dropout,
            dtype=torch.float64
        )
        self.linear = nn.Linear(in_features=hidden_dim * 2, out_features=1,dtype=torch.float64)
        self.relu = nn.ReLU()

    def forward(self, text):
        """

        :param text: tensor of (batch_size, representation_dim), with avg_word2vec it's probably (batch_size, 300).
        but for the real embedding is probably (batch_size, 52, 300), (52, 300) is 52 words, each with 300 dim
        embedding.
        :return:

        Sam's note:
        Regarding output of LSTM:
        c_n and h_n: cell state and hidden state: both of size (num_layers * num_directions=2, batch_size, hidden_size)
            num_layers: how many LSTM cells are stacked together, then each layer have an h_n/c_n to the next layer
            num_directions: 2, because it's bi-directional LSTM, for each direction there will be a "h_n/c_n", here they just stack output
            of both direction on top of each other. 
        somehow the batch size is not first......
        
        output_of_lstm: shape: (batch, "hidden_size_with_shape_of_representation")
            Batch: The entire output for the batch (if there are 50 samples in a batch, there are 50 output), 
            each output is shape of a sample. 
            "hidden_size_with_shape_of_representation": In our case the shape of a hidden_size_with_shape_of_representation is 
            (seq_len, hidden_size * num_directions), because representation of a sentence is (seq_len, input_len)
                seq_len: num of words regulated to 52 words
                input_len: each word is mapped to a vector of shape (300,)
                hidden_size: the size of "h_n", we set it to be 100
                num_directions: 2, because it's bi-directional LSTM, for each direction there will be a "h_n", output concatenate them. 
        Notice that output of lstm can be in a different shape: if the batch is just one vector: If we represent a sentence as one 
        single vector, then the hidden_size_with_shape_of_representation will be just (hidden_size * num_directions), because
        the "seq_len" is basically 1. 
        Why is output this shape?
        because as explained above, in one layer of LSTM, it will pass words to itself 52 times recursively. When there is n layers,
        it will only record the last layer. 

        The output_of_lstm will record all of the "intermediate" output that is feed to next iteration -- 52 of them. But it will record 
        only the last layer. 
        The h_n will record the hidden layer output of each layer (if there is many), but it will not record all 52 of them. For each 
        layer, it will record only the last one. 
        When there are two directions, each cell will have two output. the output_of_lstm will concatnate them and the h_n will stack 
        them on top of each other. (There are three topics here, layer, passing to itself, direction. Better understand with a picture, 
        I understand already). 

        So techniquely, we can either use output_of_lstm[:, -1,:], or simply concatnate
        the last layer of h_n's both direction: concat(h_n[-1, :, :], h_n[-2, :, :]). 
        HOWEVER: this is wrong. The reason is, with some reason I don't understand, when output concatnate the hidden layer, it
        is concatanating the "start of forward (direction) with the end of backward". And "the end of forward with the start of backward". 
        SO basically concatanating the same order, the "embedding of a sentence" is different when it's passes to RNN in forward 
        order or backward order. The embedding of the whole sentence is the last hidden state (of a layer). It only make sense to 
        concatnate the embedding of whole sentence with embedding with whole sentence. But there they align word to word, then the
        order is all messed up, the first half of output is the embedding of whole sentence in forward. The second part of output
        is embedding of the first word in backward direction (first word of backword is the last word of noraml order). 
        I guess we can understand this with the picture of the bidirectional LSTM graph, the output is what concatnate on the horizontal
        direction. On the graph the last cell of forward pass aligned with the first cell of backward pass.... So stupid.....
        See this discussion:
        https://discuss.pytorch.org/t/bilstm-output-hidden-state-mismatch/49825/2
        So the idea is 
        output[:, -1, :hidden_dim] == h_n[:, -2] # the hidden state of last layer forward
        output[:, 0, hidden_dim:] == h_n[:, -1] # the hidden state of last layer backward
    
        """
        # the last batch might not be 50
        batch_size = text.shape[0]
        # initial h_n, c_n
        # device is global variable....
        h0 = torch.randn(4, batch_size, 100).to(device).to(torch.float64)
        c0 = torch.randn(4, batch_size, 100).to(device).to(torch.float64)
        output_of_lstm, (h_n, c_n) = self.LSTM(text,  (h0, c0))
        # last_output = output_of_lstm[:, -1,:] # this doesn't work.....
        # last_output = torch.cat((output_of_lstm[:, -1,:100], output_of_lstm[:, 0,100:]), dim=1) # this is the same as below
        last_output = torch.cat((h_n[-2, :, :], h_n[-1, :, :]), dim=1)
        # last_output = self.relu(last_output)
        return self.linear(last_output)
    
    def predict(self, text):
        """
        Sam's Note: just use self(text) will return the prediction of the model. We are just adding another layer of sigmoid function here at prediction time.
        :param text: 
        :return: 
        """
        return nn.Sigmoid()(self(text))

And the function for training a batch, an epoch, etc. 

In [6]:
def binary_accuracy(preds, y):
    """
    This method returns tha accuracy of the predictions, relative to the labels.
    You can choose whether to use numpy arrays or tensors here.
    I use Tensor here
    :param preds: a vector of predictions
    :param y: a vector of true labels
    :return: scalar value - (<number of accurate predictions> / <number of examples>)
    """
    number_of_accurate_predictions = (torch.round(preds) == y).sum()
    number_of_examples = y.shape[0]
    return (number_of_accurate_predictions / number_of_examples).item()



In [7]:
def train_batch(model, optimizer, criterion, batch, device):
    """
    Sam's note: 
    All the parameters we want to update is automatically set requires_grad=True. So backward() will upgrade the gradients. Because we are just using simple LSTM and Linear from pytorch.nn, so we don't need to worry about this. 
    Maybe later if we want to parameterize something, we would need to set that newly added tensor's requires_grad=True. Just not something we need to worry about right now. Here are the code to check that
    
    # Check if the model parameters have requires_grad=True
    for name, param in model.named_parameters():
        print(f'{name}: requires_grad={param.requires_grad}')
        
    :param model:
    :param optimizer:
    :param criterion:
    :param batch: a list of two tensor: [X, y], shape of X is (batch_size, representation_of_sentence), shape of y is (batch_size, representation_of_target(usually just a number)) -> probably (batch_size,)
    :return:
    """
    # reset the gradient after every backward pass instead of accumulate for the entire epoch
    optimizer.zero_grad() # similar to  model.zero_grad()
    # assign tensor to device, and to the correct type
    X = batch[0].to(device).to(torch.float64)
    y = batch[1].to(device).to(torch.float64)
    # Forward pass the X: also automatically "use the model to predict y based on X" (This will be LSTM-Linear)
    y_pred = model(X)
    y_pred_sigmoid = model.predict(X)
    # prediction is in (batch_size, 1) shape, but original y is in (batch_size,) shape, so we need to add another dim
    y = torch.reshape(y, y_pred.shape)
    # get the loss, so that we can preform backpropagation: if we use BCEWithLogitsLoss(), then we need an output that's without
    # sigmoid, but we can achieve the samething with torch.logit? Seems like not
    loss = criterion(input=y_pred, target=y)
    # now back propagate the loss to update the parameters of the model
    loss.backward()
    optimizer.step()
    # computes loss and accuracy: with sigmoid for output to normalize it back to [0,1], the linear output doesn't guarantee that.
    # Notice that loss and accuracy and error are different notions, but it's not that complex. just ask GPT
    accuracy_value = binary_accuracy(y_pred_sigmoid, y=y)
    loss_value = loss.item()
    batch_size = batch[0].shape[0]
    # Why multiply by batch size? see explanation in train epoch. 
    return loss_value * batch_size, accuracy_value * batch_size

def train_epoch(model, data_iterator, optimizer, criterion, device):
    """
    This method operates one epoch (pass over the whole train set) of training of the given model,
    and returns the accuracy and loss for this epoch
    Assume model has method predict.
    :param model: the model we're currently training
    :param data_iterator: an iterator, iterating over the training data for the model.
    :param optimizer: the optimizer object for the training process.
    :param criterion: the criterion object for the training process.
    """
    total_loss, total_accuracy = 0, 0
    total_sample_size = 0
    for batch in data_iterator:
        batch_size = batch[0].shape[0]
        total_sample_size += batch_size
        loss, accuracy = train_batch(model, optimizer, criterion, batch, device)
        total_loss += loss
        total_accuracy += accuracy
    # we divide by total sample size, because the last batch might not be the nnormal batch size. 
    # so in train batch we multiply loss,accuracy by batch size, and here we devided it by total size. 
    return total_loss / total_sample_size, total_accuracy / total_sample_size



## Evaluation

In [8]:
def eval_batch(model, criterion, batch, device):
    """
    Sam's note:
    Setting torch.no_grad(), so pytorch will not track all the tensors with requires_grad=True. (Ususally the parameters tensor
    of the model). 
    There is no need for loss.backward() either. Because loss is in type Tensor. When we do loss.backward() (or even any 
    tensor.backward()), torch will "globally" look at all the tensors with requires_grad=True, and store the amount of grad 
    needed to be updated when we pass optimizer.step(). (More precisely, it will track it during the forward passing phase, so 
    that's some computation. 
    That's why we need torch.no_grad() here. 
    The only difference between eval batch and train batch is we don't update gradient. We could have put them into the same 
    function with a flag in function arguments to turn on/off. (But we are using no_grad() to save computation. 
    :param model: 
    :param criterion: 
    :param batch: 
    :return: 
    """
    with torch.no_grad():
        X = batch[0].to(device).to(torch.float64)
        y = batch[1].to(device).to(torch.float64)
        # Here we remain consistency with train model: We have two options: model(X) or model.predict(X). In our LSTM model, 
        # model.predict will add a sigmoid layer to the output of the model. 
        y_pred = model(X)
        y = torch.reshape(y, y_pred.shape)
        # loss = criterion(input=y_pred, target=y)
        loss = criterion(input=y_pred, target=y)
        # compute loss and accuracy
        accuracy_value = binary_accuracy(preds=model.predict(X), y=y)
        loss_value = loss.item()
        batch_size = batch[0].shape[0]
        return loss_value * batch_size, accuracy_value * batch_size

def evaluate(model, data_iterator, criterion, device):
    """
    evaluate the model performance on the given data
    exactly the same as training, just delete all the line of optimizer
    :param model: one of our models..
    :param data_iterator: torch data iterator for the relevant subset
    :param criterion: the loss criterion used for evaluation
    :return: tuple of (average loss over all examples, average accuracy over all examples)
    Sam's note: 
    The only difference between evaluate and train epoch is 
    """
    # do we need the with torch.no_grad() here? I don't think so. 
    total_loss, total_accuracy = 0, 0
    total_sample_size = 0
    for batch in data_iterator:
        batch_size = batch[0].shape[0]
        total_sample_size += batch_size
        loss, accuracy = eval_batch(model, criterion, batch, device)
        total_loss += loss
        total_accuracy += accuracy
    return total_loss / total_sample_size, total_accuracy / total_sample_size

## Put Everything Together

In [9]:
dataManger_fp = "TempFiles/SST_DataManager"
data_manager = load_data_manager(dataManger_fp)

In [10]:
import torch.nn.functional as F
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LSTM(embedding_dim=W2V_EMBEDDING_DIM, hidden_dim=100, n_layers=2, dropout=0.1)
# 5 things could be on GPU: model, data, loss_func, optimizer, otehr tensor. model, data must be on same device. The rest is 
# optional. 
model.to(device)
# Train: data_manager already initialized from above， train_data_iterator is pytorch DataLoader
train_data_iterator = data_manager.get_torch_iterator(data_subset=TRAIN)
# Define hyper parameters
n_epochs, lr, weight_decay = 20, 0.001, 0.0001
optimizer = torch.optim.Adam(params=model.parameters(), lr=lr, weight_decay=weight_decay)
criterion = F.binary_cross_entropy_with_logits # criterion = nn.BCEWithLogitsLoss()
'''train and evaluate data'''
for i in range(n_epochs):
    loss, accuracy = train_epoch(model, train_data_iterator, optimizer, criterion, device)
    print(f"epoch {i}, loss: {loss}, accuracy: {accuracy}")
    # eval the model on eval set: the same 
    evaluate_data_iterator = data_manager.get_torch_iterator(data_subset=VAL)
    loss, accuracy = evaluate(model, evaluate_data_iterator, criterion, device)
    print(f"Evaluate: loss: {loss}, accuracy: {accuracy}")

epoch 0, loss: 0.5442603087350474, accuracy: 0.7045365925977557
Evaluate: loss: 0.4468166540254727, accuracy: 0.8024948058405934
epoch 1, loss: 0.42242414088461655, accuracy: 0.8099570989407319
Evaluate: loss: 0.4423285600624856, accuracy: 0.7993762995002176
epoch 2, loss: 0.3978262923801597, accuracy: 0.8228259412718272
Evaluate: loss: 0.4463578518635698, accuracy: 0.7910602898211092
epoch 3, loss: 0.37624505978765116, accuracy: 0.8354348043045452
Evaluate: loss: 0.41714132312072527, accuracy: 0.800415798804864
epoch 4, loss: 0.35961548348482675, accuracy: 0.8421941999649639
Evaluate: loss: 0.42930942831994146, accuracy: 0.8035342989493308
epoch 5, loss: 0.3426957131449964, accuracy: 0.8548030630209256
Evaluate: loss: 0.41576262710269485, accuracy: 0.8128898055786402
epoch 6, loss: 0.3280577164105664, accuracy: 0.8628623346567743
Evaluate: loss: 0.41022538540494063, accuracy: 0.8139293087247503
epoch 7, loss: 0.29833801986561603, accuracy: 0.8730014256254588
Evaluate: loss: 0.40156864

In [13]:
# test the model on test set: the same 
evaluate_data_iterator = data_manager.get_torch_iterator(data_subset=TEST)
loss, accuracy = evaluate(model, evaluate_data_iterator, criterion, device)
print(f"Test: loss: {loss}, accuracy: {accuracy}")

Test: loss: 0.6678287383269238, accuracy: 0.8191268182593918


In [14]:
# save trained model
model_path = 'TempFiles/biLSTM'
torch.save({'epoch': n_epochs,'model_state_dict': model.state_dict(),'optimizer_state_dict': optimizer.state_dict()}, model_path)