1. Make sure you fill in all cells contain YOUR CODE HERE or YOUR ANSWER HERE.
2. After you finished, Restart the kernel & run all cell in order.

# Project II: Text Classification Using LSTM Network
## Deadline: Nov 14, 11:59 pm

You have learned about the basics of neural network training and testing during the class. Now let's move forward to the text classification tasks using simple LSTM networks! In this project, you need to implement two parts:

- **Part I: Building vocabulary for LSTM network**
    - Get familiar with discrete text data processing for neural networks. Building vocabulary by yourself.


- **Part II: Implementing your own LSTM Neural Network**
    - Learn to implement your own LSTM network and aims for a strong performance on the given text classification task.
    - Note that you need to implement the LSTM network manually, any kind of integrated package invoking will get 0 points.
    - Your LSTM network can be 2-4 layers.
    - Expected Accuracy: >=65%.
    ![](./LSTM.png)
    

Let's get started!

In [1]:
import torch
import pandas as pd

# nlp library of Pytorch
from torchtext import data
#from torchtext.legacy import data
#from torchtext.legacy import data
#from torchtext.legacy import datasets

import warnings as wrn
wrn.filterwarnings('ignore')
SEED = 2021

torch.manual_seed(SEED)
torch.backends.cuda.deterministic = True

In [2]:
data_ = pd.read_csv('./sms_spam.csv')
data_.head()
data_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    5574 non-null   object
 1   text    5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [3]:
# Field is a normal column 
# LabelField is the label column.

import spacy
nlp = spacy.load("en_core_web_lg")
def tokenizer(text): # create a tokenizer function
    return [tok.text for tok in nlp.tokenizer(text)]

TEXT = data.Field(tokenize=tokenizer,batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

In [4]:
fields = [("type",LABEL),('text',TEXT)]

In [5]:
training_data = data.TabularDataset(path="./sms_spam.csv",
                                    format="csv",
                                    fields=fields,
                                    skip_header=True
                                   )

print(vars(training_data.examples[0]))

{'type': 'ham', 'text': ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']}


In [6]:
import random
# train and validation splitting
train_data,valid_data = training_data.split(split_ratio=0.75,
                                            random_state=random.seed(SEED))

#### Question 1 (5 points)
Implement the vocabulary building and the text to label part for training.

In [7]:
#implement Question1 here:
#Building vocabularies => (Token to integer)
#you can use the data package built-in function to build the vocabulary, check the 'torchtext data' doc.

TEXT.build_vocab(train_data, min_freq = 2)
LABEL.build_vocab(train_data, min_freq = 2)

In [8]:
print("Size of text vocab:",len(TEXT.vocab))
print("Size of label vocab:",len(LABEL.vocab))
TEXT.vocab.freqs.most_common(10)


Size of text vocab: 4364
Size of label vocab: 2


[('.', 3658),
 ('to', 1615),
 ('I', 1478),
 (',', 1461),
 ('you', 1383),
 ('?', 1086),
 ('!', 1019),
 ('a', 1003),
 ('the', 882),
 ('...', 869)]

In [9]:
device = torch.device("cuda")

BATCH_SIZE = 64

# We'll create iterators to get batches of data when we want to use them
"""
This BucketIterator batches the similar length of samples and reduces the need of 
padding tokens. This makes our future model more stable

"""
train_iterator,validation_iterator = data.BucketIterator.splits(
    (train_data,valid_data),
    batch_size = BATCH_SIZE,
    # Sort key is how to sort the samples
    sort_key = lambda x:len(x.text),
    sort_within_batch = True,
    device = device
)

#### Question 2 (25 points)
You need to implement the embedding layer and the LSTM cell according to the given architecture, but you are not allowed to use any integrated package!
LSTM tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
![](./LSTM_CELL.png)

In [10]:
import torch.nn as nn
import math

class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size, bidirectional):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.bidirectional = bidirectional
        self.setWeights(self.input_size, self.hidden_size)
        
    def setWeights(self, input_size, hidden_size):
        # Input Gate
        self.W_i = torch.rand(input_size, hidden_size).to(device)
        self.U_i = torch.rand(hidden_size, hidden_size).to(device)
        self.b_i = torch.rand(hidden_size).to(device)
        
        # Forget Gate
        self.W_f = torch.rand(input_size, hidden_size).to(device)
        self.U_f = torch.rand(hidden_size, hidden_size).to(device)
        self.b_f = torch.rand(hidden_size).to(device)
        
        # Cell Gate
        self.W_c = torch.rand(input_size, hidden_size).to(device)
        self.U_c = torch.rand(hidden_size, hidden_size).to(device)
        self.b_c = torch.rand(hidden_size).to(device)
        
        # Output Gate
        self.W_o = torch.rand(input_size, hidden_size).to(device)
        self.U_o = torch.rand(hidden_size, hidden_size).to(device)
        self.b_o = torch.rand(hidden_size).to(device)
        
    def forward(self, x):
        batch_size = x.size(0)
        sequence_length = x.size(1)
        hidden_sequence = []
        
        hx = torch.zeros(batch_size, self.hidden_size).to(device)
        cx = torch.zeros(batch_size, self.hidden_size).to(device)            
        
        for t in range(sequence_length):
            # Get sequence
            x_t = x[:, t, :]
        
            # Equations for each gate
            forget_gate = torch.sigmoid(torch.mm(x_t, self.W_f) + torch.mm(hx, self.U_f) + self.b_f)
            input_gate = torch.sigmoid(torch.mm(x_t, self.W_i) + torch.mm(hx, self.U_i) + self.b_i)
            cell_gate = torch.tanh(torch.mm(x_t, self.W_c) + torch.mm(hx, self.U_c) + self.b_c)
            output_gate = torch.sigmoid(torch.mm(x_t, self.W_o) + torch.mm(hx, self.U_o) + self.b_o)
            # Updated cell state and final output
            cy = forget_gate * cx + input_gate * cell_gate
            hy = output_gate * torch.tanh(cy)
            
            hy = hy.unsqueeze(0)
            hidden_sequence.append(hy)

        # If bidirectional reverse the sequence and continue to add new weights 
        if self.bidirectional:
            self.setWeights(self.input_size, self.hidden_size)
            hx = torch.zeros(batch_size, self.hidden_size).to(device)
            cx = torch.zeros(batch_size, self.hidden_size).to(device)  
            for t in range(sequence_length):
                # Reverse Sequence
                x_t = x[:, t, :].flip(1)

                # Equations for each gate
                forget_gate = torch.sigmoid(torch.mm(x_t, self.W_f) + torch.mm(hx, self.U_f) + self.b_f)
                input_gate = torch.sigmoid(torch.mm(x_t, self.W_i) + torch.mm(hx, self.U_i) + self.b_i)
                cell_gate = torch.tanh(torch.mm(x_t, self.W_c) + torch.mm(hx, self.U_c) + self.b_c)
                output_gate = torch.sigmoid(torch.mm(x_t, self.W_o) + torch.mm(hx, self.U_o) + self.b_o)
                # Updated cell state and final output
                cy = forget_gate * cx + input_gate * cell_gate
                hy = output_gate * torch.tanh(cy)
                
                hy = hy.unsqueeze(0)
                hidden_sequence.append(hy)

        hidden_sequence = torch.cat(hidden_sequence, 0)
        hidden_sequence = hidden_sequence.transpose(0, 1)
        return hidden_sequence

In [11]:
class LSTMNet(nn.Module):
    
    def __init__(self,vocab_size,embedding_dim,hidden_dim,output_dim,n_layers,bidirectional,dropout):
        super(LSTMNet,self).__init__()
        # In this class, you need to implement the architecture of an LSTM network, the architecture should include:
        
        # 1. Embedding layer converts integer sequences to vector sequences
        self.EmbeddedLayer = nn.Embedding(vocab_size, embedding_dim)

        # 2. LSTM layer process the vector sequences 
        self.LSTMLayers = []
        for i in range(n_layers):
            if(i == 0):
                if(bidirectional):
                    self.LSTMLayers.append(nn.Sequential(nn.Dropout(p = dropout),
                                                     LSTMCell(embedding_dim, 2*hidden_dim, bidirectional)))
                else:
                    self.LSTMLayers.append(nn.Sequential(nn.Dropout(p = dropout),
                                                     LSTMCell(embedding_dim, hidden_dim, bidirectional)))
            else:
                if(bidirectional):
                    self.LSTMLayers.append(nn.Sequential(nn.Dropout(p = dropout),
                                                     LSTMCell(2*hidden_dim, 2*hidden_dim, bidirectional)))
                else:
                    self.LSTMLayers.append(nn.Sequential(nn.Dropout(p = dropout),
                                                     LSTMCell(hidden_dim, hidden_dim, bidirectional)))

        # 3. Dense layer to predict
        if(bidirectional):
            self.DenseLayer = nn.Linear(2 * hidden_dim, output_dim)
        else:
            self.DenseLayer = nn.Linear(hidden_dim, output_dim)

        # 4. Prediction activation function (you can choose your own activate function e.g., ReLU, Sigmoid, Tanh)
        self.ActivationLayer = nn.Sigmoid()
        
    def forward(self,text,text_lengths):
        output = self.EmbeddedLayer(text)
        for cell in self.LSTMLayers:
            output = cell(output)
        output = (output[:, -1, :])
        output = self.DenseLayer(output)
        output = self.ActivationLayer(output)
        return output

In [12]:
SIZE_OF_VOCAB = len(TEXT.vocab)
EMBEDDING_DIM = 300
NUM_HIDDEN_NODES = 64
NUM_OUTPUT_NODES = 1
NUM_LAYERS = 2
BIDIRECTION = True
DROPOUT = 0.1

In [13]:
model = LSTMNet(SIZE_OF_VOCAB,
                EMBEDDING_DIM,
                NUM_HIDDEN_NODES,
                NUM_OUTPUT_NODES,
                NUM_LAYERS,
                BIDIRECTION,
                DROPOUT
               )

In [14]:
import torch.optim as optim
model = model.to(device)
optimizer = optim.Adam(model.parameters(),lr=1e-4)
criterion = nn.BCELoss()
criterion = criterion.to(device)

In [15]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(preds)
    
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [16]:
def train(model,iterator,optimizer,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    model.train()
    
    for batch in iterator:
        
        # cleaning the cache of optimizer
        optimizer.zero_grad()
        
        text,text_lengths = batch.text
        
        # forward propagation and squeezing
        predictions = model(text,text_lengths).squeeze()
        
        # computing loss / backward propagation
        loss = criterion(predictions,batch.type)
        loss.backward()
        
        # accuracy
        acc = binary_accuracy(predictions,batch.type)
        
        # updating params
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    # It'll return the means of loss and accuracy
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
        

In [17]:
def evaluate(model,iterator,criterion):
    
    epoch_loss = 0.0
    epoch_acc = 0.0
    
    # deactivate the dropouts
    model.eval()
    
    # Sets require_grad flat False
    with torch.no_grad():
        for batch in iterator:
            text,text_lengths = batch.text
            
            predictions = model(text,text_lengths).squeeze()
              
            #compute loss and accuracy
            loss = criterion(predictions, batch.type)
            acc = binary_accuracy(predictions, batch.type)
            
            #keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [18]:
EPOCH_NUMBER = 15
for epoch in range(1,EPOCH_NUMBER+1):
    
    train_loss,train_acc = train(model,train_iterator,optimizer,criterion)
    
    valid_loss,valid_acc = evaluate(model,validation_iterator,criterion)
    
    # Showing statistics
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')
    print()

	Train Loss: 0.757 | Train Acc: 32.77%
	 Val. Loss: 0.663 |  Val. Acc: 87.62%

	Train Loss: 0.602 | Train Acc: 85.39%
	 Val. Loss: 0.533 |  Val. Acc: 88.19%

	Train Loss: 0.517 | Train Acc: 85.87%
	 Val. Loss: 0.466 |  Val. Acc: 88.03%

	Train Loss: 0.472 | Train Acc: 85.94%
	 Val. Loss: 0.428 |  Val. Acc: 87.89%

	Train Loss: 0.447 | Train Acc: 85.80%
	 Val. Loss: 0.411 |  Val. Acc: 87.76%

	Train Loss: 0.432 | Train Acc: 85.49%
	 Val. Loss: 0.400 |  Val. Acc: 87.62%

	Train Loss: 0.429 | Train Acc: 85.13%
	 Val. Loss: 0.394 |  Val. Acc: 87.34%

	Train Loss: 0.425 | Train Acc: 85.37%
	 Val. Loss: 0.388 |  Val. Acc: 87.98%

	Train Loss: 0.422 | Train Acc: 85.39%
	 Val. Loss: 0.390 |  Val. Acc: 87.62%

	Train Loss: 0.426 | Train Acc: 85.32%
	 Val. Loss: 0.387 |  Val. Acc: 87.25%

	Train Loss: 0.417 | Train Acc: 85.35%
	 Val. Loss: 0.385 |  Val. Acc: 87.62%

	Train Loss: 0.413 | Train Acc: 85.20%
	 Val. Loss: 0.387 |  Val. Acc: 87.41%

	Train Loss: 0.413 | Train Acc: 85.18%
	 Val. Loss: 