# Assignment 6
Second Architecure

## Dataset
We will be using previous session tweet dataset. Tweets.csv file is uploaded here. A peek into the dataset is listed.

In [1]:
import pandas as pd
from google.colab import files

In [2]:
upoaded=files.upload()

Saving tweets.csv to tweets.csv


In [3]:
df=pd.read_csv('tweets.csv')

In [4]:
df.shape
df.labels.value_counts()

0    931
1    352
2     81
Name: labels, dtype: int64

## Defining Fields 
Fields of the dataset are defined as<br/>
LABEL as a LabelField<br/>
TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case<br/>



In [5]:
# Import Library
import random
import torch, torchtext
from torchtext.legacy import data 

# Manual Seed
SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7f7a0bea6a10>

Here while defining the Tweet, below, inlcude_lengths is kept True is its needed to loop over the LSTMCells in the decoder. 

In [6]:
Tweet = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Label = data.LabelField(tokenize ='spacy', is_target=True, batch_first =True, sequential =False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [7]:
fields = [('tweets', Tweet),('labels',Label)]

Here we convert list to torchtext. 

In [8]:
example = [data.Example.fromlist([df.tweets[i],df.labels[i]], fields) for i in range(df.shape[0])] 

In [9]:
# Creating dataset
twitterDataset = data.Dataset(example, fields)

Finally, we can split into training
 and validation sets by using the split() method:

In [10]:
(train_set, valid_set) = twitterDataset.split(split_ratio=[0.85, 0.15], random_state=random.seed(SEED))

In [11]:
(len(train_set), len(valid_set))

(1159, 205)

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset. torchtext can now change this one-hot encoding for us by allowing us to pass a max_size parameter to limit the vocabulary to the most common words. 

In [12]:
Tweet.build_vocab(train_set)
Label.build_vocab(train_set)

In [13]:
print('Size of input vocab : ', len(Tweet.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Tweet.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  4651
Size of label vocab :  3
Top 10 words appreared repeatedly : [('Obama', 1069), (':', 783), ('#', 780), ('.', 761), (',', 598), ('"', 550), ('the', 542), ('RT', 516), ('?', 419), ('to', 400)]
Labels :  defaultdict(None, {0: 0, 1: 1, 2: 2})


Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [15]:
train_iterator, valid_iterator = data.BucketIterator.splits((train_set, valid_set), batch_size = 32, 
                                                            sort_key = lambda x: len(x.tweets),
                                                            sort_within_batch=True, device = device)

Save the vocabulary for later use

In [16]:
import os, pickle
with open('tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Tweet.vocab.stoi, tokens)

## Defining Our Model

We use the Embedding, RNN and LSTMCell modules in PyTorch to build a coder, encoder model for classifying tweets.

In this model we create three layers. 
1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 
2. That’s then fed into an RNN layer with 100 hidden features (again, we’re compressing down from the 300-dimensional ).
3. Finally, the output of the RNN (the final hidden state after processing the incoming tweet) is pushed through another an LSTMCell. This LSTMCell is then looped over the number of words in a sentence of the tweet. The input to the LSTMCell is single vector from the encoder RNN and its own previous hidden state. Final output, which is  a single vector is then pushed to  standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [17]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers=1):
      
      super().__init__()          
      self.hidden_dim=hidden_dim
      self.embedding = nn.Embedding(vocab_size, embedding_dim)
      # RNN layer for encoding
      self.encoder = nn.RNN(embedding_dim,hidden_dim, batch_first=True)
      # LSTMCelllayer for decoding
      self.decoder=nn.LSTMCell(hidden_dim,hidden_dim)
      # Dense layer
      self.fc = nn.Linear(hidden_dim, output_dim)
      
    def forward(self, text, debug = False):
      
      embedded = self.embedding(text)
      #hidden = [batchsize, 1,hidden_dim] as number of layers taken here is 1
      output1, hidden = self.encoder(embedded)    
      #The output1 of encoder is of shape [batchsize, sent_len, hiddenn_dim], contains hidden states from each time_step 
      # the input to the decoder LSTMCell is the final single vector from encoder (final hidden state) and its last hidden state and cell state.
      
      hidden1=hidden.squeeze(dim=0) 
      output2=[]
      for i in range(text.size(1)):
        hidden1,cellstate =self.decoder(hidden1)
        output2.append(hidden1)
      output2=torch.stack(output2, dim=1)
      #The output2 of decoder is of shape [batchsize, sent_len, hiddenn_dim], contains hidden states from each time_step 

      dense_outputs = self.fc(hidden1.squeeze(dim=0))  

      # Final activation function softmax
      output = F.softmax(dense_outputs, dim=0)       

      #Since we want to output at each time step of encoder and decoder, return output1 and output2 as well along with prediction(output)

      return output,output1,output2

In [18]:
# Define hyperparameters
size_of_vocab = len(Tweet.vocab)
embedding_dim = 300
num_hidden_nodes = 100
num_output_nodes = 3
num_layers = 1


# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers)

In [19]:
len(train_set)

1159

In [20]:
print(model)

#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

classifier(
  (embedding): Embedding(4651, 300)
  (encoder): RNN(300, 100, batch_first=True)
  (decoder): LSTMCell(100, 100)
  (fc): Linear(in_features=100, out_features=3, bias=True)
)
The model has 1,516,603 trainable parameters


## Model Training and Evaluation

First define the optimizer and loss functions

In [21]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()

# define metric
def binary_accuracy(preds, y):
    pred_class = preds.argmax(dim=1)
    correct = (pred_class == y).float() 
    acc = correct.sum() / len(correct)
    return acc

In [22]:
# push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

## Training Loop

In [23]:
def model_train(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text
        tweet,tweet_length = batch.tweets   
        
        #Capture predictions and output of encode and decoder at every time step.
        #Although it is not needed while training it is captured to maintain consistency across.

        predictions,o1,o2= model(tweet)
        
        # compute the loss
        #predictions=predictions.squeeze()
        loss = criterion(predictions, batch.labels)   

        #As the model is overfitting to the training data, regularisation is introduced.     
        L2_lambda=0.001
        L2_norm=sum(p.pow(2.0).sum() for p in model.parameters())    
        loss=loss+L2_lambda*L2_norm    

        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.labels)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Evaluation Loop**

In [24]:
def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            tweet,tweet_length  = batch.tweets
            
            predictions,o1,o2= model(tweet)
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels)
            L2_lambda=0.001
            L2_norm=sum(p.pow(2.0).sum() for p in model.parameters())    
            loss=loss+L2_lambda*L2_norm    
        
            acc = binary_accuracy(predictions, batch.labels)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Let's Train and Evaluate**

In [25]:
N_EPOCHS = 20
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = model_train(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        best_valid_acc= valid_acc
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')
print(f'\t best Val. Loss: {best_valid_loss:.3f} |  best Val. Acc: {best_valid_acc*100:.2f}% \n')

	Train Loss: 1389.445 | Train Acc: 66.76%
	 Val. Loss: 1381.004 |  Val. Acc: 68.30% 

	Train Loss: 1373.081 | Train Acc: 68.96%
	 Val. Loss: 1364.757 |  Val. Acc: 68.30% 

	Train Loss: 1356.946 | Train Acc: 68.87%
	 Val. Loss: 1348.740 |  Val. Acc: 68.30% 

	Train Loss: 1341.038 | Train Acc: 68.96%
	 Val. Loss: 1332.945 |  Val. Acc: 68.30% 

	Train Loss: 1325.348 | Train Acc: 69.12%
	 Val. Loss: 1317.365 |  Val. Acc: 68.30% 

	Train Loss: 1309.869 | Train Acc: 69.12%
	 Val. Loss: 1301.993 |  Val. Acc: 68.30% 

	Train Loss: 1294.597 | Train Acc: 69.12%
	 Val. Loss: 1286.824 |  Val. Acc: 68.30% 

	Train Loss: 1279.524 | Train Acc: 69.12%
	 Val. Loss: 1271.852 |  Val. Acc: 68.30% 

	Train Loss: 1264.646 | Train Acc: 69.12%
	 Val. Loss: 1257.073 |  Val. Acc: 68.30% 

	Train Loss: 1249.959 | Train Acc: 69.12%
	 Val. Loss: 1242.481 |  Val. Acc: 68.30% 

	Train Loss: 1235.457 | Train Acc: 69.12%
	 Val. Loss: 1228.073 |  Val. Acc: 68.30% 

	Train Loss: 1221.137 | Train Acc: 69.12%
	 Val. Loss:

## Model Testing

In [26]:
len(train_set)

1159

In [27]:
#load weights and tokenizer

path='./saved_weights.pt'
model.load_state_dict(torch.load(path))
model.eval()
tokenizer_file = open('./tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)
#inference 

import spacy
nlp = spacy.load('en')

def classify_tweet(tweet):
    
    categories = {0: "Negative", 1:"Positive", 2:"Neutral"}
    
    # tokenize the tweet 
    tokenized = [tok.text for tok in nlp.tokenizer(tweet)] 
    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]     
    # compute no. of words        
    length = [len(indexed)]
    # convert to tensor                                    
    tensor = torch.LongTensor(indexed).to(device)   
    # reshape in form of batch, no. of words           
    tensor = tensor.unsqueeze(1).T  
    # convert to tensor                          
    length_tensor = torch.LongTensor(length)
    # Get the model prediction                  
    prediction,o1,o2,h1,h2= model(tensor, debug=True)
    #prediction=prediction.squeeze()
    pred = torch.argmax(prediction) 
    
    return categories[pred.item()]

In [29]:
r_n=random.choice(train_set)
print(r_n.tweets)
tweet=r_n.tweets
indexed = [tokenizer[t] for t in tweet]        
# compute no. of words        
length = [len(indexed)]
# convert to tensor                                    
tensor = torch.LongTensor(indexed).to(device)   
# reshape in form of batch, no. of words           
tensor = tensor.unsqueeze(1).T  
# convert to tensor                          
length_tensor = torch.LongTensor(length)
# Get the model prediction                  
prediction,o1,o2= model(tensor)


['Un', 'genio', 'Obama', 'cantando', '"', 'Sexy', 'and', 'i', 'know', 'it', '"', ' ', 'ajajajaj', 'http://t.co/LojgGtGT']


In [30]:
prediction.size()

torch.Size([3])

In [31]:
o1.size(), o2.size(), len(tweet)

(torch.Size([1, 14, 100]), torch.Size([1, 14, 100]), 14)

In [32]:
import numpy as np

In [33]:
encoded=o1.squeeze()
decoded=o2.squeeze()

In [34]:
encoded.size()

torch.Size([14, 100])

In [None]:
print(tweet, len(tweet))

In [35]:
for i in range(len(tweet)):
  enc=encoded[i]
  dec=decoded[i]
  print(f'word :{tweet[i]}  \n encoded : {enc}\n decoded : {dec}')

word :Un  
 encoded : tensor([-0.0277,  0.0705,  0.0005, -0.0142,  0.0917,  0.0717, -0.0249,  0.0175,
         0.0275, -0.0535,  0.0009, -0.0220, -0.0510, -0.0680, -0.0249, -0.0631,
         0.0658,  0.0241,  0.0179,  0.0109, -0.0296, -0.0184,  0.0163, -0.0529,
        -0.0287, -0.0283,  0.0117,  0.0605, -0.0456, -0.0482, -0.0512,  0.0247,
        -0.0315,  0.0692,  0.0215,  0.0830, -0.0117,  0.0840,  0.1183,  0.0892,
         0.0685, -0.0210,  0.0662, -0.0409,  0.0272,  0.0045,  0.0782, -0.0193,
        -0.0679,  0.0072, -0.0060, -0.0379,  0.0168,  0.0847,  0.0242, -0.1152,
         0.0194, -0.0058, -0.0663,  0.0081,  0.0554, -0.0812, -0.0618,  0.0098,
         0.0099,  0.0813, -0.0271,  0.0057,  0.0753, -0.0386, -0.0329,  0.1023,
        -0.0155, -0.0106,  0.0159, -0.0628,  0.0040,  0.0315, -0.0631, -0.0005,
         0.0630, -0.0851,  0.0816, -0.0635,  0.0041,  0.0132,  0.0749, -0.0164,
         0.0329, -0.0260, -0.0044,  0.0074, -0.0217, -0.0539, -0.0264,  0.0744,
         0.0597,  