<a href="https://colab.research.google.com/github/veeralakrishna/END/blob/main/END_S7_Assaignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Dataset Preview

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Load required packages
import pandas as pd
import numpy as np

import os
import time
import pickle
import random
import torch, torchtext
from torchtext import data 

In [3]:


df_phrases = pd.read_csv('https://raw.githubusercontent.com/veeralakrishna/END/main/Session%207/references/stanfordSentimentTreebank/dictionary.txt', sep='|', header=None)
df_labels = pd.read_csv('https://raw.githubusercontent.com/veeralakrishna/END/main/Session%207/references/stanfordSentimentTreebank/sentiment_labels.txt', sep='|')


In [4]:
df_phrases.head()
df_phrases.shape

Unnamed: 0,0,1
0,!,0
1,! ',22935
2,! '',18235
3,! Alas,179257
4,! Brilliant,22936


(239232, 2)

In [5]:
df_labels.head()
df_labels.shape

Unnamed: 0,phrase ids,sentiment values
0,0,0.5
1,1,0.5
2,2,0.44444
3,3,0.5
4,4,0.42708


(239232, 2)

In [6]:
# Merge the data
df = pd.merge(df_phrases, df_labels, how='inner', left_on=1, right_on='phrase ids')

In [7]:
df.head()
df.shape

Unnamed: 0,0,1,phrase ids,sentiment values
0,!,0,0,0.5
1,! ',22935,22935,0.52778
2,! '',18235,18235,0.5
3,! Alas,179257,179257,0.44444
4,! Brilliant,22936,22936,0.86111


(239232, 4)

In [8]:
def score_to_label(score):
  if score <= 0.2:
    return 0
  elif score <= 0.4:
    return 1
  elif score <= 0.6:
    return 2
  elif score <= 0.8:
    return 3
  else:
    return 4


In [9]:
df['label'] = df.apply(lambda row: score_to_label(row['sentiment values']), axis=1)

df.head()
print("Shape of the df :", df.shape)

print("Value COunts of label")
df.label.value_counts()

Unnamed: 0,0,1,phrase ids,sentiment values,label
0,!,0,0,0.5,2
1,! ',22935,22935,0.52778,2
2,! '',18235,18235,0.5,2
3,! Alas,179257,179257,0.44444,2
4,! Brilliant,22936,22936,0.86111,4


Shape of the df : (239232, 5)
Value COunts of label


2    119449
3     50148
1     43028
4     15255
0     11352
Name: label, dtype: int64

### Defining Fields
Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequen tial to False (as it’s our numerical category class).

In [10]:
def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
seed_everything(42)

In [11]:

Review = data.Field(sequential = True, tokenize = 'spacy', batch_first =True, include_lengths=True)
Label = data.LabelField(is_target=True, batch_first =True, dtype=torch.float, sequential=False)

In [12]:
fields = [('review', Review),('labels',Label)]

In [13]:
fields

[('review', <torchtext.data.field.Field at 0x7f6acb6b1ef0>),
 ('labels', <torchtext.data.field.LabelField at 0x7f6a60366c18>)]

In [15]:
example = [data.Example.fromlist([df[0][i],df['label'][i]], fields) for i in range(df.shape[0])]

In [16]:
dataset = data.Dataset(example, fields)

In [17]:
(train, valid) = dataset.split(split_ratio=[0.85, 0.15])

In [18]:
(len(train), len(valid))

(203347, 35885)

In [19]:
vars(train.examples[13])

{'labels': 0, 'review': ['it', 'actually', 'hurts', 'to', 'watch', '.']}

### Building Vocabulary
At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all.

Let’s limit the vocabulary to a maximum of 5000 words in our training set:

In [20]:
Review.build_vocab(train)
Label.build_vocab(train)

In [21]:
print('Size of input vocab : ', len(Review.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Review.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  20822
Size of label vocab :  5
Top 10 words appreared repeatedly : [('the', 64907), (',', 60037), ('a', 46604), ('of', 44496), ('and', 44185), ('.', 32446), ('to', 31610), ('-', 31002), ("'s", 24002), ('is', 19339)]
Labels :  defaultdict(<function _default_unk_index at 0x7f6acb701158>, {2: 0, 3: 1, 1: 2, 4: 3, 0: 4})


Now we need to create a data loader to feed into our training loop. Torchtext provides the BucketIterator method that will produce what it calls a Batch, which is almost, but not quite, like the data loader we used on images.

But at first declare the device we are using.

In [22]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [23]:
train_iterator, valid_iterator = data.BucketIterator.splits((train, valid), batch_size = 32, 
                                                            sort_key = lambda x: len(x.review),
                                                            sort_within_batch=True, device = device)

In [24]:
with open('tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Review.vocab.stoi, tokens)

### Defining Our Model
We use the Embedding and LSTM modules in PyTorch to build a simple model for classifying tweets.

In this model we create three layers.

 1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding.
 2. That’s then fed into a 2 stacked-LSTMs with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). We are using 2 LSTMs for using the dropout.
 3. Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [25]:
import torch.nn as nn
import torch.nn.functional as F

class classifier(nn.Module):
    
    # Define all the layers used in model
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, dropout):
        
        super().__init__()          
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        # LSTM layer
        self.encoder = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           dropout=dropout,
                           batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances
        
        # Dense layer
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):
        
        # text = [batch size, sent_length]
        embedded = self.embedding(text)
        # embedded = [batch size, sent_len, emb dim]
      
        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.cpu(), batch_first=True)
        
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        #hidden = [batch size, num layers * num directions,hid dim]
        #cell = [batch size, num layers * num directions,hid dim]
    
        # Hidden = [batch size, hid dim * num directions]
        dense_outputs = self.fc(hidden[-1])   
        
        # Final activation function softmax
        output = F.softmax(dense_outputs, dim=-1)
            
        return output

In [26]:
# Define hyperparameters
size_of_vocab = len(Review.vocab)
embedding_dim = 300
num_hidden_nodes = 200
num_output_nodes = 5
num_layers = 2
dropout = 0.2

# Instantiate the model
model = classifier(size_of_vocab, embedding_dim, num_hidden_nodes, num_output_nodes, num_layers, dropout = dropout)

In [27]:
model

classifier(
  (embedding): Embedding(20822, 300)
  (encoder): LSTM(300, 200, num_layers=2, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=200, out_features=5, bias=True)
)

In [28]:
#No. of trianable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
    
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 6,970,805 trainable parameters


### Model Training and Evaluation
First define the optimizer and loss functions

In [29]:
import torch.optim as optim

# define optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()

# define metric
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    _, predictions = torch.max(preds, 1)
    
    correct = (predictions == y).float() 
    acc = correct.sum() / len(correct)
    return acc
    
# push to cuda if available
model = model.to(device)
criterion = criterion.to(device)

In [30]:
# Training Loop

def train_loop(model, iterator, optimizer, criterion):
    
    # initialize every epoch 
    epoch_loss = 0
    epoch_acc = 0
    
    # set the model in training phase
    model.train()  
    
    for batch in iterator:
        
        # resets the gradients after every batch
        optimizer.zero_grad()   
        
        # retrieve text and no. of words
        review, review_lengths = batch.review   
        
        # convert to 1D tensor
        predictions = model(review, review_lengths).squeeze()  
        
        # compute the loss
        loss = criterion(predictions, batch.labels.long())
        
        # compute the binary accuracy
        acc = binary_accuracy(predictions, batch.labels)   
        
        # backpropage the loss and compute the gradients
        loss.backward()       
        
        # update the weights
        optimizer.step()      
        
        # loss and accuracy
        epoch_loss += loss.item()  
        epoch_acc += acc.item()    
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [31]:
# Evaluation Loop

def evaluate(model, iterator, criterion):
    
    # initialize every epoch
    epoch_loss = 0
    epoch_acc = 0

    # deactivating dropout layers
    model.eval()
    
    # deactivates autograd
    with torch.no_grad():
    
        for batch in iterator:
        
            # retrieve text and no. of words
            review, review_lengths = batch.review
            
            # convert to 1d tensor
            predictions = model(review, review_lengths).squeeze()
            
            # compute loss and accuracy
            loss = criterion(predictions, batch.labels.long())
            acc = binary_accuracy(predictions, batch.labels)
            
            # keep track of loss and accuracy
            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [33]:
from tqdm import tqdm

# Train and Evaluate
N_EPOCHS = 15
best_valid_loss = float('inf')

# for epoch in tqdm(range(N_EPOCHS)):

for epoch in range(N_EPOCHS):
     
    # train the model
    train_loss, train_acc = train_loop(model, train_iterator, optimizer, criterion)
    
    # evaluate the model
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    # save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}% \n')

	Train Loss: 1.319 | Train Acc: 57.92%
	 Val. Loss: 1.310 |  Val. Acc: 58.84% 

	Train Loss: 1.287 | Train Acc: 61.35%
	 Val. Loss: 1.291 |  Val. Acc: 60.84% 

	Train Loss: 1.262 | Train Acc: 63.95%
	 Val. Loss: 1.282 |  Val. Acc: 61.87% 

	Train Loss: 1.244 | Train Acc: 65.84%
	 Val. Loss: 1.271 |  Val. Acc: 62.93% 

	Train Loss: 1.229 | Train Acc: 67.38%
	 Val. Loss: 1.265 |  Val. Acc: 63.60% 

	Train Loss: 1.217 | Train Acc: 68.55%
	 Val. Loss: 1.259 |  Val. Acc: 64.20% 

	Train Loss: 1.208 | Train Acc: 69.58%
	 Val. Loss: 1.258 |  Val. Acc: 64.33% 

	Train Loss: 1.199 | Train Acc: 70.45%
	 Val. Loss: 1.254 |  Val. Acc: 64.81% 

	Train Loss: 1.191 | Train Acc: 71.21%
	 Val. Loss: 1.254 |  Val. Acc: 64.77% 

	Train Loss: 1.185 | Train Acc: 71.87%
	 Val. Loss: 1.253 |  Val. Acc: 64.91% 

	Train Loss: 1.179 | Train Acc: 72.45%
	 Val. Loss: 1.252 |  Val. Acc: 64.93% 

	Train Loss: 1.174 | Train Acc: 73.02%
	 Val. Loss: 1.251 |  Val. Acc: 65.11% 

	Train Loss: 1.169 | Train Acc: 73.47%
	

### Model Testing

In [34]:

#load weights and tokenizer

path='./saved_weights.pt'
model.load_state_dict(torch.load(path));
model.eval();
tokenizer_file = open('./tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)

#inference 

import spacy
nlp = spacy.load('en')

def classify_review(tweet):
    
    categories = {
        0: "Very Negative",
        1: "Negative",
        2: "Neutral",
        3: "Positive",
        4: "Very Positive"
      }
    
    # tokenize the tweet 
    tokenized = [tok.text for tok in nlp.tokenizer(tweet)] 
    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]        
    # compute no. of words        
    length = [len(indexed)]
    # convert to tensor                                    
    tensor = torch.LongTensor(indexed).to(device)   
    # reshape in form of batch, no. of words           
    tensor = tensor.unsqueeze(1).T  
    # convert to tensor                          
    length_tensor = torch.LongTensor(length)
    # Get the model prediction                  
    prediction = model(tensor, length_tensor)

    _, pred = torch.max(prediction, 1) 
    
    return categories[Label.vocab.stoi[pred.item()]]

<All keys matched successfully>

classifier(
  (embedding): Embedding(20822, 300)
  (encoder): LSTM(300, 200, num_layers=2, batch_first=True, dropout=0.2)
  (fc): Linear(in_features=200, out_features=5, bias=True)
)

## Discussion on Data Augmentation Techniques 

You might wonder exactly how you can augment text data. After all, you can’t really flip it horizontally as you can an image! :D 

In contrast to data augmentation in images, augmentation techniques on data is very specific to final product you are building. As its general usage on any type of textual data doesn't provides a significant performance boost, that's why unlike torchvision, torchtext doesn’t offer a augmentation pipeline. Due to powerful models as transformers, augmentation tecnhiques are not so preferred now-a-days. But its better to know about some techniques with text that will provide your model with a little more information for training. 

### Synonym Replacement

First, you could replace words in the sentence with synonyms, like so:

    The dog slept on the mat

could become

    The dog slept on the rug

Aside from the dog's insistence that a rug is much softer than a mat, the meaning of the sentence hasn’t changed. But mat and rug will be mapped to different indices in the vocabulary, so the model will learn that the two sentences map to the same label, and hopefully that there’s a connection between those two words, as everything else in the sentences is the same.

### Random Insertion
A random insertion technique looks at a sentence and then randomly inserts synonyms of existing non-stopwords into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stopwords (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:


In [35]:
def random_insertion(sentence, n): 
    words = remove_stopwords(sentence) 
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym) 
    return sentence

## Random Deletion
As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability. Consider of it as pixel dropouts while treating images.

In [36]:
def random_deletion(words, p=0.5): 
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return [random.choice(words)] 
    else:
        return remaining

### Random Swap
The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here we sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

In [37]:
def random_swap(sentence, n=5): 
    length = range(len(sentence)) 
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return sentence

### Back Translation

Another popular approach for augmenting text datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. We can use the Python library googletrans for this purpose. 

In [38]:
!pip install google_trans_new

Collecting google_trans_new
  Downloading https://files.pythonhosted.org/packages/f9/7b/9f136106dc5824dc98185c97991d3cd9b53e70a197154dd49f7b899128f6/google_trans_new-1.1.9-py3-none-any.whl
Installing collected packages: google-trans-new
Successfully installed google-trans-new-1.1.9


In [42]:
!pip install googletrans

Collecting googletrans
  Downloading https://files.pythonhosted.org/packages/71/3a/3b19effdd4c03958b90f40fe01c93de6d5280e03843cc5adf6956bfc9512/googletrans-3.0.0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 6.4MB/s 
Collecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/d3/3c/cdeaf9ab0404853e77c45d9e8021d0d2c01f70a1bb26e460090926fe2a5e/hstspreload-2020.11.21-py3-none-any.whl (981kB)
[K     |████████████████████████████████| 983kB 20.4MB/s 
[?25hCollecting httpcore==0.9.*
[?25l  Downloading https://files.pythonhosted.org/packages/dd/d5/e4ff9318693ac6101a2095e580908b591838c6f33df8d3ee8dd953ba96a8/httpcore-0.9.1-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 9.0MB/s 
Collecting rfc3986<2,>=1.3
  Downloading https://files.pythonhosted.org/

In [43]:
import random
import googletrans 
import google_trans_new

from google_trans_new import google_translator

translator = google_translator()
sentence = ['The dog slept on the rug', 'This is good coffee']

available_langs = list(googletrans.LANGUAGES.keys()) 
trans_lang = random.choice(available_langs) 
print(f"Translating to {googletrans.LANGUAGES[trans_lang]}")

translations = translator.translate(sentence, lang_tgt=trans_lang)
print(translations)
# t_text = [t for t in translations]
# print(t_text)

translations_en_random = translator.translate(translations, lang_src=trans_lang, lang_tgt='en') 
print(translations_en_random)

Translating to sesotho
['Ntja e robetse mpeng', 'Ke kofi e monate'] 
['The dog is sleeping on his stomach', 'It's a delicious coffee'] 
