# Sentiment Analysis
Embedding-LSTM-Fully Connected  

Dataset Preview
Your first step to deep learning in NLP. We will be mostly using PyTorch. Just like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines.

We will be using previous session tweet dataset. Let's just preview the dataset.

In [None]:
pip install pytreebank

Collecting pytreebank
  Downloading https://files.pythonhosted.org/packages/e0/12/626ead6f6c0a0a9617396796b965961e9dfa5e78b36c17a81ea4c43554b1/pytreebank-0.2.7.tar.gz
Building wheels for collected packages: pytreebank
  Building wheel for pytreebank (setup.py) ... [?25l[?25hdone
  Created wheel for pytreebank: filename=pytreebank-0.2.7-cp37-none-any.whl size=37070 sha256=0ddc68a9dec81b8a3549ccd412b98e22b8d76198a593cd7e2be7ff0c4d71b413
  Stored in directory: /root/.cache/pip/wheels/e0/b6/91/e9edcdbf464f623628d5c3aa9de28888c726e270b9a29f2368
Successfully built pytreebank
Installing collected packages: pytreebank
Successfully installed pytreebank-0.2.7


In [None]:
import pytreebank

dataset = pytreebank.load_sst('./raw_data')

In [None]:
import os
import sys

print(sys.path[0])
filepath = os.path.join(sys.path[0],'{}.txt')

def find_label(label):
    if label in (1,2):
        return 0
    elif label == 3:
        return 2
    elif label in (4,5):
        return 1


types = ['train' , 'test', 'dev']

for t in types:
    with open(filepath.format(t), 'w') as f:
        for row in dataset[t]:
            label = find_label(row.to_labeled_lines()[0][0]+1)            
            f.write("{}\t{}\n".format(row.to_labeled_lines()[0][1], label))




In [None]:
for t in types:
    print(t,"  ",len(dataset[t]))

train    8544
test    2210
dev    1101


In [None]:
import pandas as pd

df = pd.read_csv('/content/train.txt', sep='\t', header=None, names = ['tweets', 'labels'])
df.head()

Unnamed: 0,tweets,labels
0,The Rock is destined to be the 21st Century 's...,1
1,The gorgeously elaborate continuation of `` Th...,1
2,Singer/composer Bryan Adams contributes a slew...,1
3,You 'd think by now America would have had eno...,2
4,Yet the act is still charming here .,1


Always look through your dataset to understand it more.

In [None]:
df.shape

(8544, 2)

In [None]:
df.labels.value_counts()

1    3610
0    3310
2    1624
Name: labels, dtype: int64

You can use df.labels and df.tweets (name of the column in your dataset) to access the same.

In [None]:
import random
import torch, torchtext
from torchtext.legacy import data

SEED = 43
torch.manual_seed(SEED)

<torch._C.Generator at 0x7f5648363030>

In [None]:
torch.__version__

'1.8.1+cu101'

## Defining Fields


Now we shall be defining LABEL as a LabelField, which is a subclass of Field that sets sequential to False (as it’s our numerical category class). TWEET is a standard Field object, where we have decided to use the spaCy tokenizer and convert all the text to lower‐ case.

In [None]:
Tweet = data.Field(sequential=True, tokenize='spacy', batch_first = True, include_lengths=True)
Label = data.LabelField(tokenize='spacy', is_target=True, batch_first=True, sequential=False)

Having defined those fields, we now need to produce a list that maps them onto the list of rows that are in the CSV:

In [None]:
fields = [('tweets', Tweet),('labels', Label)]
fields

[('tweets', <torchtext.legacy.data.field.Field at 0x7f5647360510>),
 ('labels', <torchtext.legacy.data.field.LabelField at 0x7f55f1a8f090>)]

Armed with our declared fields, lets convert from pandas to list to torchtext. We could also use TabularDataset to apply that definition to the CSV directly but showing an alternative approach too.

In [None]:
df.tweets[0], df.labels[0]

("The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 1)

In [None]:
df.shape[0]

8544

In [None]:
%%time
example = [data.Example.fromlist([df.tweets[i], df.labels[i]], fields) for i in range(df.shape[0])]

CPU times: user 6min 38s, sys: 5.06 s, total: 6min 43s
Wall time: 6min 43s


In [None]:
example[:5][0]

<torchtext.legacy.data.example.Example at 0x7f55f1a8cdd0>

## Creating dataset

In [None]:
## Approach 1
##twitterDataset = data.TabularDataset(path='/content/tweets.csv', format="CSV", fields = fields, skip_header=True)

## Approach 2
twitterDataset = data.Dataset(example, fields)

## Split the dataset

In [None]:
train, valid = twitterDataset.split(split_ratio=[0.85,0.15], random_state= random.seed(SEED))

In [None]:
len(train), len(valid)

(7262, 1282)

In [None]:
vars(train.examples[0])

{'labels': 1,
 'tweets': ['Asks',
  'what',
  'truth',
  'can',
  'be',
  'discerned',
  'from',
  'non',
  '-',
  'firsthand',
  'experience',
  ',',
  'and',
  'specifically',
  'questions',
  'cinema',
  "'s",
  'capability',
  'for',
  'recording',
  'truth',
  '.']}

## Building Vocabulary

At this point we would have built a one-hot encoding of each word that is present in the dataset—a rather tedious process. Thankfully, torchtext will do this for us, and will also allow a max_size parameter to be passed in to limit the vocabu‐ lary to the most common words. This is normally done to prevent the construction of a huge, memory-hungry model. We don’t want our GPUs too overwhelmed, after all. 

Let’s limit the vocabulary to a maximum of 5000 words in our training set:

In [None]:
Tweet.build_vocab(train)
Label.build_vocab(train)

By default, torchtext will add two more special tokens, <unk> for unknown words and <pad>, a padding token that will be used to pad all our text to roughly the same size to help with efficient batching on the GPU.


In [None]:
print('Size of input vocab : ', len(Tweet.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Tweet.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  15741
Size of label vocab :  3
Top 10 words appreared repeatedly : [('.', 6832), (',', 6060), ('the', 5189), ('of', 3767), ('and', 3756), ('a', 3755), ('to', 2588), ('-', 2342), ("'s", 2158), ('is', 2151)]
Labels :  defaultdict(None, {1: 0, 0: 1, 2: 2})


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [None]:
BATCH_SIZE=32

train_iterator, valid_iterator = data.BucketIterator.splits((train, valid),
                           batch_size = BATCH_SIZE,
                           sort_key = lambda x: len(x.tweets),
                           sort_within_batch=True,
                           device = device
                           )

Save the vocabulary

In [None]:
Tweet.vocab.stoi

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f55ee856a50>>,
            {'<unk>': 0,
             '<pad>': 1,
             '.': 2,
             ',': 3,
             'the': 4,
             'of': 5,
             'and': 6,
             'a': 7,
             'to': 8,
             '-': 9,
             "'s": 10,
             'is': 11,
             'that': 12,
             'in': 13,
             'it': 14,
             'The': 15,
             'as': 16,
             'film': 17,
             'with': 18,
             'but': 19,
             'movie': 20,
             'for': 21,
             'its': 22,
             'A': 23,
             '`': 24,
             'an': 25,
             'you': 26,
             'this': 27,
             'be': 28,
             "n't": 29,
             'It': 30,
             '...': 31,
             'on': 32,
             "'": 33,
             'not': 34,
             '--': 35,
             'by': 36,
             'has': 37,
          

In [None]:
import os, pickle

with open('tokenizer.pkl', 'wb') as t:
    pickle.dump(Tweet.vocab.stoi, t)

## Define the model

We use the Embedding and LSTM modules in PyTorch to build a simple model for classifying tweets.

In this model we create three layers. 
1. First, the words in our tweets are pushed into an Embedding layer, which we have established as a 300-dimensional vector embedding. 
2. That’s then fed into a 2 stacked-LSTMs with 100 hidden features (again, we’re compressing down from the 300-dimensional input like we did with images). We are using 2 LSTMs for using the dropout.
3. Finally, the output of the LSTM (the final hidden state after processing the incoming tweet) is pushed through a standard fully connected layer with three outputs to correspond to our three possible classes (negative, positive, or neutral).

In [None]:
import torch .nn as nn
import torch.nn.functional as F

class sentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, n_layers, dropout, output_dim):
        super().__init__()

        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        #LSTM layer
        self.encoder = nn.LSTM(embedding_dim,
                            hidden_dim,
                            num_layers=n_layers,
                            dropout=dropout,
                            batch_first=True)
        # try using nn.GRU or nn.RNN here and compare their performances
        # try bidirectional and compare their performances

        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, text, text_lengths):
        # text = [batch_size, text_length] 
        embedded = self.embedding(text)

        # embedded = [batch_size, text_length, embeddding_dim]

        # packed sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded,
                                                            text_lengths.cpu(),
                                                            batch_first=True)
        packed_output, (hidden, cell) = self.encoder(packed_embedded)
        # hidden = [batch_size, num_layers*num_directions, hidden_dim]
        # cell = [batch_size, num_layers*num_directions, hidden_dim]

        dense_outputs = self.fc(hidden)

        # Activation function Softmax
        output = F.softmax(dense_outputs[0], dim=1)

        return output


In [None]:
# Define Hyperparameters

VOCAB_SIZE = len(Tweet.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 200
OUTPUT_DIM = 3
NUM_LAYERS = 3
DROPOUT = 0.15

model = sentimentClassifier(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, NUM_LAYERS, DROPOUT, OUTPUT_DIM)
model

sentimentClassifier(
  (embedding): Embedding(15741, 300)
  (encoder): LSTM(300, 200, num_layers=3, batch_first=True, dropout=0.15)
  (fc): Linear(in_features=200, out_features=3, bias=True)
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(model)

5767703

## Model Training and Evaluation

### Optimizer and Loss Function

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=2e-4)
criterion = nn.CrossEntropyLoss()

# accuracy metric
def binary_accuracy(pred,y):
    _, predictions = torch.max(pred, 1)
    
    correct = (predictions == y).float()
    accuracy = correct.sum()/len(correct)
    return accuracy

## Augmentations

### Random Deletion
As the name suggests, random deletion deletes words from a sentence. Given a probability parameter p, it will go through the sentence and decide whether to delete a word or not based on that random probability. Consider of it as pixel dropouts while treating images.

In [None]:
def random_deletion(words, p=0.5): 
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return words
    else:
        remaning_pad_len = len(words) - len(remaining)
        pads = [1]*remaning_pad_len
        remaining = torch.cat([torch.tensor(remaining), torch.tensor(pads)], dim=0)
        return remaining

### Random Swap
The random swap augmentation takes a sentence and then swaps words within it n times, with each iteration working on the previously swapped sentence. Here we sample two random numbers based on the length of the sentence, and then just keep swapping until we hit n.

In [None]:
def random_swap(sentence, n=5): 
    length = range(len(sentence)) 
    for _ in range(n):
        idx1, idx2 = random.sample(length, 2)
        sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return sentence

### Random Insertion
A random insertion technique looks at a sentence and then randomly inserts synonyms of existing non-stopwords into the sentence n times. Assuming you have a way of getting a synonym of a word and a way of eliminating stopwords (common words such as and, it, the, etc.), shown, but not implemented, in this function via get_synonyms() and get_stopwords(), an implementation of this would be as follows:

In [None]:
# not used here

def random_insertion(sentence, n): 
    words = remove_stopwords(sentence) 
    for _ in range(n):
        new_synonym = get_synonyms(random.choice(words))
        sentence.insert(randrange(len(sentence)+1), new_synonym) 
    return sentence


### Back Translation

Another popular approach for augmenting text datasets is back translation. This involves translating a sentence from our target language into one or more other languages and then translating all of them back to the original language. We can use the Python library googletrans for this purpose. 

In [None]:
# not used here

import random
import googletrans
import googletrans.Translator

translator = Translator()
sentence = ['The dog slept on the rug']

available_langs = list(googletrans.LANGUAGES.keys()) 
trans_lang = random.choice(available_langs) 
print(f"Translating to {googletrans.LANGUAGES[trans_lang]}")

translations = translator.translate(sentence, dest=trans_lang) 
t_text = [t.text for t in translations]
print(t_text)

translations_en_random = translator.translate(t_text, src=trans_lang, dest='en') 
en_text = [t.text for t in translations_en_random]
print(en_text)

## Apply Augmentations

In [None]:
def augmentation(sentence, dropout=0.3):
    probability = random.random()
    if probability > 0.3:
        n = random.randint(5, 9)
        sentence = random_swap(sentence, n)
        sentence = random_deletion(sentence, dropout)
        return sentence
    else:
        return sentence

In [None]:
model = model.to(device)
criterion = criterion.to(device)

In [None]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_accuracy = 0

    model.train()

    for batch in iterator:
        optimizer.zero_grad()

        tweet, tweet_lengths = batch.tweets

        # apply augmentation
        list_of_tweets = [augmentation(t.cpu(), dropout=0.3)for t in tweet]
        tweet = torch.stack(list_of_tweets).long().to(device)

        prediction = model(tweet, tweet_lengths).squeeze()

        # compute the loss
        loss = criterion(prediction, batch.labels)

        # compute the binary accuracy
        accuracy = binary_accuracy(prediction, batch.labels)

        # backprops the loss and compute the gradients
        loss.backward()

        # update the weights
        optimizer.step()

        # store loss and accuracy
        epoch_loss += loss.item()
        epoch_accuracy += accuracy.item()

    len_iterator = len(iterator)
    return epoch_loss/len_iterator , epoch_accuracy/len_iterator

In [None]:
def evaluate(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_accuracy = 0

    model.eval()

    with torch.no_grad():

        for batch in iterator:

            tweet, tweet_lengths = batch.tweets
            prediction = model(tweet, tweet_lengths).squeeze()

            # compute the loss
            loss = criterion(prediction, batch.labels)

            # compute the binary accuracy
            accuracy = binary_accuracy(prediction, batch.labels)

            # store loss and accuracy
            epoch_loss += loss.item()
            epoch_accuracy += accuracy.item()

    len_iterator = len(iterator)
    return epoch_loss/len_iterator , epoch_accuracy/len_iterator

## Training!

In [None]:
NUM_EPOCHS = 50

best_valid_loss = float('inf')

for epoch in range(NUM_EPOCHS):
    train_loss, train_accuracy = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_accuracy = evaluate(model, valid_iterator, optimizer, criterion)

    if valid_loss < best_valid_loss:
        best_valid_loss = best_valid_loss
        torch.save(model.state_dict(), 'save_weights.pt')

    print("Epoch: ",epoch)
    print(f'Train Loss: {train_loss:.3f} || Train Accuracy: {train_accuracy*100:.2f}%')
    print(f'Validation Loss: {valid_loss:3f} || Validation Accuracy: {valid_accuracy*100:.2f}% \n')

Epoch:  0
Train Loss: 1.057 || Train Accuracy: 42.07%
Validation Loss: 1.055970 || Validation Accuracy: 42.38% 

Epoch:  1
Train Loss: 1.048 || Train Accuracy: 43.83%
Validation Loss: 1.040945 || Validation Accuracy: 46.04% 

Epoch:  2
Train Loss: 1.019 || Train Accuracy: 50.19%
Validation Loss: 0.997919 || Validation Accuracy: 54.19% 

Epoch:  3
Train Loss: 0.984 || Train Accuracy: 54.68%
Validation Loss: 0.983854 || Validation Accuracy: 55.56% 

Epoch:  4
Train Loss: 0.956 || Train Accuracy: 58.18%
Validation Loss: 0.967743 || Validation Accuracy: 58.00% 

Epoch:  5
Train Loss: 0.936 || Train Accuracy: 60.79%
Validation Loss: 0.970905 || Validation Accuracy: 57.32% 

Epoch:  6
Train Loss: 0.922 || Train Accuracy: 61.90%
Validation Loss: 0.969928 || Validation Accuracy: 56.71% 

Epoch:  7
Train Loss: 0.910 || Train Accuracy: 63.06%
Validation Loss: 0.952413 || Validation Accuracy: 58.69% 

Epoch:  8
Train Loss: 0.902 || Train Accuracy: 63.85%
Validation Loss: 0.965344 || Validation Ac

### Result:
 Overfiiting reduced!

 Till epoch 12-13 it gave good results but then it started overfitting again. And Validation loss was stuck near 59%.

# Model Testing

In [None]:
path = './save_weights.pt'
model.load_state_dict(torch.load(path))
model.eval()

tokenizer_file = open('./tokenizer.pkl', 'rb')
tokenizer = pickle.load(tokenizer_file)

In [None]:
import spacy

nlp = spacy.load('en')

def classify_tweet(tweet):
    labels = {0:'Negative', 1: 'Positive', 2:'Neutral'}

    # tokenized the tweet
    tokenized = [t.text for t in nlp.tokenizer(tweet)]

    # convert to integer sequence using predefined tokenizer dictionary
    indexed = [tokenizer[t] for t in tokenized]

    # compute number of words
    length = [len(indexed)]

    # convert to tensor
    tensor = torch.LongTensor(indexed).to(device)

    # reshape in form of [batch, number of words]
    tensor = tensor.unsqueeze(1).T

    # convert to tensor
    length_tensor = torch.LongTensor(length)

    # get the prediction
    prediction  = model(tensor, length_tensor)

    _,prediction = torch.max(prediction, 1)

    return labels[prediction.item()]

In [None]:
classify_tweet("A valid explanation for why Trump won't let women on the golf course.")

'Positive'