# Sentiment Analysis

Reference [Notebook](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb)

Implementation changes from above Notebook
1. 3 seperate LSTM layers
2. Used a for loop to do so in the forward function
3. Trained on the text that is reversed (for example "my name is Rohan" becomes "Rohan is name my")
4. Achieves 87% or more accuracy


Build a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the IMDb dataset.

In [None]:
import torch
from torchtext import data
from torchtext import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

In [None]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

Sample review

In [None]:
print(vars(train_data.examples[0]))

{'text': ['A', 'great', 'storyline', 'with', 'a', 'message', '.', 'Joan', 'Plowright', 'is', 'superb', 'as', '"', 'Phoebe', '"', ',', 'Mike', 'Kopsa', 'is', 'hilarious', 'as', '"', 'coach', '"', 'and', 'Richard', 'de', 'Klerk', 'plays', 'the', 'role', 'of', '"', 'Carmine', '"', 'superbly', '.', 'Mischa', 'Barton', 'as', '"', 'Frankie', '"', 'puts', 'in', 'a', 'good', 'performance', 'and', 'Ingrid', 'as', '"', 'Hazel', '"', 'plays', 'her', 'first', 'lead', 'extremely', 'well', '.', 'This', 'film', 'is', 'superbly', 'directed', 'by', 'Jo', '-', 'Beth', 'Williams', '.', 'The', 'editing', 'is', 'first', 'rate', '.'], 'label': 'pos'}


Reverse the text in Training Dataset

In [None]:
for i in range(len(train_data)):
  train_data.examples[i].text = train_data.examples[i].text[::-1]

Same reversed sample review

In [None]:
print(vars(train_data.examples[0]))

{'text': ['.', 'rate', 'first', 'is', 'editing', 'The', '.', 'Williams', 'Beth', '-', 'Jo', 'by', 'directed', 'superbly', 'is', 'film', 'This', '.', 'well', 'extremely', 'lead', 'first', 'her', 'plays', '"', 'Hazel', '"', 'as', 'Ingrid', 'and', 'performance', 'good', 'a', 'in', 'puts', '"', 'Frankie', '"', 'as', 'Barton', 'Mischa', '.', 'superbly', '"', 'Carmine', '"', 'of', 'role', 'the', 'plays', 'Klerk', 'de', 'Richard', 'and', '"', 'coach', '"', 'as', 'hilarious', 'is', 'Kopsa', 'Mike', ',', '"', 'Phoebe', '"', 'as', 'superb', 'is', 'Plowright', 'Joan', '.', 'message', 'a', 'with', 'storyline', 'great', 'A'], 'label': 'pos'}


Create Validation Dataset

In [None]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


The following builds the vocabulary, only keeping the most common max_size tokens.

Next is the use of pre-trained word embeddings. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors. We get these vectors simply by specifying which vectors we want and passing it as an argument to build_vocab. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary.

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don't want this, and instead initialize them randomly by setting unk_init to torch.Tensor.normal_. This will now initialize those words via a Gaussian distribution.

In [None]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

As before, we create the iterators, placing the tensors on the GPU if one is available.

Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting sort_within_batch = True.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

Build the Model

Three seperate LSTM Layers

For loop in the forward function to pass through each LSTM layer

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.layers = nn.ModuleList() # Definition of Layers as ModuleList()
        self.layers.append(nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)) # First layer is Embedding Layer
        self.layers.append(nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=1)) # 1st LSTM layer

        self.layers.append(nn.LSTM(hidden_dim, 
                    hidden_dim, 
                    num_layers=1)) # 2nd LSTM layer

        self.layers.append(nn.LSTM(hidden_dim, 
            hidden_dim, 
            num_layers=1)) # 3rd LSTM layer

        self.layers.append(nn.Linear(hidden_dim, output_dim)) # Linear Layer
        self.layers.append(nn.Dropout(dropout)) # Dropout 
        
    def forward(self, text, text_lengths):
        
        #text = [sent len, batch size]
        
        embedded = self.layers[-1](self.layers[0](text)) # Applying Dropout 
        
        #embedded = [sent len, batch size, emb dim]
        
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        hidden = packed_embedded
        for layer in range(1,(len(self.layers)-2)): # Iterate through three LSTM layers
          # if layer != 1:
          #   hidden =  self.layers[-1](hidden)
          packed_output, (hidden, cell) = self.layers[layer](hidden) # Forward pass through three LSTM layers
        
        #output = [sent len, batch size, hid dim * num directions]
        #output over padding tokens are zero tensors
        
        #hidden = [num layers * num directions, batch size, hid dim]
        #cell = [num layers * num directions, batch size, hid dim]
        
        hidden = self.layers[-1](hidden) # Applying Dropout 
        #hidden = [batch size, hid dim * num directions]
            
        return self.layers[-2](hidden) # Applying Linear function

To ensure the pre-trained vectors can be loaded into the model, the EMBEDDING_DIM must be equal to that of the pre-trained GloVe vectors loaded earlier.

We get our pad token index from the vocabulary, getting the actual string representing the pad token from the field's pad_token attribute, which is <pad> by default.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 64
OUTPUT_DIM = 1
N_LAYERS = 3
BIDIRECTIONAL = False
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,609,321 trainable parameters


In [None]:

pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the embedding layer with the pre-trained embeddings.

In [None]:
model.layers[0].weight.data.copy_(pretrained_embeddings)

tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9597,  0.8905, -0.7076,  ...,  0.3940, -1.2075, -0.9683],
        [-0.3404,  0.2269,  0.0731,  ..., -0.4427,  0.6267,  0.2811],
        [ 0.7507, -1.9179,  2.2029,  ..., -1.5966,  0.8308, -0.1398]])

As our unk and pad token aren't in the pre-trained vocabulary they have been initialized using unk_init (an $\mathcal{N}(0,1)$ distribution) when building our vocab. It is preferable to initialize them both to all zeros to explicitly tell our model that, initially, they are irrelevant for determining sentiment.

We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens, which we have already done for the padding index.

In [None]:

UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.layers[0].weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.layers[0].weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.layers[0].weight.data)


tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.9597,  0.8905, -0.7076,  ...,  0.3940, -1.2075, -0.9683],
        [-0.3404,  0.2269,  0.0731,  ..., -0.4427,  0.6267,  0.2811],
        [ 0.7507, -1.9179,  2.2029,  ..., -1.5966,  0.8308, -0.1398]])


### Train the Model

We use ADAM optimizer and BCEWithLogitsLoss as it is a binary classification problem

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [None]:
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

Function to calculate accuracy...

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Function for training our model.

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        text_lengths = text_lengths.cpu()
        
        predictions = model(text, text_lengths).squeeze(0)

        batch.label = batch.label.unsqueeze(1)
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Function to evaluate our model.

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            text_lengths = text_lengths.cpu()
            
            predictions = model(text, text_lengths).squeeze(0)
            batch.label = batch.label.unsqueeze(1)
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Start Training

In [None]:
N_EPOCHS = 9

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 7s
	Train Loss: 0.691 | Train Acc: 52.48%
	 Val. Loss: 0.693 |  Val. Acc: 50.79%
Epoch: 02 | Epoch Time: 0m 7s
	Train Loss: 0.676 | Train Acc: 56.16%
	 Val. Loss: 0.588 |  Val. Acc: 71.63%
Epoch: 03 | Epoch Time: 0m 7s
	Train Loss: 0.492 | Train Acc: 77.93%
	 Val. Loss: 0.361 |  Val. Acc: 85.46%
Epoch: 04 | Epoch Time: 0m 7s
	Train Loss: 0.345 | Train Acc: 86.39%
	 Val. Loss: 0.292 |  Val. Acc: 87.94%
Epoch: 05 | Epoch Time: 0m 7s
	Train Loss: 0.287 | Train Acc: 88.76%
	 Val. Loss: 0.282 |  Val. Acc: 88.71%
Epoch: 06 | Epoch Time: 0m 7s
	Train Loss: 0.249 | Train Acc: 90.66%
	 Val. Loss: 0.267 |  Val. Acc: 89.43%
Epoch: 07 | Epoch Time: 0m 7s
	Train Loss: 0.224 | Train Acc: 91.66%
	 Val. Loss: 0.291 |  Val. Acc: 89.32%
Epoch: 08 | Epoch Time: 0m 8s
	Train Loss: 0.196 | Train Acc: 92.87%
	 Val. Loss: 0.269 |  Val. Acc: 89.65%
Epoch: 09 | Epoch Time: 0m 7s
	Train Loss: 0.176 | Train Acc: 93.56%
	 Val. Loss: 0.280 |  Val. Acc: 89.64%


Check Test Accuracy

In [None]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.315 | Test Acc: 87.17%
