# Sentiment analysis on IMDb dataset with RNN

We will implement a multilayer recurrent neural network (RNN) with a many-to-one architecture to predict the sentiment of IMDb reviews. First, we will load the dataset.

In [4]:
import torch
import torchtext
from torchtext.datasets import IMDB

#silence deprecation warnings
torchtext.disable_torchtext_deprecation_warning()

#load in train and test sets
train_imdb = IMDB(split='train')
test_imdb = IMDB(split='test')

### Data preprocessing

Before we can feed the data into an RNN model, we need to apply several preprocessing steps:
 1. Split the training dataset into separate training and validation partitions.
 2. Identify the unique words in the training dataset
 3. Map each unique word to a unique integer and encode the review text into encoded integers (an index of each unique word)
 4. Divide the dataset into mini-batches as input to the model

In [5]:
from torch.utils.data.dataset import random_split

#separate out 5000 examples for validation set from training set
torch.manual_seed(1) #for reproducibility
train_imdb, valid_imdb = random_split(list(train_imdb), [20000, 5000])

The original training dataset contains 25,000 examples. 20,000 examples are randomly chosen for training, and 5,000 for validation.

We will now find the unique words (tokens) in the training dataset, which can be accomplished efficiently using the `Counter` class from the `collections` package, which is part of Python’s standard library. To split the text into words (or tokens), we use Python's regex library to first clean the text, and then we apply the `.split()` string method.

In [7]:
import re
from collections import OrderedDict, Counter

#define tokenizer function
def tokenizer(text):
    #remove all the HTML markup from the text
    text = re.sub('<[^>]*>', '', text)
    
    #find and store emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    
    #remove all non-word characters from the text and convert text into lowercase characters
    text = (re.sub('[\W]+', ' ', text.lower()) +' '.join(emoticons).replace('-', ''))
    
    #split text into words
    tokens = text.split()
    
    return tokens

#create Counter object to track unique tokens
token_counts = Counter()
for label, line in train_imdb:
    tokens = tokenizer(line)
    token_counts.update(tokens)

#print number of unique tokens
print('Total unique tokens in vocabulary:', len(token_counts))

Total unique tokens in vocabulary: 69039


Next, we are going to map each unique word to a unique integer. The `torchtext` package already provides a class, `Vocab`, which we can use to create such a mapping and encode the entire dataset.

In [8]:
from torchtext.vocab import vocab

#sort tokens by frequency
sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)

#create OrderedDict mapping tokens to their frequencies
ordered_dict = OrderedDict(sorted_by_freq_tuples)

#create vocab object by passing in OrderedDict of tokens
vocab = vocab(ordered_dict)

#prepend (append to start) two special tokens
vocab.insert_token('<pad>', 0) #padding
vocab.insert_token('<unk>', 1) #unknown

#assign unknown token by default
vocab.set_default_index(1)

To demonstrate how the `vocab` object works, we apply it to a sequence of words below:

In [12]:
print([vocab[token] for token in ['here', 'is', 'an', 'example']])

[127, 7, 35, 457]


Observe that there will be some tokens in the validation and test sets that did not appear in the training set. As per the code above, these will be assigned the default index of $1$, and will therefore be mapped to the unknown token. Below, we define two functions to transform each text and label in the dataset to the desired encodings.

In [49]:
#define lambda function to tokenize and encode text
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

#define lambda function to encode labels as 1 or 0
label_pipeline = lambda x: 1. if x==2 else 0.

Now, we want to create a `DataLoader` object for the training, validation, and test sets that uses a function `collate_batch` that employs the text and label encoding functions we wrote above and simultaneously pad sequences so that all sequences in a given mini-batch have the same length. We divide all three datasets into data loaders with a batch size of 32.

In [57]:
import torch.nn as nn
from torch.utils.data import DataLoader

#define function to pad and encode mini-batches
def collate_batch(batch):
    label_list, text_list, lengths = [], [], []
    
    #for each text and label in batch, append encoded label, encoded text, and length of encoded text
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        lengths.append(processed_text.size(0))
    
    #make label_list and lengths into tensors
    label_list = torch.tensor(label_list)
    lengths = torch.tensor(lengths)
    
    #pad text_list
    padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)
    
    return padded_text_list, label_list, lengths

#create DataLoader with batch size 32 for the training, validation, and test set
batch_size = 32
train_dl = DataLoader(train_imdb, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
valid_dl = DataLoader(valid_imdb, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)
test_dl = DataLoader(list(test_imdb), batch_size=batch_size, shuffle=False, collate_fn=collate_batch)

We now want to perform feature embedding to reduce the dimensionality of the word vectors. We noted above that there are over $69000$ unique tokens in the vocabulary, which would be a very large number of input features to feed into an RNN. What's more, these features would be very sparse, as they act as a one-hot encoding for each token.

A more elegant approach is to map each word to a vector of a fixed size with real-valued elements. This not only helps decrease the effect of the curse of dimensionality, but it also extracts salient features since the embedding layer in a neural network has parameters that can be learned (similar to the convolutional layers in a CNN). We implement this with PyTorch using `nn.Embedding`.

### Building an RNN model

We will create an RNN model for sentiment analysis, starting with an embedding layer producing word embeddings of feature size 20 (embed_dim=20). Since we have very long sequences, we are going to use an LSTM layer to account for long-range effects, which will be added next. Finally, we will add a fully connected layer as a hidden layer and another fully connected layer as the output layer, the latter of which will have a sigmoid activation function to predict the probability of the input sequence having positive sentiment.

In [51]:
#define RNN model class
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size):
        super().__init__()
        
        #embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        #LSTM layer
        self.rnn = nn.LSTM(embed_dim, rnn_hidden_size, batch_first=True)
        
        #fully connected hidden layer with ReLU activation
        self.fc1 = nn.Linear(rnn_hidden_size, fc_hidden_size)
        self.relu = nn.ReLU()
        
        #fully connected output layer with sigmoid activation
        self.fc2 = nn.Linear(fc_hidden_size, 1)
        self.sigmoid = nn.Sigmoid()
        
    #define forward pass
    def forward(self, text, lengths):
        out = self.embedding(text)
        out = nn.utils.rnn.pack_padded_sequence(out, 
                                                lengths.cpu().numpy(), 
                                                enforce_sorted=False, 
                                                batch_first=True)
        out, (hidden, cell) = self.rnn(out)
        out = hidden[-1, :, :]
        out = self.fc1(out)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.sigmoid(out)
        return out
    
#create instance of RNN class with correct parameters
vocab_size = len(vocab)
embed_dim = 20
rnn_hidden_size = 64
fc_hidden_size = 64
torch.manual_seed(1) #for reproducibility
model = RNN(vocab_size, embed_dim, rnn_hidden_size, fc_hidden_size)
model

RNN(
  (embedding): Embedding(69041, 20, padding_idx=0)
  (rnn): LSTM(20, 64, batch_first=True)
  (fc1): Linear(in_features=64, out_features=64, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=64, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

Below, we define a `train` function to define the training loop and an `evaluate` function to measure the model's performance for one epoch.

In [52]:
#define train function for one epoch
def train(dataloader):
    model.train()
    total_acc, total_loss = 0, 0
    
    for text_batch, label_batch, lengths in dataloader:
        
        #reset gradients to zero
        optimizer.zero_grad()
        
        #generate predictions on mini-batch 
        pred = model(text_batch, lengths)[:, 0]
        
        #calculate loss
        loss = loss_fn(pred, label_batch)
        
        #compute gradients
        loss.backward()
        
        #update parameters using gradients
        optimizer.step()
        
        #sum accuracy and loss on mini-batch
        total_acc += ((pred >= 0.5).float() == label_batch).float().sum().item()
        total_loss += loss.item()*label_batch.size(0)
        
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)

#define evaluate function for one epoch
def evaluate(dataloader):
    model.eval()
    total_acc, total_loss = 0, 0
    
    with torch.no_grad(): #dont compute gradients
        for text_batch, label_batch, lengths in dataloader:
            
            #generate predictions on mini-batch
            pred = model(text_batch, lengths)[:, 0]
            
            #calculate loss
            loss = loss_fn(pred, label_batch)
            
            #sum accuracy and loss on mini-batch
            total_acc += ((pred>=0.5).float() == label_batch).float().sum().item()
            total_loss += loss.item()*label_batch.size(0)
            
    return total_acc/len(dataloader.dataset), total_loss/len(dataloader.dataset)

Next, we define a binary cross-entropy loss function and choose the Adam optimizer, then we train the model for $10$ epochs and display the training and validation performance.

In [54]:
#define binary cross-entropy loss function
loss_fn = nn.BCELoss()

#define Adam optimizer with learning rate 0.001
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

#train the model for 10 epochs
num_epochs = 10
torch.manual_seed(1) #for reproducibility
for epoch in range(num_epochs):
    acc_train, loss_train = train(train_dl)
    acc_valid, loss_valid = evaluate(valid_dl)
    print(f'Epoch {epoch} accuracy: {acc_train:.4f} val_accuracy: {acc_valid:.4f}')

Epoch 0 accuracy: 0.5860 val_accuracy: 0.5486
Epoch 1 accuracy: 0.6826 val_accuracy: 0.7746
Epoch 2 accuracy: 0.8293 val_accuracy: 0.8326
Epoch 3 accuracy: 0.8826 val_accuracy: 0.8534
Epoch 4 accuracy: 0.9124 val_accuracy: 0.8544
Epoch 5 accuracy: 0.9354 val_accuracy: 0.8610
Epoch 6 accuracy: 0.9508 val_accuracy: 0.8552
Epoch 7 accuracy: 0.9622 val_accuracy: 0.8460
Epoch 8 accuracy: 0.9713 val_accuracy: 0.8730
Epoch 9 accuracy: 0.9797 val_accuracy: 0.8660


Looking at the training and validation performance above, we can see that the model is heavily overfitting to the training set and fails to significantly improve its performance on the validation set beyond the sixth epoch. We now evaluate the model on the test set.

In [58]:
#evaluate model on the test set
acc_test, _ = evaluate(test_dl)
print(f'Test accuracy: {acc_test:.4f}')

Test accuracy: 0.8454


The model achieves an accuracy of $84.54\%$ on the test set, which is decent, but not as good as the other method we implemented using tf-idf and a logistic regression classifier.