# 2 - Updated Sentiment Analysis

In the previous notebook, we got the fundamentals down for sentiment analysis. In this notebook, we'll actually get decent results.

We will use:
- pre-trained word embeddings
- different RNN architecture
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer

This will allow us to achieve ~85% test accuracy.

## Preparing Data

The same as before, we'll set the seed, define the `Fields` and get the train/valid/test splits.

In [1]:
import torch
from torchtext import data
from torchtext import datasets
import random

SEED = 1234


# set seed for both cpu and gpu computation
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# Field models common text processing datatypes that can be represented by tensors
# use spacy to tokenize sentences
TEXT = data.Field(tokenize='spacy')
# LabelField is designed to hold labels for the classification
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

# put IMDB dataset into text and label field and split them into train and test sets randomly
train, test = datasets.IMDB.splits(TEXT, LABEL)

# in the train set, further split a valid set for in-sample validation
train, valid = train.split(random_state=random.seed(SEED))

The first update, is the addition of pre-trained word embeddings. These vectors have been trained on corpuses of billions of tokens. Now, instead of having our word embeddings initialized randomly, they are initialized with these pre-trained vectors, where words that appear in similar contexts appear nearby in this vector space.

The first step to using these is to specify the vectors and download them, which is passed as an argument to `build_vocab`. The `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens. `100d` indicates these vectors are 100-dimensional.

**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection.

In [2]:
# construct vocab object from train data, the size of the vocabulary vector should be no greater than 25000
# vector used is trained on 6 billion tokens by glove algorithm, and dimension equals 100
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

As before, we create the iterators.

In [3]:
# set the batch size to be 4, because too big batch size will cause memory overwhelming.
BATCH_SIZE = 4

# convert the train, valid and test sets into iterators,and sort them by length of sentence.
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Build the Model

The model features the most drastic changes.

### Different RNN Architecture

We use a different RNN architecture called a Long Short-Term Memory (LSTM). Why is an LSTM better than a standard RNN? The hidden state can be thought of as a "memory" of the words seen by the model. It is difficult to train a standard RNN as the gradient decays exponentially along the sequence, causing the RNN to "forget" what has happened earlier in the sequence. LSTMs have an extra recurrent state called a _cell_, which can be thought of as the "memory" of the LSTM and can remember information for many time steps. LSTMs also use multiple _gates_, these control the flow of information into and out of the memory. For more information, go [here](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).

### Bidirectional RNN

The concept behind a bidirectional RNN is simple. As well as having an RNN processing the words in the sentence from the first to the last, we have a second RNN processing the words in the sentence from the **last to the first**. PyTorch simplifies this by concatenating both the forward and backward RNNs together, and thus the returned final hidden state, `hidden`, is the concatenation of the hidden state from the last word of the sentence from the forward RNN with the hidden state of the first word of the sentence from the backward RNN, both of which are the final hidden states from their respective RNNs.

![](https://i.imgur.com/itmIIgx.png)

### Multi-layer RNN

Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then usually made from the final hidden state of the final (highest) layer. These are easily combined with bi-directional RNNs, where each extra layer adds an additional forward and backward RNN. 

![](https://i.imgur.com/knsIzeh.png)

### Regularization

Although we've added improvements to our model, each one adds additional parameters. Without going into overfitting into to much detail, the more parameters you have in in your model, the higher the probability that you'll overfit (have a low train error but high validation/test error). To combat this, we use regularization. More specifically, we use a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model, the predictions from all these "weaker" models (one for each forward pass) get averaged together in the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.

### Implementation Details

To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN` on line 8. Also note on line 20 the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. 

As the final hidden state of our LSTM has both a forward and a backward component, which are concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM. 

Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropout for each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`x` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.  

In [4]:
import torch.nn as nn

class RNN(nn.Module):
    # initialize parameters of RNN object, which inherited frm nn.Module
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        # self.embedding represents embedding vocab vector
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # the rnn here is LSTM, and can be set as bidirectional and add dropout to regularize
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        # output layter
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        # set dropout rate
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [sent len, batch size]
        # after feed the input x in and constructed as an embedding matrix, 
        # set dropout on embedding layer
        # set some elements in embedded matrix to be zero, and rescale whole row by factor 1/(1-p)
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        output, (hidden, cell) = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.

To ensure the pre-trained vectors can be loaded into the model, the `EMBEDDING_DIM` must be equal to that of the pre-trained GloVe vectors loaded earlier.

In [5]:
# constructed a 2 hidden layers bidirectional RNN, with dropout rate equals 0.5
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
# the hidden layer has 256 nodes
HIDDEN_DIM = 256
# output has only one dimension (either positive or negative)
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model.

We retrieve the embeddings from the field's vocab, and ensure they're the correct size, _**[vocab size, embedding dim]**_ 

In [6]:
# assign pre-trained embedding vectors to the model
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We then replace the initial weights of the `embedding` layer with the pre-trained embeddings.

In [7]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

## Train the Model

Now to training the model.

The only change we'll make here is changing the optimizer from `SGD` to `Adam`. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. Adam adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about Adam (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).

To change `SGD` to `Adam`, we simply change `optim.SGD` to `optim.Adam`, also note how we do not have to provide an initial learning rate for Adam as PyTorch specifies a sensibile initial learning rate.

In [8]:
import torch.optim as optim

# use Adam optimizer to minimize loss function
# Adam optimizer doesn't require to set learning rate manually
optimizer = optim.Adam(model.parameters())

The rest of the steps for training the model are unchanged.

We define the criterion and place the model and criterion on the GPU (if available)...

In [9]:
# define loss function as a binary cross entropy function, with an extra sigmoid layer to convert numbers into [0,1] interval
criterion = nn.BCEWithLogitsLoss()

# check if the current environment has cuda installed. If yes, then use gpu to compute, if not, then use cpu to compute
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# feed the neural network and loss function to gpu/cpu
model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy...

In [10]:
import torch.nn.functional as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

We define a function for training our model...

**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training.

In [11]:
def train(model, iterator, optimizer, criterion):
    """
    model is the rnn itself
    iterator is training data iterator, containing batches of data
    optimizer is adam optimizer
    criterion is binary cross function with sigmoid
    """
    # initialize the epoch loss and accuracy
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        # initialize the gradient of optimizer
        optimizer.zero_grad()
        
        # feed batch into model, and squeeze the prediction out
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        # calculate gradient for each parameter
        loss.backward()
        # update parameter by current gradient
        optimizer.step()
        # update loss and accuracy
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function for testing our model...

**Note**: as we are now using dropout, we must remember to use `model.eval()` to ensure the dropout is "turned off" while evaluating.

In [12]:
def evaluate(model, iterator, criterion):
    """
    model is the rnn object
    iterator is validation/testing data iterator
    criterion is the binary cross entropy function defined above
    """
    epoch_loss = 0
    epoch_acc = 0
    # turn off the dropout
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            # make prediction and squeeze it out
            predictions = model(batch.text).squeeze(1)
            # evaluate loss and accuracy
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Finally, we train our model...

In [13]:
N_EPOCHS = 5
# iterate for 5 epoch to see the improvements in model in-sample and validation performance
for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.601, Train Acc: 63.70%, Val. Loss: 0.314, Val. Acc: 86.32%
Epoch: 02, Train Loss: 0.279, Train Acc: 88.95%, Val. Loss: 0.255, Val. Acc: 89.81%
Epoch: 03, Train Loss: 0.178, Train Acc: 93.40%, Val. Loss: 0.259, Val. Acc: 90.23%
Epoch: 04, Train Loss: 0.112, Train Acc: 96.13%, Val. Loss: 0.438, Val. Acc: 88.59%
Epoch: 05, Train Loss: 0.140, Train Acc: 94.54%, Val. Loss: 0.374, Val. Acc: 89.08%


...and get our new and vastly improved test accuracy!

In [14]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.491, Test Acc: 86.28%


## User Input

We can now use our model to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

Our `predict_sentiment` function does a few things:
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [15]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(sentence):
    # pick tokenized elements from sentence and put into a list
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    # convert the words into their indices in vocabulary vector
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    # convert the list into tensor and feed to gpu/cpu
    tensor = torch.LongTensor(indexed).to(device)
    # here batch only equals 1
    tensor = tensor.unsqueeze(1)
    # let the model give out the answer, and convert it from a real number into interval [0,1]
    prediction = F.sigmoid(model(tensor))
    return prediction.item()

An example negative review...

In [16]:
predict_sentiment("This film is terrible")



0.003378413151949644

An example positive review...

In [17]:
predict_sentiment("This film is great")



0.9955984354019165

## GRU RNN Model

In [18]:
class RNN_GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

In [19]:
# GRU RNN network
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model_GRU = RNN_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [20]:
model_GRU.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [21]:
optimizer_gru = optim.Adam(model_GRU.parameters())

In [22]:
model_GRU = model_GRU.to(device)

In [23]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model_GRU, train_iterator, optimizer_gru, criterion)
    valid_loss, valid_acc = evaluate(model_GRU, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.548, Train Acc: 69.73%, Val. Loss: 0.292, Val. Acc: 88.60%
Epoch: 02, Train Loss: 0.262, Train Acc: 90.04%, Val. Loss: 0.243, Val. Acc: 90.23%
Epoch: 03, Train Loss: 0.170, Train Acc: 93.81%, Val. Loss: 0.276, Val. Acc: 90.75%
Epoch: 04, Train Loss: 0.113, Train Acc: 96.09%, Val. Loss: 0.265, Val. Acc: 90.33%
Epoch: 05, Train Loss: 0.084, Train Acc: 97.23%, Val. Loss: 0.337, Val. Acc: 89.67%


In [24]:
test_loss, test_acc = evaluate(model_GRU, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.420, Test Acc: 86.84%


In [25]:
def predict_sentiment_gru(sentence):
    # pick tokenized elements from sentence and put into a list
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    # convert the words into their indices in vocabulary vector
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    # convert the list into tensor and feed to gpu/cpu
    tensor = torch.LongTensor(indexed).to(device)
    # here batch only equals 1
    tensor = tensor.unsqueeze(1)
    # let the model give out the answer, and convert it from a real number into interval [0,1]
    prediction = F.sigmoid(model_GRU(tensor))
    return prediction.item()

In [26]:
predict_sentiment_gru("This film is terrible")



0.1082150787115097

In [27]:
predict_sentiment_gru("This film is great")



0.9795071482658386

## Conclusion

We can see both methods gained above 90% accuracy in sample, and about 89% accuracy in validation set. Let's focus on out-of-sample test results.

The original RNN has the accuracy of 86.28%, while at the same time GRU has 86.84% test accuracy. The GRU has about 0.5% advantage over original RNN. This cannot be considered significant enough.

For both models, the accuracies in test are not very far away from those in validation, telling us that the risk of overfitting is not obvious. 

For the user inputs, we can find that the original RNN gave out two more extreme output (0.003 vs 0.1, 0.996 vs 0.980). Both two models can figure out the correct sentiment with strong confidence.

What's more, considering about the GRU receive slightly higher accuracy with fewer parameters, the GRU would be a better choice for IMDB dataset.

## Next Steps

We've now built a decent sentiment analysis model for movie reviews. However, not all of the steps we have added were necessary to achieve the test accuracy we've achieved. In the next notebook we'll implement a model that gets comparable accuracy with far fewer parameters and trains much, much faster.