# Homework 5 GRU

## Preparing Data

For code reference and LSTM result comparison, please refer to https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb.

1. Set seeds for random split.
2. Tokenize data with spacy, and numericalize using FloatTensor.
3. Split the dataset IBDB into training and testing subsets.
4. Further split the training subset into training and validation subsets.

In [18]:
import torch
from torchtext import data
from torchtext import datasets
import random
import spacy

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

train, test = datasets.IMDB.splits(TEXT, LABEL)

train, valid = train.split(random_state=random.seed(SEED))

1. The word embeddings are initialized using input vectors from 'glove.6B.100d'. The maximum size for the vocabulary is 25000, meaning that we only keep the 25000 most frequently appeared words for TEXT.
2. We only build vocabulary from the training set as the test set should remain unknown at the training stage so to avoid any overfitting (e.g. words in the test set that are not in the training set appear in the vocab).

In [2]:
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

.vector_cache/glove.6B.zip: 862MB [00:43, 16.1MB/s]                               
100%|██████████| 400000/400000 [00:16<00:00, 24789.82it/s]


1. Using the length of the sentences to sort the examples, we then partition them into buckets. An iterator is created so that when it is called, it returns a batch of examples from the same bucket. 

In [3]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

### Implementation Details

Compared to the  LSTM model, GRU model only returns the `output` and the final `hidden` state. 

As the final hidden state of our LSTM has both a forward and a backward component, which are concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Bidirectionality and additional layers are added to the GRU through function parameters. Dropout is also included so to avoid the problem of overfitting (by implementation, dropout is only used on intermediate layers). Parameters are commented in-line.

In [4]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        """
        param vocab_size: dimension of the input
        param embedding_dim: dimension of word embeddings
        param hidden_dim: dimension of each hidden layer
        param output_dim: dimension of the output
        param n_layers: number of layers in total
        param bidrectional: if true, implement bidrectional rnn
        param dropout: if true, implement dropout in the intermediate layers to avoid overfitting
        """
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        GRU forward process
        """
        #x = [sent len, batch size]
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

Here an instance of the RNN class (specifically with GRU implemented) is created with user defined parameters being function inputs. The `EMBEDDING_DIM` should have a size equal to the size of the GLOVE vector loaded earlier.

In [5]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

We copy the pre-trained word embeddings into `embedding` and make sure that they are of the correct size. (with 100 embedding dimensions)

In [6]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


We replace the initial weights of the `embedding` layer with `pretrained_embeddings`.

In [7]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-0.4096, -0.5753,  0.1126,  ...,  0.4092,  0.1856,  0.1066],
        [ 0.2110, -0.2472,  0.6508,  ..., -0.1627,  0.4507, -1.1627],
        [-0.2379, -0.1095,  0.4314,  ...,  0.6665,  0.3200,  0.8872]])

To train GRU, we need to use optimization tools to minimize the loss function. Using optimizer `Adam` will adaptively assign different learning rates to different parameters, giving those that are frequently updated lower learning rates and vice versa. No initial assumed learning rate needs to be input into the optimizer as it specifies an initial learning rate itself.

In [8]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

1. We define the loss function with `BCEWithLogitsLoss()` (binary cross entropy with logits).
2. Use GPU and place model and criterion onto the GPU if available.

In [9]:
criterion = nn.BCEWithLogitsLoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

The binary_accuary function takes the predictions through a sigmoid layer to make the values lie between 0 and 1, and then round them to the nearest integer.  Lastly we calculate the proportion of predictions equal to the actual labels in the batch.

In [10]:
import torch.nn.functional as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

We define a function to train the model.

In [11]:
def train(model, iterator, optimizer, criterion):
    """
    param model: the model to be trained
    param iterator: the pre-defined iterator that returns a batch for every iteration
    param optimizer: pre-defined optimizer for minimizing the loss function (Adam here)
    param criterion: pre-defined criterion for the model
    """
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We define a function to test the model.

In [12]:
def evaluate(model, iterator, criterion):
    """
    param model: the tained model to be evaluated
    param iterator: the pre-defined iterator that returns a batch for every iteration
    param criterion: criterion for the model
    """
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We train the model.

In [13]:
N_EPOCHS = 5

for epoch in range(N_EPOCHS):

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Epoch: 01, Train Loss: 0.684, Train Acc: 56.02%, Val. Loss: 0.633, Val. Acc: 70.65%
Epoch: 02, Train Loss: 0.458, Train Acc: 77.43%, Val. Loss: 0.353, Val. Acc: 85.30%
Epoch: 03, Train Loss: 0.249, Train Acc: 90.10%, Val. Loss: 0.258, Val. Acc: 89.96%
Epoch: 04, Train Loss: 0.169, Train Acc: 93.58%, Val. Loss: 0.256, Val. Acc: 90.26%
Epoch: 05, Train Loss: 0.120, Train Acc: 95.58%, Val. Loss: 0.279, Val. Acc: 90.14%


We then run the GRU model with tuned parameters on the test set.

In [14]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

  return Variable(arr, volatile=not train)


Test Loss: 0.325, Test Acc: 87.96%


Not accounting for the difference in randomization of setting seeds on different computers, we see that the test accuracy is slightly better than the link referenced on the top of the notebook. We see that in this case GRU slightly outperforms LSTM.

## User Input

We expect reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [15]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = F.sigmoid(model(tensor))
    return prediction.item()

An example negative review...

In [16]:
predict_sentiment("This film is terrible")



0.18335498869419098

An example positive review...

In [17]:
predict_sentiment("This film is great")



0.9810460805892944