# Sentiment Analysis using RNNs

Please download this notebook onto your Google Drive. 
Insert your name and ISU ID Number here in the usual form:
SONG, SeokHwan   | 701520820

## 1 - Sentiment Analysis Overview

In this series we'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the IMDb dataset.
We'll start very simple to understand the general concepts and further notebooks will build on this knowledge to actually get good results.

**Introduction**
We'll start out with a recurrent neural network (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words,  𝑋=${𝑥_1,...,𝑥_𝑇} $, one at a time, and produces a hidden state,  ℎ , for each word. We use the RNN recurrently by feeding in the current word $x_t$ as well as the hidden state from the previous word,  $ℎ_{𝑡−1}$ , to produce the next hidden state, $h_t$.

$h_t=RNN(x_t,ℎ_{𝑡−1})$

Once we have our final hidden state,  $h_t$ , (from feeding in the last word in the sequence,$x_T$) we feed it through a linear layer, 𝑓 , (also known as a fully connected layer), to receive our predicted sentiment,  𝑦̂=𝑓($ℎ_𝑇$) .
![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment1.png?raw=1)

Note: some layers and steps have been omitted from the diagram, but these will be explained later.

We will compare the effects of different changes to the system:


**Homework Outline:**
1. Standard RNN
2. Standard RNN with pre-trained word embeddings
3. RNN with LSTM blocks 
4. LSTM with regularization
5. Bidirectional RNN with regularization
6. Multi-layer RNN with regularization

This will allow us to achieve ~84% test accuracy.

This lab is based on very clearly written and explained examples from Ben Trevett of Heriot-Watt University in Scotland. https://github.com/bentrevett/pytorch-sentiment-analysis

##2. Preparing Data

One of the main concepts of TorchText is the Field. The parameters of a Field specify how the data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment. We'll be using *packed padded sequences*, which will make our RNN only process the non-padded elements of our sequence, and for any padded element the output will be a zero tensor. To use packed padded sequences, we have to tell the RNN how long the actual sequences are. We do this by setting include_lengths = True for our TEXT field. This will cause batch.text to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.

Our TEXT field has tokenize='spacy' as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io) tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.

LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels. We will explain the dtype argument later.
For more on Fields, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility.

In [2]:
import torch
from torchtext import data
from torchtext import datasets

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

A handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP). The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It processes the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the .split() method. By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.

We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.

In [3]:
from torchtext import datasets
import random

train_data_in, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data_in.split(random_state = random.seed(SEED),split_ratio= .80)

print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')
print(vars(train_data.examples[0]))

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:05<00:00, 15.9MB/s]


Number of training examples: 20000
Number of testing examples: 25000
{'text': ['The', 'opening', 'sequence', 'is', 'supposed', 'to', 'show', 'the', 'Legion', 'arriving', 'in', 'Paris', 'on', '13', 'Nov', '1918', '.', 'The', 'troops', 'pile', 'off', 'the', 'train', '--', 'wearing', 'the', 'uniform', 'in', 'which', 'the', 'French', 'Army', ',', 'including', 'the', 'Legion', ',', 'marched', 'off', 'to', 'war', 'in', '1914', '!', 'This', 'a', 'sure', 'sign', 'that', 'the', 'war', 'flick', 'you', 'are', 'about', 'to', 'see', 'will', 'be', 'a', 'turkey', '.', '(', 'The', 'French', 'Army', 'realized', 'by', '1915', 'that', 'going', 'to', 'war', 'in', 'red', 'trousers', 'and', 'dark', 'blue', 'overcoats', 'was', 'not', 'working', '.', 'Metropolitan', 'French', 'troops', 'were', 'put', 'into', '"', 'horizon', 'blue', '"', 'and', 'Colonial', 'troops', 'were', 'put', 'into', 'khaki', '.', ')', 'The', 'Claude', 'Van', '-', 'Damme', '(', 'sp', '?', ')', 'remake', 'at', 'least', 'got', 'the', 'unifo

##3 Building a Vocabulary and Bringing in an Embedding

###3.1 Vocabulary
Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer). A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment5.png?raw=1)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

Words that appear in examples but we have cut from the vocabulary are replaced with a `<unk>` token. The following builds the vocabulary, only keeping the most common `max_size` tokens.

###3.2 Pre-Trained Word Embeddings
Next is the use of pre-trained word embeddings. Now, instead of having our word embeddings initialized randomly, they are initialized with pre-trained vectors.
We get these vectors simply by specifying which vectors we want and passing it as an argument to `build_vocab`. `TorchText` handles downloading the vectors and associating them with the correct words in our vocabulary.

Here, we'll be using the `"glove.6B.100d" vectors"`. `glove` is the algorithm used to calculate the vectors, go [here](https://nlp.stanford.edu/projects/glove/) for more. `6B` indicates these vectors were trained on 6 billion tokens and `100d` indicates these vectors are 100-dimensional.

You can see the other available embeddings [here](https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L113).

The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. "terrible", "awful", "dreadful" are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.

**Note**: these vectors are about 862MB, so watch out if you have a limited internet connection.

By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don't want this, and instead initialize them randomly by setting `unk_init` to `torch.Tensor.normal_`. This will now initialize those words via a Gaussian distribution.

In [4]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)


.vector_cache/glove.6B.zip: 862MB [06:29, 2.21MB/s]                          
100%|█████████▉| 398788/400000 [00:15<00:00, 26312.82it/s]

In [5]:

print(TEXT.vocab.itos[:50])

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is', 'in', 'I', 'it', 'that', '"', "'s", 'this', '-', '/><br', 'was', 'as', 'movie', 'with', 'for', 'film', 'The', 'but', '(', ')', "n't", 'on', 'you', 'are', 'not', 'have', 'his', 'be', 'he', 'one', 'at', '!', 'by', 'all', 'an', 'who', 'they', 'from', 'like', 'so', 'her']


###3.3 Creating the Iterators

The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration. We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using torch.device, we then pass this device to the iterator. For packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting `sort_within_batch = True`.

In [6]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

##4 Build the Model

The model will be based on different RNN structures

###4.1 Vanilla RNN Architecture
The three layers of the Vanilla RNN are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$. Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment7.png?raw=1)

The `forward` method is called when we feed examples into our model.

Each batch, `text`, is a tensor of size _**[sentence length, batch size]**_. That is a batch of sentences, each having each word converted into a one-hot vector. 

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.

The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_. `embedded` is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. We verify this using the `assert` statement. Note the `squeeze` method, which is used to remove a dimension of size 1. 

Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction.

**Note:** To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. 

In [7]:
import torch.nn as nn

class Vanilla_RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, dropout, pad_idx):
        
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        # LSTM layer
        #self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):

        #text = [sent len, batch size]
        embedded = self.embedding(text)
        
        output, hidden = self.rnn(embedded)

        # Output for LSTM
        #output, (hidden, cell) = self.rnn(embedded)
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

We now create an instance of our VanillaRNN class.
The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size. The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.
The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.
The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [8]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = Vanilla_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX)

In [9]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

The final piece is copying the pre-trained word embeddings we loaded earlier into the embedding layer of our model. We retrieve the embeddings from the field's vocab, and check they're the correct size, [vocab size, embedding dim].
We then replace the initial weights of the embedding layer with the pre-trained embeddings. **Note: this should always be done on the weight.data and not the weight!**

In [10]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)
model.embedding.weight.data.copy_(pretrained_embeddings)

torch.Size([25002, 100])


tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7922, -0.1901, -0.0676,  ...,  1.0146,  0.2398,  0.0675],
        [ 0.4161, -0.1577, -0.0735,  ...,  0.3023,  0.2679,  0.6584],
        [ 0.9501, -0.7701,  0.1537,  ..., -2.0229,  0.4822, -1.0561]],
       device='cuda:0')

We then replace the initial weights of the embedding layer with the pre-trained embeddings.
Note: this should always be done on the weight.data and not the weight!
We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the padding_idx of the embedding layer it will remain zeros throughout training, however the <unk> token embedding will be learned.

In [11]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7922, -0.1901, -0.0676,  ...,  1.0146,  0.2398,  0.0675],
        [ 0.4161, -0.1577, -0.0735,  ...,  0.3023,  0.2679,  0.6584],
        [ 0.9501, -0.7701,  0.1537,  ..., -2.0229,  0.4822, -1.0561]],
       device='cuda:0')


###4.2 Utility Routines for running the models
*count_parameters* a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [12]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')


The model has 2,592,105 trainable parameters


*binary_accuracy* calculate the accuracy of the sentiment analysis. It first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. 
We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [13]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [14]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        text, text_lengths = batch.text
                
        predictions = model(text, text_lengths).squeeze(1)

        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

 

*evaluate* is similar to train, with a few modifications to avoid updating the parameters when evaluating.
model.eval() puts the model in "evaluation mode", this turns off dropout and batch normalization. Again, we are not using them in this model, but it is good practice to include them.
No gradients are calculated on PyTorch operations inside the with no_grad() block. This causes less memory to be used and speeds up computation.
The rest of the function is the same as train, with the removal of optimizer.zero_grad(), loss.backward() and optimizer.step(), as we do not update the model's parameters when evaluating.

In [15]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


*epoch_time* is a function to tell us how long an epoch takes to compare training times between models.

In [16]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

###4.3 Training the Model
We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.
At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [17]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

100%|█████████▉| 398788/400000 [00:29<00:00, 26312.82it/s]

Epoch: 01 | Epoch Time: 0m 6s
	Train Loss: 0.691 | Train Acc: 53.52%
	 Val. Loss: 0.689 |  Val. Acc: 51.92%
Epoch: 02 | Epoch Time: 0m 5s
	Train Loss: 0.693 | Train Acc: 52.68%
	 Val. Loss: 0.711 |  Val. Acc: 50.63%
Epoch: 03 | Epoch Time: 0m 5s
	Train Loss: 0.694 | Train Acc: 51.59%
	 Val. Loss: 0.684 |  Val. Acc: 53.68%
Epoch: 04 | Epoch Time: 0m 5s
	Train Loss: 0.692 | Train Acc: 52.36%
	 Val. Loss: 0.678 |  Val. Acc: 57.46%
Epoch: 05 | Epoch Time: 0m 5s
	Train Loss: 0.690 | Train Acc: 52.46%
	 Val. Loss: 0.695 |  Val. Acc: 50.45%
Epoch: 06 | Epoch Time: 0m 5s
	Train Loss: 0.694 | Train Acc: 51.29%
	 Val. Loss: 0.693 |  Val. Acc: 51.13%
Epoch: 07 | Epoch Time: 0m 5s
	Train Loss: 0.689 | Train Acc: 53.55%
	 Val. Loss: 0.687 |  Val. Acc: 54.41%
Epoch: 08 | Epoch Time: 0m 5s
	Train Loss: 0.675 | Train Acc: 57.65%
	 Val. Loss: 0.720 |  Val. Acc: 50.65%
Epoch: 09 | Epoch Time: 0m 5s
	Train Loss: 0.677 | Train Acc: 55.37%
	 Val. Loss: 0.768 |  Val. Acc: 54.75%
Epoch: 10 | Epoch Time: 0m 5

###4.4 Testing the Model
You may have noticed the loss is not uniformly decreasing and the accuracy is around 70%. This is due to several issues with the model which we'll improve in the next part.
Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [18]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.665 | Test Acc: 59.92%


This section provides a function to put in your own movie review quotes to predict the sentiment of any sentence we give it. As it has been trained on movie reviews, the sentences provided should also be movie reviews.

When using a model for inference it should always be in evaluation mode. If this tutorial is followed step-by-step then it should already be in evaluation mode (from doing `evaluate` on the test set), however we explicitly set it to avoid any risk.

Our `predict_sentiment` function does a few things:
- sets the model to evaluation mode
- tokenizes the sentence, i.e. splits it from a raw string into a list of tokens
- indexes the tokens by converting them into their integer representation from our vocabulary
- gets the length of our sequence
- converts the indexes, which are a Python list into a PyTorch tensor
- add a batch dimension by `unsqueeze`ing 
- converts the length into a tensor
- squashes the output prediction from a real number between 0 and 1 with the `sigmoid` function
- converts the tensor holding a single value into an integer with the `item()` method

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

In [19]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()

Sample Reviews from Rotten Tomatoes for Inception and the Angry Birds 2 Movie.

Negative Reviews:

In [20]:

predict_sentiment(model, "I will pretend I loved it and worship it like all the the smartest people and didn't get kinda lost half way and stopped watching like 40 minutes before the end cause I was utterly uninterested in how it would end.")

0.541588306427002

In [21]:
predict_sentiment(model, "Children will sit through it happily enough, but they deserve better, don't they?")

0.5679482221603394

Medium Reviews: (ratings 3 and 3.5)

In [22]:
predict_sentiment(model,"Invention runs lower once we're on those snowy slopes, and the hard narrative punch keeps disintegrating into a floating cloud of pixels.")

0.5012223124504089

In [23]:
predict_sentiment(model, "The idea of the show, and the acting are amazing, but even as a lover of violence, some of the scenes involving animals and even some scenes involving people just felt.... baseless")

0.5329797863960266

Positive Reviews:

In [24]:
predict_sentiment(model, "A spectacular fantasy thriller based on Nolan's own original screenplay, Inception is the smartest CGI head-trip since The Matrix.")

0.5802634954452515


##5 Different RNN Architectures

We will now use an RNN based on Long Short-Term Memory (LSTM) modules. Recall, standard RNNs suffer from the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem). LSTMs overcome this by having an extra recurrent state called a _cell_, $c$ - which can be thought of as the "memory" of the LSTM - and the use use multiple _gates_ which control the flow of information into and out of the memory. 

Thus, the model using an LSTM looks something like (with the embedding layers omitted):

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment2.png?raw=1)

The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros. The sentiment prediction is still, however, only made using the final hidden state, not the final cell state, i.e. $\hat{y}=f(h_T)$.

### Bidirectional RNN

A bidirectional RNN is an RNN processing the words in the sentence from the first to the last (a forward RNN) with a second RNN processing the words in the sentence from the **last to the first** (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$. 

In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor. 

We make our sentiment prediction using a concatenation of the last hidden state from the forward RNN (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward RNN (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$   

The image below shows a bi-directional RNN, with the forward RNN in orange, the backward RNN in green and the linear layer in silver.  

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment3.png?raw=1)

### Multi-layer RNN

Multi-layer RNNs (also called *deep RNNs*) are another simple concept. The idea is that we add additional RNNs on top of the initial standard RNN, where each RNN added is another *layer*. The hidden state output by the first (bottom) RNN at time-step $t$ will be the input to the RNN above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.

The image below shows a multi-layer unidirectional RNN, where the layer number is given as a superscript. Also note that each layer needs their own initial hidden state, $h_0^L$.

![](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/assets/sentiment4.png?raw=1)

### Regularization

Although we've added improvements to our model, each one adds additional parameters. The more parameters you have in in your model, the higher the probability that your model will overfit (memorize the training data, causing  a low training error but high validation/testing error, i.e. poor generalization to new, unseen examples). To combat this, we use  a method of regularization called *dropout*. Dropout works by randomly *dropping out* (setting to 0) neurons in a layer during a forward pass. The probability that each neuron is dropped out is set by a hyperparameter and each neuron with dropout applied is considered indepenently. One theory about why dropout works is that a model with parameters dropped out can be seen as a "weaker" (less parameters) model. The predictions from all these "weaker" models (one for each forward pass) get averaged together withinin the parameters of the model. Thus, your one model can be thought of as an ensemble of weaker models, none of which are over-parameterized and thus should not overfit.



*이탤릭체 텍스트*###5.1 Implementation Details

To use an LSTM instead of the standard RNN, we use `nn.LSTM` instead of `nn.RNN`. Also, note that the LSTM returns the `output` and a tuple of the final `hidden` state and the final `cell` state, whereas the standard RNN only returned the `output` and final `hidden` state. 

As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the `nn.Linear` layer is twice that of the hidden dimension size.

Implementing bidirectionality and adding additional layers are done by passing values for the `num_layers` and `bidirectional` arguments for the RNN/LSTM. 

Dropout is implemented by initializing an `nn.Dropout` layer (the argument is the probability of dropping out each neuron) and using it within the `forward` method after each layer we want to apply dropout to. **Note**: never use dropout on the input or output layers (`text` or `fc` in this case), you only ever want to use dropout on intermediate layers. The LSTM has a `dropout` argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer. Thus, dropout will not work for a single layer LSTM.

Before we pass our embeddings to the RNN, we need to pack them, which we do with `nn.utils.rnn.packed_padded_sequence`. This will cause our RNN to only process the non-padded elements of our sequence. The RNN will then return `packed_output` (a packed sequence) as well as the `hidden` and `cell` states (both of which are tensors). Without packed padded sequences, `hidden` and `cell` are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence. 

We then unpack the output sequence, with `nn.utils.rnn.pad_packed_sequence`, to transform it from a packed sequence to a tensor. The elements of `output` from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren't in this case, we still unpack the sequence just to show how it is done.

The final hidden state, `hidden`, has a shape of _**[num layers * num directions, batch size, hid dim]**_. These are ordered: **[forward_layer_0, backward_layer_0, forward_layer_1, backward_layer 1, ..., forward_layer_n, backward_layer n]**. As we want the final (top) layer forward and backward hidden states, we get the top two hidden layers from the first dimension, `hidden[-2,:,:]` and `hidden[-1,:,:]`, and concatenate them together before passing them to the linear layer (after applying dropout). 

In [25]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        embedded = self.dropout(self.embedding(text))
         
        #pack sequence
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        #unpack sequence
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        
        #concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        #and apply dropout
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
            
        return self.fc(hidden)

Like before, we'll create an instance of our RNN class, with the new parameters and arguments for the number of layers, bidirectionality and dropout probability.  To ensure the pre-trained vectors can be loaded into the model, the `EMBEDDING_DIM` must be equal to that of the pre-trained GloVe vectors loaded earlier.

We get our pad token index from the vocabulary, getting the actual string representing the pad token from the field's `pad_token` attribute, which is `<pad>` by default.

**Note:** we have switched models from the Vanilla RNN, we are now using the new model.

In [26]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)



We'll print out the number of parameters in our model. 

Notice how we have almost twice as many parameters as before!

In [54]:

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 4,810,857 trainable parameters


Now to training the model.

We'll continue using the optimizer  `Adam`. `Adam` adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates. More information about `Adam` (and other optimizers) can be found [here](http://ruder.io/optimizing-gradient-descent/index.html).
We define the criterion and place the model and criterion on the GPU (if available).


In [28]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

The final addition is copying the pre-trained word embeddings we loaded earlier into the `embedding` layer of our model. We replace the initial weights of the `embedding` layer with the pre-trained embeddings.

**Note**: this should always be done on the `weight.data` and not the `weight`!

In [29]:
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the `padding_idx` of the embedding layer it will remain zeros throughout training, however the `<unk>` token embedding will be learned.

###5.2 Train the Model

We define a function for training our model. 

The function uses the same train and evaluate utility functions as before.
**Note**: as we are now using dropout, we must remember to use `model.train()` to ensure the dropout is "turned on" while training and `model.eval()` to turn dropout off while evaluating.

Finally, we train our model...

In [30]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 43s
	Train Loss: 0.640 | Train Acc: 62.67%
	 Val. Loss: 0.529 |  Val. Acc: 75.69%
Epoch: 02 | Epoch Time: 0m 45s
	Train Loss: 0.640 | Train Acc: 62.45%
	 Val. Loss: 0.498 |  Val. Acc: 76.48%
Epoch: 03 | Epoch Time: 0m 44s
	Train Loss: 0.531 | Train Acc: 73.15%
	 Val. Loss: 0.403 |  Val. Acc: 82.58%
Epoch: 04 | Epoch Time: 0m 44s
	Train Loss: 0.359 | Train Acc: 84.50%
	 Val. Loss: 0.314 |  Val. Acc: 86.65%
Epoch: 05 | Epoch Time: 0m 45s
	Train Loss: 0.307 | Train Acc: 87.40%
	 Val. Loss: 0.277 |  Val. Acc: 88.81%
Epoch: 06 | Epoch Time: 0m 45s
	Train Loss: 0.265 | Train Acc: 89.28%
	 Val. Loss: 0.297 |  Val. Acc: 87.97%
Epoch: 07 | Epoch Time: 0m 45s
	Train Loss: 0.234 | Train Acc: 90.99%
	 Val. Loss: 0.263 |  Val. Acc: 89.22%
Epoch: 08 | Epoch Time: 0m 45s
	Train Loss: 0.213 | Train Acc: 91.93%
	 Val. Loss: 0.261 |  Val. Acc: 89.60%
Epoch: 09 | Epoch Time: 0m 45s
	Train Loss: 0.187 | Train Acc: 92.98%
	 Val. Loss: 0.262 |  Val. Acc: 90.90%
Epoch: 10 | Epoch T

###5.3 Test the Model
We will first assess the accuracy, then we will assess sentiment from the quotes used in part 4.4.

In [31]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.289 | Test Acc: 89.04%


#### User Input

We can now use our model to predict the sentiment of any sentence we give it just as before, using the `predict_sentiment` function.

We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1.

An example negative review for Inception, Rotten Tomatoes Rating = 2. And the Angry Birds 2 Movie.


In [32]:
predict_sentiment(model, "I will pretend I loved it and worship it like all the the smartest people and didn't get kinda lost half way and stopped watching like 40 minutes before the end cause I was utterly uninterested in how it would end.")

0.10907997936010361

In [33]:
predict_sentiment(model, "Children will sit through it happily enough, but they deserve better, don't they?")

0.0964517816901207

Rotten Tomatoes rating 3 and 3.5

In [34]:
predict_sentiment(model,"Invention runs lower once we're on those snowy slopes, and the hard narrative punch keeps disintegrating into a floating cloud of pixels.")

0.8352929949760437

In [35]:
predict_sentiment(model, "The idea of the show, and the acting are amazing, but even as a lover of violence, some of the scenes involving animals and even some scenes involving people just felt.... baseless")

0.18680378794670105

In [36]:
predict_sentiment(model, "Inception isn't a dud but nor is it a masterpiece. It's like a very ambitious, overlong potboiler: visually beautiful, ingenious in parts and dragging in others.")

0.9655977487564087

An example positive review...

In [37]:
predict_sentiment(model, "A spectacular fantasy thriller based on Nolan's own original screenplay, Inception is the smartest CGI head-trip since The Matrix.")

0.9953552484512329

##6 Questions to Answer for Homework

We've now built a decent sentiment analysis model for movie reviews! 

**Homework Questions**
1. Run the single layer Vanilla RNN and LSTM models in section 4. Write down the number of parameters and plot the training and validation errors over the epochs. Note the time per iteration and the final accuracy. Comment on the results that you got with the different test review examples.

2. Compare the computational cost for LSTMs and regular RNNs for a given hidden dimension. Pay special attention to the training and inference cost.

3. What happens to the gradient in an RNN and an LSTM if you backpropagate through a long sequence? Show the form of the derivative for both cases.

4. Run the more complex models in section 5 for at least two different options.  You can select 2 or more layers for a deep RNN and can select if you want to add in the bidirectional option. Write down the number of parameters and plot the training and validation errors over the epochs for the cases. You may need to add in extra training epochs as you create deeper networks. Note the time per iteration and the final accuracy. Comment on the results that you got with the different test review examples.

5. In a bidirectional RNN, if the different directions use a different number of hidden units, how will the shape of  $𝐇_𝑡$ change?

6. If someone asked you to repurpose your system to analyse product reviews on Amazon.com. Would you be able to use the trained system directly? Would you want to combine sources from different types of reviews when modeling text? Is this a good idea? What could go wrong?

In [85]:
import torch.nn as nn

class Vanilla_LSTM(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, dropout, pad_idx):
        
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        
        #self.rnn = nn.RNN(embedding_dim, hidden_dim)
        # LSTM layer
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, text_lengths):

        #text = [sent len, batch size]
        embedded = self.embedding(text)
        
        #output, hidden = self.rnn(embedded)

        # Output for LSTM
        output, (hidden, cell) = self.rnn(embedded)
        
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        
        return self.fc(hidden.squeeze(0))

In [86]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = Vanilla_LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX)

In [87]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [88]:
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)
model.embedding.weight.data.copy_(pretrained_embeddings)

torch.Size([25002, 100])


tensor([[-0.1117, -0.4966,  0.1631,  ...,  1.2647, -0.2753, -0.1325],
        [-0.8555, -0.7208,  1.3755,  ...,  0.0825, -1.1314,  0.3997],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7922, -0.1901, -0.0676,  ...,  1.0146,  0.2398,  0.0675],
        [ 0.4161, -0.1577, -0.0735,  ...,  0.3023,  0.2679,  0.6584],
        [ 0.9501, -0.7701,  0.1537,  ..., -2.0229,  0.4822, -1.0561]],
       device='cuda:0')

In [89]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.7922, -0.1901, -0.0676,  ...,  1.0146,  0.2398,  0.0675],
        [ 0.4161, -0.1577, -0.0735,  ...,  0.3023,  0.2679,  0.6584],
        [ 0.9501, -0.7701,  0.1537,  ..., -2.0229,  0.4822, -1.0561]],
       device='cuda:0')


In [90]:
print(f'The model has {count_parameters(model):,} trainable parameters')


The model has 2,867,049 trainable parameters


In [91]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 9s
	Train Loss: 0.667 | Train Acc: 61.42%
	 Val. Loss: 0.633 |  Val. Acc: 66.75%
Epoch: 02 | Epoch Time: 0m 9s
	Train Loss: 0.631 | Train Acc: 66.02%
	 Val. Loss: 0.517 |  Val. Acc: 78.20%
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.514 | Train Acc: 76.84%
	 Val. Loss: 0.457 |  Val. Acc: 81.41%
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.415 | Train Acc: 81.40%
	 Val. Loss: 0.330 |  Val. Acc: 86.10%
Epoch: 05 | Epoch Time: 0m 9s
	Train Loss: 0.239 | Train Acc: 90.97%
	 Val. Loss: 0.323 |  Val. Acc: 86.79%
Epoch: 06 | Epoch Time: 0m 9s
	Train Loss: 0.165 | Train Acc: 94.48%
	 Val. Loss: 0.316 |  Val. Acc: 87.64%
Epoch: 07 | Epoch Time: 0m 9s
	Train Loss: 0.119 | Train Acc: 96.35%
	 Val. Loss: 0.350 |  Val. Acc: 88.27%
Epoch: 08 | Epoch Time: 0m 9s
	Train Loss: 0.075 | Train Acc: 97.93%
	 Val. Loss: 0.381 |  Val. Acc: 88.17%
Epoch: 09 | Epoch Time: 0m 9s
	Train Loss: 0.048 | Train Acc: 98.87%
	 Val. Loss: 0.383 |  Val. Acc: 87.86%
Epoch: 10 | Epoch Time: 0m

In [92]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.339 | Test Acc: 86.69%


###6.1 Run the single layer Vanilla RNN and LSTM models in section 4. Write down the number of parameters and plot the training and validation errors over the epochs. Note the time per iteration and the final accuracy. Comment on the results that you got with the different test review examples.

Note the time per iteration and the final accuracy. 

Comment on the results that you got with the different test review examples.


> **RNN:**


The model has 2,592,105 trainable parameters

EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0

Epoch: 10 | Epoch Time: 0m 5s
	Train Loss: 0.673 | Train Acc: 56.69%
	 Val. Loss: 0.692 |  Val. Acc: 51.09%

It takes around 5 s per iteration.


Test Loss: 0.665 | Test Acc: 59.92%




> **LSTM:**


The model has 2,867,049 trainable parameters

EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
DROPOUT = 0


Epoch: 10 | Epoch Time: 0m 10s
	Train Loss: 0.035 | Train Acc: 99.24%
	 Val. Loss: 0.456 |  Val. Acc: 87.66%

It takes around 9 s per iteration.

Test Loss: 0.339 | Test Acc: 86.69%



> **Results**

RNN takes less time but LSTM have much better accuracy for both of training and testing.


###6.2 Compare the computational cost for LSTMs and regular RNNs for a given hidden dimension. Pay special attention to the training and inference cost.




> **RNN:**


The model has 2,592,105 trainable parameters

It takes around 5s.



> **LSTM:**


The model has 2,867,049 trainable parameters

It takes around 9s.

Moreover, the computational complexity is O(W) for LSTMs and RNNs. RNN: W < LSTM: W, so LSTM takes twice more than RNN but the trainibale parameters of LSTM is bigger than RNN's.

###6.3 What happens to the gradient in an RNN and an LSTM if you backpropagate through a long sequence? Show the form of the derivative for both cases.


There would be no big difference for the form of the derivative but it would be very complicated to calculate.

###6.4 Run the more complex models in section 5 for at least two different options.  You can select 2 or more layers for a deep RNN and can select if you want to add in the bidirectional option. Write down the number of parameters and plot the training and validation errors over the epochs for the cases. You may need to add in extra training epochs as you create deeper networks. Note the time per iteration and the final accuracy. Comment on the results that you got with the different test review examples.





> **The given one:**


The model has 4,810,857 trainable parameters

Epoch: 10 | Epoch Time: 0m 45s
	Train Loss: 0.169 | Train Acc: 93.62%
	 Val. Loss: 0.260 |  Val. Acc: 90.37%

Test Loss: 0.289 | Test Acc: 89.04%

It takes around 45s per iteration.



> **With N_Layer = 3**


The model has 6,387,817 trainable parameters


Epoch: 10 | Epoch Time: 1m 16s
	Train Loss: 0.132 | Train Acc: 95.28%
	 Val. Loss: 0.264 |  Val. Acc: 90.98%

Test Loss: 0.273 | Test Acc: 89.06%


It takes around 76s per iteration.


> **With Bidirectional = False**


The model has 3,393,641 trainable parameters

Epoch: 10 | Epoch Time: 0m 20s
	Train Loss: 0.118 | Train Acc: 95.64%
	 Val. Loss: 0.338 |  Val. Acc: 89.95%

Test Loss: 0.295 | Test Acc: 88.25%

It takes around 20 s per iteration.


> **Results**

Without Bidirectional, it takes little shorter and little worse accuracy.

With the larger number of layers, it takes little longer and gets little better accuracy.




####6.4.1
With N_Later = 3

In [93]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 3
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [94]:

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 6,387,817 trainable parameters


In [95]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [96]:
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

In [97]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 1m 16s
	Train Loss: 0.666 | Train Acc: 58.97%
	 Val. Loss: 0.678 |  Val. Acc: 62.66%
Epoch: 02 | Epoch Time: 1m 16s
	Train Loss: 0.543 | Train Acc: 73.15%
	 Val. Loss: 0.405 |  Val. Acc: 83.13%
Epoch: 03 | Epoch Time: 1m 16s
	Train Loss: 0.389 | Train Acc: 83.19%
	 Val. Loss: 0.298 |  Val. Acc: 87.84%
Epoch: 04 | Epoch Time: 1m 16s
	Train Loss: 0.312 | Train Acc: 87.13%
	 Val. Loss: 0.266 |  Val. Acc: 89.20%
Epoch: 05 | Epoch Time: 1m 17s
	Train Loss: 0.253 | Train Acc: 90.09%
	 Val. Loss: 0.278 |  Val. Acc: 89.36%
Epoch: 06 | Epoch Time: 1m 16s
	Train Loss: 0.218 | Train Acc: 91.61%
	 Val. Loss: 0.250 |  Val. Acc: 90.05%
Epoch: 07 | Epoch Time: 1m 16s
	Train Loss: 0.194 | Train Acc: 92.67%
	 Val. Loss: 0.312 |  Val. Acc: 89.52%
Epoch: 08 | Epoch Time: 1m 16s
	Train Loss: 0.175 | Train Acc: 93.49%
	 Val. Loss: 0.253 |  Val. Acc: 90.59%
Epoch: 09 | Epoch Time: 1m 17s
	Train Loss: 0.155 | Train Acc: 94.37%
	 Val. Loss: 0.292 |  Val. Acc: 90.17%
Epoch: 10 | Epoch T

In [98]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.273 | Test Acc: 89.06%


####6.4.2
Bidirectional = false

In [78]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = False
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)



In [79]:

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 3,393,641 trainable parameters


In [80]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

In [81]:
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

In [83]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 20s
	Train Loss: 0.450 | Train Acc: 80.20%
	 Val. Loss: 0.400 |  Val. Acc: 83.29%
Epoch: 02 | Epoch Time: 0m 20s
	Train Loss: 0.385 | Train Acc: 83.73%
	 Val. Loss: 0.339 |  Val. Acc: 86.41%
Epoch: 03 | Epoch Time: 0m 20s
	Train Loss: 0.313 | Train Acc: 87.62%
	 Val. Loss: 0.327 |  Val. Acc: 87.28%
Epoch: 04 | Epoch Time: 0m 20s
	Train Loss: 0.252 | Train Acc: 90.33%
	 Val. Loss: 0.363 |  Val. Acc: 87.06%
Epoch: 05 | Epoch Time: 0m 20s
	Train Loss: 0.218 | Train Acc: 91.29%
	 Val. Loss: 0.280 |  Val. Acc: 89.12%
Epoch: 06 | Epoch Time: 0m 20s
	Train Loss: 0.193 | Train Acc: 92.59%
	 Val. Loss: 0.290 |  Val. Acc: 88.45%
Epoch: 07 | Epoch Time: 0m 20s
	Train Loss: 0.177 | Train Acc: 93.61%
	 Val. Loss: 0.304 |  Val. Acc: 89.72%
Epoch: 08 | Epoch Time: 0m 20s
	Train Loss: 0.153 | Train Acc: 94.09%
	 Val. Loss: 0.297 |  Val. Acc: 89.85%
Epoch: 09 | Epoch Time: 0m 20s
	Train Loss: 0.132 | Train Acc: 95.19%
	 Val. Loss: 0.312 |  Val. Acc: 89.66%
Epoch: 10 | Epoch T

In [84]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.295 | Test Acc: 88.25%


###6.5 In a bidirectional RNN, if the different directions use a different number of hidden units, how will the shape of  $𝐇_𝑡$ change?


It is avaialble to use different numbers of hidden units for the different directions. However becare ful with the shape of forward and backward hidden state. The final one will be the combined one of the forward and backward hidden state.

###6.6 If someone asked you to repurpose your system to analyse product reviews on Amazon.com. Would you be able to use the trained system directly? Would you want to combine sources from different types of reviews when modeling text? Is this a good idea? What could go wrong?


I think it would be able to use the trained system directly. However, for better accuracy, it need to be retrained. This is about movies, so if there would be reviews about products, it would be better. 

If you use other reviews from other field, it could go wrong. For example, people use different words for different situation.