# Lab 04 - Simple sentiment analysis with an RNN
In this lab we will experiment with different architectures of Recurrent Neural Nets (RNN), but will also use pre-trained word embeddings and several experimental setups. The Python framework for this lab relies on PyTorch.

This lab is based on the [popular PyTorch sentiment analysis tutorial by bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/1%20-%20Simple%20Sentiment%20Analysis.ipynb).

We'll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative) using PyTorch and TorchText. This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$.

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros.

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment1.png?raw=1)

**Note:** some layers and steps have been omitted from the diagram, but these will be explained later.

In [1]:
# Install dependencies
%pip install torch==2.0.0 torchdata==0.6.0 torchtext==0.15.1
%pip install spacy tqdm
!python -m spacy download en_core_web_sm

Collecting torch==2.0.0
  Downloading torch-2.0.0-cp311-cp311-manylinux1_x86_64.whl.metadata (24 kB)
Collecting torchdata==0.6.0
  Downloading torchdata-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (892 bytes)
Collecting torchtext==0.15.1
  Downloading torchtext-0.15.1-cp311-cp311-manylinux1_x86_64.whl.metadata (7.4 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.0)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.0)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.0)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==2.0.0)
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-c

In [None]:
import torch
import torchtext

SEED = 1234
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

print("PyTorch Version: ", torch.__version__)
print("torchtext Version: ", torchtext.__version__)
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'}.")

PyTorch Version:  2.0.0+cu117
torchtext Version:  0.15.1+cpu
Using GPU.


## Initialising the dataset

A handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).

In this cell we automatically download and load the IMDb dataset using TorchText's `datasets` package.

The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review. Processing on this data (such as tokenization) can be done later. Each movie review is a `(label, review)` tuple where the label is either `pos` or `neg` and the `review` is a text string.

The dataset is loaded into its canonical train/test splits as `RawTextIterableDataset` objects. This means that this is an [iterable-style dataset](https://pytorch.org/docs/stable/data.html#iterable-style-datasets).

In [None]:
from torchtext.datasets import IMDB

train_data, test_data = IMDB(root="./", split=("train", "test"))

Unfortunately, that's about as far as we can go with iterable-style datasets. Iterable-style datasets in PyTorch use DataPipes to stream data from a source (in the case of the IMDB dataset and most torchtext datasets, that is text files), so we can often not know the length of the data or sample data points at will. We also can unfortunately not split the data while it is in a data stream form (although this is a feature that is currently being worked on at the time of writing, tracked on [this](https://github.com/pytorch/text/issues/1311) issue).

We can, however, convert the dataset into a "map-style dataset" using the `to_map_style_dataset` torchtext utility. This will essentially read all the stream's data into memory and allow us to inspect the it, reason about its length, and split it to create a validation set.

Usually it would be best practice to keep it as a data stream instead of loading it all into memory, which might even be impossible for certain datasets. Thankfully, the IMDB dataset's size is not prohibitively large, and we do wish to have finer control over the data for the purposes of this lab.

In [None]:
from torchtext.data.functional import to_map_style_dataset

# This might take a while
train_data = to_map_style_dataset(train_data)
test_data = to_map_style_dataset(test_data)

Now we can treat it as a normal dataset and do our usual data exploration.

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


We can also check an example:

In [None]:
train_data[0]

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

Note how samples in this data set have the label first and the data point second.

To split the train set into a train and validation set, we'll use PyTorch's `random_split` utility.

Note that when we initially imported PyTorch at the top of this notebook, we also set the seed using `manual_seed` to a constant, so the result of `random_split` will be reproducible in our case.

In [None]:
from torch.utils.data import random_split

split_ratio = 0.7  # 70/30 split
train_samples = int(split_ratio * len(train_data))
valid_samples = len(train_data) - train_samples
train_data, valid_data = random_split(train_data, [train_samples, valid_samples])

Again, we'll view how many examples are in each split.

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


## Processing our data
In the previous section we did some data exploration to understand our dataset's format. We noticed that each data sample consists of two strings: the sentiment label and the text.

As we can't just feed strings into a recurrent neural network, we'll need to do some processing. Specifically:
- We'll convert the **labels** into an integer (0 for negative, 1 for positive).
- For the texts we'll:
  1. Build a vocabulary.
  2. Tokenize the text using SpaCy.
  3. Convert each sentence into a vector of numerical vocabulary IDs.
  4. Pad vectors to an equal length using padding tokens.

### Building a vocabulary with torchtext
In this section we'll build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment5.png?raw=1)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

In order to build the vocab, however, we will need to tokenize each text. Let's define a tokenizer as a PyTorch module for convenience:

In [None]:
from torchtext.data.utils import get_tokenizer

class SpacyTokenizer(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.tokenizer = get_tokenizer("spacy", language="en_core_web_sm")

    def forward(self, input):
        if isinstance(input, list):
            tokens = []
            for text in input:
                tokens.append(self.tokenizer(text))
            return tokens
        elif isinstance(input, str):
            return self.tokenizer(input)
        raise ValueError(f"Type {type(input)} is not supported.")

This tokenizer will work both over a single string as well as a list of strings. The main reason we define a PyTorch `Module` for a tokenizer is so we can reuse it in our actual data processing pipeline later.

Now that we have the tokenizer, we will use the `build_vocab_from_iterator` utility from torchtext to build our vocabulary. We will also use two auxiliary functions to process the train samples into tokenized texts only with `_process_texts_for_vocab()` and into just labels with `_get_labels_for_vocab()`.

Additionally, note that we need to define the special `<pad>` and `<unk>` characters in the text vocabulary, and we also have to explicitly define that any unknown words should be assigned to the `<unk>` token using the `Vocab` object's `set_default_index()` method.

In [None]:
from torchtext.vocab import build_vocab_from_iterator, vocab
from torchtext.data.utils import get_tokenizer
from collections import OrderedDict

tokenizer = SpacyTokenizer()
MAX_VOCAB_SIZE = 25_000

def _process_texts_for_vocab(data):
    for line in data:
        yield tokenizer(line[1])

def _get_labels_for_vocab(data):
    for line in data:
        yield [line[0]]

# This might take a while as we're tokenizing
text_vocab = build_vocab_from_iterator(_process_texts_for_vocab(train_data), specials=('<unk>', '<pad>'), max_tokens=MAX_VOCAB_SIZE)
label_vocab = vocab(OrderedDict([("neg", 1), ("pos", 1)]))

text_vocab.set_default_index(text_vocab["<unk>"])

Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible.

In [None]:
print(f"Unique tokens in text vocabulary: {len(text_vocab)}")
print(f"Unique tokens in label vocabulary: {len(label_vocab)}")

Unique tokens in text vocabulary: 25000
Unique tokens in label vocabulary: 2


We can see the vocabulary directly using either of the `get_stoi()` (**s**tring **to** **i**nt) or `get_itos()` (**i**nt **to**  **s**tring) methods.

In [None]:
text_vocab.get_itos()[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

Regarding the `<pad>` and `<unk>` tokens you can see at the front there, one of them is a padding token and the other is the unknown token.

When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded with the `<pad>` token.

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment6.png?raw=1)

The unknown token will be used to replace any of the words encountered that are not part of our vocabulary. We can also check the labels, ensuring 0 is for negative and 1 is for positive.

We can also check the labels, ensuring 0 is for negative and 1 is for positive.

In [None]:
label_vocab.get_stoi()

{'pos': 1, 'neg': 0}

In [None]:
label_vocab.get_itos()

['neg', 'pos']

Unfortunately, TorchText's `Vocab` object does not give us an easy way to view the most frequent words in our vocabulary, but we can simply count them manually (although it might take a while):

In [None]:
from collections import Counter

counter = Counter()
for (label, line) in train_data:
    counter.update(tokenizer(line))

counter.most_common(20)

[('the', 202947),
 (',', 192475),
 ('.', 165243),
 ('and', 109737),
 ('a', 109247),
 ('of', 100803),
 ('to', 94018),
 ('is', 76226),
 ('in', 61399),
 ('I', 54329),
 ('it', 53601),
 ('that', 49344),
 ('"', 44197),
 ("'s", 43048),
 ('this', 42421),
 ('-', 37207),
 ('/><br', 35829),
 ('was', 35069),
 ('as', 30515),
 ('with', 29866)]

### Defining the rest of our data processing pipelines
We will use torchtext's `transforms` package in order to define the rest of our pipelines. These transforms are very similar to common PyTorch `nn` modules, except we can use them for NLP data processing.

We'll use `transforms.Sequential` to define our pipeline in each case. For the text processing, we'll run each text through our `SpacyTokenizer`, then use `VocabTransform` to convert each tokenized sentence into a list of vocabulary IDs, and then `ToTensor` to convert this into a PyTorch Tensor. Conveniently, `ToTensor` allows us to pad all vocabulary ID sequences to the same length, we just need to give it the padding token to use. We simply query our vocabulary to get it.

For the labels, we'll simply convert them to indices using `LabeltoIndex` and then convert them to a PyTorch Tensor with `ToTensor`.

In [None]:
import torchtext.transforms as T

text_transform = T.Sequential(
    SpacyTokenizer(),  # Tokenize
    T.VocabTransform(text_vocab),  # Conver to vocab IDs
    T.ToTensor(padding_value=text_vocab["<pad>"]),  # Convert to tensor and pad
)

label_transform = T.Sequential(
    T.LabelToIndex(label_vocab.get_itos()),  # Convert to integer
    T.ToTensor(),  # Convert to tensor
)

We'll also define an additional transform for the texts that will transform each text into its length *after* it gets tokenized but *before* it gets padded. These lengths will be useful when we *pack the padded sequences* later.

Note that applying `text_transform` and `lengths_transform` to our texts will mean that we will tokenize them twice, which is somewhat inefficient, but we will do so anyway for the sake of simplicity. In your implementations, you might want to consider having a "shared" pipeline that handles tokenization / vocabulary transformation and then having two separate pipelines that do further processing over that (one to extract the lengths and one to pad each tensor, for example).

In [None]:
class ToLengths(torch.nn.Module):
    def forward(self, input):
        if isinstance(input[0], list):
            lengths = []
            for text in input:
                lengths.append(len(text))
            return lengths
        elif isinstance(input, list):
            return len(input)
        raise ValueError(f"Type {type(input)} is not supported.")

lengths_transform = T.Sequential(
    SpacyTokenizer(),
    ToLengths(),
    T.ToTensor(),
)

### Understanding the processing being done
Before moving on, let's examine what exactly each step in our processing pipelines will do to the data.

In [None]:
sample_label, sample_text = train_data[0]
mapping = {1: 'neg', 2: 'pos'}

print(f"Text before any processing: {sample_text}")
print(f"Label before any processing: {sample_label}\n")

# Text Processing Pipeline
tokenizer = SpacyTokenizer()
sample_text = tokenizer(sample_text)
print(f"Text after Tokenizer: {sample_text}\n")

vocab_transform = T.VocabTransform(text_vocab)
sample_text = vocab_transform(sample_text)
print(f"Text after Vocab Transform: {sample_text}\n")

tensor_transform = T.ToTensor(padding_value=text_vocab["<pad>"])
sample_text = tensor_transform(sample_text)
print(f"Text after Tensor Transform: {sample_text}\n")
sample_label = mapping[sample_label]
# Label Processing Pipeline
print(f"Label after label transform: {label_transform([sample_label])}\n")

# Length Processing Pipeline
print(f"Text after length transform: {lengths_transform([train_data[0][1]])}")

Text before any processing: Along with "King of the Rocket Men", this was still being repeated on BBC TV in the early to mid eighties. If I was loading up a time capsule of this period both these series would definitely go in.<br /><br />Someone watching it for the first time will think it is silly but this is one of the best examples of the "Serials". Don Del Oro will make you laugh (When I was little my nickname for him was Mr Dustbin head) and it was funny upon being shot at he says "Your bullets can't harm me" then he stumbles back, seemingly less than happy. I also like the way he dispenses with Sebastian in the first episode.<br /><br />I watched this again because I had good memories of it from years back, there are some good stunts and good music, it has the ingredients you expect including water,rockfalls,runaway carts... Apart from the first episode(with Ralph Faulkner)the swordplay wasn't nearly as good as I remembered it, and yes it features the inevitable "flashback" episo

Do note that when we take two texts of differing sizes, they will be padded after going through the pipeline:

In [None]:
sample_labels, sample_texts = zip(train_data[0], train_data[1])

processed_sample_texts = text_transform(list(sample_texts))
lengths = lengths_transform(list(sample_texts))
diff = abs(lengths[0] - lengths[1]) + 5

print(f"Padding vocabulary index: {text_vocab['<pad>']}")

print("Respective text lengths after tokenization: ", lengths)
print("Tensor shape after text processing: ", processed_sample_texts.shape)
print(f"Last {diff} characters of text 0 after processing:\n", processed_sample_texts[0][-diff:])
print(f"Last {diff} characters of text 1 after processing:\n", processed_sample_texts[1][-diff:])

Padding vocabulary index: 1
Respective text lengths after tokenization:  tensor([265, 227])
Tensor shape after text processing:  torch.Size([2, 265])
Last 43 characters of text 0 after processing:
 tensor([   32,   919,    10,   169,  8602,    11,   392,     2,   298,    39,
         1062,     0,     3,    14,   159,     6, 14565,     3, 13385,     6,
         8267,     3,  3093,     6, 14565,     3, 13385,     6,  5050,    14,
          428,    67,     9,    73,     6,   403,  3396,    52,    16,   225,
           10,    12,     4])
Last 43 characters of text 1 after processing:
 tensor([  2, 269, 343, 576,   4,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
          1])


Notice how one text is shorter, but after the `ToTensor` transformation both texts end up with the same sentence length, and how the shorter text was padded with the padding token from the vocabulary for its missing characters.

### The DataLoader
Now we've got all our originally planned processing set up and ready to go. The final step is to put our data into a PyTorch `DataLoader`. The `DataLoader` will help us iterate over the data in batches ofexamples at each iteartion.

PyTorch's `DataLoader` provides quite a few features to let us iterate over data, but it doesn't do any processing on its own. We need to define a `collate_batch` function where we explicitly define what processing steps each batch of data will need to go through when it goes through the data loader. In essence, we'll just use that function to feed our data through the pipelines we created previously.

We also want to place the tensors returned on the GPU (if you're using one). PyTorch handles this using `torch.device`, which we can then pass our tensors to in `collate_batch`. Finally, note that we convert the labels to float values so we can compute loss later.

In [None]:
from torch.utils.data import DataLoader

BATCH_SIZE = 64

def collate_batch(batch):
    labels, texts = zip(*batch)
    labels = [mapping[x] for x in labels]
    lengths = lengths_transform(list(texts))
    texts = text_transform(list(texts))
    labels = label_transform(list(labels))

    return labels.float().to(DEVICE), texts.to(DEVICE), lengths.cpu()

def _get_dataloader(data):
    return DataLoader(data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

train_dataloader = _get_dataloader(train_data)
valid_dataloader = _get_dataloader(valid_data)
test_dataloader = _get_dataloader(test_data)

### Summary
- We imported the `IMDB` dataset from torchtext and converted it from an iterable-style data stream data set to a map-style data set with `to_map_style_dataset`.
- We split it further to obtain a validation set using `random_split`.
- We defined a Tokenizer using SpaCy as a PyTorch Module.
- We used `build_vocab_from_iterator` and some auxiliary functions to create our vocabulary.
- We used `torchtext.transforms` to define processing pipelines.
- We used `torch.utils.data.DataLoader` as well as an auxiliary function to put our data through our data pipelines and into batches.

That was a lot, but we are *finally* ready to put our data through our model!

Unfortunately, torchtext is still a very young project and a lot of reference material still uses its old legacy API. When trying to look deeper into torchtext, it's recommended to make sure you're checking its [latest documentation](https://pytorch.org/text/stable/index.html).

## Build the Model

The next stage is building the model that we'll eventually train and evaluate.

There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.

Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.

The embedding layer is used to transform our sparse word representations (sparse as most of the elements are 0) into dense embedding vectors (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

![](https://github.com/surrey-nlp/NLP-2025/blob/main/lab04/assets/sentiment7.png?raw=1)

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

The `forward` method is called when we feed examples into our model.

Each batch, `texts`, is a tensor of size **[batch_size, batch_sentence_length]**. That is a batch of sentences, each having each word converted into its vocabulary index in our previous processing steps. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*. The `Embedding` layer will take care of converting these indices into one-hot vectors. Note that `batch_sentence_length` is the length of the largest sentence in the batch.

The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size **[batch size, batch_sentence_length, embedding dim]**.

Each batch, we also have access to `lengths` which is a tensor of the lengths of each sentence before it was padded to be of size `batch_sentence_length`. We will use these lengths to essentially remove the padding present in each sentence into the batch, and combine all sentences into one large vector that will get fed through the RNN. This is mainly a performance optimization but it is generally good practice. Note that even if we are combining all our input into one continuous tensor, the RNN will still output results for individual sentences, the main thing we gain is that we don't waste computation time processing the padding present in the original tensors.

`embedded` is then fed into the RNN. In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, `output` normally of size **[batch size, batch_sentence length, hidden dim]** and `hidden` of size **[1, batch size, hidden dim]**. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. Since we're using packed padded sequences, `output` will also be a packed padded sequence, so its size will be slightly different and overall smaller the normal size.

Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction. Note the `squeeze` method, which is used to remove a dimension of size 1.

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, texts, lengths):
        embedded = self.embedding(texts)                          # VV note that lengths need to be on the CPU
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)

        output, hidden = self.rnn(embedded)

        return self.fc(hidden.squeeze(0))

We now create an instance of our RNN class.

The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size.

The embedding dimension is the size of the dense word vectors. This is usually around 50-250 dimensions, but depends on the size of the vocabulary.

The hidden dimension is the size of the hidden states. This is usually around 100-500 dimensions, but also depends on factors such as on the vocabulary size, the size of the dense vectors and the complexity of the task.

The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.

In [None]:
INPUT_DIM = len(text_vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Let's also create a function that will tell us how many trainable parameters our model has so we can compare the number of parameters across different models.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,591,905 trainable parameters


## Training the Model

Now we'll set up the training and then train the model.

First, we'll create an optimizer. This is the algorithm we use to update the parameters of the module. Here, we'll use _stochastic gradient descent_ (SGD). The first argument is the parameters will be updated by the optimizer, the second is the learning rate, i.e. how much we'll change the parameters by when we do a parameter update.

In [None]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=1e-3)

Next, we'll define our loss function. In PyTorch this is commonly called a criterion.

The loss function here is _binary cross entropy with logits_.

Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the _sigmoid_ or _logit_ functions.

We then use this this bound scalar to calculate the loss using binary cross entropy.

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

In [None]:
criterion = nn.BCEWithLogitsLoss()

Using `.to`, we can place the model and the criterion on the GPU (if we have one).

In [None]:
model = model.to(DEVICE)
criterion = criterion.to(DEVICE)

Our criterion function calculates the loss, however we have to write our function to calculate the accuracy.

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum() / len(correct)
    return acc

The `train` function iterates over all examples, one batch at a time.

`model.train()` is used to put the model in "training mode", which turns on _dropout_ and _batch normalization_. Although we aren't using them in this model, it's good practice to include it.

For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed.

We then feed the batch of sentences and their original lengths, `texts` and `lengths` accordingly, into the model. Note, you do not need to do `model.forward(texts, lengths)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size _**[batch size]**_.

The loss and accuracy are then calculated using our predictions and the labels, `labels`, with the loss being averaged over all examples in the batch.

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.

Finally, we return the loss and accuracy, averaged across the epoch. The `len` of an iterator is the number of batches in the iterator.

You may recall that we converted the labels to float in `collate_batch()`. This is because `ToTensor` sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. The alternative method of doing this would be to do the conversion inside the `train` function by passing `labels.float()` instad of `labels` to the criterion.

In [None]:
from tqdm import tqdm

def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in tqdm(iterator, desc="\tTraining"):
        optimizer.zero_grad()

        labels, texts, lengths = batch  # Note that this has to match the order in collate_batch
        predictions = model(texts, lengths).squeeze(1)
        loss = criterion(predictions, labels)
        acc = binary_accuracy(predictions, labels)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.

`model.eval()` puts the model in "evaluation mode", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include them.

No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating.

In [None]:
from tqdm import tqdm

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.eval()

    with torch.no_grad():
        for batch in tqdm(iterator, desc="\tEvaluation"):
            labels, texts, lengths = batch  # Note that this has to match the order in collate_batch
            predictions = model(texts, lengths).squeeze(1)
            loss = criterion(predictions, labels)
            acc = binary_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the training and validation sets.

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')
print(f"Using {'GPU' if str(DEVICE) == 'cuda' else 'CPU'} for training.")

for epoch in range(N_EPOCHS):
    print(f'Epoch: {epoch+1:02}')
    start_time = time.time()

    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion)
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')

    valid_loss, valid_acc = evaluate(model, valid_dataloader, criterion)
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')

Using GPU for training.
Epoch: 01


	Training: 100%|██████████| 274/274 [00:34<00:00,  8.04it/s]


	Train Loss: 0.690 | Train Acc: 52.73%


	Evaluation: 100%|██████████| 118/118 [00:11<00:00, 10.00it/s]


	 Val. Loss: 0.692 |  Val. Acc: 51.56%
Epoch: 02


	Training: 100%|██████████| 274/274 [00:33<00:00,  8.08it/s]


	Train Loss: 0.690 | Train Acc: 53.05%


	Evaluation: 100%|██████████| 118/118 [00:11<00:00, 10.15it/s]


	 Val. Loss: 0.691 |  Val. Acc: 51.96%
Epoch: 03


	Training: 100%|██████████| 274/274 [00:32<00:00,  8.31it/s]


	Train Loss: 0.689 | Train Acc: 53.19%


	Evaluation: 100%|██████████| 118/118 [00:11<00:00, 10.10it/s]


	 Val. Loss: 0.691 |  Val. Acc: 51.70%
Epoch: 04


	Training: 100%|██████████| 274/274 [00:34<00:00,  7.95it/s]


	Train Loss: 0.688 | Train Acc: 53.67%


	Evaluation: 100%|██████████| 118/118 [00:12<00:00,  9.72it/s]


	 Val. Loss: 0.690 |  Val. Acc: 51.97%
Epoch: 05


	Training:  81%|████████▏ | 223/274 [00:27<00:05,  9.44it/s]

You may have noticed the loss is not really decreasing and the accuracy is poor. This is due to several issues with the model which we'll improve in the next notebook.

Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss.

In [None]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss, test_acc = evaluate(model, test_dataloader, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## Next Steps

In the next notebook, the improvements we will make are:
- pre-trained word embeddings
- different RNN architectures
- bidirectional RNN
- multi-layer RNN
- regularization
- a different optimizer

This will allow us to achieve ~84% accuracy.