## Sequence Classification

In [4]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

try:
    import torchtext
except:
    !pip install torchtext
    
try:
    import spacy
except:
    !pip install spacy
    
try:
    spacy.load('en_core_web_sm')
except:
    !python -m spacy download en

The problem that we will use to demonstrate sequence classification in this lab is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment. The data was collected by Stanford researchers and was used in a 2011 paper where a split of 50-50 of the data was used for training and test. An accuracy of 88.89% was achieved.

We'll be using a **recurrent neural network** (RNN) as they are commonly used in analysing sequences. An RNN takes in sequence of words, $X=\{x_1, ..., x_T\}$, one at a time, and produces a _hidden state_, $h$, for each word. We use the RNN _recurrently_ by feeding in the current word $x_t$ as well as the hidden state from the previous word, $h_{t-1}$, to produce the next hidden state, $h_t$. 

$$h_t = \text{RNN}(x_t, h_{t-1})$$

Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$, (also known as a fully connected layer), to receive our predicted sentiment, $\hat{y} = f(h_T)$.

Below shows an example sentence, with the RNN predicting zero, which indicates a negative sentiment. The RNN is shown in orange and the linear layer shown in silver. Note that we use the same RNN for every word, i.e. it has the same parameters. The initial hidden state, $h_0$, is a tensor initialized to all zeros. 

![](http://comp6248.ecs.soton.ac.uk/labs/lab7/assets/sentiment1.png)

**Note:** some layers and steps have been omitted from the diagram, but these will be explained later.


The TorchText library provides easy access to the IMDB dataset. The `IMDB` class allows you to load the dataset in a format that is ready for use in neural network and deep learning models, and TorchText's utility methods allow us to easily create batches of data that are `padded` to the same length (we need to pad shorter sentences in the batch to the length of the longest sentence).

In [5]:
import torch
from torchtext.legacy import data

TEXT = data.Field(tokenize='spacy', lower=True, include_lengths=True)
LABEL = data.LabelField(dtype=torch.float)

In [6]:
from torchtext.legacy import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

In [7]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


In [8]:
print(vars(train_data.examples[0]))

{'text': ['bromwell', 'high', 'is', 'a', 'cartoon', 'comedy', '.', 'it', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '"', 'teachers', '"', '.', 'my', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'bromwell', 'high', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"', 'teachers', '"', '.', 'the', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'i', 'knew', 'and', 'their', 'students', '.', 'when', 'i', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'i', 'immediately', 'recalled', '.........', 'at', '..........', 'high', '.', 'a', 'classic', 'l

In [9]:
import random

train_data, valid_data = train_data.split()

In [10]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


Next, we have to build a _vocabulary_. This is effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](http://comp6248.ecs.soton.ac.uk/labs/lab7/assets/sentiment5.png)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). 

There are two ways to effectively cut-down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

The following builds the vocabulary, only keeping the most common `max_size` tokens.

In [11]:
TEXT.build_vocab(train_data, max_size=25000)
LABEL.build_vocab(train_data)

In [12]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


Why is the vocab size 25002 and not 25000? One of the addition tokens is the `<unk>` token and the other is a `<pad>` token.

When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any sentences which are shorter than the longest within the batch are padded.

![](http://comp6248.ecs.soton.ac.uk/labs/lab7/assets/sentiment6.png)

We can also view the most common words in the vocabulary and their frequencies.

In [13]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 230945), (',', 192771), ('.', 165867), ('and', 113777), ('a', 113201), ('of', 102043), ('to', 95116), ('is', 76998), ('it', 65491), ('in', 65057), ('i', 57930), ('this', 51418), ('that', 51180), ('"', 44094), ("'s", 43621), ('-', 37323), ('/><br', 35742), ('was', 35169), ('as', 32359), ('for', 30936)]


In [14]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']


In [15]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


In [16]:
# create the iterators
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
        (train_data, valid_data, test_data),
        batch_size=BATCH_SIZE,
        device=device,
        sort_key=lambda x: len(x.text),
        sort_within_batch=True)

## Build the Model

The next stage is building the model that we'll eventually train and evaluate. 

There is a small amount of boilerplate code when creating models in PyTorch, note how our `RNN` class is a sub-class of `nn.Module` and the use of `super`.

Within the `__init__` we define the _layers_ of the module. Our three layers are an _embedding_ layer, our RNN, and a _linear_ layer. All layers have their parameters initialized to random values, unless explicitly specified.

The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the RNN, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. For more information about word embeddings, see [here](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/).

The RNN layer is our RNN which takes in our dense vector and the previous hidden state $h_{t-1}$, which it uses to calculate the next hidden state, $h_t$.

![](http://comp6248.ecs.soton.ac.uk/labs/lab7/assets/sentiment7.png)

Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.

The `forward` method is called when we feed examples into our model.

Each batch, `text_len`, is a tuple containing a tensor of size _**[max_sentence length, batch size]**_ and a tensor of **batch_size** containing the true lengths of each sentence (remember, they won't necessarily be the same; some reviews are much longer than others). 

The first tensor in the tuple contains the ordered word indexes for each review in the batch. The act of converting a list of tokens into a list of indexes is commonly called *numericalizing*.

The input batch is then passed through the embedding layer to get `embedded`, which gives us a dense vector representation of our sentences. `embedded` is a tensor of size _**[sentence length, batch size, embedding dim]**_. 

`embedded` is then fed into a function called `pack_padded_sequence` before being fed into the RNN. `pack_padded_sequence` is used to create a datastructure that allows the RNN to 'mask' off the padding during the BPTT process (we don't want to learn the padding, as this could drastically influence the results!). In some frameworks you must feed the initial hidden state, $h_0$, into the RNN, however in PyTorch, if no initial hidden state is passed as an argument it defaults to a tensor of all zeros.

The RNN returns 2 tensors, `output` of size _**[sentence length, batch size, hidden dim]**_ and `hidden` of size _**[1, batch size, hidden dim]**_. `output` is the concatenation of the hidden state from every time step, whereas `hidden` is simply the final hidden state. 

Finally, we feed the last hidden state, `hidden`, through the linear layer, `fc`, to produce a prediction. Note the `squeeze` method, which is used to remove a dimension of size 1. 

In [10]:
import torch.nn as nn
import torch.optim as optim
from torchbearer import Trial

In [27]:
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, lengths):
        embedded = self.embedding(text)
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu())
        packed_output, hidden = self.rnn(embedded)
        return self.fc(hidden.squeeze(0))

In [28]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 50
HIDDEN_DIM = 100
OUTPUT_DIM = 1

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

## Model Training

In [29]:
import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.001)

In [30]:
criterion = nn.BCEWithLogitsLoss()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [11]:
class MyIter:
    def __init__(self, it):
        self.it = it
    def __iter__(self):
        for batch in self.it:
            yield (batch.text, batch.label.unsqueeze(1))
    def __len__(self):
        return len(self.it)

In [32]:
from torchbearer import Trial
        


torchbearer_trial = Trial(model, optimizer, criterion, metrics=['acc', 'loss']).to(device)
torchbearer_trial.with_generators(train_generator=MyIter(train_iterator),
                                  val_generator=MyIter(valid_iterator),
                                  test_generator=MyIter(test_iterator))
torchbearer_trial.run(epochs=5)
torchbearer_trial.predict()

HBox(children=(HTML(value='0/5(t)'), FloatProgress(value=0.0, max=274.0), HTML(value='')))




HBox(children=(HTML(value='0/5(v)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))




HBox(children=(HTML(value='1/5(t)'), FloatProgress(value=0.0, max=274.0), HTML(value='')))




HBox(children=(HTML(value='1/5(v)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))




HBox(children=(HTML(value='2/5(t)'), FloatProgress(value=0.0, max=274.0), HTML(value='')))




HBox(children=(HTML(value='2/5(v)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))




HBox(children=(HTML(value='3/5(t)'), FloatProgress(value=0.0, max=274.0), HTML(value='')))




HBox(children=(HTML(value='3/5(v)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))




HBox(children=(HTML(value='4/5(t)'), FloatProgress(value=0.0, max=274.0), HTML(value='')))




HBox(children=(HTML(value='4/5(v)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))




HBox(children=(HTML(value='0/1(p)'), FloatProgress(value=0.0, max=391.0), HTML(value='')))

RuntimeError: CUDA error: unspecified launch failure

In [19]:
results = torchbearer_trial.evaluate(data_key=torchbearer.VALIDATION_DATA)
print(results)

HBox(children=(HTML(value='0/1(e)'), FloatProgress(value=0.0, max=118.0), HTML(value='')))


{'val_binary_acc': 0.49293333292007446, 'val_loss': 0.6919684410095215}


In [12]:
class ImprovedRNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text, lengths):
        embedded = self.embedding(text)
        embedded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu())
        out, (h,c) = self.lstm(embedded)
        out = self.fc(h.squeeze(0))
        return out
        
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 50
HIDDEN_DIM = 100
OUTPUT_DIM = 1

imodel = ImprovedRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

AttributeError: 'Field' object has no attribute 'vocab'

In [13]:
optimizer = optim.Adam(imodel.parameters(), lr=0.01)
criterion = nn.BCEWithLogitsLoss()

torchbearer_trial = Trial(imodel, optimizer, criterion, metrics=['acc', 'loss']).to(device)
torchbearer_trial.with_generators(train_generator=MyIter(train_iterator),
                                  val_generator=MyIter(valid_iterator),
                                  test_generator=MyIter(test_iterator))
torchbearer_trial.run(epochs=5)
torchbearer_trial.predict()

NameError: name 'imodel' is not defined

In [34]:
import spacy
nlp = spacy.load('en_core_web_sm')

def predict_sentiment(model, sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.sigmoid(model((tensor, torch.tensor([tensor.shape[0]]))))
    return prediction.item()

In [35]:
predict_sentiment(imodel, "This film is terrible")

RuntimeError: CUDA error: unspecified launch failure