[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)

# Convolutional Neural Networks for Sentence Classification using Ignite

This is a tutorial on using Ignite to train neural network models, setup experiments and validate models.

In this experiment, we'll be replicating [
Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! This paper uses CNN for text classification, a task typically reserved for RNNs, Logistic Regression, Naive Bayes.

We want to be able to classify IMDB movie reviews and predict whether the review is positive or negative. IMDB Movie Review dataset comprises of 25000 positive and 25000 negative examples. The dataset comprises of text and label pairs. This is binary classification problem. We'll be using PyTorch to create the model, torchtext to import data and Ignite to train and monitor the models!

Lets get started! 

## Required Dependencies 

In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:

`pip install torchtext==0.9.1 spacy`

`python -m spacy download en_core_web_sm`

In [None]:
!pip install pytorch-ignite torchtext==0.9.1 spacy
!python -m spacy download en_core_web_sm

## Import Libraries

In [None]:
import random

`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:
* **datasets**: A module to download NLP datasets.
* **GloVe**: A module to download and use pretrained GloVe embedings.

In [None]:
from torchtext import datasets
from torchtext.vocab import GloVe

We import torch, nn and functional modules to create our models! 

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to setup a training loop, various metrics, handlers and a helpful contrib section! 

Below we import the following:
* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.
* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.
* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. 
* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.
* **RunningAverage**: General metric to attach to Engine during training. 
* **ModelCheckpoint**: Handler to checkpoint models. 
* **EarlyStopping**: Handler to stop training based on a score function. 
* **ProgressBar**: Handler to create a tqdm progress bar.

In [None]:
from ignite.engine import Engine, Events
from ignite.metrics import Accuracy, Loss, RunningAverage
from ignite.handlers import ModelCheckpoint, EarlyStopping
from ignite.contrib.handlers import ProgressBar
from ignite.utils import manual_seed

SEED = 1234
manual_seed(SEED)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Processing Data

We first set up a tokenizer using `torchtext.data.utils`.
The job of a tokenizer to split a sentence into "tokens". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).
We will use the tokenizer from the "spacy" library which is a popular choice. Feel free to switch to "basic_english" if you want to use the default one or any other that you want.

docs: https://pytorch.org/text/stable/data_utils.html

In [None]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")

In [None]:
tokenizer("Ignite is a high-level library for training and evaluating neural networks.")

Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line.

In [None]:
train_iter, test_iter = datasets.IMDB(split=('train','test'))

Now we set up the train, validation and test splits.  

In [None]:
# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000 

# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)


random.shuffle(_temp)
_temp = _temp[:(N if N > 0 else len(_temp) )]
n_train = int(len(_temp)*train_frac)

train_list = _temp[:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[:(N if N > 0 else len(test_list))]

Let's explore a data sample to see what it looks like.
Each data sample is a tuple  of the format `(label, text)`.

The value of label can is either 'pos' or 'neg'.


In [None]:
random_sample = random.sample(train_list,1)[0]
print(' text:', random_sample[1])
print('label:', random_sample[0])

Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. 

`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.
* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)
* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)
* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)

Note than the GloVE download size is around 900MB, so it might take some time to download. 

An instance of the `Vocab` class has the following attributes:
* `extend` is used to extend the vocabulary
* `freqs` is a dictionary of the frequency of each word
* `itos` is a list of all the words in the vocabulary.
* `stoi` is a dictionary mapping every word to an index.
* `vectors` is a torch.Tensor of the downloaded embeddings


In [None]:
from collections import Counter
from torchtext.vocab import Vocab

counter = Counter()

for (label, line) in train_list:
    counter.update(tokenizer(line))

vocab = Vocab(
    counter,
    min_freq=10,
    vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')
)

In [None]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.stoi
print("The index of '<BOS>' is", new_stoi['<BOS>'])
new_itos = vocab.itos
print("The token at index 2 is", new_itos[2])

We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `<BOS>` and `<EOS>` to the sentence in `text_transform`.

In [None]:
text_transform = lambda x: [vocab[token] for token in tokenizer(x)]
label_transform = lambda x: 1 if x == 'pos' else 0

# Print out the output of text_transform
print("input to the text_transform:", "here is an example")
print("output of the text_transform:", text_transform("here is an example"))

For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch.

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
    return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)


In [None]:
batch_size = 8  # A batch size of 8

def create_iterators(batch_size=8):
    """Heler function to create the iterators"""
    dataloaders = []
    for split in [train_list, validation_list, test_list]:
        dataloader = DataLoader(
            split, batch_size=batch_size,
            collate_fn=collate_batch
            )
        dataloaders.append(dataloader)
    return dataloaders


In [None]:
train_iterator, valid_iterator, test_iterator = create_iterators()

In [None]:
next(iter(train_iterator))

Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.
* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.
* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.

Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected.

In [None]:
batch = next(iter(train_iterator))
print('batch[0][0] : ', batch[0][0])
print('batch[1][0] : ', batch[1][[0] != 1])

lengths = []
for i, batch in enumerate(train_iterator):
    x = batch[1]
    lengths.append(x.shape[0])
    if i == 10:
        break

print ('Lengths of first 10 batches : ', lengths)

## TextCNN Model

Here is the replication of the model, here are the operations of the model:
* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. 

* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. 

* **Activation**: ReLu activation is applied to each convolution operation.

* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. 

* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.

* **Dropout**: Dropout is applied to the embedded sentence. 

* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.

* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. 


Let's note that this model works for variable text lengths! The idea to embed the words of a sentence, use convolutions, maxpooling and concantenation to embed the sentence as a single vector! This single vector is passed through a fully connected layer with sigmoid to output a single value. This value can be interpreted as the probability a sentence is positive (closer to 1) or negative (closer to 0).

The minimum length of text expected by the model is the size of the smallest kernel size of the model.

In [None]:
class TextCNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim, 
        kernel_sizes, 
        num_filters, 
        num_classes, d_prob, mode):
        super(TextCNN, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.kernel_sizes = kernel_sizes
        self.num_filters = num_filters
        self.num_classes = num_classes
        self.d_prob = d_prob
        self.mode = mode
        self.embedding = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=0)
        self.load_embeddings()
        self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,
                                             out_channels=num_filters,
                                             kernel_size=k, stride=1) for k in kernel_sizes])
        self.dropout = nn.Dropout(d_prob)
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

    def forward(self, x):
        batch_size, sequence_length = x.shape
        x = self.embedding(x.T).transpose(1, 2)
        x = [F.relu(conv(x)) for conv in self.conv]
        x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]
        x = torch.cat(x, dim=1)
        x = self.fc(self.dropout(x))
        return torch.sigmoid(x).squeeze()

    def load_embeddings(self):
        if 'static' in self.mode:
            self.embedding.weight.data.copy_(vocab.vectors)
            if 'non' not in self.mode:
                self.embedding.weight.data.requires_grad = False
                print('Loaded pretrained embeddings, weights are not trainable.')
            else:
                self.embedding.weight.data.requires_grad = True
                print('Loaded pretrained embeddings, weights are trainable.')
        elif self.mode == 'rand':
            print('Randomly initialized embeddings are used.')
        else:
            raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')

## Creating Model, Optimizer and Loss

Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are setup. 

In [None]:
vocab_size, embedding_dim = vocab.vectors.shape

model = TextCNN(vocab_size=vocab_size,
                embedding_dim=embedding_dim,
                kernel_sizes=[3, 4, 5],
                num_filters=100,
                num_classes=1, 
                d_prob=0.5,
                mode='static')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)
criterion = nn.BCELoss()

## Training and Evaluating using Ignite

### Trainer Engine - process_function

Ignite's Engine allows user to define a process_function to process a given batch, this is applied to all the batches of the dataset. This is a general class that can be applied to train and validate models! A process_function has two parameters engine and batch. 


Let's walk through what the function of the trainer does:

* Sets model in train mode. 
* Sets the gradients of the optimizer to zero.
* Generate x and y from batch.
* Performs a forward pass to calculate y_pred using model and x.
* Calculates loss using y_pred and y.
* Performs a backward pass using loss to calculate gradients for the model parameters.
* model parameters are optimized using gradients and optimizer.
* Returns scalar loss. 

Below is a single operation during the trainig process. This process_function will be attached to the training engine.

In [None]:
def process_function(engine, batch):
    model.train()
    optimizer.zero_grad()
    y, x = batch
    x = x.to(device)
    y = y.to(device)
    y_pred = model(x)
    loss = criterion(y_pred, y.float())
    loss.backward()
    optimizer.step()
    return loss.item()

### Evaluator Engine - process_function

Similar to the training process function, we setup a function to evaluate a single batch. Here is what the eval_function does:

* Sets model in eval mode.
* Generates x and y from batch.
* With torch.no_grad(), no gradients are calculated for any succeding steps.
* Performs a forward pass on the model to calculate y_pred based on model and x.
* Returns y_pred and y.

Ignite suggests attaching metrics to evaluators and not trainers because during the training the model parameters are constantly changing and it is best to evaluate model on a stationary model. This information is important as there is a difference in the functions for training and evaluating. Training returns a single scalar loss. Evaluating returns y_pred and y as that output is used to calculate metrics per batch for the entire dataset.

All metrics in Ignite require y_pred and y as outputs of the function attached to the Engine. 

In [None]:
def eval_function(engine, batch):
    model.eval()
    with torch.no_grad():
        y, x = batch
        y = y.to(device)
        x = x.to(device)
        y = y.float()
        y_pred = model(x)
        return y_pred, y

### Instantiating Training and Evaluating Engines

Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! 

In [None]:
trainer = Engine(process_function)
train_evaluator = Engine(eval_function)
validation_evaluator = Engine(eval_function)

### Metrics - RunningAverage, Accuracy and Loss

To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. 

In [None]:
RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')

Now there are two metrics that we want to use for evaluation - accuracy and loss. This is a binary problem, so for Loss we can simply pass the Binary Cross Entropy function as the loss_function. 

For Accuracy, Ignite requires y_pred and y to be comprised of 0's and 1's only. Since our model outputs from a sigmoid layer, values are between 0 and 1. We'll need to write a function that transforms `engine.state.output` which is comprised of y_pred and y. 

Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.

Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. 

To attach a metric to engine, the following format is used:
* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`


In [None]:
def thresholded_output_transform(output):
    y_pred, y = output
    y_pred = torch.round(y_pred)
    return y_pred, y

In [None]:
Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')
Loss(criterion).attach(train_evaluator, 'bce')

In [None]:
Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')
Loss(criterion).attach(validation_evaluator, 'bce')

### Progress Bar

Next we create an instance of Ignite's progess bar and attach it to the trainer and pass it a key of `engine.state.metrics` to track. In this case, the progress bar will be tracking `engine.state.metrics['loss']`

In [None]:
pbar = ProgressBar(persist=True, bar_format="")
pbar.attach(trainer, ['loss'])

### EarlyStopping - Tracking Validation Loss

Now we'll setup a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop trainig. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early.  

In [None]:
def score_function(engine):
    val_loss = engine.state.metrics['bce']
    return -val_loss

handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)
validation_evaluator.add_event_handler(Events.COMPLETED, handler)

### Attaching Custom Functions to Engine at specific Events

Below you'll see ways to define your own custom functions and attaching them to various `Events` of the training process.

The functions below both achieve similar tasks, they print the results of the evaluator run on a dataset. One function does that on the training evaluator and dataset, while the other on the validation. Another difference is how these functions are attached in the trainer engine.

The first method involves using a decorator, the syntax is simple - `@` `trainer.on(Events.EPOCH_COMPLETED)`, means that the decorated function will be attached to the trainer and called at the end of each epoch. 

The second method involves using the add_event_handler method of trainer - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`. This achieves the same result as the above. 

In [None]:
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
    train_evaluator.run(train_iterator)
    metrics = train_evaluator.state.metrics
    avg_accuracy = metrics['accuracy']
    avg_bce = metrics['bce']
    pbar.log_message(
        "Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
        .format(engine.state.epoch, avg_accuracy, avg_bce))
    
def log_validation_results(engine):
    validation_evaluator.run(valid_iterator)
    metrics = validation_evaluator.state.metrics
    avg_accuracy = metrics['accuracy']
    avg_bce = metrics['bce']
    pbar.log_message(
        "Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
        .format(engine.state.epoch, avg_accuracy, avg_bce))
    pbar.n = pbar.last_print_n = 0

trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)

### ModelCheckpoint

Lastly, we want to checkpoint this model. It's important to do so, as training processes can be time consuming and if for some reason something goes wrong during training, a model checkpoint can be helpful to restart training from the point of failure.

Below we'll use Ignite's `ModelCheckpoint` handler to checkpoint models at the end of each epoch. 

In [None]:
checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})

### Run Engine

Next, we'll run the trainer for 20 epochs and monitor results. Below we can see that progess bar prints the loss per iteration, and prints the results of training and validation as we specified in our custom function. 

In [None]:
trainer.run(train_iterator, max_epochs=20)

That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. 