From 3f832b6d7ea6905be1c49d610b46a6ad64346791 Mon Sep 17 00:00:00 2001 From: cozek Date: Fri, 9 Apr 2021 21:36:10 +0530 Subject: [PATCH 1/6] adding torchtext version and updating gitignore for jupyter --- .gitignore | 3 +++ examples/notebooks/TextCNN.ipynb | 11 +++++------ 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/.gitignore b/.gitignore index f7a11548b404..084a70204eb5 100644 --- a/.gitignore +++ b/.gitignore @@ -39,3 +39,6 @@ coverage.xml /docs/src/ /docs/build/ /docs/source/generated/ + +# Jupyter +examples/notebooks/.ipynb_checkpoints/ diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index 286f35ac6ab2..bb169fc42f01 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -48,8 +48,8 @@ }, "outputs": [], "source": [ - "!pip install pytorch-ignite torchtext spacy\n", - "!python -m spacy download en" + "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", + "!python -m spacy download en_core_web_sm" ] }, { @@ -806,8 +806,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "id": "sPe46cQOznkX", - "scrolled": false + "id": "sPe46cQOznkX" }, "outputs": [], "source": [ @@ -845,9 +844,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.3" + "version": "3.8.8" } }, "nbformat": 4, - "nbformat_minor": 1 + "nbformat_minor": 4 } From 8f8153c8409d6d28df2ba96fafa510c77b8c5178 Mon Sep 17 00:00:00 2001 From: cozek Date: Sat, 10 Apr 2021 14:31:42 +0530 Subject: [PATCH 2/6] first updates --- .gitignore | 3 + examples/notebooks/TextCNN.ipynb | 1867 ++++++++++++++++-------------- 2 files changed, 1019 insertions(+), 851 deletions(-) diff --git a/.gitignore b/.gitignore index 084a70204eb5..83edf70bc5a8 100644 --- a/.gitignore +++ b/.gitignore @@ -42,3 +42,6 @@ coverage.xml # Jupyter examples/notebooks/.ipynb_checkpoints/ + +# System files +.DS_Store diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index bb169fc42f01..cf75d23b89c8 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -1,852 +1,1017 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "HCS-d1T3znj2" - }, - "source": [ - "# Convolutional Neural Networks for Sentence Classification using Ignite" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rjZMYxFoznj9" - }, - "source": [ - "This is a tutorial on using Ignite to train neural network models, setup experiments and validate models.\n", - "\n", - "In this experiment, we'll be replicating [\n", - "Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! This paper uses CNN for text classification, a task typically reserved for RNNs, Logistic Regression, Naive Bayes.\n", - "\n", - "We want to be able to classify IMDB movie reviews and predict whether the review is positive or negative. IMDB Movie Review dataset comprises of 25000 positive and 25000 negative examples. The dataset comprises of text and label pairs. This is binary classification problem. We'll be using PyTorch to create the model, torchtext to import data and Ignite to train and monitor the models!\n", - "\n", - "Lets get started! " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sovYyC0Zznj-" - }, - "source": [ - "## Required Dependencies \n", - "\n", - "In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:\n", - "\n", - "`pip install torchtext spacy`\n", - "\n", - "`python -m spacy download en`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "7XHAD9x7znj_" - }, - "outputs": [], - "source": [ - "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", - "!python -m spacy download en_core_web_sm" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lZty7-RWznkA" - }, - "source": [ - "## Import Libraries" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VKTazeAkznkB" - }, - "outputs": [], - "source": [ - "import random" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wxThg0YTznkD" - }, - "source": [ - "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", - "* **data**: A module to setup the data in the form Fields and Labels.\n", - "* **datasets**: A module to download NLP datasets.\n", - "* **GloVe**: A module to download and use pretrained GloVe embedings." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XrXE-f7jznkD" - }, - "outputs": [], - "source": [ - "from torchtext import data, legacy\n", - "from torchtext import datasets\n", - "from torchtext.vocab import GloVe" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ivAnTyEfznkE" - }, - "source": [ - "We import torch, nn and functional modules to create our models! " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gbEFAWr0znkE" - }, - "outputs": [], - "source": [ - "import torch\n", - "import torch.nn as nn\n", - "import torch.nn.functional as F\n", - "\n", - "SEED = 1234\n", - "torch.manual_seed(SEED)\n", - "torch.cuda.manual_seed(SEED)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Q22BGKi8znkF" - }, - "source": [ - "`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to setup a training loop, various metrics, handlers and a helpful contrib section! \n", - "\n", - "Below we import the following:\n", - "* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.\n", - "* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.\n", - "* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. \n", - "* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.\n", - "* **RunningAverage**: General metric to attach to Engine during training. \n", - "* **ModelCheckpoint**: Handler to checkpoint models. \n", - "* **EarlyStopping**: Handler to stop training based on a score function. \n", - "* **ProgressBar**: Handler to create a tqdm progress bar." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "enczLgLTznkH" - }, - "outputs": [], - "source": [ - "from ignite.engine import Engine, Events\n", - "from ignite.metrics import Accuracy, Loss, RunningAverage\n", - "from ignite.handlers import ModelCheckpoint, EarlyStopping\n", - "from ignite.contrib.handlers import ProgressBar" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WZYyXYB5znkH" - }, - "source": [ - "## Processing Data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tCadZAV_znkH" - }, - "source": [ - "The code below first sets up `TEXT` and `LABEL` as general data objects. \n", - "\n", - "* `TEXT` converts any text to lowercase and produces tensors with the batch dimension first. \n", - "* `LABEL` is a data object that will convert any labels to floats.\n", - "\n", - "Next IMDB training and test datasets are downloaded, the training data is split into training and validation datasets. It takes TEXT and LABEL as inputs so that the data is processed as specified. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "-BttNTcGznkH" - }, - "outputs": [], - "source": [ - "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", - "\n", - "TEXT = legacy.data.Field(lower=True, batch_first=True)\n", - "LABEL = legacy.data.LabelField(dtype=torch.float)\n", - "\n", - "train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')\n", - "train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aAt_J_eDznkI" - }, - "source": [ - "Now we have three sets of the data - train, validation, test. Let's explore what these data objects are and how to extract data from them.\n", - "* `train_data` is **torchtext.data.dataset.Dataset**, this is similar to **torch.utils.data.Dataset**.\n", - "* `train_data[0]` is **torchtext.data.example.Example**, a Dataset is comprised of many Examples.\n", - "* Let's explore the attributes of an example. We see a few methods, but most importantly we see `label` and `text`.\n", - "* `example.text` is the text of that example and `example.label` is the label of the example. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "GZZxlIWVznkI" - }, - "outputs": [], - "source": [ - "print('type(train_data) : ', type(train_data))\n", - "print('type(train_data[0]) : ',type(train_data[0]))\n", - "example = train_data[0]\n", - "print('Attributes of Example : ', [attr for attr in dir(example) if '_' not in attr])\n", - "print('example.label : ', example.label)\n", - "print('example.text[:10] : ', example.text[:10])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "yCrZzrYaznkK" - }, - "source": [ - "Now that we have an idea of what are split datasets look like, lets dig further into `TEXT` and `LABEL`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", - "\n", - "For `TEXT`, let's download the pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", - "* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)\n", - "* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)\n", - "* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)\n", - "\n", - "We use `TEXT.build_vocab` to do this, let's explore the `TEXT` object more. \n", - "\n", - "Let's explore what `TEXT` object is and how to extract data from them. We see `TEXT` has a few attributes, let's explore vocab, since we just used the build_vocab function. \n", - "\n", - "`TEXT.vocab` has the following attributes:\n", - "* `extend` is used to extend the vocabulary\n", - "* `freqs` is a dictionary of the frequency of each word\n", - "* `itos` is a list of all the words in the vocabulary.\n", - "* `stoi` is a dictionary mapping every word to an index.\n", - "* `vectors` is a torch.Tensor of the downloaded embeddings" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "IfGqPAfEznkK" - }, - "outputs": [], - "source": [ - "TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))\n", - "print ('Attributes of TEXT : ', [attr for attr in dir(TEXT) if '_' not in attr])\n", - "print ('Attributes of TEXT.vocab : ', [attr for attr in dir(TEXT.vocab) if '_' not in attr])\n", - "print ('First 5 values TEXT.vocab.itos : ', TEXT.vocab.itos[0:5]) \n", - "print ('First 5 key, value pairs of TEXT.vocab.stoi : ', {key:value for key,value in list(TEXT.vocab.stoi.items())[0:5]}) \n", - "print ('Shape of TEXT.vocab.vectors.shape : ', TEXT.vocab.vectors.shape)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "PFfeEinqznkL" - }, - "source": [ - "Let's do the same with `LABEL`. We see that there are vectors related to `LABEL`, this is expected because `LABEL` provides the definition of a label of data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "5sU3UOXDznkL" - }, - "outputs": [], - "source": [ - "LABEL.build_vocab(train_data)\n", - "print ('Attributes of LABEL : ', [attr for attr in dir(LABEL) if '_' not in attr])\n", - "print ('Attributes of LABEL.vocab : ', [attr for attr in dir(LABEL.vocab) if '_' not in attr])\n", - "print ('First 5 values LABEL.vocab.itos : ', LABEL.vocab.itos) \n", - "print ('First 5 key, value pairs of LABEL.vocab.stoi : ', {key:value for key,value in list(LABEL.vocab.stoi.items())})\n", - "print ('Shape of LABEL.vocab.vectors : ', LABEL.vocab.vectors)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ojXZisxLznkM" - }, - "source": [ - "Now we must convert our split datasets into iterators, we'll take advantage of **torchtext.data.BucketIterator**! BucketIterator pads every element of a batch to the length of the longest element of the batch." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tM4yWV9YznkM" - }, - "outputs": [], - "source": [ - "train_iterator, valid_iterator, test_iterator = legacy.data.BucketIterator.splits((train_data, valid_data, test_data), \n", - " batch_size=32,\n", - " device=device)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d2azJGL6znkM" - }, - "source": [ - "Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.\n", - "* `batch.label[0]` is the label of a single example. We can see that `LABEL.vocab.stoi` was used to map the label that originally text into a float.\n", - "* `batch.text[0]` is the text of a single example. Similar to label, `TEXT.vocab.stoi` was used to convert each token of the example's text into indices.\n", - "\n", - "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the bucket iterator is doing exactly as we hoped!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Ga2xDXfyznkN" - }, - "outputs": [], - "source": [ - "batch = next(iter(train_iterator))\n", - "print('batch.label[0] : ', batch.label[0])\n", - "print('batch.text[0] : ', batch.text[0][batch.text[0] != 1])\n", - "\n", - "lengths = []\n", - "for i, batch in enumerate(train_iterator):\n", - " x = batch.text\n", - " lengths.append(x.shape[1])\n", - " if i == 10:\n", - " break\n", - "\n", - "print ('Lengths of first 10 batches : ', lengths)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KsUrKRr3znkO" - }, - "source": [ - "## TextCNN Model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pldMpTv-znkO" - }, - "source": [ - "Here is the replication of the model, here are the operations of the model:\n", - "* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. \n", - "\n", - "* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. \n", - "\n", - "* **Activation**: ReLu activation is applied to each convolution operation.\n", - "\n", - "* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. \n", - "\n", - "* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.\n", - "\n", - "* **Dropout**: Dropout is applied to the embedded sentence. \n", - "\n", - "* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.\n", - "\n", - "* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. \n", - "\n", - "\n", - "Let's note that this model works for variable text lengths! The idea to embed the words of a sentence, use convolutions, maxpooling and concantenation to embed the sentence as a single vector! This single vector is passed through a fully connected layer with sigmoid to output a single value. This value can be interpreted as the probability a sentence is positive (closer to 1) or negative (closer to 0).\n", - "\n", - "The minimum length of text expected by the model is the size of the smallest kernel size of the model." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "63z1tffDznkO" - }, - "outputs": [], - "source": [ - "class TextCNN(nn.Module):\n", - " def __init__(self, vocab_size, embedding_dim, kernel_sizes, num_filters, num_classes, d_prob, mode):\n", - " super(TextCNN, self).__init__()\n", - " self.vocab_size = vocab_size\n", - " self.embedding_dim = embedding_dim\n", - " self.kernel_sizes = kernel_sizes\n", - " self.num_filters = num_filters\n", - " self.num_classes = num_classes\n", - " self.d_prob = d_prob\n", - " self.mode = mode\n", - " self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=1)\n", - " self.load_embeddings()\n", - " self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,\n", - " out_channels=num_filters,\n", - " kernel_size=k, stride=1) for k in kernel_sizes])\n", - " self.dropout = nn.Dropout(d_prob)\n", - " self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)\n", - "\n", - " def forward(self, x):\n", - " batch_size, sequence_length = x.shape\n", - " x = self.embedding(x).transpose(1, 2)\n", - " x = [F.relu(conv(x)) for conv in self.conv]\n", - " x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]\n", - " x = torch.cat(x, dim=1)\n", - " x = self.fc(self.dropout(x))\n", - " return torch.sigmoid(x).squeeze()\n", - "\n", - " def load_embeddings(self):\n", - " if 'static' in self.mode:\n", - " self.embedding.weight.data.copy_(TEXT.vocab.vectors)\n", - " if 'non' not in self.mode:\n", - " self.embedding.weight.data.requires_grad = False\n", - " print('Loaded pretrained embeddings, weights are not trainable.')\n", - " else:\n", - " self.embedding.weight.data.requires_grad = True\n", - " print('Loaded pretrained embeddings, weights are trainable.')\n", - " elif self.mode == 'rand':\n", - " print('Randomly initialized embeddings are used.')\n", - " else:\n", - " raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6G3TW4c-znkO" - }, - "source": [ - "## Creating Model, Optimizer and Loss" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7nH55oXznkP" - }, - "source": [ - "Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are setup. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "HM_7LQE3znkP" - }, - "outputs": [], - "source": [ - "vocab_size, embedding_dim = TEXT.vocab.vectors.shape\n", - "\n", - "model = TextCNN(vocab_size=vocab_size,\n", - " embedding_dim=embedding_dim,\n", - " kernel_sizes=[3, 4, 5],\n", - " num_filters=100,\n", - " num_classes=1, \n", - " d_prob=0.5,\n", - " mode='static')\n", - "model.to(device)\n", - "optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)\n", - "criterion = nn.BCELoss()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xjxbAwvIznkP" - }, - "source": [ - "## Training and Evaluating using Ignite" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L8Rl7spqznkQ" - }, - "source": [ - "### Trainer Engine - process_function\n", - "\n", - "Ignite's Engine allows user to define a process_function to process a given batch, this is applied to all the batches of the dataset. This is a general class that can be applied to train and validate models! A process_function has two parameters engine and batch. \n", - "\n", - "\n", - "Let's walk through what the function of the trainer does:\n", - "\n", - "* Sets model in train mode. \n", - "* Sets the gradients of the optimizer to zero.\n", - "* Generate x and y from batch.\n", - "* Performs a forward pass to calculate y_pred using model and x.\n", - "* Calculates loss using y_pred and y.\n", - "* Performs a backward pass using loss to calculate gradients for the model parameters.\n", - "* model parameters are optimized using gradients and optimizer.\n", - "* Returns scalar loss. \n", - "\n", - "Below is a single operation during the trainig process. This process_function will be attached to the training engine." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Q4ncIcYcznkQ" - }, - "outputs": [], - "source": [ - "def process_function(engine, batch):\n", - " model.train()\n", - " optimizer.zero_grad()\n", - " x, y = batch.text, batch.label\n", - " y_pred = model(x)\n", - " loss = criterion(y_pred, y)\n", - " loss.backward()\n", - " optimizer.step()\n", - " return loss.item()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HiiQr_GYznkQ" - }, - "source": [ - "### Evaluator Engine - process_function\n", - "\n", - "Similar to the training process function, we setup a function to evaluate a single batch. Here is what the eval_function does:\n", - "\n", - "* Sets model in eval mode.\n", - "* Generates x and y from batch.\n", - "* With torch.no_grad(), no gradients are calculated for any succeding steps.\n", - "* Performs a forward pass on the model to calculate y_pred based on model and x.\n", - "* Returns y_pred and y.\n", - "\n", - "Ignite suggests attaching metrics to evaluators and not trainers because during the training the model parameters are constantly changing and it is best to evaluate model on a stationary model. This information is important as there is a difference in the functions for training and evaluating. Training returns a single scalar loss. Evaluating returns y_pred and y as that output is used to calculate metrics per batch for the entire dataset.\n", - "\n", - "All metrics in Ignite require y_pred and y as outputs of the function attached to the Engine. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "b9-G-9iVznkR" - }, - "outputs": [], - "source": [ - "def eval_function(engine, batch):\n", - " model.eval()\n", - " with torch.no_grad():\n", - " x, y = batch.text, batch.label\n", - " y_pred = model(x)\n", - " return y_pred, y" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dcmIEZuNznkS" - }, - "source": [ - "### Instantiating Training and Evaluating Engines\n", - "\n", - "Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "k1CxFQs_znkS" - }, - "outputs": [], - "source": [ - "trainer = Engine(process_function)\n", - "train_evaluator = Engine(eval_function)\n", - "validation_evaluator = Engine(eval_function)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZVu91uVtznkS" - }, - "source": [ - "### Metrics - RunningAverage, Accuracy and Loss\n", - "\n", - "To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "P-98lPU9znkS" - }, - "outputs": [], - "source": [ - "RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mufkp6mnznkS" - }, - "source": [ - "Now there are two metrics that we want to use for evaluation - accuracy and loss. This is a binary problem, so for Loss we can simply pass the Binary Cross Entropy function as the loss_function. \n", - "\n", - "For Accuracy, Ignite requires y_pred and y to be comprised of 0's and 1's only. Since our model outputs from a sigmoid layer, values are between 0 and 1. We'll need to write a function that transforms `engine.state.output` which is comprised of y_pred and y. \n", - "\n", - "Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.\n", - "\n", - "Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. \n", - "\n", - "To attach a metric to engine, the following format is used:\n", - "* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "KAK6nXEbznkS" - }, - "outputs": [], - "source": [ - "def thresholded_output_transform(output):\n", - " y_pred, y = output\n", - " y_pred = torch.round(y_pred)\n", - " return y_pred, y" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QkcC2R4qznkT" - }, - "outputs": [], - "source": [ - "Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')\n", - "Loss(criterion).attach(train_evaluator, 'bce')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tLtT5f11znkT" - }, - "outputs": [], - "source": [ - "Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')\n", - "Loss(criterion).attach(validation_evaluator, 'bce')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FbS2h_2eznkU" - }, - "source": [ - "### Progress Bar\n", - "\n", - "Next we create an instance of Ignite's progess bar and attach it to the trainer and pass it a key of `engine.state.metrics` to track. In this case, the progress bar will be tracking `engine.state.metrics['loss']`" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qteztuB3znkU" - }, - "outputs": [], - "source": [ - "pbar = ProgressBar(persist=True, bar_format=\"\")\n", - "pbar.attach(trainer, ['loss'])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x4DxUwXfznkU" - }, - "source": [ - "### EarlyStopping - Tracking Validation Loss\n", - "\n", - "Now we'll setup a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop trainig. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wPM6-USgznkU" - }, - "outputs": [], - "source": [ - "def score_function(engine):\n", - " val_loss = engine.state.metrics['bce']\n", - " return -val_loss\n", - "\n", - "handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)\n", - "validation_evaluator.add_event_handler(Events.COMPLETED, handler)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LfeL6EkhznkU" - }, - "source": [ - "### Attaching Custom Functions to Engine at specific Events\n", - "\n", - "Below you'll see ways to define your own custom functions and attaching them to various `Events` of the training process.\n", - "\n", - "The functions below both achieve similar tasks, they print the results of the evaluator run on a dataset. One function does that on the training evaluator and dataset, while the other on the validation. Another difference is how these functions are attached in the trainer engine.\n", - "\n", - "The first method involves using a decorator, the syntax is simple - `@` `trainer.on(Events.EPOCH_COMPLETED)`, means that the decorated function will be attached to the trainer and called at the end of each epoch. \n", - "\n", - "The second method involves using the add_event_handler method of trainer - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`. This achieves the same result as the above. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "3XsmcAA2znkV" - }, - "outputs": [], - "source": [ - "@trainer.on(Events.EPOCH_COMPLETED)\n", - "def log_training_results(engine):\n", - " train_evaluator.run(train_iterator)\n", - " metrics = train_evaluator.state.metrics\n", - " avg_accuracy = metrics['accuracy']\n", - " avg_bce = metrics['bce']\n", - " pbar.log_message(\n", - " \"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", - " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", - " \n", - "def log_validation_results(engine):\n", - " validation_evaluator.run(valid_iterator)\n", - " metrics = validation_evaluator.state.metrics\n", - " avg_accuracy = metrics['accuracy']\n", - " avg_bce = metrics['bce']\n", - " pbar.log_message(\n", - " \"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", - " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", - " pbar.n = pbar.last_print_n = 0\n", - "\n", - "trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IAQEr88cznkW" - }, - "source": [ - "### ModelCheckpoint\n", - "\n", - "Lastly, we want to checkpoint this model. It's important to do so, as training processes can be time consuming and if for some reason something goes wrong during training, a model checkpoint can be helpful to restart training from the point of failure.\n", - "\n", - "Below we'll use Ignite's `ModelCheckpoint` handler to checkpoint models at the end of each epoch. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9Gl6WT0YznkW" - }, - "outputs": [], - "source": [ - "checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True)\n", - "trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LxCIriIEznkW" - }, - "source": [ - "### Run Engine\n", - "\n", - "Next, we'll run the trainer for 20 epochs and monitor results. Below we can see that progess bar prints the loss per iteration, and prints the results of training and validation as we specified in our custom function. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "sPe46cQOznkX" - }, - "outputs": [], - "source": [ - "trainer.run(train_iterator, max_epochs=20)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OpqXiZUsznkY" - }, - "source": [ - "That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. " - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "name": "TextCNN.ipynb", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.8" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Copy of TextCNN PR.ipynb", + "private_outputs": true, + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.8" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "RfRKTxQO51bK" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HCS-d1T3znj2" + }, + "source": [ + "# Convolutional Neural Networks for Sentence Classification using Ignite" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rjZMYxFoznj9" + }, + "source": [ + "This is a tutorial on using Ignite to train neural network models, setup experiments and validate models.\n", + "\n", + "In this experiment, we'll be replicating [\n", + "Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! This paper uses CNN for text classification, a task typically reserved for RNNs, Logistic Regression, Naive Bayes.\n", + "\n", + "We want to be able to classify IMDB movie reviews and predict whether the review is positive or negative. IMDB Movie Review dataset comprises of 25000 positive and 25000 negative examples. The dataset comprises of text and label pairs. This is binary classification problem. We'll be using PyTorch to create the model, torchtext to import data and Ignite to train and monitor the models!\n", + "\n", + "Lets get started! " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sovYyC0Zznj-" + }, + "source": [ + "## Required Dependencies \n", + "\n", + "In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:\n", + "\n", + "`pip install torchtext==0.9.1 spacy`\n", + "\n", + "`python -m spacy download en_core_web_sm`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "7XHAD9x7znj_" + }, + "source": [ + "!pip install pytorch-ignite torchtext==0.9.1 # spacy\n", + "!python -m spacy download en_core_web_sm" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lZty7-RWznkA" + }, + "source": [ + "## Import Libraries" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VKTazeAkznkB" + }, + "source": [ + "import random" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wxThg0YTznkD" + }, + "source": [ + "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", + "* **data**: A module to setup the data in the form Fields and Labels.\n", + "* **datasets**: A module to download NLP datasets.\n", + "* **GloVe**: A module to download and use pretrained GloVe embedings." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "XrXE-f7jznkD" + }, + "source": [ + "from torchtext import data\n", + "from torchtext import datasets\n", + "from torchtext.vocab import GloVe" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ivAnTyEfznkE" + }, + "source": [ + "We import torch, nn and functional modules to create our models! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gbEFAWr0znkE" + }, + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "\n", + "SEED = 1234\n", + "random.seed(SEED)\n", + "torch.manual_seed(SEED)\n", + "torch.cuda.manual_seed(SEED)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q22BGKi8znkF" + }, + "source": [ + "`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to setup a training loop, various metrics, handlers and a helpful contrib section! \n", + "\n", + "Below we import the following:\n", + "* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.\n", + "* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.\n", + "* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. \n", + "* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.\n", + "* **RunningAverage**: General metric to attach to Engine during training. \n", + "* **ModelCheckpoint**: Handler to checkpoint models. \n", + "* **EarlyStopping**: Handler to stop training based on a score function. \n", + "* **ProgressBar**: Handler to create a tqdm progress bar." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "enczLgLTznkH" + }, + "source": [ + "from ignite.engine import Engine, Events\n", + "from ignite.metrics import Accuracy, Loss, RunningAverage\n", + "from ignite.handlers import ModelCheckpoint, EarlyStopping\n", + "from ignite.contrib.handlers import ProgressBar" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "-39hgxiUMCq9" + }, + "source": [ + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WZYyXYB5znkH" + }, + "source": [ + "## Processing Data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "irv_ebeDb8yV" + }, + "source": [ + "We first set up a tokenizer using `torchtext.data.utils`.\n", + "The job of a tokenizer to split a sentence into \"tokens\". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).\n", + "We will use the tokenizer for the \"spacy\" library which is a popular choice. Feel free to switch to \"basic_english\" if you want to use the default one or any other that you want.\n", + "\n", + "docs: https://pytorch.org/text/stable/data_utils.html" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "YNRd5Z_KMANB" + }, + "source": [ + "from torchtext.data.utils import get_tokenizer\n", + "tokenizer = get_tokenizer(\"spacy\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "ZknfGdqedSjN" + }, + "source": [ + "tokenizer(\"Ignite is a high-level library for training and evaluating neural networks.\")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZvAmyqHygcZg" + }, + "source": [ + "Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "E_jNgWXHhMBQ" + }, + "source": [ + "from torchtext.datasets import IMDB\n", + "train_iter, test_iter = IMDB(split=('train','test'))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xNKvG9b7jadd" + }, + "source": [ + "Now we set up the train, validation and test splits. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VzJG7Uh_L9q-" + }, + "source": [ + "# We are using only 1000 samples for faster training\n", + "# set to -1 to use full data\n", + "N = 1000 \n", + "\n", + "# We will use 80% of the `train split` for training and the rest for validation\n", + "train_frac = 0.8\n", + "_temp = list(train_iter)\n", + "\n", + "\n", + "random.shuffle(_temp)\n", + "_temp = _temp[:(N if N > 0 else len(_temp) )]\n", + "n_train = int(len(_temp)*train_frac)\n", + "\n", + "train_list = _temp[:n_train]\n", + "validation_list = _temp[n_train:]\n", + "test_list = list(test_iter)\n", + "test_list = test_list[:(N if N > 0 else len(test_list))]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X-qYvdeplMIs" + }, + "source": [ + "Let's explore a data sample to see what it looks like.\n", + "Each data sample is a tuple of the format `(label, text)`.\n", + "\n", + "The value of label can is either 'pos' or 'neg'.\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qrlLB7PxkIW_" + }, + "source": [ + "random_sample = random.sample(train_list,1)[0]\n", + "print(' text:',random_sample[0])\n", + "print('label:', random_sample[1])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mN5cHrazmMDG" + }, + "source": [ + "Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", + "\n", + "`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", + "* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)\n", + "* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)\n", + "* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)\n", + "\n", + "Note than the GloVE download size is around 900MB, so it might take some time to download. \n", + "\n", + "An instance of the `Vocab` class has the following attributes:\n", + "* `extend` is used to extend the vocabulary\n", + "* `freqs` is a dictionary of the frequency of each word\n", + "* `itos` is a list of all the words in the vocabulary.\n", + "* `stoi` is a dictionary mapping every word to an index.\n", + "* `vectors` is a torch.Tensor of the downloaded embeddings\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "T_ukillQMKsh" + }, + "source": [ + "from collections import Counter\n", + "from torchtext.vocab import Vocab\n", + "\n", + "counter = Counter()\n", + "\n", + "for (label, line) in train_list:\n", + " counter.update(tokenizer(line))\n", + "\n", + "vocab = Vocab(\n", + " counter,\n", + " min_freq=10,\n", + " vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')\n", + ")" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "VYYGwfYsM2Pr" + }, + "source": [ + "print(\"The length of the new vocab is\", len(vocab))\n", + "new_stoi = vocab.stoi\n", + "print(\"The index of '' is\", new_stoi[''])\n", + "new_itos = vocab.itos\n", + "print(\"The token at index 2 is\", new_itos[2])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Y72cqB6Qhqt" + }, + "source": [ + "We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `` and `` to the sentence in `text_transform`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "z_9hw21lP1nG" + }, + "source": [ + "text_transform = lambda x: [vocab[token] for token in tokenizer(x)]\n", + "label_transform = lambda x: 1 if x == 'pos' else 0\n", + "\n", + "# Print out the output of text_transform\n", + "print(\"input to the text_transform:\", \"here is an example\")\n", + "print(\"output of the text_transform:\", text_transform(\"here is an example\"))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtZSEqjJQPxM" + }, + "source": [ + "For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NHHIEfpRP4TV" + }, + "source": [ + "from torch.utils.data import DataLoader\n", + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch(batch):\n", + " label_list, text_list = [], []\n", + " for (_label, _text) in batch:\n", + " label_list.append(label_transform(_label))\n", + " processed_text = torch.tensor(text_transform(_text))\n", + " text_list.append(processed_text)\n", + " return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "3IQd3EVbQvTo" + }, + "source": [ + "batch_size = 8 # A batch size of 8\n", + "\n", + "def create_iterators(batch_size=8):\n", + " \"\"\"Heler function to create the iterators\"\"\"\n", + " dataloaders = []\n", + " for split in [train_list, validation_list, test_list]:\n", + " bucket_dataloader = DataLoader(\n", + " split, batch_size=batch_size,\n", + " collate_fn=collate_batch\n", + " )\n", + " dataloaders.append(bucket_dataloader)\n", + " return dataloaders\n" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "CudYIZitUNgx" + }, + "source": [ + "train_iterator, valid_iterator, test_iterator = create_iterators()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "787zNPm6RtKE" + }, + "source": [ + "next(iter(train_iterator))" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d2azJGL6znkM" + }, + "source": [ + "Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.\n", + "* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.\n", + "* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.\n", + "\n", + "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Ga2xDXfyznkN" + }, + "source": [ + "batch = next(iter(train_iterator))\n", + "print('batch[0][0] : ', batch[0][0])\n", + "print('batch[1][0] : ', batch[1][[0] != 1])\n", + "\n", + "lengths = []\n", + "for i, batch in enumerate(train_iterator):\n", + " x = batch[1]\n", + " lengths.append(x.shape[0])\n", + " if i == 10:\n", + " break\n", + "\n", + "print ('Lengths of first 10 batches : ', lengths)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KsUrKRr3znkO" + }, + "source": [ + "## TextCNN Model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pldMpTv-znkO" + }, + "source": [ + "Here is the replication of the model, here are the operations of the model:\n", + "* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. \n", + "\n", + "* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. \n", + "\n", + "* **Activation**: ReLu activation is applied to each convolution operation.\n", + "\n", + "* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. \n", + "\n", + "* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.\n", + "\n", + "* **Dropout**: Dropout is applied to the embedded sentence. \n", + "\n", + "* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.\n", + "\n", + "* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. \n", + "\n", + "\n", + "Let's note that this model works for variable text lengths! The idea to embed the words of a sentence, use convolutions, maxpooling and concantenation to embed the sentence as a single vector! This single vector is passed through a fully connected layer with sigmoid to output a single value. This value can be interpreted as the probability a sentence is positive (closer to 1) or negative (closer to 0).\n", + "\n", + "The minimum length of text expected by the model is the size of the smallest kernel size of the model." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "63z1tffDznkO" + }, + "source": [ + "class TextCNN(nn.Module):\n", + " def __init__(\n", + " self,\n", + " vocab_size,\n", + " embedding_dim, \n", + " kernel_sizes, \n", + " num_filters, \n", + " num_classes, d_prob, mode):\n", + " super(TextCNN, self).__init__()\n", + " self.vocab_size = vocab_size\n", + " self.embedding_dim = embedding_dim\n", + " self.kernel_sizes = kernel_sizes\n", + " self.num_filters = num_filters\n", + " self.num_classes = num_classes\n", + " self.d_prob = d_prob\n", + " self.mode = mode\n", + " self.embedding = nn.Embedding(\n", + " vocab_size, embedding_dim, padding_idx=0)\n", + " self.load_embeddings()\n", + " self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,\n", + " out_channels=num_filters,\n", + " kernel_size=k, stride=1) for k in kernel_sizes])\n", + " self.dropout = nn.Dropout(d_prob)\n", + " self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)\n", + "\n", + " def forward(self, x):\n", + " batch_size, sequence_length = x.shape\n", + " x = self.embedding(x.T).transpose(1, 2)\n", + " x = [F.relu(conv(x)) for conv in self.conv]\n", + " x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]\n", + " x = torch.cat(x, dim=1)\n", + " x = self.fc(self.dropout(x))\n", + " return torch.sigmoid(x).squeeze()\n", + "\n", + " def load_embeddings(self):\n", + " if 'static' in self.mode:\n", + " self.embedding.weight.data.copy_(vocab.vectors)\n", + " if 'non' not in self.mode:\n", + " self.embedding.weight.data.requires_grad = False\n", + " print('Loaded pretrained embeddings, weights are not trainable.')\n", + " else:\n", + " self.embedding.weight.data.requires_grad = True\n", + " print('Loaded pretrained embeddings, weights are trainable.')\n", + " elif self.mode == 'rand':\n", + " print('Randomly initialized embeddings are used.')\n", + " else:\n", + " raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6G3TW4c-znkO" + }, + "source": [ + "## Creating Model, Optimizer and Loss" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D7nH55oXznkP" + }, + "source": [ + "Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are setup. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "HM_7LQE3znkP" + }, + "source": [ + "vocab_size, embedding_dim = vocab.vectors.shape\n", + "\n", + "model = TextCNN(vocab_size=vocab_size,\n", + " embedding_dim=embedding_dim,\n", + " kernel_sizes=[3, 4, 5],\n", + " num_filters=100,\n", + " num_classes=1, \n", + " d_prob=0.5,\n", + " mode='static')\n", + "model.to(device)\n", + "optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)\n", + "criterion = nn.BCELoss()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xjxbAwvIznkP" + }, + "source": [ + "## Training and Evaluating using Ignite" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L8Rl7spqznkQ" + }, + "source": [ + "### Trainer Engine - process_function\n", + "\n", + "Ignite's Engine allows user to define a process_function to process a given batch, this is applied to all the batches of the dataset. This is a general class that can be applied to train and validate models! A process_function has two parameters engine and batch. \n", + "\n", + "\n", + "Let's walk through what the function of the trainer does:\n", + "\n", + "* Sets model in train mode. \n", + "* Sets the gradients of the optimizer to zero.\n", + "* Generate x and y from batch.\n", + "* Performs a forward pass to calculate y_pred using model and x.\n", + "* Calculates loss using y_pred and y.\n", + "* Performs a backward pass using loss to calculate gradients for the model parameters.\n", + "* model parameters are optimized using gradients and optimizer.\n", + "* Returns scalar loss. \n", + "\n", + "Below is a single operation during the trainig process. This process_function will be attached to the training engine." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "Q4ncIcYcznkQ" + }, + "source": [ + "def process_function(engine, batch):\n", + " model.train()\n", + " optimizer.zero_grad()\n", + " y, x = batch\n", + " x = x.to(device)\n", + " y = y.to(device)\n", + " y_pred = model(x)\n", + " loss = criterion(y_pred, y.float())\n", + " loss.backward()\n", + " optimizer.step()\n", + " return loss.item()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HiiQr_GYznkQ" + }, + "source": [ + "### Evaluator Engine - process_function\n", + "\n", + "Similar to the training process function, we setup a function to evaluate a single batch. Here is what the eval_function does:\n", + "\n", + "* Sets model in eval mode.\n", + "* Generates x and y from batch.\n", + "* With torch.no_grad(), no gradients are calculated for any succeding steps.\n", + "* Performs a forward pass on the model to calculate y_pred based on model and x.\n", + "* Returns y_pred and y.\n", + "\n", + "Ignite suggests attaching metrics to evaluators and not trainers because during the training the model parameters are constantly changing and it is best to evaluate model on a stationary model. This information is important as there is a difference in the functions for training and evaluating. Training returns a single scalar loss. Evaluating returns y_pred and y as that output is used to calculate metrics per batch for the entire dataset.\n", + "\n", + "All metrics in Ignite require y_pred and y as outputs of the function attached to the Engine. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "b9-G-9iVznkR" + }, + "source": [ + "def eval_function(engine, batch):\n", + " model.eval()\n", + " with torch.no_grad():\n", + " y, x = batch\n", + " y = y.to(device)\n", + " x = x.to(device)\n", + " y = y.float()\n", + " y_pred = model(x)\n", + " print('y_pred',y_pred)\n", + " print('y',y)\n", + " return y_pred, y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dcmIEZuNznkS" + }, + "source": [ + "### Instantiating Training and Evaluating Engines\n", + "\n", + "Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "k1CxFQs_znkS" + }, + "source": [ + "trainer = Engine(process_function)\n", + "train_evaluator = Engine(eval_function)\n", + "validation_evaluator = Engine(eval_function)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZVu91uVtznkS" + }, + "source": [ + "### Metrics - RunningAverage, Accuracy and Loss\n", + "\n", + "To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "P-98lPU9znkS" + }, + "source": [ + "RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mufkp6mnznkS" + }, + "source": [ + "Now there are two metrics that we want to use for evaluation - accuracy and loss. This is a binary problem, so for Loss we can simply pass the Binary Cross Entropy function as the loss_function. \n", + "\n", + "For Accuracy, Ignite requires y_pred and y to be comprised of 0's and 1's only. Since our model outputs from a sigmoid layer, values are between 0 and 1. We'll need to write a function that transforms `engine.state.output` which is comprised of y_pred and y. \n", + "\n", + "Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.\n", + "\n", + "Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. \n", + "\n", + "To attach a metric to engine, the following format is used:\n", + "* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "KAK6nXEbznkS" + }, + "source": [ + "def thresholded_output_transform(output):\n", + " y_pred, y = output\n", + " y_pred = torch.round(y_pred)\n", + " return y_pred, y" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "QkcC2R4qznkT" + }, + "source": [ + "Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')\n", + "Loss(criterion).attach(train_evaluator, 'bce')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "tLtT5f11znkT" + }, + "source": [ + "Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')\n", + "Loss(criterion).attach(validation_evaluator, 'bce')" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FbS2h_2eznkU" + }, + "source": [ + "### Progress Bar\n", + "\n", + "Next we create an instance of Ignite's progess bar and attach it to the trainer and pass it a key of `engine.state.metrics` to track. In this case, the progress bar will be tracking `engine.state.metrics['loss']`" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "qteztuB3znkU" + }, + "source": [ + "pbar = ProgressBar(persist=True, bar_format=\"\")\n", + "pbar.attach(trainer, ['loss'])" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x4DxUwXfznkU" + }, + "source": [ + "### EarlyStopping - Tracking Validation Loss\n", + "\n", + "Now we'll setup a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop trainig. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "wPM6-USgznkU" + }, + "source": [ + "def score_function(engine):\n", + " val_loss = engine.state.metrics['bce']\n", + " return -val_loss\n", + "\n", + "handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)\n", + "validation_evaluator.add_event_handler(Events.COMPLETED, handler)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LfeL6EkhznkU" + }, + "source": [ + "### Attaching Custom Functions to Engine at specific Events\n", + "\n", + "Below you'll see ways to define your own custom functions and attaching them to various `Events` of the training process.\n", + "\n", + "The functions below both achieve similar tasks, they print the results of the evaluator run on a dataset. One function does that on the training evaluator and dataset, while the other on the validation. Another difference is how these functions are attached in the trainer engine.\n", + "\n", + "The first method involves using a decorator, the syntax is simple - `@` `trainer.on(Events.EPOCH_COMPLETED)`, means that the decorated function will be attached to the trainer and called at the end of each epoch. \n", + "\n", + "The second method involves using the add_event_handler method of trainer - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`. This achieves the same result as the above. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "3XsmcAA2znkV" + }, + "source": [ + "@trainer.on(Events.EPOCH_COMPLETED)\n", + "def log_training_results(engine):\n", + " train_evaluator.run(train_iterator)\n", + " metrics = train_evaluator.state.metrics\n", + " avg_accuracy = metrics['accuracy']\n", + " avg_bce = metrics['bce']\n", + " pbar.log_message(\n", + " \"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", + " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", + " \n", + "def log_validation_results(engine):\n", + " validation_evaluator.run(valid_iterator)\n", + " metrics = validation_evaluator.state.metrics\n", + " avg_accuracy = metrics['accuracy']\n", + " avg_bce = metrics['bce']\n", + " pbar.log_message(\n", + " \"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", + " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", + " pbar.n = pbar.last_print_n = 0\n", + "\n", + "trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IAQEr88cznkW" + }, + "source": [ + "### ModelCheckpoint\n", + "\n", + "Lastly, we want to checkpoint this model. It's important to do so, as training processes can be time consuming and if for some reason something goes wrong during training, a model checkpoint can be helpful to restart training from the point of failure.\n", + "\n", + "Below we'll use Ignite's `ModelCheckpoint` handler to checkpoint models at the end of each epoch. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "9Gl6WT0YznkW" + }, + "source": [ + "checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True)\n", + "trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LxCIriIEznkW" + }, + "source": [ + "### Run Engine\n", + "\n", + "Next, we'll run the trainer for 20 epochs and monitor results. Below we can see that progess bar prints the loss per iteration, and prints the results of training and validation as we specified in our custom function. " + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "sPe46cQOznkX" + }, + "source": [ + "trainer.run(train_iterator, max_epochs=20)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OpqXiZUsznkY" + }, + "source": [ + "That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. " + ] + } + ] +} \ No newline at end of file From bd0c7f0ee337a7c4ba36f6135aff048ec507d625 Mon Sep 17 00:00:00 2001 From: cozek Date: Sat, 10 Apr 2021 16:15:02 +0530 Subject: [PATCH 3/6] undo gitignore changes --- .gitignore | 6 ------ 1 file changed, 6 deletions(-) diff --git a/.gitignore b/.gitignore index 83edf70bc5a8..f7a11548b404 100644 --- a/.gitignore +++ b/.gitignore @@ -39,9 +39,3 @@ coverage.xml /docs/src/ /docs/build/ /docs/source/generated/ - -# Jupyter -examples/notebooks/.ipynb_checkpoints/ - -# System files -.DS_Store From c36ea11d5cefe8df18b8141e5db5930b5f94c6c9 Mon Sep 17 00:00:00 2001 From: cozek Date: Sat, 10 Apr 2021 16:32:58 +0530 Subject: [PATCH 4/6] fix comment on install spacy --- examples/notebooks/TextCNN.ipynb | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index cf75d23b89c8..4e3c111a50a0 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -9,9 +9,8 @@ "collapsed_sections": [] }, "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" + "name": "python388jvsc74a57bd0bb24fb798fa891713af3d36fbae541dd86145d8cb277c7e680316fd96a4b69ba", + "display_name": "Python 3.8.8 64-bit ('ingite': conda)" }, "language_info": { "codemirror_mode": { @@ -82,7 +81,7 @@ "id": "7XHAD9x7znj_" }, "source": [ - "!pip install pytorch-ignite torchtext==0.9.1 # spacy\n", + "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", "!python -m spacy download en_core_web_sm" ], "execution_count": null, From f898d2984f602c9dafc31284280203f82a30c4f8 Mon Sep 17 00:00:00 2001 From: cozek Date: Mon, 19 Apr 2021 16:42:33 +0530 Subject: [PATCH 5/6] reviewer updates --- examples/notebooks/TextCNN.ipynb | 2022 +++++++++++++++--------------- 1 file changed, 1007 insertions(+), 1015 deletions(-) diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index 4e3c111a50a0..ab1119cc8c82 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -1,1016 +1,1008 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "name": "Copy of TextCNN PR.ipynb", - "private_outputs": true, - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "name": "python388jvsc74a57bd0bb24fb798fa891713af3d36fbae541dd86145d8cb277c7e680316fd96a4b69ba", - "display_name": "Python 3.8.8 64-bit ('ingite': conda)" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.8.8" - } - }, - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "RfRKTxQO51bK" - }, - "source": [ - "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HCS-d1T3znj2" - }, - "source": [ - "# Convolutional Neural Networks for Sentence Classification using Ignite" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rjZMYxFoznj9" - }, - "source": [ - "This is a tutorial on using Ignite to train neural network models, setup experiments and validate models.\n", - "\n", - "In this experiment, we'll be replicating [\n", - "Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! This paper uses CNN for text classification, a task typically reserved for RNNs, Logistic Regression, Naive Bayes.\n", - "\n", - "We want to be able to classify IMDB movie reviews and predict whether the review is positive or negative. IMDB Movie Review dataset comprises of 25000 positive and 25000 negative examples. The dataset comprises of text and label pairs. This is binary classification problem. We'll be using PyTorch to create the model, torchtext to import data and Ignite to train and monitor the models!\n", - "\n", - "Lets get started! " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sovYyC0Zznj-" - }, - "source": [ - "## Required Dependencies \n", - "\n", - "In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:\n", - "\n", - "`pip install torchtext==0.9.1 spacy`\n", - "\n", - "`python -m spacy download en_core_web_sm`" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "7XHAD9x7znj_" - }, - "source": [ - "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", - "!python -m spacy download en_core_web_sm" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lZty7-RWznkA" - }, - "source": [ - "## Import Libraries" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VKTazeAkznkB" - }, - "source": [ - "import random" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wxThg0YTznkD" - }, - "source": [ - "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", - "* **data**: A module to setup the data in the form Fields and Labels.\n", - "* **datasets**: A module to download NLP datasets.\n", - "* **GloVe**: A module to download and use pretrained GloVe embedings." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "XrXE-f7jznkD" - }, - "source": [ - "from torchtext import data\n", - "from torchtext import datasets\n", - "from torchtext.vocab import GloVe" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ivAnTyEfznkE" - }, - "source": [ - "We import torch, nn and functional modules to create our models! " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "gbEFAWr0znkE" - }, - "source": [ - "import torch\n", - "import torch.nn as nn\n", - "import torch.nn.functional as F\n", - "\n", - "SEED = 1234\n", - "random.seed(SEED)\n", - "torch.manual_seed(SEED)\n", - "torch.cuda.manual_seed(SEED)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Q22BGKi8znkF" - }, - "source": [ - "`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to setup a training loop, various metrics, handlers and a helpful contrib section! \n", - "\n", - "Below we import the following:\n", - "* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.\n", - "* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.\n", - "* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. \n", - "* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.\n", - "* **RunningAverage**: General metric to attach to Engine during training. \n", - "* **ModelCheckpoint**: Handler to checkpoint models. \n", - "* **EarlyStopping**: Handler to stop training based on a score function. \n", - "* **ProgressBar**: Handler to create a tqdm progress bar." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "enczLgLTznkH" - }, - "source": [ - "from ignite.engine import Engine, Events\n", - "from ignite.metrics import Accuracy, Loss, RunningAverage\n", - "from ignite.handlers import ModelCheckpoint, EarlyStopping\n", - "from ignite.contrib.handlers import ProgressBar" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "-39hgxiUMCq9" - }, - "source": [ - "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WZYyXYB5znkH" - }, - "source": [ - "## Processing Data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "irv_ebeDb8yV" - }, - "source": [ - "We first set up a tokenizer using `torchtext.data.utils`.\n", - "The job of a tokenizer to split a sentence into \"tokens\". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).\n", - "We will use the tokenizer for the \"spacy\" library which is a popular choice. Feel free to switch to \"basic_english\" if you want to use the default one or any other that you want.\n", - "\n", - "docs: https://pytorch.org/text/stable/data_utils.html" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "YNRd5Z_KMANB" - }, - "source": [ - "from torchtext.data.utils import get_tokenizer\n", - "tokenizer = get_tokenizer(\"spacy\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "ZknfGdqedSjN" - }, - "source": [ - "tokenizer(\"Ignite is a high-level library for training and evaluating neural networks.\")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZvAmyqHygcZg" - }, - "source": [ - "Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "E_jNgWXHhMBQ" - }, - "source": [ - "from torchtext.datasets import IMDB\n", - "train_iter, test_iter = IMDB(split=('train','test'))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xNKvG9b7jadd" - }, - "source": [ - "Now we set up the train, validation and test splits. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "VzJG7Uh_L9q-" - }, - "source": [ - "# We are using only 1000 samples for faster training\n", - "# set to -1 to use full data\n", - "N = 1000 \n", - "\n", - "# We will use 80% of the `train split` for training and the rest for validation\n", - "train_frac = 0.8\n", - "_temp = list(train_iter)\n", - "\n", - "\n", - "random.shuffle(_temp)\n", - "_temp = _temp[:(N if N > 0 else len(_temp) )]\n", - "n_train = int(len(_temp)*train_frac)\n", - "\n", - "train_list = _temp[:n_train]\n", - "validation_list = _temp[n_train:]\n", - "test_list = list(test_iter)\n", - "test_list = test_list[:(N if N > 0 else len(test_list))]" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "X-qYvdeplMIs" - }, - "source": [ - "Let's explore a data sample to see what it looks like.\n", - "Each data sample is a tuple of the format `(label, text)`.\n", - "\n", - "The value of label can is either 'pos' or 'neg'.\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "qrlLB7PxkIW_" - }, - "source": [ - "random_sample = random.sample(train_list,1)[0]\n", - "print(' text:',random_sample[0])\n", - "print('label:', random_sample[1])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "mN5cHrazmMDG" - }, - "source": [ - "Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", - "\n", - "`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", - "* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)\n", - "* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)\n", - "* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)\n", - "\n", - "Note than the GloVE download size is around 900MB, so it might take some time to download. \n", - "\n", - "An instance of the `Vocab` class has the following attributes:\n", - "* `extend` is used to extend the vocabulary\n", - "* `freqs` is a dictionary of the frequency of each word\n", - "* `itos` is a list of all the words in the vocabulary.\n", - "* `stoi` is a dictionary mapping every word to an index.\n", - "* `vectors` is a torch.Tensor of the downloaded embeddings\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "T_ukillQMKsh" - }, - "source": [ - "from collections import Counter\n", - "from torchtext.vocab import Vocab\n", - "\n", - "counter = Counter()\n", - "\n", - "for (label, line) in train_list:\n", - " counter.update(tokenizer(line))\n", - "\n", - "vocab = Vocab(\n", - " counter,\n", - " min_freq=10,\n", - " vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')\n", - ")" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "VYYGwfYsM2Pr" - }, - "source": [ - "print(\"The length of the new vocab is\", len(vocab))\n", - "new_stoi = vocab.stoi\n", - "print(\"The index of '' is\", new_stoi[''])\n", - "new_itos = vocab.itos\n", - "print(\"The token at index 2 is\", new_itos[2])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4Y72cqB6Qhqt" - }, - "source": [ - "We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `` and `` to the sentence in `text_transform`." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "z_9hw21lP1nG" - }, - "source": [ - "text_transform = lambda x: [vocab[token] for token in tokenizer(x)]\n", - "label_transform = lambda x: 1 if x == 'pos' else 0\n", - "\n", - "# Print out the output of text_transform\n", - "print(\"input to the text_transform:\", \"here is an example\")\n", - "print(\"output of the text_transform:\", text_transform(\"here is an example\"))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xtZSEqjJQPxM" - }, - "source": [ - "For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "NHHIEfpRP4TV" - }, - "source": [ - "from torch.utils.data import DataLoader\n", - "from torch.nn.utils.rnn import pad_sequence\n", - "\n", - "def collate_batch(batch):\n", - " label_list, text_list = [], []\n", - " for (_label, _text) in batch:\n", - " label_list.append(label_transform(_label))\n", - " processed_text = torch.tensor(text_transform(_text))\n", - " text_list.append(processed_text)\n", - " return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "3IQd3EVbQvTo" - }, - "source": [ - "batch_size = 8 # A batch size of 8\n", - "\n", - "def create_iterators(batch_size=8):\n", - " \"\"\"Heler function to create the iterators\"\"\"\n", - " dataloaders = []\n", - " for split in [train_list, validation_list, test_list]:\n", - " bucket_dataloader = DataLoader(\n", - " split, batch_size=batch_size,\n", - " collate_fn=collate_batch\n", - " )\n", - " dataloaders.append(bucket_dataloader)\n", - " return dataloaders\n" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "CudYIZitUNgx" - }, - "source": [ - "train_iterator, valid_iterator, test_iterator = create_iterators()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "787zNPm6RtKE" - }, - "source": [ - "next(iter(train_iterator))" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d2azJGL6znkM" - }, - "source": [ - "Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.\n", - "* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.\n", - "* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.\n", - "\n", - "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Ga2xDXfyznkN" - }, - "source": [ - "batch = next(iter(train_iterator))\n", - "print('batch[0][0] : ', batch[0][0])\n", - "print('batch[1][0] : ', batch[1][[0] != 1])\n", - "\n", - "lengths = []\n", - "for i, batch in enumerate(train_iterator):\n", - " x = batch[1]\n", - " lengths.append(x.shape[0])\n", - " if i == 10:\n", - " break\n", - "\n", - "print ('Lengths of first 10 batches : ', lengths)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KsUrKRr3znkO" - }, - "source": [ - "## TextCNN Model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pldMpTv-znkO" - }, - "source": [ - "Here is the replication of the model, here are the operations of the model:\n", - "* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. \n", - "\n", - "* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. \n", - "\n", - "* **Activation**: ReLu activation is applied to each convolution operation.\n", - "\n", - "* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. \n", - "\n", - "* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.\n", - "\n", - "* **Dropout**: Dropout is applied to the embedded sentence. \n", - "\n", - "* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.\n", - "\n", - "* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. \n", - "\n", - "\n", - "Let's note that this model works for variable text lengths! The idea to embed the words of a sentence, use convolutions, maxpooling and concantenation to embed the sentence as a single vector! This single vector is passed through a fully connected layer with sigmoid to output a single value. This value can be interpreted as the probability a sentence is positive (closer to 1) or negative (closer to 0).\n", - "\n", - "The minimum length of text expected by the model is the size of the smallest kernel size of the model." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "63z1tffDznkO" - }, - "source": [ - "class TextCNN(nn.Module):\n", - " def __init__(\n", - " self,\n", - " vocab_size,\n", - " embedding_dim, \n", - " kernel_sizes, \n", - " num_filters, \n", - " num_classes, d_prob, mode):\n", - " super(TextCNN, self).__init__()\n", - " self.vocab_size = vocab_size\n", - " self.embedding_dim = embedding_dim\n", - " self.kernel_sizes = kernel_sizes\n", - " self.num_filters = num_filters\n", - " self.num_classes = num_classes\n", - " self.d_prob = d_prob\n", - " self.mode = mode\n", - " self.embedding = nn.Embedding(\n", - " vocab_size, embedding_dim, padding_idx=0)\n", - " self.load_embeddings()\n", - " self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,\n", - " out_channels=num_filters,\n", - " kernel_size=k, stride=1) for k in kernel_sizes])\n", - " self.dropout = nn.Dropout(d_prob)\n", - " self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)\n", - "\n", - " def forward(self, x):\n", - " batch_size, sequence_length = x.shape\n", - " x = self.embedding(x.T).transpose(1, 2)\n", - " x = [F.relu(conv(x)) for conv in self.conv]\n", - " x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]\n", - " x = torch.cat(x, dim=1)\n", - " x = self.fc(self.dropout(x))\n", - " return torch.sigmoid(x).squeeze()\n", - "\n", - " def load_embeddings(self):\n", - " if 'static' in self.mode:\n", - " self.embedding.weight.data.copy_(vocab.vectors)\n", - " if 'non' not in self.mode:\n", - " self.embedding.weight.data.requires_grad = False\n", - " print('Loaded pretrained embeddings, weights are not trainable.')\n", - " else:\n", - " self.embedding.weight.data.requires_grad = True\n", - " print('Loaded pretrained embeddings, weights are trainable.')\n", - " elif self.mode == 'rand':\n", - " print('Randomly initialized embeddings are used.')\n", - " else:\n", - " raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6G3TW4c-znkO" - }, - "source": [ - "## Creating Model, Optimizer and Loss" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D7nH55oXznkP" - }, - "source": [ - "Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are setup. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "HM_7LQE3znkP" - }, - "source": [ - "vocab_size, embedding_dim = vocab.vectors.shape\n", - "\n", - "model = TextCNN(vocab_size=vocab_size,\n", - " embedding_dim=embedding_dim,\n", - " kernel_sizes=[3, 4, 5],\n", - " num_filters=100,\n", - " num_classes=1, \n", - " d_prob=0.5,\n", - " mode='static')\n", - "model.to(device)\n", - "optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)\n", - "criterion = nn.BCELoss()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xjxbAwvIznkP" - }, - "source": [ - "## Training and Evaluating using Ignite" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "L8Rl7spqznkQ" - }, - "source": [ - "### Trainer Engine - process_function\n", - "\n", - "Ignite's Engine allows user to define a process_function to process a given batch, this is applied to all the batches of the dataset. This is a general class that can be applied to train and validate models! A process_function has two parameters engine and batch. \n", - "\n", - "\n", - "Let's walk through what the function of the trainer does:\n", - "\n", - "* Sets model in train mode. \n", - "* Sets the gradients of the optimizer to zero.\n", - "* Generate x and y from batch.\n", - "* Performs a forward pass to calculate y_pred using model and x.\n", - "* Calculates loss using y_pred and y.\n", - "* Performs a backward pass using loss to calculate gradients for the model parameters.\n", - "* model parameters are optimized using gradients and optimizer.\n", - "* Returns scalar loss. \n", - "\n", - "Below is a single operation during the trainig process. This process_function will be attached to the training engine." - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "Q4ncIcYcznkQ" - }, - "source": [ - "def process_function(engine, batch):\n", - " model.train()\n", - " optimizer.zero_grad()\n", - " y, x = batch\n", - " x = x.to(device)\n", - " y = y.to(device)\n", - " y_pred = model(x)\n", - " loss = criterion(y_pred, y.float())\n", - " loss.backward()\n", - " optimizer.step()\n", - " return loss.item()" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HiiQr_GYznkQ" - }, - "source": [ - "### Evaluator Engine - process_function\n", - "\n", - "Similar to the training process function, we setup a function to evaluate a single batch. Here is what the eval_function does:\n", - "\n", - "* Sets model in eval mode.\n", - "* Generates x and y from batch.\n", - "* With torch.no_grad(), no gradients are calculated for any succeding steps.\n", - "* Performs a forward pass on the model to calculate y_pred based on model and x.\n", - "* Returns y_pred and y.\n", - "\n", - "Ignite suggests attaching metrics to evaluators and not trainers because during the training the model parameters are constantly changing and it is best to evaluate model on a stationary model. This information is important as there is a difference in the functions for training and evaluating. Training returns a single scalar loss. Evaluating returns y_pred and y as that output is used to calculate metrics per batch for the entire dataset.\n", - "\n", - "All metrics in Ignite require y_pred and y as outputs of the function attached to the Engine. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "b9-G-9iVznkR" - }, - "source": [ - "def eval_function(engine, batch):\n", - " model.eval()\n", - " with torch.no_grad():\n", - " y, x = batch\n", - " y = y.to(device)\n", - " x = x.to(device)\n", - " y = y.float()\n", - " y_pred = model(x)\n", - " print('y_pred',y_pred)\n", - " print('y',y)\n", - " return y_pred, y" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dcmIEZuNznkS" - }, - "source": [ - "### Instantiating Training and Evaluating Engines\n", - "\n", - "Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "k1CxFQs_znkS" - }, - "source": [ - "trainer = Engine(process_function)\n", - "train_evaluator = Engine(eval_function)\n", - "validation_evaluator = Engine(eval_function)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ZVu91uVtznkS" - }, - "source": [ - "### Metrics - RunningAverage, Accuracy and Loss\n", - "\n", - "To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "P-98lPU9znkS" - }, - "source": [ - "RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mufkp6mnznkS" - }, - "source": [ - "Now there are two metrics that we want to use for evaluation - accuracy and loss. This is a binary problem, so for Loss we can simply pass the Binary Cross Entropy function as the loss_function. \n", - "\n", - "For Accuracy, Ignite requires y_pred and y to be comprised of 0's and 1's only. Since our model outputs from a sigmoid layer, values are between 0 and 1. We'll need to write a function that transforms `engine.state.output` which is comprised of y_pred and y. \n", - "\n", - "Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.\n", - "\n", - "Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. \n", - "\n", - "To attach a metric to engine, the following format is used:\n", - "* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`\n" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "KAK6nXEbznkS" - }, - "source": [ - "def thresholded_output_transform(output):\n", - " y_pred, y = output\n", - " y_pred = torch.round(y_pred)\n", - " return y_pred, y" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "QkcC2R4qznkT" - }, - "source": [ - "Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')\n", - "Loss(criterion).attach(train_evaluator, 'bce')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "metadata": { - "id": "tLtT5f11znkT" - }, - "source": [ - "Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')\n", - "Loss(criterion).attach(validation_evaluator, 'bce')" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FbS2h_2eznkU" - }, - "source": [ - "### Progress Bar\n", - "\n", - "Next we create an instance of Ignite's progess bar and attach it to the trainer and pass it a key of `engine.state.metrics` to track. In this case, the progress bar will be tracking `engine.state.metrics['loss']`" - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "qteztuB3znkU" - }, - "source": [ - "pbar = ProgressBar(persist=True, bar_format=\"\")\n", - "pbar.attach(trainer, ['loss'])" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x4DxUwXfznkU" - }, - "source": [ - "### EarlyStopping - Tracking Validation Loss\n", - "\n", - "Now we'll setup a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop trainig. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "wPM6-USgznkU" - }, - "source": [ - "def score_function(engine):\n", - " val_loss = engine.state.metrics['bce']\n", - " return -val_loss\n", - "\n", - "handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)\n", - "validation_evaluator.add_event_handler(Events.COMPLETED, handler)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LfeL6EkhznkU" - }, - "source": [ - "### Attaching Custom Functions to Engine at specific Events\n", - "\n", - "Below you'll see ways to define your own custom functions and attaching them to various `Events` of the training process.\n", - "\n", - "The functions below both achieve similar tasks, they print the results of the evaluator run on a dataset. One function does that on the training evaluator and dataset, while the other on the validation. Another difference is how these functions are attached in the trainer engine.\n", - "\n", - "The first method involves using a decorator, the syntax is simple - `@` `trainer.on(Events.EPOCH_COMPLETED)`, means that the decorated function will be attached to the trainer and called at the end of each epoch. \n", - "\n", - "The second method involves using the add_event_handler method of trainer - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`. This achieves the same result as the above. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "3XsmcAA2znkV" - }, - "source": [ - "@trainer.on(Events.EPOCH_COMPLETED)\n", - "def log_training_results(engine):\n", - " train_evaluator.run(train_iterator)\n", - " metrics = train_evaluator.state.metrics\n", - " avg_accuracy = metrics['accuracy']\n", - " avg_bce = metrics['bce']\n", - " pbar.log_message(\n", - " \"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", - " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", - " \n", - "def log_validation_results(engine):\n", - " validation_evaluator.run(valid_iterator)\n", - " metrics = validation_evaluator.state.metrics\n", - " avg_accuracy = metrics['accuracy']\n", - " avg_bce = metrics['bce']\n", - " pbar.log_message(\n", - " \"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", - " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", - " pbar.n = pbar.last_print_n = 0\n", - "\n", - "trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IAQEr88cznkW" - }, - "source": [ - "### ModelCheckpoint\n", - "\n", - "Lastly, we want to checkpoint this model. It's important to do so, as training processes can be time consuming and if for some reason something goes wrong during training, a model checkpoint can be helpful to restart training from the point of failure.\n", - "\n", - "Below we'll use Ignite's `ModelCheckpoint` handler to checkpoint models at the end of each epoch. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "9Gl6WT0YznkW" - }, - "source": [ - "checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True)\n", - "trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "LxCIriIEznkW" - }, - "source": [ - "### Run Engine\n", - "\n", - "Next, we'll run the trainer for 20 epochs and monitor results. Below we can see that progess bar prints the loss per iteration, and prints the results of training and validation as we specified in our custom function. " - ] - }, - { - "cell_type": "code", - "metadata": { - "id": "sPe46cQOznkX" - }, - "source": [ - "trainer.run(train_iterator, max_epochs=20)" - ], - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OpqXiZUsznkY" - }, - "source": [ - "That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. " - ] - } - ] -} \ No newline at end of file + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "RfRKTxQO51bK" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HCS-d1T3znj2" + }, + "source": [ + "# Convolutional Neural Networks for Sentence Classification using Ignite" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rjZMYxFoznj9" + }, + "source": [ + "This is a tutorial on using Ignite to train neural network models, setup experiments and validate models.\n", + "\n", + "In this experiment, we'll be replicating [\n", + "Convolutional Neural Networks for Sentence Classification by Yoon Kim](https://arxiv.org/abs/1408.5882)! This paper uses CNN for text classification, a task typically reserved for RNNs, Logistic Regression, Naive Bayes.\n", + "\n", + "We want to be able to classify IMDB movie reviews and predict whether the review is positive or negative. IMDB Movie Review dataset comprises of 25000 positive and 25000 negative examples. The dataset comprises of text and label pairs. This is binary classification problem. We'll be using PyTorch to create the model, torchtext to import data and Ignite to train and monitor the models!\n", + "\n", + "Lets get started! " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sovYyC0Zznj-" + }, + "source": [ + "## Required Dependencies \n", + "\n", + "In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:\n", + "\n", + "`pip install torchtext==0.9.1 spacy`\n", + "\n", + "`python -m spacy download en_core_web_sm`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7XHAD9x7znj_" + }, + "outputs": [], + "source": [ + "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", + "!python -m spacy download en_core_web_sm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lZty7-RWznkA" + }, + "source": [ + "## Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VKTazeAkznkB" + }, + "outputs": [], + "source": [ + "import random" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wxThg0YTznkD" + }, + "source": [ + "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", + "* **data**: A module to setup the data in the form Fields and Labels.\n", + "* **datasets**: A module to download NLP datasets.\n", + "* **GloVe**: A module to download and use pretrained GloVe embedings." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XrXE-f7jznkD" + }, + "outputs": [], + "source": [ + "from torchtext import data\n", + "from torchtext import datasets\n", + "from torchtext.vocab import GloVe" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ivAnTyEfznkE" + }, + "source": [ + "We import torch, nn and functional modules to create our models! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gbEFAWr0znkE" + }, + "outputs": [], + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q22BGKi8znkF" + }, + "source": [ + "`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to setup a training loop, various metrics, handlers and a helpful contrib section! \n", + "\n", + "Below we import the following:\n", + "* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.\n", + "* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.\n", + "* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. \n", + "* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.\n", + "* **RunningAverage**: General metric to attach to Engine during training. \n", + "* **ModelCheckpoint**: Handler to checkpoint models. \n", + "* **EarlyStopping**: Handler to stop training based on a score function. \n", + "* **ProgressBar**: Handler to create a tqdm progress bar." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "enczLgLTznkH" + }, + "outputs": [], + "source": [ + "from ignite.engine import Engine, Events\n", + "from ignite.metrics import Accuracy, Loss, RunningAverage\n", + "from ignite.handlers import ModelCheckpoint, EarlyStopping\n", + "from ignite.contrib.handlers import ProgressBar\n", + "from ignite.utils import manual_seed\n", + "\n", + "SEED = 1234\n", + "manual_seed(SEED)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-39hgxiUMCq9" + }, + "outputs": [], + "source": [ + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WZYyXYB5znkH" + }, + "source": [ + "## Processing Data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "irv_ebeDb8yV" + }, + "source": [ + "We first set up a tokenizer using `torchtext.data.utils`.\n", + "The job of a tokenizer to split a sentence into \"tokens\". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).\n", + "We will use the tokenizer from the \"spacy\" library which is a popular choice. Feel free to switch to \"basic_english\" if you want to use the default one or any other that you want.\n", + "\n", + "docs: https://pytorch.org/text/stable/data_utils.html" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YNRd5Z_KMANB" + }, + "outputs": [], + "source": [ + "from torchtext.data.utils import get_tokenizer\n", + "tokenizer = get_tokenizer(\"spacy\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZknfGdqedSjN" + }, + "outputs": [], + "source": [ + "tokenizer(\"Ignite is a high-level library for training and evaluating neural networks.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZvAmyqHygcZg" + }, + "source": [ + "Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E_jNgWXHhMBQ" + }, + "outputs": [], + "source": [ + "from torchtext.datasets import IMDB\n", + "train_iter, test_iter = IMDB(split=('train','test'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xNKvG9b7jadd" + }, + "source": [ + "Now we set up the train, validation and test splits. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VzJG7Uh_L9q-" + }, + "outputs": [], + "source": [ + "# We are using only 1000 samples for faster training\n", + "# set to -1 to use full data\n", + "N = 1000 \n", + "\n", + "# We will use 80% of the `train split` for training and the rest for validation\n", + "train_frac = 0.8\n", + "_temp = list(train_iter)\n", + "\n", + "\n", + "random.shuffle(_temp)\n", + "_temp = _temp[:(N if N > 0 else len(_temp) )]\n", + "n_train = int(len(_temp)*train_frac)\n", + "\n", + "train_list = _temp[:n_train]\n", + "validation_list = _temp[n_train:]\n", + "test_list = list(test_iter)\n", + "test_list = test_list[:(N if N > 0 else len(test_list))]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X-qYvdeplMIs" + }, + "source": [ + "Let's explore a data sample to see what it looks like.\n", + "Each data sample is a tuple of the format `(label, text)`.\n", + "\n", + "The value of label can is either 'pos' or 'neg'.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qrlLB7PxkIW_" + }, + "outputs": [], + "source": [ + "random_sample = random.sample(train_list,1)[0]\n", + "print(' text:', random_sample[1])\n", + "print('label:', random_sample[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mN5cHrazmMDG" + }, + "source": [ + "Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", + "\n", + "`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", + "* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)\n", + "* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)\n", + "* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)\n", + "\n", + "Note than the GloVE download size is around 900MB, so it might take some time to download. \n", + "\n", + "An instance of the `Vocab` class has the following attributes:\n", + "* `extend` is used to extend the vocabulary\n", + "* `freqs` is a dictionary of the frequency of each word\n", + "* `itos` is a list of all the words in the vocabulary.\n", + "* `stoi` is a dictionary mapping every word to an index.\n", + "* `vectors` is a torch.Tensor of the downloaded embeddings\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T_ukillQMKsh" + }, + "outputs": [], + "source": [ + "from collections import Counter\n", + "from torchtext.vocab import Vocab\n", + "\n", + "counter = Counter()\n", + "\n", + "for (label, line) in train_list:\n", + " counter.update(tokenizer(line))\n", + "\n", + "vocab = Vocab(\n", + " counter,\n", + " min_freq=10,\n", + " vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VYYGwfYsM2Pr" + }, + "outputs": [], + "source": [ + "print(\"The length of the new vocab is\", len(vocab))\n", + "new_stoi = vocab.stoi\n", + "print(\"The index of '' is\", new_stoi[''])\n", + "new_itos = vocab.itos\n", + "print(\"The token at index 2 is\", new_itos[2])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Y72cqB6Qhqt" + }, + "source": [ + "We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `` and `` to the sentence in `text_transform`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "z_9hw21lP1nG" + }, + "outputs": [], + "source": [ + "text_transform = lambda x: [vocab[token] for token in tokenizer(x)]\n", + "label_transform = lambda x: 1 if x == 'pos' else 0\n", + "\n", + "# Print out the output of text_transform\n", + "print(\"input to the text_transform:\", \"here is an example\")\n", + "print(\"output of the text_transform:\", text_transform(\"here is an example\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtZSEqjJQPxM" + }, + "source": [ + "For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NHHIEfpRP4TV" + }, + "outputs": [], + "source": [ + "from torch.utils.data import DataLoader\n", + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch(batch):\n", + " label_list, text_list = [], []\n", + " for (_label, _text) in batch:\n", + " label_list.append(label_transform(_label))\n", + " processed_text = torch.tensor(text_transform(_text))\n", + " text_list.append(processed_text)\n", + " return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3IQd3EVbQvTo" + }, + "outputs": [], + "source": [ + "batch_size = 8 # A batch size of 8\n", + "\n", + "def create_iterators(batch_size=8):\n", + " \"\"\"Heler function to create the iterators\"\"\"\n", + " dataloaders = []\n", + " for split in [train_list, validation_list, test_list]:\n", + " bucket_dataloader = DataLoader(\n", + " split, batch_size=batch_size,\n", + " collate_fn=collate_batch\n", + " )\n", + " dataloaders.append(bucket_dataloader)\n", + " return dataloaders\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CudYIZitUNgx" + }, + "outputs": [], + "source": [ + "train_iterator, valid_iterator, test_iterator = create_iterators()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "787zNPm6RtKE" + }, + "outputs": [], + "source": [ + "next(iter(train_iterator))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d2azJGL6znkM" + }, + "source": [ + "Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.\n", + "* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.\n", + "* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.\n", + "\n", + "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ga2xDXfyznkN" + }, + "outputs": [], + "source": [ + "batch = next(iter(train_iterator))\n", + "print('batch[0][0] : ', batch[0][0])\n", + "print('batch[1][0] : ', batch[1][[0] != 1])\n", + "\n", + "lengths = []\n", + "for i, batch in enumerate(train_iterator):\n", + " x = batch[1]\n", + " lengths.append(x.shape[0])\n", + " if i == 10:\n", + " break\n", + "\n", + "print ('Lengths of first 10 batches : ', lengths)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KsUrKRr3znkO" + }, + "source": [ + "## TextCNN Model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pldMpTv-znkO" + }, + "source": [ + "Here is the replication of the model, here are the operations of the model:\n", + "* **Embedding**: Embeds a batch of text of shape (N, L) to (N, L, D), where N is batch size, L is maximum length of the batch, D is the embedding dimension. \n", + "\n", + "* **Convolutions**: Runs parallel convolutions across the embedded words with kernel sizes of 3, 4, 5 to mimic trigrams, four-grams, five-grams. This results in outputs of (N, L - k + 1, D) per convolution, where k is the kernel_size. \n", + "\n", + "* **Activation**: ReLu activation is applied to each convolution operation.\n", + "\n", + "* **Pooling**: Runs parallel maxpooling operations on the activated convolutions with window sizes of L - k + 1, resulting in 1 value per channel i.e. a shape of (N, 1, D) per pooling. \n", + "\n", + "* **Concat**: The pooling outputs are concatenated and squeezed to result in a shape of (N, 3D). This is a single embedding for a sentence.\n", + "\n", + "* **Dropout**: Dropout is applied to the embedded sentence. \n", + "\n", + "* **Fully Connected**: The dropout output is passed through a fully connected layer of shape (3D, 1) to give a single output for each example in the batch. sigmoid is applied to the output of this layer.\n", + "\n", + "* **load_embeddings**: This is a method defined for TextCNN to load embeddings based on user input. There are 3 modes - rand which results in randomly initialized weights, static which results in frozen pretrained weights, nonstatic which results in trainable pretrained weights. \n", + "\n", + "\n", + "Let's note that this model works for variable text lengths! The idea to embed the words of a sentence, use convolutions, maxpooling and concantenation to embed the sentence as a single vector! This single vector is passed through a fully connected layer with sigmoid to output a single value. This value can be interpreted as the probability a sentence is positive (closer to 1) or negative (closer to 0).\n", + "\n", + "The minimum length of text expected by the model is the size of the smallest kernel size of the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "63z1tffDznkO" + }, + "outputs": [], + "source": [ + "class TextCNN(nn.Module):\n", + " def __init__(\n", + " self,\n", + " vocab_size,\n", + " embedding_dim, \n", + " kernel_sizes, \n", + " num_filters, \n", + " num_classes, d_prob, mode):\n", + " super(TextCNN, self).__init__()\n", + " self.vocab_size = vocab_size\n", + " self.embedding_dim = embedding_dim\n", + " self.kernel_sizes = kernel_sizes\n", + " self.num_filters = num_filters\n", + " self.num_classes = num_classes\n", + " self.d_prob = d_prob\n", + " self.mode = mode\n", + " self.embedding = nn.Embedding(\n", + " vocab_size, embedding_dim, padding_idx=0)\n", + " self.load_embeddings()\n", + " self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,\n", + " out_channels=num_filters,\n", + " kernel_size=k, stride=1) for k in kernel_sizes])\n", + " self.dropout = nn.Dropout(d_prob)\n", + " self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)\n", + "\n", + " def forward(self, x):\n", + " batch_size, sequence_length = x.shape\n", + " x = self.embedding(x.T).transpose(1, 2)\n", + " x = [F.relu(conv(x)) for conv in self.conv]\n", + " x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]\n", + " x = torch.cat(x, dim=1)\n", + " x = self.fc(self.dropout(x))\n", + " return torch.sigmoid(x).squeeze()\n", + "\n", + " def load_embeddings(self):\n", + " if 'static' in self.mode:\n", + " self.embedding.weight.data.copy_(vocab.vectors)\n", + " if 'non' not in self.mode:\n", + " self.embedding.weight.data.requires_grad = False\n", + " print('Loaded pretrained embeddings, weights are not trainable.')\n", + " else:\n", + " self.embedding.weight.data.requires_grad = True\n", + " print('Loaded pretrained embeddings, weights are trainable.')\n", + " elif self.mode == 'rand':\n", + " print('Randomly initialized embeddings are used.')\n", + " else:\n", + " raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6G3TW4c-znkO" + }, + "source": [ + "## Creating Model, Optimizer and Loss" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D7nH55oXznkP" + }, + "source": [ + "Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are setup. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HM_7LQE3znkP" + }, + "outputs": [], + "source": [ + "vocab_size, embedding_dim = vocab.vectors.shape\n", + "\n", + "model = TextCNN(vocab_size=vocab_size,\n", + " embedding_dim=embedding_dim,\n", + " kernel_sizes=[3, 4, 5],\n", + " num_filters=100,\n", + " num_classes=1, \n", + " d_prob=0.5,\n", + " mode='static')\n", + "model.to(device)\n", + "optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)\n", + "criterion = nn.BCELoss()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xjxbAwvIznkP" + }, + "source": [ + "## Training and Evaluating using Ignite" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L8Rl7spqznkQ" + }, + "source": [ + "### Trainer Engine - process_function\n", + "\n", + "Ignite's Engine allows user to define a process_function to process a given batch, this is applied to all the batches of the dataset. This is a general class that can be applied to train and validate models! A process_function has two parameters engine and batch. \n", + "\n", + "\n", + "Let's walk through what the function of the trainer does:\n", + "\n", + "* Sets model in train mode. \n", + "* Sets the gradients of the optimizer to zero.\n", + "* Generate x and y from batch.\n", + "* Performs a forward pass to calculate y_pred using model and x.\n", + "* Calculates loss using y_pred and y.\n", + "* Performs a backward pass using loss to calculate gradients for the model parameters.\n", + "* model parameters are optimized using gradients and optimizer.\n", + "* Returns scalar loss. \n", + "\n", + "Below is a single operation during the trainig process. This process_function will be attached to the training engine." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q4ncIcYcznkQ" + }, + "outputs": [], + "source": [ + "def process_function(engine, batch):\n", + " model.train()\n", + " optimizer.zero_grad()\n", + " y, x = batch\n", + " x = x.to(device)\n", + " y = y.to(device)\n", + " y_pred = model(x)\n", + " loss = criterion(y_pred, y.float())\n", + " loss.backward()\n", + " optimizer.step()\n", + " return loss.item()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HiiQr_GYznkQ" + }, + "source": [ + "### Evaluator Engine - process_function\n", + "\n", + "Similar to the training process function, we setup a function to evaluate a single batch. Here is what the eval_function does:\n", + "\n", + "* Sets model in eval mode.\n", + "* Generates x and y from batch.\n", + "* With torch.no_grad(), no gradients are calculated for any succeding steps.\n", + "* Performs a forward pass on the model to calculate y_pred based on model and x.\n", + "* Returns y_pred and y.\n", + "\n", + "Ignite suggests attaching metrics to evaluators and not trainers because during the training the model parameters are constantly changing and it is best to evaluate model on a stationary model. This information is important as there is a difference in the functions for training and evaluating. Training returns a single scalar loss. Evaluating returns y_pred and y as that output is used to calculate metrics per batch for the entire dataset.\n", + "\n", + "All metrics in Ignite require y_pred and y as outputs of the function attached to the Engine. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b9-G-9iVznkR" + }, + "outputs": [], + "source": [ + "def eval_function(engine, batch):\n", + " model.eval()\n", + " with torch.no_grad():\n", + " y, x = batch\n", + " y = y.to(device)\n", + " x = x.to(device)\n", + " y = y.float()\n", + " y_pred = model(x)\n", + " return y_pred, y" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dcmIEZuNznkS" + }, + "source": [ + "### Instantiating Training and Evaluating Engines\n", + "\n", + "Below we create 3 engines, a trainer, a training evaluator and a validation evaluator. You'll notice that train_evaluator and validation_evaluator use the same function, we'll see later why this was done! " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k1CxFQs_znkS" + }, + "outputs": [], + "source": [ + "trainer = Engine(process_function)\n", + "train_evaluator = Engine(eval_function)\n", + "validation_evaluator = Engine(eval_function)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZVu91uVtznkS" + }, + "source": [ + "### Metrics - RunningAverage, Accuracy and Loss\n", + "\n", + "To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "P-98lPU9znkS" + }, + "outputs": [], + "source": [ + "RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mufkp6mnznkS" + }, + "source": [ + "Now there are two metrics that we want to use for evaluation - accuracy and loss. This is a binary problem, so for Loss we can simply pass the Binary Cross Entropy function as the loss_function. \n", + "\n", + "For Accuracy, Ignite requires y_pred and y to be comprised of 0's and 1's only. Since our model outputs from a sigmoid layer, values are between 0 and 1. We'll need to write a function that transforms `engine.state.output` which is comprised of y_pred and y. \n", + "\n", + "Below `thresholded_output_transform` does just that, it rounds y_pred to convert y_pred to 0's and 1's, and then returns rounded y_pred and y. This function is the output_transform function used to transform the `engine.state.output` to achieve `Accuracy`'s desired purpose.\n", + "\n", + "Now, we attach `Loss` and `Accuracy` (with `thresholded_output_transform`) to train_evaluator and validation_evaluator. \n", + "\n", + "To attach a metric to engine, the following format is used:\n", + "* `Metric(output_transform=output_transform, ...).attach(engine, 'metric_name')`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KAK6nXEbznkS" + }, + "outputs": [], + "source": [ + "def thresholded_output_transform(output):\n", + " y_pred, y = output\n", + " y_pred = torch.round(y_pred)\n", + " return y_pred, y" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QkcC2R4qznkT" + }, + "outputs": [], + "source": [ + "Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')\n", + "Loss(criterion).attach(train_evaluator, 'bce')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tLtT5f11znkT" + }, + "outputs": [], + "source": [ + "Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')\n", + "Loss(criterion).attach(validation_evaluator, 'bce')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FbS2h_2eznkU" + }, + "source": [ + "### Progress Bar\n", + "\n", + "Next we create an instance of Ignite's progess bar and attach it to the trainer and pass it a key of `engine.state.metrics` to track. In this case, the progress bar will be tracking `engine.state.metrics['loss']`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qteztuB3znkU" + }, + "outputs": [], + "source": [ + "pbar = ProgressBar(persist=True, bar_format=\"\")\n", + "pbar.attach(trainer, ['loss'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x4DxUwXfznkU" + }, + "source": [ + "### EarlyStopping - Tracking Validation Loss\n", + "\n", + "Now we'll setup a Early Stopping handler for this training process. EarlyStopping requires a score_function that allows the user to define whatever criteria to stop trainig. In this case, if the loss of the validation set does not decrease in 5 epochs, the training process will stop early. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wPM6-USgznkU" + }, + "outputs": [], + "source": [ + "def score_function(engine):\n", + " val_loss = engine.state.metrics['bce']\n", + " return -val_loss\n", + "\n", + "handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)\n", + "validation_evaluator.add_event_handler(Events.COMPLETED, handler)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LfeL6EkhznkU" + }, + "source": [ + "### Attaching Custom Functions to Engine at specific Events\n", + "\n", + "Below you'll see ways to define your own custom functions and attaching them to various `Events` of the training process.\n", + "\n", + "The functions below both achieve similar tasks, they print the results of the evaluator run on a dataset. One function does that on the training evaluator and dataset, while the other on the validation. Another difference is how these functions are attached in the trainer engine.\n", + "\n", + "The first method involves using a decorator, the syntax is simple - `@` `trainer.on(Events.EPOCH_COMPLETED)`, means that the decorated function will be attached to the trainer and called at the end of each epoch. \n", + "\n", + "The second method involves using the add_event_handler method of trainer - `trainer.add_event_handler(Events.EPOCH_COMPLETED, custom_function)`. This achieves the same result as the above. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3XsmcAA2znkV" + }, + "outputs": [], + "source": [ + "@trainer.on(Events.EPOCH_COMPLETED)\n", + "def log_training_results(engine):\n", + " train_evaluator.run(train_iterator)\n", + " metrics = train_evaluator.state.metrics\n", + " avg_accuracy = metrics['accuracy']\n", + " avg_bce = metrics['bce']\n", + " pbar.log_message(\n", + " \"Training Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", + " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", + " \n", + "def log_validation_results(engine):\n", + " validation_evaluator.run(valid_iterator)\n", + " metrics = validation_evaluator.state.metrics\n", + " avg_accuracy = metrics['accuracy']\n", + " avg_bce = metrics['bce']\n", + " pbar.log_message(\n", + " \"Validation Results - Epoch: {} Avg accuracy: {:.2f} Avg loss: {:.2f}\"\n", + " .format(engine.state.epoch, avg_accuracy, avg_bce))\n", + " pbar.n = pbar.last_print_n = 0\n", + "\n", + "trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IAQEr88cznkW" + }, + "source": [ + "### ModelCheckpoint\n", + "\n", + "Lastly, we want to checkpoint this model. It's important to do so, as training processes can be time consuming and if for some reason something goes wrong during training, a model checkpoint can be helpful to restart training from the point of failure.\n", + "\n", + "Below we'll use Ignite's `ModelCheckpoint` handler to checkpoint models at the end of each epoch. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9Gl6WT0YznkW" + }, + "outputs": [], + "source": [ + "checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True)\n", + "trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LxCIriIEznkW" + }, + "source": [ + "### Run Engine\n", + "\n", + "Next, we'll run the trainer for 20 epochs and monitor results. Below we can see that progess bar prints the loss per iteration, and prints the results of training and validation as we specified in our custom function. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sPe46cQOznkX" + }, + "outputs": [], + "source": [ + "trainer.run(train_iterator, max_epochs=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OpqXiZUsznkY" + }, + "source": [ + "That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.8" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From a4a37b1eb36a1ccc537dabe29d0f17e5788ee7bf Mon Sep 17 00:00:00 2001 From: cozek Date: Tue, 20 Apr 2021 23:02:33 +0530 Subject: [PATCH 6/6] remove old imports and use proper variable name --- examples/notebooks/TextCNN.ipynb | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index ab1119cc8c82..82e1f5186089 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -88,7 +88,6 @@ }, "source": [ "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", - "* **data**: A module to setup the data in the form Fields and Labels.\n", "* **datasets**: A module to download NLP datasets.\n", "* **GloVe**: A module to download and use pretrained GloVe embedings." ] @@ -101,7 +100,6 @@ }, "outputs": [], "source": [ - "from torchtext import data\n", "from torchtext import datasets\n", "from torchtext.vocab import GloVe" ] @@ -238,8 +236,7 @@ }, "outputs": [], "source": [ - "from torchtext.datasets import IMDB\n", - "train_iter, test_iter = IMDB(split=('train','test'))" + "train_iter, test_iter = datasets.IMDB(split=('train','test'))" ] }, { @@ -432,11 +429,11 @@ " \"\"\"Heler function to create the iterators\"\"\"\n", " dataloaders = []\n", " for split in [train_list, validation_list, test_list]:\n", - " bucket_dataloader = DataLoader(\n", + " dataloader = DataLoader(\n", " split, batch_size=batch_size,\n", " collate_fn=collate_batch\n", " )\n", - " dataloaders.append(bucket_dataloader)\n", + " dataloaders.append(dataloader)\n", " return dataloaders\n" ] },