diff --git a/examples/notebooks/TextCNN.ipynb b/examples/notebooks/TextCNN.ipynb index 286f35ac6ab2..82e1f5186089 100644 --- a/examples/notebooks/TextCNN.ipynb +++ b/examples/notebooks/TextCNN.ipynb @@ -1,5 +1,14 @@ { "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "RfRKTxQO51bK" + }, + "source": [ + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)" + ] + }, { "cell_type": "markdown", "metadata": { @@ -35,9 +44,9 @@ "\n", "In this example we only need torchtext and spacy package, assuming that `torch` and `ignite` are already installed. We can install it using `pip`:\n", "\n", - "`pip install torchtext spacy`\n", + "`pip install torchtext==0.9.1 spacy`\n", "\n", - "`python -m spacy download en`" + "`python -m spacy download en_core_web_sm`" ] }, { @@ -48,8 +57,8 @@ }, "outputs": [], "source": [ - "!pip install pytorch-ignite torchtext spacy\n", - "!python -m spacy download en" + "!pip install pytorch-ignite torchtext==0.9.1 spacy\n", + "!python -m spacy download en_core_web_sm" ] }, { @@ -79,7 +88,6 @@ }, "source": [ "`torchtext` is a library that provides multiple datasets for NLP tasks, similar to `torchvision`. Below we import the following:\n", - "* **data**: A module to setup the data in the form Fields and Labels.\n", "* **datasets**: A module to download NLP datasets.\n", "* **GloVe**: A module to download and use pretrained GloVe embedings." ] @@ -92,7 +100,6 @@ }, "outputs": [], "source": [ - "from torchtext import data, legacy\n", "from torchtext import datasets\n", "from torchtext.vocab import GloVe" ] @@ -116,11 +123,7 @@ "source": [ "import torch\n", "import torch.nn as nn\n", - "import torch.nn.functional as F\n", - "\n", - "SEED = 1234\n", - "torch.manual_seed(SEED)\n", - "torch.cuda.manual_seed(SEED)" + "import torch.nn.functional as F" ] }, { @@ -153,7 +156,22 @@ "from ignite.engine import Engine, Events\n", "from ignite.metrics import Accuracy, Loss, RunningAverage\n", "from ignite.handlers import ModelCheckpoint, EarlyStopping\n", - "from ignite.contrib.handlers import ProgressBar" + "from ignite.contrib.handlers import ProgressBar\n", + "from ignite.utils import manual_seed\n", + "\n", + "SEED = 1234\n", + "manual_seed(SEED)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-39hgxiUMCq9" + }, + "outputs": [], + "source": [ + "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')" ] }, { @@ -168,149 +186,277 @@ { "cell_type": "markdown", "metadata": { - "id": "tCadZAV_znkH" + "id": "irv_ebeDb8yV" }, "source": [ - "The code below first sets up `TEXT` and `LABEL` as general data objects. \n", - "\n", - "* `TEXT` converts any text to lowercase and produces tensors with the batch dimension first. \n", - "* `LABEL` is a data object that will convert any labels to floats.\n", + "We first set up a tokenizer using `torchtext.data.utils`.\n", + "The job of a tokenizer to split a sentence into \"tokens\". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).\n", + "We will use the tokenizer from the \"spacy\" library which is a popular choice. Feel free to switch to \"basic_english\" if you want to use the default one or any other that you want.\n", "\n", - "Next IMDB training and test datasets are downloaded, the training data is split into training and validation datasets. It takes TEXT and LABEL as inputs so that the data is processed as specified. " + "docs: https://pytorch.org/text/stable/data_utils.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "-BttNTcGznkH" + "id": "YNRd5Z_KMANB" }, "outputs": [], "source": [ - "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n", + "from torchtext.data.utils import get_tokenizer\n", + "tokenizer = get_tokenizer(\"spacy\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZknfGdqedSjN" + }, + "outputs": [], + "source": [ + "tokenizer(\"Ignite is a high-level library for training and evaluating neural networks.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZvAmyqHygcZg" + }, + "source": [ + "Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E_jNgWXHhMBQ" + }, + "outputs": [], + "source": [ + "train_iter, test_iter = datasets.IMDB(split=('train','test'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xNKvG9b7jadd" + }, + "source": [ + "Now we set up the train, validation and test splits. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VzJG7Uh_L9q-" + }, + "outputs": [], + "source": [ + "# We are using only 1000 samples for faster training\n", + "# set to -1 to use full data\n", + "N = 1000 \n", "\n", - "TEXT = legacy.data.Field(lower=True, batch_first=True)\n", - "LABEL = legacy.data.LabelField(dtype=torch.float)\n", + "# We will use 80% of the `train split` for training and the rest for validation\n", + "train_frac = 0.8\n", + "_temp = list(train_iter)\n", "\n", - "train_data, test_data = legacy.datasets.IMDB.splits(TEXT, LABEL, root='/tmp/imdb/')\n", - "train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))" + "\n", + "random.shuffle(_temp)\n", + "_temp = _temp[:(N if N > 0 else len(_temp) )]\n", + "n_train = int(len(_temp)*train_frac)\n", + "\n", + "train_list = _temp[:n_train]\n", + "validation_list = _temp[n_train:]\n", + "test_list = list(test_iter)\n", + "test_list = test_list[:(N if N > 0 else len(test_list))]" ] }, { "cell_type": "markdown", "metadata": { - "id": "aAt_J_eDznkI" + "id": "X-qYvdeplMIs" }, "source": [ - "Now we have three sets of the data - train, validation, test. Let's explore what these data objects are and how to extract data from them.\n", - "* `train_data` is **torchtext.data.dataset.Dataset**, this is similar to **torch.utils.data.Dataset**.\n", - "* `train_data[0]` is **torchtext.data.example.Example**, a Dataset is comprised of many Examples.\n", - "* Let's explore the attributes of an example. We see a few methods, but most importantly we see `label` and `text`.\n", - "* `example.text` is the text of that example and `example.label` is the label of the example. " + "Let's explore a data sample to see what it looks like.\n", + "Each data sample is a tuple of the format `(label, text)`.\n", + "\n", + "The value of label can is either 'pos' or 'neg'.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "GZZxlIWVznkI" + "id": "qrlLB7PxkIW_" }, "outputs": [], "source": [ - "print('type(train_data) : ', type(train_data))\n", - "print('type(train_data[0]) : ',type(train_data[0]))\n", - "example = train_data[0]\n", - "print('Attributes of Example : ', [attr for attr in dir(example) if '_' not in attr])\n", - "print('example.label : ', example.label)\n", - "print('example.text[:10] : ', example.text[:10])" + "random_sample = random.sample(train_list,1)[0]\n", + "print(' text:', random_sample[1])\n", + "print('label:', random_sample[0])" ] }, { "cell_type": "markdown", "metadata": { - "id": "yCrZzrYaznkK" + "id": "mN5cHrazmMDG" }, "source": [ - "Now that we have an idea of what are split datasets look like, lets dig further into `TEXT` and `LABEL`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", + "Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. \n", "\n", - "For `TEXT`, let's download the pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", + "`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.\n", "* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)\n", "* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)\n", "* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)\n", "\n", - "We use `TEXT.build_vocab` to do this, let's explore the `TEXT` object more. \n", - "\n", - "Let's explore what `TEXT` object is and how to extract data from them. We see `TEXT` has a few attributes, let's explore vocab, since we just used the build_vocab function. \n", + "Note than the GloVE download size is around 900MB, so it might take some time to download. \n", "\n", - "`TEXT.vocab` has the following attributes:\n", + "An instance of the `Vocab` class has the following attributes:\n", "* `extend` is used to extend the vocabulary\n", "* `freqs` is a dictionary of the frequency of each word\n", "* `itos` is a list of all the words in the vocabulary.\n", "* `stoi` is a dictionary mapping every word to an index.\n", - "* `vectors` is a torch.Tensor of the downloaded embeddings" + "* `vectors` is a torch.Tensor of the downloaded embeddings\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "IfGqPAfEznkK" + "id": "T_ukillQMKsh" }, "outputs": [], "source": [ - "TEXT.build_vocab(train_data, vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/'))\n", - "print ('Attributes of TEXT : ', [attr for attr in dir(TEXT) if '_' not in attr])\n", - "print ('Attributes of TEXT.vocab : ', [attr for attr in dir(TEXT.vocab) if '_' not in attr])\n", - "print ('First 5 values TEXT.vocab.itos : ', TEXT.vocab.itos[0:5]) \n", - "print ('First 5 key, value pairs of TEXT.vocab.stoi : ', {key:value for key,value in list(TEXT.vocab.stoi.items())[0:5]}) \n", - "print ('Shape of TEXT.vocab.vectors.shape : ', TEXT.vocab.vectors.shape)" + "from collections import Counter\n", + "from torchtext.vocab import Vocab\n", + "\n", + "counter = Counter()\n", + "\n", + "for (label, line) in train_list:\n", + " counter.update(tokenizer(line))\n", + "\n", + "vocab = Vocab(\n", + " counter,\n", + " min_freq=10,\n", + " vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VYYGwfYsM2Pr" + }, + "outputs": [], + "source": [ + "print(\"The length of the new vocab is\", len(vocab))\n", + "new_stoi = vocab.stoi\n", + "print(\"The index of '' is\", new_stoi[''])\n", + "new_itos = vocab.itos\n", + "print(\"The token at index 2 is\", new_itos[2])" ] }, { "cell_type": "markdown", "metadata": { - "id": "PFfeEinqznkL" + "id": "4Y72cqB6Qhqt" }, "source": [ - "Let's do the same with `LABEL`. We see that there are vectors related to `LABEL`, this is expected because `LABEL` provides the definition of a label of data." + "We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `` and `` to the sentence in `text_transform`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "5sU3UOXDznkL" + "id": "z_9hw21lP1nG" }, "outputs": [], "source": [ - "LABEL.build_vocab(train_data)\n", - "print ('Attributes of LABEL : ', [attr for attr in dir(LABEL) if '_' not in attr])\n", - "print ('Attributes of LABEL.vocab : ', [attr for attr in dir(LABEL.vocab) if '_' not in attr])\n", - "print ('First 5 values LABEL.vocab.itos : ', LABEL.vocab.itos) \n", - "print ('First 5 key, value pairs of LABEL.vocab.stoi : ', {key:value for key,value in list(LABEL.vocab.stoi.items())})\n", - "print ('Shape of LABEL.vocab.vectors : ', LABEL.vocab.vectors)" + "text_transform = lambda x: [vocab[token] for token in tokenizer(x)]\n", + "label_transform = lambda x: 1 if x == 'pos' else 0\n", + "\n", + "# Print out the output of text_transform\n", + "print(\"input to the text_transform:\", \"here is an example\")\n", + "print(\"output of the text_transform:\", text_transform(\"here is an example\"))" ] }, { "cell_type": "markdown", "metadata": { - "id": "ojXZisxLznkM" + "id": "xtZSEqjJQPxM" + }, + "source": [ + "For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NHHIEfpRP4TV" }, + "outputs": [], "source": [ - "Now we must convert our split datasets into iterators, we'll take advantage of **torchtext.data.BucketIterator**! BucketIterator pads every element of a batch to the length of the longest element of the batch." + "from torch.utils.data import DataLoader\n", + "from torch.nn.utils.rnn import pad_sequence\n", + "\n", + "def collate_batch(batch):\n", + " label_list, text_list = [], []\n", + " for (_label, _text) in batch:\n", + " label_list.append(label_transform(_label))\n", + " processed_text = torch.tensor(text_transform(_text))\n", + " text_list.append(processed_text)\n", + " return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "tM4yWV9YznkM" + "id": "3IQd3EVbQvTo" }, "outputs": [], "source": [ - "train_iterator, valid_iterator, test_iterator = legacy.data.BucketIterator.splits((train_data, valid_data, test_data), \n", - " batch_size=32,\n", - " device=device)" + "batch_size = 8 # A batch size of 8\n", + "\n", + "def create_iterators(batch_size=8):\n", + " \"\"\"Heler function to create the iterators\"\"\"\n", + " dataloaders = []\n", + " for split in [train_list, validation_list, test_list]:\n", + " dataloader = DataLoader(\n", + " split, batch_size=batch_size,\n", + " collate_fn=collate_batch\n", + " )\n", + " dataloaders.append(dataloader)\n", + " return dataloaders\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CudYIZitUNgx" + }, + "outputs": [], + "source": [ + "train_iterator, valid_iterator, test_iterator = create_iterators()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "787zNPm6RtKE" + }, + "outputs": [], + "source": [ + "next(iter(train_iterator))" ] }, { @@ -320,10 +466,10 @@ }, "source": [ "Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to setup are process_functions for Ignite's `Engine`.\n", - "* `batch.label[0]` is the label of a single example. We can see that `LABEL.vocab.stoi` was used to map the label that originally text into a float.\n", - "* `batch.text[0]` is the text of a single example. Similar to label, `TEXT.vocab.stoi` was used to convert each token of the example's text into indices.\n", + "* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.\n", + "* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.\n", "\n", - "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the bucket iterator is doing exactly as we hoped!" + "Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected." ] }, { @@ -335,13 +481,13 @@ "outputs": [], "source": [ "batch = next(iter(train_iterator))\n", - "print('batch.label[0] : ', batch.label[0])\n", - "print('batch.text[0] : ', batch.text[0][batch.text[0] != 1])\n", + "print('batch[0][0] : ', batch[0][0])\n", + "print('batch[1][0] : ', batch[1][[0] != 1])\n", "\n", "lengths = []\n", "for i, batch in enumerate(train_iterator):\n", - " x = batch.text\n", - " lengths.append(x.shape[1])\n", + " x = batch[1]\n", + " lengths.append(x.shape[0])\n", " if i == 10:\n", " break\n", "\n", @@ -395,7 +541,13 @@ "outputs": [], "source": [ "class TextCNN(nn.Module):\n", - " def __init__(self, vocab_size, embedding_dim, kernel_sizes, num_filters, num_classes, d_prob, mode):\n", + " def __init__(\n", + " self,\n", + " vocab_size,\n", + " embedding_dim, \n", + " kernel_sizes, \n", + " num_filters, \n", + " num_classes, d_prob, mode):\n", " super(TextCNN, self).__init__()\n", " self.vocab_size = vocab_size\n", " self.embedding_dim = embedding_dim\n", @@ -404,7 +556,8 @@ " self.num_classes = num_classes\n", " self.d_prob = d_prob\n", " self.mode = mode\n", - " self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=1)\n", + " self.embedding = nn.Embedding(\n", + " vocab_size, embedding_dim, padding_idx=0)\n", " self.load_embeddings()\n", " self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,\n", " out_channels=num_filters,\n", @@ -414,7 +567,7 @@ "\n", " def forward(self, x):\n", " batch_size, sequence_length = x.shape\n", - " x = self.embedding(x).transpose(1, 2)\n", + " x = self.embedding(x.T).transpose(1, 2)\n", " x = [F.relu(conv(x)) for conv in self.conv]\n", " x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]\n", " x = torch.cat(x, dim=1)\n", @@ -423,7 +576,7 @@ "\n", " def load_embeddings(self):\n", " if 'static' in self.mode:\n", - " self.embedding.weight.data.copy_(TEXT.vocab.vectors)\n", + " self.embedding.weight.data.copy_(vocab.vectors)\n", " if 'non' not in self.mode:\n", " self.embedding.weight.data.requires_grad = False\n", " print('Loaded pretrained embeddings, weights are not trainable.')\n", @@ -462,7 +615,7 @@ }, "outputs": [], "source": [ - "vocab_size, embedding_dim = TEXT.vocab.vectors.shape\n", + "vocab_size, embedding_dim = vocab.vectors.shape\n", "\n", "model = TextCNN(vocab_size=vocab_size,\n", " embedding_dim=embedding_dim,\n", @@ -521,9 +674,11 @@ "def process_function(engine, batch):\n", " model.train()\n", " optimizer.zero_grad()\n", - " x, y = batch.text, batch.label\n", + " y, x = batch\n", + " x = x.to(device)\n", + " y = y.to(device)\n", " y_pred = model(x)\n", - " loss = criterion(y_pred, y)\n", + " loss = criterion(y_pred, y.float())\n", " loss.backward()\n", " optimizer.step()\n", " return loss.item()" @@ -561,7 +716,10 @@ "def eval_function(engine, batch):\n", " model.eval()\n", " with torch.no_grad():\n", - " x, y = batch.text, batch.label\n", + " y, x = batch\n", + " y = y.to(device)\n", + " x = x.to(device)\n", + " y = y.float()\n", " y_pred = model(x)\n", " return y_pred, y" ] @@ -806,8 +964,7 @@ "cell_type": "code", "execution_count": null, "metadata": { - "id": "sPe46cQOznkX", - "scrolled": false + "id": "sPe46cQOznkX" }, "outputs": [], "source": [ @@ -825,11 +982,6 @@ } ], "metadata": { - "accelerator": "GPU", - "colab": { - "name": "TextCNN.ipynb", - "provenance": [] - }, "kernelspec": { "display_name": "Python 3", "language": "python", @@ -845,9 +997,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.3" + "version": "3.8.8" } }, "nbformat": 4, - "nbformat_minor": 1 + "nbformat_minor": 4 }