<a href="https://colab.research.google.com/github/zhangguanheng66/text/blob/migration_tutorial/examples/legacy_tutorial/migration_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture

# [TODO] Update to torchtext 0.9.0 release
!pip install --pre torchtext==0.9.0.dev20210220 -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

# Reload environment
exit()

This is a tutorial to show how to migrate from the legacy API in torchtext to the new API in 0.9.0 release. Here, we take the IMDB dataset as an example for the sentiment analysis. Both legacy and new APIs in torchtext can preprocess the text input and prepare the data to train/validate a model with the following steps:

*   Train/val/test split: generate train/validate/test data set if they are available
*   Tokenization: break a raw text string sentence into a list of words
*   Vocab: define a "contract" from tokens to indexes
*   Numericalize: Convert a list of tokens to the corresponding indexes
*   Batch: generate batches of data samples and add padding if necessary

## Step 1: Create a dataset object
----------------------------

Fist of all, we create a dataset for the sentiment analysis. The individual data sample contains a label and a text string.

### *Legacy*
In the legacy code, `Field` class is used for data processing, including tokenizer and numberzation. To check out the dataset, users need to first set up the TEXT/LABEL fields.

In [None]:
import torchtext
import torch
from torchtext.legacy import data
from torchtext.legacy import datasets

TEXT = data.Field()
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

You can print out the raw data by checking out Dataset.examples. The entire text data are stored as a list of tokens.

In [None]:
legacy_examples = legacy_train.examples
print(legacy_examples[0].text, legacy_examples[0].label)

### *New*
The new dataset API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line.

In [None]:
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))

To print out the raw data, you can call the next() function on the IterableDataset.

In [None]:
next(train_iter)

## Step 2 Build the data processing pipeline
----------------------------

### *Legacy*

The default tokenizer implemented in the `Field` class is the built-in python `split()` function. Users could choose the tokenizer by callying `data.get_tokenizer()` and add it to the `Field` constructor. For the sequence model, it's common to append `<BOS>` (begin-of-sentence) and `<EOS>` (end-of-sentence) tokens, and the special tokens need to be defined in the `Field` class.

In [None]:
TEXT = data.Field(tokenize=data.get_tokenizer('basic_english'),
                  init_token='<SOS>', eos_token='<EOS>', lower=True)
LABEL = data.LabelField(dtype = torch.long)
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets

Now you can create a vocabulary of the words from the text file stored in the predefined `Field` object, `TEXT`. You fist have to build a vocabulary in your `Field` object by passing the dataset to the `build_vocab` func. The Field object builds the vocabulary (`TEXT.vocab`) on a specific data split.

In [None]:
TEXT.build_vocab(legacy_train)
LABEL.build_vocab(legacy_train)

Things you can do with a vocabuary object


*   Total length of the vocabulary
*   String2Index (stoi) and Index2String (itos)
*   A purpose-specific vocabulary which contains word appearing more than N times



In [None]:
legacy_vocab = TEXT.vocab
print("The length of the legacy vocab is", len(legacy_vocab))
legacy_stoi = legacy_vocab.stoi
print("The index of 'example' is", legacy_stoi['example'])
legacy_itos = legacy_vocab.itos
print("The token at index 686 is", legacy_itos[686])

# Set up the mim_freq value in the Vocab class
TEXT.build_vocab(legacy_train, min_freq=10)
legacy_vocab2 = TEXT.vocab
print("The length of the legacy vocab is", len(legacy_vocab2))

The length of the legacy vocab is 100686
The index of 'example' is 467
The token at index 686 is knew
The length of the legacy vocab is 20439


### *New*

Users have the access to different kinds of tokenizers directly via `data.get_tokenizer()` function.

In [None]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

To have more flexibility, users can build the vocabulary directly with the Vocab class. For example, the argument `min_freq` is to set up the cutoff frequency to in the vocabulary. The special tokens, like `<BOS>` and `<EOS>` can be assigned to the special symbols in the constructor of the Vocab class.

In [None]:
from collections import Counter
from torchtext.vocab import Vocab

train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>', '<PAD>'))

In [None]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.stoi
print("The index of '<BOS>' is", new_stoi['<BOS>'])
new_itos = vocab.itos
print("The token at index 2 is", new_itos[2])

The length of the new vocab is 20439
The index of '<BOS>' is 1
The token at index 2 is <EOS>


Both `text_transform` and `label_transform` are the callable object, such as a lambda func here, to process the raw text and label data from the dataset iterators. Users can add the special symbols `<BOS>` and `<EOS>` to the sentence in `text_transform`.

In [None]:
text_transform = lambda x: [vocab['<BOS>']] + [vocab[token] for token in tokenizer(x)] + [vocab['<EOS>']]
label_transform = lambda x: 1 if x == 'pos' else 0

# Print out the output of text_transform
print("input to the text_transform:", "here is an example")
print("output of the text_transform:", text_transform("here is an example"))

input to the text_transform: here is an example
output of the text_transform: [1, 134, 12, 43, 467, 2]


## Step 3: Generate batch iterator
--------------------------------

To train a model efficiently, it's recommended to build an iterator to generate data batch.

### *Legacy*
The legacy `Iterator` class is used to batch the dataset and send to the target device, like CPU or GPU.

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)  # datasets here refers to torchtext.legacy.datasets
legacy_train_iterator, legacy_test_iterator = data.Iterator.splits(
    (legacy_train, legacy_test),
    batch_size=8, device = device)

For a NLP workflow, it's also common to define an iterator and batch texts with similar lengths together. The legacy `BucketIterator` class in torchtext library minimizes the amount of padding needed.

In [None]:
from torchtext.legacy.data import BucketIterator
legacy_train, legacy_test = datasets.IMDB.splits(TEXT, LABEL)
legacy_train_bucketiterator, legacy_test_bucketiterator = data.BucketIterator.splits(
    (legacy_train, legacy_test),
    sort_key=lambda x: len(x.text),
    batch_size=8, device = device)

### *New*

`torch.utils.data.DataLoader` is used to generate data batch. Users could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch.

In [None]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
   label_list, text_list = [], []
   for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
   return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

train_iter = IMDB(split='train')
train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, 
                              collate_fn=collate_batch)

To group the texts with similar length together, like introduced in the legacy `BucketIterator` class, first of all, we randomly create multiple "pools", and each of them has a size of `batch_size * 100`. Then, sort the samples within the individual pool by length and pass the sorted samples to `DataLoader` to generate batch iterator. It should be noted that the `shuffle` argument in `DataLoader` is set to `False` to keep the sample order in the "pool".

In [None]:
# Sort the sentences within the "pool" of batch_size * 100.
def bucket_iter_func(pool, batch_size=64):
   for rand_item in pool:
      sorted_item = sorted(rand_item,
                           key=lambda x: len(tokenizer(x[1])))  # x is a tuple of (label, text)
      sorted_dataloader = DataLoader(sorted_item, batch_size=batch_size,
                                     shuffle=False,  # shuffle is set to False to keep the order  
                                     collate_fn=collate_batch)
      for item in sorted_dataloader:
         yield item

train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8 # A batch size of 8
rand_pools = DataLoader(train_list, batch_size=batch_size*100,
                        shuffle=True, collate_fn=lambda x: x)
sorted_train_dataloader = bucket_iter_func(rand_pools, batch_size=batch_size)

## Step 4: Iterate batch to train a model
-------------------------------

It's almost same for both legacy and new APIs to iterate the data for batches during training and validating a model.

### *Legacy*

The legacy batch iterator can be iterated or executed with `next()` method.

In [None]:
# for item in legacy_train_iterator:
#   model(item)

# Or
next(iter(legacy_train_iterator))

### *New*

The batch iterator can be iterated or executed with `next()` method.

In [None]:
# for idx, (label, text) in enumerate(train_dataloader):
#   model(item)

# Or
next(iter(train_dataloader))