# TorchText Dataset Loading and Procesing Tutorial

As a framework for deep learning problems, PyTorch supports many tools for data loading and processing. In the past, TorchText datasets were developed and maintained depending on its own utils (like [Iterator](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/iterator.py#L15)/[Batch](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/batch.py#L4)). In this tutorial, we will introduce [torch.utils.data.DataLoader](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py) and its application to load TorchText dataset. The PyTorch dataloader is widely used by the commnunity, especially [TorchVision](https://github.com/pytorch/vision/tree/master/torchvision), and actively maintained by the developers. We hope to apply PyTorch DataLoader for the future TorchText datasets.

## Load TorchText Dataset

AG_NEWS dataset has recently been added to TorchText for supervised learning. The original dataset has the training and testing examples, and each example contains a list of tokens and a label. By setting the parameter "ngrams", the token list includes a contiguous sequence of n items from a given sample of text.

In [None]:
import torch
import torchtext

from text_classification import AG_NEWS
txt_cls = AG_NEWS(ngrams=2)

Split the train_examples into train and valid set with a ratio of 0.7 (train) and 0.3 (valid).

In [None]:
rnd = torchtext.data.dataset.RandomShuffler(None)
train_examples, _test_examples, valid_examples = \
    torchtext.data.dataset.rationed_split(txt_cls.train_examples, 0.7, 0.0, 0.3, rnd)
test_examples = txt_cls.test_examples

## Convert into dataset

Let's create a dataset for the text problem. The text data (i.e., "examples") and corresponding processors (i.e., "fields") are stored as in the instance.

In [None]:
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, examples, fields):
        self.examples = examples
        self.fields = fields

    def __getitem__(self, i):
        return self.examples[i]

    def __len__(self):
        try:
            return len(self.examples)
        except TypeError:
            return 2**32

    def __iter__(self):
        for x in self.examples:
            yield x

train_dataset = TextDataset(train_examples, txt_cls.fields)
test_dataset = TextDataset(test_examples, txt_cls.fields)
valid_dataset = TextDataset(valid_examples, txt_cls.fields)

## Pad and Numericalize Tokens

Up to now, the data saved in the examples of TextDataset are still a list of tokens. In order to use PyTorch DataLoader, we have to pad the token lists with the same length and convert them into PyTorch tensors. It should be noted that the output of [Field.pad](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/field.py#L240) and [Field.numericalize](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/field.py#L311) functions is in the shape of [seq_length * N] where seq_length is the length of token lists and N is the number of examples. They have to be transposed to the shape of [N * seq_length].

In [None]:
def pad_and_numericalize(dataset, device=None):

    def convert_examples_to_dict(examples, attr_name_list):

        examples_dict = {att: [] for att in attr_name_list}

        for ex in examples:
            for name in attr_name_list:
                if hasattr(ex, name):
                    examples_dict[name].append(ex.__dict__[name])
                else:
                    print("no attribute found: ", name)

        return examples_dict
    
    examples, fields = convert_examples_to_dict(dataset.examples,
                                                ['text', 'label']), dataset.fields

    # The output of pad_and_numericalize function is in the shape of
    # [src_length * N]
    # Transpose into the shape of
    # [N * src_length]
    examples['text'] = fields['text'].numericalize(fields['text'].pad(examples['text']),
                                                   device=device).transpose(0, 1)
    examples['label'] = fields['label'].numericalize(fields['label'].
                                                     pad(examples['label']),
                                                     device=device)
    dataset.examples = [{'text': text, 'label': label} for (text, label)
                        in zip(examples['text'], examples['label'])]
    return dataset

# Pad/numericalize train/test/valid data
train_dataset = pad_and_numericalize(train_dataset, device=torch.device("cpu"))
test_dataset = pad_and_numericalize(test_dataset, device=torch.device("cpu"))
valid_dataset = pad_and_numericalize(valid_dataset, device=torch.device("cpu"))

## Add to DataLoader

Apply torch.utils.data.DataLoader on AG_NEWS dataset. Instead of existing [Iterator](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/iterator.py#L15)/[Batch](https://github.com/pytorch/text/blob/284a51651dd9697f9afd76f2ceb23a8181ae7552/torchtext/data/batch.py#L4) classes in TorchText, this tutorial applies PyTorch [DataLoader](https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py). torch.utils.data.DataLoader is an iterator which provides the following features:
- Batching the data
- Shuffling the data
- Load the data in parallel using multiprocessing workers
- Save/load datasets for re-using

In [None]:
from torch.utils.data import DataLoader

train_dataset_dataloader = DataLoader(train_dataset, batch_size=128,
                                      shuffle=True, num_workers=4)
test_dataset_dataloader = DataLoader(test_dataset, batch_size=128,
                                     shuffle=True, num_workers=4)
valid_dataset_dataloader = DataLoader(valid_dataset, batch_size=128,
                                      shuffle=True, num_workers=4)
for i_batch, sample_batched in enumerate(train_dataset_dataloader):
    print(i_batch, sample_batched['text'].size(), sample_batched['label'].size())

## Save/loader dataloader

Saving the post-processing text data have been demanded for a long time by TorchText user community because data processing usually is time-consuming. With PyTorch DataLoader, we can apply the standard torch save/load methods and re-used post-processing data.

In [None]:
torch.save([train_dataset_dataloader, test_dataset_dataloader, valid_dataset_dataloader],
           './dataloader_text_cls_example.pt')
dataloader_train, dataloader_test, dataloader_valid = torch.load('./dataloader_text_cls_example.pt')