# Data 101 with `gluon` and `gluonnlp` - the second

This notebook continues our demonstration of GluonNLP's tools for getting data into the shape required for training. In particular, we cover

- Transforming data to batches
- DataStream abstractions in `gluonnlp`

In [1]:
import warnings
warnings.filterwarnings('ignore')

import itertools
from mxnet import gluon
import gluonnlp as nlp

## Creating batches - Batchify

To train our models, data alone is not sufficient.
It must both be in the right format for the task and sufficiently arranged for efficient computation.

This usually means that we need to group data together in batches so that we can proess multiple examples in parallel, getting the best use out of the available GPU resources.

### Example: Padding

For efficient training we need to transform the input data into batches.
Consider a model that takes independent sentences as inputs, to batchify the data,
each sentence in a batch must have the same length.
As it is unlikely to find sufficient sentences of every possible length
it is common to pad all sentences in a batch to the same length.

Such padded inputs are common in machine translation.

In [2]:
data = nlp.data.WikiText2(flatten=False)
vocab = nlp.Vocab(nlp.data.count_tokens(itertools.chain.from_iterable(data)))
coded = data.transform(vocab.to_indices)

For WikiText2, the first 100 paragraphs have the following lengths.

In [3]:
print([len(s) for s in data[:100]])

[6, 128, 92, 102, 6, 212, 297, 85, 6, 189, 222, 204, 6, 268, 249, 8, 305, 8, 267, 90, 6, 66, 125, 145, 93, 6, 65, 8, 196, 223, 120, 10, 100, 91, 6, 210, 7, 283, 91, 10, 179, 144, 45, 17, 22, 22, 76, 87, 58, 10, 10, 9, 8, 5, 10, 4, 5, 10, 5, 176, 86, 59, 195, 176, 117, 130, 80, 72, 48, 21, 10, 10, 4, 9, 9, 5, 3, 15, 11, 5, 17, 5, 4, 3, 9, 91, 118, 35, 21, 6, 122, 160, 7, 109, 56, 7, 185, 43, 118, 110]


<img src="no_bucket_strategy.png" style="width: 100%;"/>

#### `gluonnlp.data.batchify.Pad`

`gluonnlp.data.batchify.Pad` can be instantiated with the padding value.
The resulting function takes a list of variable length lists (or NDArrays or numpy arrays)
as input and returns a single NDArray where all shorter inputs are padded to the length
of the maximum length input element.

In [4]:
print(len(coded[0]), len(coded[1]))

6 128


In [5]:
pad_val = vocab[vocab.padding_token]
pad = nlp.data.batchify.Pad(pad_val=pad_val)
print('pad_val:', pad_val)

pad_val: 1


In [6]:
print('Shape', pad([coded[0], coded[1]]).shape)
print(pad([coded[0], coded[1]]).astype(int)[:, :24])

Shape (2, 128)

[[   12  3933  4430   853    12     3     1     1     1     1     1     1
      1     1     1     1     1     1     1     1     1     1     1     1]
 [21730   129  3933    92    45     0  4430    25   754    45 33278     5
   7184     6  3933     7     4 23715    92    24     5  1912   988    10]]
<NDArray 2x24 @cpu(0)>


#### `gluon.data.DataLoader`

Manually sampling sentences from the dataset and applying the pad function may be bothersome.
`gluon.data.DataLoader` automates this process.

In [7]:
print(gluon.data.DataLoader.__doc__)

Loads data from a dataset and returns mini-batches of data.

    Parameters
    ----------
    dataset : Dataset
        Source dataset. Note that numpy and mxnet arrays can be directly used
        as a Dataset.
    batch_size : int
        Size of mini-batch.
    shuffle : bool
        Whether to shuffle the samples.
    sampler : Sampler
        The sampler to use. Either specify sampler or shuffle, not both.
    last_batch : {'keep', 'discard', 'rollover'}
        How to handle the last batch if batch_size does not evenly divide
        `len(dataset)`.

        keep - A batch with less samples than previous batches is returned.
        discard - The last batch is discarded if its incomplete.
        rollover - The remaining samples are rolled over to the next epoch.
    batch_sampler : Sampler
        A sampler that returns mini-batches. Do not specify batch_size,
        shuffle, sampler, and last_batch if batch_sampler is specified.
    batchify_fn : callable
        Callback fun

In [8]:
batch_size = 8
data_loader = gluon.data.DataLoader(coded, batchify_fn=pad, batch_size=batch_size)
print('Average length of batches is', sum(batch.shape[1] for batch in data_loader) / len(data_loader))

Average length of batches is 192.7940087512622


<img src="fixed_bucket_strategy_ratio0.7.png" style="width: 100%;"/>

##### Custom `Sampler` for `DataLoader`

Sampling random sentences from the Dataset and padding them is sub-optimal as the number of padding elments is determined by the longest sentence.
`DataLoader` supports the specification of a `Sampler` to specify the sentences to select for a batch.

For example, `gluonnlp.data.FixedBucketSampler` assigns each data sample (sentence)
to a fixed bucket based on its length. Resulting batches will only contain sentences
from a single bucket.

This can significantly reduce the average number of elements per batch and consequently the amount of computation.

In [9]:
sampler = nlp.data.FixedBucketSampler(lengths=[len(c) for c in coded], batch_size=batch_size)

In [10]:
data_loader = gluon.data.DataLoader(coded, batchify_fn=pad, batch_sampler=sampler)

In [11]:
print('Average length of batches is', sum(batch.shape[1] for batch in data_loader) / len(data_loader))

Average length of batches is 117.92067226890757


## `DataStream` for large corpora

``` python
class DataStream(object):
    # Iterate over stream
    # `for sample in stream: print(sample)`
    def __iter__(self):
        ...

    def transform(self, fn):
        # Returns a new DataStream
        # Each sample transformed by the function `fn`
        ...
```

Corpora for various NLP tasks can be very large - too large to load them at once to memory.
Luckily there is seldomly a need to load the corpus at once. Instead we can stream the necessary chunks of data that we need at any point in time off the disk, dealing only with the batches that we need for the current training iterations.
In `gluonnlp`, the `DataStream` concept together with other utilities makes such data pipelines easy. 

Any Python iterable can be wrapped in a `DataStream` using `gluonnlp.data.SimpleDataStream`

In [12]:
data = nlp.data.SimpleDataStream(
    [
        [1,2,3],
        [4,5,6]
    ]
)
transformed = data.transform(lambda x: [e + 1 for e in x])

for sample in transformed:
    print(sample)

[2, 3, 4]
[5, 6, 7]


## Example - Google Billion Words Corpus

As GBW is very large, `gluonnlp` contains a convenient `DataStream` representation of the corpus.
The corpus was split in 100 parts of around ~40MB size (containing ~300k sentences each).
The Stream iterates over `CorpusDataset` - one for each of the 100 files.

In [13]:
gbw = nlp.data.GBWStream(segment='train')

In [14]:
for gbw_element in itertools.islice(gbw, 2):
    print(gbw_element)

<gluonnlp.data.dataset.CorpusDataset object at 0x7fb5dd452f60>
<gluonnlp.data.dataset.CorpusDataset object at 0x7fb5dd4529b0>


In [15]:
for i in range(3):
    print(gbw_element[i][:10])

['As', 'well', 'as', 'funding', 'equipment', ',', 'the', 'foundation', 'has', 'paid']
['The', 'alert', 'among', 'you', 'will', 'have', 'noticed', 'by', 'now', 'that']
['Speaking', 'at', 'a', 'rally', 'at', 'a', 'Las', 'Vegas', 'casino', ',']


### `DatasetStream`

A `DataStream` that iterates over `Dataset`s is also called a `DatasetStream`.
`gluonnlp` contains the `SimpleDatasetStream`,
which allows you to easily construct a Stream similar to `GBWStream`.

In [16]:
print(nlp.data.SimpleDatasetStream.__doc__)

A simple stream of Datasets.

    The SimpleDatasetStream is created from multiple files based on provided
    `file_pattern`. One file is read at a time and a corresponding Dataset is
    returned. The Dataset is created based on the file and the kwargs passed to
    SimpleDatasetStream.

    Parameters
    ----------
    dataset : class
        The class for which to create an object for every file. kwargs are
        passed to this class.
    file_pattern: str
        Path to the input text files.
    file_sampler : str, {'sequential', 'random'}, defaults to 'random'
        The sampler used to sample a file.

        - 'sequential': SequentialSampler
        - 'random': RandomSampler
    kwargs
        All other keyword arguments are passed to the dataset constructor.
    


### Example: Language Modeling

In short, a language model tries to predict the next word given the previous ones.

<img src="Recurrent_neural_network_unfold.svg" style="width: 66%;"/>
Image from https://en.wikipedia.org/wiki/Recurrent_neural_network

#### Back-propagation-through-time (BPTT)

For language models no padding is required and batches are instead packed with samples as tight as possible.
There is neither any guarantee, nor requirement that a sentence fits in a single batch.
Instead subsequent batches must be chosen such that they contain valid continuations of the previous sentences.

We train the model using truncated back-propagation-through-time (BPTT).
We backpropagate for 35 time steps using stochastic gradient descent.
For efficient optimizations, we need to obtain a batch of source and target arrays.

For truncated BPTT we usually do not reset the hidden states of our RNN,
making it important that batches following upon each other actually are
valid continuations of the previous batch.

For a batch size of 1, the first two data batches would contain the following words:

In [17]:
batchify = nlp.data.batchify.StreamBPTTBatchify(gbw.vocab, seq_len=5, batch_size=2)

In [18]:
bptt_batches = batchify(gbw)

In [19]:
for batch in itertools.islice(bptt_batches, 2):
    print('data', batch[0].astype(int))
    print('target', batch[1].astype(int), '\n')

data 
[[   16   373]
 [35574    11]
 [  465  7484]
 [ 1427  5305]
 [ 1897  1129]]
<NDArray 5x2 @cpu(0)>
target 
[[35574    11]
 [  465  7484]
 [ 1427  5305]
 [ 1897  1129]
 [  713   122]]
<NDArray 5x2 @cpu(0)> 

data 
[[  713   122]
 [39131    18]
 [    7     8]
 [10503 42884]
 [   64  4548]]
<NDArray 5x2 @cpu(0)>
target 
[[39131    18]
 [    7     8]
 [10503 42884]
 [   64  4548]
 [   27    21]]
<NDArray 5x2 @cpu(0)> 

