# Data 101 with `gluon` and `gluonnlp`

In order to train neural networks, we need data.
GluonNLP provides useful abstractions for getting textual data in the shape required to train typical deep nets.

In this notebook, we will cover the following:
- Dataset abstractions in `gluon`
- Using included Datasets as well as custom Datasets
- Common data transformations in NLP

In [1]:
import warnings
warnings.filterwarnings('ignore')

import itertools
from mxnet import gluon
import gluonnlp as nlp

## Dataset in `gluon`

Datasets in Gluon have the following basic structure:

``` python
class Dataset(object):
    def __getitem__(self, idx):
        ...
    
    def __len__(self):
        ...

    def transform(self, fn, lazy=True):
        # Returns a new dataset with each sample
        # transformed by the function `fn`.
```

We can make list-like Python object (ie. that implements `__getitem__` meaning it  subscripted like `x[0]` etc.),
into a `gluon` `Dataset` by wrapping it, using `gluon.data.SimpleDataset` as follows:

In [2]:
simple_data = gluon.data.SimpleDataset([[1, 2, 3], [4, 5, 6]])

In [3]:
transformed = simple_data.transform(lambda x: [e + 1 for e in x])

In [4]:
print(transformed[0])
print(transformed[1])

[2, 3, 4]
[5, 6, 7]


## Provided datasets

- GluonNLP features a number of popular benchmark datasets out of the box in order to make it easy for you to get going.
  - http://gluon-nlp.mxnet.io/api/data.html
  
Using these datasets just requires instantiating a class. Here we take WikiText2 as example. This datasrt is distributed as text with one paragraph per line.

In [5]:
data_w2 = nlp.data.WikiText2(segment='train', flatten=False)

In [6]:
print('WikiText2 with flatten=False contains', len(data_w2), 'samples')

WikiText2 with flatten=False contains 23767 samples


In [7]:
print(data_w2[0][:10])
print(data_w2[1][:10])
print(data_w2[2][:10])

['=', 'Valkyria', 'Chronicles', 'III', '=', '<eos>']
['Senjō', 'no', 'Valkyria', '3', ':', '<unk>', 'Chronicles', '(', 'Japanese', ':']
['The', 'game', 'began', 'development', 'in', '2010', ',', 'carrying', 'over', 'a']


Each sample in `data` is a list of words, representing a paragraph in the WikiText2 dataset.

## Example 1 - flatten

Some tasks don't operate on paragraphs and thus ignore paragraph boundaries.

In [8]:
data_w2_f = nlp.data.WikiText2(segment='train', flatten=True)

In [9]:
print('WikiText2 with flatten=True contains', len(data_w2_f), 'tokens')

WikiText2 with flatten=True contains 2075677 tokens


In [10]:
print('The first 10 words are: ', data_w2_f[:10])

The first 10 words are:  ['=', 'Valkyria', 'Chronicles', 'III', '=', '<eos>', 'Senjō', 'no', 'Valkyria', '3']


### Example 2 - BOS / EOS

Some tasks denote the beginning and end of a sentence with special BOS (beginning-of-sentence) and EOS (end-of-sentence) tokens.

In [11]:
data_w2_eos = nlp.data.WikiText2(segment='train', flatten=True, bos=None, eos='<eos>')
print('WikiText2 with flatten=True and eos contains', len(data_w2_eos), 'tokens')
print('The first 10 words are: ', data_w2_eos[:10])

WikiText2 with flatten=True and eos contains 2075677 tokens
The first 10 words are:  ['=', 'Valkyria', 'Chronicles', 'III', '=', '<eos>', 'Senjō', 'no', 'Valkyria', '3']


## `WikiText2` behind the scenes: `CorpusDataset`

- `gluonnlp` provides `CorpusDataset` which makes it easy to read custom corpora based on provided sample splitter and word tokenizer
- Let's use `CorpusDataset` to manually read WikiText2

First, "standard" Python:

In [12]:
with open('wiki.train.tokens', encoding='utf-8') as f:
    raw_data = f.read()

In [13]:
print(raw_data[:300])

 
 = Valkyria Chronicles III = 
 
 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStatio


Let's look at the GluonNLP functionality to read a file.

In [14]:
data_w2_raw = nlp.data.CorpusDataset(
    'wiki.train.tokens',
    encoding='utf8',
    flatten=False,
    skip_empty=True,
    sample_splitter=lambda x: x.splitlines(),
    tokenizer=lambda x: x.split(),
    bos=None,
    eos=None)

In [15]:
for i in range(3):
    print(data_w2_raw[i][:12])

['=', 'Valkyria', 'Chronicles', 'III', '=']
['Senjō', 'no', 'Valkyria', '3', ':', '<unk>', 'Chronicles', '(', 'Japanese', ':', '戦場のヴァルキュリア3', ',']
['The', 'game', 'began', 'development', 'in', '2010', ',', 'carrying', 'over', 'a', 'large', 'portion']


## Common transformations

### Vocabulary and coding

- Vocabulay provides one-to-one mapping between tokens and integer indices

In [16]:
counter = nlp.data.count_tokens(itertools.chain.from_iterable(data_w2_raw))

In [17]:
vocab = nlp.Vocab(counter=counter,
          max_size=None,
          min_freq=1,
          unknown_token='<unk>',
          padding_token='<pad>',
          bos_token='<bos>',
          eos_token='<eos>',
          reserved_tokens=None)
print(vocab)

Vocab(size=33280, unk="<unk>", reserved="['<pad>', '<bos>', '<eos>']")


The vocabulary object enables us to replace a token or list of tokens `t`
with the corresponding indices via
`vocab[t]` or equivalently `vocab.to_indices(t)`

In [18]:
for i in range(3):
    data_i = data_w2_raw[i][:5]
    print('{:<50}{:<13}{}'.format(str(data_i), 'becomes', str(vocab[data_i])))

['=', 'Valkyria', 'Chronicles', 'III', '=']       becomes      [12, 3933, 4430, 853, 12]
['Senjō', 'no', 'Valkyria', '3', ':']             becomes      [21730, 129, 3933, 92, 45]
['The', 'game', 'began', 'development', 'in']     becomes      [15, 79, 135, 443, 9]


We can use the Dataset `transform` API to apply the `to_indices` method of the vocabulary.

In [19]:
coded = data_w2_raw.transform(vocab.to_indices)

for i in range(3):
    print(coded[i][:10])

[12, 3933, 4430, 853, 12]
[21730, 129, 3933, 92, 45, 0, 4430, 25, 754, 45]
[15, 79, 135, 443, 9, 283, 5, 3332, 73, 11]


## Practice - Wikitext-2-Raw

WikiText-2 is distributed in two versions.
- Standard version
  - Preprocessed by replacing infrequent tokens with `<unk>``
- Raw version

Let's use `gluonnlp` arrive at the pre-processed version given the raw version.

A few tips:
- GluonNLP provides nlp.data.WikiText2Raw to load the raw version.
- nlp.data.WikiText2Raw default parameters are for character level language modeling
  - To reconstruct the pre-processed WikiText2, a different `tokenizer` argument needs to be passed
- When constructing the vocabulary, you must specify a maximum size based on the number of words in WikiText2

In [20]:
print(nlp.data.WikiText2Raw.__doc__)

WikiText-2 character-level dataset for language modeling

    WikiText2Raw is implemented as CorpusDataset with the default flatten=True.

    From Salesforce research:
    https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset

    License: Creative Commons Attribution-ShareAlike

    Parameters
    ----------
    segment : {'train', 'val', 'test'}, default 'train'
        Dataset segment.
    flatten : bool, default True
        Whether to return all samples as flattened tokens. If True, each sample is a token.
    skip_empty : bool, default True
        Whether to skip the empty samples produced from sample_splitters. If False, `bos` and `eos`
        will be added in empty samples.
    tokenizer : function, default s.encode('utf-8')
        A function that splits each sample string into list of tokens.
        The tokenizer can also be used to convert everything to lowercase.
        E.g. with tokenizer=lambda s: s.lower().encode('utf-8')
    bos :

In [21]:
# data = nlp.data.WikiText2Raw(...)
# vocab = 
# coded = data.transform(vocab.to_indices)