## Using TorchText with Your Own Datasets

There are three data formats TorchText can read: `json`, `tsv` (tab separated values) and`csv` (comma separated values).

**In my opinion, the best formatting for TorchText is `json`, which I'll explain later on.**

## Reading JSON

Starting with `json`, your data must be in the `json lines` format, i.e. it must be something like:

```
{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["i", "love", "the", "united kingdom"]}
{"name": "Mary", "location": "United States", "age": 36, "quote": ["i", "want", "more", "telescopes"]}
```

That is, each line is a `json` object.

We then define the fields:

In [38]:
import random

from torchtext import data
from torchtext import datasets

NAME = data.Field()
SAYING = data.Field()
PLACE = data.Field()

SEED = 1234

Next, we must tell TorchText which fields apply to which elements of the `json` object. 

For `json` data, we must create a dictionary where:
- the key matches the key of the `json` object
- the value is a tuple where:
  - the first element becomes the batch object's attribute name
  - the second element is the name of the `Field`
  
What do we mean when we say "becomes the batch object's attribute name"? Recall in the previous exercises where we accessed the `TEXT` and `LABEL` fields in the train/evaluation loop by using `batch.text` and `batch.label`, this is because TorchText sets the batch object to have a `text` and `label` attribute, each being a tensor containing either the text or the label.

A few notes:

* The order of the keys in the `fields` dictionary does not matter, as long as its keys match the `json` data keys.

- The `Field` name does not have to match the key in the `json` object, e.g. we use `PLACE` for the `"location"` field.

- When dealing with `json` data, not all of the keys have to be used, e.g. we did not use the `"age"` field.

- Also, if the values of `json` field are a string then the `Fields` tokenization is applied (default is to split the string on spaces), however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.

- The value of the `json` fields do not have to be the same type. Some examples can have their `"quote"` as a string, and some as a list. The tokenization will only get applied to the ones with their `"quote"` as a string.

- If you are using a `json` field, every single example must have an instance of that field, e.g. in this example all examples must have a name, location and quote. However, as we are not using the age field, it does not matter if an example does not have it.

In [3]:
fields = {'name': ('n', NAME), 'location': ('p', PLACE), 'quote': ('s', SAYING)}

Now, in a training loop we can iterate over the data iterator and access the name via `batch.n`, the location via `batch.p`, and the quote via `batch.s`.

We then create our datasets (`train_data` and `test_data`) with the `TabularDataset.splits` function. 

The `path` argument specifices the top level folder common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, e.g. here the train dataset is located at `data/train.json`.

We tell the function we are using `json` data, and pass in our `fields` dictionary defined previously.

In [10]:
train_data, test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields
)

In [11]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'train.json',
                                        validation = 'valid.json',
                                        test = 'test.json',
                                        format = 'json',
                                        fields = fields
)

We can then view an example to make sure it has worked correctly.

Notice how the field names (`n`, `p` and `s`) match up with what was defined in the `fields` dictionary.

Also notice how the word `"United Kingdom"` in `p` has been split by the tokenization, whereas the `"united kingdom"` in `s` has not. This is due to what was mentioned previously, where TorchText assumes that any `json` fields that are lists are already tokenized and no further tokenization is applied. 

In [12]:
print(vars(train_data[0]))

{'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united kingdom']}


In [15]:
print(vars(train_data[2]))

{'n': ['Suraj', 'Karki'], 'p': ['Nepal'], 's': ['I', 'want', 'to', 'be', 'data', 'scientist']}


We can now use `train_data`, `test_data` and `valid_data` to build a vocabulary and create iterators, as in the other notebooks. We can access all attributes by using `batch.n`, `batch.p` and `batch.s` for the names, places and sayings, respectively.

## Reading CSV/TSV

`csv` and `tsv` are very similar, except csv has elements separated by commas and tsv by tabs.

Using the same example above, our `tsv` data will be in the form of:

```
name	location	age	quote
John	United Kingdom	42	i love the united kingdom
Mary	United States	36	i want more telescopes
```

That is, on each row the elements are separated by tabs and we have one example per row. The first row is usually a header (i.e. the name of each of the columns), but your data could have no header.

You cannot have lists within `tsv` or `csv` data.

The way the fields are defined is a bit different to `json`. We now use a list of tuples, where each element is also a tuple. The first element of these inner tuples will become the batch object's attribute name, second element is the `Field` name.

Unlike the `json` data, the tuples have to be in the same order that they are within the `tsv` data. Due to this, when skipping a column of data a tuple of `None`s needs to be used, if not then our `SAYING` field will be applied to the `age` column of the `tsv` data and the `quote` column will not be used. 

However, if you only wanted to use the `name` and `age` column, you could just use two tuples as they are the first two columns.

We change our `TabularDataset` to read the correct `.tsv` files, and change the `format` argument to `'tsv'`.

If your data has a header, which ours does, it must be skipped by passing `skip_header = True`. If not, TorchText will think the header is an example. By default, `skip_header` will be `False`.

In [16]:
fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]

In [17]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'train.tsv',
                                        validation = 'valid.tsv',
                                        test = 'test.tsv',
                                        format = 'tsv',
                                        fields = fields,
                                        skip_header = True
)

In [19]:
print(vars(train_data[1]))

{'n': ['Mary'], 'p': ['United', 'States'], 's': ['i', 'want', 'more', 'telescopes']}


In [20]:
fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]

In [21]:
train_data, valid_data, test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'train.csv',
                                        validation = 'valid.csv',
                                        test = 'test.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

In [22]:
print(vars(train_data[0]))

{'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united', 'kingdom']}


## Iterators 

Using any of the above datasets, we can then build the vocab and create the iterators.

In [23]:
NAME.build_vocab(train_data)
SAYING.build_vocab(train_data)
PLACE.build_vocab(train_data)

In [26]:
print(vars(NAME.vocab))

{'freqs': Counter({'John': 1, 'Mary': 1}), 'itos': ['<unk>', '<pad>', 'John', 'Mary'], 'unk_index': 0, 'stoi': defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7f1a7c7b6430>>, {'<unk>': 0, '<pad>': 1, 'John': 2, 'Mary': 3}), 'vectors': None}


Then, we can create the iterators after defining our batch size and device.

By default, the train data is shuffled each epoch, but the validation/test data is sorted. However, TorchText doesn't know what to use to sort our data and it would throw an error if we don't tell it. 

There are two ways to handle this, you can either tell the iterator not to sort the validation/test data by passing `sort = False`, or you can tell it how to sort the data by passing a `sort_key`. A sort key is a function that returns a key on which to sort the data on. For example, `lambda x: x.s` will sort the examples by their `s` attribute, i.e their quote. Ideally, you want to use a sort key as the `BucketIterator` will then be able to sort your examples and then minimize the amount of padding within each batch.

We can then iterate over our iterator to get batches of data. Note how by default TorchText has the batch dimension second.

In [27]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 1

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort = False, #don't sort test/validation data
    batch_size=BATCH_SIZE,
    device=device)

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    sort_key = lambda x: x.s, #sort by s attribute (quote)
    batch_size=BATCH_SIZE,
    device=device)

print('Train:')
for batch in train_iterator:
    print(batch)
    
print('Valid:')
for batch in valid_iterator:
    print(batch)
    
print('Test:')
for batch in test_iterator:
    print(batch)

Train:

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 2x1]
	[.s]:[torch.LongTensor of size 5x1]

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 2x1]
	[.s]:[torch.LongTensor of size 4x1]
Valid:

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 1x1]
	[.s]:[torch.LongTensor of size 2x1]

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 1x1]
	[.s]:[torch.LongTensor of size 4x1]
Test:

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 1x1]
	[.s]:[torch.LongTensor of size 3x1]

[torchtext.data.batch.Batch of size 1]
	[.n]:[torch.LongTensor of size 1x1]
	[.p]:[torch.LongTensor of size 2x1]
	[.s]:[torch.LongTensor of size 3x1]


## Loading big csv file

In [28]:
TEXT = data.Field(tokenize='spacy',batch_first=True,include_lengths=True)
LABEL = data.LabelField(dtype = torch.float,batch_first=True)

In [29]:
fields = [(None, None), ('text',TEXT),('label', LABEL)] # text => Input, label => Ouput

In [30]:
# loading custom dataset
training_data=data.TabularDataset(path = './data/quora.csv',format = 'csv',fields = fields,skip_header = True)

In [34]:
print(vars(training_data[10]))

{'text': ['Why', 'do', 'n’t', 'Arab', 'world', 'completely', 'destroy', 'Israel', '?'], 'label': '1'}


In [39]:
train_data, valid_data = training_data.split(split_ratio=0.7, random_state = random.seed(SEED))

In [40]:
print(vars(train_data[0]))

{'text': ['In', 'the', 'world', 'of', 'Grand', 'Theft', 'Auto', 'V', ',', 'a', 'full', 'earth', 'rotation', 'takes', '48', 'minutes', '.', 'If', 'this', 'occurred', 'in', 'real', 'life', ',', 'what', 'would', 'be', 'the', 'consequences', '?'], 'label': '0'}


In [41]:
print(vars(valid_data[0]))

{'text': ['What', 'happens', 'to', 'the', 'used', 'cricket', 'balls', '?'], 'label': '0'}


#### Building VOCABULARY

In [42]:
#initialize glove embeddings
TEXT.build_vocab(train_data,min_freq=3)  
LABEL.build_vocab(train_data)

In [46]:
print(TEXT.vocab.stoi)



In [47]:
#check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  

In [48]:
#set batch size
BATCH_SIZE = 64

In [49]:
#Load an iterator
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

In [57]:
for batch in train_iterator:
    print(batch)
    break


[torchtext.data.batch.Batch of size 64]
	[.text]:('[torch.LongTensor of size 64x7]', '[torch.LongTensor of size 64]')
	[.label]:[torch.FloatTensor of size 64]


In [58]:
for batch in train_iterator:
    print(batch.text)
    break

(tensor([[  62,   53,   10,  ...,   17,  171,    2],
        [  11,   12,   34,  ..., 2634, 2430,    2],
        [  13,  323,   64,  ..., 8570,  391,    2],
        ...,
        [  11,   12,   34,  ...,  499, 3450,    2],
        [  11,   12,  128,  ..., 2628,  718,    2],
        [  11,   12,  153,  ..., 3467,  200,    2]]), tensor([16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
        16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
        16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
        16, 16, 16, 16, 16, 16, 16, 16, 16, 16]))


In [59]:
for batch in train_iterator:
    print(batch.text[0].shape)
    break

torch.Size([64, 18])


In [61]:
for batch in train_iterator:
    print(batch.label)
    print(batch.label.shape)
    break

tensor([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 1., 0.,
        0., 0., 1., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0.,
        0., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.,
        1., 1., 0., 1., 0., 0., 0., 1., 0., 0.])
torch.Size([64])
