# A - Using TorchText with Your Own Datasets

In this series we have used the IMDb dataset included as a dataset in TorchText. TorchText has many canonical datasets included for classification, language modelling, sequence tagging, etc. However, frequently you'll be wanting to use your own datasets. Luckily, TorchText has functions to help you to this.

Recall in the series, we:
- defined the `Field`s
- loaded the dataset
- created the splits

As a reminder, the code is shown below:

In [119]:
from torchtext import data
from torchtext import datasets

TEXT = data.Field()
LABEL = data.LabelField()

train, test = datasets.IMDB.splits(TEXT, LABEL)

train, valid = train.split()

There are three data formats TorchText can read: `json`, `tsv` (tab separated values) and`csv` (comma separated values).

**In my opinion, the best formatting for TorchText is `json`, which I'll explain later on.**

## Reading JSON

Starting with `json`, your data must be in the `json lines` format, i.e. it must be something like:

```
{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["i", "love", "the", "united kingdom"]}
{"name": "Mary", "location": "United States", "age": 36, "quote": ["i", "want", "more", "telescopes"]}
```

That is, each line is a `json` object. See `data/train.json` for an example.

We then define the fields:

In [88]:
NAME = data.Field()
SAYING = data.Field()
PLACE = data.Field()

Next, we must tell TorchText which fields apply to which elements of the `json` object. 

For `json` data, we must create a dictionary where:
- the key matches the key of the `json` object
- the value is a tuple where:
  - the first element becomes the batch object's attribute name
  - the second element is the name of the `Field`
  
What do we mean when we say "becomes the batch object's attribute name"? Recall in the previous exercises where we accessed the `TEXT` and `LABEL` fields in the train/evaluation loop by using `batch.text` and `batch.label`. Here, to access the name we use `batch.n`, to access the location we use `batch.p`, etc.

A few notes:

* The order of the keys in the `fields` dictionary does not matter, as long as its keys match the `json` data keys.

- The `Field` name does not have to match the key in the `json` object, i.e. we use `PLACE` for the `"location"` field.

- When dealing with `json` data, not all of the keys have to be used, i.e. we did not use the `"age"` field.

- Also, if the values of `json` field are a string then the `Fields` tokenization is applied, however if the values are a list then no tokenization is applied. Usually it is a good idea for the data to already be tokenized into a list, this saves time as you don't have to wait for TorchText to do it.

- The value of the `json` fields do not have to be the same type. Some examples can have their `"quote"` as a string, and some as a list. The tokenization will only get applied to the ones with their `"quote"` as a string.

- If you are using a `json` field, every single example must have an instance of that field, i.e. in this example all examples must have a name, location and quote. However, as we are not using the age field, it does not matter if an example does not have it.

In [94]:
fields = {'name': ('n', NAME), 'location': ('p', PLACE), 'quote': ('s', SAYING)}

We then create our `train` and `test` datasets with the `TabularDataset.splits` function. 

The `path` argument specifices the top level folder common among both datasets, and the `train` and `test` arguments specify the filename of each dataset, i.e. here the train dataset is located at `data/train.json`.

We tell the function we are using `json` data, and pass in our `fields` dictionary defined previously.

In [95]:
train, test = data.TabularDataset.splits(
                path = 'data',
                train = 'train.json',
                test = 'test.json',
                format = 'json',
                fields = fields
)

If you already had a validation dataset, the location of this can be passed as the `validation` argument.

In [96]:
train, valid, test = data.TabularDataset.splits(
                path = 'data',
                train = 'train.json',
                validation = 'valid.json',
                test = 'test.json',
                format = 'json',
                fields = fields
)

We can then view an example to make sure it has worked correctly.

Notice how the field names (`n`, `p` and `s`) match up with what was defined in the `fields` dictionary.

Also notice how the word `"United Kingdom"` in `p` has been split by the tokenization, whereas the `"united kingdom"` in `s` has not. This is due to what was mentioned previously, where TorchText assumes that any `json` fields that are lists are already tokenized and no further tokenization is applied. 

In [97]:
print('vars(train[0]):', vars(train[0]))

vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united kingdom']}


We can now use `train`, `test` and `valid` to build a vocabulary and create iterators, as in the other notebooks.

## Reading CSV/TSV

`csv` and `tsv` are very similar, one has elements separated by commas and one by tabs.

Using the same example above, our `tsv` data will be in the form of:

```
name	location	age	quote
John	United Kingdom	42	i love the united kingdom
Mary	United States	36	i want more telescopes
```

That is, on each row the elements are separated by tabs and we have one example by row. The first row is usually a header (i.e. the name of each of the columns), but your data could have no header.

You cannot have lists within `tsv` or `csv` data.

The way the fields are defined is a bit different to `json`. We now use a list of tuples, where the elements are tuples are the same as before, i.e. first element is the batch object's attribute name, second element is the `Field` name. Unlike the `json` formatted data, 

Unlike the `json` data, the tuples have to be in the same order that they are within the `tsv` data. Due to this, when skipping a column of data a tuple of `None`s needs to be used, if not then our `SAYING` field will be applied to the `age` column of the `tsv` data and the `quote` column will not be used. 

However, if you only wanted to use the `name` and `age` column, you could just use two tuples as they are the first two columns.

We change our `TabularDataset` to read the correct `.tsv` files, and change the `format` argument to `'tsv'`.

If your data has a header, which ours does, it must be skipped by passing `skip_header = True`. If not, TorchText will think the header is an example. By default, `skip_header` will be `False`.

In [116]:
fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]

In [117]:
train, valid, test = data.TabularDataset.splits(
                path = 'data',
                train = 'train.tsv',
                validation = 'valid.tsv',
                test = 'test.tsv',
                format = 'tsv',
                fields = fields,
                skip_header = True
)

In [118]:
print('vars(train[0]):', vars(train[0]))

vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the', 'united', 'kingdom']}


Finally, we'll cover `csv` files. 

This is pretty much the exact same as the `tsv` files, expect with the `format` argument set to `'csv'`.

In [123]:
fields = [('n', NAME), ('p', PLACE), (None, None), ('s', SAYING)]

In [124]:
train, valid, test = data.TabularDataset.splits(
                path = 'data',
                train = 'train.csv',
                validation = 'valid.csv',
                test = 'test.csv',
                format = 'csv',
                fields = fields,
                skip_header = True
)

In [125]:
print('vars(train[0]):', vars(train[0]))

vars(train[0]): {'n': ['John'], 'p': ['United', 'Kingdom'], 's': ['i', 'love', 'the']}


## Why JSON over CSV/TSV?

1. Your `csv` or `tsv` data cannot be lists. This means data cannot be tokenized during the pre-processing step, which means everytime you run your Python script that reads this data via TorchText, it has to be tokenized. Using advanced tokenizers, such as the `spaCy` tokenizer takes a non-negligible amount of time, especially if you are running your script multiple times. Thus, it is better to tokenize your data from a string into a list and use the `json` format.

2. If tabs appear in your `tsv` data, or commas appear in your `csv` data, TorchText will think they are delimiters between columns. This will cause your data to be parsed incorrectly, and worst of all TorchText will not alert you to this as it cannot tell the difference between a tab/comma in a field and a tab/comma as a delimiter. As `json` data is essential a dictionary, you access the data within the fields via its key, so do not have to worry about surprise delimiters.