<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/huggingface-transformers/huggingface-course/05-dataset-library/1-loading-custom-dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What if my dataset isn't on the Hub?

You know how to use the Hugging Face Hub to download datasets, but youâ€™ll often find yourself working with data that is stored either on your laptop or on a remote server. In this section weâ€™ll show you how ðŸ¤— Datasets can be used to load datasets that arenâ€™t available on the Hugging Face Hub.

Let's install the Transformers and Datasets libraries to run this notebook.

In [None]:
!pip -q install datasets transformers[sentencepiece]

In [2]:
from datasets import load_dataset

##Working with local and remote datasets

ðŸ¤— Datasets provides loading scripts to handle the loading of local and remote datasets. It supports several common data formats, such as:


|Data format	| Loading script	| Example |
|--|--|--|
|CSV & TSV	| csv	| load_dataset("csv", data_files="my_file.csv") |
|Text files	| text	| load_dataset("text", data_files="my_file.txt") |
| JSON & JSON Lines	| json	| load_dataset("json", data_files="my_file.jsonl") |
| Pickled DataFrames	| pandas	| load_dataset("pandas", data_files="my_dataframe.pkl") |

As shown in the table, for each data format we just need to specify the type of loading script in the load_dataset() function, along with a data_files argument that specifies the path to one or more files. Letâ€™s start by loading a dataset from local files; later weâ€™ll see how to do the same with remote files.

##Loading a local dataset

For this example weâ€™ll use the [SQuAD-it dataset](https://github.com/crux82/squad-it/), which is a large-scale dataset for question answering in Italian.

The training and test splits are hosted on GitHub, so we can download them with a simple wget command:

In [3]:
!wget -q https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget -q https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

This will download two compressed files called SQuAD_it-train.json.gz and SQuAD_it-test.json.gz, which we can decompress with the Linux gzip command:

In [4]:
!gzip -dkv SQuAD_it-*.json.gz

SQuAD_it-test.json.gz:	 87.5% -- replaced with SQuAD_it-test.json
SQuAD_it-train.json.gz:	 82.3% -- replaced with SQuAD_it-train.json


We can see that the compressed files have been replaced with SQuAD_it-train.json and SQuAD_it-text.json, and that the data is stored in the JSON format.


To load a JSON file with the load_dataset() function, we just need to know if weâ€™re dealing with ordinary JSON (similar to a nested dictionary) or JSON Lines (line-separated JSON). 

Like many question answering datasets, SQuAD-it uses the nested format, with all the text stored in a data field. This means we can load the dataset by specifying the field argument as follows:

In [None]:
squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")

By default, loading local files creates a DatasetDict object with a train split. We can see this by inspecting the squad_it_dataset object:

In [7]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

This shows us the number of rows and the column names associated with the training set. We can view one of the examples by indexing into the train split as follows:

In [17]:
#quad_it_dataset["train"][0]

Great, weâ€™ve loaded our first local dataset! But while this worked for the training set, what we really want is to include both the train and test splits in a single DatasetDict object so we can apply Dataset.map() functions across both splits at once. 

To do this, we can provide a dictionary to the data_files argument that maps each split name to a file associated with that split:

In [None]:
data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [22]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

In [16]:
#squad_it_dataset["train"][0]

In [15]:
#squad_it_dataset["test"][0]

This is exactly what we wanted. Now, we can apply various preprocessing techniques to clean up the data, tokenize the reviews, and so on.

>The data_files argument of the load_dataset() function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"). See the ðŸ¤— [Datasets documentation](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files) for more details.

The loading scripts in ðŸ¤— Datasets actually support automatic decompression of the input files, so we could have skipped the use of gzip by pointing the data_files argument directly to the compressed files:

In [None]:
data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [19]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

This can be useful if you donâ€™t want to manually decompress many GZIP files. The automatic decompression also applies to other common formats like ZIP and TAR, so you just need to point data_files to the compressed files and youâ€™re good to go!

Now that you know how to load local files on your laptop or desktop, letâ€™s take a look at loading remote files.

##Loading a remote dataset

If youâ€™re working as a data scientist or coder in a company, thereâ€™s a good chance the datasets you want to analyze are stored on some remote server. Fortunately, loading remote files is just as simple as loading local ones! Instead of providing a path to local files, we point the data_files argument of load_dataset() to one or more URLs where the remote files are stored. 

For example, for the SQuAD-it dataset hosted on GitHub, we can just point data_files to the SQuAD_it-*.json.gz URLs as follows:

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}

squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [24]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

This returns the same DatasetDict object obtained above, but saves us the step of manually downloading and decompressing the SQuAD_it-*.json.gz files. This wraps up our foray into the various ways to load datasets that arenâ€™t hosted on the Hugging Face Hub. Now that weâ€™ve got a dataset to play with, letâ€™s get our hands dirty with various data-wrangling techniques!