# Exploing `datasets`

In [None]:
from datasets import load_dataset

### Loading a local dataset

In [None]:
sc_dataset = load_dataset("csv", data_files="./../../data/FiQA_and_Financial_PhraseBank_in_1/data.csv")

|Data format       |Loading script| 	Example                                              |
|------------------|:------------:|---------------------------------------------------------:|
|CSV & TSV         |csv           | 	load_dataset("csv", data_files="my_file.csv")        |
|Text files        |text          | 	load_dataset("text", data_files="my_file.txt")       |
|JSON & JSON Lines |json          | 	load_dataset("json", data_files="my_file.jsonl")     |
|Pickled DataFrames|pandas        | 	load_dataset("pandas", data_files="my_dataframe.pkl")|

In [None]:
sc_dataset

This creates `DatasetDict` object with a train split. If there are multiple files such as train, dev, and test, the `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths.

In [None]:
data_files = {'train':'./../../data/re-tacred/train.json', 
              'dev':'./../../data/re-tacred/dev.json', 
              'test':'./../../data/re-tacred/test.json'}

re_dataset = load_dataset("json", data_files=data_files)
re_dataset

**NOTE:** `load_dataset() fucntion can also perform automatic decompression to common formats like ZIP and TAR.

## Dataset Wrangling

##### Selecting a random sample for data analysis.

`Dataset.select()` expects an iterable of indices.

In [None]:
sample = sc_dataset["train"].shuffle(seed=25).select(range(1000))

### Dataset Slicing

In [None]:
sample[0]

In [None]:
sample[:4]

### Important functions

##### unique()

In [None]:
sample.unique('Sentiment')

In [None]:
for split in re_dataset.keys():
    assert len(re_dataset[split].unique('id')) == len(re_dataset[split])

##### filter()

In [None]:
# Filtering samples with sentence length greater than 5
sample = sample.filter(lambda x: len(x["Sentence"].split()) > 5)
print(len(sample))

##### map()

The `map()` function supports processing batches of examples at once.

In [None]:
def add_sentences(examples):
    return {'sentence': ' '.join(examples["token"])}

re_dataset = re_dataset.map(add_sentences)
re_dataset['train'][0]

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./../../../hf_models/distilbert-base-uncased/")

def tokenize(examples):
    return tokenizer(examples["sentence"], truncation=True)

In [None]:
%%time
tokenized_dataset = re_dataset.map(tokenize, batched=True, num_proc=8)

**Parallelization** can be achieved using the parameter `batched=True`, thus making the process fast. For large datasets, multiprocessing can be enabled using the parameter `num_proc` to specify the number of processes. 

In [None]:
tokenized_dataset['train'][0]

## Splitting the dataset

In [None]:
sc_dataset

In [None]:
splitted_sc_dataset = sc_dataset["train"].train_test_split(train_size=0.9, seed=25)
splitted_sc_dataset

In [None]:
splitted_sc_dataset["dev"] = splitted_sc_dataset.pop("test")
splitted_sc_dataset

In [None]:
final_sc_dataset = splitted_sc_dataset["train"].train_test_split(train_size=0.9, seed=25)

In [None]:
final_sc_dataset["dev"] = splitted_sc_dataset["dev"]
final_sc_dataset

## Saving the dataset

The dault format is *Arrow*. Using default function `save_to_disk()`, the dataset will be saved in *Arrow*, where each split is associated with its own *dataset.arrow* table, and some metadata in *dataset_info.json* and *state.json*. 

|Data format| 	Function|
|-----------|-----------|
|Arrow| 	Dataset.save_to_disk()|
|CSV| 	Dataset.to_csv()|
|JSON| 	Dataset.to_json()|

In [None]:
for split, dataset in final_sc_dataset.items():
    dataset.to_csv(f"./../../data/FiQA_and_Financial_PhraseBank_in_1/{split}.csv")