# Exploing `datasets`

In [1]:
from datasets import load_dataset

### Loading a local dataset

In [2]:
sc_dataset = load_dataset("csv", data_files="./../../data/FiQA_and_Financial_PhraseBank_in_1/data.csv")

|Data format       |Loading script| 	Example                                              |
|------------------|:------------:|---------------------------------------------------------:|
|CSV & TSV         |csv           | 	load_dataset("csv", data_files="my_file.csv")        |
|Text files        |text          | 	load_dataset("text", data_files="my_file.txt")       |
|JSON & JSON Lines |json          | 	load_dataset("json", data_files="my_file.jsonl")     |
|Pickled DataFrames|pandas        | 	load_dataset("pandas", data_files="my_dataframe.pkl")|

In [3]:
sc_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 5842
    })
})

This creates `DatasetDict` object with a train split. If there are multiple files such as train, dev, and test, the `data_files` argument of the `load_dataset()` function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths.

In [4]:
data_files = {'train':'./../../data/re-tacred/train.json', 
              'dev':'./../../data/re-tacred/dev.json', 
              'test':'./../../data/re-tacred/test.json'}

re_dataset = load_dataset("json", data_files=data_files)
re_dataset

DatasetDict({
    train: Dataset({
        features: ['relation', 'stanford_deprel', 'subj_start', 'stanford_ner', 'obj_start', 'token', 'stanford_head', 'subj_type', 'subj_end', 'stanford_pos', 'obj_type', 'docid', 'obj_end', 'id'],
        num_rows: 58465
    })
    dev: Dataset({
        features: ['relation', 'stanford_deprel', 'subj_start', 'stanford_ner', 'obj_start', 'token', 'stanford_head', 'subj_type', 'subj_end', 'stanford_pos', 'obj_type', 'docid', 'obj_end', 'id'],
        num_rows: 19584
    })
    test: Dataset({
        features: ['relation', 'stanford_deprel', 'subj_start', 'stanford_ner', 'obj_start', 'token', 'stanford_head', 'subj_type', 'subj_end', 'stanford_pos', 'obj_type', 'docid', 'obj_end', 'id'],
        num_rows: 13418
    })
})

**NOTE:** `load_dataset() fucntion can also perform automatic decompression to common formats like ZIP and TAR.

## Dataset Wrangling

##### Selecting a random sample for data analysis.

`Dataset.select()` expects an iterable of indices.

In [5]:
sample = sc_dataset["train"].shuffle(seed=25).select(range(1000))

### Dataset Slicing

In [6]:
sample[0]

{'Sentence': "Mr Kivimeister said John Deer former Timberjack stands to win in the situation : it controls around 60 % of Estonia 's forest machinery market .",
 'Sentiment': 'positive'}

In [7]:
sample[:4]

{'Sentence': ["Mr Kivimeister said John Deer former Timberjack stands to win in the situation : it controls around 60 % of Estonia 's forest machinery market .",
  'The tightened competition situation in the production automation market has affected net sales during 2006 , Cencorp said .',
  '1 p.m. Central office of Nordea Bank 19 3-ya ulitsa Yamskogo Polya , Building 1 Telephone : 495 777-34-77 ext. 3932 , 3931 03.02.2011 Unimilk - EGM 03-04 .02.2011 XVI international business-summit Food Business Russia 2011 will take place .',
  "Before Kemira 's installation NordAlu was producing 3,500 tons of liquid and solid aluminum waste per year ."],
 'Sentiment': ['positive', 'neutral', 'neutral', 'neutral']}

### Important functions

##### unique()

In [8]:
sample.unique('Sentiment')

['positive', 'neutral', 'negative']

In [9]:
for split in re_dataset.keys():
    assert len(re_dataset[split].unique('id')) == len(re_dataset[split])

##### filter()

In [10]:
# Filtering samples with sentence length greater than 5
sample = sample.filter(lambda x: len(x["Sentence"].split()) > 5)
print(len(sample))

981


##### map()

The `map()` function supports processing batches of examples at once.

In [11]:
def add_sentences(examples):
    return {'sentence': ' '.join(examples["token"])}

re_dataset = re_dataset.map(add_sentences)
re_dataset['train'][0]

{'relation': 'org:founded_by',
 'stanford_deprel': ['compound',
  'nsubj',
  'ROOT',
  'case',
  'nmod',
  'amod',
  'nmod:tmod',
  'mark',
  'xcomp',
  'det',
  'compound',
  'compound',
  'dobj',
  'punct',
  'appos',
  'punct',
  'punct',
  'xcomp',
  'det',
  'dobj',
  'case',
  'nummod',
  'nmod',
  'case',
  'nmod',
  'punct',
  'xcomp',
  'amod',
  'compound',
  'compound',
  'compound',
  'dobj',
  'mark',
  'xcomp',
  'dobj',
  'cc',
  'conj',
  'det',
  'compound',
  'dobj',
  'punct'],
 'subj_start': 10,
 'stanford_ner': ['PERSON',
  'PERSON',
  'O',
  'O',
  'DATE',
  'DATE',
  'DATE',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'ORGANIZATION',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'NUMBER',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'PERSON',
  'PERSON',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'],
 'obj_start': 0,
 'token': ['Tom',
  'Thabane',
  'resigned',
  'in',
  'October',
  'last',
  'year',
  'to',
  'form',
  'the',


In [12]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("./../../../hf_models/distilbert-base-uncased/")

def tokenize(examples):
    return tokenizer(examples["sentence"], truncation=True)

In [13]:
%%time
tokenized_dataset = re_dataset.map(tokenize, batched=True, num_proc=8)

CPU times: user 131 ms, sys: 7.8 ms, total: 139 ms
Wall time: 138 ms


**Parallelization** can be achieved using the parameter `batched=True`, thus making the process fast. For large datasets, multiprocessing can be enabled using the parameter `num_proc` to specify the number of processes. 

In [14]:
tokenized_dataset['train'][0]

{'relation': 'org:founded_by',
 'stanford_deprel': ['compound',
  'nsubj',
  'ROOT',
  'case',
  'nmod',
  'amod',
  'nmod:tmod',
  'mark',
  'xcomp',
  'det',
  'compound',
  'compound',
  'dobj',
  'punct',
  'appos',
  'punct',
  'punct',
  'xcomp',
  'det',
  'dobj',
  'case',
  'nummod',
  'nmod',
  'case',
  'nmod',
  'punct',
  'xcomp',
  'amod',
  'compound',
  'compound',
  'compound',
  'dobj',
  'mark',
  'xcomp',
  'dobj',
  'cc',
  'conj',
  'det',
  'compound',
  'dobj',
  'punct'],
 'subj_start': 10,
 'stanford_ner': ['PERSON',
  'PERSON',
  'O',
  'O',
  'DATE',
  'DATE',
  'DATE',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'ORGANIZATION',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'NUMBER',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'PERSON',
  'PERSON',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'],
 'obj_start': 0,
 'token': ['Tom',
  'Thabane',
  'resigned',
  'in',
  'October',
  'last',
  'year',
  'to',
  'form',
  'the',


## Splitting the dataset

In [15]:
sc_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 5842
    })
})

In [16]:
splitted_sc_dataset = sc_dataset["train"].train_test_split(train_size=0.9, seed=25)
splitted_sc_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 5257
    })
    test: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 585
    })
})

In [17]:
splitted_sc_dataset["dev"] = splitted_sc_dataset.pop("test")
splitted_sc_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 5257
    })
    dev: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 585
    })
})

In [18]:
final_sc_dataset = splitted_sc_dataset["train"].train_test_split(train_size=0.9, seed=25)

In [19]:
final_sc_dataset["dev"] = splitted_sc_dataset["dev"]
final_sc_dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 4731
    })
    test: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 526
    })
    dev: Dataset({
        features: ['Sentence', 'Sentiment'],
        num_rows: 585
    })
})

## Saving the dataset

The dault format is *Arrow*. Using default function `save_to_disk()`, the dataset will be saved in *Arrow*, where each split is associated with its own *dataset.arrow* table, and some metadata in *dataset_info.json* and *state.json*. 

|Data format| 	Function|
|-----------|-----------|
|Arrow| 	Dataset.save_to_disk()|
|CSV| 	Dataset.to_csv()|
|JSON| 	Dataset.to_json()|

In [None]:
for split, dataset in final_sc_dataset.items():
    dat