<a href="https://colab.research.google.com/github/rkarthiksub/MachineLearningNotebooks/blob/master/Day_4_LLM_Workbook_Workshop_April_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Datasets


* In the previous session, we used the HF ```datasets``` module to load a dataset from the hub and use that directly for the fine-tuning tasks.
* It is good to know what could be accomplished with the datasets module and a bit more helpful methods
* The ```dataset``` library allows us to quickly use the (thousands of) existing datasets for NLP, audio, and vision domains.
* Please visit the official [documentation](https://huggingface.co/docs/datasets/en/index) to know more about the ```dataset``` module

In [None]:
%%capture
# if you are using colab, uncomment this code below
!pip install datasets

In [None]:
import datasets

* Let's see the list of attributes and functions of the datasets module

In [None]:
dir(datasets)

['Array2D',
 'Array3D',
 'Array4D',
 'Array5D',
 'ArrowBasedBuilder',
 'Audio',
 'BuilderConfig',
 'ClassLabel',
 'Dataset',
 'DatasetBuilder',
 'DatasetDict',
 'DatasetInfo',
 'DownloadConfig',
 'DownloadManager',
 'DownloadMode',
 'Features',
 'GeneratorBasedBuilder',
 'Image',
 'IterableDataset',
 'IterableDatasetDict',
 'LargeList',
 'NamedSplit',
 'NamedSplitAll',
 'Pdf',
 'ReadInstruction',
 'Sequence',
 'Split',
 'SplitBase',
 'SplitDict',
 'SplitGenerator',
 'SplitInfo',
 'StreamingDownloadManager',
 'SubSplitInfo',
 'Translation',
 'TranslationVariableLanguages',
 'Value',
 'VerificationMode',
 'Version',
 'Video',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'are_progress_bars_disabled',
 'arrow_dataset',
 'arrow_reader',
 'arrow_writer',
 'builder',
 'combine',
 'concatenate_datasets',
 'config',
 'data_files',
 'dataset_dict',
 'disable_caching',
 'disable_progress_bar',
 'dis

* We are familiar with some classes/methods such as _Dataset_, _DatasetDict_ ,_Features_, _Value_ and functions _features_, _load_dataset_
* When we load a dataset from hub, the data is cached locally by following [memory-mapped](https://huggingface.co/docs/datasets/en/about_arrow) columnar format (therefore, uses less RAM)
* The hub contains thousands of datasets
* Visit : https://huggingface.co/datasets
* Based on the task we can download a suitable dataset
* Let us download a dataset for language modelling: https://huggingface.co/datasets/wikipedia

In [None]:
# uncomment the line below if you are using colab and try to access wikipedia dump
# !pip install mwparserfromhell

# Creating a dataset from Local files

 * Say we have dataset files stored locally in CSV format
 * Suppose we have two files, namely "set-1.csv" and "set-2.csv" in a directory (create dummy sets if you are using this notebook on colab)
 * NOTE: All the files should have the **same number of columns** with the same column names.
 * Then we can simply load them as follows

In [None]:
from datasets import load_dataset

In [None]:
# load california housing dataset from sklearn
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)

In [None]:
print(housing)

{'data':        MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   
20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   

       Longitude  
0        -1

In [None]:
# save the dataset to a local 'csv' file
housing['data'].to_csv('./cali_housing.csv')#, index=False)

In [None]:
!head cali_housing.csv

,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984126984126984,1.0238095238095237,322.0,2.5555555555555554,37.88,-122.23
1,8.3014,21.0,6.238137082601054,0.9718804920913884,2401.0,2.109841827768014,37.86,-122.22
2,7.2574,52.0,8.288135593220339,1.073446327683616,496.0,2.8022598870056497,37.85,-122.24
3,5.6431,52.0,5.8173515981735155,1.0730593607305936,558.0,2.547945205479452,37.85,-122.25
4,3.8462,52.0,6.281853281853282,1.0810810810810811,565.0,2.1814671814671813,37.85,-122.25
5,4.0368,52.0,4.761658031088083,1.1036269430051813,413.0,2.139896373056995,37.85,-122.25
6,3.6591,52.0,4.9319066147859925,0.9513618677042801,1094.0,2.1284046692607004,37.84,-122.25
7,3.12,52.0,4.797527047913447,1.061823802163833,1157.0,1.7882534775888717,37.84,-122.25
8,2.0804,42.0,4.294117647058823,1.1176470588235294,1206.0,2.026890756302521,37.84,-122.26


In [None]:
# a list of address of the files containing the dataset
data_files=["cali_housing.csv",
            # '2nd_csv_file.csv'
            ]

# an example if there are multiple files
# data_files=["cali_housing.csv","cali_housing2.csv"]

# load the dataset in HF format from csv files
data_local = load_dataset("csv",data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# print the new dataset loaded with HF API
print(data_local)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 20640
    })
})


* we can split the samples into train-validation-test splits

In [None]:
# divide the 'train' split to create (train, test) sets
raw_dataset= data_local['train'].train_test_split(test_size=0.2)
print(raw_dataset)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 16512
    })
    test: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 4128
    })
})


In [None]:
help(data_local['train'].train_test_split)

In [None]:
# further divide 'test' split to create 'validation' set
test_val_dataset= raw_dataset['test'].train_test_split(test_size=0.5)
print(test_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
    test: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
})


In [None]:
# create the HF dataset with all 3 splits
new_dataset = datasets.DatasetDict({
    'train':raw_dataset['train'],
    'test':test_val_dataset['train'],
    'validation':test_val_dataset['test'],
})

In [None]:
new_dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 16512
    })
    test: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
})

In [None]:
# remove unnecessary columns
new_dataset.remove_columns(['Unnamed: 0'])

DatasetDict({
    train: Dataset({
        features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 16512
    })
    test: Dataset({
        features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
    validation: Dataset({
        features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'],
        num_rows: 2064
    })
})

* For other formats, you can see [here](https://huggingface.co/docs/datasets/en/tabular_load#csv-files)
* We can do validation split by spliting the dict further as discussed [here](https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090/21)
* **What about data_files in other formats**?
    * For all file formats, internally HF loads a suitable dataset builder
    * You can see all the supported formats and their builder [here](https://huggingface.co/docs/datasets/en/about_dataset_load)

In [None]:
# save the dataset to local disk
new_dataset.save_to_disk('local_datasets/custom_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/16512 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2064 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2064 [00:00<?, ? examples/s]

In [None]:
# list the files of the saved HF dataset
!ls local_datasets/custom_dataset/validation -lh

total 156K
-rw-r--r-- 1 root root 148K Apr 24 14:55 data-00000-of-00001.arrow
-rw-r--r-- 1 root root 1.3K Apr 24 14:55 dataset_info.json
-rw-r--r-- 1 root root  250 Apr 24 14:55 state.json


In [None]:
# dataset.save_to_disk('local_datasets/temp_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/32863 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/520 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2507 [00:00<?, ? examples/s]

* It creates a folder 'custom_dataset' with sub-folder for each split (namely, train, test here)
* Within the subfolder we will see a dataset split stored in  the ```arrow``` format
* Now we can load the dataset by directly giving the path:"local_datasets/custom_dataset" in the ```load_dataset``` function
* We can also mention the split we want (by default it loads all the available splits)

In [None]:
raw_dataset_from_disk = load_dataset('local_datasets/custom_dataset/')
print(raw_dataset_from_disk)

DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    test: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})


## Create audio and image datasets

https://huggingface.co/docs/datasets/en/create_dataset


## Publish a dataset in HF Hub

https://huggingface.co/docs/datasets/en/upload_dataset

# Load dataset from Hub

 * The details of a dataset that we wish to use are available in the hub itself
 * However, you can import ```load_dataset_builder``` to get the info (before downloading)
 * Note: Loading access_key from a shell environment variable is safer than loading from file or entering directly.

In [None]:
access_token = "" # copy your access token from HF

In [None]:
from datasets import load_dataset, get_dataset_split_names, get_dataset_config_names, get_dataset_config_info

## Dataset from Hub: Translation

* Let's take a look at WMT-14 dataset: https://huggingface.co/datasets/wmt/wmt14
* The dataset is composed 5 sub-datasets (called configurations in HF datasets): "rs-en" "cs-en","fr-en", "hi-en","de-en"
* We have to mention the subset we wish to download
* the ```load_dataset``` function contains a lot of optional arguments [Link](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset).
* Most commonly used: ```path,name, split, revision, streaming```
* It supports the following dataset formats: CSV, JSON, Arrow, SQL, WebDataset, Parquet

In [None]:
# print(get_dataset_config_info("wmt/wmt14","hi-en"))
print(get_dataset_config_names("wmt/wmt14"))
print(get_dataset_split_names("wmt/wmt14","hi-en")) # have to select a subset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

['cs-en', 'de-en', 'fr-en', 'hi-en', 'ru-en']
['train', 'validation', 'test']


In [None]:
temp_dataset = load_dataset(path="wmt/wmt14",name="hi-en", split='train', stream = False)
print(temp_dataset)

train-00000-of-00001.parquet:   0%|          | 0.00/992k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/85.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/506k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/32863 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/520 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2507 [00:00<?, ? examples/s]

Dataset({
    features: ['translation'],
    num_rows: 32863
})


In [None]:
dataset = load_dataset(path="wmt/wmt14",name="hi-en",
                      #  token=access_token
                       )
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 32863
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})


In [None]:
dataset['train'][10000:10002]

{'translation': [{'en': 'Introduction to quantum mechanics',
   'hi': 'क्वांटम यांत्रिकी का परिचय'},
  {'en': 'Quantization', 'hi': 'क्वांटीकरण'}]}

* Return the raw dataset (that is, samples from all splits)
* It won't be a DatasetDict (makes sense!)
* Now, we can use ```len()``` function to get the number of samples as there is no ambigurity.

In [None]:
# doanload all the splits combined together
raw_dataset = load_dataset(path="wmt/wmt14",name="hi-en",#token=access_token,
                           split="train+test+validation")

# print the data
print(raw_dataset)

# print number of samples in the dataset
print(len(raw_dataset))

Dataset({
    features: ['translation'],
    num_rows: 35890
})
35890


In [None]:
# get only train and validation splits
train_raw_dataset = load_dataset(path="wmt/wmt14",name="hi-en",#token=access_token,
                                 split="train+validation")

# get only test split
test_raw_dataset = load_dataset(path="wmt/wmt14",name="hi-en",#token=access_token,
                                split="test")

### Caching

* Now the dataset is cached locally at ```~/.cache/huggingface/datasets/wmt___wmt14``
* The leaf directory contains the following files ```dataset_info.json  wmt14-test.arrow  wmt14-train.arrow  wmt14-validation.arrow```
* Suppose, we want just the "train" split. Then we do not need to download the dataset again.

In [None]:
dataset = load_dataset(path="wmt/wmt14",name="hi-en",# token=access_token
                       )

print(f'Number of samples in each split:{dataset.num_rows}')
print(f'Number of columns in each split:{dataset.num_columns}')
print(f'Name of columns in each split:{dataset.column_names}')

Number of samples in each split:{'train': 32863, 'validation': 520, 'test': 2507}
Number of columns in each split:{'train': 1, 'validation': 1, 'test': 1}
Name of columns in each split:{'train': ['translation'], 'validation': ['translation'], 'test': ['translation']}


In [None]:
!ls /root/.cache/huggingface/datasets/ -lh

total 12K
drwxr-xr-x 3 root root 4.0K Apr 24 14:43 csv
drwxr-xr-x 3 root root 4.0K Apr 24 14:58 custom_dataset
-rw-r--r-- 1 root root    0 Apr 24 14:43 _root_.cache_huggingface_datasets_csv_default-2bbc2d5d94f3af17_0.0.0_a43390c7ecea6519ff2ce9d10005c8750601c9e456069be5efbd2747df45f420.lock
-rw-r--r-- 1 root root    0 Apr 24 14:59 _root_.cache_huggingface_datasets_custom_dataset_default_0.0.0_f10c1782347a9d6b.lock
-rw-r--r-- 1 root root    0 Apr 24 15:15 _root_.cache_huggingface_datasets_wmt___wmt14_hi-en_0.0.0_b199e406369ec1b7634206d3ded5ba45de2fe696.lock
drwxr-xr-x 3 root root 4.0K Apr 24 15:09 wmt___wmt14


In [None]:
!ls /root/.cache/huggingface/datasets/ -lh

total 16K
drwxr-xr-x 8 root root 4.0K Jan 14 14:54 csv
drwxr-xr-x 3 root root 4.0K Jan 14 15:02 custom_dataset
-rw-r--r-- 1 root root    0 Jan 14 14:53 _root_.cache_huggingface_datasets_csv_default-4a9baf0dc56b6699_0.0.0_9ea1179385ff7ad1e756d327ffccaa3b801175702a2d91528226ba2c66873f52.lock
-rw-r--r-- 1 root root    0 Jan 14 14:47 _root_.cache_huggingface_datasets_csv_default-581a80b79de2b555_0.0.0_9ea1179385ff7ad1e756d327ffccaa3b801175702a2d91528226ba2c66873f52.lock
-rw-r--r-- 1 root root    0 Jan 14 14:45 _root_.cache_huggingface_datasets_csv_default-586dc7a48542713a_0.0.0_9ea1179385ff7ad1e756d327ffccaa3b801175702a2d91528226ba2c66873f52.lock
-rw-r--r-- 1 root root    0 Jan 14 14:53 _root_.cache_huggingface_datasets_csv_default-912f2eebadb3d26f_0.0.0_9ea1179385ff7ad1e756d327ffccaa3b801175702a2d91528226ba2c66873f52.lock
-rw-r--r-- 1 root root    0 Jan 14 14:46 _root_.cache_huggingface_datasets_csv_default-b5a3b5715011e481_0.0.0_9ea1179385ff7ad1e756d327ffccaa3b801175702a2d91528226ba2c668

* Note that all splits are of ```Dataset``` class
* Applying any transformation to ```DatasetDict``` will be applied to all splits in the dictionary

In [None]:
# Now, we just download train split
dataset = load_dataset(path="wmt/wmt14",name="hi-en",token=access_token, split='train')

### Fingerprinting

Modifying dataset

* What if we apply a transformation on the downloaded dataset?
* Let's add a new word to all english sentences in the dataset

In [None]:
def add_prefix(sample):
    sample['translation']['en'] = 'hi '+sample['translation']['en']
    return sample

In [None]:
dataset_modified = dataset.map(add_prefix)

Map:   0%|          | 0/32863 [00:00<?, ? examples/s]

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [None]:
print(dataset_modified['train']['translation'][10000:10004])

[{'en': 'hi Introduction to quantum mechanics', 'hi': 'क्वांटम यांत्रिकी का परिचय'}, {'en': 'hi Quantization', 'hi': 'क्वांटीकरण'}, {'en': 'hi Quick Gun Murugun', 'hi': 'क्विक गन मुरुगुन'}, {'en': 'hi Quiz', 'hi': 'क्विज़'}]


* Now, the leaf directory will add one more file with a unique fingerprint for the transformation
* In this case, the newly added file name will be of ```*.arrow``` type
* If we apply one more transformation, then unique fingerprint is generated for that transformtion

In [None]:
from os import name
def add_prefix_hi(sample):
    sample['translation']['hi'] = 'नमस्ते '+sample['translation']['hi']
    return sample

# naveen

# संस्कृत


# modify the dataset
dataset_modified_hi = dataset.map(add_prefix_hi)

Map:   0%|          | 0/32863 [00:00<?, ? examples/s]

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Map:   0%|          | 0/2507 [00:00<?, ? examples/s]

In [None]:
print(dataset_modified_hi['test'][0:4])

{'translation': [{'en': 'A black box in your car?', 'hi': 'नमस्ते आपकी कार में ब्लैक बॉक्स?'}, {'en': "As America's road planners struggle to find the cash to mend a crumbling highway system, many are beginning to see a solution in a little black box that fits neatly by the dashboard of your car.", 'hi': 'नमस्ते जबकि अमेरिका के सड़क योजनाकार, ध्वस्त होते हुए हाईवे सिस्टम को सुधारने के लिए धन की कमी से जूझ रहे हैं, वहीं बहुत-से लोग इसका समाधान छोटे से ब्लैक बॉक्स में देख रहे हैं, जो आपकी कार के डैशबोर्ड पर सफ़ाई से फिट हो जाता है।'}, {'en': "The devices, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center of a controversial attempt in Washington and state planning offices to overhaul the outdated system for funding America's major roads.", 'hi': 'नमस्ते यह डिवाइस, जो मोटर-चालक द्वारा वाहन चलाए गए प्रत्येक मील को ट्रैक करती है तथा उस सूचना को अधिकारियों को संचारित करती है, आजकल अमेरिका की प्रमुख सड़कों का वित्त-पोषण करने के लिए पुराने 

In this case, now there are two fingerprints.

## Features

In [None]:
print(dataset)
print(dataset['train'].features)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 32863
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})
{'translation': Translation(languages=['hi', 'en'], id=None)}


* Features defines the internal structure of a dataset. It is used to specify the underlying serialization format.
* The **features** of a dataset vary based on the type of dataset (i.e., task, domain..)
* In this particular case, the Feature is "Translation"
* Some of other feature classes are: Value, ClassLabel, Image, Audio, Array2D, ArrayxD, Sequence
* Go [here](https://huggingface.co/docs/datasets/en/about_dataset_features) to learn more about handling features
* Let's take a look at one more example

### Microsoft Research Paraphrase Corpus (MRPC)

In [None]:
dataset_1 = load_dataset('glue', 'mrpc', split='train') # load MRPC from GLUE benchmark
dataset_1.features

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [None]:
print(f'There are {len(dataset_1.column_names)} columns namely : {dataset_1.column_names}')
print(f'The features of the columns are respectively:\n {dataset_1.features}')

There are 4 columns namely : ['sentence1', 'sentence2', 'label', 'idx']
The features of the columns are respectively:
 {'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None)}


### Rotten Tomatoes

In [None]:
dataset_2 = load_dataset("rotten_tomatoes", split="train")
print(dataset_2)
print(dataset_2.features)

README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}


There are other ways of loading datasets
* If the dataset is really large we may specify the part of the dataset using ```data_files``` argument.
* Load JSON files directly from the local system
* It is always good to go through [this page](https://huggingface.co/docs/datasets/en/loading) before writing your own script for loading datasets

# Manipulating dataset(s)
* Now, we assume that the dataset is loaded (memory-mapped)
* Think of the Dataset class as a table with each sample stored in a row (indexed from 0 to num_rows). For exact details, refer to: [doc](https://huggingface.co/docs/datasets/en/about_arrow)
* This particular dataset contains one column ('Translation'). Accessing that column directly returns all the samples

In [None]:
# print(dataset['translation']) # prints all the samples

## Take a single sample

In [None]:
# Too slow, access all samples in the column and output the first one
print(dataset['translation'][0])
#fast
print(dataset[0]['translation'])

{'en': 'January 0', 'hi': '० जनवरी'}
{'en': 'January 0', 'hi': '० जनवरी'}


## Slicing

In [None]:
#slicing
print(dataset[0:4]['translation'])

[{'en': 'January 0', 'hi': '० जनवरी'}, {'en': 'March 0', 'hi': '० मार्च'}, {'en': '1000', 'hi': '१०००'}, {'en': '1001', 'hi': '१००१'}]


* Note that slicing using columns return the samples (whereas `.select` methods returns samples in Dataset/Dict format)

## Filtering
* I want to ensure that the number of words in a sentence should be of at least 10
* However, each row in the Dataset contains a dictionary with keys 'en' and 'hi'
* We can apply filtering after flattening the rows

In [None]:
print(dataset)
print(dataset['train'][:4])

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 32863
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 2507
    })
})
{'translation': [{'en': 'January 0', 'hi': '० जनवरी'}, {'en': 'March 0', 'hi': '० मार्च'}, {'en': '1000', 'hi': '१०००'}, {'en': '1001', 'hi': '१००१'}]}


In [None]:
flattened_dataset=dataset.flatten() # return new Dataset
print(flattened_dataset)

DatasetDict({
    train: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 32863
    })
    validation: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 520
    })
    test: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 2507
    })
})


In [None]:
flattened_dataset['train'][:4]

{'translation.en': ['January 0', 'March 0', '1000', '1001'],
 'translation.hi': ['० जनवरी', '० मार्च', '१०००', '१००१']}

In [None]:
flattened_dataset[0:3]

{'translation.en': ['January 0', 'March 0', '1000'],
 'translation.hi': ['० जनवरी', '० मार्च', '१०००']}

In [None]:
new_dataset = flattened_dataset.filter(lambda x:len(x['translation.en'].split(' '))>=10)
print(new_dataset)

Filter:   0%|          | 0/32863 [00:00<?, ? examples/s]

Filter:   0%|          | 0/520 [00:00<?, ? examples/s]

Filter:   0%|          | 0/2507 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 25
    })
    validation: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 437
    })
    test: Dataset({
        features: ['translation.en', 'translation.hi'],
        num_rows: 2188
    })
})


In [None]:
new_dataset['train'][0:4]

{'translation.en': ['List of virtual communities with more than 100 million active users',
  '1 − 1 + 2 − 6 + 24 − 120 + ⋯',
  '1 − 2 + 3 − 4 + · · ·',
  "2007 Western & Southern Financial Group Masters and Women's Open"],
 'translation.hi': ['१० करोड़ से अधिक प्रयोक्ताओं वाले आभासी समुदाय',
  '१ − १ + २ − ६ + २४ − १२० + · · ·',
  '१ − २ + ३ − ४ + · · ·',
  '२००७ सिनसिनाटी मास्टर्स']}

## Concatenate datasets
* Suppose we want to retain only the english sentences (say, to be used for scaling language modelling dataset)
* Then, we can remove the column for hindi


In [None]:
en_wmt_dataset = new_dataset.remove_columns('translation.hi')
print(en_wmt_dataset)
print(en_wmt_dataset['train'][0:4])

DatasetDict({
    train: Dataset({
        features: ['translation.en'],
        num_rows: 25
    })
    validation: Dataset({
        features: ['translation.en'],
        num_rows: 437
    })
    test: Dataset({
        features: ['translation.en'],
        num_rows: 2188
    })
})
{'translation.en': ['List of virtual communities with more than 100 million active users', '1 − 1 + 2 − 6 + 24 − 120 + ⋯', '1 − 2 + 3 − 4 + · · ·', "2007 Western & Southern Financial Group Masters and Women's Open"]}


* Now, we can combine the dataset with the rotten tomatoes
* However, we have to make sure that the schema of **datasets to be combined** is the same
* Therefore, I rename the column of WMT to "text" and remove the 'label' column from rotten tomatoes

In [None]:
from datasets import concatenate_datasets
ds1 = en_wmt_dataset.rename_column('translation.en','text')['train']
ds2 = dataset_2.remove_columns('label')
lm_dataset = concatenate_datasets([ds1,ds2],axis=0)
print(lm_dataset)

Dataset({
    features: ['text'],
    num_rows: 8555
})


In [None]:
ds2

Dataset({
    features: ['text'],
    num_rows: 8530
})

In [None]:
ds1

Dataset({
    features: ['text'],
    num_rows: 25
})

In [None]:
ds2

Dataset({
    features: ['text'],
    num_rows: 8530
})

In [None]:
en_wmt_dataset

Dataset({
    features: ['translation.en'],
    num_rows: 25
})

In [None]:
lm_dataset[-5:]

{'text': ['any enjoyment will be hinge from a personal threshold of watching sad but endearing characters do extremely unconventional things .',
  "if legendary shlockmeister ed wood had ever made a movie about a vampire , it probably would look a lot like this alarming production , adapted from anne rice's novel the vampire chronicles .",
  "hardly a nuanced portrait of a young woman's breakdown , the film nevertheless works up a few scares .",
  'interminably bleak , to say nothing of boring .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .']}

In [None]:
lm_dataset[:5]


{'text': ['List of virtual communities with more than 100 million active users',
  '1 − 1 + 2 − 6 + 24 − 120 + ⋯',
  '1 − 2 + 3 − 4 + · · ·',
  "2007 Western & Southern Financial Group Masters and Women's Open",
  "2008 Western & Southern Financial Group Masters and Women's Open"]}

In [None]:
ds1

Dataset({
    features: ['text2'],
    num_rows: 25
})

In [None]:
ds2

Dataset({
    features: ['text'],
    num_rows: 8530
})

## Interleaving Datasets
* Often time we have $n$ skewed datasets
* So we want to build a new dataset by intervaling the samples from each dataset according to a distribution

In [None]:
from datasets import interleave_datasets
inter_datasets = interleave_datasets([ds1,ds2],probabilities=[0.8,0.2])
print(inter_datasets[:10])

{'text': ['List of virtual communities with more than 100 million active users', '1 − 1 + 2 − 6 + 24 − 120 + ⋯', 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', '1 − 2 + 3 − 4 + · · ·', "2007 Western & Southern Financial Group Masters and Women's Open", 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', "2008 Western & Southern Financial Group Masters and Women's Open", '2012 Italian Navy Marines shooting incident in the Laccadive Sea', 'List of winners and shortlisted authors of the Booker Prize for Fiction', 'United Nations Economic and Social Commission for Asia and the Pacific']}


In [None]:
# ?interleave_datasets

* Create a new dataset with: ```.take,.skip, .shuffle, .select``` and so on

In [None]:
#take first n elements
n = 3
small_ds = dataset.take(n)
print(small_ds)

Dataset({
    features: ['translation'],
    num_rows: 3
})


In [None]:
#take first n elements
n = 3
small_ds = inter_datasets.select([0,11,2])
print(small_ds)

Dataset({
    features: ['text2', 'text'],
    num_rows: 3
})


In [None]:
inter_datasets[2]

{'text2': '1 − 1 + 2 − 6 + 24 − 120 + ⋯', 'text': None}

In [None]:
small_ds[:]

{'text2': [None,
  'Department of Ayurveda, Yoga and Naturopathy, Unani, Siddha and Homoeopathy',
  '1 − 1 + 2 − 6 + 24 − 120 + ⋯'],
 'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  None,
  None]}

In [None]:
# help(inter_datasets.select)

## Iterable dataset
* Must read: [Map-style vs Iterable](https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable)
* iterates over a dataset **one example** at a time by default
* don’t write anything on disk (for ex, stream the ImageNet-1k dataset without downloading it on disk)

In [None]:
iter_dataset = inter_datasets.to_iterable_dataset()
print(iter_dataset)
for sample in iter_dataset:
    print(sample)
    break

IterableDataset({
    features: ['text2', 'text'],
    num_shards: 1
})
{'text2': None, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}


* Use .take method if you want a subset of samples

In [None]:
subset = iter_dataset.take(3)
print(subset)
print(list(subset))

IterableDataset({
    features: ['text2', 'text'],
    num_shards: 1
})
[{'text2': None, 'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}, {'text2': 'List of virtual communities with more than 100 million active users', 'text': None}, {'text2': '1 − 1 + 2 − 6 + 24 − 120 + ⋯', 'text': None}]


## Loading from external links

In [None]:
base_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/"
dataset = load_dataset("json", data_files={"train": base_url + "train-v1.1.json", "validation": base_url + "dev-v1.1.json"}, field="data")
print(dataset)

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    validation: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})
