<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/Exercise%20task%203%3A%20text%20classification%20corpora(datasets_introduction).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the `datasets` library

This notebook serves as an introduction to the `datasets` Python library and in part as an introduction to the associated dataset repository of the same name. The datasets repository is located at <https://huggingface.co/datasets> and the library documentation is found at <https://huggingface.co/docs/datasets>.

---

## Setup

Install the `datasets` Python package on the system.

In [1]:
!pip install --quiet datasets

Import the `datasets` library.

In [2]:
import datasets

Disable the progress bar to make output from dataset loading a bit less verbose. (This only affects what shows on screen.)

In [3]:
#datasets.disable_progress_bar()

---

## Loading a dataset

We can load a dataset from the repository simply by invoking the `load_dataset` function with the name of the dataset. We can also similarly use the function `load_dataset_builder` to get some general information about the dataset.

* [Documentation for `load_dataset`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)
* [Documentation for `load_dataset_builder`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset_builder)

It's worth exploring the search and filtering functions of the repository (<https://huggingface.co/datasets>) to get an idea of what datasets are available. For example, how many datasets are there in your native language?

In [4]:
DATASET_emotion = 'emotion'
DATASET_rotten_tomatoes = 'rotten_tomatoes'
DATASET_snli = 'snli'
DATASET_sst2 = 'sst2'
DATASET_emo = 'emo'

In [5]:
dataset_emotion = datasets.load_dataset(DATASET_emotion)
dataset_rotten_tomatoes = datasets.load_dataset(DATASET_rotten_tomatoes)
dataset_snli = datasets.load_dataset(DATASET_snli)
dataset_sst2 = datasets.load_dataset(DATASET_sst2)
dataset_emo = datasets.load_dataset(DATASET_emo)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [6]:
builder_emotion = datasets.load_dataset_builder(DATASET_emotion)
builder_rotten_tomatoes = datasets.load_dataset_builder(DATASET_rotten_tomatoes)
builder_snli = datasets.load_dataset_builder(DATASET_snli)
builder_sst2 = datasets.load_dataset_builder(DATASET_sst2)
builder_emo = datasets.load_dataset_builder(DATASET_emo)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


---

## General dataset information

We can find various pieces of information about the corpus in the `info` field of the object returned by `load_dataset_builder`. (Note that not all datasets in the repositiry will have useful information here.)

In [7]:
print('emotion:\n', builder_emotion.info.description)
print('\nrotten_tomatoes:\n', builder_rotten_tomatoes.info.description)
print('\nsnli:\n', builder_snli.info.description)
print('\nsst2:\n', builder_sst2.info.description)
print('\nemo:\n', builder_emo.info.description)

emotion:
 Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.


rotten_tomatoes:
 

snli:
 

sst2:
 

emo:
 In this dataset, given a textual dialogue i.e. an utterance along with two previous turns of context, the goal was to infer the underlying emotion of the utterance by choosing from four emotion classes - Happy, Sad, Angry and Others.



In [8]:
print('emotion:\n', builder_emotion.info.citation)
print('\nrotten_tomatoes:\n', builder_rotten_tomatoes.info.citation)
print('\nsnli:\n', builder_snli.info.citation)
print('\nsst2:\n', builder_sst2.info.citation)
print('\nemo:\n', builder_emo.info.citation)

emotion:
 @inproceedings{saravia-etal-2018-carer,
    title = "{CARER}: Contextualized Affect Representations for Emotion Recognition",
    author = "Saravia, Elvis  and
      Liu, Hsien-Chi Toby  and
      Huang, Yen-Hao  and
      Wu, Junlin  and
      Chen, Yi-Shin",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1404",
    doi = "10.18653/v1/D18-1404",
    pages = "3687--3697",
    abstract = "Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce 

As the description suggests, you can refer to the paper <https://www.aclweb.org/anthology/D18-1404> for more information about the dataset.

---

## `Dataset` and `DatasetDict`

Let's have a look at the dataset itself next. This is the most important object for using datasets -- the builder is mostly useful for general information about a dataset.

In [9]:
print('emotion:\n', dataset_emotion)
print('\nrotten_tomatoes:\n', dataset_rotten_tomatoes)
print('\nsnli:\n', dataset_snli)
print('\nsst2:\n', dataset_sst2)
print('\nemo:\n', dataset_emo)

emotion:
 DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

rotten_tomatoes:
 DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

snli:
 DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
})

sst2:
 DatasetDict({
    train: Dataset({
        f

We see here `Dataset` objects keyed with `train`, `validation` and `test`. `Dataset` is the central class of the `datasets` library, representing a structured collection of data (e.g. a text corpus, or a part of a text corpus). Each of the three datasets has features `text` and `label`, as we would expect for a text classification dataset. We can also see `num_rows` for each of the three datasets; this is the number of examples in each of the train, development and test subsets of the data.

Note that the top-level object here isn't a `Dataset` but rather a `DatasetDict`. This object is analogous to a Python dictionary: you can access the `Dataset` objects that it holds by the dictionary keys (here `train`, `validation` and `test`). For convenience, the `DatasetDict` object also implements `Dataset` functions (e.g. `map`) which, when called, invoke the same function on the `Dataset` objects.

* [Documentation for `Dataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset)
* [Documentation for `DatasetDict`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict)

We can get any of the `Dataset` objects by indexing the `DatasetDict` with one of its keys:

In [10]:
print('emotion:\n', dataset_emotion['train'])
print('\nrotten_tomatoes:\n', dataset_rotten_tomatoes['train'])
print('\nsnli:\n', dataset_snli['train'])
print('\nsst2:\n', dataset_sst2['train'])
print('\nemo:\n', dataset_emo['train'])

emotion:
 Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

rotten_tomatoes:
 Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

snli:
 Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

sst2:
 Dataset({
    features: ['idx', 'sentence', 'label'],
    num_rows: 67349
})

emo:
 Dataset({
    features: ['text', 'label'],
    num_rows: 30160
})


---

## `Dataset` contents

Let's work on the train `Dataset` for now.

In [11]:
train_dataset_emotion = dataset_emotion['train']
train_dataset_rotten_tomatoes = dataset_rotten_tomatoes['train']
train_dataset_snli = dataset_snli['train']
train_dataset_sst2 = dataset_sst2['train']
train_dataset_emo = dataset_emo['train']

As we saw above, this `Dataset` has the features `text` and `label`, and there are 16,000 examples in the dataset, i.e. 16,000 (`text`, `label`) pairs.

In [12]:
print('emotion:\n', train_dataset_emotion)
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes)
print('\nsnli:\n', train_dataset_snli)
print('\nsst2:\n', train_dataset_sst2)
print('\nemo:\n', train_dataset_emo)

emotion:
 Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

rotten_tomatoes:
 Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})

snli:
 Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

sst2:
 Dataset({
    features: ['idx', 'sentence', 'label'],
    num_rows: 67349
})

emo:
 Dataset({
    features: ['text', 'label'],
    num_rows: 30160
})


Data contained in the `Dataset` object can be accessed by indexing the object in one of two basic ways:

* by row (integer), giving the values of the features for a particular example
* by feature name (string), giving the value of that feature for all rows

(Like `list` objects, we can also slice the `Dataset` object by indexing with  two integers `start:end`.)

Let's first look at an individual example (row):

In [13]:
print('emotion:\n', train_dataset_emotion[0])
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes[0])
print('\nsnli:\n', train_dataset_snli[0])
print('\nsst2:\n', train_dataset_sst2[0])
print('\nemo:\n', train_dataset_emo[0])

emotion:
 {'text': 'i didnt feel humiliated', 'label': 0}

rotten_tomatoes:
 {'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

snli:
 {'premise': 'A person on a horse jumps over a broken down airplane.', 'hypothesis': 'A person is training his horse for a competition.', 'label': 1}

sst2:
 {'idx': 0, 'sentence': 'hide new secretions from the parental units ', 'label': 0}

emo:
 {'text': "don't worry  i'm girl hmm how do i know if you are what's ur name", 'label': 0}


The first ten rows:

In [14]:
print('emotion:\n', train_dataset_emotion[0:10])
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes[0:10])
print('\nsnli:\n', train_dataset_snli[0:10])
print('\nsst2:\n', train_dataset_sst2[0:10])
print('\nemo:\n', train_dataset_emo[0:10])

emotion:
 {'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy', 'ive been feeling a little burdened lately wasnt sure why that was', 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny', 'i feel as confused about life as a teenager or as jaded as a year old man', 'i have been with petronas for years i feel that petronas has performed well and made a huge profit', 'i feel romantic too'], 'label': [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]}

rotten_tomatoes:
 {'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'the gorgeously el

(Note above that slicing doesn't give a list of dictionaries, but rather a single dictionary where the values are lists.)

First ten texts:

In [15]:
print('emotion:\n', train_dataset_emotion['text'][0:10])
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes['text'][0:10])
print('\nsnli:\n', train_dataset_snli['premise'][0:10])
print('\nsst2:\n', train_dataset_sst2['sentence'][0:10])
print('\nemo:\n', train_dataset_emo['text'][0:10])

emotion:
 ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy', 'ive been feeling a little burdened lately wasnt sure why that was', 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny', 'i feel as confused about life as a teenager or as jaded as a year old man', 'i have been with petronas for years i feel that petronas has performed well and made a huge profit', 'i feel romantic too']

rotten_tomatoes:
 ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is

First ten labels:

In [16]:
print('emotion:\n', train_dataset_emotion['label'][0:10])
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes['label'][0:10])
print('\nsnli:\n', train_dataset_snli['label'][0:10])
print('\nsst2:\n', train_dataset_sst2['label'][0:10])
print('\nemo:\n', train_dataset_emo['label'][0:10])

emotion:
 [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]

rotten_tomatoes:
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

snli:
 [1, 2, 0, 1, 0, 2, 2, 0, 1, 1]

sst2:
 [0, 0, 1, 0, 0, 0, 1, 1, 0, 1]

emo:
 [0, 3, 0, 3, 0, 0, 0, 0, 0, 0]


---

## Feature names

Note above that the labels are integers. This is convenient for machine learning, but makes the data difficult to interpret. To make sense of the integer labels, we can look at the `features` for the dataset

* [Documentation for dataset features](https://huggingface.co/docs/datasets/about_dataset_features)

In [17]:
print('emotion:\n', train_dataset_emotion.features)
print('\nrotten_tomatoes:\n', train_dataset_rotten_tomatoes.features)
print('\nsnli:\n', train_dataset_snli.features)
print('\nsst2:\n', train_dataset_sst2.features)
print('\nemo:\n', train_dataset_emo.features)

emotion:
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

rotten_tomatoes:
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}

snli:
 {'premise': Value(dtype='string', id=None), 'hypothesis': Value(dtype='string', id=None), 'label': ClassLabel(names=['entailment', 'neutral', 'contradiction'], id=None)}

sst2:
 {'idx': Value(dtype='int32', id=None), 'sentence': Value(dtype='string', id=None), 'label': ClassLabel(names=['negative', 'positive'], id=None)}

emo:
 {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['others', 'happy', 'sad', 'angry'], id=None)}


There's a `ClassLabel` object keyed by `'label'` that has a `names` attribute. Let's pick that out into a `label_names` attribute for convenience:

In [18]:
label_emotion = train_dataset_emotion.features['label'].names

print('emotion:\n', label_emotion)

label_rotten_tomatoes = train_dataset_rotten_tomatoes.features['label'].names

print('\nrotten_tomatoes:\n', label_rotten_tomatoes)

label_snli = train_dataset_snli.features['label'].names

print('\nsnli:\n', label_snli)

label_sst2 = train_dataset_sst2.features['label'].names

print('\nsst2:\n', label_sst2)

label_emo = train_dataset_emo.features['label'].names

print('\nemo:\n', label_emo)

emotion:
 ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

rotten_tomatoes:
 ['neg', 'pos']

snli:
 ['entailment', 'neutral', 'contradiction']

sst2:
 ['negative', 'positive']

emo:
 ['others', 'happy', 'sad', 'angry']


We can now use this list to convert the integer labels into human-readable strings to interpret them.

In [19]:
print('emotion:\n', 'text :', train_dataset_emotion[3]['text'])
print('emotion:\n', 'label:', train_dataset_emotion[3]['label'])
print('---')
print('emotion:\n', 'text :', train_dataset_emotion[3]['text'])
print('emotion:\n', 'label:', label_emotion[train_dataset_emotion[3]['label']])

print('\nrotten_tomatoes:\n', 'text :', train_dataset_rotten_tomatoes[3]['text'])
print('\nrotten_tomatoes:\n', 'label:', train_dataset_rotten_tomatoes[3]['label'])
print('---')
print('\nrotten_tomatoes:\n', 'text :', train_dataset_rotten_tomatoes[3]['text'])
print('\nrotten_tomatoes:\n', 'label:', label_rotten_tomatoes[train_dataset_rotten_tomatoes[3]['label']])

print('\nsnli:\n', 'premise :', train_dataset_snli[3]['premise'])
print('\nsnli:\n', 'label:', train_dataset_snli[3]['label'])
print('---')
print('\nsnli:\n', 'premise :', train_dataset_snli[3]['premise'])
print('\nsnli:\n', 'label:', label_snli[train_dataset_snli[3]['label']])

print('\nsst2:\n', 'sentence :', train_dataset_sst2[3]['sentence'])
print('\nsst2:\n', 'label:', train_dataset_sst2[3]['label'])
print('---')
print('\nsst2:\n', 'sentence :', train_dataset_sst2[3]['sentence'])
print('\nsst2:\n', 'label:', label_sst2[train_dataset_sst2[3]['label']])

print('\nemo:\n', 'text :', train_dataset_emo[3]['text'])
print('\nemo:\n', 'label:', train_dataset_emo[3]['label'])
print('---')
print('\nemo:\n', 'text :', train_dataset_emo[3]['text'])
print('\nemo:\n', 'label:', label_emo[train_dataset_emo[3]['label']])

emotion:
 text : i am ever feeling nostalgic about the fireplace i will know that it is still on the property
emotion:
 label: 2
---
emotion:
 text : i am ever feeling nostalgic about the fireplace i will know that it is still on the property
emotion:
 label: love

rotten_tomatoes:
 text : if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

rotten_tomatoes:
 label: 1
---

rotten_tomatoes:
 text : if you sometimes like to go to the movies to have fun , wasabi is a good place to start .

rotten_tomatoes:
 label: pos

snli:
 premise : Children smiling and waving at camera

snli:
 label: 1
---

snli:
 premise : Children smiling and waving at camera

snli:
 label: neutral

sst2:
 sentence : remains utterly satisfied to remain the same throughout 

sst2:
 label: 0
---

sst2:
 sentence : remains utterly satisfied to remain the same throughout 

sst2:
 label: negative

emo:
 text : u r ridiculous i might be ridiculous but i am telling the truth u little d

In [20]:
print(train_dataset_snli[3].keys())


dict_keys(['premise', 'hypothesis', 'label'])


---

That concludes our introductory look into datasets. For more information on the library and the repository, please see [the `datasets` documentation](https://huggingface.co/docs/datasets).

In [21]:
from collections import Counter
from datasets import load_dataset
def distribution(dataset_name):
    dataset = load_dataset(dataset_name)
    labels = dataset['train']['label']
    label_counts = Counter(labels)
    label_names = dataset['train'].features['label'].names
    total_samples = len(dataset['train']) + len(dataset.get('validation', [])) + len(dataset['test'])
    train_size = (len(dataset['train']) / total_samples) * 100
    validation_size = (len(dataset.get('validation', [])) / total_samples) * 100
    test_size = (len(dataset['test']) / total_samples) * 100
    label_percentages = ((label_names[label], (count / len(labels)) * 100) for label, count in label_counts.items())
    print(f"Distributions in \"{dataset_name}\":")
    print(f"Train: {train_size:.2f}%, Validation: {validation_size:.2f}%, Test: {test_size:.2f}%")
    print('\n'.join([f"{label_name}: {percentage:.2f}%" for label_name, percentage in label_percentages]))

In [22]:
for dataset_name in ['emotion', 'rotten_tomatoes', 'snli', 'sst2', 'emo']:
    print()
    distribution(dataset_name)


Distributions in "emotion":
Train: 80.00%, Validation: 10.00%, Test: 10.00%
sadness: 29.16%
anger: 13.49%
love: 8.15%
surprise: 3.57%
fear: 12.11%
joy: 33.51%

Distributions in "rotten_tomatoes":
Train: 80.00%, Validation: 10.00%, Test: 10.00%
pos: 50.00%
neg: 50.00%

Distributions in "snli":
Train: 96.49%, Validation: 1.75%, Test: 1.75%
neutral: 33.22%
contradiction: 33.30%
entailment: 33.34%
contradiction: 0.14%

Distributions in "sst2":
Train: 96.16%, Validation: 1.24%, Test: 2.60%
negative: 44.22%
positive: 55.78%

Distributions in "emo":
Train: 84.56%, Validation: 0.00%, Test: 15.44%
others: 49.56%
angry: 18.26%
sad: 18.11%
happy: 14.07%
