<a href="https://colab.research.google.com/github/simulate111/Introduction-to-Human-Language-Technology/blob/main/Exercise%20task%203%3A%20text%20classification%20corpora(datasets_introduction).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the `datasets` library

This notebook serves as an introduction to the `datasets` Python library and in part as an introduction to the associated dataset repository of the same name. The datasets repository is located at <https://huggingface.co/datasets> and the library documentation is found at <https://huggingface.co/docs/datasets>.

---

## Setup

Install the `datasets` Python package on the system.

In [1]:
!pip install --quiet datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the `datasets` library.

In [2]:
import datasets

Disable the progress bar to make output from dataset loading a bit less verbose. (This only affects what shows on screen.)

In [3]:
datasets.disable_progress_bar()

---

## Loading a dataset

We can load a dataset from the repository simply by invoking the `load_dataset` function with the name of the dataset. We can also similarly use the function `load_dataset_builder` to get some general information about the dataset.

* [Documentation for `load_dataset`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)
* [Documentation for `load_dataset_builder`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset_builder)

It's worth exploring the search and filtering functions of the repository (<https://huggingface.co/datasets>) to get an idea of what datasets are available. For example, how many datasets are there in your native language?

In [4]:
DATASET_NAME = 'emotion'

dataset = datasets.load_dataset(DATASET_NAME)
builder = datasets.load_dataset_builder(DATASET_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


---

## General dataset information

We can find various pieces of information about the corpus in the `info` field of the object returned by `load_dataset_builder`. (Note that not all datasets in the repositiry will have useful information here.)

In [5]:
print(builder.info.description)
print(builder.info.citation)

Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.

@inproceedings{saravia-etal-2018-carer,
    title = "{CARER}: Contextualized Affect Representations for Emotion Recognition",
    author = "Saravia, Elvis  and
      Liu, Hsien-Chi Toby  and
      Huang, Yen-Hao  and
      Wu, Junlin  and
      Chen, Yi-Shin",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    month = oct # "-" # nov,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D18-1404",
    doi = "10.18653/v1/D18-1404",
    pages = "3687--3697",
    abstract = "Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a

As the description suggests, you can refer to the paper <https://www.aclweb.org/anthology/D18-1404> for more information about the dataset.

---

## `Dataset` and `DatasetDict`

Let's have a look at the dataset itself next. This is the most important object for using datasets -- the builder is mostly useful for general information about a dataset.

In [6]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


We see here `Dataset` objects keyed with `train`, `validation` and `test`. `Dataset` is the central class of the `datasets` library, representing a structured collection of data (e.g. a text corpus, or a part of a text corpus). Each of the three datasets has features `text` and `label`, as we would expect for a text classification dataset. We can also see `num_rows` for each of the three datasets; this is the number of examples in each of the train, development and test subsets of the data.

Note that the top-level object here isn't a `Dataset` but rather a `DatasetDict`. This object is analogous to a Python dictionary: you can access the `Dataset` objects that it holds by the dictionary keys (here `train`, `validation` and `test`). For convenience, the `DatasetDict` object also implements `Dataset` functions (e.g. `map`) which, when called, invoke the same function on the `Dataset` objects.

* [Documentation for `Dataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset)
* [Documentation for `DatasetDict`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict)

We can get any of the `Dataset` objects by indexing the `DatasetDict` with one of its keys:

In [7]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

---

## `Dataset` contents

Let's work on the train `Dataset` for now.

In [8]:
train_dataset = dataset['train']

As we saw above, this `Dataset` has the features `text` and `label`, and there are 16,000 examples in the dataset, i.e. 16,000 (`text`, `label`) pairs.

In [9]:
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

Data contained in the `Dataset` object can be accessed by indexing the object in one of two basic ways:

* by row (integer), giving the values of the features for a particular example
* by feature name (string), giving the value of that feature for all rows

(Like `list` objects, we can also slice the `Dataset` object by indexing with  two integers `start:end`.)

Let's first look at an individual example (row):

In [10]:
train_dataset[0]

{'text': 'i didnt feel humiliated', 'label': 0}

The first ten rows:

In [11]:
train_dataset[0:10]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
  'i am feeling grouchy',
  'ive been feeling a little burdened lately wasnt sure why that was',
  'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny',
  'i feel as confused about life as a teenager or as jaded as a year old man',
  'i have been with petronas for years i feel that petronas has performed well and made a huge profit',
  'i feel romantic too'],
 'label': [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]}

(Note above that slicing doesn't give a list of dictionaries, but rather a single dictionary where the values are lists.)

First ten texts:

In [12]:
train_dataset['text'][0:10]

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'im grabbing a minute to post i feel greedy wrong',
 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
 'i am feeling grouchy',
 'ive been feeling a little burdened lately wasnt sure why that was',
 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny',
 'i feel as confused about life as a teenager or as jaded as a year old man',
 'i have been with petronas for years i feel that petronas has performed well and made a huge profit',
 'i feel romantic too']

First ten labels:

In [13]:
train_dataset['label'][0:10]

[0, 0, 3, 2, 3, 0, 5, 4, 1, 2]

---

## Feature names

Note above that the labels are integers. This is convenient for machine learning, but makes the data difficult to interpret. To make sense of the integer labels, we can look at the `features` for the dataset

* [Documentation for dataset features](https://huggingface.co/docs/datasets/about_dataset_features)

In [14]:
train_dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

There's a `ClassLabel` object keyed by `'label'` that has a `names` attribute. Let's pick that out into a `label_names` attribute for convenience:

In [15]:
label_names = train_dataset.features['label'].names

label_names

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

We can now use this list to convert the integer labels into human-readable strings to interpret them.

In [16]:
print('text :', train_dataset[3]['text'])
print('label:', train_dataset[3]['label'])
print('---')
print('text :', train_dataset[3]['text'])
print('label:', label_names[train_dataset[3]['label']])

text : i am ever feeling nostalgic about the fireplace i will know that it is still on the property
label: 2
---
text : i am ever feeling nostalgic about the fireplace i will know that it is still on the property
label: love


---

That concludes our introductory look into datasets. For more information on the library and the repository, please see [the `datasets` documentation](https://huggingface.co/docs/datasets).

In [23]:
import datasets
from collections import Counter

def analyze_dataset(dataset_name):
    # Load dataset and builder
    builder = datasets.load_dataset_builder(dataset_name)

    # Print description
    print("Description of the dataset:")
    print(builder.info.description)

    # Calculate relative sizes of subsets
    splits = builder.info.splits
    total_size = sum(split.num_examples for split in splits.values())

    relative_sizes = {split_name: (split.num_examples / total_size) * 100 for split_name, split in splits.items()}

    print("\nRelative sizes of subsets:")
    for split_name, percentage in relative_sizes.items():
        print(f"{split_name}: {percentage:.2f}%")

    # Calculate label distribution in 'train' subset
    train_dataset = datasets.load_dataset(dataset_name, split='train')
    train_labels = train_dataset['label']
    label_counts = Counter(train_labels)
    train_size = len(train_labels)

    label_percentage = {label_name: (count / train_size) * 100 for label_name, count in label_counts.items()}

    print("\nDistribution of labels in the 'train' subset:")
    for label_name, percentage in label_percentage.items():
        print(f"{label_name}: {percentage:.2f}%")

# Apply function to the specified datasets
datasets_to_analyze = ['emotion', 'rotten_tomatoes', 'snli', 'sst2', 'emo']
for dataset_name in datasets_to_analyze:
    print(f"\nAnalyzing dataset: {dataset_name}\n")
    analyze_dataset(dataset_name)



Analyzing dataset: emotion

Description of the dataset:
Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.


Relative sizes of subsets:
train: 80.00%
validation: 10.00%
test: 10.00%

Distribution of labels in the 'train' subset:
0: 29.16%
3: 13.49%
2: 8.15%
5: 3.57%
4: 12.11%
1: 33.51%

Analyzing dataset: rotten_tomatoes

Description of the dataset:


Relative sizes of subsets:
train: 80.00%
validation: 10.00%
test: 10.00%

Distribution of labels in the 'train' subset:
1: 50.00%
0: 50.00%

Analyzing dataset: snli

Description of the dataset:


Relative sizes of subsets:
test: 1.75%
validation: 1.75%
train: 96.49%

Distribution of labels in the 'train' subset:
1: 33.22%
2: 33.30%
0: 33.34%
-1: 0.14%

Analyzing dataset: sst2

Description of the dataset:


Relative sizes of subsets:
train: 96.16%
validation: 1.24%
test: 2.60%

Distribution of labels in the 'train