<a href="https://colab.research.google.com/github/simulate111/Deep-Learning-in-Human-Language-Technology/blob/main/Exercise%20task%202.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the `datasets` library

This notebook serves as an introduction to the `datasets` Python library and in part as an introduction to the associated dataset repository of the same name. The datasets repository is located at <https://huggingface.co/datasets> and the library documentation is found at <https://huggingface.co/docs/datasets>.

---

## Setup

Install the `datasets` Python package on the system.

In [1]:
!pip install --quiet datasets

Import the `datasets` library.

In [2]:
import datasets

Disable the progress bar to make output from dataset loading a bit less verbose. (This only affects what shows on screen.)

In [3]:
datasets.disable_progress_bar()

---

## Loading a dataset

We can load a dataset from the repository simply by invoking the `load_dataset` function with the name of the dataset. We can also similarly use the function `load_dataset_builder` to get some general information about the dataset.

* [Documentation for `load_dataset`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)
* [Documentation for `load_dataset_builder`](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset_builder)

It's worth exploring the search and filtering functions of the repository (<https://huggingface.co/datasets>) to get an idea of what datasets are available. For example, how many datasets are there in your native language?

In [4]:
DATASET_NAME = 'imdb'

dataset = datasets.load_dataset(DATASET_NAME)
builder = datasets.load_dataset_builder(DATASET_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
from transformers import BertTokenizer, BertModel
tokenizer1 = BertTokenizer.from_pretrained('bert-base-cased')
model1 = BertModel.from_pretrained("bert-base-cased")



In [6]:
tokenizer2 = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model2 = BertModel.from_pretrained("bert-base-multilingual-cased")

---

## General dataset information

We can find various pieces of information about the corpus in the `info` field of the object returned by `load_dataset_builder`. (Note that not all datasets in the repositiry will have useful information here.)

In [7]:
print(builder.info.description)
print(builder.info.citation)





As the description suggests, you can refer to the paper <https://www.aclweb.org/anthology/D18-1404> for more information about the dataset.

---

## `Dataset` and `DatasetDict`

Let's have a look at the dataset itself next. This is the most important object for using datasets -- the builder is mostly useful for general information about a dataset.

In [8]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


We see here `Dataset` objects keyed with `train`, `validation` and `test`. `Dataset` is the central class of the `datasets` library, representing a structured collection of data (e.g. a text corpus, or a part of a text corpus). Each of the three datasets has features `text` and `label`, as we would expect for a text classification dataset. We can also see `num_rows` for each of the three datasets; this is the number of examples in each of the train, development and test subsets of the data.

Note that the top-level object here isn't a `Dataset` but rather a `DatasetDict`. This object is analogous to a Python dictionary: you can access the `Dataset` objects that it holds by the dictionary keys (here `train`, `validation` and `test`). For convenience, the `DatasetDict` object also implements `Dataset` functions (e.g. `map`) which, when called, invoke the same function on the `Dataset` objects.

* [Documentation for `Dataset`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset)
* [Documentation for `DatasetDict`](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict)

We can get any of the `Dataset` objects by indexing the `DatasetDict` with one of its keys:

In [9]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

---

## `Dataset` contents

Let's work on the train `Dataset` for now.

In [10]:
train_dataset = dataset['train']

As we saw above, this `Dataset` has the features `text` and `label`, and there are 16,000 examples in the dataset, i.e. 16,000 (`text`, `label`) pairs.

In [11]:
train_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

Data contained in the `Dataset` object can be accessed by indexing the object in one of two basic ways:

* by row (integer), giving the values of the features for a particular example
* by feature name (string), giving the value of that feature for all rows

(Like `list` objects, we can also slice the `Dataset` object by indexing with  two integers `start:end`.)

Let's first look at an individual example (row):

In [12]:
train_dataset[0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [13]:
train_dataset[1]

{'text': '"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn\'t matter what one\'s political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn\'t true. I\'ve seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don\'t exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we\'re treated to the site of Vincent Gallo\'s throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) "double-standard" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, 

In [14]:
tokenizer1(train_dataset[0]['text'])

{'input_ids': [101, 146, 12765, 146, 6586, 140, 19556, 19368, 13329, 118, 162, 21678, 2162, 17056, 1121, 1139, 1888, 2984, 1272, 1104, 1155, 1103, 6392, 1115, 4405, 1122, 1165, 1122, 1108, 1148, 1308, 1107, 2573, 119, 146, 1145, 1767, 1115, 1120, 1148, 1122, 1108, 7842, 1118, 158, 119, 156, 119, 10148, 1191, 1122, 1518, 1793, 1106, 3873, 1142, 1583, 117, 3335, 1217, 170, 5442, 1104, 2441, 1737, 107, 6241, 107, 146, 1541, 1125, 1106, 1267, 1142, 1111, 1991, 119, 133, 9304, 120, 135, 133, 9304, 120, 135, 1109, 4928, 1110, 8663, 1213, 170, 1685, 3619, 3362, 2377, 1417, 14960, 1150, 3349, 1106, 3858, 1917, 1131, 1169, 1164, 1297, 119, 1130, 2440, 1131, 3349, 1106, 2817, 1123, 2209, 1116, 1106, 1543, 1199, 3271, 1104, 4148, 1113, 1184, 1103, 1903, 156, 11547, 1162, 1354, 1164, 2218, 1741, 2492, 1216, 1112, 1103, 4357, 1414, 1105, 1886, 2492, 1107, 1103, 1244, 1311, 119, 1130, 1206, 4107, 8673, 1105, 6655, 10552, 3708, 2316, 1104, 8583, 1164, 1147, 11089, 1113, 4039, 117, 1131, 1144, 2673, 1

In [15]:
tokenizer1.tokenize(train_dataset[0]['text'])

['I',
 'rented',
 'I',
 'AM',
 'C',
 '##UR',
 '##IO',
 '##US',
 '-',
 'Y',
 '##EL',
 '##L',
 '##OW',
 'from',
 'my',
 'video',
 'store',
 'because',
 'of',
 'all',
 'the',
 'controversy',
 'that',
 'surrounded',
 'it',
 'when',
 'it',
 'was',
 'first',
 'released',
 'in',
 '1967',
 '.',
 'I',
 'also',
 'heard',
 'that',
 'at',
 'first',
 'it',
 'was',
 'seized',
 'by',
 'U',
 '.',
 'S',
 '.',
 'customs',
 'if',
 'it',
 'ever',
 'tried',
 'to',
 'enter',
 'this',
 'country',
 ',',
 'therefore',
 'being',
 'a',
 'fan',
 'of',
 'films',
 'considered',
 '"',
 'controversial',
 '"',
 'I',
 'really',
 'had',
 'to',
 'see',
 'this',
 'for',
 'myself',
 '.',
 '<',
 'br',
 '/',
 '>',
 '<',
 'br',
 '/',
 '>',
 'The',
 'plot',
 'is',
 'centered',
 'around',
 'a',
 'young',
 'Swedish',
 'drama',
 'student',
 'named',
 'Lena',
 'who',
 'wants',
 'to',
 'learn',
 'everything',
 'she',
 'can',
 'about',
 'life',
 '.',
 'In',
 'particular',
 'she',
 'wants',
 'to',
 'focus',
 'her',
 'attention',
 '##

In [16]:
def tokenizing(data):
    return tokenizer1(data['text'])

In [17]:
tokenized_train_datase = train_dataset.map(tokenizing)

Token indices sequence length is longer than the specified maximum sequence length for this model (521 > 512). Running this sequence through the model will result in indexing errors


In [18]:
print(tokenized_train_datase[0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

The first ten rows:

In [19]:
train_dataset[0:10]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

(Note above that slicing doesn't give a list of dictionaries, but rather a single dictionary where the values are lists.)

First ten texts:

In [20]:
train_dataset['text'][0:10]

['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, e

First ten labels:

In [21]:
train_dataset['label'][0:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

---

## Feature names

Note above that the labels are integers. This is convenient for machine learning, but makes the data difficult to interpret. To make sense of the integer labels, we can look at the `features` for the dataset

* [Documentation for dataset features](https://huggingface.co/docs/datasets/about_dataset_features)

In [22]:
train_dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

There's a `ClassLabel` object keyed by `'label'` that has a `names` attribute. Let's pick that out into a `label_names` attribute for convenience:

In [23]:
label_names = train_dataset.features['label'].names

label_names

['neg', 'pos']

We can now use this list to convert the integer labels into human-readable strings to interpret them.

In [24]:
print('text :', train_dataset[3]['text'])
print('label:', train_dataset[3]['label'])
print('---')
print('text :', train_dataset[3]['text'])
print('label:', label_names[train_dataset[3]['label']])

text : This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.
label: 0
---
text : This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me 

---

That concludes our introductory look into datasets. For more information on the library and the repository, please see [the `datasets` documentation](https://huggingface.co/docs/datasets).