In [1]:
%run supportvectors-common.ipynb



<center><img src="https://d4x5p7s4.rocketcdn.me/wp-content/uploads/2016/03/logo-poster-smaller.png"/> </center>
<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Datasets


HuggingFace provides a convenient and uniform api to access a large variety of datasets in a standardized manner. We will explore this core library in this lab.


In [2]:
# Uncomment this if you do not have datasets installed yet.
# !pip install datasets

from datasets import list_datasets

available_datasets = list_datasets()

print (f'There are {len(available_datasets)} datasets present today on the HuggingFace hub!')

There are 59783 datasets present today on the HuggingFace hub!


This is perhaps a number larger than one would have expected: it speaks to the richness of the ecosystem. We will consider the sentiment analysis as the classification task in the next lab. So let us explore a dataset appropriate for it. A dataset named `emotion` is a good choice.

In [3]:
from datasets import load_dataset
emotions = load_dataset('emotion')
emotions

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

Conveniently for classification tasks
, this dataset is already decomposed into the training set of 16K rows, validation set of 2K rows, and the test set of 2K rows. Let us explore the training set:

In [4]:
training_dataset = emotions['train']
training_dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

As we would expect, this dataset contains labeled data. The feature is the text, and the label is the sentiment associated with the text. Let us look at a row:


In [5]:
training_dataset[0]

{'text': 'i didnt feel humiliated', 'label': 0}

The label `0` is not very informative. Let's try to relate it to the feature names.

In [6]:
training_dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

From this we gather that there are six possible sentiment labels:

* sadness
* joy
* love
* anger
* fear
* surprise


### From Dataset to Pandas Dataframe

We are familiar with the `pandas.DataFrame` and use it conveniently to explore data. It turns out that the `datasets` library gives an easy bridge to `pandas.DataFrame`.

In [7]:
emotions.set_format(type='pandas')
training_df = emotions['train'][:]
training_df.head()

Unnamed: 0,text,label
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,3
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,3


Let us augment this dataframe with the actual emption label name, by using the `ClassLabel.int2row()` function.

In [8]:
training_df['emotion_name'] = training_df['label'].apply(lambda row: training_dataset.features['label'].int2str(row))

In [9]:
training_df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
text,16000.0,15969.0,i feel on the verge of tears from weariness i ...,2.0,,,,,,,
label,16000.0,,,,1.565937,1.50143,0.0,0.0,1.0,3.0,5.0
emotion_name,16000.0,6.0,joy,5362.0,,,,,,,


In [10]:
training_df.sample(20)

Unnamed: 0,text,label,emotion_name
8756,ive made it through a week i just feel beaten ...,0,sadness
4660,i feel this strategy is worthwhile,1,joy
6095,i feel so worthless and weak what does he have...,0,sadness
304,i feel clever nov,1,joy
8241,im moved in ive been feeling kind of gloomy,0,sadness
9577,i allowed myself to feel the really shitty fee...,0,sadness
1035,i feel confused too,4,fear
9976,i feel like a crappy mummy if were stuck in bu...,0,sadness
7872,i feel like i liked my hair much better before...,2,love
8341,i feel the self pressured expectation to keep ...,4,fear


#### `pandas.DataFrame` to `Dataset`

Most data scientists are familiar with the `pandas.DataFrame`, and fluent with data wrangling using it. It may interest the reader to learn that `pandas` has released a high-performance and memory efficient `pyarrow` backend to `pandas` in their version 2. 

Creating the huggingface `Dataset` from a `pandas.DataFrame` is relatively straightforward:

In [11]:
x1 = np.linspace(0, 1, 1000)
x2 = np.sin(x)
x3 = np.log(1+x1)

data = pd.DataFrame(data={'x1': x1, 'x2':x2, 'x3': x3})
data.describe().transpose()

NameError: name 'x' is not defined

Let us now create a huggingface `Dataset` object:

In [None]:
from datasets import Dataset
ds = Dataset.from_pandas(data)
ds