# Runs Datalab on [emotion recognition](https://huggingface.co/datasets/dair-ai/emotion) dataset

In [1]:
from datasets import load_dataset, concatenate_datasets

import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

from cleanlab import Datalab

In [2]:
pd.set_option("display.max_colwidth", None) 

In [3]:
dataset_dict = load_dataset("dair-ai/emotion")

No config specified, defaulting to: emotion/split
Found cached dataset emotion (/Users/sanjana/.cache/huggingface/datasets/dair-ai___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [5]:
dataset = concatenate_datasets([dataset for dataset in dataset_dict.values()])
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 20000
})

In [6]:
dataset.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'], id=None)}

In [7]:
dataset[0]

{'text': 'i didnt feel humiliated', 'label': 0}

### Get Embeddings

In [8]:
raw_texts = [dataset[i]['text'] for i in range(len(dataset))]
labels = [dataset[i]['label'] for i in range(len(dataset))]

In [9]:
em_model = SentenceTransformer('all-MiniLM-L6-v2')
text_embeddings = em_model.encode(raw_texts)

In [10]:
text_embeddings.shape

(20000, 384)

### Get out of sample pred_probs using cross validation

In [11]:
model = LogisticRegression(max_iter=400)
pred_probs = cross_val_predict(model, text_embeddings, labels, method="predict_proba")

In [12]:
pred_probs.shape

(20000, 6)

### Run Datalab

In [13]:
lab = Datalab(dataset, label_name="label")
lab.find_issues(pred_probs=pred_probs, features=text_embeddings)

Finding label issues ...


2023-04-28 14:23:08.477943: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Finding outlier issues ...
Fitting OOD estimator based on provided features ...
Finding near_duplicate issues ...
Audit complete. 3927 issues found in the dataset.


### Report of all issues found

In [14]:
lab.report()

Here is a summary of the different kinds of issues found in the data:

    issue_type    score  num_issues
         label 0.782050        3737
near_duplicate 0.332089         182
       outlier 0.664571           8

(Note: A lower score indicates a more severe issue across all examples in the dataset.)


----------------------- label issues -----------------------

About this issue:
	Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.
    

Number of examples with this issue: 3737
Overall dataset quality in terms of this issue: : 0.7821

Examples representing most severe instances of this issue:
       is_label_issue  label_score given_label predicted_label
3859             True     0.001196        fear             joy
10350            True     0.002418         joy            fear
17893            True     0.002462        fear             joy
634              True     0.002501        fear            

### Get issue summary

In [15]:
lab.get_summary()

Unnamed: 0,issue_type,score,num_issues
0,label,0.78205,3737
1,outlier,0.664571,8
2,near_duplicate,0.332089,182


### near_duplicate
Get top near duplicate issues

In [16]:
duplicate_issues = lab.get_issues("near_duplicate")
identified_duplicate_issues = duplicate_issues[duplicate_issues['is_near_duplicate_issue'] == True]
top_nd = identified_duplicate_issues.sort_values("near_duplicate_score").head()
top_nd

Unnamed: 0,is_near_duplicate_issue,near_duplicate_score,near_duplicate_sets,distance_to_nearest_neighbor
112,True,0.0,[19887],0.0
13880,True,0.0,[7333],0.0
13236,True,0.0,[9605],0.0
12892,True,0.0,[11013],0.0
12562,True,0.0,[7669],0.0


View near duplicate sets

In [17]:
near_duplicate_sets = []
for idx, sets in zip(top_nd.index.tolist(), top_nd['near_duplicate_sets']):
    near_duplicate_sets.append([idx] + list(sets))
for i, s in enumerate(near_duplicate_sets):
    print(f"set: {i}")
    for idx in s:
        print(dataset[int(idx)]["text"])
    print("\n")

set: 0
i feel like some of you have pains and you cannot imagine becoming passionate about the group or the idea that is causing pain
i feel like some of you have pains and you cannot imagine becoming passionate about the group or the idea that is causing pain


set: 1
i feel like i am very passionate about youtube and so id quite like to explain why i think youtube is the next best thing for entertainment
i feel like i am very passionate about youtube and so id quite like to explain why i think youtube is the next best thing for entertainment


set: 2
i feel like a tortured artist when i talk to her
i feel like a tortured artist when i talk to her


set: 3
i cant escape the tears of sadness and just true grief i feel at the loss of my sweet friend and sister
i cant escape the tears of sadness and just true grief i feel at the loss of my sweet friend and sister


set: 4
i feel so weird about it
i feel so weird about it




### label

In [18]:
label_issues = lab.get_issues("label")
label_issues.head() 

Unnamed: 0,is_label_issue,label_score,given_label,predicted_label
0,False,0.604405,sadness,sadness
1,False,0.272918,sadness,joy
2,False,0.892816,anger,anger
3,False,0.144541,love,sadness
4,False,0.796081,anger,anger


View examples with label errors and their suggested label

In [19]:
identified_label_issues = label_issues[label_issues["is_label_issue"] == True]
identified_label_issues = identified_label_issues.sort_values('label_score', ascending=False)
erroneous_examples = identified_label_issues.join(pd.DataFrame(raw_texts, columns=["text"]), how="inner")[["text", "given_label", "predicted_label"]]
erroneous_examples = erroneous_examples.rename({'predicted_label': 'suggested_label'}, axis=1)
erroneous_examples.head()

Unnamed: 0,text,given_label,suggested_label
17051,i manage feelings for prince charming and the boy,joy,love
331,i just love the feeling of something warmly hugging you and feeling so precious and small precious to someone something,joy,love
19066,i asked some girls what it meant to them to be valued and for the most part the response was that they felt valued when the people around them made them feel valued and treated them in a loving and caring manner,joy,love
481,i love sliding down on a nice big throbbing cock and feeling what my gorgeous body does to a man,joy,love
5243,i could have used for this blog post but this one perfectly describes the way i feel as well as give tribute to my,joy,love


### outlier
View top examples detected as outliers

In [20]:
outlier_issues = lab.get_issues("outlier")
outlier_issues = outlier_issues.join(pd.DataFrame(raw_texts, columns=["text"]), how="inner")
outlier_issues[outlier_issues["is_outlier_issue"] ==True].sort_values("outlier_score")[['text', 'outlier_score', 'distance_to_nearest_neighbor', 'nearest_neighbor']].head()

Unnamed: 0,text,outlier_score,distance_to_nearest_neighbor,nearest_neighbor
12058,i wanted to use older kx forks wheel w disc brakes but am was not feeling adventurous enough to try to figure out a stem and lowering the off road height,0.464728,0.722003,5197
7345,i feel this command is useful to check the free space in log file for all databases in over go,0.477303,0.667775,9424
11640,i feel a little uncertain about the structure of a revalidation portfolio,0.486628,0.693602,10005
19150,when i heard the last regulation of the socialist govrenment concerning pensions,0.489361,0.666289,9876
2166,i feel poles are most useful in pairs all price and stats in this review are for two poles,0.490448,0.645411,17580
