<a href="https://colab.research.google.com/github/subhan97ahmed/How-find-bad-labels/blob/main/How_find_bad_labels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Downloading google's emotion data 

In [1]:
!wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv

--2023-06-12 13:47:16--  https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 142.251.171.128, 142.250.152.128, 142.251.172.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.251.171.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14174600 (14M) [application/octet-stream]
Saving to: ‘data/full_dataset/goemotions_1.csv’


2023-06-12 13:47:17 (126 MB/s) - ‘data/full_dataset/goemotions_1.csv’ saved [14174600/14174600]



In [3]:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv("/content/data/full_dataset/goemotions_1.csv")

In [5]:
df.columns

Index(['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id',
       'created_utc', 'rater_id', 'example_very_unclear', 'admiration',
       'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion',
       'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust',
       'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy',
       'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief',
       'remorse', 'sadness', 'surprise', 'neutral'],
      dtype='object')

In [7]:
df[["text","joy"]].loc[lambda d:d["joy"]==1].sample(4)

Unnamed: 0,text,joy
44646,"Yea, I saw him live in Stockholm. It was awesome!",1
14827,"Oh come on, you can't mention Psi-Ops without ...",1
8027,That's crazy; I went to a super [RELIGION] hig...,1
55701,Ooh! Ooh! I want to lose more than 5-10 vanity...,1


In [9]:
X,y = df["text"],df["joy"] 

pipe =make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight="balanced",max_iter=1000)
)

In [10]:
%%time
pipe.fit(X,y)

CPU times: user 6.01 s, sys: 6.15 s, total: 12.2 s
Wall time: 8.5 s


### Trick: 1 Model Uncertainty 

In [11]:
pipe.predict_proba(X)


array([[8.34260530e-01, 1.65739470e-01],
       [9.97775877e-01, 2.22412327e-03],
       [9.93142816e-01, 6.85718376e-03],
       ...,
       [9.99774813e-01, 2.25187430e-04],
       [8.47542468e-01, 1.52457532e-01],
       [8.36129953e-01, 1.63870047e-01]])

In [16]:
probas = pipe.predict_proba(X)[:,0]

(df.loc[(probas>0.45)&(probas<0.55)][["text","joy"]].head(10))

Unnamed: 0,text,joy
10,"I have, and now that you mention it, I think t...",0
57,I don’t even try to make shots like this mysel...,0
286,"play who you find fun, ""tiers don't matter""",0
303,You guys are so cute together! Good luck with ...,0
315,"Ha! They’re very naughty! Pretty cute, tho ❤️",0
331,Oh that's lovely.,0
337,They all have these same moronic talking point...,0
390,I like seeing this. There is hope. I want to h...,0
425,I would’ve LOVED to be a part of something lik...,0
522,I cheered for the Superbowl to be canceled,0


###Trick: 2 Model Disagreement 



In [None]:
df.loc[lambda d:d["joy"]!=pipe.predict(X)]

In [20]:
def correct_class_confidence(X,y,model):
  probas = model.predict_proba(X)
  values=[]
  for i,proba in enumerate(probas):
    proba_dict = {model.classes_[j]: v for j,v in enumerate(proba)}
    values.append(proba_dict[y[i]])
  return values


In [24]:
df.assign(confidence=correct_class_confidence(X,y,pipe)).loc[lambda d: pipe.predict(d["text"])!=d["joy"]][["text","joy","confidence"]].sort_values("confidence").loc[lambda d: d["joy"]==1].head(20)

Unnamed: 0,text,joy,confidence
44431,[NAME]? Is that you?,1,0.101772
69536,There it is!,1,0.129937
44917,Thank you so much. ❤ I'm going to be okay.,1,0.184334
37164,Thank you for this! :),1,0.185591
17172,You know damn well it is.,1,0.20064
40954,"Let's be real, [NAME] is coming back.",1,0.201564
52629,I don’t see the problem,1,0.206169
36560,That will have to be MY cake then!,1,0.222738
64693,oh is that it,1,0.226585
16317,I did that.,1,0.248891


In [25]:
df.assign(confidence=correct_class_confidence(X,y,pipe)).loc[lambda d: pipe.predict(d["text"])!=d["joy"]][["text","joy","confidence"]].sort_values("confidence").loc[lambda d: d["joy"]==1].head(20)

Unnamed: 0,text,joy,confidence
44431,[NAME]? Is that you?,1,0.101772
69536,There it is!,1,0.129937
44917,Thank you so much. ❤ I'm going to be okay.,1,0.184334
37164,Thank you for this! :),1,0.185591
17172,You know damn well it is.,1,0.20064
40954,"Let's be real, [NAME] is coming back.",1,0.201564
52629,I don’t see the problem,1,0.206169
36560,That will have to be MY cake then!,1,0.222738
64693,oh is that it,1,0.226585
16317,I did that.,1,0.248891


###Trick: 3 Pruning

In [31]:
!pip install cleanlab==1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cleanlab==1.0
  Downloading cleanlab-1.0-py2.py3-none-any.whl (77 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.6/77.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: cleanlab
  Attempting uninstall: cleanlab
    Found existing installation: cleanlab 2.4.0
    Uninstalling cleanlab-2.4.0:
      Successfully uninstalled cleanlab-2.4.0
Successfully installed cleanlab-1.0


In [34]:
from cleanlab.pruning import get_noise_indices

ordered_label_errors = get_noise_indices(
   s=y,
   psx=pipe.predict_proba(X),
   sorted_index_method='prob_given_label')

In [35]:
df.iloc[ordered_label_errors][['text','joy']].head(10)

Unnamed: 0,text,joy
20796,My son and I both enjoy taking pictures. It gi...,0
56480,It's wonderful and gives me happy happy feels,0
11193,Glad to see Hearts fans giving [NAME] such a w...,0
56140,Then enjoy it for as long as it's fun. When it...,0
4204,I got matched with the best [NAME] in the worl...,0
69056,Happy to hear this exciting news. Congratulati...,0
22406,"Glad you’re doing better internet stranger, an...",0
15942,"No problem at all, glad I could help :) Cheers!",0
60608,Happy birthday! Enjoy your day and [NAME] shou...,0
37513,The only happy ending I enjoyed on this sub. G...,0
