**Bad labels: dataset**

Many popular datasets like MNIST and CIFAR have incorrect labels that can affect machine learning benchmarks. This problem is detailed on labelerrors.com and in the research paper available on arXiv. We'll use cleanlab to identify and correct these errors in our training data.

The GoEmotions dataset, created by Google, contains text snippets from Reddit annotated with emotion labels. It's designed for emotion prediction tasks, focusing on labels like "excitement". You can download the dataset using the following commands:

Starting with a simple model using the GoEmotions dataset

Load the Dataset:

In [None]:
import os
import pandas as pd

# File path
file_path = 'goemotions_1.csv'

# Check if file exists
if not os.path.isfile(file_path):
    print(f"File not found: {file_path}. Downloading...")
    # Make sure the data/full_dataset/ directory exists
    os.makedirs('data/full_dataset/', exist_ok=True)
    # Command to download the file
    os.system(f'wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/{file_path}')
    file_path = f'data/full_dataset/{file_path}'  # Update file path after download

# Now load the data
df = pd.read_csv(file_path)
print("File loaded successfully.")


Preview the Data:

In [None]:
pd.set_option('display.max_colwidth', None)
print(df[['text', 'excitement']].loc[lambda d: d['excitement'] == 0].sample(2))


Handle Class Imbalance:

In [None]:
print(df['excitement'].value_counts())


Create and Train the Model:

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

X, y = df['text'], df['excitement']
pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)
pipe.fit(X, y)


Finding Bad Labels using Model Disagreement

In [None]:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

# Load data
df = pd.read_csv("goemotions_1.csv")

# Setup DataFrame to display full text
pd.set_option('display.max_colwidth', None)

# Create a pipeline with logistic regression and a count vectorizer
X, y = df['text'], df['excitement']
pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)
pipe.fit(X, y)

# Function to calculate the confidence of the correct class
def correct_class_confidence(X, y, model):
    probas = model.predict_proba(X)
    values = [proba_dict[y[i]] for i, proba_dict in enumerate(map(lambda p: dict(zip(model.classes_, p)), probas))]
    return values

# Assign confidence values to the DataFrame
df['confidence'] = correct_class_confidence(X, y, pipe)

# Filter out examples where predictions do not match the labels
disagreements = df.loc[lambda d: pipe.predict(d['text']) != d['excitement']]
disagreements = disagreements.assign(confidence=correct_class_confidence(disagreements['text'], disagreements['excitement'], pipe))

# Sort by confidence and filter for specific examples
sorted_disagreements = (disagreements
                        .loc[lambda d: d['excitement'] == 0]
                        .sort_values("confidence")
                        .head(20))

print(sorted_disagreements[['text', 'excitement', 'confidence']])


Pruning with Cleanlab

In [None]:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from cleanlab.pruning import get_noise_indices

# Load data
df = pd.read_csv("goemotions_1.csv")

# Setup DataFrame to display full text
pd.set_option('display.max_colwidth', None)

# Create a pipeline with logistic regression and a count vectorizer
X, y = df['text'], df['excitement']
pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)
pipe.fit(X, y)

# Generate probabilities of each class for the dataset
probabilities = pipe.predict_proba(X)

# Use cleanlab to identify potential label issues
ordered_label_errors = get_noise_indices(
    s=y,
    psx=probabilities,
    sorted_index_method='prob_given_label'
)

# Display potential mislabeled examples
mislabeled_examples = df.iloc[ordered_label_errors][['text', 'excitement']].head(20)
print(mislabeled_examples)


Learning with Noisy Labels via Cleanlab

In [None]:
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from cleanlab.classification import LearningWithNoisyLabels

# Load the dataset
df = pd.read_csv("goemotions_1.csv")

# Set up the DataFrame to display the full text
pd.set_option('display.max_colwidth', None)

# Prepare data
X, y = df['text'], df['excitement']

# Define the base pipeline with Logistic Regression
base_pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight='balanced', max_iter=1000)
)

# Train the base pipeline
base_pipe.fit(X, y)

# Wrap the base pipeline with LearningWithNoisyLabels
lnl_model = LearningWithNoisyLabels(clf=base_pipe)
lnl_model.fit(X=X, s=y.values)

# Compare predictions from LearningWithNoisyLabels and the base pipeline
discrepancies = df.loc[lnl_model.predict(X) != base_pipe.predict(X)][['text', 'excitement']].sample(5)
print("Discrepancies found:")
print(discrepancies)


Improving a model by tuning hyperparameters can risk overfitting on bad labels. It's often more effective to first identify and correct bad labels in the dataset to ensure true model accuracy.