# **Goal**: Reproducing some results from the paper [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749) (NeurIPS, 2021)

**Method**: https://github.com/cleanlab/cleanlab

**Data**: https://github.com/cleanlab/label-errors (attached repository)

**Example**: https://github.com/cleanlab/label-errors/blob/main/examples/Tutorial%20-%20How%20To%20Find%20Label%20Errors%20With%20CleanLab.ipynb (attached notebook)

In [3]:
import cleanlab
import numpy as np
import pandas as pd
import json

In the repository, there are predictions from models trained on the following datasets:

In [4]:
datasets = {
    'mnist_test_set': 'image',
    'cifar10_test_set': 'image',
    'cifar100_test_set': 'image',
    'caltech256': 'image',
    'imagenet_val_set': 'image',
    '20news_test_set': 'text',
    'imdb_test_set': 'text',
    'amazon': 'text',
    'audioset_eval_set': 'audio',
}

All the metadata files with predictions etc. need to be downloaded from the attached repository `git clone https://github.com/cleanlab/label-errors`.

In [5]:
root_dir = "label-errors/dataset_indexing"
with open(root_dir + "/audioset_eval_set_index_to_youtube_id.json", 'r') as rf:
    AUDIOSET_INDEX_TO_YOUTUBE = json.load(rf)
with open(root_dir + "/imdb_test_set_index_to_filename.json", 'r') as rf:
    IMDB_INDEX_TO_FILENAME = json.load(rf)
with open(root_dir + "/imagenet_val_set_index_to_filepath.json", 'r') as rf:
    IMAGENET_INDEX_TO_FILEPATH = json.load(rf)
with open(root_dir + "/caltech256_index_to_filename.json", "r") as rf:
    CALTECH256_INDEX_TO_FILENAME = json.load(rf)    

**We omit reproducing label errors found on labelerrors.com as the package changed (v2.0 is available) and should work better, at least not worse.**

In [6]:
# def get_label_error_indices_to_match_labelerrors_com():
#     """This method will reproduce the label errors found on labelerrors.com and
#     match (within a few percentage) the counts of label errors in Table 1 in the
#     label errors paper: https://arxiv.org/abs/2103.14749
    
#     While reproducibility is nice, some of these methods have been improved, and
#     if you are not reproducing the results in the paper, we recommend using the
#     latest version of `cleanlab.pruning.get_noise_indices()`

#     Variations in method is due to the fact that this research was
#     conducted over the span of years. All methods use variations of
#     confident learning."""

#     if dataset == 'imagenet_val_set':
#         cj = cleanlab.latent_estimation.compute_confident_joint(
#             s=labels, psx=pyx, calibrate=False, )
#         num_errors = cj.sum() - cj.diagonal().sum()
#     elif dataset == 'mnist_test_set':
#         cj = cleanlab.latent_estimation.compute_confident_joint(
#             s=labels, psx=pyx, calibrate=False, )
#         label_errors_bool = cleanlab.pruning.get_noise_indices(
#             s=labels, psx=pyx, confident_joint=cj, prune_method='prune_by_class',
#         )
#         num_errors = sum(label_errors_bool)
#     elif dataset != 'audioset_eval_set':  # Audioset is special case: it is multi-label
#         cj = cleanlab.latent_estimation.compute_confident_joint(
#             s=labels, psx=pyx, calibrate=False, )
#         num_errors = cleanlab.latent_estimation.num_label_errors(
#             labels=labels, psx=pyx, confident_joint=cj, )
    
#     if dataset == 'audioset_eval_set':  # Special case (multi-label) (TODO: update)
#         label_error_indices = cleanlab.pruning.get_noise_indices(
#             s=labels, psx=pyx, multi_label=True,
#             sorted_index_method='self_confidence', )
#         label_error_indices = label_error_indices[:307]
#     else:
#         prob_label = np.array([pyx[i, l] for i, l in enumerate(labels)])
#         max_prob_not_label = np.array(
#             [max(np.delete(pyx[i], l, -1)) for i, l in enumerate(labels)])
#         normalized_margin = prob_label - max_prob_not_label
#         label_error_indices = np.argsort(normalized_margin)[:num_errors]

#     return label_error_indices

## Find label errors using cleanlab (v2) based on the given predictions

We compute results for all available variations of their methods; specifically `"predicted_neq_given"` is a naive approach indicating an error when `argmax(preds) != label`.

In [9]:
results = {}

In [10]:
for dataset, modality in datasets.items():
    # Get the cross-validated predicted probabilities on the test set.
    if dataset == 'amazon' or dataset == 'imagenet_val_set':
        n_parts = 3 if dataset == 'amazon' else 4
        pyx_fn = 'label-errors/cross_validated_predicted_probabilities/' \
             '{}_pyx.part{}_of_{}.npy'
        parts = [np.load(pyx_fn.format(dataset, i + 1, n_parts)) for i in range(n_parts)]
        pyx = np.vstack(parts)
    else:
        pyx = np.load('label-errors/cross_validated_predicted_probabilities/' \
            '{}_pyx.npy'.format(dataset), allow_pickle=True)

    # Get the cross-validated predictions (argmax of pyx) on the test set.
    pred = np.load('label-errors/cross_validated_predicted_labels/'
        '{}_pyx_argmax_predicted_labels.npy'.format(dataset), allow_pickle=True)
    # Get the test set labels
    labels = np.load('label-errors/original_test_labels/'
        '{}_original_labels.npy'.format(dataset), allow_pickle=True)
    # Find label error indices using cleanlab in one line of code. 
    # This will use the most recent version of cleanlab with best results.
    print(f'{dataset} has {pyx.shape[0]} examples')

    temp = {}
    for filter_by in ['prune_by_class', 'prune_by_noise_rate', 'both', 'confident_learning', 'predicted_neq_given']:
        label_error_indices = cleanlab.filter.find_label_issues(
            labels=labels,
            pred_probs=pyx,
            filter_by=filter_by,
            multi_label=True if dataset == 'audioset_eval_set' else False,
            ## this only reorders the result, if None then returns boolean mask
            # return_indices_ranked_by='self_confidence', 
        )
        num_errors = np.sum(label_error_indices)
        temp[filter_by] = label_error_indices
        print('Estimated number of errors by {}:'.format(filter_by), num_errors)
        
    results[dataset] = temp

mnist_test_set has 10000 examples
Estimated number of errors by prune_by_class: 15
Estimated number of errors by prune_by_noise_rate: 15
Estimated number of errors by both: 15
Estimated number of errors by confident_learning: 15
Estimated number of errors by predicted_neq_given: 87
cifar10_test_set has 10000 examples
Estimated number of errors by prune_by_class: 284
Estimated number of errors by prune_by_noise_rate: 284
Estimated number of errors by both: 226
Estimated number of errors by confident_learning: 244
Estimated number of errors by predicted_neq_given: 706
cifar100_test_set has 10000 examples
Estimated number of errors by prune_by_class: 2250
Estimated number of errors by prune_by_noise_rate: 2120
Estimated number of errors by both: 1779
Estimated number of errors by confident_learning: 1846
Estimated number of errors by predicted_neq_given: 3071
caltech256 has 29780 examples
Estimated number of errors by prune_by_class: 2420
Estimated number of errors by prune_by_noise_rate:

## Are these errors correct with respect to human evaluation? The case of Amazon Reviews 

We aim to check if the estimated errors are in line with human ground truth given by the authors based on the MTurk validation.

### Start by processing the MTurk results.

In [11]:
f = open('label-errors/mturk/amazon_mturk.json')
mturk_data = json.load(f)

In [12]:
len(mturk_data)

1000

There are 1000 answers.

In [13]:
mturk_data_clean = []
for obj in mturk_data:
    new_obj = {}
    for k, v in obj.items():
        if k == "mturk":
            for kk, vv in v.items():
                new_obj[kk] = vv
        else:
            new_obj[k] = v
    mturk_data_clean += [new_obj]

In [14]:
mturk_df = pd.DataFrame(mturk_data_clean)

Available are:
- original label (ground truth)
- label guessed by CL: this is the most probable label given by the model AFTER the CL method decided that it is a potentiall error
- positive/negative/neutral/off-topic counts of MTurk answers (5 per row)

In [15]:
mturk_df

Unnamed: 0,id,url,given_original_label,our_guessed_label,positive,negative,neutral,off-topic
0,22360,https://labelerrors.com/static/amazon/22360.txt,Positive,Neutral,3,0,2,0
1,38306,https://labelerrors.com/static/amazon/38306.txt,Positive,Neutral,1,1,3,0
2,46831,https://labelerrors.com/static/amazon/46831.txt,Positive,Negative,3,1,0,1
3,51608,https://labelerrors.com/static/amazon/51608.txt,Neutral,Positive,2,1,1,1
4,58198,https://labelerrors.com/static/amazon/58198.txt,Neutral,Positive,3,1,1,0
...,...,...,...,...,...,...,...,...
995,9936139,https://labelerrors.com/static/amazon/9936139.txt,Neutral,Positive,2,1,2,0
996,9938659,https://labelerrors.com/static/amazon/9938659.txt,Negative,Neutral,0,1,4,0
997,9957864,https://labelerrors.com/static/amazon/9957864.txt,Positive,Negative,2,1,2,0
998,9964543,https://labelerrors.com/static/amazon/9964543.txt,Positive,Neutral,2,1,2,0


We first aim to reproduce the '73.2%' result from the apper:

In [16]:
mturk_iserror = []
mturk_true_label = []
errors = 0
for i in range(mturk_df.shape[0]):
    x = mturk_df.iloc[i,:]
    if x[x.given_original_label.lower()] < 3:
        errors += 1
        mturk_iserror.append(True)
    else:
        mturk_iserror.append(False)
print(100*errors/1000)

73.2


Next, let's add a new value: TRUE label (ground truth by humans)

In [17]:
label_dict = {
    0: 'Positive',
    1: 'Negative',
    2: 'Neutral',
    3: 'Off-topic'
}

mturk_df = mturk_df.assign(
    mturk_iserror=mturk_iserror,
    mturk_argmax_label=[label_dict[l] for l in mturk_df.iloc[:, 4:].values.argmax(axis=1)]
)
mturk_df

Unnamed: 0,id,url,given_original_label,our_guessed_label,positive,negative,neutral,off-topic,mturk_iserror,mturk_argmax_label
0,22360,https://labelerrors.com/static/amazon/22360.txt,Positive,Neutral,3,0,2,0,False,Positive
1,38306,https://labelerrors.com/static/amazon/38306.txt,Positive,Neutral,1,1,3,0,True,Neutral
2,46831,https://labelerrors.com/static/amazon/46831.txt,Positive,Negative,3,1,0,1,False,Positive
3,51608,https://labelerrors.com/static/amazon/51608.txt,Neutral,Positive,2,1,1,1,True,Positive
4,58198,https://labelerrors.com/static/amazon/58198.txt,Neutral,Positive,3,1,1,0,True,Positive
...,...,...,...,...,...,...,...,...,...,...
995,9936139,https://labelerrors.com/static/amazon/9936139.txt,Neutral,Positive,2,1,2,0,True,Positive
996,9938659,https://labelerrors.com/static/amazon/9938659.txt,Negative,Neutral,0,1,4,0,True,Neutral
997,9957864,https://labelerrors.com/static/amazon/9957864.txt,Positive,Negative,2,1,2,0,True,Positive
998,9964543,https://labelerrors.com/static/amazon/9964543.txt,Positive,Neutral,2,1,2,0,True,Positive


We get to know (73.2 != 65.8) that authors did not use argmax, as it doesn't resolve ties

In [18]:
(mturk_df.given_original_label != mturk_df.mturk_argmax_label).mean()*100

65.8

In [19]:
mturk_df.to_csv("mturk_amazon_results.csv")

In [20]:
mturk_df = pd.read_csv("mturk_amazon_results.csv", index_col=[0])

### Check Confident Learning vs MTurk

In [22]:
cl_amazon_result = results['amazon']
for k, v in cl_amazon_result.items():
    cleanlab_wrong = pd.Series(np.flatnonzero(v))
    mturk_df = pd.concat([mturk_df, pd.DataFrame({'cl_'+k: mturk_df.id.isin(cleanlab_wrong)})], axis=1)
mturk_df

Unnamed: 0,id,url,given_original_label,our_guessed_label,positive,negative,neutral,off-topic,mturk_iserror,mturk_argmax_label,cl_prune_by_class,cl_prune_by_noise_rate,cl_both,cl_confident_learning,cl_predicted_neq_given
0,22360,https://labelerrors.com/static/amazon/22360.txt,Positive,Neutral,3,0,2,0,False,Positive,True,True,True,True,True
1,38306,https://labelerrors.com/static/amazon/38306.txt,Positive,Neutral,1,1,3,0,True,Neutral,True,True,True,True,True
2,46831,https://labelerrors.com/static/amazon/46831.txt,Positive,Negative,3,1,0,1,False,Positive,True,True,True,True,True
3,51608,https://labelerrors.com/static/amazon/51608.txt,Neutral,Positive,2,1,1,1,True,Positive,True,True,True,True,True
4,58198,https://labelerrors.com/static/amazon/58198.txt,Neutral,Positive,3,1,1,0,True,Positive,True,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,9936139,https://labelerrors.com/static/amazon/9936139.txt,Neutral,Positive,2,1,2,0,True,Positive,True,True,True,True,True
996,9938659,https://labelerrors.com/static/amazon/9938659.txt,Negative,Neutral,0,1,4,0,True,Neutral,True,True,True,True,True
997,9957864,https://labelerrors.com/static/amazon/9957864.txt,Positive,Negative,2,1,2,0,True,Positive,True,True,True,True,True
998,9964543,https://labelerrors.com/static/amazon/9964543.txt,Positive,Neutral,2,1,2,0,True,Positive,True,True,True,True,True


**We discover that results obtained with methods other than the naive baseline approach do not match samples sent to MTurk**

In [23]:
mturk_df.iloc[:, 10:].mean(axis=0) # fraction of 1000 samples sent to MTurk, which were indicated by a given method

cl_prune_by_class         0.860
cl_prune_by_noise_rate    0.758
cl_both                   0.719
cl_confident_learning     0.615
cl_predicted_neq_given    1.000
dtype: float64

**In what follows, only the naive approach matches the accuracy reported in the paper -- all other methods are worse and close to random accuracy vs Human ground truth**

In [24]:
for i in range(10, 15):
    print(f'Accuracy of {mturk_df.columns[i]}: {((mturk_df.iloc[:, i] == mturk_df.mturk_iserror).mean()*100).round(3)}')

Accuracy of cl_prune_by_class: 67.6
Accuracy of cl_prune_by_noise_rate: 60.2
Accuracy of cl_both: 58.1
Accuracy of cl_confident_learning: 53.3
Accuracy of cl_predicted_neq_given: 73.2


These results are only based on a sample of 1000 observations from the potential 500,000 errors so this obviously is an anegdotal proof.

Yet, it indicates that: 
1. We can't extrapolate the results of 1k samples to the whole dataset (like the authors did)
2. The process is very fuzzy and random -- a more critical evaluation is needed.

--------------

### Try reproducing IMDb results

In [74]:
f_t = open('label-errors/mturk/imdb_mturk.json')
data_t = json.load(f_t)
print(len(data_t))

data_clean_t = []
for obj in data_t:
    new_obj = {}
    for k, v in obj.items():
        if k == "mturk":
            for kk, vv in v.items():
                new_obj[kk] = vv
        else:
            new_obj[k] = v
    data_clean_t += [new_obj]
    
df_t = pd.DataFrame(data_clean_t)

mturk_iserror_t = []
mturk_true_label_t = []
errors = 0
for i in range(df_t.shape[0]):
    x = df_t.iloc[i, :]
    if x.given < 3:
        mturk_true_label_t.append(x.our_guessed_label if x.guessed > x.neutral else "Neutral")
        errors += 1
        mturk_iserror_t.append(True)
    else:
        mturk_true_label_t.append(x.given_original_label)
        mturk_iserror_t.append(False)
print(errors)
print(100*errors/1310)

1310
725
55.343511450381676


We got the "55.3%" from the paper.

In [75]:
df_t

Unnamed: 0,id,url,given_original_label,our_guessed_label,given,guessed,neutral,off-topic
0,test/neg/10003_3,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0
1,test/neg/1003_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,0,3,2,0
2,test/neg/10050_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,1,1,2,1
3,test/neg/10053_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0
4,test/neg/1008_2,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,4,0,1,0
...,...,...,...,...,...,...,...,...
1305,test/pos/9826_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,3,2,0
1306,test/pos/9855_9,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,1,2,1
1307,test/pos/9877_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,2,2,0
1308,test/pos/9910_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,2,3,0


In [76]:
df_t = df_t.assign(
    mturk_iserror=mturk_iserror_t,
    mturk_true_label=mturk_true_label_t
)
df_t

Unnamed: 0,id,url,given_original_label,our_guessed_label,given,guessed,neutral,off-topic,mturk_iserror,mturk_true_label
0,test/neg/10003_3,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral
1,test/neg/1003_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,0,3,2,0,True,Positive
2,test/neg/10050_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,1,1,2,1,True,Neutral
3,test/neg/10053_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral
4,test/neg/1008_2,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,4,0,1,0,False,Negative
...,...,...,...,...,...,...,...,...,...,...
1305,test/pos/9826_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,3,2,0,True,Negative
1306,test/pos/9855_9,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,1,2,1,True,Neutral
1307,test/pos/9877_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,2,2,0,True,Neutral
1308,test/pos/9910_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,2,3,0,True,Neutral


In [77]:
df_t.sort_values("id")

Unnamed: 0,id,url,given_original_label,our_guessed_label,given,guessed,neutral,off-topic,mturk_iserror,mturk_true_label
0,test/neg/10003_3,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral
1,test/neg/1003_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,0,3,2,0,True,Positive
2,test/neg/10050_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,1,1,2,1,True,Neutral
3,test/neg/10053_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral
4,test/neg/1008_2,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,4,0,1,0,False,Negative
...,...,...,...,...,...,...,...,...,...,...
1305,test/pos/9826_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,3,2,0,True,Negative
1306,test/pos/9855_9,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,1,2,1,True,Neutral
1307,test/pos/9877_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,2,2,0,True,Neutral
1308,test/pos/9910_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,2,3,0,True,Neutral


In [79]:
cl_imdb_result = results['imdb_test_set']
df_ids_t = pd.Series([int(x[2].split("_")[0]) for x in df_t.id.str.split("/")])
for k, v in cl_imdb_result.items():
    cleanlab_wrong = pd.Series(np.flatnonzero(v))
    df_t = pd.concat([df_t, pd.DataFrame({'cl_'+k: df_ids_t.isin(cleanlab_wrong)})], axis=1)
df_t

Unnamed: 0,id,url,given_original_label,our_guessed_label,given,guessed,neutral,off-topic,mturk_iserror,mturk_true_label,cl_prune_by_class,cl_prune_by_noise_rate,cl_both,cl_confident_learning,cl_predicted_neq_given
0,test/neg/10003_3,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral,False,False,False,False,False
1,test/neg/1003_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,0,3,2,0,True,Positive,False,False,False,False,False
2,test/neg/10050_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,1,1,2,1,True,Neutral,False,False,False,False,False
3,test/neg/10053_4,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,2,1,2,0,True,Neutral,False,False,False,False,False
4,test/neg/1008_2,https://labelerrors.com/static/imdb/test/neg/1...,Negative,Positive,4,0,1,0,False,Negative,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1305,test/pos/9826_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,3,2,0,True,Negative,False,False,False,False,False
1306,test/pos/9855_9,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,1,2,1,True,Neutral,False,False,False,False,False
1307,test/pos/9877_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,1,2,2,0,True,Neutral,False,False,False,False,False
1308,test/pos/9910_8,https://labelerrors.com/static/imdb/test/pos/9...,Positive,Negative,0,2,3,0,True,Neutral,False,False,False,False,False


#### We can't compare results for IMDb with `cleanlab` v2.0 because there is no connection between the ID of examples in MTurk and the ids of examples coming from `cleanlab`..

See the very weird `id` column

In [81]:
cleanlab_wrong

0           0
1           2
2           8
3          19
4          25
        ...  
2601    24965
2602    24977
2603    24980
2604    24981
2605    24990
Length: 2606, dtype: int64

In [82]:
df_t.id

0       test/neg/10003_3
1        test/neg/1003_4
2       test/neg/10050_4
3       test/neg/10053_4
4        test/neg/1008_2
              ...       
1305     test/pos/9826_8
1306     test/pos/9855_9
1307     test/pos/9877_8
1308     test/pos/9910_8
1309     test/pos/9948_7
Name: id, Length: 1310, dtype: object

In [80]:
df_t.iloc[:, 10:].mean(axis=0)

cl_prune_by_class         0.061069
cl_prune_by_noise_rate    0.061069
cl_both                   0.061069
cl_confident_learning     0.045038
cl_predicted_neq_given    0.106870
dtype: float64

In [67]:
for i in range(10, 15):
    print(f'Accuracy of {df_t.columns[i]}: {((df_t.iloc[:, i] == df_t.mturk_iserror).mean()*100).round(3)}')

Accuracy of cl_prune_by_class: 44.58
Accuracy of cl_prune_by_noise_rate: 44.58
Accuracy of cl_both: 44.58
Accuracy of cl_confident_learning: 44.733
Accuracy of cl_predicted_neq_given: 45.725
