# **Goal**: Reproducing some results from the paper [Confident Learning: Estimating Uncertainty in Dataset Labels](https://arxiv.org/abs/1911.00068) (JAIR, 2021)

**Method**: https://github.com/cleanlab/cleanlab

**Data**: Amazon Reviews 5-core (9.9gb, 2014) http://jmcauley.ucsd.edu/data/amazon/links.html *// comment: there is a newer (14.3gb, 2018)*

**Example**: https://github.com/cleanlab/examples/blob/master/amazon-reviews-fasttext/amazon_pyx.ipynb

**Notes**:
- JAIR 2019 paper only mentions (without code):
> To demonstrate that non-deep-learning methods can be effective in
finding label issues under the CL framework, we use a multinomial logistic regression classifier for both finding label errors and learning with noisy labels. The built-in SGD optimizer in the open-sourced fastText library (Joulin et al., 2017) is used with settings: initial learning rate = 0.1, embedding dimension = 100, and n-gram = 3). Out-of-sample predicted probabilities Confident Learning: Estimating Uncertainty in Dataset Labels are obtained via 5-fold cross-validation. For input during training, a review is represented as the mean of pre-trained, tri-gram, word-level fastText embeddings (Bojanowski et al., 2017).
- [NeurIPS 2021 paper](https://openreview.net/forum?id=XccDXrDNLek) only mentions using fastText. To reproduce the results, the paper shares the predicted probabilities, but does not share the code used to obtain these predictions (https://github.com/cleanlab/label-errors).

## Data preprocessing

Data can be download from http://jmcauley.ucsd.edu/data/amazon/links.html or accessed from https://drive.google.com/uc?id=1W0B5KjjLBRRPBk0M4Pzoh_zFZ72oGtCr using https://stackoverflow.com/a/50670037.

In [1]:
import json, gzip
import gdown
import pandas as pd
import numpy as np

from tqdm import tqdm

In [2]:
## amazon5core.txt
# url = "https://drive.google.com/uc?id=1W0B5KjjLBRRPBk0M4Pzoh_zFZ72oGtCr"
# gdown.download(url)

In [3]:
## kcore_5_helpful.csv
# url = "https://drive.google.com/uc?id=1-4t7iJXOh-PJnzIxaVWv5Mo_oRoQ2bln"
# gdown.download(url)

In [4]:
def get_json(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)
    
    
def get_dataframe(path):
    # obtain a dataset as pandas.DataFrame for exploration
    dict_for_df = {}
    for i, d in enumerate(tqdm(get_json(path))):
         dict_for_df[i] = d
    return pd.DataFrame.from_dict(dict_for_df, orient='index')


def create_dataframe_helpful(path):
    # obtain a dataset as pandas.DataFrame for exploration
    # as defined in the paper, we take only reviews with helpfullness ratio > 0.5
    
    df = {}
    for i, d in enumerate(get_json(path)):
        h = d['helpful']
        if h[0] > h[1] // 2: 
            df[i] = {'rating': d['overall'], 'text': d['reviewText']}
    return pd.DataFrame.from_dict(df, orient='index')


def create_dataset(path_input, path_output, n_rows=None, verbose=True):
    # create a dataset used in the paper
    # - take only reviews with helpfullness ratio > 0.5
    # - and for classes: 1, 3, 5
    # `n_rows` allows to reduce the number of reviews for prototyping
    
    labels = []
    iterator = tqdm(get_json(path_input)) if verbose else get_json(path_input)
    j = 0
    with open(path_output+".txt", "w") as f:
        for i, d in enumerate(iterator):
            h = d["helpful"]
            if h[0] > h[1] // 2:
                label = int(d["overall"])
                if label in [1, 3, 5]:
                    text = d["reviewText"]
                    if len(text) > 0:
                        f.write(
                            "__label__{} {}\n".format(
                                label,
                                text.strip().replace("\n", " __newline__ "),
                            )
                        )
                        labels.append(label)
                    j += 1
                    if n_rows:
                        if j == n_rows:
                            break

    label_map = {1: 0, 3: 1, 5: 2}
    labels = [label_map[l] for l in labels]
    with open(path_output+".npy", "wb") as g:
        np.save(g, np.array(labels))
    
    if verbose:
        print(pd.Series(labels).value_counts())

Creating a dataframe of all the reviews for exploration

In [5]:
get_json('data/kcore_5.json.gz')

<generator object get_json at 0x7f9bcf6e9230>

In [6]:
# df = create_dataframe_helpful('data/kcore_5.json.gz')

In [7]:
# df.to_csv("data/kcore_5_helpful.csv")
df = pd.read_csv("data/kcore_5_helpful.csv")

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13406523 entries, 0 to 13406522
Data columns (total 3 columns):
 #   Column      Dtype  
---  ------      -----  
 0   Unnamed: 0  int64  
 1   rating      float64
 2   text        object 
dtypes: float64(1), int64(1), object(1)
memory usage: 306.9+ MB


In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,rating,text
0,1,5.0,I bought this for my husband who plays the pia...
1,5,4.0,This work bears deep connections to themes fir...
2,6,5.0,"You may laugh, but I have found that Otherland..."
3,7,5.0,Do not try and vacuum the dust. That's impossi...
4,8,5.0,What if Dread had come out victorious and left...


Creating the final dataset used in the fastText model

In [11]:
dataset_metadata = {
    "amazon5core_mini": 100_000,
    "amazon5core_medium": 1_000_000,
    "amazon5core": None # 10mln
}

In [11]:
for file_name, n_rows in dataset_metadata.items():
    create_dataset(path_input='data/kcore_5.json.gz', path_output="data/"+file_name, n_rows=n_rows)

345162it [00:13, 25100.77it/s]


2    77416
1    15120
0     7442
dtype: int64


3834235it [02:23, 26777.52it/s]


2    756758
1    163119
0     79963
dtype: int64


41135700it [21:57, 31229.02it/s]


2    7890518
1    1188544
0     917375
dtype: int64


## Create models with fasttext

- mini (100k samples): `gdown.download(https://drive.google.com/uc?id=1Z-caK6XskdisHH_lWakxxPJBMoSHmUOH)`
- medium (1mln samples): `gdown.download(https://drive.google.com/uc?id=1-7HCceJ5cD2AdfrMR1sOGf3zYo_vIA23)`
- full dataset (10mln samples): `gdown.download(https://drive.google.com/uc?id=1UkjeSBXkovxlD1zFbRF_idZdpMeOVdpt)`

In [12]:
import fasttext
import time

In [13]:
def print_results(results):
    for i, (N, p, r) in enumerate(results):
        print("Precision@{}\t{:.3f}".format(i, p))
        print("Recall@{}\t{:.3f}".format(i, r))

Train and evaluate (on train set)

In [14]:
times = {'index': [], 'train': [], 'inference': []}

for path_train in list(dataset_metadata):
    
    times['index'].append(path_train)
    
    st = time.time()
    model = fasttext.train_supervised(
        input=f'data/{path_train}.txt', 
        lr=0.1,
        dim=100,
        wordNgrams=3,
        epoch=6,
        thread=2,
        verbose=2
    )
    et = time.time()
    time.sleep(1)
    
    times['train'].append(et - st)
        
    print(f'Exemplary words used in the model: {model.words[0:10]}')
    print(f'Labels, targets used in the model: {model.labels}')
    
    st = time.time()
    result_1 = model.test(f'data/{path_train}.txt', 1)
    et = time.time()
    time.sleep(1)
    
    result_3 = model.test(f'data/{path_train}.txt', 2)
    result_5 = model.test(f'data/{path_train}.txt', 3)
    
    times['inference'].append(et - st)
    
    print_results((result_1, result_3, result_3))

Read 20M words
Number of words:  690059
Number of labels: 3
Progress: 100.0% words/sec/thread:  662558 lr:  0.000000 avg.loss:  0.365555 ETA:   0h 0m 0s


Exemplary words used in the model: ['the', 'and', 'of', 'to', 'a', 'is', 'in', 'I', 'that', 'this']
Labels, targets used in the model: ['__label__5', '__label__3', '__label__1']
Precision@0	0.906
Recall@0	0.906
Precision@1	0.492
Recall@1	0.983
Precision@2	0.492
Recall@2	0.983


Read 204M words113M words
Number of words:  3759048
Number of labels: 3
Progress: 100.0% words/sec/thread:  677459 lr:  0.000000 avg.loss:  0.181880 ETA:   0h 0m 0s 21.4% words/sec/thread:  679532 lr:  0.078646 avg.loss:  0.348825 ETA:   0h11m51s 34.8% words/sec/thread:  682991 lr:  0.065186 avg.loss:  0.302580 ETA:   0h 9m46s 82.2% words/sec/thread:  675618 lr:  0.017752 avg.loss:  0.204062 ETA:   0h 2m41s 95.6% words/sec/thread:  679243 lr:  0.004391 avg.loss:  0.187117 ETA:   0h 0m39s 96.8% words/sec/thread:  678291 lr:  0.003155 avg.loss:  0.185442 ETA:   0h 0m28s


Exemplary words used in the model: ['the', 'and', 'of', 'to', 'a', 'is', 'in', 'I', 'that', 'this']
Labels, targets used in the model: ['__label__5', '__label__3', '__label__1']
Precision@0	0.980
Recall@0	0.980
Precision@1	0.499
Recall@1	0.999
Precision@2	0.499
Recall@2	0.999


Time comparison

In [15]:
df_times = pd.DataFrame(times)
df_times

Unnamed: 0,index,train,inference
0,amazon5core_mini,98.159207,21.145966
1,amazon5core_medium,950.633152,204.628172


They trained using only 1mln samples, 10mln samples takes ~2h to train, which we did before but omit here (takes too long).

## Clean data with cleanlab

We use the implementation of `fasttext` available in the `cleanlab` package, which automizes crossvalidation, as we need unbiased probabilties for Confident Learning.

We use parameters mentioned in the JAIR paper (slightly different from the `cleanlab` example).

In [20]:
import cleanlab
from cleanlab.experimental.fasttext import FastTextClassifier, data_loader

In [21]:
cv_n_folds = 5  # Increasing more improves pyx, at great cost.
seed = 0
lr = 0.1
ngram = 3
epochs = 5  # Increasing more doesn't do much.
dim = 100

In [22]:
labels = np.load("data/amazon5core_medium.npy")

In [23]:
ftc = FastTextClassifier(
    train_data_fn="data/amazon5core_medium.txt",
    batch_size=100000,
    labels=[1, 3, 5],
    kwargs_train_supervised={
        "epoch": epochs,
        "thread": 2,
        "lr": lr,
        "wordNgrams": ngram,
        "bucket": 200000,
        "dim": dim,
        "loss": "softmax",  # possible: 'softmax', 'hs'
    },
)

predictions = cleanlab.count.estimate_cv_predicted_probabilities(
    X=np.arange(len(labels)),
    labels=labels,
    clf=ftc, # model
    cv_n_folds=cv_n_folds,
    seed=seed,
)

output_file_name = (
    "data/"
    + "amazon_pyx_cv__folds_{}__epochs_{}__lr_{}__ngram_{}__dim_{}.npy".format(
        cv_n_folds, epochs, lr, ngram, dim
    )
)
with open(output_file_name, "wb") as f:
    np.save(f, predictions)

Read 163M words
Number of words:  3197476
Number of labels: 3
Progress: 100.0% words/sec/thread:  756192 lr:  0.000000 avg.loss:  0.242466 ETA:   0h 0m 0s  4.3% words/sec/thread:  711662 lr:  0.095739 avg.loss:  0.509927 ETA:   0h 9m11s 13.1% words/sec/thread:  746226 lr:  0.086938 avg.loss:  0.411464 ETA:   0h 7m57s 14.2% words/sec/thread:  744752 lr:  0.085850 avg.loss:  0.405121 ETA:   0h 7m52s 0.381876 ETA:   0h 6m57s 24.1% words/sec/thread:  773555 lr:  0.075937 avg.loss:  0.375495 ETA:   0h 6m42s 24.6% words/sec/thread:  777169 lr:  0.075435 avg.loss:  0.374526 ETA:   0h 6m37s 58.2% words/sec/thread:  770592 lr:  0.041759 avg.loss:  0.293800 ETA:   0h 3m41s 61.2% words/sec/thread:  772192 lr:  0.038752 avg.loss:  0.289431 ETA:   0h 3m25s
Read 164M words
Number of words:  3201514
Number of labels: 3
Progress: 100.0% words/sec/thread:  741324 lr:  0.000000 avg.loss:  0.241752 ETA:   0h 0m 0s 0.069477 avg.loss:  0.352077 ETA:   0h 6m36s 51.4% words/sec/thread:  742801 lr:  0.048567 

Test the model's performance (on 5-CV folds)

In [24]:
from sklearn.metrics import accuracy_score

In [25]:
predictions = np.load("data/amazon_pyx_cv__folds_5__epochs_5__lr_0.1__ngram_3__dim_100.npy")
labels = np.load("data/amazon5core_medium.npy")

In [26]:
accuracy_score(labels, np.argmax(predictions, axis=1))

0.8935289646343415

Clean labels, try all the possible methods

In [27]:
temp = {}
for filter_by in ['prune_by_class', 'prune_by_noise_rate', 'both', 'confident_learning', 'predicted_neq_given']:
    label_error_indices = cleanlab.filter.find_label_issues(
        labels=labels,
        pred_probs=predictions,
        filter_by=filter_by,
        multi_label=False,
        # return_indices_ranked_by='self_confidence', # this only reorders the result, if None then returns boolean mask
    )
    num_errors = np.sum(label_error_indices)
    temp[filter_by] = label_error_indices
    print(f'Estimated number of errors by the method {filter_by}: {num_errors} | {100*np.round(num_errors / len(labels), 3)}%')

Estimated number of errors by the method prune_by_class: 61072 | 6.1%
Estimated number of errors by the method prune_by_noise_rate: 60555 | 6.1%
Estimated number of errors by the method both: 51166 | 5.1%
Estimated number of errors by the method confident_learning: 44667 | 4.5%
Estimated number of errors by the method predicted_neq_given: 106454 | 10.6%
