# Lab — Label Errors

This lab highlights data-centric AI techniques (using [confident learning](https://jair.org/index.php/jair/article/view/12125)) to improve the accuracy of an XGBoost classifier on a noisy dataset that has label errors.

The DCAI techniques demonstrated in this lab involve optimizing the dataset itself rather than altering the model's architecture or hyperparameters. As a result, it is possible to achieve further improvements in accuracy by fine-tuning the model in conjunction with the newly enhanced data, but that is not the focus of this lab.

In this lab, we will:

- Establish a baseline [XGBoost](https://xgboost.readthedocs.io/) model accuracy on the original data
- Automatically find mislabeled data points by:
    - Computing out-of-sample predicted probabilities
    - Estimating the number of label errors using confident learning
    - Ranking errors, using the number of label errors as a cutoff in identifying issues
- Remove the bad data
- Retrain the exact same XGBoost model to see the improvement in test accuracy

## Software installation

This lab relies on a couple PyPI packages. If you don't have them installed, run the following cell:

In [None]:
!pip install xgboost==1.7 scikit-learn pandas cleanlab

## Setup and Data Processing

Let's take a look at the dataset used in this lab, a tabular dataset of student grades.

The data includes three exam scores (numerical features), a written note (categorical feature with missing values), and a (noisy) letter grade (categorical label). Our aim is to train a model to classify the grade for each student based on the other features.

In this dataset, 20% of the grade labels are actually incorrect (the `noisy_letter_grade` column). Synthetic noise was added to this dataset for the purpose of this lab. In this lab, we have access to the true letter grade each student should have received (the `letter_grade` column), which we use for evaluating both the underlying accuracy of model predictions and how well our approach detects which data are mislabeled. We are careful to only use these true grades for evaluation, not for model training.

In the real world, you don't have access to the true labels (you only observe the `noisy_letter_grade`, not the true `letter_grade`). So when evaluating models in the real world, you have to be careful to make sure that your test set is free of error (using methods like those covered in this lab, ideally combined with human review).

In [1]:
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

df = pd.read_csv("../student-grades.csv")
df.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
0,f48f73,53,77,93,,C,C
1,0bd4e7,81,64,80,great participation +10,B,B
2,e1795d,74,88,97,,B,B
3,cb9d7a,61,94,78,,C,C
4,9acca4,48,90,91,,C,C


In [2]:
df_c = df.copy()
# Transform letter grades and notes to categorical numbers.
# Necessary for XGBoost.
df['letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['letter_grade'])
df['noisy_letter_grade'] = preprocessing.LabelEncoder().fit_transform(df['noisy_letter_grade'])
df['notes'] = preprocessing.LabelEncoder().fit_transform(df["notes"])
df['notes'] = df['notes'].astype('category')
df.head()

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
0,f48f73,53,77,93,5,2,2
1,0bd4e7,81,64,80,2,1,1
2,e1795d,74,88,97,5,1,1
3,cb9d7a,61,94,78,5,2,2
4,9acca4,48,90,91,5,2,2


# Get What We Need

To apply confident learning (the technique explained in today's lecture), we need to obtain [**out-of-sample** predicted probabilities](https://docs.cleanlab.ai/stable/tutorials/pred_probs_cross_val.html#out-of-sample-predicted-probabilities) for all of our data. To do this, we can use K-fold cross validation: for each fold, we will train on some subset of our data and get predictions on the rest of the data that was _not_ used for training.

We need to choose a model in order to do this. For this lab, we'll use [XGBoost](https://xgboost.readthedocs.io/), a library implementing gradient-boosted decision trees, a class of model commonly used for tabular data.

In [3]:
# Prepare training data (remove labels from the dataframe) and labels
data = df.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)
labels = df['noisy_letter_grade']

# XGBoost(experimental) supports categorical data.
# Here we use default hyperparameters for simplicity.
# Get out-of-sample predicted probabilities and check model accuracy.
model = XGBClassifier(tree_method="hist", enable_categorical=True)

# Exercise 1: getting out-of-sample predicted probabilities

Compute out-of-sample predicted probabilities for every data point. You can do this manually using for loops and multiple invocations of model training and prediction, or you can use scikit-learn's [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) (if you're using this function, take a look at the documentations, and in particular, the `method=` keyword argument).

In [4]:
# pred_probs should be a Nx5 matrix of out-of-sample predicted probabilities, with N = len(data)
pred_probs = cross_val_predict(model, data, labels, method='predict_proba')

## Checking model accuracy on original data

Now that we have out-of-sample predicted probabilities, we can also check the model's (cross-val) accuracy on the original (noisy) data, so we'll have a baseline to compare our final results.

In [5]:
preds = np.argmax(pred_probs, axis=1)
acc_original = accuracy_score(preds, labels)
print(f"Accuracy with original data: {round(acc_original*100,1)}%")

Accuracy with original data: 67.4%


# Finding label issues automatically

We count label issues using confident learning. First, we need to compute class thresholds for the different classes.

# Exercise 2: computing class thresholds

Implement the Confident Learning algorithm for computing class thresholds for the 5 classes. You can refer to slide 26 from today's lecture or see equation 2 in [this paper](https://jair.org/index.php/jair/article/view/12125).

The class threshold for each class is the model's expected (average) self-confidence for each class. In other words, to compute the threshold for a particular class, you can average the predicted probability for that class, for all datapoints that are labeled with that particular class.

In [6]:
def compute_class_thresholds(pred_probs: np.ndarray, labels: np.ndarray) -> np.ndarray:
    # this code is written in this style to make it easier to understand the algorithm
    #
    # a more efficient implementation would use numpy vectorized operations and
    # scan over the data only once
    n_examples, n_classes = pred_probs.shape
    thresholds = np.zeros(n_classes)
    for k in range(n_classes):
        count = 0
        p_sum = 0
        for i in range(n_examples):
            if labels[i] == k:
                count += 1
                p_sum += pred_probs[i, k]
        thresholds[k] = p_sum / count
    return thresholds

In [7]:
# should be a numpy array of length 5
thresholds = compute_class_thresholds(pred_probs, labels.to_numpy())

In [8]:
thresholds

array([0.72064708, 0.64154055, 0.67945672, 0.47879468, 0.45895547])

# Exercise 3: constructing the confident joint

Next, we compute the confident joint, a matrix that counts the number of label errors for each noisy label $\tilde{y}$ and true label $y^*$. You can follow the algorithm that we walked through in slide 27 from today's lecture, or see equation 1 in [this paper](https://jair.org/index.php/jair/article/view/12125).

The confident joint C is a K x K matrix (with K = 5 for this dataset), where `C[i][j]` is an estimate of the count of the number of data points with noisy label `i` and true label `j`. From lecture, recall that we put a data point in bin `(i, j)` if its given label is `i`, and its predicted probability for class `j` is above the threshold for class `j` (`thresholds[j]`). Each data point should only go in a single bin; if a data point's predicted probability is above the class threshold for multiple classes, it goes in the bin for which it has the highest predicted probability.

In [40]:
def compute_confident_joint(pred_probs: np.ndarray, labels: np.ndarray, thresholds: np.ndarray) -> np.ndarray:
    # written using for loops to be understandable
    #
    # this can be more efficiently implemented using numpy vectorized operations
    n_examples, n_classes = pred_probs.shape
    C = np.zeros((n_classes, n_classes), dtype=np.int)
    for data_idx in range(n_examples):
        i = labels[data_idx]
        j = None
        p_j = -1
        for candidate_j in range(n_classes):
            p = pred_probs[data_idx, candidate_j]
            if data_idx == 78:
                print(p)
            if p >= thresholds[candidate_j] and p > p_j:
                j = candidate_j
                p_j = p
        if j is not None:
            C[i][j] += 1
    return C

In [41]:
C = compute_confident_joint(pred_probs, labels.to_numpy(), thresholds)

0.007412594
0.5270441
0.001442478
0.0021401923
0.4619606


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  C = np.zeros((n_classes, n_classes), dtype=np.int)


In [30]:
C

array([[157,   4,  18,   8,  14],
       [  4, 145,   1,  42,  11],
       [  2,   4, 103,   5,   8],
       [  3,  37,   0,  76,  10],
       [ 19,  16,  16,   8,  70]])

# Exercise 4: count the number of label issues

Now that we have the confident joint C, we can count the estimated number of label issues in our dataset. Recall that this is the sum of the off-diagonal entries (the cases where we estimate that a label has been flipped).

In [42]:
num_label_issues = C.sum() - C.trace()

In [43]:
num_label_issues

230

In [14]:
print('Estimated noise rate: {:.1f}%'.format(100*num_label_issues / pred_probs.shape[0]))

Estimated noise rate: 24.4%


# Exercise 5: filter out label issues

In this lab, our approach to identifying issues is to rank the data points by a score ("self-confidence", the model's predicted probability for a data point's given label) and then take the top `num_label_issues` of those.

First, we want to compute the model's _self-confidence_ for each data point. For a data point `i`, that is `pred_probs[i, labels[i]]`.

In [15]:
# this should be a numpy array of length 941 of probabilities
self_confidences = np.array([pred_probs[i, l] for i, l in enumerate(labels)])

Next, we rank the _indices_ of the data points by the self-confidence.

In [16]:
# this should be a numpy array of length 941 of integer indices
ranked_indices = np.argsort(self_confidences)

Finally, let's compute the indices of label issues as the top `num_label_issues` items in the `ranked_indices`.

In [18]:
issue_idx = ranked_indices[:num_label_issues]

Let's look at a couple of the highest-ranked data points (most likely to be label issues):

In [19]:
df_c.iloc[ranked_indices[:5]]

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
637,77c9c5,0,79,65,"cheated on exam, gets 0pts",F,A
264,dacfb9,80,60,80,,C,F
378,e2614a,85,62,75,,C,F
689,b0306d,77,51,70,,D,B
318,347d55,98,51,74,,C,F


# How'd We Do?

Let's go a step further and see how we did at automatically identifying which data points are mislabeled. If we take the intersection of the labels errors identified by Confident Learning and the true label errors, we see that our approach was able to identify 75% of the label errors correctly (based on predictions from a model that is only 67% accurate). 

In [16]:
# Computing percentage of true errors identified. 
true_error_idx = df[df.letter_grade != df.noisy_letter_grade].index.values
cl_acc = len(set(true_error_idx).intersection(set(issue_idx)))/len(true_error_idx)
print(f"Percentage of errors found: {round(cl_acc*100,1)}%")

Percentage of errors found: 76.6%


# Train a More Robust Model

Now that we have the indices of potential label errors within our data, let's remove them from our data, retrain our model, and see what improvement we can gain.

Keep in mind that our baseline model from above, trained on the original data using the `noisy_letter_grade` as the prediction label, achieved a cross-validation accuracy of 67%.

Let's use a very simple method to handle these label errors and just drop them entirely from the data and retrain our exact same `XGBClassifier`. In a real-world application, a better approach might be to have humans review the issues and _correct_ the labels rather than dropping the data points.

In [20]:
# Remove the label errors found by Confident Learning
data = df.drop(issue_idx)
clean_labels = data['noisy_letter_grade']
data = data.drop(['stud_ID', 'letter_grade', 'noisy_letter_grade'], axis=1)

# Train a more robust classifier with less erroneous data
model = XGBClassifier(tree_method="hist", enable_categorical=True)
clean_pred_probs = cross_val_predict(model, data, clean_labels, method='predict_proba')
clean_preds = np.argmax(clean_pred_probs, axis=1)

acc_clean = accuracy_score(clean_preds, clean_labels)
print(f"Accuracy with original data: {round(acc_original*100, 1)}%")
print(f"Accuracy with errors found by Confident Learning removed: {round(acc_clean*100, 1)}%")

# Compute reduction in error.
err = ((1-acc_original)-(1-acc_clean))/(1-acc_original)
print(f"Reduction in error: {round(err*100,1)}%")

Accuracy with original data: 67.4%
Accuracy with errors found by Confident Learning removed: 90.3%
Reduction in error: 70.3%


After removing the suspected label issues, our model's new cross-validation accuracy is now 90%, which means we **reduced the error-rate of the model by 70%** (the original model had 67% accuracy). 

**Note: throughout this entire process we never changed any code related to model architecture/hyperparameters, training, or data preprocessing!  This improvement is strictly coming from increasing the quality of our data which leaves additional room for additional optimizations on the modeling side.**

# Conclusion

For the student grades dataset, we found that simply dropping identified label errors and retraining the model resulted in a 70% reduction in prediction error on our classification problem (with accuracy improving from 67% to 90%).

An implementation of the Confident Learning algorithm (and much more) is available in the [cleanlab](https://github.com/cleanlab/cleanlab) library on GitHub. This is how today's lab assignment can be done in a single line of code with Cleanlab:

In [18]:
import cleanlab

cl_issue_idx = cleanlab.filter.find_label_issues(labels, pred_probs, return_indices_ranked_by='self_confidence')

In [19]:
df_c.iloc[cl_issue_idx[:5]]

Unnamed: 0,stud_ID,exam_1,exam_2,exam_3,notes,letter_grade,noisy_letter_grade
637,77c9c5,0,79,65,"cheated on exam, gets 0pts",F,A
264,dacfb9,80,60,80,,C,F
378,e2614a,85,62,75,,C,F
689,b0306d,77,51,70,,D,B
318,347d55,98,51,74,,C,F


_Advanced topic_: you might notice that the above `cl_issue_idx` differs in length (by a little bit) from our `issue_idx`. The reason for this is that we implemented a slightly simplified version of the algorithm in this lab. We skipped a calibration step after computing the confident joint that makes the confident joint have the true noisy prior $p(labels)$ (summed over columns for each row) and also add up to the total number of examples. If you're interested in the details of this, see equation 3 and the subsequent explanation in the [paper](https://jair.org/index.php/jair/article/view/12125).