# Learning with Mislabeled Data

In this notebook, we will explore the [cleanlab](https://github.com/cleanlab/cleanlab) library which provides functions for "finding, quantifying, and learning with label errors in datasets." In particular, we will do the following:

1. Use `get_noise_indices` to detect mislabeled training labels
2. Use the `LearningWithNoisyLabels` wrapper with various scikit-learn compatible models to make predictions despite the mislabeled data.

**Note:** We use the leaked training labels to test some of our cleanlab functions, however we won't use it to train any models.

## -- Credits --

This notebook was inspired by the following discussions/notebooks:

* [This discussion](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/285503) about the mislabeled training data and the [accompanying notebook](https://www.kaggle.com/motloch/nov21-mislabeled-25).
* [This notebook](https://www.kaggle.com/criskiev/game-over-or-eda-of-the-leaked-train-csv) where the [original training labels](https://www.kaggle.com/criskiev/november21) were posted. 
* [This notebook](https://www.kaggle.com/kalaikumarr/comparing-22-different-classification-models) which gets baselines for various models.
* [This notebook](https://www.kaggle.com/kaaveland/tps-nov-2021-some-models-that-work-ok) which tests various sklearn classifiers. I used this notebook to pick models (and sometimes parameters) to test with the `LearningWithNoisyLabels` wrapper.
* [This notebook](https://www.kaggle.com/sugamkhetrapal/tps-nov-2021-1-14-xgboost-linear) which uses XGBoost with linear models (rather than trees as usual).

Please check these out (and upvote them!).

In [None]:
# Global variables for testing changes to this notebook quickly
RANDOM_SEED = 0
NUM_FOLDS = 8

# Install cleanlab
!pip install -q cleanlab

In [None]:
# Generic imports
import numpy as np
import pandas as pd
import time
import gc

# Hide warnings
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
warnings.filterwarnings('ignore')

# Plotting
import matplotlib.pyplot as plt

# cleanlab
import cleanlab
from cleanlab.pruning import get_noise_indices
from cleanlab.classification import LearningWithNoisyLabels

# Preprocessing
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler

# Models & Evaluation
from sklearn.metrics import roc_auc_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.model_selection import StratifiedKFold, cross_val_predict

# Models
from sklearn.base import clone
from sklearn.utils.extmath import softmax
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import MultinomialNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier


In [None]:
# Load data
original_train = pd.read_csv('../input/november21/train.csv')
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

# Feature columns
features = [col for col in train.columns if col not in ['id', 'target']]

# Check that the two train.csv are the same (except for the target)
print(train[features].equals(original_train[features]))

# Save space
y_actual = original_train['target'].copy()
del original_train
gc.collect()

# 1. Find Label Errors

In this section we use cleanlab functions to detect which labels are mislabeled. In particular, we do the following:

1. Use logistic regression to estimate train label probabilities (from `predict_proba`)
2.  `get_noisy_indices` to get the mislabled examples
3. Compare with the actual mislabeled examples from the leaked training data.

In [None]:
# fix labels, assumes input is pandas dataframe/series
def fix_labels(X_train, y_train, y_actual):
    
    y_train = y_train.reset_index(drop = True)
    y_actual = y_actual.reset_index(drop = True)
    
    # Logistic regression
    pipeline = make_pipeline(
        StandardScaler(),
        LogisticRegression(
            solver = 'saga', 
            random_state = RANDOM_SEED
        ),
    )

    # Label probabilities
    label_prob = cross_val_predict(
        estimator = pipeline,
        X = X_train,
        y = y_train,
        cv = StratifiedKFold(
            n_splits = NUM_FOLDS, 
            shuffle = True, 
            random_state = RANDOM_SEED
        ),
        n_jobs = -1,
        method = "predict_proba",
    )

    # Estimate label errors
    pred_errors = get_noise_indices(
        s = y_train,
        psx = label_prob,
        sorted_index_method='normalized_margin',
     )

    # Actual label errors
    actual_errors = y_actual.index[y_train != y_actual].to_numpy()
    
    # Indicator vectors for label errors
    y_true = y_actual.copy()
    y_pred = y_train.copy()
    
    y_pred.values[:] = 0
    y_pred.iloc[pred_errors] = 1
    y_true.values[:] = 0
    y_true.iloc[actual_errors] = 1

    # Add "fixed" target labels
    fixed = y_train.copy()
    fixed.iloc[pred_errors] = (y_train.iloc[pred_errors] + 1) % 2
    
    return fixed, y_pred, y_true

In [None]:
%%time
pred_labels, pred_errors, true_errors = fix_labels(train[features], train['target'], y_actual)

In [None]:
# Analysis
print("Total Rows:", len(pred_labels))
print("Actual Errors:", true_errors.sum())
print("Estimated Errors:", pred_errors.sum())
print("\nAccuracy:", round(accuracy_score(true_errors, pred_errors), 3))
print("Precision:", round(precision_score(true_errors, pred_errors), 3))
print("Recall:", round(recall_score(true_errors, pred_errors), 3))

# Confusion matrix
cm = confusion_matrix(true_errors, pred_errors)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title("Confusion Matrix")
plt.xlabel("Predicted Errors")
plt.ylabel("Actual Errors")
plt.show()

# 2. Testing Models with Noisy Data

In this section, we use a cleanlab function to make predictions on the partially mislabeled data using various scikit-learn compatibles models. We will do the following for each model:

1. Get a baseline by training the vanilla model on the ~1/4 mislabeled training data
2. Use `LearningWithNoisyLabels` to wrap the model and train on the same folds.

We check each of the following models:

* Logistic Regression
* Ridge Regression
* Linear Discriminant Analysis
* SGDClassifier
* XGBoost
* Multi-layer Perceptron Classifier

**Note (1):** The wrapper expects a scikit-learn compatible estimators with `.fit()`, `.predict()` and `.predict_proba()` methods. Not all of these estimators have `.predict_proba()` methods so we have to extend them by defining our own (using the decision function and softmax).

**Note (2):** The wrapper function attempts to fix the mislabeled data using cross-validation so instead of training one model per fold, we are actually training 5 models per fold. Hence, we should expect significantly longer training times.

## Scoring Functions

The following functions accept a scikit-learn compatible model or pipeline with fit, predict and predict_proba methods and return auc scores, out-of-fold predictions and test set predictions (averaged over each fold) for the vanilla models and the wrapped models, respectively.

In [None]:
# Scoring/Training Baseline Function
def train_model(sklearn_model):
    
    # Store the holdout predictions
    oof_preds = np.zeros((train.shape[0],))
    test_preds = np.zeros((test.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    times = np.zeros(NUM_FOLDS)
    print('')
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train['target'])):
        
        # Training and Validation Sets
        X_train, y_train = train[features].iloc[train_idx].to_numpy(), train['target'].iloc[train_idx].to_numpy()
        X_valid, y_valid = train[features].iloc[valid_idx].to_numpy(), train['target'].iloc[valid_idx].to_numpy()
        X_test = test[features]
        
        # Create model
        model = clone(sklearn_model)
            
        start = time.time()

        model.fit(X_train, y_train)
        
        end = time.time()
        
        # validation and test predictions
        valid_preds = model.predict_proba(X_valid)[:, 1]
        test_preds += model.predict_proba(X_test)[:, 1] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        
        # fold auc score
        fold_auc = roc_auc_score(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} (AUC): {round(fold_auc, 5)} in {round(end-start,2)}s.')
        scores[fold] = fold_auc
        times[fold] = end-start
        
        time.sleep(0.5)
        
    print("\nAverage AUC:", round(scores.mean(), 5))
    print(f'Training Time: {round(times.sum(), 2)}s')
    
    return scores, test_preds, oof_preds

In [None]:
# Scoring/Training function for LearningWithNoisyLabels
def train_noisy_model(sklearn_model):
    
    # Store the holdout predictions
    oof_preds = np.zeros((train.shape[0],))
    test_preds = np.zeros((test.shape[0],))
    scores = np.zeros(NUM_FOLDS)
    times = np.zeros(NUM_FOLDS)
    print('')
    
    # Stratified k-fold cross-validation
    skf = StratifiedKFold(n_splits = NUM_FOLDS, shuffle = True, random_state = RANDOM_SEED)
    for fold, (train_idx, valid_idx) in enumerate(skf.split(train, train['target'])):
        
        # Training and Validation Sets
        X_train, y_train = train[features].iloc[train_idx].to_numpy(), train['target'].iloc[train_idx].to_numpy()
        X_valid, y_valid = train[features].iloc[valid_idx].to_numpy(), train['target'].iloc[valid_idx].to_numpy()
        X_test = test[features]
        
        # Create model
        model = LearningWithNoisyLabels(
            clf = clone(sklearn_model)
        )
            
        start = time.time()

        model.fit(X_train, y_train)
        
        end = time.time()
        
        # validation and test predictions
        valid_preds = model.predict_proba(X_valid)[:, 1]
        test_preds += model.predict_proba(X_test)[:, 1] / NUM_FOLDS
        oof_preds[valid_idx] = valid_preds
        
        # fold auc score
        fold_auc = roc_auc_score(y_valid, valid_preds)
        end = time.time()
        print(f'Fold {fold} (AUC): {round(fold_auc, 5)} in {round(end-start,2)}s.')
        scores[fold] = fold_auc
        times[fold] = end-start
        
        time.sleep(0.5)
        
    print("\nAverage AUC:", round(scores.mean(), 5))
    print(f'Training Time: {round(times.sum(), 2)}s')
    
    return scores, test_preds, oof_preds

## 2.1 Logistic Regression

In [None]:
# Logistic Regression
logit_pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(
        solver = 'saga',
        random_state = RANDOM_SEED,
        n_jobs = -1,
    ),
)

In [None]:
# Logistic Regression Baseline
logit_scores, logit_preds, logit_oof = train_model(logit_pipeline)

submission['target'] = logit_preds
submission.to_csv('logit_submission.csv', index=False)

In [None]:
# Logistic Regression w/ Wrapper
noisy_logit_scores, noisy_logit_preds, noisy_logit_oof = train_noisy_model(logit_pipeline)

submission['target'] = noisy_logit_preds
submission.to_csv('noisy_logit_submission.csv', index=False)

## 2.2 Ridge Regression

The wrapper function expects an estimator with a `predict_proba` method, so we create an equivalent using softmax:

In [None]:
# Class extending Ridge Regression
class ExtendedRidgeClassifier(RidgeClassifier):
    def predict_proba(self, X):
        temp = self.decision_function(X)
        return softmax(np.c_[-temp, temp])
    
# Ridge Regression
ridge_pipeline = make_pipeline(
    StandardScaler(),
    ExtendedRidgeClassifier(random_state = RANDOM_SEED),
)

In [None]:
# Ridge Regression Baseline
ridge_scores, ridge_preds, ridge_oof = train_model(ridge_pipeline)

submission['target'] = ridge_preds
submission.to_csv('ridge_submission.csv', index=False)

In [None]:
# Ridge Regression w/ Wrapper
noisy_ridge_scores, noisy_ridge_preds, noisy_ridge_oof = train_noisy_model(ridge_pipeline)

submission['target'] = noisy_ridge_preds
submission.to_csv('noisy_ridge_submission.csv', index=False)

## 2.3 Linear Discriminant Analysis

In [None]:
# Linear Discriminant Analysis
lda_pipeline = make_pipeline(
    StandardScaler(),
    LinearDiscriminantAnalysis(),
)

In [None]:
lda_scores, lda_preds, lda_oof = train_model(lda_pipeline)

submission['target'] = lda_preds
submission.to_csv('lda_submission.csv', index=False)

In [None]:
noisy_lda_scores, noisy_lda_preds, noisy_lda_oof = train_noisy_model(lda_pipeline)

submission['target'] = noisy_lda_preds
submission.to_csv('noisy_lda_submission.csv', index=False)

## 2.4 SGDClassifier

We use the parameters borrowed from [this notebook](https://www.kaggle.com/kaaveland/tps-nov-2021-some-models-that-work-ok). Again, since the wrapper function expects an estimator with a `predict_proba` method, we create an equivalent using softmax:

In [None]:
# Extended SGDClassifier
class ExtendedSGDClassifier(SGDClassifier):
    def predict_proba(self, X):
        temp = self.decision_function(X)
        return softmax(np.c_[-temp, temp])

# SGDClassifier
sgd_pipeline = make_pipeline(
    RobustScaler(), 
    ExtendedSGDClassifier(
        loss='hinge', 
        learning_rate='adaptive', 
        penalty='l2', 
        alpha=1e-3, 
        eta0=0.025,
        random_state = RANDOM_SEED
    )
)

In [None]:
sgd_scores, sgd_preds, sgd_oof = train_model(sgd_pipeline)

submission['target'] = sgd_preds
submission.to_csv('sgd_submission.csv', index=False)

In [None]:
noisy_sgd_scores, noisy_sgd_preds, noisy_sgd_oof = train_noisy_model(sgd_pipeline)

submission['target'] = noisy_sgd_preds
submission.to_csv('noisy_sgd_submission.csv', index=False)

## 2.5 Naive Bayes Classifier

In [None]:
# Naive Bayes Classifier
nb_pipeline = make_pipeline(
    MinMaxScaler(),
    MultinomialNB(),
)

In [None]:
nb_scores, nb_preds, nb_oof = train_model(nb_pipeline)

submission['target'] = nb_preds
submission.to_csv('nb_submission.csv', index=False)

In [None]:
noisy_nb_scores, noisy_nb_preds, noisy_nb_oof = train_noisy_model(nb_pipeline)

submission['target'] = noisy_nb_preds
submission.to_csv('noisy_nb_submission.csv', index=False)

## 2.6 Multi-Layer Perceptron Classifier

In [None]:
# Multi-layer Perceptron Classifier
mlp_pipeline = make_pipeline(
    StandardScaler(),
    MLPClassifier(
        hidden_layer_sizes=(128, 64),
        batch_size = 256, 
        early_stopping = True,
        validation_fraction = 0.2,
        n_iter_no_change = 5,
        random_state = RANDOM_SEED
    ),
)

In [None]:
mlp_scores, mlp_preds, mlp_oof = train_model(mlp_pipeline)

submission['target'] = mlp_preds
submission.to_csv('mlp_submission.csv', index=False)

In [None]:
noisy_mlp_scores, noisy_mlp_preds, noisy_mlp_oof = train_noisy_model(mlp_pipeline)

submission['target'] = noisy_mlp_preds
submission.to_csv('noisy_mlp_submission.csv', index=False)

## 2.7 XGBoost with Linear Models

In [None]:
# XGBoost Classifier
xgb_pipeline = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        booster = 'gblinear',
        eval_metric = 'auc',
        random_state = RANDOM_SEED
    ),
)

In [None]:
xgb_scores, xgb_preds, xgb_oof = train_model(xgb_pipeline)

submission['target'] = xgb_preds
submission.to_csv('xgb_submission.csv', index=False)

In [None]:
noisy_xgb_scores, noisy_xgb_preds, noisy_xgb_oof = train_noisy_model(xgb_pipeline)

submission['target'] = noisy_xgb_preds
submission.to_csv('noisy_xgb_submission.csv', index=False)

We see that the `LearningWithNoisyLabels` wrapper doesn't necessarily lead to better model performance. It may be worth exploring further, especially with better parameter tuning since we mostly only used default settings. However the training slow down may not be worthwhile.