# Improving LightGBM model with Cleanlab for the *American Express - Default Prediction* competition

This notebook improves the LightGBM model from this [LightGBM Quickstart notebook](https://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstart) using the [cleanlab](https://github.com/cleanlab/cleanlab/) library for data-centric AI. 

`cleanlab` improves any model by automatically removing datapoints inferred to contain errors from the model's training set. With under 5 extra lines of code, we can obtain a 1% reduction in error without changing any of the existing model, training, or data-processing code.

| Model      | Public Score |
| ----------- | ----------- |
| LightGBM      | 0.785       |
| LightGBM + `cleanlab`   | 0.787         |

In [None]:
!pip install git+https://github.com/cleanlab/cleanlab.git
# Install latest version of cleanlab code (as of Jun 24, 2022); equivalent to installing from this commit:  !pip install git+https://github.com/cleanlab/cleanlab.git@4bd688f51c6d1135630e53dfeac2a9a223db03f3

Import dependencies and set seeds:

In [None]:
import cleanlab
from cleanlab.classification import CleanLearning
from lightgbm import LGBMClassifier, log_evaluation

import numpy as np
import pandas as pd
from IPython.display import display
import gc
import random

from sklearn.model_selection import StratifiedKFold
from sklearn.calibration import CalibrationDisplay

SEED = 123  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

Define metric for lightGBM model to evaluate. This metric code is taken from @yunchonggan's fast metric implementation: https://www.kaggle.com/competitions/amex-default-prediction/discussion/328020


In [None]:
def amex_metric(y_true: np.array, y_pred: np.array) -> float:

    # count of positives and negatives
    n_pos = y_true.sum()
    n_neg = y_true.shape[0] - n_pos

    # sorting by descring prediction values
    indices = np.argsort(y_pred)[::-1]
    preds, target = y_pred[indices], y_true[indices]

    # filter the top 4% by cumulative row weights
    weight = 20.0 - target * 19.0
    cum_norm_weight = (weight / weight.sum()).cumsum()
    four_pct_filter = cum_norm_weight <= 0.04

    # default rate captured at 4%
    d = target[four_pct_filter].sum() / n_pos

    # weighted gini coefficient
    lorentz = (target / n_pos).cumsum()
    gini = ((lorentz - cum_norm_weight) * weight).sum()

    # max weighted gini coefficient
    gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))

    # normalized weighted gini coefficient
    g = gini / gini_max

    return 0.5 * (g + d)


def lgb_amex_metric(y_true, y_pred):
    """The competition metric with lightgbm's calling convention"""
    return ('amex', amex_metric(y_true, y_pred), True)

### Data Preprocessing

We just apply the same feature engineering steps as in [the original notebook](https://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstart):

In [None]:
features_avg = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_28', 'B_29', 'B_30', 'B_32', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', 'D_48', 'D_50', 'D_51', 'D_53', 'D_54', 'D_55', 'D_58', 'D_59', 'D_60', 'D_61', 'D_62', 'D_65', 'D_66', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_75', 'D_76', 'D_77', 'D_78', 'D_80', 'D_82', 'D_84', 'D_86', 'D_91', 'D_92', 'D_94', 'D_96', 'D_103', 'D_104', 'D_108', 'D_112', 'D_113', 'D_114', 'D_115', 'D_117', 'D_118', 'D_119', 'D_120', 'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', 'D_128', 'D_129', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', 'D_140', 'D_141', 'D_142', 'D_144', 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_2', 'R_3', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_14', 'R_15', 'R_16', 'R_17', 'R_20', 'R_21', 'R_22', 'R_24', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_9', 'S_11', 'S_12', 'S_13', 'S_15', 'S_16', 'S_18', 'S_22', 'S_23', 'S_25', 'S_26']
features_min = ['B_2', 'B_4', 'B_5', 'B_9', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_19', 'B_20', 'B_28', 'B_29', 'B_33', 'B_36', 'B_42', 'D_39', 'D_41', 'D_42', 'D_45', 'D_46', 'D_48', 'D_50', 'D_51', 'D_53', 'D_55', 'D_56', 'D_58', 'D_59', 'D_60', 'D_62', 'D_70', 'D_71', 'D_74', 'D_75', 'D_78', 'D_83', 'D_102', 'D_112', 'D_113', 'D_115', 'D_118', 'D_119', 'D_121', 'D_122', 'D_128', 'D_132', 'D_140', 'D_141', 'D_144', 'D_145', 'P_2', 'P_3', 'R_1', 'R_27', 'S_3', 'S_5', 'S_7', 'S_9', 'S_11', 'S_12', 'S_23', 'S_25']
features_max = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_18', 'B_19', 'B_21', 'B_23', 'B_24', 'B_25', 'B_29', 'B_30', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', 'D_48', 'D_49', 'D_50', 'D_52', 'D_55', 'D_56', 'D_58', 'D_59', 'D_60', 'D_61', 'D_63', 'D_64', 'D_65', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_76', 'D_77', 'D_78', 'D_80', 'D_82', 'D_84', 'D_91', 'D_102', 'D_105', 'D_107', 'D_110', 'D_111', 'D_112', 'D_115', 'D_116', 'D_117', 'D_118', 'D_119', 'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', 'D_128', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', 'D_138', 'D_140', 'D_141', 'D_142', 'D_144', 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_3', 'R_5', 'R_6', 'R_7', 'R_8', 'R_10', 'R_11', 'R_14', 'R_17', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_7', 'S_8', 'S_11', 'S_12', 'S_13', 'S_15', 'S_16', 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']
features_last = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_26', 'B_28', 'B_29', 'B_30', 'B_32', 'B_33', 'B_36', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', 'D_48', 'D_49', 'D_50', 'D_51', 'D_52', 'D_53', 'D_54', 'D_55', 'D_56', 'D_58', 'D_59', 'D_60', 'D_61', 'D_62', 'D_63', 'D_64', 'D_65', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_75', 'D_76', 'D_77', 'D_78', 'D_79', 'D_80', 'D_81', 'D_82', 'D_83', 'D_86', 'D_91', 'D_96', 'D_105', 'D_106', 'D_112', 'D_114', 'D_119', 'D_120', 'D_121', 'D_122', 'D_124', 'D_125', 'D_126', 'D_127', 'D_130', 'D_131', 'D_132', 'D_133', 'D_134', 'D_138', 'D_140', 'D_141', 'D_142', 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_2', 'R_3', 'R_4', 'R_5', 'R_6', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_12', 'R_13', 'R_14', 'R_15', 'R_19', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_11', 'S_12', 'S_13', 'S_16', 'S_19', 'S_20', 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']
for i in ['test', 'train']:
    df = pd.read_parquet(f'../input/amex-data-integer-dtypes-parquet-format/{i}.parquet')
    cid = pd.Categorical(df.pop('customer_ID'), ordered=True)
    last = (cid != np.roll(cid, -1)) # mask for last statement of every customer
    if 'target' in df.columns:
        df.drop(columns=['target'], inplace=True)
    gc.collect()
    print('Read', i)
    df_avg = (df
              .groupby(cid)
              .mean()[features_avg]
              .rename(columns={f: f"{f}_avg" for f in features_avg})
             )
    gc.collect()
    print('Computed avg', i)
    df_min = (df
              .groupby(cid)
              .min()[features_min]
              .rename(columns={f: f"{f}_min" for f in features_min})
             )
    gc.collect()
    print('Computed min', i)
    df_max = (df
              .groupby(cid)
              .max()[features_max]
              .rename(columns={f: f"{f}_max" for f in features_max})
             )
    gc.collect()
    print('Computed max', i)
    df = (df.loc[last, features_last]
          .rename(columns={f: f"{f}_last" for f in features_last})
          .set_index(np.asarray(cid[last]))
         )
    gc.collect()
    print('Computed last', i)
    df = pd.concat([df, df_min, df_max, df_avg], axis=1)
    if i == 'train': train = df
    else: test = df
    print(f"{i} shape: {df.shape}")
    del df, df_avg, df_min, df_max, cid, last

target = pd.read_csv('../input/amex-default-prediction/train_labels.csv').target.values
print(f"target shape: {target.shape}")

Next split data into training/validation sets. In this notebook, we will only train the LightGBM model on the below training data. For a more competitive submission, you may want to train it on the merged training+validation data before submitting predictions from the resulting model.

In [None]:
from sklearn.model_selection import train_test_split

features = [f for f in train.columns if f != 'customer_ID' and f != 'target']
print(f"{len(features)} features")

X_test = test[features]
X_tr, X_va, y_tr, y_va = train_test_split(train[features], target, test_size=0.1, random_state=1, shuffle=True, stratify=target)
print(f"Shape of training data: {X_tr.shape}")

### Training models 

Construct basic LightGBM model (that computes validation score every 20 boosting rounds):

In [None]:
def my_booster(random_state=SEED, n_estimators=200):
    return LGBMClassifier(random_state=random_state, n_estimators=n_estimators)

lgbm_kwargs = {'eval_set': [(X_va, y_va)],'eval_metric': [lgb_amex_metric], 'callbacks': [log_evaluation(20)]}


Fit and predict with base LightGBM model:

In [None]:
model = my_booster()
model.fit(X_tr, y_tr, **lgbm_kwargs)

y_va_pred_og = model.predict_proba(X_va, raw_score=True)
score_og = amex_metric(y_va, y_va_pred_og)

n_trees = model.best_iteration_
if n_trees is None: 
    n_trees = model.n_estimators
    
y_test_pred_og = model.predict_proba(X_test, raw_score=True)
print(f"Base LightGBM model trained with {n_trees} trees.")

Add cleanlab to produce an improved version of the same model

In [None]:
model = my_booster()  # could be any classification model, not just LightGBM
cl = CleanLearning(clf=model, verbose=True,
                   find_label_issues_kwargs={"frac_noise": 0.2})

cl.fit(X_tr, y_tr,clf_kwargs=lgbm_kwargs, sample_weight= np.ones((len(y_tr),)))

y_va_pred_cl = cl.predict_proba(X_va, raw_score=True)
score_cl = amex_metric(y_va, y_va_pred_cl)

n_trees_cl = cl.clf.best_iteration_  # Note we use cl.clf to access some attributes of base model
if n_trees_cl is None: 
    n_trees_cl = cl.clf.n_estimators

y_test_pred_cl = cl.predict_proba(X_test, raw_score=True)
print(f"Cleanlab version of LightGBM model trained with {n_trees} trees.")

### Generating submission

In [None]:
predictions = y_test_pred_cl  # change to y_test_pred_og for predictions of base model
sub = pd.DataFrame({'customer_ID': test.index, 'prediction': predictions})
sub.to_csv('submission.csv', index=False)
display(sub)

## Final Notes

The above LightGBM + `cleanlab` model was trained with default LightGBM hyperparameters. The `cleanlab` parameters can also be tuned to further improve overall performance. In practice, I find you can get much better results by manually inspecting the top issues `cleanlab` has identified rather than just automatically removing this data as is done in cleanlab's  `CleanLearning` approach. 

Cleanlab may be especially useful for Gradient Boosting models (like LightGBM, XGBoost, or CatBoost), which are particularly  sensitive to noisy training data:  https://cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/node14.html

While this notebook used an LightGBM model, `cleanlab` can be used with any classifier. Feel free to experiment this with other models and let me know if you see an improvement!
