# Classification with Tabular Data using H2OAutoML and Cleanlab


This notebook is based on the following two tutorial notebooks.

- [cleanlab/docs/source/tutorials/tabular.ipynb](https://github.com/cleanlab/cleanlab/blob/0dc384a4edfba31500e672b15026b781ea952f91/docs/source/tutorials/tabular.ipynb)
- [h2o-tutorials/tutorials/sklearn-integration/H2OAutoML_as_sklearn_estimator.ipynb](https://github.com/h2oai/h2o-tutorials/blob/7c8fca34b2bf26870be71232ade52472a087f0ad/tutorials/sklearn-integration/H2OAutoML_as_sklearn_estimator.ipynb)


In this tutorial, we will use `cleanlab` with `H2OAutoML` models to find potential label errors in the German Credit dataset. This dataset contains 1,000 individuals described by 20 features, each labeled as either "good" or "bad" credit risk. `cleanlab` automatically shortlists examples from this dataset that confuse our ML model; many of which are potential label errors (due to annotator mistakes), edge cases, and otherwise ambiguous examples.

**Overview of what we'll do in this tutorial:**

- Build a simple credit risk classifier with `H2OAutoML`.

- Use this classifier to compute out-of-sample predicted probabilities, `pred_probs`, via cross validation.

- Identify potential label errors in the data with `cleanlab`'s `find_label_issues` method.

- Train a robust version of the same `H2OAutoML` model via `cleanlab`'s `CleanLearning` wrapper.

**Data:** https://www.openml.org/d/31


## **1. Install required dependencies**


You can use `conda` to install all packages required for this tutorial as follows:

```
!conda env update -n cleanlab-h2oautoml -f ./conda-env.yml
```


In [1]:
import random
import numpy as np

SEED = 123456

np.random.seed(SEED)
random.seed(SEED)

## **2. Load and process the data**


We first load the data features and labels.


In [2]:
from sklearn.datasets import fetch_openml

data = fetch_openml("credit-g")  # get the credit data from OpenML
X_raw = data.data  # features (pandas DataFrame)
y_raw = data.target  # labels (pandas Series)

Next we preprocess the data. Here we apply one-hot encoding to features with categorical data, and standardize features with numeric data. We also perform label encoding on the labels - "bad" is encoded as 0 and "good" is encoded as 1.


In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

cat_features = X_raw.select_dtypes("category").columns
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)

num_features = X_raw.select_dtypes("float64").columns
scaler = StandardScaler()
X_scaled = X_encoded.copy()
X_scaled[num_features] = scaler.fit_transform(X_encoded[num_features])
X_scaled = X_scaled.to_numpy()

y = y_raw.map({"bad": 0, "good": 1})  # encode labels as integers
y = y.to_numpy()

<div class="alert alert-info">
Bringing Your Own Data (BYOD)?

You can easily replace the above with your own tabular dataset, and continue with the rest of the tutorial.

</div>


## **3. Select a classification model and compute out-of-sample predicted probabilities**


Here we use `H2OAutoML`, but you can choose _any_ suitable scikit-learn model for this tutorial.


To identify label issues, `cleanlab` requires a probabilistic prediction from your model for every datapoint. However, these predictions will be _overfitted_ (and thus unreliable) for examples the model was previously trained on. `cleanlab` is intended to only be used with **out-of-sample** predicted probabilities, i.e., on examples held out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides a more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via a simple scikit-learn wrapper:


In [None]:
from h2o.sklearn import H2OAutoMLClassifier

In [4]:
def getH2O(keep_cross_validation_predictions=True):
    return H2OAutoMLClassifier(
        keep_cross_validation_predictions=keep_cross_validation_predictions,
        max_runtime_secs=30,
        sort_metric="aucpr",
        nfolds=3,
        verbosity="error",
    )

In [5]:
clf = getH2O()
clf.fit(X_scaled, y)
pred_probs = clf.predict_proba(X_scaled)
pred_probs.shape

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,27 mins 13 secs
H2O_cluster_timezone:,Asia/Tokyo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.4
H2O_cluster_version_age:,13 days
H2O_cluster_name:,H2O_from_python_kon_gf7hh5
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.221 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
glm prediction progress: |███████████████████████████████████████████████████████| (done) 100%


(1000, 2)

## **4. Use cleanlab to find label issues**


Based on the given labels and out-of-sample predicted probabilities, `cleanlab` can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by `cleanlab`'s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model's prediction.


In [6]:
from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels=y, pred_probs=pred_probs, return_indices_ranked_by="self_confidence"
)

print(f"Cleanlab found {len(ranked_label_issues)} potential label errors.")
ranked_label_issues

Cleanlab found 122 potential label errors.


array([505, 757, 190,  92, 228,  80, 754, 409, 796, 780, 949, 598, 412,
       435, 272, 963, 614, 351, 137, 846, 302,  56, 647, 278, 213, 861,
       335, 642, 763, 864, 589, 674, 700, 735, 936, 621, 559, 543, 331,
       580, 900, 175, 980, 720, 357, 557, 424, 457, 349, 834, 978, 876,
       552, 229, 966, 818, 611, 285, 474, 249, 650, 945, 203, 485, 268,
       188, 195, 645, 812, 827, 951, 815, 208, 141, 866,  17, 623, 395,
       221, 340, 145, 896, 704, 392, 605, 986, 743, 808, 658, 934,  14,
       687, 438, 481, 869, 740, 615, 111, 367, 530, 746, 573, 338, 667,
       741,  79, 993,  31, 218, 890, 829, 309, 417, 616, 513, 926, 201,
       703, 666, 985, 462, 441])

Let's review some of the most likely label errors:


In [7]:
X_raw.iloc[ranked_label_issues].assign(label=y_raw.iloc[ranked_label_issues]).head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,label
505,no checking,10.0,existing paid,new car,1309.0,no known savings,1<=X<4,4.0,male single,guarantor,...,life insurance,27.0,none,own,1.0,unskilled resident,1.0,none,yes,bad
757,>=200,15.0,critical/other existing credit,radio/tv,1271.0,no known savings,1<=X<4,3.0,male single,none,...,no known property,39.0,none,for free,2.0,skilled,1.0,yes,yes,bad
190,no checking,24.0,existing paid,business,4591.0,>=1000,1<=X<4,2.0,male single,none,...,life insurance,54.0,none,own,3.0,high qualif/self emp/mgmt,1.0,yes,yes,bad
92,no checking,12.0,critical/other existing credit,radio/tv,797.0,no known savings,>=7,4.0,female div/dep/mar,none,...,life insurance,33.0,bank,own,1.0,unskilled resident,2.0,none,yes,bad
228,no checking,9.0,existing paid,radio/tv,1478.0,<100,4<=X<7,4.0,male single,none,...,car,22.0,none,own,1.0,skilled,1.0,none,yes,bad


These examples appear the most suspicious to our model and should be carefully re-examined. Perhaps the original annotators missed something when deciding on the labels for these individuals.


## **5. Train a more robust model from noisy labels**


Following proper ML practice, let's split our data into train and test sets.


In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.25, random_state=SEED
)

We again standardize the numeric features, this time fitting the scaling parameters solely on the training set.


In [9]:
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()

Let's now train and evaluate the original `H2OAutoML` model.


In [None]:
from sklearn.metrics import accuracy_score

In [10]:
clf = getH2O()
clf.fit(X_train, y_train)
acc_og = clf.score(X_test, y_test)

H2O session _sid_a440 closed.
Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,27 mins 45 secs
H2O_cluster_timezone:,Asia/Tokyo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.4
H2O_cluster_version_age:,13 days
H2O_cluster_name:,H2O_from_python_kon_gf7hh5
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.203 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


In [11]:
print(f"Test accuracy of original H2OAutoML: {acc_og}")

Test accuracy of original H2OAutoML: 0.732


`cleanlab` provides a wrapper class that can be easily applied to any scikit-learn compatible model. Once wrapped, the resulting model can still be used in the exact same manner, but it will now train more robustly if the data have noisy labels.


In [None]:
from cleanlab.classification import CleanLearning

In [14]:
clf = getH2O(keep_cross_validation_predictions=False)
cl = CleanLearning(clf)  # cl has same methods/attributes as clf

The following operations take place when we train the `cleanlab`-wrapped model: The original model is trained in a cross-validated fashion to produce out-of-sample predicted probabilities. Then, these predicted probabilities are used to identify label issues, which are then removed from the dataset. Finally, the original model is trained on the remaining clean subset of the data once more.


In [15]:
cl.fit(X_train, y_train)

Computing out of sample predicted probabilites via 5-fold cross validation. May take a while ...
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |██████████

H2OAutoMLClassifier(max_runtime_secs=30, nfolds=3, sort_metric='aucpr',
                    verbosity='error')

We can get predictions from the resulting model and evaluate them, just like how we did it for the original scikit-learn model.


In [16]:
preds = cl.predict(X_test)
acc_cl = accuracy_score(y_test, preds)
print(f"Test accuracy of cleanlab's H2OAutoML: {acc_cl}")

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%
Test accuracy of cleanlab's H2OAutoML: 0.752


We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially when we evaluate on test data that are themselves noisy. The best practice is to run `cleanlab` to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.


In [17]:
# Hidden code cell to check that cleanlab has improved prediction accuracy
print(f"Test accuracy of original vs cleanlab's H2OAutoML: {acc_og} vs {acc_cl}")
if acc_og >= acc_cl:
    raise Exception("Cleanlab training failed to improve model accuracy.")

Test accuracy of original vs cleanlab's H2OAutoML: 0.732 vs 0.752
