# Structured Tabular Data Classification with scikit-learn

In this tutorial, we will use Cleanlab to find potential label errors in the German Credit dataset. This dataset contains 1,000 examples with 20 features, each example is labeled with either "good" or "bad" credit risk. Cleanlab will shortlist *hundreds* of examples that confuses our ML model the most; many of which are potential label errors, edge cases and obscure examples.

**Overview of what we'll do in this tutorial:**

- Build a simple classifier with scikit-learn's logistic regression.

- Compute the out-of-sample predicted probabilities, `pyx`, with cross validation.

- Generate a list of potential label errors with Cleanlab's `get_noise_indices`.

- Build and train a robust model with Cleanlab's `LearningWithNoisyLabels` wrapper.

**Data:** https://www.openml.org/d/31

## **1. Install the required dependencies**

``%%capture`` is a magic function to hide the cell's output.

In [1]:
%%capture

%pip install cleanlab sklearn pandas

## **2. Load the German Credit dataset**


Fetch the data from OpenML then load the data's features in ``X_raw`` and its label in ``y_raw``.

In [2]:
from sklearn.datasets import fetch_openml

data = fetch_openml('credit-g')
X_raw = data.data
y_raw = data.target

## **3. Perform categorical encoding and feature scaling**

Perform one-hot encoding on features with categorical data type.

In [3]:
import pandas as pd

cat_features = X_raw.select_dtypes('category').columns
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)

Perform feature scaling on features with numerical (float64) data type.

In [4]:
from sklearn.preprocessing import StandardScaler

num_features = X_raw.select_dtypes('float64').columns

X_scaled = X_encoded.copy()
scaler = StandardScaler()
X_scaled[num_features] = scaler.fit_transform(X_encoded[num_features])

Perform label encoding on the labels - "bad" is encoded as 0 and "good" is encoded as 1. 

In [5]:
y = y_raw.map({'bad': 0, 'good': 1})

## **4. Build a classification model**

Build a scikit-learn Logistic Regression model.

In [6]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()


## **5. Compute the out-of-sample predicted probabilities with cross validation**

We will fit the entire dataset on the model used to compute the out-of-sample predicted probabilties, ``pyx``. This model will not be used for model evaluation.

In [7]:
_ = clf.fit(X_scaled, y)

Compute the out-of-sample predicted probabilities, ``pyx``, with cross validation.

In [8]:
from sklearn.model_selection import cross_val_predict

pyx = cross_val_predict(clf, X_scaled, y, cv=3, method='predict_proba')

## **6. Run Cleanlab to find potential label errors**

In [9]:
from cleanlab.pruning import get_noise_indices

ordered_label_errors = get_noise_indices(
    s=y, psx=pyx, sorted_index_method="prob_given_label"
)


## **7. Review some of the highest potential label errors**

In [10]:
print(f"Cleanlab found {len(ordered_label_errors)} potential label errors.")


Cleanlab found 173 potential label errors.


Print the top few potentially mislabeled examples

In [11]:
X_raw.iloc[ordered_label_errors].assign(label=y_raw.iloc[ordered_label_errors]).head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,label
505,no checking,10.0,existing paid,new car,1309.0,no known savings,1<=X<4,4.0,male single,guarantor,...,life insurance,27.0,none,own,1.0,unskilled resident,1.0,none,yes,bad
56,0<=X<200,12.0,existing paid,radio/tv,6468.0,no known savings,unemployed,2.0,male single,none,...,no known property,52.0,none,own,1.0,high qualif/self emp/mgmt,1.0,yes,yes,bad
190,no checking,24.0,existing paid,business,4591.0,>=1000,1<=X<4,2.0,male single,none,...,life insurance,54.0,none,own,3.0,high qualif/self emp/mgmt,1.0,yes,yes,bad
92,no checking,12.0,critical/other existing credit,radio/tv,797.0,no known savings,>=7,4.0,female div/dep/mar,none,...,life insurance,33.0,bank,own,1.0,unskilled resident,2.0,none,yes,bad
757,>=200,15.0,critical/other existing credit,radio/tv,1271.0,no known savings,1<=X<4,3.0,male single,none,...,no known property,39.0,none,for free,2.0,skilled,1.0,yes,yes,bad


## **8. Adapt with Cleanlab's wrapper and train a more robust model**

Split the categorical encoded dataset into train and test subsets.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.25)

Perform feature scaling on features with numerical (float64) data type, this time fitting only on the training set.

In [13]:
scaler = StandardScaler()

X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])

Build a new instance of scikit-learn's logistic regression classifier, then wrap it with Cleanlab's ``LearningWithNoisyLabels`` wrapper.

In [14]:
from cleanlab.classification import LearningWithNoisyLabels

clf = LogisticRegression()
lnl= LearningWithNoisyLabels(clf)

Train the wrapped model, `lnl`, on the train set.

In [15]:
_ = lnl.fit(X_train.to_numpy(), y_train.to_numpy())

## **9. Evaluate the robust model's performance**

In [16]:
from sklearn.metrics import accuracy_score

y_pred = lnl.predict(X_test.to_numpy())
accuracy_score(y_test, y_pred)

0.712

## **What's next?**

Congratulation on completing this tutorial! Check out our following tutorial on using Cleanlab for audio classification, where we even found label errors in one of the most reputable audio datasets!