# CleanLab


## 1. Installing CleanLab with DataLab Extension

This command installs the cleanlab package, which is used for identifying and correcting label issues in datasets. The [datalab] extra installs additional dependencies required for data exploration and visualization.

In [None]:
!pip install "cleanlab[datalab]"

## 2. Importing Required Libraries

Here, several essential libraries are imported:

    - numpy and pandas for numerical computations and data handling.
    - load_iris to load the Iris dataset.
    - train_test_split to split the dataset into training and testing sets.
    - RandomForestClassifier to define a classifier model.
    - CleanLearning from cleanlab.classification is imported to perform anomaly detection and label correction.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from cleanlab.classification import CleanLearning

## 3. Loading Iris Dataset and Introducing Label Errors


This step loads the Iris dataset, which consists of 150 samples of iris flowers classified into three species. We then introduce errors in the labels (y) by randomly selecting 5 indices and replacing their true labels with incorrect ones. This simulates mislabeled data points, which are then detected later in the process.

In [None]:
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

np.random.seed(42)
num_errors = 5
error_indices = np.random.choice(len(y), num_errors, replace=False)
y[error_indices] = np.random.choice([0, 1, 2], num_errors, replace=True)

## 4. Splitting the Dataset into Training and Test Sets

The dataset is split into training (X_train, y_train) and testing (X_test, y_test) sets. 80% of the data is used for training, and 20% is reserved for testing. The random_state=42 ensures that the split is reproducible.

In [None]:
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 5. Defining and Fitting the Random Forest Classifier with Cleanlab's CleanLearning

A Random Forest classifier is defined with 100 trees. The classifier is then wrapped in CleanLearning from cleanlab, which allows the model to automatically detect and correct label issues in the dataset. The model is trained using the fit() method on the training data (X_train, y_train).


In [None]:
# Define a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Use Cleanlab's CleanLearning
cleaner = CleanLearning(clf)
cleaner.fit(X_train, y_train)

## 6. Identifying Suspected Label Issues

The find_label_issues() method identifies potential mislabeled data points based on the classifier’s predictions. It returns a label_issues object with a boolean flag (is_label_issue) for each data point. The indices of the suspected mislabeled points are extracted and printed for review.

In [None]:
# Get label issues (higher scores mean more likely mislabeled)
label_issues = cleaner.find_label_issues(X=X_train, labels=y_train)

# Display suspected mislabeled data
mislabeled_indices = np.where(label_issues["is_label_issue"])[0]
print("Suspected label errors at indices:", mislabeled_indices)

## 7. Displaying Suspected Mislabeled Data Points

After detecting the mislabeled points, the predict() method is used to obtain the model’s predictions on the training data. For each suspected mislabeled index, a DataFrame is created with the following columns:

    - The feature values (X_train[idx]).
    - The true label (y_train[idx]).
    - The label previously assigned by the model (predicted_labels[idx]).

Each individual DataFrame containing a suspected mislabeled data point is appended to the suspect_dfs list. Once all suspected points are processed, the list of DataFrames is concatenated into a single DataFrame (df_all_suspects). This final DataFrame is then printed to display all the suspected mislabeled data points in a clean, readable table format.

In [None]:
# Get model's predictions
predicted_labels = cleaner.predict(X_train)

# Create an empty list to store DataFrames
suspect_dfs = []

# Loop over the mislabeled indices and create a structured DataFrame for each
for idx in mislabeled_indices:
    # Create a DataFrame for the suspected mislabeled data point
    df_suspect = pd.DataFrame([X_train[idx]], columns=iris.feature_names)
    df_suspect.insert(0, "Index", idx)  # Insert index column
    df_suspect["True Label"] = y_train[idx]  # Correct label
    df_suspect["Previously Assigned Label"] = predicted_labels[idx]  # What it was classified as before

    # Append the current suspect DataFrame to the list
    suspect_dfs.append(df_suspect)

# Combine all the suspect DataFrames into a single DataFrame
df_all_suspects = pd.concat(suspect_dfs, ignore_index=True)

# Print the full table of suspected mislabeled data points
print("\n                                           Suspected Mislabeled Data Points")
print("-----------------------------------------------------------------------------------------------------------------------")

print(df_all_suspects.to_string(index=False))
