# Detecting Issues in Tabular Data (Numeric/Categorical columns) with Datalab


In this 5-minute quickstart tutorial, we use Datalab to detect various issues in a classification dataset with tabular (numeric/categorical) features. Tabular (or *structured*) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies _hundreds_ of examples in this dataset that were mislabeled with the incorrect final grade selected. You can run the same code from this tutorial to detect incorrect information in your own tabular classification datasets.

**Overview of what we'll do in this tutorial:**

- Train a classifier model (here scikit-learn's HistGradientBoostingClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation.

- Create a K nearest neighbours (KNN) graph between the examples in the dataset.

- Identify issues in the dataset with cleanlab's `Datalab` audit applied to the predictions and KNN graph.


<div class="alert alert-info">
Quickstart
<br/>
    
Already have (out-of-sample) `pred_probs` from a model trained on your original data labels? Have a `knn_graph` computed between dataset examples (reflecting similarity in their feature values)? Run the code below to find issues in your dataset.

<div  class=markdown markdown="1" style="background:white;margin:16px">  
    
```ipython3 
from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)

lab.get_issues()
```
   
</div>
</div>

## 1. Install required dependencies


You can use `pip` to install all packages required for this tutorial as follows:

```ipython3
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
```

In [None]:
# Package installation (hidden on docs website).
dependencies = ["cleanlab", "datasets"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    dependencies_test = [dependency.split('>')[0] if '>' in dependency 
                         else dependency.split('<')[0] if '<' in dependency 
                         else dependency.split('=')[0] for dependency in dependencies]
    missing_dependencies = []
    for dependency in dependencies_test:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [None]:
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import NearestNeighbors

from cleanlab import Datalab

SEED = 100  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

## 2. Load and process the data


We first load the data features and labels (which are possibly noisy).


In [None]:
grades_data = pd.read_csv("https://s.cleanlab.ai/grades-tabular-demo-v2.csv")
grades_data.head()

In [None]:
X_raw = grades_data[["exam_1", "exam_2", "exam_3", "notes"]]
labels = grades_data["letter_grade"]

Next we preprocess the data. Here we apply one-hot encoding to columns with categorical values and standardize the values in numeric columns.

In [None]:
cat_features = ["notes"]
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)

numeric_features = ["exam_1", "exam_2", "exam_3"]
scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])

<div class="alert alert-info">
Bringing Your Own Data (BYOD)?

Assign your data's features to variable `X` and its labels to variable `labels` instead.

</div>

## 3. Select a classification model and compute out-of-sample predicted probabilities


Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.


In [None]:
clf = HistGradientBoostingClassifier()

To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be _overfitted_ (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with **out-of-sample** predicted class probabilities, i.e., on examples held out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.
We can implement this via the `cross_val_predict` method from scikit-learn.


In [None]:
num_crossval_folds = 5 
pred_probs = cross_val_predict(
    clf,
    X_processed,
    labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

## 4. Construct K nearest neighbours graph

The KNN graph reflects how close each example is when compared to other examples in our dataset (in the numerical space of preprocessed feature values). This similarity information is used by Datalab to identify issues like outliers in our data. For tabular data, think carefully about the most appropriate way to define the similarity between two examples.

Here we use the `NearestNeighbors` class in sklearn to easily compute this graph (with similarity defined by the Euclidean distance between feature values). The graph should be represented as a sparse matrix with nonzero entries indicating nearest neighbors of each example and their distance.

In [None]:
KNN = NearestNeighbors(metric='euclidean')
KNN.fit(X_processed.values)

knn_graph = KNN.kneighbors_graph(mode="distance")

## 5. Use cleanlab to find label issues


Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab's `Datalab` class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a `Datalab` object such that it can audit our dataset for various types of issues.

In [None]:
data = {"X": X_processed.values, "y": labels}

lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)

In [None]:
lab.report()

### Label issues

The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the `get_issues` method.

In [None]:
issue_results = lab.get_issues("label")
issue_results.head()

To review the most severe label issues, sort the DataFrame above by the `label_score` column (a lower score represents that the label is less likely to be correct). 

Let's review some of the most likely label errors:

In [None]:
sorted_issues = issue_results.sort_values("label_score").index

X_raw.iloc[sorted_issues].assign(
    given_label=labels.iloc[sorted_issues], 
    predicted_label=issue_results["predicted_label"].iloc[sorted_issues]
).head()

The dataframe above shows the original label (`given_label`) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative `predicted_label` for each example.

These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D! 

### Outlier issues

According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset.

In [None]:
outlier_results = lab.get_issues("outlier")
sorted_outliers= outlier_results.sort_values("outlier_score").index

X_raw.iloc[sorted_outliers].head()

The student at index 3 has fractional exam scores, which is likely a error. We also see that the students at index 0 and 4 have numerical values in their notes section, which is also probably unintended. Lastly, we see that the student at index 8 has a html string in their notes section, definitely a mistake!

### Near-duplicate issues

According to the report, our dataset contains some sets of nearly duplicated examples.
We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.

In [None]:
duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()

The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see some examples that cleanlab has flagged as being nearly duplicated. Let's view these examples to see how similar they are

Using the one of the lowest-scoring examples, let's compare it against the identified near-duplicate sets.

In [None]:
# Identify the row with the lowest near_duplicate_score
lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].idxmin()

# Extract the indices of the lowest scoring duplicate and its near duplicate sets
indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, "near_duplicate_sets"].tolist()

# Display the relevant rows from the original dataset
X_raw.iloc[indices_to_display]

These examples are exact duplicates! Perhaps the same information was accidentally recorded multiple times in this data.

Similarly, let's take a look at another example and the identified near-duplicate sets:


In [None]:
# Identify the next row not in the previous near duplicate set
second_lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].drop(indices_to_display).idxmin()

# Extract the indices of the second lowest scoring duplicate and its near duplicate sets
next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, "near_duplicate_sets"].tolist()

# Display the relevant rows from the original dataset
X_raw.iloc[next_indices_to_display]

We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from [the FAQ](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?).

This tutorial highlighted a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Datalab with any ML model -- the better the model, the more accurate the data errors detected by Datalab will be!

## Spending too much time on data quality?

Using this open-source package effectively can require significant ML expertise and experimentation, plus handling detected data issues can be cumbersome.

That’s why we built [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) -- an automated platform to find **and fix** issues in your dataset, 100x faster and more accurately.  Cleanlab Studio automatically runs optimized data quality algorithms from this package on top of cutting-edge AutoML & Foundation models fit to your data, and helps you fix detected issues via a smart data correction interface. [Try it](https://cleanlab.ai/) for free!

<p align="center">
  <img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/ml-with-cleanlab-studio.png" alt="The modern AI pipeline automated with Cleanlab Studio">
</p>

In [None]:
# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

identified_label_issues = issue_results[issue_results["is_label_issue"] == True]
label_issue_indices = [3, 723, 709, 886, 689]  # check these examples were found in label issues
if not all(x in identified_label_issues.index for x in label_issue_indices):
    raise Exception("Some highlighted examples are missing from identified_label_issues.")
    
identified_outlier_issues = outlier_results[outlier_results["is_outlier_issue"] == True]
outlier_issue_indices = [3, 7, 0, 4, 8]  # check these examples were found in outlier issues
if not all(x in identified_outlier_issues.index for x in outlier_issue_indices):
    raise Exception("Some highlighted examples are missing from identified_outlier_issues.")
    
identified_duplicate_issues = duplicate_results[duplicate_results["is_near_duplicate_issue"] == True]
duplicate_issue_indices = [690, 246, 185, 582]  # check these examples were found in duplicate issues
if not all(x in identified_duplicate_issues.index for x in duplicate_issue_indices):
    raise Exception("Some highlighted examples are missing from identified_duplicate_issues.")
    
# check that the near duplicates shown are actually flagged as near duplicate sets
if not duplicate_results.iloc[690]["near_duplicate_sets"] == 246:
    raise Exception("These examples are not in the same near duplicate set")
    
if not duplicate_results.iloc[185]["near_duplicate_sets"] == 582:
    raise Exception("These examples are not in the same near duplicate set")

# Function to check if all rows are identical
def are_rows_identical(df):
    first_row = df.iloc[0]
    return all(df.iloc[i].equals(first_row) for i in range(1, len(df)))

# Test to ensure all displayed rows are identical
if not are_rows_identical(X_raw.iloc[indices_to_display]):
    raise Exception("Not all rows are identical! These examples should belong to the same EXACT duplicate set")

# Repeat the test for the next set of indices
if not are_rows_identical(X_raw.iloc[next_indices_to_display]):
    raise Exception("Not all rows are identical! These examples should belong to the same EXACT duplicate set")