# Data quality - label inconsistencies

Label inconsistencies can significantly undermine the development of machine learning models, particularly in sensitive use cases like fraud detection. When labels are inconsistent—such as when similar transactions are labeled differently—it creates noise in the training data, which leads to unreliable models. This directly impacts the model’s ability to accurately predict fraudulent behavior, increasing the likelihood of both false positives and false negatives. In critical industries, such as finance, these errors can have substantial economic and operational consequences.

Synthetic data generation is also impacted by label inconsistencies. If the original dataset contains labeling errors or inconsistencies, the synthetic data generated from it will replicate these issues, propagating flawed patterns and reducing the effectiveness of both synthetic data and models trained on them. Thus, ensuring label consistency is vital for maintaining the quality of your data! 

In [3]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-2.1.1-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting nvidia-nccl-cu12 (from xgboost)
  Downloading nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Downloading xgboost-2.1.1-py3-none-manylinux_2_28_x86_64.whl (153.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.9/153.9 MB[0m [31m116.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading nvidia_nccl_cu12-2.23.4-py3-none-manylinux2014_x86_64.whl (199.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.0/199.0 MB[0m [31m114.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: nvidia-nccl-cu12, xgboost
Successfully installed nvidia-nccl-cu12-2.23.4 xgboost-2.1.1


## How to detect inconsistent labels

In [10]:
import pandas as pd
from sklearn.model_selection import cross_val_predict, train_test_split
from xgboost import XGBClassifier

from ydata.dataset import Dataset
from ydata.metadata import Metadata
from ydata.quality.labels import FindInconsistentLabelsEngine, LabelFilter, RankedBy

In [12]:
# In this example it was used the credit card dataset from Kaggle https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
df = pd.read_csv('data (2).csv')
y = df['Class']
X = df.drop('Class', axis=1).copy()

# Create the train and test split
df_train, df_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42)

# XGBoost(experimental) supports categorical data.
# Here we use default hyperparameters for simplicity.
model = XGBClassifier(tree_method="hist", enable_categorical=True)
model.fit(df_train, y_train)

# Evaluate model on test split with ground truth labels.
model = XGBClassifier(tree_method="hist", enable_categorical=True)
pred_probs = cross_val_predict(model, df_train, y_train, method='predict_proba')

df_train['Class'] = y_train
dataset = Dataset(df_train)
metadata = Metadata(dataset)

findlabels = FindInconsistentLabelsEngine(filter_type=LabelFilter.CONFIDENT_LEARNING,
                                          indices_ranked_by=RankedBy.SELF_CONFIDENCE)

er = findlabels.fit_transform(X=dataset,
                              label_name='Class',
                              metadata=metadata,
                              pred_probs=pred_probs)

print(f"Number of missclassified labels {len(er)}")

[########################################] | 100% Completed | 1.15 sms
[########################################] | 100% Completed | 4.83 sms
function
Number of missclassified labels 645
