-
Notifications
You must be signed in to change notification settings - Fork 0
Kaggle Example: Bias Correction
This tutorial demonstrates how to use the entropic_measurement library to detect and correct measurement bias in real-world datasets from Kaggle. We'll walk through a complete example using the famous Titanic dataset, showing how bias correction can improve model fairness and accuracy.
Measurement bias occurs when our data collection process systematically over- or under-represents certain groups or outcomes. The entropic_measurement library provides tools to quantify this bias using information theory and apply corrections to create more balanced datasets.
Before starting, make sure you have the required libraries installed:
pip install entropic_measurement pandas numpy scikit-learn kaggleWe'll use the Titanic dataset from Kaggle, which is perfect for demonstrating bias correction in binary classification tasks.
import pandas as pd
import numpy as np
from entropic_measurement import measure_and_correct
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load the Titanic dataset
# Download from: https://www.kaggle.com/competitions/titanic/data
train_df = pd.read_csv('titanic/train.csv')
# Display basic information about the dataset
print("Dataset shape:", train_df.shape)
print("\nFirst few rows:")
print(train_df.head())
# Check survival rates by gender (potential bias)
print("\nSurvival rates by gender:")
print(train_df.groupby('Sex')['Survived'].agg(['count', 'mean']))Let's prepare our data for bias measurement and correction:
# Select relevant features for analysis
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'
# Create a clean dataset
df_clean = train_df[features + [target]].copy()
# Handle missing values
df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)
df_clean['Fare'].fillna(df_clean['Fare'].median(), inplace=True)
# Encode categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
df_clean['Sex_encoded'] = le_sex.fit_transform(df_clean['Sex'])
df_clean['Embarked_encoded'] = le_embarked.fit_transform(df_clean['Embarked'])
# Prepare final feature matrix
X = df_clean[['Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_encoded']]
y = df_clean[target]
print("Preprocessed data shape:", X.shape)
print("Target distribution:")
print(y.value_counts(normalize=True))Now we'll apply the measure_and_correct function to identify and correct measurement bias:
# Apply bias measurement and correction
print("\n=== BIAS MEASUREMENT AND CORRECTION ===")
print("Analyzing measurement bias in the Titanic dataset...\n")
# Use measure_and_correct to detect and correct bias
corrected_X, corrected_y, bias_report = measure_and_correct(
X, y,
method='entropy_based', # Use entropy-based correction
correction_strength=0.7, # Moderate correction strength
return_report=True # Get detailed bias analysis
)
print("Original dataset shape:", X.shape)
print("Corrected dataset shape:", corrected_X.shape)
print("\nBias correction completed successfully!")Let's examine what the bias correction accomplished:
# Display bias measurement results
print("\n=== BIAS ANALYSIS REPORT ===")
print(f"Original entropy: {bias_report['original_entropy']:.4f}")
print(f"Corrected entropy: {bias_report['corrected_entropy']:.4f}")
print(f"Bias reduction: {bias_report['bias_reduction']:.2%}")
# Compare target distributions
print("\n=== TARGET DISTRIBUTION COMPARISON ===")
print("Original target distribution:")
print(y.value_counts(normalize=True))
print("\nCorrected target distribution:")
print(corrected_y.value_counts(normalize=True))
# Analyze feature-wise bias correction
print("\n=== FEATURE-WISE BIAS ANALYSIS ===")
feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
for i, feature in enumerate(feature_names):
original_mean = X.iloc[:, i].mean()
corrected_mean = corrected_X.iloc[:, i].mean()
bias_change = abs(corrected_mean - original_mean)
print(f"{feature}: Original={original_mean:.3f}, Corrected={corrected_mean:.3f}, Change={bias_change:.3f}")Create visualizations to understand the bias correction effects:
# Create comparison plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Target distribution comparison
axes[0, 0].bar(['Died', 'Survived'], y.value_counts(normalize=True), alpha=0.7, label='Original')
axes[0, 0].bar(['Died', 'Survived'], corrected_y.value_counts(normalize=True), alpha=0.7, label='Corrected')
axes[0, 0].set_title('Target Distribution: Original vs Corrected')
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].legend()
# Age distribution comparison
axes[0, 1].hist(X['Age'], bins=20, alpha=0.7, label='Original', density=True)
axes[0, 1].hist(corrected_X['Age'], bins=20, alpha=0.7, label='Corrected', density=True)
axes[0, 1].set_title('Age Distribution: Original vs Corrected')
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Density')
axes[0, 1].legend()
# Fare distribution comparison
axes[1, 0].hist(X['Fare'], bins=20, alpha=0.7, label='Original', density=True)
axes[1, 0].hist(corrected_X['Fare'], bins=20, alpha=0.7, label='Corrected', density=True)
axes[1, 0].set_title('Fare Distribution: Original vs Corrected')
axes[1, 0].set_xlabel('Fare')
axes[1, 0].set_ylabel('Density')
axes[1, 0].legend()
# Bias reduction visualization
metrics = ['Original Entropy', 'Corrected Entropy']
values = [bias_report['original_entropy'], bias_report['corrected_entropy']]
axes[1, 1].bar(metrics, values, color=['red', 'green'], alpha=0.7)
axes[1, 1].set_title('Entropy Comparison')
axes[1, 1].set_ylabel('Entropy Value')
plt.tight_layout()
plt.show()
print(f"\nVisualization complete! Bias reduction achieved: {bias_report['bias_reduction']:.2%}")Let's train models on both original and corrected datasets to see the impact:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
# Train models on original and corrected data
rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
rf_corrected = RandomForestClassifier(n_estimators=100, random_state=42)
# Cross-validation scores
original_scores = cross_val_score(rf_original, X, y, cv=5, scoring='accuracy')
corrected_scores = cross_val_score(rf_corrected, corrected_X, corrected_y, cv=5, scoring='accuracy')
print("\n=== MODEL PERFORMANCE COMPARISON ===")
print(f"Original dataset - Mean CV Accuracy: {original_scores.mean():.4f} (+/- {original_scores.std() * 2:.4f})")
print(f"Corrected dataset - Mean CV Accuracy: {corrected_scores.mean():.4f} (+/- {corrected_scores.std() * 2:.4f})")
print(f"Performance improvement: {(corrected_scores.mean() - original_scores.mean()):.4f}")
# Train final models for detailed comparison
rf_original.fit(X, y)
rf_corrected.fit(corrected_X, corrected_y)
print("\nModels trained successfully! Bias correction has been applied.")Based on our analysis, here's what the bias correction accomplished:
-
Entropy Reduction: The corrected dataset shows lower entropy, indicating a more balanced and less biased distribution.
-
Feature Balance: Bias correction adjusted feature distributions to reduce systematic over-representation of certain groups.
-
Improved Fairness: The corrected model is likely to make more equitable predictions across different demographic groups.
-
Performance Impact: While accuracy might vary, the corrected model provides more reliable and generalizable predictions.
Here's a condensed version you can run immediately:
# Complete working example
import pandas as pd
from entropic_measurement import measure_and_correct
from sklearn.preprocessing import LabelEncoder
# Load and preprocess Titanic data
train_df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
# Quick preprocessing
df = train_df[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].copy()
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
X = df.drop('Survived', axis=1)
y = df['Survived']
# Apply bias correction
corrected_X, corrected_y, report = measure_and_correct(X, y, return_report=True)
print(f"Bias reduction achieved: {report['bias_reduction']:.2%}")
print(f"Original entropy: {report['original_entropy']:.4f}")
print(f"Corrected entropy: {report['corrected_entropy']:.4f}")Congratulations! You've successfully applied bias correction to a real-world dataset using the entropic_measurement library. This example demonstrated how measurement bias can be quantified and corrected, leading to more fair and robust machine learning models.
Key takeaways:
- Measurement bias is common in real datasets and can significantly impact model performance
- The
measure_and_correctfunction provides an automated way to detect and correct bias - Bias correction often leads to more balanced datasets and fairer model predictions
- Entropy-based methods offer a principled approach to measuring information content and bias
We encourage you to explore bias correction on your own datasets! Try applying these techniques to:
- Your own Kaggle competition datasets
- Company datasets where fairness is crucial
- Any classification problem where you suspect measurement bias
- Different types of bias (selection bias, sampling bias, etc.)
The entropic_measurement library is designed to be flexible and work with various data types and bias scenarios. Experiment with different correction strengths and methods to find what works best for your specific use case. Remember, the goal isn't just better accuracy—it's building more ethical, fair, and robust AI systems.
Happy bias correcting! 🚀