Kaggle Example: Bias Correction

This tutorial demonstrates how to use the entropic_measurement library to detect and correct measurement bias in real-world datasets from Kaggle. We'll walk through a complete example using the famous Titanic dataset, showing how bias correction can improve model fairness and accuracy.

Overview

Measurement bias occurs when our data collection process systematically over- or under-represents certain groups or outcomes. The entropic_measurement library provides tools to quantify this bias using information theory and apply corrections to create more balanced datasets.

Prerequisites

Before starting, make sure you have the required libraries installed:

pip install entropic_measurement pandas numpy scikit-learn kaggle

Step 1: Loading the Kaggle Dataset

We'll use the Titanic dataset from Kaggle, which is perfect for demonstrating bias correction in binary classification tasks.

import pandas as pd
import numpy as np
from entropic_measurement import measure_and_correct
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

# Load the Titanic dataset
# Download from: https://www.kaggle.com/competitions/titanic/data
train_df = pd.read_csv('titanic/train.csv')

# Display basic information about the dataset
print("Dataset shape:", train_df.shape)
print("\nFirst few rows:")
print(train_df.head())

# Check survival rates by gender (potential bias)
print("\nSurvival rates by gender:")
print(train_df.groupby('Sex')['Survived'].agg(['count', 'mean']))

Step 2: Data Preprocessing

Let's prepare our data for bias measurement and correction:

# Select relevant features for analysis
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'

# Create a clean dataset
df_clean = train_df[features + [target]].copy()

# Handle missing values
df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
df_clean['Embarked'].fillna(df_clean['Embarked'].mode()[0], inplace=True)
df_clean['Fare'].fillna(df_clean['Fare'].median(), inplace=True)

# Encode categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
df_clean['Sex_encoded'] = le_sex.fit_transform(df_clean['Sex'])
df_clean['Embarked_encoded'] = le_embarked.fit_transform(df_clean['Embarked'])

# Prepare final feature matrix
X = df_clean[['Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_encoded']]
y = df_clean[target]

print("Preprocessed data shape:", X.shape)
print("Target distribution:")
print(y.value_counts(normalize=True))

Step 3: Using entropic_measurement for Bias Detection and Correction

Now we'll apply the measure_and_correct function to identify and correct measurement bias:

# Apply bias measurement and correction
print("\n=== BIAS MEASUREMENT AND CORRECTION ===")
print("Analyzing measurement bias in the Titanic dataset...\n")

# Use measure_and_correct to detect and correct bias
corrected_X, corrected_y, bias_report = measure_and_correct(
    X, y, 
    method='entropy_based',  # Use entropy-based correction
    correction_strength=0.7,  # Moderate correction strength
    return_report=True  # Get detailed bias analysis
)

print("Original dataset shape:", X.shape)
print("Corrected dataset shape:", corrected_X.shape)
print("\nBias correction completed successfully!")

Step 4: Analyzing the Results

Let's examine what the bias correction accomplished:

# Display bias measurement results
print("\n=== BIAS ANALYSIS REPORT ===")
print(f"Original entropy: {bias_report['original_entropy']:.4f}")
print(f"Corrected entropy: {bias_report['corrected_entropy']:.4f}")
print(f"Bias reduction: {bias_report['bias_reduction']:.2%}")

# Compare target distributions
print("\n=== TARGET DISTRIBUTION COMPARISON ===")
print("Original target distribution:")
print(y.value_counts(normalize=True))
print("\nCorrected target distribution:")
print(corrected_y.value_counts(normalize=True))

# Analyze feature-wise bias correction
print("\n=== FEATURE-WISE BIAS ANALYSIS ===")
feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
for i, feature in enumerate(feature_names):
    original_mean = X.iloc[:, i].mean()
    corrected_mean = corrected_X.iloc[:, i].mean()
    bias_change = abs(corrected_mean - original_mean)
    print(f"{feature}: Original={original_mean:.3f}, Corrected={corrected_mean:.3f}, Change={bias_change:.3f}")

Step 5: Visualizing the Impact

Create visualizations to understand the bias correction effects:

# Create comparison plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Target distribution comparison
axes[0, 0].bar(['Died', 'Survived'], y.value_counts(normalize=True), alpha=0.7, label='Original')
axes[0, 0].bar(['Died', 'Survived'], corrected_y.value_counts(normalize=True), alpha=0.7, label='Corrected')
axes[0, 0].set_title('Target Distribution: Original vs Corrected')
axes[0, 0].set_ylabel('Proportion')
axes[0, 0].legend()

# Age distribution comparison
axes[0, 1].hist(X['Age'], bins=20, alpha=0.7, label='Original', density=True)
axes[0, 1].hist(corrected_X['Age'], bins=20, alpha=0.7, label='Corrected', density=True)
axes[0, 1].set_title('Age Distribution: Original vs Corrected')
axes[0, 1].set_xlabel('Age')
axes[0, 1].set_ylabel('Density')
axes[0, 1].legend()

# Fare distribution comparison
axes[1, 0].hist(X['Fare'], bins=20, alpha=0.7, label='Original', density=True)
axes[1, 0].hist(corrected_X['Fare'], bins=20, alpha=0.7, label='Corrected', density=True)
axes[1, 0].set_title('Fare Distribution: Original vs Corrected')
axes[1, 0].set_xlabel('Fare')
axes[1, 0].set_ylabel('Density')
axes[1, 0].legend()

# Bias reduction visualization
metrics = ['Original Entropy', 'Corrected Entropy']
values = [bias_report['original_entropy'], bias_report['corrected_entropy']]
axes[1, 1].bar(metrics, values, color=['red', 'green'], alpha=0.7)
axes[1, 1].set_title('Entropy Comparison')
axes[1, 1].set_ylabel('Entropy Value')

plt.tight_layout()
plt.show()

print(f"\nVisualization complete! Bias reduction achieved: {bias_report['bias_reduction']:.2%}")

Step 6: Model Performance Comparison

Let's train models on both original and corrected datasets to see the impact:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

# Train models on original and corrected data
rf_original = RandomForestClassifier(n_estimators=100, random_state=42)
rf_corrected = RandomForestClassifier(n_estimators=100, random_state=42)

# Cross-validation scores
original_scores = cross_val_score(rf_original, X, y, cv=5, scoring='accuracy')
corrected_scores = cross_val_score(rf_corrected, corrected_X, corrected_y, cv=5, scoring='accuracy')

print("\n=== MODEL PERFORMANCE COMPARISON ===")
print(f"Original dataset - Mean CV Accuracy: {original_scores.mean():.4f} (+/- {original_scores.std() * 2:.4f})")
print(f"Corrected dataset - Mean CV Accuracy: {corrected_scores.mean():.4f} (+/- {corrected_scores.std() * 2:.4f})")
print(f"Performance improvement: {(corrected_scores.mean() - original_scores.mean()):.4f}")

# Train final models for detailed comparison
rf_original.fit(X, y)
rf_corrected.fit(corrected_X, corrected_y)

print("\nModels trained successfully! Bias correction has been applied.")

Key Insights and Interpretation

Based on our analysis, here's what the bias correction accomplished:

Entropy Reduction: The corrected dataset shows lower entropy, indicating a more balanced and less biased distribution.
Feature Balance: Bias correction adjusted feature distributions to reduce systematic over-representation of certain groups.
Improved Fairness: The corrected model is likely to make more equitable predictions across different demographic groups.
Performance Impact: While accuracy might vary, the corrected model provides more reliable and generalizable predictions.

Complete Working Example

Here's a condensed version you can run immediately:

# Complete working example
import pandas as pd
from entropic_measurement import measure_and_correct
from sklearn.preprocessing import LabelEncoder

# Load and preprocess Titanic data
train_df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Quick preprocessing
df = train_df[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].copy()
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])

X = df.drop('Survived', axis=1)
y = df['Survived']

# Apply bias correction
corrected_X, corrected_y, report = measure_and_correct(X, y, return_report=True)

print(f"Bias reduction achieved: {report['bias_reduction']:.2%}")
print(f"Original entropy: {report['original_entropy']:.4f}")
print(f"Corrected entropy: {report['corrected_entropy']:.4f}")

Conclusion and Next Steps

Congratulations! You've successfully applied bias correction to a real-world dataset using the entropic_measurement library. This example demonstrated how measurement bias can be quantified and corrected, leading to more fair and robust machine learning models.

Key takeaways:

Measurement bias is common in real datasets and can significantly impact model performance
The measure_and_correct function provides an automated way to detect and correct bias
Bias correction often leads to more balanced datasets and fairer model predictions
Entropy-based methods offer a principled approach to measuring information content and bias

We encourage you to explore bias correction on your own datasets! Try applying these techniques to:

Your own Kaggle competition datasets
Company datasets where fairness is crucial
Any classification problem where you suspect measurement bias
Different types of bias (selection bias, sampling bias, etc.)

The entropic_measurement library is designed to be flexible and work with various data types and bias scenarios. Experiment with different correction strengths and methods to find what works best for your specific use case. Remember, the goal isn't just better accuracy—it's building more ethical, fair, and robust AI systems.

Happy bias correcting! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaggle Example: Bias Correction

Kaggle Example: Bias Correction

Overview

Prerequisites

Step 1: Loading the Kaggle Dataset

Step 2: Data Preprocessing

Step 3: Using entropic_measurement for Bias Detection and Correction

Step 4: Analyzing the Results

Step 5: Visualizing the Impact

Step 6: Model Performance Comparison

Key Insights and Interpretation

Complete Working Example

Conclusion and Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally