# Exploratory Data Analysis (EDA) for Bias Detection

In this notebook, we'll use the Adult Income dataset to explore potential data bias.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
url = 'https://raw.githubusercontent.com/selva86/datasets/master/Adult.csv'
df = pd.read_csv(url)
df.head()

In [None]:
# Quick overview
df.info()
df.describe(include='all')

In [None]:
# Gender distribution
sns.countplot(x='sex', data=df)
plt.title('Gender Distribution')
plt.show()

# Income comparison by gender
sns.countplot(x='sex', hue='income', data=df)
plt.title('Income Level by Gender')
plt.show()

In [None]:
# Income comparison by race
plt.figure(figsize=(10, 4))
sns.countplot(x='race', hue='income', data=df)
plt.title('Income Level by Race')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Age distribution by income
sns.histplot(data=df, x='age', hue='income', bins=30, kde=True)
plt.title('Age Distribution by Income')
plt.show()

In [None]:
# Encode categorical features
df_encoded = df.copy()
df_encoded = pd.get_dummies(df_encoded, drop_first=True)

# Correlation heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(df_encoded.corr(), cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()

### âœ… Summary

- We visualized income disparities across gender and race
- We used histogram and count plots to detect potential bias
- Correlation heatmaps help explore relationships in encoded data

EDA is a first step in detecting fairness issues before training an AI model.
