# Bias and Fairness Assessment (Binary Classification: Adult Income)

## Dataset Overview: UCI Adult Income Dataset
The **Adult Income dataset** (also known as the **Census Income** dataset) originates from the **UCI Machine Learning Repository**. It was extracted from the 1994 U.S. Census database and is widely used for benchmarking classification models, especially in fairness and bias research.

The task is to **predict whether an individual earns more than $50K per year** based on features such as age, education, occupation, and marital status.

- Target variable: income (binary: <=50K or >50K)

- Samples: 48,842

- Features: 14 demographic and employment-related attributes

- Use case: Benchmarking algorithms, fairness audits, and bias mitigation

Due to its inclusion of sensitive attributes (e.g., sex, race), it’s commonly used in studies evaluating algorithmic fairness and disparate impact.



In this notebook, we’ll train an XGBoost model to predict whether an individual’s annual income exceeds \$50K and then evaluate its performance and fairness across different demographic groups.

### Step 1: Install and import dependencies


In [None]:
! pip install equiboots

In [None]:
! pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [None]:
# fetch dataset
adult = fetch_ucirepo(id=2)
adult = adult.data.features.join(adult.data.targets, how="inner")

In [None]:
adult

## Basic Preprocessing Steps

### 1. Drop missing values

In [None]:
# Drop missing values
adult.dropna(inplace=True)

### 2. Copy DataFrame for posterity

In [None]:
df = adult.copy()

In [None]:
adult['income'].value_counts()

### 3. Encode categorical variables

In [None]:
def outcome_merge(val):
  if val == '<=50K' or val == '<=50K.':
    return 0
  else:
    return 1

In [None]:
df['income'] = df['income'].apply(outcome_merge)

### 4. Split the data

In [None]:
# Split data
X = df.drop("income", axis=1)
y = df["income"]

In [None]:
for col in X.columns:
    if isinstance(X[col], object):
        X[col] = X[col].astype("category")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)


## Train XGBoost Model

In [None]:
y_train.value_counts()

In [None]:
model = XGBClassifier(
    eval_metric='logloss',
    random_state=42,
    enable_categorical=True
)
model.fit(X_train, y_train)

## Evaluate XGBoost Model

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
print(classification_report(y_test, y_pred))

# Bias and Fairness Analysis with EquiBoots

In [None]:
import equiboots as eqb

## Points Estimates

In [None]:
# get predictions and true values
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
y_test = y_test.to_numpy()

X_test[['race', 'sex']] = X_test[['race', 'sex']].astype(str)


# Create fairness DataFrame
fairness_df = X_test[['race', 'sex']].reset_index()

eq = eqb.EquiBoots(
    y_true=y_test,
    y_prob=y_prob,
    y_pred=y_pred,
    fairness_df=fairness_df,
    fairness_vars=["race", "sex"],
)
eq.grouper(groupings_vars=["race", "sex"])

In [None]:
sliced_race_data = eq.slicer("race")
race_metrics = eq.get_metrics(sliced_race_data)

sliced_sex_data = eq.slicer("sex")
sex_metrics = eq.get_metrics(sliced_sex_data)

In [None]:
test_config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
}
stat_test_results_race = eq.analyze_statistical_significance(
    race_metrics, "race", test_config
)

stat_test_results_sex = eq.analyze_statistical_significance(
    sex_metrics, "sex", test_config
)

In [None]:
stat_test_results_race

In [None]:
overall_stat_results = {"sex": stat_test_results_sex, "race": stat_test_results_race}

## Signficance plots
Below we plot the different race and sex groups and look at how their performance differs for each of these groups.
We conduct statistical signficance tests to determine firstly whether there is a difference between the groups
(omnibus test) this is represented by the asterix (*) and then we determine which groups are statistically signficance these are shown with the (▲).

Point estimate signficance was determined using the chi-squared test.


In [None]:
eqb.eq_group_metrics_point_plot(
    group_metrics=[race_metrics, sex_metrics],
    metric_cols=[
        "Accuracy",
        "Precision",
        "Recall",
    ],
    category_names=["race", "sex"],
    figsize=(6, 8),
    include_legend=True,
    plot_thresholds=(0.9, 1.1),
    raw_metrics=True,
    show_grid=True,
    y_lim=(0, 1),
    statistical_tests=overall_stat_results
)

In [None]:
from equiboots.tables import metrics_table

In [None]:
stat_metrics_table_point = metrics_table(race_metrics, statistical_tests=stat_test_results_race, reference_group="White")

In [None]:
stat_metrics_table_point

## Precision-Recall, ROC AUC and Calibration by Race
These plots look at how performance is different across the different race groups.
We choose to exclude certain groups from the analysis because there are not enough members of these groups to make a 
fair comparison between the groups.

In [None]:
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="pr",
    title="Precision-Recall by Race Group",
    exclude_groups=["Amer-Indian-Eskimo", "Other"]
)

In [None]:
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="roc",
    title="ROC AUC by Race Group",
    # figsize=(5, 5),
    decimal_places=2,
    subplots=True,
    exclude_groups=["Amer-Indian-Eskimo", "Other"]
)

In [None]:
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="calibration",
    shade_area=True,
    title="Calibration by Race Group",
    exclude_groups=[ "Amer-Indian-Eskimo", "Other"]
)

## Bootstrap Estimates

In [None]:
int_list = np.linspace(0, 100, num=10, dtype=int).tolist()
eq2 = eqb.EquiBoots(
    y_true=y_test,
    y_pred=y_pred,
    y_prob=y_prob,
    fairness_df=fairness_df,
    fairness_vars= ["race"],
    seeds=int_list,
    reference_groups=["White"],
    task="binary_classification",
    bootstrap_flag=True,
    num_bootstraps=5001,
    boot_sample_size=1000,
    group_min_size=150,
    balanced=False,  # False is stratified, True is balanced
)

# Set seeds
eq2.set_fix_seeds(int_list)
print("seeds", eq2.seeds)

eq2.grouper(groupings_vars=["race"])

boots_race_data = eq2.slicer("race")


### Calculate disparities

In [None]:
race_metrics = eq2.get_metrics(boots_race_data)

In [None]:
dispa = eq2.calculate_disparities(race_metrics, "race")

## Calculating Disparity
Here we look at the disparity between the reference group which in this case is White, with the other race groups.
If we compare the prevalence with the predicted prevalence we are able to see if there is a difference.

In [None]:
eqb.eq_group_metrics_plot(
    group_metrics=dispa,
    metric_cols=[
        "Accuracy_Ratio",
        "Precision_Ratio",
        "Predicted_Prevalence_Ratio",
        "Prevalence_Ratio",
        "FP_Rate_Ratio",
        "TN_Rate_Ratio",
        "Recall_Ratio",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    show_grid=False,
    strict_layout=True,
    leg_cols=7,
    plot_thresholds=[0.9, 1.2],
)

### Calculate differences in metrics

In [None]:
diffs = eq2.calculate_differences(race_metrics, "race")


### Calculate statistical signficance

In [None]:
metrics_boot = ['Accuracy_diff', 'Precision_diff', 'Recall_diff', 'F1_Score_diff',
       'Specificity_diff', 'TP_Rate_diff', 'FP_Rate_diff', 'FN_Rate_diff',
       'TN_Rate_diff', 'Prevalence_diff', 'Predicted_Prevalence_diff',
       'ROC_AUC_diff', 'Average_Precision_Score_diff', 'Log_Loss_diff',
       'Brier_Score_diff', 'Calibration_AUC_diff']


test_config = {
    "test_type": "bootstrap_test",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
    "tail_type": "two_tailed",
    "metrics": metrics_boot,
}

stat_test_results = eq.analyze_statistical_significance(
    race_metrics, "race", test_config, diffs
)

### Table of statistical signficance (difference between metrics)

In [None]:
stat_metrics_table_diff = metrics_table(race_metrics, statistical_tests=stat_test_results, differences=diffs, reference_group="White")

In [None]:
stat_metrics_table_diff

### Plot statistical signficance between the differences of metrics

This section plots the metrics for each group against each other.
Statistical tests are used to determine whether these differences are statistically significant.
Statistical signficance is shown with an asterix (*)

In [None]:
eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=metrics_boot,
    name="race",
    categories="all",
    figsize=(12, 10),
    plot_type="violinplot",
    color_by_group=True,
    show_grid=True,
    max_cols=4,
    strict_layout=True,
    save_path="./images",
    show_pass_fail=False,
    statistical_tests=stat_test_results
)