# Bias and Fairness Assessment (Binary Classification: Adult Income)

Assessing Machine Learning models for bias and fairness is of great importance:

- Prevent Discrimination
  - Avoid unfair treatment based on protected attributes.

- Meet Legal Standards
  - Ensure compliance with laws and anti-discrimination acts.

- Build Trust
  - Fair models are more accepted by users, stakeholders, and regulators.

- Expose Hidden Gaps
  - Surface performance differences across demographic subgroups.

- Promote Ethical AI
  - Prevent reinforcement of societal or historical biases in data.

- Enable Accountability
  - Make models more transparent and open to external review.

- Guide Fairness Fixes
  - Identify where to apply debiasing or fairness-enhancing techniques

## Dataset Overview: UCI Adult Income Dataset
The **Adult Income dataset** (also known as the **Census Income** dataset) originates from the **UCI Machine Learning Repository**. It was extracted from the 1994 U.S. Census database and is widely used for benchmarking classification models, especially in fairness and bias research.

The task is to **predict whether an individual earns more than $50K per year** based on features such as age, education, occupation, and marital status.

- Target variable: income (binary: <=50K or >50K)

- Samples: 48,842

- Features: 14 demographic and employment-related attributes

- Use case: Benchmarking algorithms, fairness audits, and bias mitigation

Due to its inclusion of sensitive attributes (e.g., sex, race), it’s commonly used in studies evaluating algorithmic fairness and disparate impact.



# Modeling

In this notebook, we’ll train an XGBoost model to predict whether an individual’s annual income exceeds \$50K and then evaluate its performance and fairness across different demographic groups.

### Step 1: Install and import dependencies


In [None]:
! pip install equiboots

In [None]:
! pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [None]:
# fetch dataset
adult = fetch_ucirepo(id=2)
adult = adult.data.features.join(adult.data.targets, how="inner")

In [None]:
adult.head(3)

## Basic Preprocessing Steps

### 1. Drop missing values

In [None]:
# Drop missing values
adult.dropna(inplace=True)

### 2. Copy DataFrame for posterity

In [None]:
df = adult.copy()

In [None]:
adult["income"].value_counts()

### 3. Encode categorical variables

In [None]:
def outcome_merge(val):
    if val == "<=50K" or val == "<=50K.":
        return 0
    else:
        return 1

In [None]:
df["income"] = df["income"].apply(outcome_merge)

In [None]:
#  sex, count and percentages above_50k

income_by_sex = df.groupby("sex")["income"].agg(
    ["count", lambda x: (x.sum() / x.count()) * 100]
)
income_by_sex.columns = ["count", "percentage_above_50k"]
income_by_sex

In [None]:
#  race, count and percentages above_50k

income_by_race = df.groupby("race")["income"].agg(
    ["count", lambda x: (x.sum() / x.count()) * 100]
)
income_by_race.columns = ["count", "percentage_above_50k"]
income_by_race

### 4. Split the data

In [None]:
# Split data
X = df.drop("income", axis=1)
y = df["income"]

In [None]:
for col in X.columns:
    if isinstance(X[col], object):
        X[col] = X[col].astype("category")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

In [None]:
y_train.value_counts()

## Train XGBoost Model

In [None]:
model = XGBClassifier(eval_metric="logloss", random_state=42, enable_categorical=True)
model.fit(X_train, y_train)

## Evaluate XGBoost Model

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
print(classification_report(y_test, y_pred))

# Bias and Fairness Analysis with EquiBoots

**Equiboots supports a point estimate fairness analysis on a model's operating point (e.g., optimal threshold) as well as on multiple bootstraps with replacement.**


To initialize an analysis with equiboots:

1. Define a fairness Dataframe with the variables of interest.
2. Initialize an equiboots object using:
    - Ground truth (y_true)
    - Model probabilities (y_prob)
    - Model predictions (y_pred)
3. Identify the columns/variables that we will be assessing (e.g., race, sex)

In [None]:
import equiboots as eqb

## Point Estimates

In [None]:
# get predictions and true values
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
y_test = y_test.to_numpy()

X_test[["race", "sex"]] = X_test[["race", "sex"]].astype(str)


# Create fairness DataFrame
fairness_df = X_test[["race", "sex"]].reset_index()

eq = eqb.EquiBoots(
    y_true=y_test,
    y_prob=y_prob,
    y_pred=y_pred,
    fairness_df=fairness_df,
    fairness_vars=["race", "sex"],
)

# grouping by variables' groups (e.g., Male, Female, etc)
eq.grouper(groupings_vars=["race", "sex"])

In [None]:
# slicing data by variable of interest
sliced_race_data = eq.slicer("race")

# generating performance metrics (dependent on prediction task, equiboots default task="binary_classification")
race_metrics = eq.get_metrics(sliced_race_data)

# slicing and generating metrics for sex
sliced_sex_data = eq.slicer("sex")
sex_metrics = eq.get_metrics(sliced_sex_data)

In [None]:
## generating statistical significnace in tests
test_config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
}

# stat test race
stat_test_results_race = eq.analyze_statistical_significance(
    race_metrics, "race", test_config
)

# stat test sex
stat_test_results_sex = eq.analyze_statistical_significance(
    sex_metrics, "sex", test_config
)

In [None]:
stat_test_results_race

## Signficance plots

- Equiboots supports statistical testing to assess significance in metrics differences.

- Specifically the omnibus and pairwise chi-square test is used to assess significance between groups.

- Reference groups to compare against can be provided at the initialization of the Equiboots object:
  - using  (reference_groups=["white","female"]),
  - otherwise the groups with highest number of observations is automatically selected

- Below we plot the different race and sex groups and look at how their performance differs for each of these groups.

- We conduct statistical signficance tests to determine firstly whether there is a difference between:
  - the groups (omnibus test) this is represented by the asterix (*),
  - and if significant, then we determine which groups are statistically signficance these are shown with the (▲).



In [None]:
overall_stat_results = {
    "sex": stat_test_results_sex,
    "race": stat_test_results_race,
}

In [None]:
eqb.eq_group_metrics_point_plot(
    group_metrics=[race_metrics, sex_metrics],
    metric_cols=[
        "Accuracy",
        "Precision",
        "Recall",
    ],
    category_names=["race", "sex"],
    figsize=(6, 8),
    include_legend=True,
    plot_thresholds=(0.9, 1.1),
    raw_metrics=True,
    show_grid=True,
    y_lim=(0, 1),
    statistical_tests=overall_stat_results,
    y_lims={(0, 0): (0.70, 1.0), (0, 1): (0.70, 1.0)},
)

In [None]:
from equiboots.tables import metrics_table

In [None]:
stat_metrics_table_point = metrics_table(
    race_metrics, statistical_tests=stat_test_results_race, reference_group="White"
)

In [None]:
# table with metrics per group and statistical significance shown on columns for
# omnibus and/or pairwise
stat_metrics_table_point

## Forest Plots

**Forest plots** within this context provide a clear way to visualize point estimates across multiple groups, making it easy to compare performance metrics side by side. Below is but one example. Available metrics are as follows:


`'Accuracy'`, `'Precision'`, `'Recall'`, `'F1 Score'`, `'Specificity'`,  
`'TP Rate'`, `'FP Rate'`, `'FN Rate'`, `'TN Rate'`, `'TP'`, `'FP'`,  
`'FN'`, `'TN'`, `'Prevalence'`, `'Predicted Prevalence'`, `'ROC AUC'`,  
`'Average Precision Score'`, `'Log Loss'`, `'Brier Score'`, `'Calibration AUC'`.


In [None]:
eqb.eq_plot_metrics_forest(
    group_metrics=race_metrics,
    metric_name="Prevalence",
    title="Forest Plot: Race Group Point Estimates",
    reference_group="White",
    figsize=(8, 6),
    sort_groups=True,
    ascending=False,
    statistical_tests=stat_test_results_race,
    save_path="./images",
    filename="prevalance_forest_point_est_race",
)

## Effect Size

EquiBoots also calculates effect size when we are dealing with point estimates. In this case we can see the effect size for all of the results is low (under 0.2) with the highest being 0.11. This indicates that although statistical signficance was found it is not necessary a strong finding.

According to [source](www.ibm.com/docs/en/cognos-analytics/12.0.x), for Cramer's V:

- ES ≤ 0.2 is interpreted as a weak result.
- 0.2 < ES ≤ 0.6 is interpreted as a moderate result.
- ES > 0.6 is interpreted as a strong result.

In [None]:
eqb.plot_effect_sizes(
    stat_test_results_race,
    xlabel="Race & Ethnicity",
    ylabel="Effect size",
    title="Race/Ethnicity Effect Sizes",
    figsize=(10, 4),
    # rotation=0,
    save_path="/home/lshpaner/Python_Projects/equiboots/notebooks/images",
    filename="race_effect_size",
)

## Concluding remarks for the Point Estimate analysis on the model's Operating point

- There are statistically significant differences in Accuracy, Precision, and
Recall across both race and sex categories (based on the omnibus test).
- Pairwise analysis shows that these differences are statistically significant for all racial groups when compared to the reference group (White), except for Asian-Pac-Islander.


## Precision-Recall, ROC, and Calibration Curves by Race
These plots look at how performance is different across the different race groups.
We choose to exclude certain groups from the analysis because there are not enough members of these groups to make a
fair comparison between the groups.

In [None]:
# PR curves
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="pr",
    subplots=False,
    figsize=(7, 7),
    title="Precision-Recall by Race Group",
    exclude_groups=["Amer-Indian-Eskimo", "Other"],
)

In [None]:
# ROC curves
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="roc",
    title="ROC AUC by Race Group",
    figsize=(7, 7),
    decimal_places=2,
    subplots=False,
    exclude_groups=["Amer-Indian-Eskimo", "Other"],
)

In [None]:
# calibration curves
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="calibration",
    shade_area=True,
    title="Calibration by Race Group",
    exclude_groups=["Amer-Indian-Eskimo", "Other"],
    subplots=False,
)

In [None]:
# calibration curves per group
eqb.eq_plot_group_curves(
    sliced_race_data,
    curve_type="calibration",
    shade_area=True,
    title="Calibration by Race Group",
    exclude_groups=["Amer-Indian-Eskimo", "Other"],
    subplots=True,
    n_cols=3,
)

### Concluding remarks for the Point Estimate analysis on the model's ROC, PR, and Calibration Curves

- PRAUC for all race groups shows no visual differences.
- For AUC ROC, the AUC for the Black population is higher.
- In terms of calibration curves, the least calibrated group is Asian-Pac-Islander.


## Bootstrap Estimates

Bootstrap estimates:
- randomly sampling fairness_df, y_true, y_prob, and y_pred

In [None]:
# setting fixed seed for reproducibility
# Alternatively, seeds can be set after initialization
int_list = np.linspace(0, len(y_test), num=len(y_test), dtype=int).tolist()

eq2 = eqb.EquiBoots(
    y_true=y_test,
    y_pred=y_pred,
    y_prob=y_prob,
    fairness_df=fairness_df,
    fairness_vars=["race"],
    seeds=int_list,
    reference_groups=["White"],
    task="binary_classification",
    bootstrap_flag=True,
    num_bootstraps=5001,
    boot_sample_size=len(y_test),  # whole length of test set
    group_min_size=150,  # any group with samples below this number will be ignored
    balanced=False,  # False is stratified (i.e., maintaining groups proportions), True is balanced (equal proportions)
    stratify_by_outcome=False,  # True maintain initial dataset outcome proportions per group
)

# Set seeds after initialization
eq2.set_fix_seeds(int_list)
print("seeds", eq2.seeds)

# group bootstraps by grouping variables (e.g., race)
eq2.grouper(groupings_vars=["race"])

# slice by variable and assign to a variable
# race related bootstraps
boots_race_data = eq2.slicer("race")

### Calculate disparities

- In the context of bias and fairness in machine learning, disparity refers to the differences in model performance, predictions, or outcomes across different demographic or sensitive groups.
- It quantifies how a model's behavior varies for subgroups based on attributes like race, sex, age, or other characteristics.

Here's how you can represent the disparity ratio for a given metric (M) and a specific group (G) compared to a reference group (R):

$$\text{Disparity Ratio} = \frac{M(G)}{M(R)}$$

And here's how you can represent the disparity difference:

$$\text{Disparity Difference} = M(G) - M(R)$$

Where:

- $M(G)$ is the value of the metric for group G.
- $M(R)$ is the value of the metric for the reference group R.


For example, if you are looking at the "Predicted Prevalence" metric (the proportion of individuals predicted to have a positive outcome), the Predicted Prevalence Disparity Ratio for a group (e.g., "Black") compared to a reference group (e.g., "White") would be:

- $$ \text{Predicted Prevalence Disparity Ratio} = \frac{\text{Predicted Prevalence (Black)}}{\text{Predicted Prevalence (White)}} $$

- $$ \text{Predicted Prevalence Disparity difference} = \text{Predicted Prevalence (Black)}-\text{Predicted Prevalence (White)} $$

### Takeaway

- Disparity analysis is crucial for identifying potential unfairness in a model's predictions and understanding how it impacts different populations.
- Tools like EquiBoots help to quantify and visualize these disparities, allowing for a more informed assessment of model fairness.

In [None]:
# compute binary classification metrics wrt to race
boots_race_metrics = eq2.get_metrics(boots_race_data)

In [None]:
dispa = eq2.calculate_disparities(boots_race_metrics, "race")

## Calculating Disparity
Here we look at the disparity between the reference group which in this case is White, with the other race groups.
If we compare the prevalence with the predicted prevalence we are able to see **if** there is a difference. In this case we do not see a noticable difference between predicted prevalence and actual prevalence

In [None]:
eqb.eq_group_metrics_plot(
    group_metrics=dispa,
    metric_cols=[
        "Accuracy_Ratio",
        "Precision_Ratio",
        "Predicted_Prevalence_Ratio",
        "Prevalence_Ratio",
        "FP_Rate_Ratio",
        "TN_Rate_Ratio",
        "Recall_Ratio",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    show_grid=False,
    strict_layout=True,
    leg_cols=7,
    # plot_thresholds=[0.9, 1.2],
)

### Concluding Remarks for the Bootstrap Disparity Ratios

- Prevalence:
  - Among all bootstraps, the Black population is around 50% less likely to have a higher income.
  - The Asian-Pac-Islander group is a multimodal distribution of ratios, with the biggest mode close to a ratio of 1 and two other modes: one around 1.3 times higher than the reference White group and another one around 1.7 times higher.
- Predicted prevalence:
  - depicts the same behavior, suggesting that the model is following some inherent disparities in the domain of income.
- The False positive rate ratio in the Black population is around 50% less times with respect to the reference group.
- In the remaining charts the disparities are overlapping the reference (1.0)

### Calculate Disparity differences in metrics

In [None]:
diffs = eq2.calculate_differences(boots_race_metrics, "race")

In [None]:
eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=[
        "Accuracy_diff",
        "Precision_diff",
        "Predicted_Prevalence_diff",
        "Prevalence_diff",
        "FP_Rate_diff",
        "TN_Rate_diff",
        "Recall_diff",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    show_grid=False,
    strict_layout=True,
    leg_cols=7,
    # plot_thresholds=[0.9, 1.2],
)

### Calculate statistical signficance

- We are using a bootstrap-based approach to determine if the observed differences in various model performance metrics (like Accuracy, Precision, Recall, etc.) between each demographic group and the chosen reference group (White) are statistically significant.

- The bootstrap method involves repeatedly resampling the data to create multiple versions of the metrics for each group. By comparing the distribution of these bootstrapped metric differences to a null hypothesis of no difference, we can calculate a p-value.

- This p-value, adjusted for multiple comparisons using a method like Bonferroni, helps us conclude whether the observed disparities are likely due to random chance or represent a true, statistically significant difference.



In [None]:
# metrics to perform a statistical test
metrics_boot = [
    "Accuracy_diff",
    "Precision_diff",
    "Recall_diff",
    "F1_Score_diff",
    "Specificity_diff",
    "TP_Rate_diff",
    "FP_Rate_diff",
    "FN_Rate_diff",
    "TN_Rate_diff",
    "Prevalence_diff",
    "Predicted_Prevalence_diff",
    "ROC_AUC_diff",
    "Average_Precision_Score_diff",
    "Log_Loss_diff",
    "Brier_Score_diff",
    "Calibration_AUC_diff",
]

# configuration dictionary to provide parameters around statistical testing
test_config = {
    "test_type": "bootstrap_test",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
    "tail_type": "two_tailed",
    "metrics": metrics_boot,
}


stat_test_results = eq.analyze_statistical_significance(
    metric_dict=boots_race_metrics,  # pass variable sliced metrics
    var_name="race",  # variable name
    test_config=test_config,  # configuration
    differences=diffs,  # the differences of each race group
)

### Table of statistical signficance (difference between metrics)

In [None]:
stat_metrics_table_diff = metrics_table(
    boots_race_metrics,
    statistical_tests=stat_test_results,
    differences=diffs,
    reference_group="White",
)

In [None]:
# differences of each race group wrt reference group
# reference group differences are all zero not shown for simplicity
# * depicts statistical significance
stat_metrics_table_diff

### Plot statistical signficance between the differences of metrics

This section plots the metrics for each group against each other.
Statistical tests are used to determine whether these differences are statistically significant.
Statistical signficance is shown with an asterix (*)

In [None]:
eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=metrics_boot,
    name="race",
    categories="all",
    figsize=(20, 10),
    plot_type="violinplot",
    color_by_group=True,
    show_grid=True,
    max_cols=6,
    strict_layout=True,
    save_path="./images",
    show_pass_fail=False,
    statistical_tests=stat_test_results,
)

## Bootstrapped Forest Plots

In [None]:
eqb.eq_plot_bootstrap_forest(
    group_boot_metrics=boots_race_metrics,
    metric="ROC AUC",
    reference_group="White",
    title="AUROC - Bootstrapped Race Metrics",
    save_path="/home/lshpaner/Python_Projects/equiboots/notebooks/images",
    figsize=(8, 6),
    filename="bootstrapped_roc_auc_race_metrics",
)

In [None]:
eqb.calculate_bootstrap_stats(group_boot_metrics=boots_race_metrics, metric="ROC AUC")

# Conclusion

EquiBoots allow us to compare the performance of machine learning models across different race groups.

We looked at both point estimates and bootstrapped estimates in this example and analysed their statistical signficance.

Overall we found multiple metrics where performance was statistically different from the reference group in the point estimates examples. With the caveat that we also saw small effect sizes meaning not necessarily a strong difference.

The bootstrapped examples also showed us where the model performance differed, when looking at the precision we can see a higher precision for Black sample (with statistical significance) however we cannot say that the model is biased as we also see a statistically signficant difference between the reference group (White sample) in terms of prevalence.

- Differences in Accuracy, Precision, Specificity, FP rate, TN Rate, Prevalence, Predicted Prevalence, Log loss, and Brier Score are statistically significant for the Black population.
- For the Asian-Pac-Islander population, the Calibration Curve AUC (with the 45-degree diagonal) is statistically significantly different from the reference.
- This suggests that the model would improve from calibration.
- Moreover, the prevalence disparity in outcome observed within the Black population is clearer in the Bootstrap analysis, evident in both prevalence and across most of the model's metrics