# Bias and Fairness Assessment (Binary Classification: Adult Income)

Assessing Machine Learning models for bias and fairness is of great importance:

- Prevent Discrimination
  - Avoid unfair treatment based on protected attributes.

- Meet Legal Standards
  - Ensure compliance with laws and anti-discrimination acts.

- Build Trust
  - Fair models are more accepted by users, stakeholders, and regulators.

- Expose Hidden Gaps
  - Surface performance differences across demographic subgroups.

- Promote Ethical AI
  - Prevent reinforcement of societal or historical biases in data.

- Enable Accountability
  - Make models more transparent and open to external review.

- Guide Fairness Fixes
  - Identify where to apply debiasing or fairness-enhancing techniques

## Dataset Overview: UCI Adult Income Dataset
The **Adult Income dataset** (also known as the **Census Income** dataset) originates from the **UCI Machine Learning Repository**. It was extracted from the 1994 U.S. Census database and is widely used for benchmarking classification models, especially in fairness and bias research.

The task is to **predict whether an individual earns more than $50K per year** based on features such as age, education, occupation, and marital status.

- Target variable: income (binary: <=50K or >50K)

- Samples: 48,842

- Features: 14 demographic and employment-related attributes

- Use case: Benchmarking algorithms, fairness audits, and bias mitigation

Due to its inclusion of sensitive attributes (e.g., sex, race), it’s commonly used in studies evaluating algorithmic fairness and disparate impact.



# Modeling

In this notebook, we’ll train an XGBoost model to predict whether an individual’s annual income exceeds \$50K and then evaluate its performance and fairness across different demographic groups.

### Step 1: Install and import dependencies


In [None]:
# ! pip install equiboots

In [None]:
# ! pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [None]:
# fetch dataset
adult = fetch_ucirepo(id=2)
adult = adult.data.features.join(adult.data.targets, how="inner")

In [None]:
adult.head(3)

## Basic Preprocessing Steps

### 1. Drop missing values

In [None]:
# Drop missing values
adult.dropna(inplace=True)

### 2. Copy DataFrame for posterity

In [None]:
df = adult.copy()

In [None]:
adult["income"].value_counts()

### 3. Encode categorical variables

In [None]:
def outcome_merge(val):
    if val == "<=50K" or val == "<=50K.":
        return 0
    else:
        return 1

In [None]:
df["income"] = df["income"].apply(outcome_merge)

In [None]:
#  sex, count and percentages above_50k

income_by_sex = df.groupby("sex")["income"].agg(
    ["count", lambda x: (x.sum() / x.count()) * 100]
)
income_by_sex.columns = ["count", "percentage_above_50k"]
income_by_sex

In [None]:
#  race, count and percentages above_50k

income_by_race = df.groupby("race")["income"].agg(
    ["count", lambda x: (x.sum() / x.count()) * 100]
)
income_by_race.columns = ["count", "percentage_above_50k"]
income_by_race

### 4. Split the data

In [None]:
# Split data
X = df.drop("income", axis=1)
y = df["income"]

In [None]:
for col in X.columns:
    if isinstance(X[col], object):
        X[col] = X[col].astype("category")

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
)

In [None]:
y_train.value_counts()

## Train XGBoost Model

In [None]:
model = XGBClassifier(eval_metric="logloss", random_state=42, enable_categorical=True)
model.fit(X_train, y_train)

## Evaluate XGBoost Model

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)
print(classification_report(y_test, y_pred))

# Bias and Fairness Analysis with EquiBoots

**Equiboots supports a point estimate fairness analysis on a model's operating point (e.g., optimal threshold) as well as on multiple bootstraps with replacement.**


To initialize an analysis with equiboots:

1. Define a fairness Dataframe with the variables of interest.
2. Initialize an equiboots object using:
    - Ground truth (y_true)
    - Model probabilities (y_prob)
    - Model predictions (y_pred)
3. Identify the columns/variables that we will be assessing (e.g., race, sex)

In [None]:
import equiboots as eqb

# Points Estimates
## Overview:
This code performs a step-by-step fairness analysis of a machine learning model using EquiBoots. First, it generates predictions and probabilities for the test data, converts the true labels to a NumPy array, and ensures the sensitive group variables ('race' and 'sex') are in string format. Then, it creates a new table (DataFrame) to track group-level attributes and adds both the true labels and predicted probabilities for later fairness comparisons. The EquiBoots tool is initialized with these inputs and used to group data by race and sex. Fairness metrics such as accuracy, precision, recall, and specificity are calculated for each group. A statistical significance test (Chi-square with Bonferroni correction) is then configured and run to determine whether differences between groups are meaningful. The code also defines and uses a custom function to compute specificity. Using the "White" group as a reference, it calculates baseline performance metrics and then identifies the best decision threshold for each racial group that minimizes the difference from the reference. These steps together build a foundation for evaluating and improving model fairness across demographic groups.


##  Unadjusted Fairness Metrics by Race and Sex  
This section visualizes the **original (unadjusted)** model predictions across **race** and **sex**.  
We'll plot **Accuracy**, **Precision**, and **Recall**, along with results from statistical significance testing.


##Step 1: Organize Statistical Test Results
Prepare the significance test results for race and sex groups so we can include them in the performance plot.

In [None]:
# get predictions and true values
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
y_test = y_test.to_numpy()

X_test[["race", "sex"]] = X_test[["race", "sex"]].astype(str)


# Create fairness DataFrame
fairness_df = X_test[["race", "sex"]].reset_index()

eq = eqb.EquiBoots(
    y_true=y_test,
    y_prob=y_prob,
    y_pred=y_pred,
    fairness_df=fairness_df,
    fairness_vars=["race", "sex"],
)

# grouping by variables' groups (e.g., Male, Female, etc)
eq.grouper(groupings_vars=["race", "sex"])

##Step 2: Plot Performance Metrics for Race and Sex (Unadjusted)
Visualize the model's performance for different demographic groups. This plot shows Accuracy, Precision, and Recall grouped by race and sex using the unadjusted predictions. Statistical test results are added to highlight significant differences.

In [None]:
# Compute metrics by race
sliced_race_data = eq.slicer(
    "race"
)  # tells eq to filter and organize the data by race(assigns new variable)
race_metrics = eq.get_metrics(
    sliced_race_data
)  # caculates fairness for each race; data is stored in race_metrics
# Compute metrics by sex
sliced_sex_data = eq.slicer("sex")  # filters and organize data by sex
sex_metrics = eq.get_metrics(sliced_sex_data)  # calculates fairness for sex

In [None]:
# Config for statistical testing
test_config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
}

# Run fairness tests
stat_test_results_race = eq.analyze_statistical_significance(
    race_metrics, "race", test_config
)
stat_test_results_sex = eq.analyze_statistical_significance(
    sex_metrics, "sex", test_config
)

# Store results
overall_stat_results = {
    "sex": stat_test_results_sex,
    "race": stat_test_results_race,
}

In [None]:
# Plot original (unadjusted) performance metrics by race and sex
eqb.eq_group_metrics_point_plot(
    group_metrics=[race_metrics, sex_metrics],  # Metrics by group
    metric_cols=["Accuracy", "Precision", "Recall"],  # Which metrics to display
    category_names=["race", "sex"],  # Groups being compared
    figsize=(6, 8),  # Plot size
    include_legend=True,  # Include legend for clarity
    plot_thresholds=(0.9, 1.1),  # Highlight near-equal thresholds
    raw_metrics=True,  # Show actual values (not normalized)
    show_grid=True,  # Show grid for readability
    y_lim=(0, 1),  # Y-axis limit
    statistical_tests=overall_stat_results,  # Include statistical significance tests
    y_lims={  # Customize Y-axis for each group/metric
        ("sex", "Accuracy"): (0.70, 1.0),
        ("sex", "Precision"): (0.70, 1.0),
        ("sex", "Recall"): (0.70, 1.0),
        ("race", "Accuracy"): (0.70, 1.0),
        ("race", "Precision"): (0.70, 1.0),
        ("race", "Recall"): (0.70, 1.0),
    },
)

#**Calculating best thresholds possible**


## *Best thresholds for race and sex*

In [None]:
# Create a DataFrame(fairness_df) to track group attributes
fairness_df = X_test[["race", "sex"]].reset_index(
    drop=True
)  # creates new data frame and table to race and sex for fairness analysis

# Add true labels and predicted probabilities
fairness_df["y_true"] = y_test  # holds the correct answers/actual labels
fairness_df["y_prob"] = y_prob  # model's predicted probabilities

In [None]:
# Initialize EquiBoots with labels and group information
eq = eqb.EquiBoots(  # maybe ask for help
    y_true=y_test,  # actual correct answers/labels
    y_prob=y_prob,  # predicted probabilities
    y_pred=y_pred,  # predicted class labels
    fairness_df=fairness_df,  # table with race and sex info of each person
    # figure out what fairness_vars is
    fairness_vars=[
        "race",
        "sex",
    ],  # tells equiboots to use race and sex when analyzing fairness
)

# Group the data by race and sex for analysis
eq.grouper(
    groupings_vars=["race", "sex"]
)  # groups data by race and sex so Equiboots analyzes how fairly the model performs for each group

## Step 1: Generate Model Predictions and Probabilities

In [None]:
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    confusion_matrix,
)
import numpy as np


# Function to compute specificity
def specificity_score(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return tn / (tn + fp)

In [None]:
# Generate predictions and predicted probabilities
# Ensure 'race' and 'sex' are category dtype for XGBoost
X_test[["race", "sex"]] = X_test[["race", "sex"]].astype("category")

y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# y_test = y_test.to_numpy()

## Step 2: Compute Reference Metrics
## 1.White Group

In [None]:
# Use the "White" group as reference at threshold = 0.5
reference_group = "White"
ref_indices = (
    fairness_df["race"] == reference_group
)  # tells us which people in the data belong to the White group
ref_pred = (fairness_df["y_prob"][ref_indices] >= 0.5).astype(
    int
)  # applies probability of 0.5 to White group
ref_y_true = fairness_df["y_true"][
    ref_indices
]  # grabs correct answers for White group (saves in ref_y_true)

# Compute Baseline Metrics
ref_accuracy = accuracy_score(
    ref_y_true, ref_pred
)  # correct predictions #y_true = true values
ref_precision = precision_score(ref_y_true, ref_pred)  # how many tp there are
ref_recall = recall_score(ref_y_true, ref_pred)  # how mant tp the model caught onto
ref_specificity = specificity_score(ref_y_true, ref_pred)  # how many tn's are there

## 2. Male Group


In [None]:
# Use the "Male" group as reference at threshold = 0.5
reference_group_sex = "Male"
ref_indices_sex = fairness_df["sex"] == reference_group_sex
ref_pred_sex = (fairness_df["y_prob"][ref_indices_sex] >= 0.5).astype(int)
ref_y_true_sex = fairness_df["y_true"][ref_indices_sex]

# Compute Baseline Metrics for Male group
ref_accuracy_sex = accuracy_score(ref_y_true_sex, ref_pred_sex)
ref_precision_sex = precision_score(ref_y_true_sex, ref_pred_sex)
ref_recall_sex = recall_score(ref_y_true_sex, ref_pred_sex)
ref_specificity_sex = specificity_score(ref_y_true_sex, ref_pred_sex)

print(f"Male Group Metrics (Threshold 0.5):")
print(f"Accuracy: {ref_accuracy_sex}")
print(f"Precision: {ref_precision_sex}")
print(f"Recall: {ref_recall_sex}")
print(f"Specificity: {ref_specificity_sex}")

## Step 3: Find Matching Thresholds for Other Group
### Race groups

In [None]:
# # Find group-specific thresholds that minimize difference from White group
# group_thresholds = {}  # stores best threshold for each race groups

# for group in fairness_df["race"].unique():  # loops through each unique race group
#     if group == reference_group:
#         group_thresholds[group] = 0.5
#         continue  # assigns threshold to White and skips to next group

#     group_indices = fairness_df["race"] == group  # google boolean, binary t/f
#     group_probs = fairness_df["y_prob"][
#         group_indices
#     ].values  # predicts the groups probabilities
#     group_true = fairness_df["y_true"][
#         group_indices
#     ].values  # actual answers/labels for this group

#     best_threshold = 0.5
#     best_diff = float("inf")

#     for t in np.linspace(
#         0.1, 0.9, 100
#     ):  # loops various 100 threshold values #linear space
#         preds = (group_probs >= t).astype(int)  # true
#         acc = accuracy_score(group_true, preds)

#         # calculates preformance metrics
#         prec = precision_score(group_true, preds)
#         rec = recall_score(group_true, preds)
#         spec = specificity_score(group_true, preds)

#         # calculates total difference
#         diff = (
#             abs(acc - ref_accuracy)  # absolute value
#             + abs(prec - ref_precision)
#             + abs(rec - ref_recall)
#             + abs(spec - ref_specificity)
#         )

#         if diff < best_diff:
#             best_diff = diff
#             best_threshold = t

#     group_thresholds[group] = best_threshold

group_thresholds_race = eqb.find_group_thresholds(
    y_true=y_test,
    y_prob=y_prob,
    reference_group="White",
    group_vec=fairness_df["race"],
    ref_metrics=None,
    threshold_range=(0.1, 0.9),
    n_steps=100,
    default_threshold=0.5,
)


# Display optimal thresholds per group
group_thresholds_race

### Sex groups

In [None]:
# # Iterate and find best thresholds for other groups
# group_thresholds_sex = {}

# for group in fairness_df["sex"].unique():
#     if group == reference_group_sex:
#         group_thresholds_sex[group] = 0.5
#         continue

#     group_indices_sex = fairness_df["sex"] == group
#     group_probs_sex = fairness_df["y_prob"][group_indices_sex].values
#     group_true_sex = fairness_df["y_true"][group_indices_sex].values

#     best_threshold_sex = 0.5
#     best_diff_sex = float("inf")

#     for t in np.linspace(0.1, 0.9, 100):
#         preds_sex = (group_probs_sex >= t).astype(int)
#         acc_sex = accuracy_score(group_true_sex, preds_sex)
#         prec_sex = precision_score(group_true_sex, preds_sex)
#         rec_sex = recall_score(group_true_sex, preds_sex)
#         spec_sex = specificity_score(group_true_sex, preds_sex)

#         diff_sex = (
#             abs(acc_sex - ref_accuracy_sex)
#             + abs(prec_sex - ref_precision_sex)
#             + abs(rec_sex - ref_recall_sex)
#             + abs(spec_sex - ref_specificity_sex)
#         )

#         if diff_sex < best_diff_sex:
#             best_diff_sex = diff_sex
#             best_threshold_sex = t

#     group_thresholds_sex[group] = best_threshold_sex

# Store and display group thresholds
print("Optimal thresholds per sex group:")

group_thresholds_sex = eqb.find_group_thresholds(
    y_true=y_test,
    y_prob=y_prob,
    reference_group="Male",
    group_vec=fairness_df["sex"],
    ref_metrics=None,
    threshold_range=(0.1, 0.9),
    n_steps=100,
    default_threshold=0.5,
)


print(group_thresholds_sex)

## **Signficance plots**

- Equiboots supports statistical testing to assess significance in metrics differences.

- Specifically the omnibus and pairwise chi-square test is used to assess significance between groups.

- Reference groups to compare against can be provided at the initialization of the Equiboots object:
  - using  (reference_groups=["white","female"]),
  - otherwise the groups with highest number of observations is automatically selected

- Below we plot the different race and sex groups and look at how their performance differs for each of these groups.

- We conduct statistical signficance tests to determine firstly whether there is a difference between:
  - the groups (omnibus test) this is represented by the asterix (*),
  - and if significant, then we determine which groups are statistically signficance these are shown with the (▲).



##  **Threshold-Adjusted Metrics**
We now apply **group-specific probability thresholds by race** to modify predictions, aiming to make key performance metrics (Accuracy, Precision, Recall, Specificity) **match those of the White group**.  
We then visualize those adjusted race metrics side-by-side with **unadjusted sex metrics**.




#**Overview**:
 Focuses on adjusting prediction thresholds to improve fairness across racial groups. It starts by saving the original statistical test results for race and sex, allowing comparison before and after adjustment. Then, it defines custom thresholds for each race group, chosen to match the performance of the White group. A function is created to apply these group-specific thresholds to make new predictions. These adjusted predictions are added to the dataset, and a new EquiBoots object is initialized to analyze fairness again, this time using the updated predictions grouped by race. Fairness metrics like accuracy and recall are recalculated for each group based on these adjustments. The code sets up and runs statistical significance tests to check if group differences remain after threshold changes. It also uses a tool to create tables summarizing fairness metrics, highlighting differences compared to the White group. Finally, it prepares a combined summary of adjusted race results and original sex results, and generates a plot that visually compares key performance metrics across race and sex, marking areas of near-equal performance and indicating statistically significant differences. This process helps assess and improve the model’s fairness by tailoring thresholds for each group.


### Step 1: Prepare Statistical Test Results for Original (Unadjusted) Metrics

In [None]:
# Initialize EquiBoots with labels and group information
eq = eqb.EquiBoots(  # maybe ask for help
    y_true=y_test,  # actual correct answers/labels
    y_prob=y_prob,  # predicted probabilities
    y_pred=y_pred,  # predicted class labels
    fairness_df=fairness_df,  # table with race and sex info of each person
    # figure out what fairness_vars is
    fairness_vars=[
        "race",
        "sex",
    ],  # tells equiboots to use race and sex when analyzing fairness
)

# Group the data by race and sex for analysis
eq.grouper(
    groupings_vars=["race", "sex"]
)  # groups data by race and sex so Equiboots analyzes how fairly the model performs for each group

In [None]:
# Configuration for statistical testing
test_config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
}

# Run fairness tests
stat_test_results_race = eq.analyze_statistical_significance(
    race_metrics, "race", test_config
)
stat_test_results_sex = eq.analyze_statistical_significance(
    sex_metrics, "sex", test_config
)

In [None]:
# Define how we want to test for statistically significant group differences
test_config = {
    "test_type": "chi_square",  # Use Chi-squared test
    "alpha": 0.05,  # Significance level
    "adjust_method": "bonferroni",  # Adjust for multiple comparisons
    "confidence_level": 0.95,  # 95% confidence interval
    "classification_task": "binary_classification",  # Task type
}

In [None]:
# Prepare statistical test results for original (unadjusted) metrics
# Make sure to run the cell above this one (cell ID 9907ff14) to define stat_test_results_sex and stat_test_results_race
overall_stat_results = {
    "sex": stat_test_results_sex,
    "race": stat_test_results_race,
}

### Step 2: Define Race-Specific Thresholds
These thresholds were determined by minimizing the difference in performance compared to the White reference group.

In [None]:
# Manually defined thresholds for each race group (based on prior optimization)
group_thresholds_race = {
    "White": 0.5,
    "Black": 0.407070707070707,
    "Asian-Pac-Islander": 0.4232323232323232,
    "Amer-Indian-Eskimo": 0.41515151515151516,
    "Other": 0.2535353535353535,
}

In [None]:
# Manually defined thresholds for each sex group (based on prior optimization)
group_thresholds_sex = {
    "Male": 0.5,
    "Female": 0.4313131313131313,
}

### Step 3: Define a Function to Apply Group-Specific Thresholds

In [None]:
# # This function applies custom thresholds for each group based on their group label
# def grouped_threshold_predict(
#     y_prob, group_labels, group_thresholds, default_threshold=0.5
# ):
#     predictions = np.zeros_like(
#         y_prob, dtype=int
#     )  # Initialize array of 0s for predicted labels
#     for group in np.unique(
#         group_labels
#     ):  # Loop over each unique group (e.g., each race)
#         idx = group_labels == group  # Get indices where group label matches
#         threshold = group_thresholds.get(
#             group, default_threshold
#         )  # Use custom or default threshold
#         predictions[idx] = (y_prob[idx] >= threshold).astype(
#             int
#         )  # Apply threshold to get predictions
#     return predictions

### Step 4: Apply Grouped Thresholds to Make New Predictions

In [None]:
# Extract race labels for each instance in the dataset
race_labels = fairness_df["race"].values

# Get new predictions using group-specific thresholds
y_pred_grouped_thresh = eqb.grouped_threshold_predict(
    y_prob, race_labels, group_thresholds_race
)

# Save new predictions in the fairness DataFrame
fairness_df["y_pred_grouped_thresh"] = y_pred_grouped_thresh

### Step 5: Create New EquiBoots Object for Adjusted Predictions

In [None]:
# Create a new EquiBoots object using the adjusted predictions
eq_adjusted = eqb.EquiBoots(
    y_true=fairness_df["y_true"],  # True labels
    y_prob=fairness_df["y_prob"],  # Predicted probabilities
    y_pred=fairness_df["y_pred_grouped_thresh"],  # Adjusted predictions
    fairness_df=fairness_df,  # Dataset with sensitive attributes
    fairness_vars=["race"],  # Focus fairness analysis on race
)

# Re-group the data by race (prepares for fairness slicing)
eq_adjusted.grouper(groupings_vars=["race"])

### Step 6: Slice and Evaluate New Metrics by Race

In [None]:
# Extract subgroup data by race
sliced_race_data_adjusted = eq_adjusted.slicer("race")

# Compute fairness performance metrics for each racial group using adjusted predictions
race_metrics_adjusted = eq_adjusted.get_metrics(sliced_race_data_adjusted)

### Step 7: Run Statistical Significance Tests for Adjusted Predictions

In [None]:
# Create adjusted predictions using race-specific thresholds
race_labels = fairness_df["race"].values
y_pred_grouped_thresh = eqb.grouped_threshold_predict(
    y_prob, race_labels, group_thresholds_race
)

# Update fairness_df with adjusted predictions
fairness_df["y_pred_grouped_thresh"] = y_pred_grouped_thresh

# Instantiate new EquiBoots object using thresholded predictions
eq_adjusted = eqb.EquiBoots(
    y_true=fairness_df["y_true"],
    y_prob=fairness_df["y_prob"],
    y_pred=fairness_df["y_pred_grouped_thresh"],
    fairness_df=fairness_df,
    fairness_vars=["race"],
)

# Group by race for adjusted metrics
eq_adjusted.grouper(groupings_vars=["race"])

# Evaluate adjusted metrics by race
sliced_race_data_adjusted = eq_adjusted.slicer("race")
race_metrics_adjusted = eq_adjusted.get_metrics(sliced_race_data_adjusted)

# Run statistical significance tests on adjusted race metrics
test_config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
}
stat_test_results_adjusted = eq_adjusted.analyze_statistical_significance(
    race_metrics_adjusted, "race", test_config
)

### Step 8: Combine Adjusted and Original Metrics for Final Plot

In [None]:
# Combine updated race metrics and original sex metrics for a final fairness plot
overall_stat_results_adjusted = {
    "race": stat_test_results_adjusted,  # Adjusted test results for race
    "sex": stat_test_results_sex,  # Original test results for sex
}

In [None]:
# Plot the group performance metrics for both race (adjusted) and sex (original)
eqb.eq_group_metrics_point_plot(
    group_metrics=[race_metrics_adjusted, sex_metrics],  # Grouped metrics to compare
    metric_cols=["Accuracy", "Precision", "Recall"],  # Metrics to plot
    category_names=["race", "sex"],  # Categories to compare
    figsize=(8, 8),  # Size of the plot
    include_legend=True,  # Show legend
    plot_thresholds=(0.9, 1.1),  # Highlight near-equality
    raw_metrics=True,  # Use raw (not normalized) values
    show_grid=True,  # Add grid to plot
    y_lim=(0, 1),  # Set y-axis limits
    statistical_tests=overall_stat_results_adjusted,  # Display statistical test results
    y_lims={  # Customize Y-axis for each group/metric
        ("sex", "Accuracy"): (0.70, 1.0),
        ("sex", "Precision"): (0.70, 1.0),
        ("sex", "Recall"): (0.70, 1.0),
        ("race", "Accuracy"): (0.70, 1.0),
        ("race", "Precision"): (0.70, 1.0),
        ("race", "Recall"): (0.70, 1.0),
    },
)

In [None]:
from equiboots.tables import metrics_table

In [None]:
stat_metrics_table_point = metrics_table(
    race_metrics, statistical_tests=stat_test_results_race, reference_group="White"
)

In [None]:
# table with metrics per group and statistical significance shown on columns for
# omnibus and/or pairwise
stat_metrics_table_point

In [None]:
# from google.colab import sheets

# # Make sure to run the cell above this one (QvB6iZvPYBKW) to define stat_metrics_table_point
# sheet = sheets.InteractiveSheet(df=stat_metrics_table_point)

# Conclusion

EquiBoots provides a powerful framework for evaluating and visualizing machine learning model performance across demographic groups with a focus on fairness in binary classification. In this analysis, we examined both unadjusted and threshold adjusted predictions for race and sex groups. Point estimates revealed statistically significant performance differences between the reference group (White) and others, especially for Accuracy, Precision, and Recall. While these differences were statistically significant, some had small effect sizes, suggesting limited practical impact in certain cases.

To address disparities, we applied race specific probability thresholds to align key metrics with those of the White group. This post processing approach successfully reduced or eliminated significant differences in many metrics, particularly for Black and Asian Pacific Islander populations. Specificity and False Positive Rates also improved, demonstrating that group specific thresholds can help balance performance outcomes. Sex based metrics, which were not adjusted, were included for comparison.

Although thresholding improved fairness metrics, it does not address deeper issues such as biases in data collection, feature design, or model training. Bootstrapped analyses further confirmed disparities, particularly within the Black population, across multiple metrics including Prevalence, Log Loss, and Brier Score. Miscalibration in groups like Asian Pacific Islander also suggests that further improvements could come from calibration techniques.

Overall, this analysis highlights both the shortcomings of traditional evaluation methods and the value of fairness aware tools like EquiBoots. Group specific thresholding is one effective strategy to improve equity, but it should be used alongside broader efforts in fair model development and ethical deployment practices.



# Task
Calculate matching thresholds for sex, using males as the reference.

## Define the reference group

### Subtask:
Specify 'Male' as the reference group for calculating thresholds.


## Calculate reference metrics

### Subtask:
Compute the performance metrics (Accuracy, Precision, Recall, Specificity) for the 'Male' group at a default threshold (e.g., 0.5).


**Reasoning**:
Compute the performance metrics (Accuracy, Precision, Recall, Specificity) for the 'Male' group at a default threshold (e.g., 0.5) as instructed in the subtask.



**Reasoning**:
The error indicates that `fairness_df` is not defined. I need to recreate the `fairness_df` DataFrame using the `X_test` and `y_test` variables which were defined earlier in the notebook, and then proceed with calculating the performance metrics for the 'Male' group.



### Step 5: Apply Grouped Thresholds to Make New Predictions for Sex

In [None]:
# Extract sex labels for each instance in the dataset
sex_labels = fairness_df["sex"].values

# Get new predictions using group-specific thresholds for sex
y_pred_grouped_thresh_sex = eqb.grouped_threshold_predict(
    y_prob, sex_labels, group_thresholds_sex
)

# Save new predictions in the fairness DataFrame
fairness_df["y_pred_grouped_thresh_sex"] = y_pred_grouped_thresh_sex

### Step 6: Create New EquiBoots Object for Adjusted Sex Predictions

In [None]:
# Create a new EquiBoots object using the adjusted predictions for sex
eq_adjusted_sex = eqb.EquiBoots(
    y_true=fairness_df["y_true"],  # True labels
    y_prob=fairness_df["y_prob"],  # Predicted probabilities
    y_pred=fairness_df["y_pred_grouped_thresh_sex"],  # Adjusted predictions for sex
    fairness_df=fairness_df,  # Dataset with sensitive attributes
    fairness_vars=["sex"],  # Focus fairness analysis on sex
)

# Re-group the data by sex (prepares for fairness slicing)
eq_adjusted_sex.grouper(groupings_vars=["sex"])

### Step 7: Slice and Evaluate New Metrics by Sex

In [None]:
# Extract subgroup data by sex
sliced_sex_data_adjusted = eq_adjusted_sex.slicer("sex")

# Compute fairness performance metrics for each sex group using adjusted predictions
sex_metrics_adjusted = eq_adjusted_sex.get_metrics(sliced_sex_data_adjusted)

### Step 8: Run Statistical Significance Tests for Adjusted Sex Predictions

In [None]:
# Run statistical significance tests on adjusted sex metrics
stat_test_results_sex_adjusted = eq_adjusted_sex.analyze_statistical_significance(
    sex_metrics_adjusted, "sex", test_config
)

### Step 9: Combine Adjusted Race and Adjusted Sex Metrics for Final Plot

In [None]:
# Combine adjusted race metrics and adjusted sex metrics for a final fairness plot
overall_stat_results_adjusted_both = {
    "race": stat_test_results_adjusted,  # Adjusted test results for race
    "sex": stat_test_results_sex_adjusted,  # Adjusted test results for sex
}

### Step 10: Plot Adjusted Performance Metrics for Race and Sex

In [None]:
# Plot the group performance metrics for both race (adjusted) and sex (adjusted)
eqb.eq_group_metrics_point_plot(
    group_metrics=[
        race_metrics_adjusted,
        sex_metrics_adjusted,
    ],  # Grouped metrics to compare
    metric_cols=["Accuracy", "Precision", "Recall"],  # Metrics to plot
    category_names=["race", "sex"],  # Categories to compare
    figsize=(8, 8),  # Size of the plot
    include_legend=True,  # Show legend
    plot_thresholds=(0.9, 1.1),  # Highlight near-equality
    raw_metrics=True,  # Use raw (not normalized) values
    show_grid=True,  # Add grid to plot
    y_lim=(0, 1),  # Set y-axis limits
    statistical_tests=overall_stat_results_adjusted_both,  # Display statistical test results
    y_lims={  # Customize Y-axis for each group/metric
        ("sex", "Accuracy"): (0.70, 1.0),
        ("sex", "Precision"): (0.70, 1.0),
        ("sex", "Recall"): (0.70, 1.0),
        ("race", "Accuracy"): (0.70, 1.0),
        ("race", "Precision"): (0.70, 1.0),
        ("race", "Recall"): (0.70, 1.0),
    },
)