<a href="https://colab.research.google.com/github/temahm/AiCon/blob/main/PredictSalaryOver50Kv2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict income >50K. We’ll measure whether the model treats groups differently by sex and race.
Logistic Regression vs XGBoost

In [None]:
!pip -q install fairlearn xgboost scikit-learn pandas numpy

Load dataset (Adult)

In [None]:
from sklearn.datasets import fetch_openml

adult = fetch_openml("adult", version=2, as_frame=True)
df = adult.frame.copy()

df.head()

Clean + set target + choose sensitive attribute(s)

In [None]:
import numpy as np
import pandas as pd

df = df.replace("?", np.nan).dropna()

y = (df["class"] == ">50K").astype(int)

A_sex  = df["sex"]   # sensitive feature for fairness evaluation
A_race = df["race"]  # optional, you can do one at a time

X = df.drop(columns=["class"])  # keep everything for now
X = pd.get_dummies(X, drop_first=True)

Exclude sensitive features from training (good practice) but still keep them for evaluation.

In [None]:
# If you want: exclude sex/race from training features
X = X.drop(columns=[c for c in X.columns if c.startswith("sex_") or c.startswith("race_")])

Train/test split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(
    X, y, A_sex, test_size=0.25, random_state=42, stratify=y
)

Train two models: Logistic Regression vs XGBoost

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

lr = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),  # sparse-friendly
    ("model", LogisticRegression(max_iter=2000))
])

xgb = XGBClassifier(
    n_estimators=300,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric="logloss",
    random_state=42
)

lr.fit(X_train, y_train)
xgb.fit(X_train, y_train)

LR is linear and easier to interpret. XGBoost is more powerful and may amplify proxy patterns.

**Measure accuracy + fairness metrics (group gaps)**

Use fairlearn MetricFrame to compute metrics by group.

In [None]:
from sklearn.metrics import accuracy_score
from fairlearn.metrics import MetricFrame, selection_rate, false_positive_rate, false_negative_rate

def evaluate(model, X_test, y_test, A_test, name="model"):
    y_pred = model.predict(X_test)

    mf = MetricFrame(
        metrics={
            "accuracy": accuracy_score,
            "selection_rate": selection_rate,     # P(ŷ=1)
            "FPR": false_positive_rate,
            "FNR": false_negative_rate,
        },
        y_true=y_test,
        y_pred=y_pred,
        sensitive_features=A_test
    )

    print(f"\n=== {name} (overall) ===")
    print(mf.overall)

    print(f"\n=== {name} (by group) ===")
    print(mf.by_group)

    print(f"\n=== {name} (group gaps: max - min) ===")
    print(mf.difference())

evaluate(lr,  X_test, y_test, A_test, "Logistic Regression")
evaluate(xgb, X_test, y_test, A_test, "XGBoost")

**Selection rate**: who gets predicted “>50K”

**FPR**: among true ≤50K, how often we incorrectly predict >50K

**FNR**: among true >50K, how often we incorrectly predict ≤50K

**Group gap**: max(metric) − min(metric) across groups (basic disparity measure)

# Overall Performance

**Logistic Regression (overall)**

Accuracy: 0.846

FPR: 0.0686

FNR: 0.413


**XGBoost (overall)**

Accuracy: 0.866

FPR: 0.0581

FNR: 0.365

XGBoost performs better. It increases accuracy from 84.6% to 86.6%.
It reduces false positives and reduces false negatives overall.

If we stopped here, we would conclude XGBoost is better.


# Group-Level Results (This Is Where Bias Appears)

## Measuring fairness across sex.

### Selection Rate (Who Gets Predicted >50K)

**Logistic Regression**:

Female: 8.1%

Male: 25.3%

Gap: 17.2 percentage points


**XGBoost**:

Female: 8.3%

Male: 25.8%

Gap: 17.5 percentage points

Men are predicted to earn >50K about 3x more often than women.

Even though the model does not use sex directly, disparities persist.

### Learning:

*“Removing sensitive features does not remove bias.”*

#Error Disparities (The Evidence)

##FNR (False Negative Rate).

FNR matters because:

*It measures how often we deny opportunity to someone who actually qualifies.*

Logistic Regression FNR:

Female: 50.1%

Male: 39.6%

Gap: 10.5 points


**XGBoost FNR**:

Female: 46.6%

Male: 34.5%

Gap: 12.0 points

This is key.

Among people who truly earn >50K, women are more likely to be incorrectly classified as ≤50K.

And even more powerful:

XGBoost reduces FNR overall — but increases the disparity.
This is the core teaching insight.

#Tradeoff

| Metric        | Logistic | XGBoost |
| ------------- | -------- | ------- |
| Accuracy      | 84.6%    | 86.6%   |
| FNR Gap       | 10.5%    | 12.0%   |
| Selection Gap | 17.2%    | 17.5%   |

----------------------------------------



XGBoost is more powerful. It captures nonlinear patterns and proxy relationships.

Some of those patterns reflect historical inequality in the data.

## So the model becomes more accurate — but also better at reproducing structural inequity.
----------------------------------------

# Fairness is not a property of the algorithm — it is a property of the entire socio-technical system.

Then transition to:

- Human-in-the-loop

- Governance oversight

- Threshold adjustments

- Fairness constraints



.

.

.

.

.

.

# **A simple mitigation (Human-in-the-loop + review queue)**


**Idea**: Send uncertain cases to a human reviewer.

**Step A: get probabilities**

In [None]:
import numpy as np
proba = xgb.predict_proba(X_test)[:, 1]  # probability of >50K

**Step B: define an “uncertain band” and review it**

In [None]:
low, high = 0.45, 0.55
needs_review = (proba >= low) & (proba <= high)

auto_pred = (proba > 0.5).astype(int)

print("Review rate:", needs_review.mean())

**Step C: simulate “human correction”**

simulate that humans correct reviewed items using the ground truth (for this demo, this represents expert review).

In [None]:
final_pred = auto_pred.copy()
final_pred[needs_review] = y_test.iloc[needs_review].values  # simulated human correction

**Step D: re-evaluate fairness after HITL**

In [None]:
from fairlearn.metrics import MetricFrame, selection_rate, false_positive_rate, false_negative_rate
from sklearn.metrics import accuracy_score

mf = MetricFrame(
    metrics={
        "accuracy": accuracy_score,
        "selection_rate": selection_rate,
        "FPR": false_positive_rate,
        "FNR": false_negative_rate,
    },
    y_true=y_test,
    y_pred=final_pred,
    sensitive_features=A_test
)

print("\n=== XGBoost + Human Review (overall) ===")
print(mf.overall)

print("\n=== XGBoost + Human Review (by group) ===")
print(mf.by_group)

print("\n=== XGBoost + Human Review (group gaps) ===")
print(mf.difference())

# HITL Results Analysis

###Overall Performance Improved

**XGBoost (before HITL)**

Accuracy: 0.8659

FPR: 0.0581

FNR: 0.3647

**XGBoost + Human Review**

Accuracy: 0.8866 ⬆

FPR: 0.0470 ⬇

FNR: 0.3148 ⬇

By sending uncertain cases to human review, we improved accuracy from 86.6% to 88.7%.

**False negatives decreased substantially**

#Look at Group-Level Fairness

###Selection Rate Gap

Before HITL (XGBoost): Gap: 0.1749

After HITL: Gap: 0.1754

**Selection disparity is roughly unchanged. HITL does not automatically fix structural imbalance.**

--------

**FPR Gap (False Positive Gap)**

Before HITL: Gap: 0.0617

After HITL: Gap: 0.0493

The difference in false positive errors between men and women decreased.

------------
**FNR Gap (Most Important)**

Before HITL: Gap: 0.1205

After HITL: Gap: 0.1143

Slight reduction.

But what matters more:

**Female FNR went from: 0.4658 → 0.4106**

Male FNR went from:0.3453 → 0.2963

**Both improved**


##The algorithm alone improved predictive performance but preserved disparity


##When we added a human review band for uncertain cases, overall error decreased and some disparities shrank

#This demonstrates that fairness is not purely a modeling problem — it is a governance design problem

We’re not claiming humans are perfect.

We’re showing a governance pattern: automation where confident; human oversight where uncertain.

This is practical in admissions, scholarships, internships, hiring shortlists, etc.