# Calculating Fairness Metrics



We will use the census adult income dataset from the UCI ML repository for our example.

Our target is income bracket -- it is a binary variable (either <=50k or >50k).

We will use gender as our protected attribute in this example.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('https://www.dropbox.com/s/j8scafz8tu8z8zc/census_adult_income.csv?dl=1')

In [None]:
df

Unnamed: 0,age,workclass,functional_weight,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,Private,297847,9th,5,Married-civ-spouse,Other-service,Wife,Black,Female,3411,0,34,United-States,<=50K
1,72,Private,74141,9th,5,Married-civ-spouse,Exec-managerial,Wife,Asian-Pac-Islander,Female,0,0,48,United-States,>50K
2,45,Private,178215,9th,5,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,40,United-States,>50K
3,31,Private,86958,9th,5,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
4,55,Private,176012,9th,5,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,23,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,20,Private,86143,Some-college,10,Never-married,Other-service,Other-relative,Asian-Pac-Islander,Male,0,0,30,United-States,<=50K
32557,48,Private,350440,Some-college,10,Married-civ-spouse,Craft-repair,Other-relative,Asian-Pac-Islander,Male,0,0,40,Cambodia,>50K
32558,22,Local-gov,195532,Some-college,10,Never-married,Protective-serv,Other-relative,White,Female,0,0,43,United-States,<=50K
32559,20,Private,176321,Some-college,10,Never-married,Adm-clerical,Other-relative,White,Female,0,0,20,United-States,<=50K


We will create binary columns for gender and income bracket (our target variable).

In [None]:
df['gender'] = (df['sex'] == ' Female')

In [None]:
df['target'] = (df['income_bracket'] == ' >50K')

In [None]:
df = df.drop(['sex', 'income_bracket'], axis=1)

Now we will build a column transformer to standardize / encode our independent variables:
* Gender
* Hours per week
* Working class
* Marital status
* Education
* Occupation

The rest of the variables (including the target) will be dropped.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, OneHotEncoder

ct = make_column_transformer(
    (OrdinalEncoder(), ['gender']),
    (StandardScaler(), ['hours_per_week']),
    (OneHotEncoder(), ['workclass', 'marital_status', 'education', 'occupation']),
    remainder='drop'
)

Here we prepare an 80% training split.

In [None]:
df_train = df.sample(frac=0.8, random_state=1234)
df_test = df.drop(df_train.index)

Time to train the logistic regression model.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
pipeline = make_pipeline(ct, model)
pipeline.fit(df_train, df_train['target'])

Here we check accuracy metrics on the test set.

In [None]:
from sklearn.metrics import precision_recall_fscore_support
precision, recall, fscore, support = precision_recall_fscore_support(df_test['target'], pipeline.predict(df_test))
print('precision:', precision[1])
print('recall:', recall[1])
print('fscore:', fscore[1])

precision: 0.7172593235039029
recall: 0.5146235220908525
fscore: 0.5992753623188406


## Demographic parity

Now we can compute our first fairness metric: demographic parity.  This metric compares the rate at which members of the protected group are positively classified, versus non-members.  We apply a threshold of 80% to the ratio:

$$\frac{ Pr(C=1|x\in G) }{ Pr(C=1|x \notin G) }$$

In [None]:
df_test['pred'] = pipeline.predict(df_test)
num = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == True)]) / len(df_test[df_test['gender'] == True])
den = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == False)]) / len(df_test[df_test['gender'] == False])
demo_par = num/den
demo_par

0.20580399619410086

## Equalized Odds and Equal Opportunity

Equalized odds and equal opportunity are alternative metrics, which incorporate classifer accuracy into the metric.  Equalized odds compares the true positive and false positive rates, whereas equal opportunity only compares the true positive rates.

With these metrics, we are not asking that members of each group be accepted at the same rate, but that each group have the same rate of being correctly (and incorrectly) accepted.

True positive rates:

$$Pr(C=1|x\in G,y=1) $$
$$Pr(C=1|x \notin G,y=1)$$

False positive rates:

$$Pr(C=1|x\in G,y=0)$$
$$Pr(C=1|x \notin G,y=0)$$

We can also be more thorough in our analysis, by checking both ratios for each metric: protected/other and other/protected.


In [None]:
tp_protected = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == True) &
 (df_test['target'] == True)]) / len(df_test[(df_test['gender'] == True) & (df_test['target'] == True)])
tp_other = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == False) &
 (df_test['target'] == True)]) / len(df_test[(df_test['gender'] == False) & (df_test['target'] == True)])

fp_other = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == True) &
 (df_test['target'] == False)]) / len(df_test[(df_test['gender'] == True) & (df_test['target'] == False)])
fp_protected = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == False) &
 (df_test['target'] == False)]) / len(df_test[(df_test['gender'] == False) & (df_test['target'] == False)])

tp_protected, tp_other, fp_other, fp_protected

(0.3148936170212766,
 0.5488338192419825,
 0.015532940546331012,
 0.09776168531928901)

## Testing fairness through unawareness

Now train another logistic regression model, this time leaving out the "gender" column.  Does this improve the fairness of the model?  Do we achieve a "fair model" according to the 80% threshold?

In [None]:
ct = make_column_transformer(
    (StandardScaler(), ['hours_per_week']),
    (OneHotEncoder(), ['workclass', 'marital_status', 'education', 'occupation']),
    remainder='drop'
)

df_train = df.sample(frac=0.8, random_state=1234)
df_test = df.drop(df_train.index)

model = LogisticRegression(max_iter=1000)
pipeline = make_pipeline(ct, model)
pipeline.fit(df_train, df_train['target'])

precision, recall, fscore, support = precision_recall_fscore_support(df_test['target'], pipeline.predict(df_test))
print('precision:', precision[1])
print('recall:', recall[1])
print('fscore:', fscore[1])

precision: 0.7186147186147186
recall: 0.5164903546981954
fscore: 0.6010137581462709


In [None]:
df_test['pred'] = pipeline.predict(df_test)
num = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == True)]) / len(df_test[df_test['gender'] == True])
den = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == False)]) / len(df_test[df_test['gender'] == False])
demo_par = num/den
demo_par

0.23873117121316245

The fairness of the model is slightly improved as the demographic parity increases from 0.206 to 0.239. However, we still do not achieve a "fair model" according to the 80% threshold because 0.239 < 0.8.

## Searching for a fair model

Use hyperparameter search and model selection techniques to try to find a more fair model (using any fairness metric you prefer).  Possible variations to test:
* Features included in the model
* Model type
* Hyperparameter settings

In [None]:
# recursive feature elimination
from sklearn.feature_selection import RFE
import scipy.sparse

ct = make_column_transformer(
    (StandardScaler(), df.select_dtypes(include='number').columns),
    (OneHotEncoder(), df.select_dtypes(include='object').columns)
)

df_train = df.sample(frac=0.8, random_state=1234)
df_test = df.drop(df_train.index)
X_train = ct.fit_transform(df_train)

model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select = 5)
rfe.fit(X_train, df_train['target'])

X_train = pd.DataFrame.sparse.from_spmatrix(X_train)
indices = list(X_train.columns[rfe.get_support()])
features = [list(ct.get_feature_names_out())[i] for i in indices]
features

['standardscaler__capital_gain',
 'onehotencoder__education_ Preschool',
 'onehotencoder__marital_status_ Married-civ-spouse',
 'onehotencoder__occupation_ Priv-house-serv',
 'onehotencoder__relationship_ Own-child']

In [None]:
ct = make_column_transformer(
    (StandardScaler(), ['capital_gain']),
    (OneHotEncoder(), ['education', 'marital_status', 'occupation', 'relationship']),
    remainder='drop'
)

df_train = df.sample(frac=0.8, random_state=1234)
df_test = df.drop(df_train.index)

model = LogisticRegression(max_iter=1000)
pipeline = make_pipeline(ct, model)
pipeline.fit(df_train, df_train['target'])

df_test['pred'] = pipeline.predict(df_test)
num = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == True)]) / len(df_test[df_test['gender'] == True])
den = len(df_test[(df_test['pred'] == True) & (df_test['gender'] == False)]) / len(df_test[df_test['gender'] == False])
demo_par = num/den
demo_par

0.33324905351546646