#  Fairbalance
This notebook demonstrates the use of fairbalance to load data, evaluate its balance, rebalance its protected attributes and evaluate a model.

In [None]:
%pip install fairbalance
%pip install scikit-learn

# Loading data

You can load data using the fairbalance.datasets module. Currently, the available datasets are adult, german, KKD census and the American Community Surveys. These datasets are slightly transformed compared to their original shape to be readily usable when loaded.

In this tutorial, we will use the Adult dataset, which has a binary task of determining if an individual earns more than $50K (1) or not (0).

In [4]:
from fairbalance.datasets import load_adult, ADULT_METADATA

In [5]:
data, target, cont_columns, cat_columns = load_adult()

Let's look at what the data looks like:

In [6]:
data.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [11]:
print("target: \n", target.head())
print("categorical columns: \n", cat_columns)
print("continuous columns: \n", cont_columns)

target: 
    income
0       0
1       0
2       0
3       0
4       0
categorical columns: 
 ['workclass', 'marital-status', 'occupation', 'relationship', 'native-country', 'education', 'sex', 'race']
continuous columns: 
 ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']


ADULT_METADATA contains the protected attributes and privileged classes ususally used with this dataset.

In [16]:
ADULT_METADATA

{'protected_attributes': ['sex', 'race'],
 'privileged_classes': {'sex': 'Male', 'race': 'White'}}

# Evaluating the dataset balance

Let's now evaluate the balance of the dataset. For this, we can use the FairnessAnalysis class from the fairbalance.metrics model. 

In [17]:
from fairbalance.metrics import FairnessAnalysis

In [19]:
adult_analysis = FairnessAnalysis(X=data,
                                  y=target,
                                  positive_output_target=1,
                                  protected_attributes=ADULT_METADATA["protected_attributes"],
                                  privileged_groups=ADULT_METADATA["privileged_classes"])

the get_fairness_analysis() method gives a full analysis of the protected attributes, displaying the balance index, PMI, RMSPMI, DIR, RMSDIR and CBS for each attribute.

In [20]:
adult_analysis.get_fairness_analysis()

______

BALANCE INDEX
sex 0.6587009932592764
race 0.17560928080714577
______

DISPARATE IMPACT RATIO
sex {'Female': 0.36052833882228574}
race {'Black': 0.4760563602437251, 'Asian-Pac-Islander': 0.9369908378449117, 'Amer-Indian-Eskimo': 0.4651012834069375, 'Other': 0.4850049931988143}
RMSDIR
sex 0.36052833882228574
race 0.6237248062223386
______

POINTWISE MUTUAL INFORMATION
sex
Male {0: 0.11641842963328615, 1: -0.1496443695438816}
Female {0: -0.13048722919155542, 1: 0.2366433715599068}
race
White {0: 0.04340482896946561, 1: -0.038970203552986166}
Black {0: -0.0592233495120658, 1: 0.15346697180365237}
Asian-Pac-Islander {0: 0.011316639922955816, 1: -0.02591478507214903}
Amer-Indian-Eskimo {0: -0.03150645002805027, 1: 0.10445847701425687}
Other {0: -0.029281104546010503, 1: 0.09661305580981895}
RMSPMI
sex 0.165055816604411
race 0.07304622535023712
______

COMPOSITE FAIRNESS SCORE
sex 0.570794361621946
race 0.4751067313628907
______



you can also compute individual values with their own method, for example here with the Disparate Impact Ratio:

In [22]:
DIR, RMSDIR = adult_analysis.get_disparate_impact_ratio()

print("DIR: \n", DIR)
print("RMSDIR: \n", RMSDIR)

DIR: 
 {'sex': {'Female': 0.36052833882228574}, 'race': {'Black': 0.4760563602437251, 'Asian-Pac-Islander': 0.9369908378449117, 'Amer-Indian-Eskimo': 0.4651012834069375, 'Other': 0.4850049931988143}}
RMSDIR: 
 {'sex': 0.36052833882228574, 'race': 0.6237248062223386}


# Balancing the dataset

Looking at the CBS of each attributes, we can see that both attributes have relatively low CBS (0.57 for sex and 0.47 for race). As a rule of thumb, a value above 0.75 is usually good.

Therefore, we need to mitigate the bias. The fairbalance package implements different balancing strategies accessible through fairbalance.mitigation_strategies, using different processors accessible through fairbalance.processors. In this example, we'll use the CompleteBalance mitigation strategy, with the RandomOverSamplerProcessor processor. 

In [31]:
from fairbalance.mitigation_strategies import CompleteBalance
from fairbalance.processors import RandomOverSamplerProcessor

In [32]:
balancer = CompleteBalance(processor=RandomOverSamplerProcessor())

We can simply balance our dataset using the resample() method. Let's balance the two protected attributes at the same time.

In [33]:
balanced_data, balanced_target = balancer.resample(X=data,
                                                   y=target,
                                                   protected_attributes=ADULT_METADATA["protected_attributes"],
                                                   cont_columns=cont_columns,
                                                   cat_columns=cat_columns)

Let's evaluate the balance of the new data:

In [34]:
balanced_data_analysis = FairnessAnalysis(X=balanced_data,
                                          y=balanced_target,
                                          positive_output_target=1,
                                          protected_attributes=ADULT_METADATA["protected_attributes"],
                                          privileged_groups=ADULT_METADATA["privileged_classes"])
balanced_data_analysis.get_fairness_analysis()

______

BALANCE INDEX
sex 1.0
race 1.0
______

DISPARATE IMPACT RATIO
sex {'Female': 0.9983823617518424}
race {'Black': 0.9981616733740747, 'Asian-Pac-Islander': 0.998755599800896, 'Amer-Indian-Eskimo': 0.9875062220009954, 'Other': 0.9927327028372324}
RMSDIR
sex 0.9983823617518424
race 0.9942995313410059
______

POINTWISE MUTUAL INFORMATION
sex
Male {0: 0.0003785535251920788, 1: -0.00045847101385361065}
Female {0: -0.0003786808342135173, 1: 0.0004584217680699599}
race
White {0: 0.000986598207249448, 1: -0.001433874460723081}
Black {0: 0.0014606882104401177, 1: -0.002122383539021307}
Asian-Pac-Islander {0: 0.0006662709839175421, 1: -0.0009684838993840177}
Amer-Indian-Eskimo {0: -0.002229357650186129, 1: 0.0032453728741205297}
Other {0: -0.0008840711219281151, 1: 0.0012860924879759094}
RMSPMI
sex 0.00042043077123955286
race 0.0017025706527084767
______

COMPOSITE FAIRNESS SCORE
sex 0.9990350275417906
race 0.9965651739607544
______



Much better ! The two attributes are now perfectly balanced.

# Train-Test split and model training

fairbalance implements a balanced_train_test_split function, which is an extension of the sklearn train_test_split. It can be accessed through the fairbalance.utils module

In [35]:
from fairbalance.utils import balanced_train_test_split

The balanced_train_test_split function takes as mandatory parameters the dataset and the target. Other parameters are:
- protected_attributes: if only this parameters is also defined, returns a train-test split that respect as much as possible the proportion of classes of the given protected_attribute.
- mitigator: if a mitigator object is given, it will apply it to the training data.   
- cont_columns and cat_columns: necessary if the mitigator is defined.
- any parameter for sklearn train_test_split function, such as test_size or random_state


In [37]:
X_train, X_test, y_train, y_test = balanced_train_test_split(data, target,
                                                             protected_attributes=ADULT_METADATA["protected_attributes"],
                                                             mitigator=CompleteBalance(processor=RandomOverSamplerProcessor()),
                                                             cont_columns=cont_columns,
                                                             cat_columns=cat_columns)

This data can now be used to train any of your model, for example from sklearn !  