# YData Quality - Bias & Fairness Tutorial
Time-to-Value: 3 minutes

This notebook provides a tutorial for the ydata_quality package module on Bias & Fairness .

**Structure:**

1. A bias and fairness introduction
2. Load example dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

## Bias and Fairness
As data is increasingly used for automated decision making with heavy impact on individual lifes, a thorough analysis of data quality encompasses an understanding of the inherent biases embedded in the datasets that can cause different treatments based on sensitive attributes. There are ethical and legal obligations for Data Scientists to develop applications which are unbiased and fair.

### Definitions
We consider _fairness_ to be the absence of differentiated treatment (assistive or punitive) based on sensitive attributes. _Fairness_ can also be thought of as the absence of unjustified basis for differentiated treatment.

We consider _sensitive attributes (a.k.a. sensitive features)_ as personal details for which there are legal and ethical obligations for which not to differentiate the treatment based on.

We consider _bias_ as a systematic, non-neglectable treatment which is differentiated towards a specific sub-group of individuals.

### Remarks
- The absence of sensitive attributes in a data application does not guarantee fairness by default. 
- Biases in data can be originated from multiple sources: sample (e.g. fair representation of multiple groups), label (e.g. how the outcome was defined), machine learning pipeline (e.g. how data is diggested), application (e.g. how an application is deployed). Not all biases are available during data quality analysis but it is important to keep them in mind from the beginning.



**References**
- Chapter 11 Bias and Fairness | Big Data and Social Science [(link)](https://textbook.coleridgeinitiative.org/chap-bias.html)
- Fairness and Machine Learning - Limitations and Opportunities [(link)](https://fairmlbook.org/)
- Fairness Tutorial - Moritz Hardt - MLSS 2020, Tübingen [(link)](https://www.youtube.com/watch?v=Igq_S_7IfOU)

In [1]:
import pandas as pd

from ydata_quality.bias_fairness import BiasFairness

## Load the example dataset
The "Adult Data Set" (a.k.a. "Census Income") contains a set of records to predict whether an individual's income exceeds $50K/yr, based on census data. 

In [2]:
# This is the DataFrame used in the demo from GE tutorials
df = pd.read_csv('../datasets/transformed/census_10k.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Create the engine
Each engine contains the checks and tests for each suite.

In [3]:
bf = BiasFairness(df=df, sensitive_features=['race', 'sex'], label='income', random_state=42)

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a dictionary with outputs of operation performed. 

In [4]:
results = bf.evaluate()

## Check the status
After running the data quality checks, you can check the warnings for each individual test of the Bias & Fairness module. The warnings are sorted by priority and have additional details that can provide better insights for Data Scientists.

In [5]:
bf.report()

	[PROXY IDENTIFICATION] Found 1 feature pairs of correlation to sensitive attributes with values higher than defined threshold (0.5). (Priority 2: usage allowed, limited human intelligibility)
	[SENSITIVE ATTRIBUTE REPRESENTATIVITY] Found 2 values of 'race' sensitive attribute with low representativity in the dataset (below 1.00%). (Priority 2: usage allowed, limited human intelligibility)


In [6]:

bias_fairness_warnings = bf.get_warnings()
bias_fairness_warnings

 relationship_sex    0.643254
 Name: association, dtype: float64),
  Other                 0.0083
 Name: race, dtype: float64)]

## Full Test Suite
In this section, you will find a detailed overview of the available tests in the data expectations module of ydata_quality. These are all run with the `evaluate` method, which centralizes input arguments and produces specific outputs in the returned results dictionary, structured by test.

In [7]:
# Results object structure
list(results.keys())

['performance_discrimination',
 'proxy_identification',
 'sensitive_predictability',
 'sensitive_representativity']

### Performance Discrimination
The "Performance Discrimination" inspects the disparities in the classifying performance across values of sensitive attributes. A model is trained on the full data but the performance metrics are broken down per each sub-group, enabling the Data Scientist to understand the performance of a baseline classifier throughout each partition of sub-group data.

In [8]:
performances = bf.performance_discrimination()
performances

{'race':  Other                 [ERROR] Failed performance metric with message...
  Amer-Indian-Eskimo                                             0.572917
  Asian-Pac-Islander                                             0.550847
  White                                                          0.573228
  Black                                                          0.595126
 dtype: object,
 'sex':  Male      0.582815
  Female    0.614891
 dtype: float64}

### Proxy Identification
The "Proxy Identification" test aims to detect associations between sensitive and non-sensitive attributes in the data, to signal Data Scientists for unwanted proxies. The leakage of protected features by non-protected ones may give Data Scientists a false confidence that by disregarding sensitive attributes that they will not be available in the data (directly or indirectly).

In [9]:
bf.proxy_identification(th=0.2)

features
relationship_sex       0.643254
marital-status_sex     0.449933
occupation_sex         0.420299
native-country_race    0.409056
hours-per-week_sex     0.224377
income_sex             0.202520
Name: association, dtype: float64

## Sensitive Attributes Predictability
The "Sensitive Attributess Predictability"  test builds a predictive model with sensitive attributes as targets to grasp how easy it is to build a baseline that can predict protected features based on non-protected ones. Similar to the "Proxy Identification" mechanism with more complex relationships (based on a model, not on association measures). The leakage of protected features by non-protected ones may give Data Scientists a false confidence that by disregarding sensitive attributes that they will not be available in the data (directly or indirectly).

The performance metric values are calculated as a ratio of _(real - min)/(max - min)_. The best performance (max) is defined as a perfect prediction, where the predictions equal the true outcomes. The baseline performance (min) is defined as the performance achieved by a naive model, where the predictions are the mode for classification or mean for regression.

The values returned by the test indicate the percentage of achievable performance achieved by a baseline model trained to predict a sensitive attribute.

In [10]:
sens_pred = bf.sensitive_predictability()
sens_pred

race    0.120460
sex     0.254559
dtype: float64

In [11]:

bf.sensitive_predictability(adjusted_metric=False)

race    0.050544
sex     0.627280
dtype: float64

## Sensitive Attributes Representativity
The "Sensitive Attributes Representativity" calculates the distribution of categorical features to assess if any sub-group of a sensitive attribute is underrepresented. It is meant for Data Scientists to validate the data for sampling bias, i.e. the systematic over/under representation of some members of a population in relation to the others. 

In [12]:
bf.sensitive_representativity()

{'race':  White                 0.8556
  Black                 0.0953
  Asian-Pac-Islander    0.0309
  Amer-Indian-Eskimo    0.0099
  Other                 0.0083
 Name: race, dtype: float64,
 'sex':  Male      0.6703
  Female    0.3297
 Name: sex, dtype: float64}