# YData Quality - Bias & Fairness Tutorial
Time-to-Value: 3 minutes

This notebook provides a tutorial for the ydata_quality package module on Bias & Fairness .

**Structure:**

1. A bias and fairness introduction
2. Load example dataset
3. Instantiate the Data Quality engine
4. Run the quality checks
5. Assess the warnings
6. (Extra) Detailed overview

## Bias and Fairness

In [1]:
import pandas as pd

from ydata_quality.bias_fairness import BiasFairness

## Load the example dataset

In [2]:
# This is the DataFrame used in the demo from GE tutorials
df = pd.read_csv('../examples/census/census.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Create the engine
Each engine contains the checks and tests for each suite.

In [3]:
bf = BiasFairness(df=df, sensitive_features=['race', 'sex'], label='income')

### Full Evaluation
The easiest way to assess the data quality analysis is to run `.evaluate()` which returns a dictionary with outputs of operation performed. 

In [4]:
results = bf.evaluate()

  return f(*args, **kwargs)
  return f(*args, **kwargs)


In [5]:
results

{'performance_discrimination': {'race':  Black                 0.582349
   Asian-Pac-Islander    0.561472
   Other                 0.604348
   Amer-Indian-Eskimo    0.586435
   White                 0.580440
  dtype: float64,
  'sex':  Male      0.589942
   Female    0.600223
  dtype: float64},
 'proxy_identification': Series([], Name: association, dtype: float64)}

## Check the status
After running the data quality checks, you can check the warnings for each individual operation over the GE validation log. The warnings are sorted by priority and have additional details that can provide better insights for Data Scientists.

In [6]:
bf.report()



## Full Test Suite
In this section, you will find a detailed overview of the available tests in the data expectations module of ydata_quality. These are all run with the `evaluate` method, which centralizes input arguments and produces specific outputs in the returned results dictionary, structured by test.

In [7]:
# Results object structure
list(results.keys())

['performance_discrimination', 'proxy_identification']

### Performance Discrimination


In [8]:
performances = bf.performance_discrimination()
performances

  return f(*args, **kwargs)
  return f(*args, **kwargs)


{'race':  Black                 0.582349
  Asian-Pac-Islander    0.561472
  Other                 0.604348
  Amer-Indian-Eskimo    0.586435
  White                 0.580440
 dtype: float64,
 'sex':  Male      0.589942
  Female    0.600223
 dtype: float64}

### Proxy Identification


In [9]:
bf.proxy_identification(th=0.2)

features
relationship_sex       0.648892
marital-status_sex     0.461635
occupation_sex         0.423864
native-country_race    0.407741
hours-per-week_sex     0.229309
income_sex             0.215836
Name: association, dtype: float64

## Sensitive Predictability

In [10]:
bf.sensitive_predictability()

  return f(*args, **kwargs)


race    0.049720
sex     0.628948
dtype: float64