<a href="https://colab.research.google.com/github/yohanesnuwara/66DaysOfData/blob/main/D06_fairness_and_bias.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness and Bias

Bias occurs when the frequency of events, properties, and/or outcomes captured in a data set does not accurately reflect their real-world frequency. We need to audit for potential bias in our data. Fairness is how the data is represented fairly. 

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

from ipywidgets import interact
import ipywidgets as widgets

from IPython.core.display import display, HTML
import base64
!pip -q install facets-overview==1.0.0
from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator

Access census dataset.

In [2]:
# Download census dataset
!wget -q https://download.mlcc.google.com/mledu-datasets/adult_census_train.csv
!wget -q https://download.mlcc.google.com/mledu-datasets/adult_census_test.csv

In [3]:
# Read dataset
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
           "marital_status", "occupation", "relationship", "race", "gender",
           "capital_gain", "capital_loss", "hours_per_week", "native_country",
           "income_bracket"]

train_df = pd.read_csv('/content/adult_census_train.csv', names=COLUMNS, 
                       sep=r'\s*,\s*', engine='python', na_values="?")

test_df = pd.read_csv('/content/adult_census_test.csv', names=COLUMNS, 
                      sep=r'\s*,\s*', skiprows=[0],
                      engine='python', na_values="?")

train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


List numerical and categorical predictors.

In [6]:
# Print numerical and categorical features
numerical_features = train_df.select_dtypes(include=np.number).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

print(numerical_features)
print(categorical_features)

['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'gender', 'native_country', 'income_bracket']


## Spotting bias

There are several questions to ask:
* **Are there missing feature values for a large number of observations?**
* **Are there features that are missing that might affect other features?**
* **Are there any unexpected feature values?**
* **What signs of data skew do you see?**

We use a Python library called Facets to describe the statistics of the data and answer those questions.

In [4]:
fsg = FeatureStatisticsGenerator()
dataframes = [
    {'table': train_df, 'name': 'trainData'}]
censusProto = fsg.ProtoFromDataFrames(dataframes)
protostr = base64.b64encode(censusProto.SerializeToString()).decode("utf-8")


HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

And use Plotly to make pie chart of categorical data. 

In [9]:
import plotly.express as px

@interact
def f(categoricalf=categorical_features):
  count = train_df[categoricalf].value_counts()
  # Only top 10 features
  x, y = count.index[:10], count.values[:10]
  df = pd.DataFrame({'x': x, 'y': y})

  fig = px.pie(df, values='y', names='x')
  fig.show()

Answering the questions, we can spot several biases, for example:
* `hours_per_week` has minimum value 1, that is unreasonable because people commonly have multiple hours of work in a week
* In `capital_gain` and `capital_loss`, 90% of data have zeros. It may be reasonable because only investors (individuals who make investments) have this data, and in fact, the population of investors comprise a small sample of whole population. 
* `gender` is represented by 67% male and 33% female, suggesting data skew (unfairness). 
* However, `native_country` cannot be said biased although it is heavily represented by 95% US people. The census is for US people, that's why. 


We can go deeper, by answering these questions:

* **What's missing?**
* **What's being overgeneralized?**
* **What's being underrepresented?**
* **How do the variables, and their values, reflect the real world?**
* **What might we be leaving out?**

Use Facets and try:

* Select `education` as **Binning | X-axis**, `income_bracket` as **Color by** and **Label by**.
* `marital` as **Binning | X-axis**, `gender` as **Color by** and **Label by**.

In [33]:
SAMPLE_SIZE = 5000
  
train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=train_dive)
display(HTML(html))

Some could be answered:

* `education` and `income_bracket` reflect the real world. In general, income correlates with education. The higher education is (Bachelor or higher), the higher proportion of people have income >50k.  
* In most categories of `marital`, proportion of male and female is 1:1. Except for `married-civ-spouse` status, male outnumbers female. Because we already know that female comprises only 33% population, thus we can say that specifically, married women are under-represented. 

References:

* ML Crash Course by Google: https://developers.google.com/machine-learning/crash-course/fairness/identifying-bias 
* Other resources:
  * https://www.kdnuggets.com/2020/12/machine-learning-model-fair.html
  * Tutorial in R: https://www.districtdatalabs.com/fairness-and-bias-in-algorithms