# Simpson's Paradox

In this notebook we will study the statistics and causal effects behind the alleged gender discrimination at U.C. Berkeley in 1973. This is also the example given in [*The Book of Why*](http://bayes.cs.ucla.edu/WHY/) (Judea Pearl; 2018). Instead of constructing an artificial example as asked in the Python challenge, I will analysis said dataset.

## Background 

## Data 

The dataset can be found on the website of Micheal Stob (Professor of Mathematics Emeritus at Calvin University) under "Berkeley"; see http://www.calvin.edu/~stob/data/. 

https://rawgit.com/amitrajitbose/GenderBias-UCBerkeley-Report/master/ProjectReport.html

In [1]:
import numpy as np
import pandas as pd
import pprint 

In [108]:
df = pd.read_csv("Berkeley.csv")
df.head()

Unnamed: 0,Admit,Gender,Dept,Freq
0,Admitted,Male,A,512
1,Rejected,Male,A,313
2,Admitted,Female,A,89
3,Rejected,Female,A,19
4,Admitted,Male,B,353


In [109]:
df_males = df.loc[df.Gender == 'Male']
df_females = df.loc[df.Gender == 'Female']

departments = df.Dept.unique()

df_dept = pd.DataFrame(columns=['Dept', 'AcceptRateMale', 'AcceptRateFemale'])
df_dept.Dept = departments

for dp in departments: 
    # First fill rows with absolute numbers 
    df_dept.loc[df_dept.Dept == dp, 'AcceptRateMale'] = int(df_males.query("Dept == @dp & Admit == 'Admitted'").Freq)
    df_dept.loc[df_dept.Dept == dp, 'AcceptRateFemale'] = int(df_females.query("Dept == @dp & Admit == 'Admitted'").Freq)

# Second divide absolute numbers by total application number to get frequencies
df_dept.AcceptRateMale /= df_males.groupby('Dept').Freq.aggregate(sum).values
df_dept.AcceptRateFemale /= df_females.groupby('Dept').Freq.aggregate(sum).values
    
df_dept

Unnamed: 0,Dept,AcceptRateMale,AcceptRateFemale
0,A,0.620606,0.824074
1,B,0.630357,0.68
2,C,0.369231,0.340641
3,D,0.330935,0.349333
4,E,0.277487,0.239186
5,F,0.0589812,0.0703812


In [123]:
df[df.Admit == 'Admitted'].groupby('Gender').Freq.aggregate(sum) / df.groupby('Gender').Freq.aggregate(sum)

Gender
Female    0.303542
Male      0.445188
Name: Freq, dtype: float64

We see that in departments A, B, D and F the acceptance rate of women is higher than that of men and only marginally lower in the other departments. However, when we look at the aggregated data the acceptance rate of females, approx. 0.3, is much smaller than that of males, approx. 0.45. This is a classic example of the Simpson's paradox. When conditioning on a feature, here Department, the effect changes 'direction'.