# Gender bias study

For this case study, we have a look at the admission data from the UC Berkeley gender bias study ([Sex Bias in Gradudate Admissions: Data from Berkley](https://science.sciencemag.org/content/187/4175/398)). We generated a sample of single student admission entries, based on the admission numbers from the [admission table](https://www.randomservices.org/random/data/Berkeley.html) that is gratefully provided by [RandomServices](http://www.randomservices.org/random/) under the [Creative Commons License](https://creativecommons.org/licenses/by/2.0/). 


The dataset consists of admission records of graduate students along with the departments they applied for in UC Berkely from 1973. There are three variables: Gender, Admission, and Department. The goal of the study is to investigate whether UC Berkeley had a gender bias in their admission process in 1973.

## Explore Data
Let’s have a look at the data first.

In [None]:
import pandas as pd
data = pd.read_csv("student_admissions_berkeley.csv")
data.head()

Does Admission depend on Gender?

In [None]:
admissions = data.groupby(["Admission", "Gender"]).size().unstack()
admissions.plot(kind='bar');

In [None]:
print("Male admission rate:", round(100*admissions["Male"]["Yes"]/(admissions["Male"]["Yes"]+admissions["Male"]["No"])))
print("Female admission rate:", round(100*admissions["Female"]["Yes"]/(admissions["Female"]["Yes"]+admissions["Female"]["No"])))

This suggests that Admission depends on Gender as males have a higher admission rate than females. If the decision to admit a student depended on the student’s gender, that would be a discriminatory policy towards females. We want to investigate whether students were rejected *just* because they were female.

Formally, what we observed above suggests a statistical dependence (or correlation) between Admission and Gender. What we really want to know is whether Gender *causes* Admission. Using Causality, we examine this problem next.

## Causal Analysis
In a graphical causal model, causal graph is the key component that encodes the causal-relationships between variables. We can start by building a mental model of the causal graph between variables in our study. As some subjects are attendly mostly by males than females, it is plausible to assume that Gender influences Department. Some departments are more competitive for entry than the others. Therefore, it is reasonable to assume that Department influences Admission. What we would like to investigate is whether Gender influences Admission. To validate this from data, we add a directed edge from Gender to Admission in our mental model.

In [None]:
import networkx as nx
import dowhy.gcm as gcm

causal_graph = nx.DiGraph([("Gender", "Department"), ("Gender", "Admission"), ("Department", "Admission")])
gcm.util.plot(causal_graph)

In addition to the graph, we also need a data-generating process (otherwise called “causal model”) associated with each node (given its parents) in the causal graph. Although it is possible to assign a causal model to each node individually, Causality comes with a convenience method to automatically learn the causal model of each node in the causal graph from data.

In [None]:
causal_model = gcm.StructuralCausalModel(causal_graph)
gcm.auto.assign_causal_mechanisms(causal_model, data)

Now we're ready to learn the data-generating process:

In [None]:
gcm.fit(causal_model, data)

With the causal_graph set up, we are now ready to ask various causal questions. First, we start by computing the causal strengths of directed arrows in our mental model. In particular, we compute the strength of incoming arrows to Admission as we want to see if the causal influence of Gender on Admission is strong enough to warrant an edge.

In [None]:
arrow_strength_admission = gcm.arrow_strength(causal_model, "Admission")
arrow_strength_admission

It turns out that the direct causal influence of Gender on Admission is very weak. The next step is then to examine the direct causal influence of Gender on Department.

In [None]:
arrow_strength_department = gcm.arrow_strength(causal_model, "Department")
arrow_strength_department

In [None]:
gcm.util.plot(causal_graph, causal_strengths={**arrow_strength_admission, **arrow_strength_department})

Overall, through direct causal influence quantification, we observe that Gender has negligible influence on Admission compared to the causal strength of other directed edges. Therefore, this analysis suggests that Gender influences Department which further influences Admission. In other words, we have a chain causal graph Gender→Department→Admission.

We can also go one step further, and validate the result of this analyis. As causal graphs encode conditional independence relationships between variables, some of those relationships are statistically testable (under some assumptions). In Gender→Department→Admission, the following conditional independences should hold.

* Gender is dependent on Department
* Department is dependent on Admission
* Gender is dependent on Admission
* Gender is conditionally independent of Admissoin given Department

We now test those relationships using kernel-based (conditional) independence tests from data.

In [None]:
print("Gender is independent of Department: ", gcm.independence_test(data.Gender, data.Department) > 0.05)

As the p-value is very small (e.g. < 0.05), we can confidently reject the null hypothesis that Gender is independent of Department.

In [None]:
print("Department is independent of Admission: ", gcm.independence_test(data.Department, data.Admission) > 0.05)

As the p-value for this independence test is also very small, we can confidently reject the null hypothesis that Department is independent of Admission.

In [None]:
print("Gender is independent of Admission, conditioned on Department:",
      gcm.independence_test(data.Gender, data.Admission, conditioned_on=data.Department) > 0.05)

In this case, the p-value is quite high (e.g. > 0.05). As such, we cannot confidently reject the null hypothesis that Gender is conditionally independent of Admission given Department.

Overall, data confirms all the conditional independences that the chain graph Gender→Department→Admission entails. That is, statistical tests on data validate the causal graph Gender→Department→Admission. These findings are also inline with the gender bias study by Bickel et al., 1975 which argued that females tended to apply to more crowded departments with very few graduate positions. If we look into data, we also see that to be the case.

In [None]:
data.groupby(["Department", "Gender"]).size().unstack().plot(kind='bar')

In [None]:
data.groupby(["Admission", "Department"]).size().unstack().plot(kind='bar');

A large proportion of males applied to departments A and B that also have higher admission rates. A large proportion of females, on the contrary, applied to other departments that have lower admission rates.