# Lecture 5: Source of Bias
This notebook is a part of the [Algorithmic Fairness, Accountability and Ethics (Spring 2026)](https://learnit.itu.dk/course/view.php?id=3025445) at [IT-University of Copenhagen](https://itu.dk/)

#### Ex.5.1: Data Analysis on the Berkeley admissions in 1973

1. The dataset `BerkeleyAdmissionsData.csv` is a three-way table that presents admissions data at the University of California, Berkeley in 1973 according to the variables department (A, B, C, D, E), gender (male, female), and outcome (admitted, denied) encoded as Yes and No.
2. Load the dataset
3. Did Berkley admissions in 1973 suffered from gender bias? Why or why not?
    * What methods or metric did you use?
    * Could you find any signs of the [Simpson's Paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) (a form of aggregation bias)? What methods did you use? 
    * When you complete the exercise, have a look at [the original paper](https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf)

#### Ex.5.2: Correlation 
*Inspired by the [Social Data Science and Visualisation Lecture](https://github.com/suneman/socialdata2023/blob/main/lectures/Week2.ipynb) by Sune Lehmann*

You will be working with the `Data1.tsv`, `Data2.tsv`, `Data3.tsv`, `Data4.tsv`. The format is .tsv, which stands for tab separated values.  Each file has two columns (separated using the tab character). The first column is $x$-values, and the second column is $y$-values.

1. **Calculate simple statistics**
    1. Calculate *mean* and *variance* for $x$ and $y$ variables for each dataset (separately).
    2. Calculate *Pearson's* and *Spearman's* correlation coefficients between $x$ and $y$ variables for each dataset (separatelly)
    3. Fit a straight line trought each dataset. In Python you can do it like:
    ```
    from scipy import stats
    slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
    ```
    4. You should get a set of results for each dataset. Compare them, what do you observe?
2. **Visualise datasets**
    1. For each dataset make a plot (including the linear fit)
    2. What do you observe? How does it correspond to the results from the previous subsection?
    3. After you complete the exercise [look here](https://en.wikipedia.org/wiki/Anscombe%27s_quartet).

#### Ex.5.3: Data Analysis on the ProPublica Dataset 

**The goal of this exercise is to have you interact with the COMPAS dataset, to clean the dataset for analysis, extract insight, visualize findings, and replicate a part of the ProPublica's analysis. If you have worked already with the COMPAS dataset and find the exercise boring or redundant, consider working on the other exercises, or working on analyzing possible biases in a data set of your choice**

Please remember to use materials on [LearnIT](https://learnit.itu.dk/course/section.php?id=165657) under Lecture 5 – Read before class:
* Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries
* A Survey on Bias and Fairness in Machine Learning 

Also refer to the [How we analyzed the COMPAS Recidivism Algorithm](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) (Article) and [ProPublica Github Repository](https://github.com/propublica/compas-analysis/).


#### Loading and surveying the data
* Load the dataset `compas-scores-two-years.csv`

In [1]:
import pandas as pd
compas = pd.read_csv('data/compas-scores-two-years.csv')
compas.c_jail_in = pd.to_datetime(compas.c_jail_in)
compas.c_jail_out = pd.to_datetime(compas.c_jail_out)
compas.head()

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


#### Columns of Interest:
* `age` - Age of the defendant. It is numeric.
* `age_cat` - Category of Age. It can be < 25, 25-45, >45.
* `sex` - Sex of the defendant. It is either 'Male' or 'Female'
* `race` - Race of the defendant. It can be 'African-American', 'Caucasian', 'Hispanic', 'Asian', or 'Other'.
* `c_charge_degree` - Degree of the crime. It is either M (Misdemeanor), F (Felony), or O (not causing jail time).
* `priors_count` - Count of prior crimes committed by the defendant. It is numeric.
* `days_b_screening_arrest` - Days between the arrest and COMPAS screening.
* `decile_score` - The COMPAS score predicted by the system. It is between 1-10.
* `score_text` - Category of decile score. It can be Low (1-4), Medium (5-7), and High (8-10).
* `is_recid` - A variable to indicate if recidivism was done by the defendant. It can be 0, 1, -1.
* `two_year_recid` - A variable to indicate if recidivism was done by the defendant within two years.
* `c_jail_in` - Time when the defendant was jailed.
* `c_jail_out` - Time when the defendant was released from the jail.

#### Data Cleaning
Now that we have surveyed the dataset, let's look into cleaning the data. This data-cleaning is largely based off of ProPublica's methods. Requerements for the data filtering:
1. We only focus on cases where the COMPAS scored crime happened within +/- 30 days from when the person was arrested (if the value is missing, the record shoudl be removed). 
2. Then, we also get rid of cases where is_recid is -1 since we only want binary values for the purpose of our model (0 for no recidivism, 1 for yes recidivism). 
3. Finally, we don't want the c_charge_degree to be "O" which denotes ordinary traffic offenses (not as serious of a crime). 

Finish cleaning the dataset by filling in the code below based on the description above. The cleaned dataset should have 6172 records and 13 features.

(***Optional**) Create a "Lenghts of stay in jail" feature (you can compute this feature using `c_jail_in` and `c_jail_out`) and use it in the exercise*

In [2]:
print(compas.shape)

compas.dropna(subset=['days_b_screening_arrest'], inplace=True) #remove NaNs here
print(compas.shape)

compas = compas[compas['c_charge_degree'] != 'O'].copy() # removes all rows where c_charge_degree is 'O'
print(compas.shape) #apparently there were none...

#remove rows where days_b_screening_arrest is more than 30
compas = compas[abs(compas['days_b_screening_arrest']) <=30].copy()

# keep only needed columns
compas = compas[['age', 'age_cat', 'sex', 'race', 'c_charge_degree', 
                 'priors_count', 'days_b_screening_arrest', 'decile_score',
                 'score_text', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']].copy()
print(compas.shape)

(7214, 53)
(6907, 53)
(6907, 53)
(6172, 13)


#### Exploratory data Analysis

First, study basic statistics of the dataset (in case you make plots, make sure that you provide labels and titles)
* Frequency of different attributes (such as race, age, decile score, prio_counts)
* General descriptive statistics of the dataset

#### Bias Analysis

* Study the distribution of the recidivism score `decile_score` for different categories: does recidivism have the same distribution for different races? For different genders?
    * Make sure that your plots are comparable (e.g. axes have same scale)
* If it is not distributed in the same way, which biases do you identify in the input dataset that can lead to different distributions? Think about "how data can unintentionally discriminate" from the theory class
* Is there a measurement bias? Explain
* Is there a population bias? Explain
* Is there a sampling bias? Explain
* Look at the correlation between features. What can you notice? How could this affect the recidivism score? (*you can use `nominal` method from `dython` package to find correlations between categorical and continious variables (if not sure check the lecture slides). Read documentation to get more info.*)


#### Replicating ProPublica Analysis
Propublica used the COMPAS scores to predict recidivism if the score was >=5 and no recidivism if the score was < 5.

This is not a complete analysis since it solely uses the decile score and does a hard thresholding for prediction, discarding all other aspects of individuals. But let's reproduce it anyway.

Let's call this thresholded version of predicted recividism `predicted_recid`.

* Compute and compare the confusion matrix for each of the races
* Compute and compare the error rate, false positive rate, and false negative rate for each of the races
* What do you conclude?

#### References
- https://github.com/propublica/compas-analysis/
- https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
- https://mit-serc.pubpub.org/pub/risk-prediction-in-cj/release/2