# CS 378 Homework 1: Machine Learning on the ProPublica COMPAS Dataset (100 pts)

## Deadline: 11:59 pm, September 7, 2022

Please submit this completed notebook file to Canvas once finished. For policies regarding extensions and collaboration/honesty, please see the course syllabus. 

## Overall goals
Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a proprietary AI tool that some U.S. courts use to estimate the risk of recidivism in a defendant. The nonprofit organization Propublica peformed an analysis of COMPAS and found it to systemically discriminate against black defendants. In this assignment, you will use a dataset of COMPAS scores (provided as part of this assignment) to reproduce parts of Propublica's analysis, and also train your own ML-based criminal recidivism predictors. 

Before you start working on this assignment, please read [this article](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing), which describes Propublica's findings. You can also explore Propublica's methodology [here](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm). 


## Loading and surveying the data

You can download the dataset for this assignment as a CSV file from [here](https://drive.google.com/file/d/1e8vyCBn8u2J2s5EVUPSCm7zkI1d5CkFc/view?usp=sharing). Please make sure you have the csv file in the same directory as this Python notebook to load the data.

In order to get started, we first need to load our dataset into the code. We do this using a popular Python data science framework called PANDAS. First, we load our dataset (compas-scores-two-years.csv) into Python as a PANDAS dataframe. The .head() command gives the first 5 entires so that we can get a peek of what the dataset looks like.

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None # muting a trivial warning about PANDAS, don't worry about this
df = pd.read_csv('compas-scores-two-years.csv')
df.head()

Then, we look into the kind of data the dataset contains by calling the columns field:

In [None]:
df.columns

The shape field gives us how many rows and columns the dataset has (therefore the name shape). We will constantly use this command to check how we are doing regarding data cleaning and dataset manipulation.

In [None]:
df.shape

## Question 1: Data Cleaning (3 points)
Now we clean the data. This data-cleaning is largely based off of ProPublica's methods. First, we only focus on cases where the COMPAS scored crime happened within 30 days from when the person was arrested. Then, we also get rid of cases where is_recid is -1 since we only want binary values for the purpose of our model (0 for no recidivism, 1 for yes recidivism). Finally, we don't want the c_charge_degree to be "O" which denotes ordinary traffic offenses (not as serious of a crime), and we don't want the score_text to be "N/A". All of this is done using the following code: 

In [None]:
df_cleaned = df.loc[(df['days_b_screening_arrest'] <= 30) & (df['days_b_screening_arrest'] >= -30) 
              & (df['is_recid'] != -1) & (df['c_charge_degree'] != "O") & (df['score_text'] != 'N/A')]

In [None]:
df_cleaned.shape

Now we choose which columns to pick. Notice that by not picking names we are able to anonymize the dataset.

In [None]:
df_filtered = df_cleaned[['age','sex', 'race', 'juv_fel_count', 'decile_score', 'priors_count', 'is_recid', 'is_violent_recid', 
                   'v_decile_score']]
df_filtered.head()

In [None]:
df_filtered.shape

#### Question 1.1:  Look at the original dataset and the cleaned one (df_filtered). Pick one column (aside from name) that was deleted and one column that wasn't and provide justifications for why they were deleted / why they were not. (3 points)

One final data manipulation we need to do is on the race column. Notice that for the race column, we have strings as our race descriptions. If we want to use it as an input to our model later on, it needs to be a numeric value. Therefore, we create additional columns and use them as indicator random variables (1 denoting that the row belongs to the race and 0 denoting that the row doesn't). In addition, we replace the (binary) sex field by 0/1: 1 for male and 0 for female.  

In [None]:
df_final = df_filtered.join(pd.get_dummies(df_filtered['race']))
df_final["sex"] = (df_final["sex"] == "Male") + 0 ## Use the binary coding for sex.
df_final

Let us now look at the final shape of our cleaned dataset.

In [None]:
df_final.shape

## Question 2: Data Analysis & Visualization (10 points)

Now we have a cleaned, filtered out dataset -- df_final -- to work with. From now on, we are going to work with this dataset unless specified otherwise. 

Before we start doing machine learning, we will perform some manual analysis and visualization of the data. We start by importing all the libraries we need for visualization.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from string import ascii_letters

We first want to explore some basic summary satistics of the dataframe.

In [None]:
df_final.describe()

#### Question 2.1: Using the df_final dataset, compute the means of criminal recidivism ('is_recid') for white males, white females, black males, and black females, as well as the general population. Also, construct histograms that visualize the distribution of 'is_recid' for these subgroups and the general population. (4 points)

#### Question 2.2: Repeat the analysis above for decile_score and violent_decile_score. (2 points)

#### Question 2.3: Share any insights about the data that you may have developed from the above visualizations. (4 points)

## Question 3: Replicating ProPublica's Analysis (12 points)
Now that we have a much more comprehensive understanding of the dataset after loading, cleaning, analyzing, and visualizing it, we reproduce the threshold rule analysis ProPublica has conducted. As a recap, Propublica used the COMPAS scores to predict recidivism if the score was >=5 and no recidivism if the score was < 5.

Note that this is not complete since it solely uses the decile score and does a hard thresholding for prediction, discarding all other aspects of individuals. 

### Filtering the dataset using race

Now we use a filtering operation to select the rows for everyone in the African-American population. We can do this as follows:

In [None]:
df_black = df_final[df_final.race == "African-American"]

Take a look into the dataframe we just got.

In [None]:
df_black

### A simplified thresholding rule
Now, let's use a simple thresholding rule to "predict" recidivism, in the spirit of ProPublica's analysis: for `decile_score >= 5`, predict `recidivism = True`; and for `decile_score < 5`, predict `recidivism = False`. We save our prediction to the column `predicted_recid`.

In [None]:
df_black["predicted_recid"] = (df_black.decile_score >= 5)

In [None]:
df_black.head()

#### Question 3.1: Using the sklearn package, construct and visualize the confusion matrices for the entire population, the black population, and the white population. (6 points)

#### Question 3.2: Compute the accuracy, precision, recall, false positive rate, and false negative rate for the entire population, the white subpopulation, and the black subpopulation. (6 points)

## Question 4: Machine Learning (70 points)

Now we proceed to the actual machine learning. As mentioned in class, we first define our features `X`, which we use to predict, and the label `Y`, which we try to predict.

In [None]:
X = df_final.drop(columns=['is_recid', 'is_violent_recid', "race"])
Y = df_final['is_recid']

In [None]:
X.head()

#### Question 4.1: Explain why we are dropping is_violent_recid and race. (3 points)

Now we divide the dataset into training / testing parts. We will use the training dataset to train our model to make predictions, in this case criminal recividism. Then, we will use the testing set to see how our model performed. A 80:20 split is pretty standard in practice. We can do this as follows:

In [None]:
# Split the data into train, test
from sklearn.model_selection import train_test_split

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=155)

The `random_state = 155` sets a random seed for the splitting, so that everytime you run the above code, you will end up with the exact same split. 

into the shape of train_x, test_x, train_y, and test_y using our favorite shape function. If all has been implemented correctly, it should be (4937, 12), (1235, 12), (4937,), and (1235,). 

#### Question 4.2: Write code to construct test datasets (test_x_w, test_y_w) and (test_x_b, test_y_b) corresponding to the white and black individuals in the test set, respectively. (5 points)

### Logistic regression

Now we will experiment with multiple models that we learned about in class and compare their performance. We start with logistic regression. 

#### Question 4.3: Train a Logistic Regression model on the data (this dataset should include individuals of all races). Aim to choose hyperparameters (see documentation) such that the model performs the best and behaves the most equitably! Report on the following metrics: (i) Training accuracy; (ii) Test accuracy, precision, recall, false positive rate, and true positive rate for the overall population;  (iii) Test accuracy, precision, recall, false positive rate, and true positive rate for the white population; (iv) Test accuracy, precision, recall, false positive rate, and true positive rate for the black population; (v) The ROC curves for the black and white populations. (15 points)

#### Question 4.4: Comment on the social implications, as you see them, of your results in Question 4.3. (5 points)

### Neural networks 

#### Question 4.5: Train a neural network on the data. Aim to choose hyperparameters such as depth and width such that the model performs the best and behaves the most equitably. Compute and report on the metrics considered in Question 4.3, (i)-(v). (10 points)

### Decision Trees
#### Question 4.6: Train a decision tree classifier (of a suitable depth) on this dataset. As before, choose hyperparemeters so as to maximize performance and equity. Compute and report on the metrics considered in Question 4.3, (i)-(iv). (10 points)

#### Question 4.7: Comment, with some empirical evidence, on how the performance and fairness of the model changes with the maximum tree depth. (3 points)

### Random Forests


#### Question 4.8: Read up on random forest classifiers (https://en.wikipedia.org/wiki/Random_forest). Train a random forest and compute the metrics in Question 4.3, (i)-(iv), using your model. Relate your results with those for the decision tree model. (13 points) 

### Comparisons

#### Question 4.9: Write a few sentences comparing the performance and fairness/unfairness of the different models you trained in this task.  (6 points) 

## Question 5: Reflections on the case (5 points)

This question is graded based on completion.

#### Question 5.1: Having completed this assignment, what are your thoughts on the use of machine learning in sentencing procedures? For example, you could approach this question by listing some of the pros and cons of human vs. automated decision making in this setting. (5 points)


## References
- https://github.com/propublica/compas-analysis/
- https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
- https://pandas.pydata.org/
- https://jupyter.org/
- https://matplotlib.org/stable/index.html
- https://seaborn.pydata.org/
- https://numpy.org/