# Preprocess the COMPAS Dataset
In this file, we preprocess the COMPAS dataset for binary classification via an NN. The COMPAS data can be found here: https://github.com/propublica/compas-analysis. We use much of the same methodology as ProPublica in their preprocessing, found here: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

In [129]:
import pandas as pd
import csv

## Open and Explore COMPAS Dataset
We open the COMPAS dataset under compas-scores, found in the following GitHub Repo, made public by ProPublica: https://github.com/propublica/compas-analysis

In [94]:
df = pd.read_csv('./raw_data/compas-scores-two-years.csv')

In [95]:
df.head(5)

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


In [96]:
len(df)

7214

In [97]:
df.columns.values

array(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex',
       'dob', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out',
       'c_case_number', 'c_offense_date', 'c_arrest_date',
       'c_days_from_compas', 'c_charge_degree', 'c_charge_desc',
       'is_recid', 'r_case_number', 'r_charge_degree',
       'r_days_from_arrest', 'r_offense_date', 'r_charge_desc',
       'r_jail_in', 'r_jail_out', 'violent_recid', 'is_violent_recid',
       'vr_case_number', 'vr_charge_degree', 'vr_offense_date',
       'vr_charge_desc', 'type_of_assessment', 'decile_score.1',
       'score_text', 'screening_date', 'v_type_of_assessment',
       'v_decile_score', 'v_score_text', 'v_screening_date', 'in_custody',
       'out_custody', 'priors_count.1', 'start', 'end', 'event',
       'two_year_recid'], dtype=object)

Notice that we have 53 variables for each individual. We will be using the label for two_year_recid for our binary classifier, so we will predict whether an individual recidivates (commits another crime) in two years.

## Preprocess Features
We try to follow ProPublica's procedure in preprocessing and classifying the data as close as possible, under this article: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

First, following ProPublica's procedure, we filter out all cases with arrest or charge dates *within* 30 days of the COMPAS assessment being conducted. 

In ProPublica's words: "It was not always clear, however, which criminal case was associated with an individual’s COMPAS score. To match COMPAS scores with accompanying cases, we considered cases with arrest dates or charge dates within 30 days of a COMPAS assessment being conducted."

In [98]:
df = df[df['days_b_screening_arrest'] < 30]
df = df[df['days_b_screening_arrest'] > -30]

In [99]:
len(df)

6159

Next, ProPublica saw all scores higher than 'Low' for Risk of Recidivism as indicating a positive case for risk of recidivism. There are three possibilities for the Risk of Recidivism variable: Low, Medium, and High. Thus, we collapse this variable such that all individuals labelled Low are relabelled 0 for the new variable 'risk_recid' and all individuals labeled Medium or High are relabelled 1 for the same varabile 'risk_recid.' This gives us a binary classification problem.

From the article: "...scores in the medium and high range garner more interest from supervision agencies than low scores, as a low score would suggest there is little risk of general recidivism, so we considered scores any higher than 'low' to indicate a risk of recidivism."

In [100]:
def label_recid(row):
    if row['score_text'] == 'Low':
        return 0
    elif row['score_text'] == 'Medium' or row['score_text'] == 'High':
        return 1

In [101]:
df['score_text'].head(5)

0       Low
1       Low
2       Low
5       Low
6    Medium
Name: score_text, dtype: object

In [102]:
df['risk_recid'] = df.apply(lambda row: label_recid(row), axis=1)

In [103]:
df['risk_recid'].head(5)

0    0
1    0
2    0
5    0
6    1
Name: risk_recid, dtype: int64

Now, we try to choose columns that are similar to ProPublica's chosen columns. In their analysis, ProPublica used "race, age, criminal history, future recidivism, charge degree, gender and age." 

To match this, we choose 9 features:
- sex
- age
- race
- juv_fel_count
- juv_misd_count
- juv_other_count
- priors_count
- c_charge_degree
- two_year_recid

We also keep our label 'risk_recid,' created earlier.

We drop all other columns.

In [104]:
df = df[['sex', 
         'age', 
         'race', 
         'juv_fel_count', 
         'juv_misd_count', 
         'juv_other_count', 
         'priors_count', 
         'c_charge_degree', 
         'two_year_recid', 
         'risk_recid']]

In [122]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,two_year_recid,risk_recid,c_charge_degree_F,c_charge_degree_M
0,1,69,0,0,0,0,0,0,0,1,0
1,1,34,1,0,0,0,0,1,0,1,0
2,1,24,1,0,0,1,4,1,0,1,0
5,1,44,0,0,0,0,0,0,0,0,1
6,1,41,0,0,0,0,14,1,1,1,0


We do some final preprocessing to have binary values for sex, race, and c_charge_degree. For sex, we label 0 as female and 1 as male. For race (our sensitive variable), we label 1 as African-American and 0 as all other races NOT African-American. For c_charge_degree, we split into two columns: c_charge_degree_F and c_charge_degree_M.

In [110]:
df['sex'].replace(['Female','Male'], [0,1], inplace=True) # Replace sex

In [115]:
def label_race(row):
    if row['race'] == 'African-American':
        return 1
    else:
        return 0

In [116]:
df['race'] = df.apply(lambda row: label_race(row), axis=1) # Replace race

In [119]:
df = pd.get_dummies(df, prefix=['c_charge_degree'], columns=['c_charge_degree']) # Replace c_charge_degree

In [121]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,two_year_recid,risk_recid,c_charge_degree_F,c_charge_degree_M
0,1,69,0,0,0,0,0,0,0,1,0
1,1,34,1,0,0,0,0,1,0,1,0
2,1,24,1,0,0,1,4,1,0,1,0
5,1,44,0,0,0,0,0,0,0,0,1
6,1,41,0,0,0,0,14,1,1,1,0


In [125]:
df = df[[col for col in df if col not in ['risk_recid']] + ['risk_recid']] # Move the label (risk_recid) to the last col

In [126]:
df.head(5)

Unnamed: 0,sex,age,race,juv_fel_count,juv_misd_count,juv_other_count,priors_count,two_year_recid,c_charge_degree_F,c_charge_degree_M,risk_recid
0,1,69,0,0,0,0,0,0,1,0,0
1,1,34,1,0,0,0,0,1,1,0,0
2,1,24,1,0,0,1,4,1,1,0,0
5,1,44,0,0,0,0,0,0,0,1,0
6,1,41,0,0,0,0,14,1,1,0,1


In [127]:
len(df)

6159

## Save the Preprocessed Dataset to CSV
Our final preprocessed dataset has 6,159 entries, 10 features, and a binary label (risk_recid). We save to CSV and use it for binary classification in another notebook.

In [128]:
df.to_csv('preprocessed_compas_data.csv')