# Neural network models for predicting COMPAS risk score and two-year recidivism

### COMPAS risk score
The code below trains neural network models on different subsets of the ProPublica dataset to predict the risk score level (high, medium, low) that the COMPAS algorithm would have assigned a given example.  What I show below are (3) models.  One is a classifier that uses 8 features, the next trains only on age, sex and race, and the last model  uses only race to predict COMPAS risk score.  Comments about the models and performance of each is noted within the notebook below.

### Two-year recidivism
Model 4 does not attempt to predict the COMPAS risk score assigned to an example, but rather predicts general two-year recidivism and violent recidivism, given only a few readily available features.  In the end, my simple model is considerably more accurate than the COMPAS scores at predicting both kinds of recidivism.


In [545]:
import pandas as pd
import logging

import numpy as np
import math

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

from mla.neuralnet import NeuralNet
from mla.neuralnet.constraints import MaxNorm, UnitNorm
from mla.neuralnet.layers import Activation, Dense, Dropout
from mla.neuralnet.optimizers import SGD, RMSprop, Adagrad, Adadelta, Adam
from mla.neuralnet.parameters import Parameters
from mla.neuralnet.regularizers import *
from mla.utils import one_hot

## Functions
Functions for important tasks like converting dataset rows with lots of text into feature vectors for the model, segmenting the dataset for training and validation, etc.


In [546]:
# logging.basicConfig(level=logging.DEBUG)

def classification(X, y, is_for_score=True):
        
    y = one_hot(y)
    y_len = y.shape[1]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1111)

    model = NeuralNet(
        layers=[
            Dense(512, Parameters(init='uniform', regularizers={'W': L2(0.05)})),
            Activation('relu'),
            Dropout(0.9),
            Dense(128, Parameters(init='normal', constraints={'W': MaxNorm()})),
            Activation('relu'),
            Dense(y_len),
            Activation('softmax'),
        ],
        loss='categorical_crossentropy',
        optimizer=Adadelta(),
        metric='accuracy',
        batch_size=256,
        max_epochs=25,
    )

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('classification accuracy', roc_auc_score(y_test[:, 0], predictions[:, 0]))
    
    example = np.asarray(X[-2:])
    ex_predict = model.predict(example)
    if is_for_score:
        scores = ["LOW", "MEDIUM", "HIGH"]
        print "Your first example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[0])]
        print "Your second example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[1])]
    else:
        is_recid = ["not", ""]
        print "Your first example is %s predicted to offend in the next 2 years." % (is_recid[np.argmax(ex_predict[0])])
        print "Your second example is %s predicted to offend in the next 2 years." % (is_recid[np.argmax(ex_predict[1])])


In [547]:
def process_x_y(x_data, y_data, to_int):
    
    ''' convert the text columns of np.arrays of desired x_data and y_data into int / vector representation
        to_int is the indices of columns with text values (make this programmatic in the next update)'''
    
    # convert text columns to integer values
    le = preprocessing.LabelEncoder()
    for i in to_int:
        temp = x_data[:,i]
        temp_fit = le.fit(temp)
        x_data[:,i] = le.transform(temp)

    for i in range(len(x_data)):
        for j in range(len(x_data[i])):
            if np.isnan(x_data[i][j]):
                x_data[i][j] = 0

    x_data = x_data.astype(int)
    

    y_fit = le.fit(y_data)
    y_data = le.transform(y_data)
    
    return x_data, y_data

In [548]:
def run_model(keep, target, to_int, is_for_score=True):

    y = target.as_matrix()
    x = keep.as_matrix()

    x_data, y_data = process_x_y(x,y, to_int)

    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1111)
    classification(x_data, y_data, is_for_score)

In [549]:
csv_file = 'datasets/compas-scores-two-years-violent.csv'
df = pd.read_csv(csv_file)

## Model 1 - eight features to predict COMPAS risk scores
This model trains on sex, age, race, juvenile felony count, juvenile misdemeanor count, juvenile other count, priors count and the charge degree.  It uses these features to predict the COMPAS risk score level. It does this for two types of risk scores, the general risk, and risk of violent recidivism.  It classifies:
* violent scores with ~87% accuracy
* general recidivism scores with ~84% accuracy.

It also prints the predictions for four examples:
* Violent risk
    1. 22 year old Caucasian male who is being charged for his first misdemeanor - MEDIUM
    2. 22 year old African-American male who is being charged for his first misdemeanor - HIGH
* General risk
    3. 29 year old Caucasian female in the same situation - LOW
    4. 29 year old African-American female in the same situation - MEDIUM

### For calculating your own risk (or an example of choice):

Look for occurrences of the < EXAMPLE > tag in the the code below. Change the values in the indicated array to whatever you want. The order and types of each value correspond to the "keep" variable list a few lines above the arrays.

If you pull the repo and can use iPython notebooks, then run the cell and it will print out the classification for your example.  The sets of values seen in training for each feature are listed below.  If there are no values listed after the "#", then you can enter an integer value into the field.

* 'sex', # 'Male' or 'Female' (I know this enforces a binary but it's what the training data from the FOIA requests provided)
* 'age',
* 'race', # 'African-American', 'Caucasian', 'Asian', 'Hispanic, 'Native American', 'Other'
* 'juv_fel_count',
* 'juv_misd_count',
* 'juv_other_count',
* 'priors_count',
* 'c_charge_degree' # 'F' or 'M' (felony or misdemeanor)

In [555]:
# Predicting the scores assigned for recidivism risks

keep = [
 'sex',
 'age',
 'race',
 'juv_fel_count',
 'juv_misd_count',
 'juv_other_count',
 'priors_count',
 'c_charge_degree']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0,2,7]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian',0,0,0,0,'M'] # <--- you can change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep1.loc[df.shape[0]] = ['Male',22,'African-American',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'African-American',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

('classification accuracy', 0.87341875681570325)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.32it/s]
INFO:root:Epoch:0, train loss: 3.27075742451, train accuracy: 0.623634558093, elapsed: 0.270987987518 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.87it/s]
INFO:root:Epoch:1, train loss: 2.23840868849, train accuracy: 0.632323733863, elapsed: 0.269526004791 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.94it/s]
INFO:root:Epoch:2, train loss: 1.8000674205, train accuracy: 0.647715988083, elapsed: 0.265296936035 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.07it/s]
INFO:root:Epoch:3, train loss: 1.53181541716, train accuracy: 0.650198609732, elapsed: 0.265210151672 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.24it/s]
INFO:root:Epoch:4, train loss: 1.43408255379, train accuracy: 0.658639523337, elapsed: 0.267863035202 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.50it/s]
INFO:root:Epoch:5, train loss: 1.29348487914, train accuracy: 0.659136047666, elap

('classification accuracy', 0.84224116081611955)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for general recidivism risk score.




## Model 2 - age, sex and race to predict COMPAS risk scores
This model trains on sex, age and race.  It uses these features to predict the COMPAS risk score level.  It classifies:
* violent scores with ~80% accuracy
* general scores with ~71% accuracy.

It also prints the predictions for four examples:
* Violent risk
    1. 22 year old Caucasian male - MEDIUM
    2. 22 year old African-American male - HIGH
* General risk
    3. 29 year old Caucasian female - MEDIUM
    4. 29 year old African-American female - MEDIUM

In [551]:
# Predicting the scores assigned for recidivism risks using JUST age, sex and race

keep = [
 'sex',
 'age',
 'race']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0,2]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian'] # <--- you can change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep1.loc[df.shape[0]] = ['Male',22,'African-American'] # <--- and /or change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'African-American'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

('classification accuracy', 0.80388495092693568)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a HIGH risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.39it/s]
INFO:root:Epoch:0, train loss: 3.41556894639, train accuracy: 0.621151936445, elapsed: 0.251341819763 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.50it/s]
INFO:root:Epoch:1, train loss: 2.42968530345, train accuracy: 0.621151936445, elapsed: 0.251350879669 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.87it/s]
INFO:root:Epoch:2, train loss: 1.91347753987, train accuracy: 0.621151936445, elapsed: 0.256721019745 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 74.06it/s]
INFO:root:Epoch:3, train loss: 1.6958359586, train accuracy: 0.621151936445, elapsed: 0.249255895615 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 73.48it/s]
INFO:root:Epoch:4, train loss: 1.53612005188, train accuracy: 0.621151936445, elapsed: 0.250196933746 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.15it/s]
INFO:root:Epoch:5, train loss: 1.48193384667, train accuracy: 0.621151936445, elap

('classification accuracy', 0.70653275883918387)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for general recidivism risk score.




## Model 3 - using race to predict COMPAS risk scores
This model uses only race to predict COMPAS risk score. It classifies:
* violent scores with ~63% accuracy
* general scores with ~62% accuracy.

It also prints the predictions for four examples:
* Violent risk
    1. Caucasian individual - MEDIUM
    2. African-American individual - MEDIUM
* General risk
    3. Caucasian individual - MEDIUM
    4. African-American individual - MEDIUM

In [552]:
# Predicting the score assigned for recidivism risks using JUST race

keep = ['race']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Caucasian'] # <--- you can change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep1.loc[df.shape[0]] = ['African-American'] # <--- and /or change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Caucasian'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['African-American'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

('classification accuracy', 0.6273991275899673)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 75.46it/s]
INFO:root:Epoch:0, train loss: 1.88798630233, train accuracy: 0.621151936445, elapsed: 0.241130113602 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 73.15it/s]
INFO:root:Epoch:1, train loss: 1.06206941344, train accuracy: 0.621151936445, elapsed: 0.246817111969 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 73.68it/s]
INFO:root:Epoch:2, train loss: 1.02672265932, train accuracy: 0.621151936445, elapsed: 0.24526309967 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.82it/s]
INFO:root:Epoch:3, train loss: 0.984412694214, train accuracy: 0.621151936445, elapsed: 0.248409032822 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.48it/s]
INFO:root:Epoch:4, train loss: 0.966699981092, train accuracy: 0.621151936445, elapsed: 0.250889062881 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.69it/s]
INFO:root:Epoch:5, train loss: 0.957718136034, train accuracy: 0.621151936445, e

('classification accuracy', 0.62116651881890761)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for general recidivism risk score.




## Model 4
Instead of predicting COMPAS risk score, I wanted to see if I could make a model that performs at least as well as the COMPAS algorithm at predicting 2-year rates of recidivism.

Using the same 8 features used in the first model, and using standard validations practices, my model was able to predict:
* violent recidivism with ~71% accuracy
* general recidivism with ~72% accuracy.

Considering, COMPAS was correct in predicting recidivism 61% of the time, and violent recidivism only 20%, this does not paint the use of risk scores in the best light.

It also prints the predictions for four examples:
* Predicted to commit violent crime within 2 years
    1. 22 year old Caucasian male who is being charged for his first misdemeanor - NO
    2. 22 year old African-American male who is being charged for his first misdemeanor - NO
* Predicted to commit any crime within 2 years
    3. 29 year old Caucasian female in the same situation - NO
    4. 29 year old African-American female in the same situation - NO

In [554]:
# Predicting the scores assigned for recidivism risks

keep = [
 'sex',
 'age',
 'race',
 'juv_fel_count',
 'juv_misd_count',
 'juv_other_count',
 'priors_count',
 'c_charge_degree']

target_1 = ['is_violent_recid']
target_2 = ['two_year_recid']
text_cols = [0,2,7]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian',0,0,0,0,'M'] # <--- you can change the values in this array
to_target1.loc[df.shape[0]] = [0]
to_keep1.loc[df.shape[0]] = ['Male',22,'African-American',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target1.loc[df.shape[0]] = [0]
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = [0]
to_keep2.loc[df.shape[0]] = ['Female',29,'African-American',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = [0]

run_model(to_keep1, to_target1, text_cols, is_for_score=False)
print "Where only violent offenses are counted as recidivism.\n\n"
run_model(to_keep2, to_target2, text_cols, is_for_score=False)
print "Where any offense is counted as recidivism.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a

('classification accuracy', 0.71390578624114731)
Your first example is not predicted to offend in the next 2 years.
Your second example is not predicted to offend in the next 2 years.
Where only violent offenses are counted as recidivism.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.93it/s]
INFO:root:Epoch:0, train loss: 2.26700432252, train accuracy: 0.834905660377, elapsed: 0.263825893402 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 70.35it/s]
INFO:root:Epoch:1, train loss: 1.31187440986, train accuracy: 0.834905660377, elapsed: 0.261878013611 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.56it/s]
INFO:root:Epoch:2, train loss: 1.08821329312, train accuracy: 0.835153922542, elapsed: 0.26714015007 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.39it/s]
INFO:root:Epoch:3, train loss: 0.942766406044, train accuracy: 0.835402184707, elapsed: 0.26860499382 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.14it/s]
INFO:root:Epoch:4, train loss: 0.860200055715, train accuracy: 0.835650446872, elapsed: 0.272272825241 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 66.40it/s]
INFO:root:Epoch:5, train loss: 0.787085874958, train accuracy: 0.836891757696, el

('classification accuracy', 0.72063984268643166)
Your first example is not predicted to offend in the next 2 years.
Your second example is not predicted to offend in the next 2 years.
Where any offense is counted as recidivism.


