# A Neural Network model for predicting risk level
The code below trains a neural network model on different subsets of the ProPublica dataset to predict the risk score level (high, medium, low) that the COMPAS algorithm would have assigned a given example.  What I show below are (3) models.  One is a classifier that uses 8 features, the next trains only on age, sex and race, and the last model  uses only race to predict COMPAS risk score.  Notes about modifying the models and performance of each is noted within the notebook below.




In [415]:
import pandas as pd
import logging

import numpy as np
import math

from sklearn.model_selection import train_test_split
# from sklearn.datasets import make_classification
# from sklearn.datasets import make_regression
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

# from mla.datasets import *
# from mla.metrics.metrics import root_mean_squared_log_error, mean_squared_error
from mla.neuralnet import NeuralNet
from mla.neuralnet.constraints import MaxNorm, UnitNorm
from mla.neuralnet.layers import Activation, Dense, Dropout
from mla.neuralnet.optimizers import SGD, RMSprop, Adagrad, Adadelta, Adam
from mla.neuralnet.parameters import Parameters
from mla.neuralnet.regularizers import *
from mla.utils import one_hot

## Functions
Setting up functions for important steps like converting dataset rows with lots of text into feature vectors for the model, segmenting the dataset for training and validation...


In [493]:
logging.basicConfig(level=logging.DEBUG)

def classification(X, y, is_for_score=True):
    
    example_1 = X[-1]
    example_2 = X[-2]
    example = np.asarray([example_1, example_2])
    
    y = one_hot(y)
    y_len = y.shape[1]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1111)

    model = NeuralNet(
        layers=[
            Dense(512, Parameters(init='uniform', regularizers={'W': L2(0.05)})),
            Activation('relu'),
            Dropout(0.9),
            Dense(128, Parameters(init='normal', constraints={'W': MaxNorm()})),
            Activation('relu'),
            Dense(y_len),
            Activation('softmax'),
        ],
        loss='categorical_crossentropy',
        optimizer=Adadelta(),
        metric='accuracy',
        batch_size=256,
        max_epochs=25,
    )

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('classification accuracy', roc_auc_score(y_test[:, 0], predictions[:, 0]))
    
    ex_predict = model.predict(example)
    if is_for_score:
        scores = ["LOW", "MEDIUM", "HIGH"]
        print ex_predict
        print "Your first example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[0])]
        print "Your second example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[1])]
    else:
        is_recid = ["not", ""]
        print ex_predict
        print "Your first example is %s predicted to violently offend in the next 2 years." % (is_recid[np.argmax(ex_predict[0])])
        print "Your second example is %s predicted to offend in the next 2 years." % (is_recid[np.argmax(ex_predict[1])])

In [474]:
def process_x_y(x_data, y_data, to_int):
    
    ''' convert the text columns of np.arrays of desired x_data and y_data into int / vector representation
        to_int is the indices of columns with text values (make this programmatic in the next update)'''
    
    # convert text columns to integer values
    le = preprocessing.LabelEncoder()
    for i in to_int:
        temp = x_data[:,i]
        temp_fit = le.fit(temp)
        x_data[:,i] = le.transform(temp)

    for i in range(len(x_data)):
        for j in range(len(x_data[i])):
            if np.isnan(x_data[i][j]):
                x_data[i][j] = 0

    x_data = x_data.astype(int)
    

    y_fit = le.fit(y_data)
    y_data = le.transform(y_data)
    
    return x_data, y_data

In [475]:
def run_model(keep, target, to_int, is_for_score=True):

    y = target.as_matrix()
    x = keep.as_matrix()

    x_data, y_data = process_x_y(x,y, to_int)

    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1111)
    classification(x_data, y_data, is_for_score)

In [476]:
csv_file = 'datasets/compas-scores-two-years-violent.csv'
df = pd.read_csv(csv_file)

## Model 1
This model trains on sex, age, race, juvenile felony count, juvenile misdemeanor count, juvenile other count, priors count and the charge degree.  It uses these features to predict the COMPAS risk score level. It does this for two types of risk scores, the general risk, and risk of violent recidivism.  It classifies violent score with ~89% accuracy, and general recidivism score with ~84% accuracy.

It also prints the predictions for two examples.  One is 22 year old Caucasian male who is being charged for his first misdemeanor, and the other is for a 29 year old Caucasian female in the same situation.

### For calculating your own risk (or an example of choice):

Look for occurrences of the < EXAMPLE > tag in the the code below. Change the values in the indicated array to whatever you want. The order and types of each value correspond to the "keep" variable list a few lines above the arrays.  If you pull the repo and can use iPython notebooks, then run the cell and it will print out the classification for your example.  Values used in the rest of the training set for each feature are listed below.  If there are no possible values listed after the "#", then you can enter an integer value into the field.

* 'sex', # 'Male' or 'Female' (I know this enforces a binary but it's what the training data from the FOIA requests provided)
* 'age',
* 'race', # 'African-American', 'Caucasian', 'Asian', 'Hispanic, 'Native American', 'Other'
* 'juv_fel_count',
* 'juv_misd_count',
* 'juv_other_count',
* 'priors_count',
* 'c_charge_degree' # 'F' or 'M' (felony or misdemeanor)

In [458]:
# Predicting the scores assigned for recidivism risks

keep = [
 'sex',
 'age',
 'race',
 'juv_fel_count',
 'juv_misd_count',
 'juv_other_count',
 'priors_count',
 'c_charge_degree']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0,2,7]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian',0,0,0,0,'M'] # <--- change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 70659
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.76it/s]
INFO:root:Epoch:0, train loss: 3.02523268477, train accuracy: 0.715988083416, elapsed: 0.266045093536 sec.
Epoch progress: 100%|██████████| 16/16 

('classification accuracy', 0.88762809163581724)
Your first example is predicted to have a HIGH risk score
Your second example is predicted to have a MEDIUM risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.63it/s]
INFO:root:Epoch:0, train loss: 3.52527276931, train accuracy: 0.62140019861, elapsed: 0.2697057724 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.43it/s]
INFO:root:Epoch:1, train loss: 2.264331288, train accuracy: 0.630585898709, elapsed: 0.273370981216 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.36it/s]
INFO:root:Epoch:2, train loss: 1.88108241283, train accuracy: 0.651191658391, elapsed: 0.267838954926 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.90it/s]
INFO:root:Epoch:3, train loss: 1.61499512259, train accuracy: 0.656405163853, elapsed: 0.262670993805 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 62.36it/s]
INFO:root:Epoch:4, train loss: 1.44290141963, train accuracy: 0.669811320755, elapsed: 0.297621965408 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.34it/s]
INFO:root:Epoch:5, train loss: 1.32402383493, train accuracy: 0.673535253227, elapsed:

('classification accuracy', 0.83850385038503861)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a HIGH risk score
for general recidivism risk score.




## Model 2
This model trains on sex, age and race.  It uses these features to predict the COMPAS risk score level.  It classifies violent scores with ~81% accuracy, and general with ~70% accuracy.

It also prints the prediction for two examples.  One is 22 year old Caucasian male, and the other is for a 29 year old Caucasian female.

In [461]:
# Predicting the scores assigned for recidivism risks using JUST age, sex and race

keep = [
 'sex',
 'age',
 'race']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0,2]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian'] # <--- change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 68099
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.32it/s]
INFO:root:Epoch:0, train loss: 3.19464799657, train accuracy: 0.715988083416, elapsed: 0.255340099335 sec.
Epoch progress: 100%|██████████| 16/16 

('classification accuracy', 0.8113096196147862)
Your first example is predicted to have a HIGH risk score
Your second example is predicted to have a MEDIUM risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.76it/s]
INFO:root:Epoch:0, train loss: 3.45181498457, train accuracy: 0.62140019861, elapsed: 0.276559114456 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.88it/s]
INFO:root:Epoch:1, train loss: 2.39495813418, train accuracy: 0.62140019861, elapsed: 0.265471935272 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 71.04it/s]
INFO:root:Epoch:2, train loss: 2.04861305886, train accuracy: 0.62140019861, elapsed: 0.258276939392 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 66.81it/s]
INFO:root:Epoch:3, train loss: 1.78059975265, train accuracy: 0.62140019861, elapsed: 0.279517889023 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 70.34it/s]
INFO:root:Epoch:4, train loss: 1.62983492534, train accuracy: 0.62140019861, elapsed: 0.259443998337 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 61.97it/s]
INFO:root:Epoch:5, train loss: 1.53645469038, train accuracy: 0.62140019861, elapsed: 

('classification accuracy', 0.70583844098695581)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for general recidivism risk score.




## Model 3
This model uses only race to predict COMPAS risk score.  It uses these features to predict the COMPAS risk score level.  It classify both violent scores with ~64% accuracy and general scores with ~62% accuracy.

It also prints the prediction for two examples.  One is is Caucasian, while the other is African-American.

In [495]:
# Predicting the score assigned for recidivism risks using JUST race

keep = ['race']

target_1 = ['v_score_text']
target_2 = ['score_text']
text_cols = [0]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Caucasian'] # <--- change the values in this array
to_target1.loc[df.shape[0]] = ['Low']
to_keep2.loc[df.shape[0]] = ['African-American'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = ['Low']

run_model(to_keep1, to_target1, text_cols)
print "for violent recidivism risk score.\n\n"
run_model(to_keep2, to_target2, text_cols)
print "for general recidivism risk score.\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 67075
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 75.27it/s]
INFO:root:Epoch:0, train loss: 1.92027765933, train accuracy: 0.715988083416, elapsed: 0.245267868042 sec.
Epoch progress: 100%|██████████| 16/16 

('classification accuracy', 0.64306025001341272)
[[ 0.02472367  0.83667637  0.13859996]
 [ 0.06569479  0.69822232  0.23608289]]
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for violent recidivism risk score.




Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 72.44it/s]
INFO:root:Epoch:0, train loss: 1.22633597369, train accuracy: 0.62140019861, elapsed: 0.253875017166 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 73.69it/s]
INFO:root:Epoch:1, train loss: 1.02430953608, train accuracy: 0.62140019861, elapsed: 0.24627494812 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 73.92it/s]
INFO:root:Epoch:2, train loss: 0.981280962959, train accuracy: 0.62140019861, elapsed: 0.249930858612 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 77.09it/s]
INFO:root:Epoch:3, train loss: 0.964789983241, train accuracy: 0.62140019861, elapsed: 0.239142894745 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 75.44it/s]
INFO:root:Epoch:4, train loss: 0.952072206228, train accuracy: 0.62140019861, elapsed: 0.243625879288 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 76.08it/s]
INFO:root:Epoch:5, train loss: 0.935923425088, train accuracy: 0.62140019861, elapse

('classification accuracy', 0.62209649536382206)
[[ 0.19830587  0.4989328   0.30276133]
 [ 0.19830587  0.4989328   0.30276133]]
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
for general recidivism risk score.




## Model 4
Instead of predicting COMPAS risk score, I wanted to see if I could make a model that performs at least as well as the COMPAS algorithm at predicting 2-year rates of violent recidivism.

Using the same 8 features used in the first model, and using standard validations practices, my model was able to predict violent recidivism with ~71% accuracy and general recidivism with ~72% accuracy.  Considering, COMPAS was correct in predicting recidivism 61% of the time, and violent recidivism only 20%, this does not paint the use of risk scores in the best light.

In [496]:
# Predicting the scores assigned for recidivism risks

keep = [
 'sex',
 'age',
 'race',
 'juv_fel_count',
 'juv_misd_count',
 'juv_other_count',
 'priors_count',
 'c_charge_degree']

target_1 = ['is_violent_recid']
target_2 = ['two_year_recid']
text_cols = [0,2,7]
to_keep1 = df[keep]
to_keep2 = df[keep]
to_target1 = df[target_1]
to_target2 = df[target_2]

# <EXAMPLE>: Adding in an example of choice
to_keep1.loc[df.shape[0]] = ['Male',22,'Caucasian',0,0,0,0,'M'] # <--- change the values in this array
to_target1.loc[df.shape[0]] = [0]
to_keep2.loc[df.shape[0]] = ['Female',29,'Caucasian',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target2.loc[df.shape[0]] = [1]

run_model(to_keep1, to_target1, text_cols, is_for_score=False)
print "\n\n"
run_model(to_keep2, to_target2, text_cols, is_for_score=False)
print "\n\n"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 70530
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.41it/s]
INFO:root:Epoch:0, train loss: 1.87386480165, train accuracy: 0.824975173784, elapsed: 0.265868186951 sec.
Epoch progress: 100%|██████████| 16/16 

('classification accuracy', 0.71093772958359047)
[[ 0.84068165  0.15931835]
 [ 0.82292494  0.17707506]]
Your first example is not predicted to violently offend in the next 2 years.
Your second example is not predicted to offend in the next 2 years.





Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.20it/s]
INFO:root:Epoch:0, train loss: 3.23924499199, train accuracy: 0.834657398213, elapsed: 0.26778793335 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.68it/s]
INFO:root:Epoch:1, train loss: 1.29289527122, train accuracy: 0.834657398213, elapsed: 0.263152122498 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.60it/s]
INFO:root:Epoch:2, train loss: 1.05842254597, train accuracy: 0.834657398213, elapsed: 0.262568950653 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 68.31it/s]
INFO:root:Epoch:3, train loss: 0.897455537998, train accuracy: 0.834657398213, elapsed: 0.267280101776 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.32it/s]
INFO:root:Epoch:4, train loss: 0.788729211719, train accuracy: 0.834657398213, elapsed: 0.263830900192 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.60it/s]
INFO:root:Epoch:5, train loss: 0.743086680788, train accuracy: 0.834657398213, e

('classification accuracy', 0.71710028739978815)
[[ 0.91120805  0.08879195]
 [ 0.85294366  0.14705634]]
Your first example is not predicted to violently offend in the next 2 years.
Your second example is not predicted to offend in the next 2 years.



