# A Neural Network model for predicting risk level
The code below trains a neural network model on different subsets of the ProPublica dataset to predict the risk score level (high, medium, low) that the COMPAS algorithm would have assigned a given example.  What I show below are (3) models.  One is a classifier that uses 8 features, the next trains only on age, sex and race, and the last model  uses only race to predict COMPAS risk score.  Notes about modifying the models and performance of each is noted within the notebook below.




In [415]:
import pandas as pd
import logging

import numpy as np
import math

from sklearn.model_selection import train_test_split
# from sklearn.datasets import make_classification
# from sklearn.datasets import make_regression
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

# from mla.datasets import *
# from mla.metrics.metrics import root_mean_squared_log_error, mean_squared_error
from mla.neuralnet import NeuralNet
from mla.neuralnet.constraints import MaxNorm, UnitNorm
from mla.neuralnet.layers import Activation, Dense, Dropout
from mla.neuralnet.optimizers import SGD, RMSprop, Adagrad, Adadelta, Adam
from mla.neuralnet.parameters import Parameters
from mla.neuralnet.regularizers import *
from mla.utils import one_hot

## Functions
Setting up functions for important steps like converting dataset rows with lots of text into feature vectors for the model, segmenting the dataset for training and validation...


In [431]:
logging.basicConfig(level=logging.DEBUG)

def classification(X, y):
    
    example_1 = X[-1]
    example_2 = X[-2]
    example = np.asarray([example_1, example_2])
    
    y = one_hot(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1111)

    model = NeuralNet(
        layers=[
            Dense(512, Parameters(init='uniform', regularizers={'W': L2(0.05)})),
            Activation('relu'),
            Dropout(0.9),
            Dense(128, Parameters(init='normal', constraints={'W': MaxNorm()})),
            Activation('relu'),
            Dense(3),
            Activation('softmax'),
        ],
        loss='categorical_crossentropy',
        optimizer=Adadelta(),
        metric='accuracy',
        batch_size=256,
        max_epochs=25,

    )

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('classification accuracy', roc_auc_score(y_test[:, 0], predictions[:, 0]))
    
    ex_predict = model.predict(example)
    scores = ["LOW", "MEDIUM", "HIGH"]
    print "Your first example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[0])]
    print "Your second example is predicted to have a %s risk score" % scores[np.argmax(ex_predict[1])]


In [432]:
def process_x_y(x_data, y_data, to_int):
    
    ''' convert the text columns of np.arrays of desired x_data and y_data into int / vector representation
        to_int is the indices of columns with text values (make this programmatic in the next update)'''
    
    # convert text columns to integer values
    le = preprocessing.LabelEncoder()
    for i in to_int:
        temp = x_data[:,i]
        temp_fit = le.fit(temp)
        x_data[:,i] = le.transform(temp)

    for i in range(len(x_data)):
        for j in range(len(x_data[i])):
            if np.isnan(x_data[i][j]):
                x_data[i][j] = 0

    x_data = x_data.astype(int)
    

    y_fit = le.fit(y_data)
    y_data = le.transform(y_data)
    
    return x_data, y_data

In [433]:
def run_model(keep, target, to_int):

    y = target.as_matrix()
    x = keep.as_matrix()

    x_data, y_data = process_x_y(x,y, to_int)

    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1111)
    classification(x_data, y_data)

In [440]:
csv_file = 'datasets/compas-scores-two-years-violent.csv'
df = pd.read_csv(csv_file)

## Model 1
This model trains on sex, age, race, juvenile felony count, juvenile misdemeanor count, juvenile other count, priors count and the charge degree.  It uses these features to predict the COMPAS risk score level.  It classifies examples with ~84% accuracy.

It also prints the prediction for two examples.  One is 22 year old Caucasian male who is being charged for his first misdemeanor, and the other is for a 29 year old Caucasian female in the same situation.

### For calculating your own risk (or an example of choice):

Look for occurrences of the < EXAMPLE > tag in the the code below. Change the values in the indicated array to whatever you want. The order and types of each value correspond to the "keep" variable list a few lines above the arrays.  If you pull the repo and can use iPython notebooks, then run the cell and it will print out the classification for your example.  Values used in the rest of the training set for each feature are listed below.  If there are no possible values listed after the "#", then you can enter an integer value into the field.

* 'sex', # 'Male' or 'Female' (I know this enforces a binary but it's what the training data from the FOIA requests provided)
* 'age',
* 'race', # 'African-American', 'Caucasian', 'Asian', 'Hispanic, 'Native American', 'Other'
* 'juv_fel_count',
* 'juv_misd_count',
* 'juv_other_count',
* 'priors_count',
* 'c_charge_degree' # 'F' or 'M' (felony or misdemeanor)

In [441]:
# Predicting the score assigned for risk of violent recidivism

keep = [
 'sex',
 'age',
 'race',
 'juv_fel_count',
 'juv_misd_count',
 'juv_other_count',
 'priors_count',
 'c_charge_degree']

target = ['v_score_text']
text_cols = [0,2,7]
to_keep = df[keep]
to_target = df[target]

# <EXAMPLE>: Adding in an example of choice
to_keep.loc[df.shape[0]] = ['Male',22,'Caucasian',0,0,0,0,'M'] # <--- change the values in this array
to_target.loc[df.shape[0]] = ['Low']
to_keep.loc[df.shape[0]] = ['Female',29,'Caucasian',0,0,0,0,'M'] # <--- and /or change the values in this array
to_target.loc[df.shape[0]] = ['Low']

run_model(to_keep, to_target, text_cols)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 70659
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.21it/s]
INFO:root:Epoch:0, train loss: 5.44387433872, train accuracy: 0.715277777778, elapsed: 0.265343904495 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.90it/s]
INFO:root:Epoch:1, train loss: 2.02447621972, train accuracy: 0.715277777778, elapsed: 0.259624004364 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.7

('classification accuracy', 0.84311399346198468)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score


## Model 2
This model trains on sex, age and race.  It uses these features to predict the COMPAS risk score level.  It classifies examples with ~81% accuracy.

It also prints the prediction for two examples.  One is 22 year old Caucasian male, and the other is for a 29 year old Caucasian female.

In [442]:
# Predicting the score assigned for risk of violent recidivism using JUST age, sex and race

keep = [
 'sex',
 'age',
 'race']

target = ['v_score_text']
text_cols = [0,2]

to_keep = df[keep]
to_target = df[target]

# <EXAMPLE>: Adding in an example of choice
to_keep.loc[df.shape[0]] = ['Male',22,'Caucasian']
to_target.loc[df.shape[0]] = ['Low']
to_keep.loc[df.shape[0]] = ['Female',29,'Caucasian']
to_target.loc[df.shape[0]] = ['Low']

run_model(to_keep, to_target, text_cols)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 68099
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 69.39it/s]
INFO:root:Epoch:0, train loss: 4.30000393931, train accuracy: 0.715277777778, elapsed: 0.263988018036 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 62.68it/s]
INFO:root:Epoch:1, train loss: 1.70967224467, train accuracy: 0.715277777778, elapsed: 0.28728890419 sec.
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 71.85

('classification accuracy', 0.81514288727196038)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score


## Model 3
This model uses only race to predict COMPAS risk score.  It uses these features to predict the COMPAS risk score level.  It classifies examples with ~62% accuracy.

It also prints the prediction for two examples.  One is is Caucasian, while the other is African-America.

In [443]:
# Predicting the score assigned for risk of violent recidivism using JUST race

keep = [
 'race']

target = ['v_score_text']
text_cols = [0]

to_keep = df[keep]
to_target = df[target]

# <EXAMPLE>: Adding in an example of choice
to_keep.loc[df.shape[0]] = ['Caucasian']
to_target.loc[df.shape[0]] = ['Low']
to_keep.loc[df.shape[0]] = ['African-American']
to_target.loc[df.shape[0]] = ['Low']

run_model(to_keep, to_target, text_cols)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
INFO:root:Total parameters: 67075
Epoch progress: 100%|██████████| 16/16 [00:00<00:00, 67.32it/s]
INFO:root:Epoch:0, train loss: 1.20407417912, train accuracy: 0.715277777778, elapsed: 0.270169019699 sec.
Epoch progress: 100%|██████████| 16/16 

('classification accuracy', 0.62489454813877465)
Your first example is predicted to have a MEDIUM risk score
Your second example is predicted to have a MEDIUM risk score
