## Homework 5-2: Fair prediction

In this homework you will experiment with modifying the logistic regression classifier we built on the COMPAS data, tuning it to get equal false positive rates between black and white defendants.

### Part 0. Loading the data and building the feature matrix.
Free code, copied from our class notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn import metrics
%matplotlib inline

In [2]:
# Select between data on overall arrests and arrests for violent crimes
# This allows quick comparisons of the difference between these two data sets
violent = False

if violent:
    fname ='compas-scores-two-years-violent.csv'
    decile_col = 'v_decile_score'
    score_col = 'v_score_text'
else:
    fname ='compas-scores-two-years.csv'
    decile_col = 'decile_score'
    score_col = 'score_text'


In [3]:
cv = pd.read_csv(fname)

In [4]:
# Data cleaning ala ProPublica
cv = cv[
    (cv.days_b_screening_arrest <= 30) &  
    (cv.days_b_screening_arrest >= -30) &  
    (cv.is_recid != -1) &
    (cv.c_charge_degree != 'O') &
    (cv[score_col] != 'N/A')
]

# Keep only black and white races for this analysis
cv = cv[(cv.race == 'African-American') | (cv.race=='Caucasian')]
         
# renumber the rows from 0 again
cv.reset_index(inplace=True, drop=True) 
cv.shape

(5278, 53)

In [5]:
# build up dummy variables for age, race, gender
features = pd.concat(
    [pd.get_dummies(cv.age_cat, prefix='age'),
     pd.get_dummies(cv.sex, prefix='sex'),
     pd.get_dummies(cv.c_charge_degree, prefix='degree'), # felony or misdemeanor charge ('f' or 'm')
     cv.priors_count],
    axis=1)

# We should have one less dummy variable than the number of categories, to avoid the "dummy variable trap"
# See https://www.quora.com/When-do-I-fall-in-the-dummy-variable-trap
features.drop(['age_25 - 45', 'sex_Female', 'degree_M'], axis=1, inplace=True)

# Try to predict whether someone is re-arrested
target = cv.two_year_recid

### Part 1. Your basic logistic regression

Fit a logistic regression to this data. Print out the accuracy, PPV, and FPV overall, and for just black vs. white defendants. 

Most of the code you need can be found in the class notebook.

In [6]:
# Fit a logistic regression
x = features.values
y = target.values
lr = LogisticRegression()
lr.fit(x,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
# Predict the result on the training data
y_pred = lr.predict(x)
guessed=pd.Series(y_pred)==1
actual=cv.two_year_recid==1

In [11]:
# Free code for you!

# cm is a confusion matrix. The rows are guessed, the columns are actual 
def print_ppv_fpv(cm):
    # the indices here are [col][row] or [actual][guessed]
    TN = cm[False][False]   
    TP = cm[True][True]
    FN = cm[True][False]
    FP = cm[False][True]
    print('Accuracy: ', (TN+TP)/(TN+TP+FN+FP))
    print('PPV: ', TP / (TP + FP))
    print('FPR: ', FP / (FP + TN))
    print('FNR: ', FN / (FN + TP))
    print()

def print_metrics(guessed, actual):
    cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
    print(cm)
    print()
    print_ppv_fpv(cm)    


In [12]:
# Print out the accuracy, PPV, FPV, FNV for
#  - everyone 
print('Everyone')
print('========')
print_metrics(guessed, actual)

#  - just white defendants
print('Caucasian')
print('=====')
subset = cv.race == 'Caucasian'
print_metrics(guessed[subset], actual[subset])


#  - just black defendants
print('African-American')
print('================')
subset = cv.race == 'African-American'
print_metrics(guessed[subset], actual[subset])


Everyone
actual   False  True 
guessed              
False     2076   1047
True       719   1436

Accuracy:  0.665403561955286
PPV:  0.6663573085846868
FPR:  0.25724508050089445
FNR:  0.4216673378977044

Caucasian
=====
actual   False  True 
guessed              
False     1061    493
True       220    329

Accuracy:  0.6609605325725154
PPV:  0.599271402550091
FPR:  0.1717408274785324
FNR:  0.5997566909975669

African-American
actual   False  True 
guessed              
False     1015    554
True       499   1107

Accuracy:  0.6683464566929134
PPV:  0.6892901618929016
FPR:  0.3295904887714663
FNR:  0.33353401565322094



### Part 2. Equalizing false positive rates
Now you'll build your own classifier that equalizes the false positive rates between white and non-white defendants. There are many ways to do this. We're going to use race explicitly to set a different threshold for white and black defendants. 

To begin with, we are going to write our own prediction function, starting with this one:

In [13]:
# This takes a trained LogisticRegression, a set of features, and a threshold
# Predicts true wherever the regression gives a probability > threshold
# Note: returns a numpy array, not a dataframe
def predict_threshold(classifier, features, threshold):
    # predict_proba returns two columns: probability of true, and probability of false
    # [:,1] selects the second column
    return classifier.predict_proba(features)[:,1] > threshold

In [14]:
# This is the same as lr.predict(x) when we use a threshold of 0.5
guessed2 = predict_threshold(lr, x, 0.5)

Now adapt this function so it takes two thresholds `a_threshold` and `b_threshold`, and a column of values `use_b` which means use the `b_threshold` for any row where it's true. The idea is to allow us to adjust the thresholds independently on two different groups.

In [15]:
# Write a function which takes the following arguments
def predict_threshold_groups(classifier, features, a_threshold, b_threshold, use_b):
    # calculate probabilities from our classifier
    
    # Create one Series which is True where the probabilities are bigger than a_threshold, 
    # and another for b_threshold
    series_a = predict_threshold(classifier, features, a_threshold)
    series_b = predict_threshold(classifier, features, b_threshold)
    
    # Then combine them, selecting values from either Series according to use_b
    return np.where(use_b, series_b, series_a)

Now use this function with different thresholds for black and white defendants. Print out the confusion martrix, accuracy, FPV, and PPV for the results -- again, overall and for each race.

In [42]:
# Predict recidivism with different thresholds for black and white
is_caucasian = cv.race == 'Caucasian'
is_african_american = cv.race == 'African-American'

guessed = predict_threshold_groups(lr, x, 0.5, 0.587, is_african_american)


# Print out metrics for everyone, black, and white
print('Everyone')
print('========')
print_metrics(guessed, actual)

print('Caucasian')
print('=====')
print_metrics(guessed[is_caucasian], actual[is_caucasian])

print('African-American')
print('================')
print_metrics(guessed[is_african_american], actual[is_african_american])


Everyone
actual   False  True 
guessed              
False     2305   1360
True       490   1123

Accuracy:  0.6494884425918909
PPV:  0.6962182269063856
FPR:  0.17531305903398928
FNR:  0.5477245267821184

Caucasian
=====
actual   False  True 
guessed              
False     1061    493
True       220    329

Accuracy:  0.6609605325725154
PPV:  0.599271402550091
FPR:  0.1717408274785324
FNR:  0.5997566909975669

African-American
actual   False  True 
guessed              
False     1244    867
True       270    794

Accuracy:  0.6418897637795276
PPV:  0.7462406015037594
FPR:  0.178335535006605
FNR:  0.5219747140276941



Tune the thresholds so the False Positive Rate is the same for white and black defendants.
- What did you change to achive this?
- What effect does this have on the overall accuracy, FPR, FNR, and PPV?
- What effect does this have on the PPV for white and black?


In this case I raised the threshold for black defendants from 0.5 to 0.585, which equalizes the FPR at about 17%. The overall accuracy fell only slightly from 66% to 65%, and the accuracy for black defendants fell from 67% to 64%. But the PPV for black defendants -- the probability that someone who is categorized as high risk will actually be re-arrested within two years -- increased from 69% to 75%, because the higher threshold removes some of the people who were not particularly risky from the high risk group. The cost is a higher false negative rate for black defendents, which has gone up from 33% to 52%

### Bonus: Predicting race and the impossibility of blinding
So far we've excluded race as a predictive variable, hoping that this would make the results unbiased. But is race encoded in the other data points? To find out, alter the regression above to try to predict race from the other demographic and criminal history variables.

How accurately can you predict race just on these factors alone?

In [47]:
# Use cross validation and the classifier of your choice to see how well you can predict race
from sklearn.model_selection import cross_val_score

my_classifier = tree.DecisionTreeClassifier()
scores = cross_val_score(my_classifier, 
                         features.values, 
                         pd.get_dummies(cv.race, prefix='race')['race_African-American'].values,
                         cv=5)
scores

array([0.64867424, 0.63920455, 0.63541667, 0.62654028, 0.63981043])

Let's compare this accuracy to just guessing one race all the time. Which race is more common in this data and what would the accuracy be if we just always guessed that race.

In [None]:
# What is the most common race in our arrest data?
# African-American

In [48]:
# What is the accuracy if we always guess the most common race?
cv.race.value_counts()

African-American    3175
Caucasian           2103
Name: race, dtype: int64

In [49]:
3175/(3175+2103)

0.6015536187949981

Based on this, how much information about race "leaks" into our original recidivism predictor, even if we don't give it the race variable as a feature?

The Decision Tree Classifier had a 2.5–4% better chance at guessing the race correctly.