## Homework: Fair prediction

In this homework you will build a logistic regression classifier on the Machine Bias data, then tune it to get equal false positive rates between black and white defendants.

### Submitted By: 
Name - Shreya Vaidyanathan; 
UNI - sv2525

### Part 0. Loading the data and building the feature matrix.
Free code, copied from our class notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn import metrics
%matplotlib inline

In [2]:
# Select between data on overall arrests and arrests for violent crimes
# This allows quick comparisons of the difference between these two data sets
violent = False

if violent:
    fname ='compas-scores-two-years-violent.csv'
    decile_col = 'v_decile_score'
    score_col = 'v_score_text'
else:
    fname ='compas-scores-two-years.csv'
    decile_col = 'decile_score'
    score_col = 'score_text'


In [3]:
cv = pd.read_csv(fname)

In [4]:
# Data cleaning ala ProPublica
cv = cv[
    (cv.days_b_screening_arrest <= 30) &  
    (cv.days_b_screening_arrest >= -30) &  
    (cv.is_recid != -1) &
    (cv.c_charge_degree != 'O') &
    (cv[score_col] != 'N/A')
]

# Keep only black and white races for this analysis
cv = cv[(cv.race == 'African-American') | (cv.race=='Caucasian')]
         
# renumber the rows from 0 again
cv.reset_index(inplace=True, drop=True) 
cv.shape

(5278, 53)

In [5]:
# build up dummy variables for age, race, gender
features = pd.concat(
    [pd.get_dummies(cv.age_cat, prefix='age'),
     pd.get_dummies(cv.sex, prefix='sex'),
     pd.get_dummies(cv.c_charge_degree, prefix='degree'), # felony or misdemeanor charge ('f' or 'm')
     cv.priors_count],
    axis=1)

# We should have one less dummy variable than the number of categories, to avoid the "dummy variable trap"
# See https://www.quora.com/When-do-I-fall-in-the-dummy-variable-trap
features.drop(['age_25 - 45', 'sex_Female', 'degree_M'], axis=1, inplace=True)

# Try to predict whether someone is re-arrested
target = cv.two_year_recid

In [6]:
# COMPAS text score value counts
# cv[decile_col].value_counts()
# high risk rates by race
score_race = pd.crosstab(cv.race, cv[score_col])
score_race['High risk rate'] = score_race['High'] / score_race.sum(axis=1)
score_race

score_text,High,Low,Medium,High risk rate
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
African-American,845,1346,984,0.266142
Caucasian,223,1407,473,0.106039


In [7]:
# high risk rates by sex
score_sex = pd.crosstab(cv.sex, cv[score_col])
score_sex['High risk rate'] = score_sex['High'] / score_sex.sum(axis=1)
score_sex

score_text,High,Low,Medium,High risk rate
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,148,575,308,0.14355
Male,920,2178,1149,0.216623


### Part 1. Your basic logistic regression

Fit a logistic regression to this data. Print out the accuracy, PPV, and FPV overall, and for just black vs. white defendants. 

Most of the code you need can be found in the class notebook.

In [8]:
# Fit a logistic regression
x = features.values                #Feature vectors
y = target.values                  #Labels  
lr = LogisticRegression()          #Logistic regression initialization   
lr.fit(x,y)                        #Fit the LR to the data

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

This is a logistic regression, so the coefficients are odds ratios (after undoing the logarithm.) Let's look at them to see what weights it used to make its predictions.

In [9]:
# Predict the result on the training data
# Examine regression coefficients

coeffs = pd.DataFrame(np.exp(lr.coef_), columns=features.columns)
coeffs

Unnamed: 0,age_Greater than 45,age_Less than 25,sex_Male,degree_F,priors_count
0,0.488937,2.091818,1.417314,1.210107,1.179172


In [10]:
# Crosstab for our predictive model
y_pred = lr.predict(x)
guessed=pd.Series(y_pred)==1

actual=cv.two_year_recid==1

cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
cm  #for "confusion matrix"

actual,False,True
guessed,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2076,1047
True,719,1436


In [11]:
# Free code for you!

# cm is a confusion matrix. The rows are guessed, the columns are actual 
def print_ppv_fpv(cm):
    # the indices here are [col][row] or [actual][guessed]
    TN = cm[False][False]   
    TP = cm[True][True]
    FN = cm[True][False]
    FP = cm[False][True]
    print('Accuracy: ', (TN+TP)/(TN+TP+FN+FP))
    print('PPV: ', TP / (TP + FP))
    print('FPR: ', FP / (FP + TN))
    print('FNR: ', FN / (FN + TP))
    print()

def print_metrics(guessed, actual):
    cm = pd.crosstab(guessed, actual, rownames=['guessed'], colnames=['actual'])
    print(cm)
    print()
    print_ppv_fpv(cm)    


In [12]:
# Print out the accuracy, PPV, FPV, FNV for
#  - everyone 
#  - just white defendants
#  - just black defendants
print_ppv_fpv(cm)
print('White')
jwd = cv.race == 'Caucasian'
print_metrics(guessed[jwd], actual[jwd])

print('Black')
jbd = cv.race == 'African-American'
print_metrics(guessed[jbd], actual[jbd])

Accuracy:  0.665403561955
PPV:  0.666357308585
FPR:  0.257245080501
FNR:  0.421667337898

White
actual   False  True 
guessed              
False     1061    493
True       220    329

Accuracy:  0.660960532573
PPV:  0.59927140255
FPR:  0.171740827479
FNR:  0.599756690998

Black
actual   False  True 
guessed              
False     1015    554
True       499   1107

Accuracy:  0.668346456693
PPV:  0.689290161893
FPR:  0.329590488771
FNR:  0.333534015653



In [13]:
# jbd  #use_b variable for 

### Part 2. Equalizing false positive rates
Now you'll build your own classifier that equalizes the false positive rates between white and non-white defendants. There are many ways to do this. We're going to use race explicitly to set a different threshold for white and black defendants. 

To begin with, we are going to write our own prediction function, starting with this one:

In [14]:
# This takes a trained LogisticRegression, a set of features, and a threshold
# Predicts true wherever the regression gives a probability > threshold

# Note: returns a numpy array, not a dataframe

def predict_threshold(classifier, features, threshold):
    # predict_proba returns two columns: probability of true, and probability of false
    # [:,1] selects the second column
    return classifier.predict_proba(features)[:,1] > threshold

In [15]:
# This is the same as lr.predict(x) when we use a threshold of 0.5
guessed2 = predict_threshold(lr, x, 0.5)

In [16]:
# predict_threshold(lr, x, 0.2)

Now adapt this function so it takes two thresholds `a_threshold` and `b_threshold`, and a column of values `use_b` which means use the `b_threshold` for any row where it's true. The idea is to allow us to adjust the thresholds independently on two different groups.

In [17]:
# Write a function which takes the following arguments
def predict_threshold_groups(classifier, feautes, a_threshold, b_threshold, use_b):
    # calculate probabilities from our classifier
    
    # Create one Series which is True where the probabilities are bigger than a_threshold, 
    # and another for b_threshold
    # Then combine them, selecting values from either Series according to use_b
    df_pa = predict_threshold(lr, x, a_threshold)
    df_pb = predict_threshold(lr, x, b_threshold)
    
    final_df = []
    for i in range(0,len(use_b)):
        if(use_b[i] == True):
            final_df.append(df_pb[i])
        else:
            final_df.append(df_pa[i])
            
    return final_df

Now use this function with different thresholds for black and white defendants. Print out the confusion martrix, accuracy, FPV, and PPV for the results -- again, overall and for each race.

In [27]:
# Predict recidivism with different thresholds for black and white
# Print out metrics for everyone, black, and white

# print_metrics(guessed, actual)
print("\n")
pred_res = predict_threshold_groups(lr, x, 0.5, 0.5, jbd)
# print(pred_res)

guessed3 = pd.Series(pred_res)==1
actual = cv.two_year_recid==1
    
print("Everyone")
cm = pd.crosstab(guessed3, actual, rownames=['guessed'], colnames=['actual'])
# print(cm)

print_ppv_fpv(cm)
print('White')
white = cv.race == 'Caucasian'
print_metrics(guessed3[white], actual[white])

print('Black')
black = cv.race == 'African-American'
print_metrics(guessed3[black], actual[black])



Everyone
Accuracy:  0.665403561955
PPV:  0.666357308585
FPR:  0.257245080501
FNR:  0.421667337898

White
actual   False  True 
guessed              
False     1061    493
True       220    329

Accuracy:  0.660960532573
PPV:  0.59927140255
FPR:  0.171740827479
FNR:  0.599756690998

Black
actual   False  True 
guessed              
False     1015    554
True       499   1107

Accuracy:  0.668346456693
PPV:  0.689290161893
FPR:  0.329590488771
FNR:  0.333534015653



In [63]:
# Predict recidivism with different thresholds for black and white
# Print out metrics for everyone, black, and white

# print_metrics(guessed, actual)
print("\n")
pred_res = predict_threshold_groups(lr, x, 0.48, 0.6, jbd)
# print(pred_res)

guessed3 = pd.Series(pred_res)==1
actual = cv.two_year_recid==1
    
print("Everyone")
cm = pd.crosstab(guessed3, actual, rownames=['guessed'], colnames=['actual'])
print(cm)

print_ppv_fpv(cm)
print('White')
white = cv.race == 'Caucasian'
print_metrics(guessed3[white], actual[white])

print('Black')
black = cv.race == 'African-American'
print_metrics(guessed3[black], actual[black])



Everyone
actual   False  True 
guessed              
False     2311   1424
True       484   1059
Accuracy:  0.638499431603
PPV:  0.686325340246
FPR:  0.173166368515
FNR:  0.573499798631

White
actual   False  True 
guessed              
False     1013    451
True       268    371

Accuracy:  0.658107465525
PPV:  0.580594679186
FPR:  0.209211553474
FNR:  0.548661800487

Black
actual   False  True 
guessed              
False     1298    973
True       216    688

Accuracy:  0.625511811024
PPV:  0.761061946903
FPR:  0.142668428005
FNR:  0.585791691752



Tune the thresholds so the False Positive Rate is the same for white and black defendants.
- What did you change to achieve this?
- What effect does this have on the overall accuracy, FPR, FNR, and PPV?
- What effect does this have on the PPV for white and black?


In [60]:
#(your answer here)
pred_res = predict_threshold_groups(lr, x, 0.515, 0.585, jbd)
# print(pred_res)

guessed3 = pd.Series(pred_res)==1
actual = cv.two_year_recid==1
    
print("Everyone")
cm = pd.crosstab(guessed3, actual, rownames=['guessed'], colnames=['actual'])
print(cm)

print_ppv_fpv(cm)
print('White')
white = cv.race == 'Caucasian'
print_metrics(guessed3[white], actual[white])

print('Black')
black = cv.race == 'African-American'
print_metrics(guessed3[black], actual[black])

Everyone
actual   False  True 
guessed              
False     2319   1366
True       476   1117
Accuracy:  0.651004168246
PPV:  0.701192718142
FPR:  0.17030411449
FNR:  0.550140958518

White
actual   False  True 
guessed              
False     1075    499
True       206    323

Accuracy:  0.664764621969
PPV:  0.610586011342
FPR:  0.16081186573
FNR:  0.607055961071

Black
actual   False  True 
guessed              
False     1244    867
True       270    794

Accuracy:  0.64188976378
PPV:  0.746240601504
FPR:  0.178335535007
FNR:  0.521974714028



(your answer here)

#### 1. Change to achieve this :

I changed the values given to the function 'predict_threshold_groups(classifier, feautes, a_threshold, b_threshold, use_b)' and adjusted the threshold values for each race group in order to identify some point where the FPR was equal in both cases.
    
I tried playing around with both 'a_threshold' and 'b_threshold' values starting with the initial '0.5' that was taken in the basic predict function and moved the values around to (0.3, 0.4), (0.5, 0.6), (0.43, 0.52) etc. with the intuition that the lesser the threshold for black people, the more likely that they will not be falsely predicted by the algorithm here. I took the approach of moving the threshold to extremes on both black and white people to see how the prediction will change. I noticed that the higher the threshold values for 

The final value I chose was - 
pred_res = predict_threshold_groups(lr, x, 0.515, 0.585, jbd)

An interesting thing I came across was that when I obtained nearly equal values of FPR for both the categories whenever the difference between (threshold_b - threshold_a) was around 0.8! For instance, here then its threshold_a=0.515 and threshold_b=0.585 the FPR circles around 17% and when I set it to (0.42, 0.48) the FPR was around 28-30%


#### 2. Effect it had on the overall accuracy, FPR, FNR, PPV?
I noticed that the smallest change in the threshold for the white category had very little change in the overall values for that category, where as the opposite was true in the case of the black category. I understand that this is due to larger amount of data being available for the white category and this help achieve more accurate or favourable results for the race class.

The overall values of FPR, FNR an PPV moved more when the threshold_b was adjusted. 

When I changed the 'threshold_b' to be higher than 0.6, I found that the FPR was very low and FNR shot up -- meaning that when I set a higher threshold, more people were let go easily. This is not ideal so I pulled back. 


#### 3. What effect does this have on the PPV for white and black ?
PPV is directly proportionally to threshold. It measures how precise the predictions are and we can observe that the PPV values are lesser when the threshold is set to be higher for the categories. When we push the PPV to be more high by raising the threshold for one class, the FNR also increases which is not an ideal trade off (as seen above)

So the essential question of where to set the bar for making this prediction is still very tricky and must factor several other things that are involved in the system. 

### Bonus: Predicting race and the impossibility of blinding
So far we've excluded race as a predictive variable, hoping that this would make the results unbiased. But is race encoded in the other data points? To find out, alter the regression above to try to predict race from the other demographic and criminal history variables.

How accurately can you predict race just on these factors alone?

In [20]:
# Use cross validation and the classifier of your choice to see how well you can predict race
from sklearn.model_selection import cross_val_score

Let's compare this accuracy to just guessing one race all the time. Which race is more common in this data and what would the accuracy be if we just always guessed that race.

In [21]:
# What is the most common race in our arrest data?
# race value counts
cv.race.value_counts()

African-American    3175
Caucasian           2103
Name: race, dtype: int64

In [22]:
# What is the accuracy if we always guess the most common race?


Based on this, how much information about race "leaks" into our original recidivism predictor, even if we don't give it the race variable as a feature?

(your answer here)