# Model Audit

## Dataset

You will examine the ProPublica COMPAS dataset, which consists of all criminal defendants who were subject to COMPAS screening in Broward County, Florida, during 2013 and 2014. For each defendant, various information fields (‘features’) were also gathered by ProPublica. Broadly, these fields are related to the defendant’s demographic information (e.g., gender and race), criminal history (e.g., the number of prior offenses) and administrative information about the case (e.g., the case number, arrest date, risk of recidivism predicted by the COMPAS tool). Finally, the dataset also contains information about whether the defendant did actually recidivate or not.

The COMPAS score uses answers to 137 questions to assign a risk score to defendants -- essentially a probability of re-arrest. The actual output is two-fold: a risk rating of 1-10 and a "low", "medium", or "high" risk label.

Link to dataset: https://github.com/propublica/compas-analysis

The file we will analyze is: compas-scores-two-years.csv

Link to the ProPublica article:

https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing


## Project Background and Goals

- The COMPAS scores have been shown to have biases against certain racial groups. Analyze the dataset to highlight these biases.  

- Based on the features in the COMPAS dataset, train classifiers to predict who will re-offend (hint: no need to use all features, just the ones you find relevant).  Study if your classifiers are more or less fair than the COMPAS classifier. 

- Build a fair classifier. Is excluding the race from the feature set enough?

## Fair Re-Offend Predictor

by Steve

In [1]:
# Audit comments will be made in commented code cells, to distinguish from Steve's own writing and code

## Download the data

First load the data from the ProPublica repo:
https://github.com/propublica/compas-analysis


In [25]:
import pandas as pd
url = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
df = pd.read_csv(url)
df

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7209,10996,steven butler,steven,butler,2013-11-23,Male,1992-07-17,23,Less than 25,African-American,...,5,Medium,2013-11-23,2013-11-22,2013-11-24,0,1,860,0,0
7210,10997,malcolm simmons,malcolm,simmons,2014-02-01,Male,1993-03-25,23,Less than 25,African-American,...,5,Medium,2014-02-01,2014-01-31,2014-02-02,0,1,790,0,0
7211,10999,winston gregory,winston,gregory,2014-01-14,Male,1958-10-01,57,Greater than 45,Other,...,1,Low,2014-01-14,2014-01-13,2014-01-14,0,0,808,0,0
7212,11000,farrah jean,farrah,jean,2014-03-09,Female,1982-11-17,33,25 - 45,African-American,...,2,Low,2014-03-09,2014-03-08,2014-03-09,3,0,754,0,0


## Data Cleaning

I got rid of difficult columns and then turned categorical features into numbers

In [26]:
df.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

In [27]:
df = df[['id', 'sex', 'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 
       'is_recid', 'is_violent_recid', 'decile_score.1', 'v_decile_score',
       'two_year_recid']]

In [28]:
# Of the kept columns, several groups seem redundant and prone to cause overfitting:
#   'decile_score', 'decile_score.1', 'v_decile_score'
#   'age', 'age_cat'
#   'is_recid', 'is_violent_recid', 'two_year_recid'

# Inclusion of the COMPAS score itself ('decile_score' etc) conflicts with assignment goal:
#    "Study if your classifiers are more or less fair than the COMPAS classifier."

# 'is_recid', 'is_violent_recid' closely related to target feature 'two_year_recid';
# model may just fit to these without finding patterns in other features


In [29]:
df.isnull().sum()

id                  0
sex                 0
age                 0
age_cat             0
race                0
juv_fel_count       0
decile_score        0
juv_misd_count      0
juv_other_count     0
priors_count        0
c_charge_degree     0
is_recid            0
is_violent_recid    0
decile_score.1      0
v_decile_score      0
two_year_recid      0
dtype: int64

In [30]:
df = df.dropna()

In [31]:
# Identifying nulls is a good first step of data cleaning, but why drop nulls when there are none?
# Not a problem but possible sign of poor understanding of code

In [32]:
df = df.set_index('id')

In [33]:
dummy_cols = ['sex','age_cat','race','c_charge_degree'] 

df = pd.get_dummies(df, columns=dummy_cols)

In [34]:
X = df.drop('two_year_recid', axis=1)
y = df['two_year_recid']

In [35]:
# 'is_recid' would be the appropriate target for a model predicting recidivism generally,
# although 'is_violent_recid' and 'two_year_recid' are each appropriate for more specific analysis.

# The starting dataframe was simply named 'df' and all modifications were done in-place; this is sufficient
# but could lead to confusion in more complex analysis.

In [36]:
# looking at dummy encoded columns for audit
X.columns

Index(['age', 'juv_fel_count', 'decile_score', 'juv_misd_count',
       'juv_other_count', 'priors_count', 'is_recid', 'is_violent_recid',
       'decile_score.1', 'v_decile_score', 'sex_Female', 'sex_Male',
       'age_cat_25 - 45', 'age_cat_Greater than 45', 'age_cat_Less than 25',
       'race_African-American', 'race_Asian', 'race_Caucasian',
       'race_Hispanic', 'race_Native American', 'race_Other',
       'c_charge_degree_F', 'c_charge_degree_M'],
      dtype='object')

In [37]:
# dummy encoding the 'sex' and 'c_charge_degree' columns has created redundancies from binary source columns:
# 'sex_Female', 'sex_Male'
# 'c_charge_degree_F', 'c_charge_degree_M'
# these could again lead to overfitting

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

## Model Selection

I picked an SVC because I like them.

In [39]:
from sklearn.svm import SVC

# Create the SVM classifier
clf = SVC()

# Train the classifier
clf.fit(X_train, y_train)

# Get the confusion matrix
y_pred = clf.predict(X_test)

In [40]:
# SVC's use non-linear functions in high dimensional decision space to set decision function, 
# which makes interperetaion of the relationship between individual features and the decision boundary challenging.
# As this exercise is about analyzing how individual features may affect model bias, 
# an SVC was not an ideal model choice.

# Logistic Regression, Random Forest, XGBoost might be better options for interpretability.

# Again, choosing a model "because I like them" does not demonstrate understanding of the process.

## Model Evaluation

I checked accuracy and did a Confusion Matrix

In [41]:
import numpy as np

accuracy = np.mean(y_pred == y_test)

print('Accuracy:', accuracy)

Accuracy: 0.9611973392461197


In [42]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[951,  69],
       [  1, 783]])

## Model Summary

This is a very good and fair model because it is very accurate and predicts very well.

In [43]:
# "it is very accurate and predicts very well" technically correct but does not address assignment goals: 
#   "Study if your classifiers are more or less fair than the COMPAS classifier."
#   "Build a fair classifier. Is excluding the race from the feature set enough?"


---
# Your Turn
---

## Recommendations to Improve the Model and Reduce Bias

`The simplest method to reduce racial bias in the model would be to drop the 'race' column.  `

## Checking Bias

Using a method of your choosing retrieve feature importance for Steve's model

Compare predictions between `African-American` and `Caucasian` using a Confusion Matrix or any other tools

In [44]:
# SVC is non-linear, use permutation_importance to assign feature importance
from sklearn.inspection import permutation_importance
import numpy as np

# Get names of columns
feature_names = X_test.columns

# Calculate permutation importance (clf is Steve's SVC model in provided code above)
result = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=42)

# Use argsort to sort importances by the absolute value
sorted_indices = np.argsort(np.abs(result.importances_mean))

# Print feature names and their importance values sorted by importance
for idx in sorted_indices[::-1]:  # Reverse the order for descending importance
    feature_name = feature_names[idx]
    importance_value = result.importances_mean[idx]
    print(f"{feature_name}: {importance_value}")


is_recid: 0.45221729490022167
age: 0.002383592017738345
priors_count: 0.0010532150776052852
is_violent_recid: -5.543237250554833e-05
v_decile_score: -5.543237250554833e-05
sex_Female: 0.0
juv_fel_count: 0.0
decile_score: 0.0
juv_misd_count: 0.0
juv_other_count: 0.0
decile_score.1: 0.0
c_charge_degree_M: 0.0
c_charge_degree_F: 0.0
age_cat_25 - 45: 0.0
age_cat_Greater than 45: 0.0
age_cat_Less than 25: 0.0
race_African-American: 0.0
race_Asian: 0.0
race_Caucasian: 0.0
race_Hispanic: 0.0
race_Native American: 0.0
race_Other: 0.0
sex_Male: 0.0


In [45]:
# Confusion Matrices
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Extract y_test and y_pred for Caucasians and African-Americans (using informal terms for ease of coding)
y_test_white = y_test[X_test['race_Caucasian']==1]
y_pred_white = y_pred[X_test['race_Caucasian']==1]
y_test_black = y_test[X_test['race_African-American']==1]
y_pred_black = y_pred[X_test['race_African-American']==1]

# Set confusion matrices and scores
cm_white = confusion_matrix(y_test_white, y_pred_white)
accuracy_white = accuracy_score(y_test_white, y_pred_white)
precision_white = precision_score(y_test_white, y_pred_white)
recall_white = recall_score(y_test_white, y_pred_white)
f1_white = f1_score(y_test_white, y_pred_white)

cm_black = confusion_matrix(y_test_black, y_pred_black)
accuracy_black = accuracy_score(y_test_black, y_pred_black)
precision_black = precision_score(y_test_black, y_pred_black)
recall_black = recall_score(y_test_black, y_pred_black)
f1_black = f1_score(y_test_black, y_pred_black)

# Print cm's and scores
print('Metrics of model fit for data on Caucasians')
print(cm_white)
print("Accuracy: {:.4f}".format(accuracy_white))
print("Precision: {:.4f}".format(precision_white))
print("Recall: {:.4f}".format(recall_white))
print("F1 Score: {:.4f}".format(f1_white))

print('\n')
print('Metrics of model fit for data on African-Americans')
print(cm_black)
print("Accuracy: {:.4f}".format(accuracy_black))
print("Precision: {:.4f}".format(precision_black))
print("Recall: {:.4f}".format(recall_black))
print("F1 Score: {:.4f}".format(f1_black))



Metrics of model fit for data on Caucasians
[[354  15]
 [  1 231]]
Accuracy: 0.9734
Precision: 0.9390
Recall: 0.9957
F1 Score: 0.9665


Metrics of model fit for data on African-Americans
[[425  47]
 [  0 473]]
Accuracy: 0.9503
Precision: 0.9096
Recall: 1.0000
F1 Score: 0.9527


#### Results 
The redundant features:
is_recid, is_violent_recid 
decile_score, v_decile_score
Were all among the most important features of the model. The recidivism measures are closely related to the target feature and may act as a "cheat sheet" for the model, inflating its accuracy. The model may be overfit to decile score, which are themselves results of the scoring process we want to analyze. Given the relative importance of these features, it's hard to assign a level of "fairness" to how it treats the demographic features.

The use of an SVC model presented me with technical difficultes getting TensorFlow Fairness Indicators to run on a non-linear model.

Precision is slightly higher for Caucasians while recall is slightly higher (at 100%) for African-Americans. This indicates that the model is indeed better at avoiding false positives for Caucasians, and at avoiding false negatives for African-Americans.
Rephrased more directly, the model can be seen as biased to predicting recidivism among African Americans. However, given the very high model scores for both groups, this may not be a significant or meaningful finding.


# Improve the Model

Implement some/all of your suggestions to make Steve's model better.

In [23]:
# Suggestions for improvement:
#   Set 'is_recid' as target, drop 'two_year_recid' and 'is_violent_recid'
#   Create 'custody_duration' from 'custody_in' and 'custody_out' dates
#   Drop 'decile_score', 'decile_score.1', 'v_decile_score' to not have COMPAS score itself affect model
#   Drop age_cat, after dummy encoding drop 'c_charge_degree_F', 'sex_Female' 
#   Use Logistic Regression model to start, then perhpas other models more interperetable than SVC
#   Use TensorFlow Fairness Indicators to analyze model and make adjustments
#   Create second model using COMPAS score feature(s) and do same fairness analysis for comparison


In [77]:
# Import source data
import pandas as pd
url = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
source_df = pd.read_csv(url)


In [78]:
source_df.columns

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')

In [79]:
# model w/o COMPAS features
# Retain only demographic columns, drop COMPAS scores and 
demographic_df = source_df[['id', 'sex', 'age', 'race', 'juv_fel_count', 
       'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 
       'is_recid']]

In [80]:
# Use 'id' as index
demographic_df = demographic_df.set_index('id')

In [81]:
# dummy encode categorical features 
dummy_cols = ['sex','race','c_charge_degree'] 

demographic_df = pd.get_dummies(demographic_df, columns=dummy_cols)
demographic_df.columns

Index(['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count',
       'priors_count', 'is_recid', 'sex_Female', 'sex_Male',
       'race_African-American', 'race_Asian', 'race_Caucasian',
       'race_Hispanic', 'race_Native American', 'race_Other',
       'c_charge_degree_F', 'c_charge_degree_M'],
      dtype='object')

In [82]:
# drop redundant dummy features
drop_dummies = ['sex_Female', 'c_charge_degree_F']
demographic_df = demographic_df.drop(columns=drop_dummies)

In [83]:
# Set 'is_recid' as target
X = demographic_df.drop('is_recid', axis=1)
y = demographic_df['is_recid']

In [84]:
# train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [85]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create the Logistic Regression classifier
model = LogisticRegression(max_iter=1000) # increasing iterations suggested by python warning

# Train the classifier
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In [86]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print cm and scores
print('Model metrics')
print(cm)
print("Accuracy: {:.4f}".format(accuracy))
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1 Score: {:.4f}".format(f1))

Model metrics
[[479 428]
 [473 424]]
Accuracy: 0.5006
Precision: 0.4977
Recall: 0.4727
F1 Score: 0.4848


In [87]:
# Retrieve feature coeeficients for feature importance
# Access the coefficients and feature names
coefficients = model.coef_[0]
feature_names = X_train.columns

# Use argsort to sort coefficients by the absolute value
sorted_indices = np.argsort(np.abs(coefficients))

# Print feature names and their coefficients sorted by absolute value
for idx in sorted_indices[::-1]:  # Reverse the order for descending absolute importance
    feature_name = feature_names[idx]
    coefficient_value = coefficients[idx]
    print(f"{feature_name}: {coefficient_value:.3f}")

sex_Male: 0.352
juv_fel_count: 0.238
race_Native American: 0.219
race_African-American: 0.205
c_charge_degree_M: -0.200
priors_count: 0.162
juv_other_count: 0.145
race_Hispanic: -0.109
race_Caucasian: 0.077
race_Other: -0.045
age: -0.042
race_Asian: 0.036
juv_misd_count: 0.024


In [88]:
# Extract y_test and y_pred for Caucasians and African-Americans (using informal terms for ease of coding)
y_test_white = y_test[X_test['race_Caucasian']==1]
y_pred_white = y_pred[X_test['race_Caucasian']==1]
y_test_black = y_test[X_test['race_African-American']==1]
y_pred_black = y_pred[X_test['race_African-American']==1]

# Set confusion matrices and scores
cm_white = confusion_matrix(y_test_white, y_pred_white)
accuracy_white = accuracy_score(y_test_white, y_pred_white)
precision_white = precision_score(y_test_white, y_pred_white)
recall_white = recall_score(y_test_white, y_pred_white)
f1_white = f1_score(y_test_white, y_pred_white)

cm_black = confusion_matrix(y_test_black, y_pred_black)
accuracy_black = accuracy_score(y_test_black, y_pred_black)
precision_black = precision_score(y_test_black, y_pred_black)
recall_black = recall_score(y_test_black, y_pred_black)
f1_black = f1_score(y_test_black, y_pred_black)

# Print cm's and scores
print('Metrics of model fit for data on Caucasians')
print(cm_white)
print("Accuracy: {:.4f}".format(accuracy_white))
print("Precision: {:.4f}".format(precision_white))
print("Recall: {:.4f}".format(recall_white))
print("F1 Score: {:.4f}".format(f1_white))

print('\n')
print('Metrics of model fit for data on African-Americans')
print(cm_black)
print("Accuracy: {:.4f}".format(accuracy_black))
print("Precision: {:.4f}".format(precision_black))
print("Recall: {:.4f}".format(recall_black))
print("F1 Score: {:.4f}".format(f1_black))

Metrics of model fit for data on Caucasians
[[183 152]
 [138 135]]
Accuracy: 0.5230
Precision: 0.4704
Recall: 0.4945
F1 Score: 0.4821


Metrics of model fit for data on African-Americans
[[218 206]
 [286 249]]
Accuracy: 0.4870
Precision: 0.5473
Recall: 0.4654
F1 Score: 0.5030


## Repeating w/o COMPAS data

In [89]:
# Repeat with inclusion of COMPAS score data
compas_df = source_df[['id', 'sex', 'age', 'race', 'juv_fel_count', 
       'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 
       'is_recid', 'decile_score']]

In [90]:
compas_df = compas_df.set_index('id')

In [91]:
# dummy encode categorical features 
dummy_cols = ['sex','race','c_charge_degree'] 

compas_df = pd.get_dummies(compas_df, columns=dummy_cols)

In [92]:
# drop redundant dummy features
drop_dummies = ['sex_Female', 'c_charge_degree_F']
compas_df = compas_df.drop(columns=drop_dummies)

In [93]:
# Set 'is_recid' as target
X = compas_df.drop('is_recid', axis=1)
y = compas_df['is_recid']

In [94]:
# train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [95]:
# Create the Logistic Regression classifier
model = LogisticRegression()

# Train the classifier
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [96]:
cm = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print cm and scores
print('Model metrics for Logistic Regression w/ COMPAS score')
print(cm)
print("Accuracy: {:.4f}".format(accuracy))
print("Precision: {:.4f}".format(precision))
print("Recall: {:.4f}".format(recall))
print("F1 Score: {:.4f}".format(f1))

Model metrics for Logistic Regression w/ COMPAS score
[[521 444]
 [431 408]]
Accuracy: 0.5150
Precision: 0.4789
Recall: 0.4863
F1 Score: 0.4826


In [97]:
# Retrieve feature coeeficients for feature importance
# Access the coefficients and feature names
coefficients = model.coef_[0]
feature_names = X_train.columns

# Use argsort to sort coefficients by the absolute value
sorted_indices = np.argsort(np.abs(coefficients))

# Print feature names and their coefficients sorted by absolute value
for idx in sorted_indices[::-1]:  # Reverse the order for descending absolute importance
    feature_name = feature_names[idx]
    coefficient_value = coefficients[idx]
    print(f"{feature_name}: {coefficient_value:.3f}")

sex_Male: 0.360
juv_other_count: 0.241
race_Hispanic: -0.164
decile_score: 0.146
juv_fel_count: 0.144
race_Other: -0.131
priors_count: 0.116
c_charge_degree_M: -0.088
race_Native American: 0.072
race_Asian: -0.064
juv_misd_count: -0.042
race_African-American: 0.041
age: -0.032
race_Caucasian: -0.001


In [98]:
# Extract y_test and y_pred for Caucasians and African-Americans (using informal terms for ease of coding)
y_test_white = y_test[X_test['race_Caucasian']==1]
y_pred_white = y_pred[X_test['race_Caucasian']==1]
y_test_black = y_test[X_test['race_African-American']==1]
y_pred_black = y_pred[X_test['race_African-American']==1]

# Set confusion matrices and scores
cm_white = confusion_matrix(y_test_white, y_pred_white)
accuracy_white = accuracy_score(y_test_white, y_pred_white)
precision_white = precision_score(y_test_white, y_pred_white)
recall_white = recall_score(y_test_white, y_pred_white)
f1_white = f1_score(y_test_white, y_pred_white)

cm_black = confusion_matrix(y_test_black, y_pred_black)
accuracy_black = accuracy_score(y_test_black, y_pred_black)
precision_black = precision_score(y_test_black, y_pred_black)
recall_black = recall_score(y_test_black, y_pred_black)
f1_black = f1_score(y_test_black, y_pred_black)

# Print cm's and scores
print('Metrics of model fit for data on Caucasians')
print(cm_white)
print("Accuracy: {:.4f}".format(accuracy_white))
print("Precision: {:.4f}".format(precision_white))
print("Recall: {:.4f}".format(recall_white))
print("F1 Score: {:.4f}".format(f1_white))

print('\n')
print('Metrics of model fit for data on African-Americans')
print(cm_black)
print("Accuracy: {:.4f}".format(accuracy_black))
print("Precision: {:.4f}".format(precision_black))
print("Recall: {:.4f}".format(recall_black))
print("F1 Score: {:.4f}".format(f1_black))

Metrics of model fit for data on Caucasians
[[184 175]
 [132 133]]
Accuracy: 0.5080
Precision: 0.4318
Recall: 0.5019
F1 Score: 0.4642


Metrics of model fit for data on African-Americans
[[253 189]
 [240 225]]
Accuracy: 0.5270
Precision: 0.5435
Recall: 0.4839
F1 Score: 0.5119
