## General Testing

We will first use the entire dataset to investigate how significant each feature is in predicting arrests. Then, we will check for:

-Ability to predict arrest based on subject and officer info

-Predict Race of subject based on OTHER

-Predict Frisk by OTHER


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
#load data
df = pd.read_csv('data/cleaned_df.csv')

#change frisk and arrest flags to 1 and 0
df.replace({'Frisk_Flag' : { 'Y' : 1, 'N' : 0}}, inplace=True)
df.replace({'Arrest_Flag' : { 'Y' : 1, 'N' : 0}}, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52147 entries, 0 to 52146
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject_Age_Group         52147 non-null  object
 1   Was_Weapon_Reported       52147 non-null  object
 2   Officer_Gender            52147 non-null  object
 3   Officer_Race              52147 non-null  object
 4   Subject_Perceived_Race    52147 non-null  object
 5   Subject_Perceived_Gender  52147 non-null  object
 6   Arrest_Flag               52147 non-null  int64 
 7   Frisk_Flag                52147 non-null  int64 
 8   Reported_Year             52147 non-null  int64 
 9   Officer_Age               52147 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 4.0+ MB


First the non-numeric variables need to be split into data types that can be properly interpreted by our machine learning models.

We drop the first of each of these dummies to reduce unneccessary correlation betweel features.

In [14]:
dummies_df = pd.get_dummies(df, drop_first=True)
dummies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52147 entries, 0 to 52146
Data columns (total 28 columns):
 #   Column                                                            Non-Null Count  Dtype
---  ------                                                            --------------  -----
 0   Arrest_Flag                                                       52147 non-null  int64
 1   Frisk_Flag                                                        52147 non-null  int64
 2   Reported_Year                                                     52147 non-null  int64
 3   Officer_Age                                                       52147 non-null  int64
 4   Subject_Age_Group_1 - 17                                          52147 non-null  uint8
 5   Subject_Age_Group_18 - 25                                         52147 non-null  uint8
 6   Subject_Age_Group_26 - 35                                         52147 non-null  uint8
 7   Subject_Age_Group_36 - 45                        

In [24]:
dummies_df.corr()

Unnamed: 0,Arrest_Flag,Frisk_Flag,Reported_Year,Officer_Age,Subject_Age_Group_1 - 17,Subject_Age_Group_18 - 25,Subject_Age_Group_26 - 35,Subject_Age_Group_36 - 45,Subject_Age_Group_46 - 55,Subject_Age_Group_56 and Above,...,Officer_Race_White,Subject_Perceived_Race_Asian,Subject_Perceived_Race_Black or African American,Subject_Perceived_Race_Hispanic,Subject_Perceived_Race_Multi-Racial,Subject_Perceived_Race_Native Hawaiian or Other Pacific Islander,Subject_Perceived_Race_Unknown,Subject_Perceived_Race_White,Subject_Perceived_Gender_M,Subject_Perceived_Gender_N
Arrest_Flag,1.0,0.083303,0.333745,-0.061629,-0.03183,-0.023831,0.019372,0.030733,0.001695,0.007119,...,-0.024737,0.015239,0.031819,-0.05737,-0.039318,0.026506,-0.000626,-0.005529,0.030493,-0.030436
Frisk_Flag,0.083303,1.0,0.043615,-0.021776,0.020454,0.023774,-0.010136,-0.006735,-0.008817,-0.011356,...,-0.004341,0.017839,0.064391,0.012337,-0.001484,0.000668,0.009548,-0.075364,0.126071,-0.005494
Reported_Year,0.333745,0.043615,1.0,-0.095119,-0.075716,-0.072637,0.025403,0.049153,0.001864,0.032102,...,-0.051502,0.019566,-0.019224,-0.097209,-0.107039,0.051031,0.115803,0.004712,0.030129,-0.04087
Officer_Age,-0.061629,-0.021776,-0.095119,1.0,-0.001268,0.005021,-0.004736,-0.009035,0.001196,-0.013407,...,0.029189,-0.010184,-0.035162,0.001633,0.040815,-0.003476,-0.015346,0.035829,-0.010043,0.04428
Subject_Age_Group_1 - 17,-0.03183,0.020454,-0.075716,-0.001268,1.0,-0.100045,-0.143355,-0.106205,-0.077585,-0.047176,...,-0.002277,-0.001502,0.081709,0.021698,0.024787,-0.003788,8.8e-05,-0.084573,-0.014431,-0.015875
Subject_Age_Group_18 - 25,-0.023831,0.023774,-0.072637,0.005021,-0.100045,1.0,-0.351195,-0.260184,-0.19007,-0.115572,...,0.010442,0.011542,0.033574,0.030181,0.036705,0.015963,0.008451,-0.056863,-0.053583,-0.023813
Subject_Age_Group_26 - 35,0.019372,-0.010136,0.025403,-0.004736,-0.143355,-0.351195,1.0,-0.372819,-0.272352,-0.165604,...,0.007656,-0.015675,-0.041988,-0.001321,0.001832,-0.005615,-0.008882,0.053221,0.001211,-0.053484
Subject_Age_Group_36 - 45,0.030733,-0.006735,0.049153,-0.009035,-0.106205,-0.260184,-0.372819,1.0,-0.201773,-0.122688,...,-0.00974,0.006931,-0.035721,-0.002798,-0.020235,0.003427,-0.030184,0.045742,0.031479,-0.042656
Subject_Age_Group_46 - 55,0.001695,-0.008817,0.001864,0.001196,-0.077585,-0.19007,-0.272352,-0.201773,1.0,-0.089626,...,-0.002922,0.005703,0.007259,-0.014672,-0.013676,-0.010678,-0.035346,0.019584,0.04609,-0.033779
Subject_Age_Group_56 and Above,0.007119,-0.011356,0.032102,-0.013407,-0.047176,-0.115572,-0.165604,-0.122688,-0.089626,1.0,...,0.009056,0.00118,0.021588,-0.027633,-0.020587,-0.003622,-0.020782,0.006027,0.044488,-0.018259


First split will be based around predicting the Arrest Flag based on all other data.

In [29]:
#import necessary tools

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

In [30]:
X= dummies_df.drop('Arrest_Flag', axis=1)
y = dummies_df['Arrest_Flag']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [35]:
show resulting sizes of data

SyntaxError: invalid syntax (Temp/ipykernel_13120/1106518343.py, line 1)

In [36]:
#address class imbalance etc? maybe later?

Pipeline:

standardize etc

etc etc

explain


In [32]:
scaled_pipeline_2 = Pipeline([('ss', StandardScaler()), 
                              ('RF', RandomForestClassifier(random_state=123))])

scaled_pipeline_2.fit(X_train, y_train)
scaled_pipeline_2.score(X_test, y_test)

0.8927667408146046

gridsearch for params:

In [33]:
grid = [{'RF__max_depth': [4, 5, 6], 
         'RF__min_samples_split': [2, 5, 10], 
         'RF__min_samples_leaf': [1, 3, 5]}]

gridsearch = GridSearchCV(estimator=scaled_pipeline_2, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

In [34]:
# Fit the training data
gridsearch.fit(X_train, y_train)

# Print the accuracy on test set
gridsearch.score(X_test, y_test)

KeyboardInterrupt: 

evaluation and different scores!

SMOTE - synthetic data

gridsearch on each model!

feature selection?

confusion matrices??

ridge lasso? etc?