# NYPD Civilian Complaints


#### Shweta Kumar A15409222


## Summary of Findings


### Question

For my model, I have decided to predict the duration in days a case will take from when its received to when its closed as a regression model with an added column I shall create called "duration" that contains the days from when the complaint was received to when it was closed. I will use R^2 scores to evaluate my models and as the metric to improve my model based on.

### Baseline Model

For my Baseline Model, I will use the bare minimum and train my model with only the features in the table that do not have large unique values that could make the model messy. 

At the time of prediction, I will not have information on the complainant and officer after the case is closed so I only choose features known when the complaint is filed. 

For the model, there are 4 categorical variables and 3 numerical variables. There are no ordinal features I used for this baseline model. I implemented a Linear Regression model with a PCA in the categorical features pipeline to eliminate unecessary elements. 

* R squared = 0.14884089482053664

This is not a very high accuracy and I will definitely need to improve upon it in my Final Model. However, it was not negative which means atleast it follows the trend of the data. If it had been negative, it would indicate my model fit worse than a horizontal line. 

### Final Model

For thie final model, I decided to transform the "outcome_description" column using a FunctionTransformer that binned each outcome by severity. Any outcome resulting in arrest is labeled as Arrest, and ending in summons is labeled as Summons, and all others are Other. Then, I used OHE to create multiple more features to enhance my model. Aside from my engineered feature, the comoplainant's gender and age were features I decided to add on as well as my EDA from my previous project indicated trends with the complainant's demographic and how long it took to process the complaint. 

I experimented with DecisionTreeRegressor, KNeighborsRegressor, and RandomForestRegressor and found that RandomForestRegressor worked best with my model looking at my R^2 values. I then used GridSearch to figure which parramters would be best for my model and tuned my paramters. However, my default parameters actually ended up working better than those I got from the Search so I kept my default params. 

* R squared = 0.81928371847372

My R squared value indicates significant improvement in accuracy from my Baseline model so I was pleased. 

### Fairness Evaluation

I decided to use the subset of Male vs Female cops the complainant filed against to assess the fairness of my model. I set the significance level to 0.01 and performed a permuation test using r squared as my parity measure as I had done previously. I got a value above my significance level so I fail to reject my null that the model is fair towards cops of both genders. I used R^2 as my parity measure because that is the best measure of accuracy given my model is not a binary classification. 

## Code

In [1]:
#Import all necessary packages
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import datetime
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier

In [13]:
#load the dataset
allegations_fp = os.path.join('data', 'allegations.csv')
allegations = pd.read_csv(allegations_fp)
#Will not be using these columns for any sort oof analysis so I want to just drop them for now
allegations= allegations.drop(["unique_mos_id","first_name","last_name","shield_no"], axis = 1)

In [14]:
#Creating the "duration" column 
allegations = allegations.rename(columns={"month_received": "month", "year_received": "year"})
allegations['date_received'] = pd.to_datetime(allegations[['year', 'month']].assign(day=1))
allegations = allegations.drop(["month","year"], axis = 1)
allegations = allegations.rename(columns={"month_closed": "month", "year_closed": "year"})
allegations['date_closed'] = pd.to_datetime(allegations[['year', 'month']].assign(day=1))
allegations = allegations.drop(["month","year"], axis = 1) #drop month, year but have new datetime columns


allegations["duration"] = allegations["date_closed"] - allegations["date_received"] #calculate total days duration
allegations["duration"] = allegations["duration"] / np.timedelta64(1,'D')
allegations["duration"] = allegations["duration"].apply(lambda x: int(x))

allegations["allegation"] = allegations["allegation"].str.lower() #standardize allegations column

# BASELINE

In [6]:
X = allegations.drop("duration", axis =1) #drop the predictor
y = allegations['duration']
types = X.dtypes
catcols = ["rank_abbrev_incident","rank_incident","mos_ethnicity","mos_gender"] #bare minimum categorical variables
numcols = ["complainant_age_incident","precinct","mos_age_incident"]
cats = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('pca', PCA(svd_solver='full', n_components=0.99))]) #elimnate unnecessary elements
    

ct = ColumnTransformer([
    ('catcols', cats, catcols),
    ('numcols', SimpleImputer(strategy='constant', fill_value=0), numcols) #impute null values in numerical columns
])

pl = Pipeline([('feats', ct), ('reg', LinearRegression())])
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25)
pl.fit(X_tr, y_tr)
pl.score(X_ts, y_ts)

0.14792343530663699

# FINAL MODEL

Using FunctionTransformer to bin outcomes by severity and OHE the results for my new feature

In [48]:
def outcome_bin(X):
    def outcome_helper (column):
        if type(column) == float: 
            return "Other"
        if "Arrest" in column: 
            return "Arrest"
        if "Summons - " in column: 
            return "Summons"
        if "Moving violation summons issued" in column: 
            return "Summons"
        if "Parking summons issued" in column: 
            return "Summons"
        return "Other"
    return pd.DataFrame(X["outcome_description"].apply(outcome_helper))

num_feat = ['outcome_description']
num_transformer = Pipeline([('func-trans', FunctionTransformer(func = outcome_bin)),
                            ('ohe', OneHotEncoder(handle_unknown='ignore',sparse=False))])

catcols = ["rank_abbrev_incident","rank_incident","mos_ethnicity"
            ,"complainant_ethnicity","mos_gender","complainant_gender"]

numcols = ["mos_age_incident", "complainant_age_incident","precinct"]

cats = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('pca', PCA(svd_solver='full', n_components=0.99))]) #elimnate unnecessary elements


preproc = ColumnTransformer(transformers=[('numcols', SimpleImputer(strategy='constant', fill_value=0),numcols),
                                          ('num', num_transformer, num_feat),
                                          ('cat', cats, catcols)], remainder="drop")

pl = Pipeline(steps=[('preprocessor', preproc), ('clf', RandomForestRegressor())])

X = allegations.drop(['fado_type','allegation','contact_reason','date_closed','date_received','duration'], axis=1)
y = allegations['duration']

X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size=0.25)
pl.fit(X_tr, y_tr)
preds = pl.predict(X_ts)
print(pl.score(X_ts, y_ts))

0.81928371847372


In [8]:
#dictionary of parameters to possible tune
parameters = {
    'clf__max_depth': [2,3,4,5,7,10,30,75,100,None], 'clf__max_features':["sqrt","auto","log2"],
    'clf__min_samples_split':[2,3,5,7,10,15,20],
    'clf__min_samples_leaf':[2,3,5,7,10,15,20]
}

In [9]:
#Use a GridSearch to determine best possible params for the RandomForestRegressor
X = allegations.drop("duration", axis = 1)
y = allegations['duration']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pl, parameters, cv = 5) #train with pipeline
#clf.fit(X_train, y_train)
#clf.best_params_

#GridSearch params produced lower accuracy than default params so I keep the default

# FAIRNESS EVALUATION

###### Subset = Males vs Female Cops

In [50]:
#Significance = 0.01
from sklearn import metrics
from sklearn.metrics import r2_score

results = X_ts
results["prediction"] = preds
results["duration"] = y_ts

obs = r2_score(y_ts, preds)

metrs = []
for _ in range(100):
    s = (
        results[['mos_gender', 'prediction', 'duration']]
        .assign(mos_gender=results.mos_gender.sample(frac=1.0, replace=False).reset_index(drop=True))
        .groupby('mos_gender')
        .apply(lambda x: r2_score(x.duration, x.prediction))
        .iloc[-1]
    )
    
    metrs.append(s)
print(pd.Series(metrs <= obs).mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results["prediction"] = preds
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results["duration"] = y_ts


0.02
