### Capstone Project: Reducing Crime in San Francisco, Part 2

by Elton Yeo, DSI13

#### Contents:
- [Preprocessing](#Preprocessing)
- [Modelling](#Modelling)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)
- [Limitations and Next Steps](#Limitations-and-Next-Steps)

In [232]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import geopandas
from shapely.geometry import Point
from pygeocoder import Geocoder

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import cross_val_score, KFold, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import metrics

from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

%matplotlib inline

### Preprocessing

#### Keeping preventable_crime only

In [235]:
final_reports=pd.read_csv('../data/reports_cleaned.csv')

In [236]:
final_reports=final_reports[['incident_date', 'incident_day_of_week', 
                       'incident_time', 'zip', 'incident_category']]

In [237]:
final_reports.head()

Unnamed: 0,incident_date,incident_day_of_week,incident_time,zip,incident_category
0,2019/05/01,Wednesday,01:00,94122,preventable_crime
1,2019/06/22,Saturday,07:45,94103,non_violent_crime
2,2019/05/27,Monday,02:25,94123,violent_crime
3,2018/11/07,Wednesday,03:50,94103,preventable_crime
4,2019/08/15,Thursday,12:45,94115,preventable_crime


In [238]:
final_reports.incident_category.value_counts()

preventable_crime     154868
non_violent_crime      75722
violent_crime          20316
white_collar_crime      9959
drug_crime              7648
Name: incident_category, dtype: int64

In [239]:
final_reports=final_reports[final_reports.incident_category == 'preventable_crime']

In [240]:
final_reports.incident_category.value_counts()

preventable_crime    154868
Name: incident_category, dtype: int64

In [241]:
final_reports.incident_category.replace({'preventable_crime': 1},
                                       inplace=True)

In [242]:
final_reports.rename({'incident_category': 'preventable_crime'}, axis=1, inplace=True)

In [243]:
final_reports.head()

Unnamed: 0,incident_date,incident_day_of_week,incident_time,zip,preventable_crime
0,2019/05/01,Wednesday,01:00,94122,1
3,2018/11/07,Wednesday,03:50,94103,1
4,2019/08/15,Thursday,12:45,94115,1
9,2019/02/27,Wednesday,15:30,94111,1
11,2019/04/08,Monday,00:30,94115,1


In [244]:
final_reports.loc[:, 'incident_date'] = pd.to_datetime(final_reports['incident_date'])
final_reports.loc[:, 'incident_date'] = final_reports.incident_date.apply(lambda x: x.year)
final_reports.loc[:, 'incident_hour'] = pd.to_datetime(final_reports['incident_time'])
final_reports.loc[:, 'incident_hour'] = final_reports.incident_hour.apply(lambda x: x.hour)

In [245]:
final_reports.head()

Unnamed: 0,incident_date,incident_day_of_week,incident_time,zip,preventable_crime,incident_hour
0,2019,Wednesday,01:00,94122,1,1
3,2018,Wednesday,03:50,94103,1,3
4,2019,Thursday,12:45,94115,1,12
9,2019,Wednesday,15:30,94111,1,15
11,2019,Monday,00:30,94115,1,0


#### Splitting into train and test datasets

We will use 2018 data as our training data, and 2019 data as our testing data. We will ignore all 2020 data. 

In [246]:
final_reports.incident_date.value_counts()

2018    71444
2019    70277
2020    13147
Name: incident_date, dtype: int64

In [247]:
#splitting into train and test datasets
train=final_reports[final_reports.incident_date == 2018]
test=final_reports[final_reports.incident_date == 2019]

#dropping incident_date column from both datasets
train.drop('incident_date', axis=1, inplace=True)
test.drop('incident_date', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [248]:
train.shape

(71444, 5)

In [249]:
test.shape

(70277, 5)

In [250]:
#save data to csv
train.to_csv('../data/train.csv', index=False)
test.to_csv('../data/test.csv', index=False)

In [251]:
train.head()

Unnamed: 0,incident_day_of_week,incident_time,zip,preventable_crime,incident_hour
3,Wednesday,03:50,94103,1,3
18,Friday,09:30,94121,1,9
54,Friday,07:30,94103,1,7
66,Sunday,19:47,94118,1,19
71,Sunday,09:20,94107,1,9


In [252]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71444 entries, 3 to 268506
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   incident_day_of_week  71444 non-null  object
 1   incident_time         71444 non-null  object
 2   zip                   71444 non-null  int64 
 3   preventable_crime     71444 non-null  int64 
 4   incident_hour         71444 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 3.3+ MB


In [253]:
train.zip.astype('int64')

3         94103
18        94121
54        94103
66        94118
71        94107
          ...  
268431    94133
268468    94124
268488    94124
268504    94122
268506    94123
Name: zip, Length: 71444, dtype: int64

In [254]:
train=train.groupby(['incident_day_of_week', 'zip', 'incident_hour']).sum().reset_index()

In [255]:
train.sort_values('preventable_crime', ascending=False).head()

Unnamed: 0,incident_day_of_week,zip,incident_hour,preventable_crime
2480,Thursday,94103,19,114
44,Friday,94103,20,108
1274,Saturday,94103,23,107
43,Friday,94103,19,107
3701,Wednesday,94103,19,93


In [256]:
test=test.groupby(['incident_day_of_week', 'zip', 'incident_hour']).sum().reset_index()

In [257]:
test.sort_values('preventable_crime', ascending=False).head()

Unnamed: 0,incident_day_of_week,zip,incident_hour,preventable_crime
3078,Tuesday,94103,18,92
1967,Sunday,94109,0,89
42,Friday,94103,18,88
12,Friday,94102,12,87
3669,Wednesday,94102,18,86


In [258]:
#getting dummies from categorical data so that they can be processed by models
train = pd.get_dummies(train, columns = ['incident_day_of_week', 'zip', 'incident_hour'], drop_first=True)

#confirming that dummies have been created by checking the number of columns
train.shape

(4268, 56)

In [259]:
#getting dummies from categorical data so that they can be processed by models
test = pd.get_dummies(test, columns = ['incident_day_of_week', 'zip', 'incident_hour'], drop_first=True)

#confirming that dummies have been created by checking the number of columns
test.shape

(4262, 56)

### Modelling

#### Creating our train/test split and scaling

In [260]:
#create our features matrix X and target vector y
X = train.drop(['preventable_crime'], axis=1)
y = train['preventable_crime']

#train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#scaling
ss = StandardScaler()
ss.fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test) 

#### Baseline MSE score

In [261]:
base_df= pd.DataFrame(y_test)
base_df['baseline']= np.mean(y_test)
base_df

Unnamed: 0,preventable_crime,baseline
1702,1,16.863168
1173,23,16.863168
308,12,16.863168
1322,13,16.863168
2570,21,16.863168
...,...,...
2698,10,16.863168
1672,32,16.863168
4075,9,16.863168
604,3,16.863168


In [262]:
mean_squared_error(y_true=base_df.preventable_crime, y_pred=base_df.baseline)

247.07780927176287

#### Linear Regression

In [263]:
#instantiate our model
lr = LinearRegression()

#finding r-squared scores for linear regression
lr_score = cross_val_score(lr, X_train_sc, y_train, cv=10)
lr_score.mean()

0.7630160928049994

In [264]:
lr.fit(X_train_sc, y_train)

y_pred_lr=lr.predict(X_test_sc)

mean_squared_error(y_test, y_pred_lr)

65.85420132167526

#### LassoCV

In [265]:
#instantiate our model and find optimal alpha
lasso = LassoCV(n_alphas=500)

#fitting to lasso
lasso.fit(X_train_sc, y_train)

#input optimal alpha
lasso_opt = Lasso(lasso.alpha_)

#finding r-squared scores for lasso
lasso_score = cross_val_score(lasso_opt, X_train_sc, y_train)
lasso_score.mean()



0.7647805078966549

In [266]:
lasso_opt.fit(X_train_sc, y_train)

y_pred_lasso=lasso_opt.predict(X_test_sc)

mean_squared_error(y_test, y_pred_lasso)

65.71666876835954

#### RidgeCV

In [267]:
#instantiate our model and find optimal alpha
ridge_alphas=np.logspace(0, 5, 200)
ridge = RidgeCV(alphas=ridge_alphas)

#fitting to ridge
ridge.fit(X_train_sc, y_train)

#input optimal alpha
ridge_opt= Ridge(alpha=ridge.alpha_)

#finding r-squared scores for ridge
ridge_score = cross_val_score(ridge_opt, X_train_sc, y_train)
ridge_score.mean()



0.7647855711923855

In [268]:
ridge_opt.fit(X_train_sc, y_train)

y_pred_ridge=ridge_opt.predict(X_test_sc)

mean_squared_error(y_test, y_pred_ridge)

65.80920705863697

#### Random Forest with GridSearchCV

In [269]:
#determining range of hyperparamters
rf_parameters={
        'max_depth': np.arange(25, 30),
        'n_estimators': np.arange(15, 20),
        'min_samples_split': [2,4,6,8,10]
}

#instantiate our model within pipeline
rf=GridSearchCV(RandomForestRegressor(), 
                      rf_parameters,
                      verbose=1)

#fittinng our data to our model
rf.fit(X_train_sc, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 125 candidates, totalling 375 fits


[Parallel(n_jobs=1)]: Done 375 out of 375 | elapsed:   51.5s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'max_depth': array([25, 26

In [270]:
#best r2 score from randomizedsearchCV
rf.best_score_

0.7377464948132548

In [271]:
#using best hyperparameters to get r2 score on X_test_sc and y_test
best_rf=rf.best_estimator_
best_rf.score(X_test_sc, y_test)

0.7970670421665086

In [272]:
y_pred_rf=best_rf.predict(X_test_sc)

mean_squared_error(y_test, y_pred_rf)

50.14023065053809

#### XGBoost with GridSearchCV

In [273]:
xgb_parameters={
        'max_depth': [1,3,5,7],
        'n_estimators': [500, 550, 600], 
        'learning_rate': [0.2, 0.4, 0.6, 0.8]
}

xgb=GridSearchCV(XGBRegressor(), 
                      xgb_parameters,
                      verbose=1)

xgb.fit(X_train_sc, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=1)]: Done 144 out of 144 | elapsed:  4.4min finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_cons...
                                    objective='reg:squarederror',
                                    random_state=None, reg_alpha=None,
                                    reg_lambda=None, scale_pos_weight=None,
                                    subsample=None, tree_method=None,
                                    validate_p

In [274]:
xgb.best_score_

0.8298653572877429

In [275]:
best_xgb=xgb.best_estimator_
best_xgb.score(X_test_sc, y_test)

0.8451533929561439

In [276]:
y_pred_xgb=best_xgb.predict(X_test_sc)

mean_squared_error(y_test, y_pred_xgb)

38.259160441561505

#### Final model predicton and evaluation

In [277]:
X_valid = test.drop(['preventable_crime'], axis=1)
y_valid = test['preventable_crime']

#scaling my test set according to X_train
X_valid_sc = ss.transform(X_valid)

In [278]:
best_xgb.score(X_valid_sc, y_valid)

0.8022793805526838

### Conclusion and Recommendations

### Limitations and Next Steps

understand which variables would result in the lowest number of crimes in an area?