## 2023 New Orleans 911 Safety and Crime Calls - Preprocessing and Modeling

### 1. Feature Engineering: 

Apply one-hot encoding transformations to categorical values:

### Predictive Modeling:
 
**Response Time Prediction**: Develop a model to predict classification of initial call to dispatch time (high risk or low risk - DispatchTimeCategory) based on the following features:

Type                     object  
TypeText                 object  
Priority                 object  
InitialType              object  
InitialTypeText          object  
InitialPriority         float64  
DispositionText          object  
SelfInitiated            object  
Beat                     object  
Zip                      object  
PoliceDistrict           object  
IncidentCategory         object  
TimeCreate_hour           int32  
TimeCreate_day            int32  
TimeCreate_month          int32  
TimeDispatch_hour         int32  
TimeDispatch_day          int32  
TimeDispatch_month        int32  
TimeArrive_hour           int32  
TimeArrive_day            int32   
TimeArrive_month          int32  
TimeClosed_hour           int32  
TimeClosed_day            int32  
TimeClosed_month          int32  


**Approach**:
Why GBC Might Be Well-Suited:

**Handling Complex Relationships**: GBCs can effectively capture complex, non-linear relationships between features and the target variable. This is important because 911 call data often involves factors that might not have a simple linear impact on dispatch time category (e.g., type of incident, location, time of day).
**Robust to Noise and Outliers**: GBCs are generally robust to noisy data and outliers. This dataset, even after cleaning, has potential for data entry errors. It has also been confirmed to contain outliers as seen in the data cleaning notebook. 
**Feature Importance**: GBCs provide a way to assess feature importance, which can be insightful for understanding the key factors driving dispatch time categorization.

1. Split data into training and testing sets.  
2. Train GBC model.  
3. Use cross-validation to ensure the model generalizes well to unseen data.  
4. Fine-tune hyperparameters using random search for computational efficiency.






In [28]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import OneHotEncoder  # For encoding categorical features as a one-hot numeric array
from sklearn.compose import ColumnTransformer  # For applying different preprocessing steps to different columns
from sklearn.pipeline import Pipeline  # For creating a machine learning pipeline
from sklearn.ensemble import GradientBoostingClassifier  # For training a Gradient Boosting classifier
from sklearn.metrics import classification_report  # For evaluating the performance of the classification model (in this case, GBC)
from sklearn.model_selection import RandomizedSearchCV  # For performing hyperparameter tuning using randomized search
from scipy.stats import randint as sp_randint  # For defining distributions for randomized search

In [29]:
cleaned = pd.read_csv("CapstoneDataCleaned.csv")

In [30]:
cleaned.head(5)

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,TimeCreate,TimeDispatch,TimeArrive,TimeClosed,...,Beat,Zip,PoliceDistrict,Latitude,Longitude,IncidentCategory,InitialCalltoDispatchTime,DispatchToArriveTime,ArrivaltoClose,ActualResponseTime
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,2023-01-06 16:55:53,2023-01-06 17:02:01,2023-01-06 18:28:28,2023-01-06 19:15:31,...,2A02,70118,2,-90.127641,29.917747,Assault and Violence,0.102222,1.440833,0.784167,1.543056
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,2023-02-03 17:30:36,2023-02-03 18:09:35,2023-02-03 18:16:17,2023-02-03 18:27:44,...,5C02,70117,5,-90.04785,29.968767,Property Safety,0.649722,0.111667,0.190833,0.761389
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,2023-02-03 20:21:39,2023-02-04 09:37:33,2023-02-05 04:31:36,2023-02-05 05:35:19,...,7J04,70128,7,-89.95505,30.022642,Medical and Mental Health,13.265,18.900833,1.061944,32.165833
3,THEFT,THEFT,1,THEFT,THEFT,0,2023-02-03 23:48:51,2023-02-05 09:01:59,2023-02-05 09:31:31,2023-02-05 09:36:10,...,7O10,70127,7,-89.98827,30.034284,Theft and Burglary,33.218889,0.492222,0.0775,33.711111
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,2023-01-07 06:08:31,2023-01-08 01:02:30,2023-01-08 01:23:23,2023-01-08 02:56:18,...,4H02,70114,4,-90.031897,29.949725,Miscellaneous,18.899722,0.348056,1.548611,19.247778


In [31]:
cleaned.columns

Index(['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
       'InitialPriority', 'TimeCreate', 'TimeDispatch', 'TimeArrive',
       'TimeClosed', 'DispositionText', 'SelfInitiated', 'Beat', 'Zip',
       'PoliceDistrict', 'Latitude', 'Longitude', 'IncidentCategory',
       'InitialCalltoDispatchTime', 'DispatchToArriveTime', 'ArrivaltoClose',
       'ActualResponseTime'],
      dtype='object')

## Define threshold for classification of target variable between low risk of above median dispatch time and high risk of being above median dispatch time. 
Here, we are going to use median instead of mean, as the heavy presence of outliers pulled the mean to above an acceptable dispatch time for emergency response.  

For reference, here are other major cities' mean 911 response times for 2023:   

**Chicago**: 3.46 minutes  
**Los Angeles**: 5.7 minutes  
**Seattle**: 7 minutes  
**Dallas**: 8 minutes   
**Miami**: 8 minutes  
**New York City**: 9.1 minutes  
**Atlanta**: 9.5 minutes  
**Houston**: 10 minutes  
**Detroit**: 12 minutes  
**Denver**: 13 minutes  

**New Orleans**:   
Mean initial call to dispatch time: ~2.088 hours -> 125.29 minutes  
Mean initial call to responder arrival time: ~2.363 hours -> 141.78 minutes

Median initial call to dispatch time may be a better metric for flagging unusually high dispatch times here, in hopes that it will reel in some of the outlier response cases reported in 2023. It will also ensure class balance of 50% / 50%. 

- median ->
- if we leave out






In [32]:
initialcalltodispatch_mean = cleaned['InitialCalltoDispatchTime'].mean()
print(initialcalltodispatch_mean)

2.0881001094625256


In [33]:
initialcalltoarrival_mean = cleaned['ActualResponseTime'].mean()
print(initialcalltoarrival_mean)

2.3627500044316814


In [34]:

initialcalltodispatch_median = cleaned['InitialCalltoDispatchTime'].median()
print(initialcalltodispatch_median)

0.355


In [35]:
# Function to categorize response times
def categorize_response_time(rt):
    if rt < median_calltodispatch_time:
        return 'neg_label'
    else:
        return 'pos_label'

# pos label indicated above average response time predicted


# Apply categorization to create a new column
cleaned['DispatchTimeCategory'] = cleaned['InitialCalltoDispatchTime'].apply(categorize_response_time)

In [36]:
cleaned.columns

Index(['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
       'InitialPriority', 'TimeCreate', 'TimeDispatch', 'TimeArrive',
       'TimeClosed', 'DispositionText', 'SelfInitiated', 'Beat', 'Zip',
       'PoliceDistrict', 'Latitude', 'Longitude', 'IncidentCategory',
       'InitialCalltoDispatchTime', 'DispatchToArriveTime', 'ArrivaltoClose',
       'ActualResponseTime', 'DispatchTimeCategory'],
      dtype='object')

In [37]:
cleaned.head()

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,TimeCreate,TimeDispatch,TimeArrive,TimeClosed,...,Zip,PoliceDistrict,Latitude,Longitude,IncidentCategory,InitialCalltoDispatchTime,DispatchToArriveTime,ArrivaltoClose,ActualResponseTime,DispatchTimeCategory
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,2023-01-06 16:55:53,2023-01-06 17:02:01,2023-01-06 18:28:28,2023-01-06 19:15:31,...,70118,2,-90.127641,29.917747,Assault and Violence,0.102222,1.440833,0.784167,1.543056,neg_label
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,2023-02-03 17:30:36,2023-02-03 18:09:35,2023-02-03 18:16:17,2023-02-03 18:27:44,...,70117,5,-90.04785,29.968767,Property Safety,0.649722,0.111667,0.190833,0.761389,pos_label
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,2023-02-03 20:21:39,2023-02-04 09:37:33,2023-02-05 04:31:36,2023-02-05 05:35:19,...,70128,7,-89.95505,30.022642,Medical and Mental Health,13.265,18.900833,1.061944,32.165833,pos_label
3,THEFT,THEFT,1,THEFT,THEFT,0,2023-02-03 23:48:51,2023-02-05 09:01:59,2023-02-05 09:31:31,2023-02-05 09:36:10,...,70127,7,-89.98827,30.034284,Theft and Burglary,33.218889,0.492222,0.0775,33.711111,pos_label
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,2023-01-07 06:08:31,2023-01-08 01:02:30,2023-01-08 01:23:23,2023-01-08 02:56:18,...,70114,4,-90.031897,29.949725,Miscellaneous,18.899722,0.348056,1.548611,19.247778,pos_label


In [38]:
cleaned2 = cleaned.drop(columns=['Latitude', 'Longitude', 'InitialCalltoDispatchTime', 'DispatchToArriveTime', 'ArrivaltoClose', 'ActualResponseTime'])

In [39]:
cleaned2['Zip'] = cleaned['Zip'].astype('object')

In [40]:
cleaned2['PoliceDistrict'] = cleaned['PoliceDistrict'].astype('object')

In [41]:
cleaned2.dtypes

Type                    object
TypeText                object
Priority                object
InitialType             object
InitialTypeText         object
InitialPriority         object
TimeCreate              object
TimeDispatch            object
TimeArrive              object
TimeClosed              object
DispositionText         object
SelfInitiated           object
Beat                    object
Zip                     object
PoliceDistrict          object
IncidentCategory        object
DispatchTimeCategory    object
dtype: object

In [42]:
cleaned2.head()

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,TimeCreate,TimeDispatch,TimeArrive,TimeClosed,DispositionText,SelfInitiated,Beat,Zip,PoliceDistrict,IncidentCategory,DispatchTimeCategory
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,2023-01-06 16:55:53,2023-01-06 17:02:01,2023-01-06 18:28:28,2023-01-06 19:15:31,REPORT TO FOLLOW,N,2A02,70118,2,Assault and Violence,neg_label
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,2023-02-03 17:30:36,2023-02-03 18:09:35,2023-02-03 18:16:17,2023-02-03 18:27:44,GONE ON ARRIVAL,N,5C02,70117,5,Property Safety,pos_label
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,2023-02-03 20:21:39,2023-02-04 09:37:33,2023-02-05 04:31:36,2023-02-05 05:35:19,GONE ON ARRIVAL,N,7J04,70128,7,Medical and Mental Health,pos_label
3,THEFT,THEFT,1,THEFT,THEFT,0,2023-02-03 23:48:51,2023-02-05 09:01:59,2023-02-05 09:31:31,2023-02-05 09:36:10,GONE ON ARRIVAL,N,7O10,70127,7,Theft and Burglary,pos_label
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,2023-01-07 06:08:31,2023-01-08 01:02:30,2023-01-08 01:23:23,2023-01-08 02:56:18,Necessary Action Taken,N,4H02,70114,4,Miscellaneous,pos_label


In [43]:
cleaned2.dtypes

Type                    object
TypeText                object
Priority                object
InitialType             object
InitialTypeText         object
InitialPriority         object
TimeCreate              object
TimeDispatch            object
TimeArrive              object
TimeClosed              object
DispositionText         object
SelfInitiated           object
Beat                    object
Zip                     object
PoliceDistrict          object
IncidentCategory        object
DispatchTimeCategory    object
dtype: object

In [44]:
cleaned2.columns

Index(['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
       'InitialPriority', 'TimeCreate', 'TimeDispatch', 'TimeArrive',
       'TimeClosed', 'DispositionText', 'SelfInitiated', 'Beat', 'Zip',
       'PoliceDistrict', 'IncidentCategory', 'DispatchTimeCategory'],
      dtype='object')

In [45]:
cleaned2['InitialPriority'] = pd.to_numeric(cleaned2['InitialPriority'], errors='coerce')
# cleaned2 = cleaned2.drop(columns=['TimeCreate', 'TimeDispatch', 'TimeArrive',
#        'TimeClosed'])

Splitting data into train/test sets.

In [46]:
# Convert datetime columns to datetime type
datetime_columns = ['TimeCreate', 'TimeDispatch', 'TimeArrive', 'TimeClosed']
for col in datetime_columns:
    cleaned2[col] = pd.to_datetime(cleaned2[col])

# Extracting features from datetime columns
for col in datetime_columns:
    cleaned2[f'{col}_hour'] = cleaned2[col].dt.hour
    cleaned2[f'{col}_day'] = cleaned2[col].dt.day
    cleaned2[f'{col}_month'] = cleaned2[col].dt.month
    cleaned2.drop(col, axis=1, inplace=True)



In [47]:
cleaned2.dtypes

Type                     object
TypeText                 object
Priority                 object
InitialType              object
InitialTypeText          object
InitialPriority         float64
DispositionText          object
SelfInitiated            object
Beat                     object
Zip                      object
PoliceDistrict           object
IncidentCategory         object
DispatchTimeCategory     object
TimeCreate_hour           int32
TimeCreate_day            int32
TimeCreate_month          int32
TimeDispatch_hour         int32
TimeDispatch_day          int32
TimeDispatch_month        int32
TimeArrive_hour           int32
TimeArrive_day            int32
TimeArrive_month          int32
TimeClosed_hour           int32
TimeClosed_day            int32
TimeClosed_month          int32
dtype: object

In [48]:
# Splitting the data into features and target
X = cleaned2.drop('DispatchTimeCategory', axis=1)
y = cleaned2['DispatchTimeCategory']

# Defining the column transformer
categorical_features = ['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText', 
                        'InitialPriority', 'DispositionText', 'SelfInitiated', 'Beat', 
                        'Zip', 'PoliceDistrict', 'IncidentCategory']
numerical_features = [col for col in X.columns if col not in categorical_features]

# categorical_features: Defines a list of categorical columns that will be one-hot encoded.
# numerical_features: Defines a list of numerical columns that will be passed through without transformation.

# preprocessor: Creates a ColumnTransformer object to apply different transformations to different column types.
# ('num', 'passthrough', numerical_features): Applies the passthrough transformer (no transformation) to the numerical features.
# ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features): Applies OneHotEncoder with handle_unknown='ignore' to 
# the categorical features. This converts categorical values into numerical columns, handling unseen categories by ignoring them.

preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])


# Creating the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42))
])

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hyperparameter Tuning

# Define the parameter space to search
param_dist = {
    'classifier__n_estimators': sp_randint(50, 200),
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__max_depth': sp_randint(2, 6),
    'classifier__min_samples_split': sp_randint(2, 11),
    'classifier__min_samples_leaf': sp_randint(1, 5),
}

# Create the RandomSearchCV object
random_search = RandomizedSearchCV(
    estimator=pipeline, 
    param_distributions=param_dist,
    n_iter=100,  # Number of parameter settings to try
    cv=5,      # Number of folds for cross-validation
    scoring='f1_macro', # Choose your desired scoring metric (e.g., accuracy, precision, recall, f1-score)
    random_state=42
)

# Fit the random search to the training data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print("Best parameters found:", best_params)

# Get the best estimator (the trained model with the best parameters)
best_model = random_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))



Best parameters found: {'classifier__learning_rate': 0.2, 'classifier__max_depth': 5, 'classifier__min_samples_leaf': 2, 'classifier__min_samples_split': 7, 'classifier__n_estimators': 179}
              precision    recall  f1-score   support

   neg_label       0.88      0.93      0.91     12570
   pos_label       0.93      0.88      0.90     12502

    accuracy                           0.90     25072
   macro avg       0.90      0.90      0.90     25072
weighted avg       0.90      0.90      0.90     25072



In [27]:
# Feature importance scores
importance = best_model.feature_importances_

AttributeError: 'Pipeline' object has no attribute 'feature_importances_'