<a href="https://colab.research.google.com/github/somermerriman91/2023-New-Orleans-Crime-Model/blob/main/Copy_of_2023CrimeData_Preprocessing_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2023 New Orleans 911 Safety and Crime Calls - Preprocessing and Modeling

### 1. Feature Engineering:

#### 1a. Define threshold for classification of target variable between low risk of above median dispatch time and high risk of being above median dispatch time.  
#### 1b. Feature extraction: Extract hour, day, and month features from four datetime features for future seasonality / feature importance analysis.   
#### 1c. Column transformer: Separate categorical variables from numerical in order for categorical features to be one-hot encoded and numerical features to be "passed through" without transformation.



### 2. Modeling:  

**Response Time Prediction**: Develop a model to predict classification of initial call to dispatch time (high risk or low risk - DispatchTimeCategory) based on the following features:  


Type                     object  
TypeText                 object  
Priority                 object  
InitialType              object  
InitialTypeText          object  
InitialPriority         float64  
DispositionText          object  
SelfInitiated            object  
Beat                     object  
Zip                      object  
PoliceDistrict           object  
IncidentCategory         object  
TimeCreate_hour           int32  
TimeCreate_day            int32  
TimeCreate_month          int32  
TimeDispatch_hour         int32  
TimeDispatch_day          int32  
TimeDispatch_month        int32  
TimeArrive_hour           int32  
TimeArrive_day            int32   
TimeArrive_month          int32  
TimeClosed_hour           int32  
TimeClosed_day            int32  
TimeClosed_month          int32  

#### 2a. Splitting the data.
Split data into train and test sets.


#### 2b. Selecting the best model:  
2b.1 Decision Tree Classifier  
2b.2 Random Forest Classifier  
2b.3 Gradient Boosting Classifier

Create a pipeline for each.
Assess each model's performance on Precision, Recall, and F1 scores, as this is a binary classification target variable.

Select the best model based on their classification report scores and conduct hyperparameter tuning.



Why GBC Might Be Well-Suited:

**Handling Complex Relationships**: GBCs can effectively capture complex, non-linear relationships between features and the target variable. This is important because 911 call data often involves factors that might not have a simple linear impact on dispatch time category (e.g., type of incident, location, time of day).
**Robust to Noise and Outliers**: GBCs are generally robust to noisy data and outliers. This dataset, even after cleaning, has potential for data entry errors. It has also been confirmed to contain outliers as seen in the data cleaning notebook.
**Feature Importance**: GBCs provide a way to assess feature importance, which can be insightful for understanding the key factors driving dispatch time categorization.

  
2. Train GBC model.  
3. Use cross-validation to ensure the model generalizes well to unseen data.  
4. Fine-tune hyperparameters using random search for computational efficiency.






In [32]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler, OneHotEncoder  # For encoding categorical features as a one-hot numeric array and scaling numeric features
from sklearn.compose import ColumnTransformer  # For applying different preprocessing steps to different columns
from sklearn.pipeline import Pipeline  # For creating a machine learning pipeline
from sklearn.ensemble import GradientBoostingClassifier  # For training a Gradient Boosting classifier
from sklearn.metrics import classification_report  # For evaluating the performance of the classification model (in this case, GBC)
from sklearn.model_selection import RandomizedSearchCV  # For performing hyperparameter tuning using randomized search
from scipy.stats import randint as sp_randint  # For defining distributions for randomized search
from sklearn.metrics import precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import cross_val_score # For conducting cross-validation to ensure the model's performance generalizes well to unseen data.


In [2]:
# load the csv file into the notebook.
cleaned = pd.read_csv("CapstoneDataCleaned.csv")

In [3]:
# observe summary statistics to verify that dataset was loaded correctly.
cleaned.describe()

Unnamed: 0,Zip,PoliceDistrict,Latitude,Longitude,InitialCalltoDispatchTime,DispatchToArriveTime,ArrivaltoClose,ActualResponseTime
count,125360.0,125360.0,125360.0,125360.0,125360.0,125360.0,125360.0,125360.0
mean,70120.836327,4.589861,-90.0543,29.97224,2.0881,0.27465,1.124892,2.36275
std,5.944306,2.23895,0.047299,0.037227,4.382705,1.182769,3.395053,4.626214
min,70112.0,1.0,-90.136659,29.894221,0.0,0.000278,0.000278,0.000556
25%,70116.0,3.0,-90.087323,29.945076,0.041389,0.061944,0.21,0.176111
50%,70119.0,5.0,-90.065928,29.965275,0.355,0.1175,0.613611,0.561944
75%,70126.0,7.0,-90.024475,29.996116,2.070347,0.201389,1.400278,2.401667
max,70131.0,8.0,-89.737284,30.167657,108.229444,86.536944,695.441389,133.283889


In [4]:
cleaned.head(5)

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,TimeCreate,TimeDispatch,TimeArrive,TimeClosed,...,Beat,Zip,PoliceDistrict,Latitude,Longitude,IncidentCategory,InitialCalltoDispatchTime,DispatchToArriveTime,ArrivaltoClose,ActualResponseTime
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,2023-01-06 16:55:53,2023-01-06 17:02:01,2023-01-06 18:28:28,2023-01-06 19:15:31,...,2A02,70118,2,-90.127641,29.917747,Assault and Violence,0.102222,1.440833,0.784167,1.543056
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,2023-02-03 17:30:36,2023-02-03 18:09:35,2023-02-03 18:16:17,2023-02-03 18:27:44,...,5C02,70117,5,-90.04785,29.968767,Property Safety,0.649722,0.111667,0.190833,0.761389
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,2023-02-03 20:21:39,2023-02-04 09:37:33,2023-02-05 04:31:36,2023-02-05 05:35:19,...,7J04,70128,7,-89.95505,30.022642,Medical and Mental Health,13.265,18.900833,1.061944,32.165833
3,THEFT,THEFT,1,THEFT,THEFT,0,2023-02-03 23:48:51,2023-02-05 09:01:59,2023-02-05 09:31:31,2023-02-05 09:36:10,...,7O10,70127,7,-89.98827,30.034284,Theft and Burglary,33.218889,0.492222,0.0775,33.711111
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,2023-01-07 06:08:31,2023-01-08 01:02:30,2023-01-08 01:23:23,2023-01-08 02:56:18,...,4H02,70114,4,-90.031897,29.949725,Miscellaneous,18.899722,0.348056,1.548611,19.247778


In [5]:
cleaned.columns

Index(['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
       'InitialPriority', 'TimeCreate', 'TimeDispatch', 'TimeArrive',
       'TimeClosed', 'DispositionText', 'SelfInitiated', 'Beat', 'Zip',
       'PoliceDistrict', 'Latitude', 'Longitude', 'IncidentCategory',
       'InitialCalltoDispatchTime', 'DispatchToArriveTime', 'ArrivaltoClose',
       'ActualResponseTime'],
      dtype='object')

## 1a. Define threshold for classification of target variable between low risk of above median dispatch time and high risk of being above median dispatch time.
Here, we are going to use median instead of mean, as the heavy presence of outliers pulled the mean to above an acceptable dispatch time for emergency response.  

For reference, here are other major cities' mean 911 response times for 2023:   

**Chicago**: 3.46 minutes  
**Los Angeles**: 5.7 minutes  
**Seattle**: 7 minutes  
**Dallas**: 8 minutes   
**Miami**: 8 minutes  
**New York City**: 9.1 minutes  
**Atlanta**: 9.5 minutes  
**Houston**: 10 minutes  
**Detroit**: 12 minutes  
**Denver**: 13 minutes  

**New Orleans**:
(calculated below)

1a.1 Calculate point statistics and choose a threshold value for flagging high expected wait times.

Mean initial call to dispatch time: ~2.088 hours -> 125.29 minutes  
Mean initial call to responder arrival time: ~2.363 hours -> 141.77 minutes

Median initial call to dispatch time may be a better metric for flagging unusually high dispatch times here, in hopes that it will reel in some of the outlier response cases reported in 2023. It will also ensure class balance of 50% / 50%.

Median initial call to responder arrival time: 0.355 hour -> 21.29 minutes

While still outside of the major US city average response time range, this threshold can still serve as an ideal response time for the city relative to what its citizens experienced in 2023.






In [6]:
# Calculate the mean of the initial call to dispatch time.
# Due to outliers / positively skewed data, the mean is pulled to above 2 hours. Setting this as the threshold for high risk wait time scenarios
# would not be practical, as national major city averages range from 3.46 minutes to 13 minutes. ~2 hours is no where near the national average.
initialcalltodispatch_mean = cleaned['InitialCalltoDispatchTime'].mean()
print(initialcalltodispatch_mean)
print(initialcalltodispatch_mean * 60)

2.0881001094625256
125.28600656775153


In [7]:
# Confirm average response times are around 2.5 hours, as stated in news articles.
initialcalltoarrival_mean = cleaned['ActualResponseTime'].mean()
print(initialcalltoarrival_mean)
print(initialcalltoarrival_mean * 60)

2.3627500044316814
141.76500026590088


In [8]:
# Let's observe median instead, and see if this can serve as a more useful threshold for flagging potentially unacceptable emergency response wait times.
initialcalltodispatch_median = cleaned['InitialCalltoDispatchTime'].median()
print(initialcalltodispatch_median)
print(initialcalltodispatch_median * 60)

0.355
21.299999999999997


#### 1a.2 Define a function to create a binary feature that labels instances as either "high risk," as in high wait time expected ( > 21.30 minutes), or "low risk," as in low/acceptable wait times expected.

In [9]:
# Function to categorize response times
def categorize_response_time(rt):
    if rt < initialcalltodispatch_median:
        return 'low_risk'
    else:
        return 'high_risk'




# Apply categorization to create a new column
cleaned['DispatchTimeCategory'] = cleaned['InitialCalltoDispatchTime'].apply(categorize_response_time)

In [10]:
# Confirm that the 'DispatchTimeCategory' was added, and contains records of two different class labels.
cleaned.head()

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,TimeCreate,TimeDispatch,TimeArrive,TimeClosed,...,Zip,PoliceDistrict,Latitude,Longitude,IncidentCategory,InitialCalltoDispatchTime,DispatchToArriveTime,ArrivaltoClose,ActualResponseTime,DispatchTimeCategory
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,2023-01-06 16:55:53,2023-01-06 17:02:01,2023-01-06 18:28:28,2023-01-06 19:15:31,...,70118,2,-90.127641,29.917747,Assault and Violence,0.102222,1.440833,0.784167,1.543056,low_risk
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,2023-02-03 17:30:36,2023-02-03 18:09:35,2023-02-03 18:16:17,2023-02-03 18:27:44,...,70117,5,-90.04785,29.968767,Property Safety,0.649722,0.111667,0.190833,0.761389,high_risk
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,2023-02-03 20:21:39,2023-02-04 09:37:33,2023-02-05 04:31:36,2023-02-05 05:35:19,...,70128,7,-89.95505,30.022642,Medical and Mental Health,13.265,18.900833,1.061944,32.165833,high_risk
3,THEFT,THEFT,1,THEFT,THEFT,0,2023-02-03 23:48:51,2023-02-05 09:01:59,2023-02-05 09:31:31,2023-02-05 09:36:10,...,70127,7,-89.98827,30.034284,Theft and Burglary,33.218889,0.492222,0.0775,33.711111,high_risk
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,2023-01-07 06:08:31,2023-01-08 01:02:30,2023-01-08 01:23:23,2023-01-08 02:56:18,...,70114,4,-90.031897,29.949725,Miscellaneous,18.899722,0.348056,1.548611,19.247778,high_risk


In [11]:
# Since we have chosen our target variable as the time between the initial citizen call and time of dispatch, we can drop previously calculated time intervals. We will, however, retain time stamps to extract seasonality features later on.
cleaned2 = cleaned.drop(columns=['InitialCalltoDispatchTime', 'DispatchToArriveTime', 'ArrivaltoClose', 'ActualResponseTime'])

In [12]:
# Observe data types as we approach our modeling process.
cleaned2.dtypes

Unnamed: 0,0
Type,object
TypeText,object
Priority,object
InitialType,object
InitialTypeText,object
InitialPriority,object
TimeCreate,object
TimeDispatch,object
TimeArrive,object
TimeClosed,object


In [13]:
# Change 'Zip' and 'PoliceDistrict' to type object, as they are categorical features.
cleaned2['Zip'] = cleaned['Zip'].astype('object')
cleaned2['PoliceDistrict'] = cleaned['PoliceDistrict'].astype('object')

In [14]:
# Confirm data type conversions.
cleaned2.dtypes

Unnamed: 0,0
Type,object
TypeText,object
Priority,object
InitialType,object
InitialTypeText,object
InitialPriority,object
TimeCreate,object
TimeDispatch,object
TimeArrive,object
TimeClosed,object


In [15]:
cleaned2.columns

Index(['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
       'InitialPriority', 'TimeCreate', 'TimeDispatch', 'TimeArrive',
       'TimeClosed', 'DispositionText', 'SelfInitiated', 'Beat', 'Zip',
       'PoliceDistrict', 'Latitude', 'Longitude', 'IncidentCategory',
       'DispatchTimeCategory'],
      dtype='object')

1b. Feature extraction: Extract hour, day, and month features from four datetime features for future seasonality / feature importance analysis.

In [16]:
datetime_columns = ['TimeCreate', 'TimeDispatch', 'TimeArrive', 'TimeClosed']
for col in datetime_columns:
    cleaned2[col] = pd.to_datetime(cleaned2[col])
    cleaned2[f'{col}_hour'] = cleaned2[col].dt.hour
    cleaned2[f'{col}_day'] = cleaned2[col].dt.day
    cleaned2[f'{col}_month'] = cleaned2[col].dt.month



In [17]:
cleaned2.dtypes

Unnamed: 0,0
Type,object
TypeText,object
Priority,object
InitialType,object
InitialTypeText,object
InitialPriority,object
TimeCreate,datetime64[ns]
TimeDispatch,datetime64[ns]
TimeArrive,datetime64[ns]
TimeClosed,datetime64[ns]


In [18]:
cleaned2.drop(datetime_columns, axis=1, inplace=True)

In [19]:
# Confirm hour, day, and month features were extracted.
cleaned2.head()

Unnamed: 0,Type,TypeText,Priority,InitialType,InitialTypeText,InitialPriority,DispositionText,SelfInitiated,Beat,Zip,...,TimeCreate_month,TimeDispatch_hour,TimeDispatch_day,TimeDispatch_month,TimeArrive_hour,TimeArrive_day,TimeArrive_month,TimeClosed_hour,TimeClosed_day,TimeClosed_month
0,SEXOFF,SEX OFFENSE: GENERAL/MISC,2,ASLT,SIMPLE ASSAULT,1,REPORT TO FOLLOW,N,2A02,70118,...,1,17,6,1,18,6,1,19,6,1
1,TRESP,TRESPASSING,2,TRESP,TRESPASSING,2,GONE ON ARRIVAL,N,5C02,70117,...,2,18,3,2,18,3,2,18,3,2
2,WELFARE,WELFARE CHECK,1,WELFARE,WELFARE CHECK,1,GONE ON ARRIVAL,N,7J04,70128,...,2,9,4,2,4,5,2,5,5,2
3,THEFT,THEFT,1,THEFT,THEFT,0,GONE ON ARRIVAL,N,7O10,70127,...,2,9,5,2,9,5,2,9,5,2
4,PRIS,PRISONER TRANSPORT,1,PRIS,PRISONER TRANSPORT,1,Necessary Action Taken,N,4H02,70114,...,1,1,8,1,1,8,1,2,8,1


#### 1c. Column transformer: Separate categorical variables from numerical in order for categorical features to be one-hot encoded and numerical features to be "passed through" without transformation.

In [20]:
# Type                            object
# TypeText                        object
# Priority                        object
# InitialType                     object
# InitialTypeText                 object
# InitialPriority                 object
# TimeCreate              datetime64[ns]
# TimeDispatch            datetime64[ns]
# TimeArrive              datetime64[ns]
# TimeClosed              datetime64[ns]
# DispositionText                 object
# SelfInitiated                   object
# Beat                            object
# Zip                             object
# PoliceDistrict                  object



# Splitting the data into features and target
X = cleaned2.drop('DispatchTimeCategory', axis=1)
y = cleaned2['DispatchTimeCategory']


**Preprocessor1** varies slightly from **preprocessor**. "Preprocessor" does not require standard scaling of numerical features, as Decision Trees and Random Forests do not require it. This is because they make decisions based on relative comparisons of features within each node.

**('num', 'passthrough', numerical_features)**: Applies the passthrough transformer (no transformation) to the numerical features.  

**('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)**: Applies OneHotEncoder with handle_unknown='ignore' to the categorical features. This converts categorical values into numerical columns, handling unseen categories by ignoring them.  a

In [24]:
# Defining the column transformer
categorical_features = ['Type', 'TypeText', 'Priority', 'InitialType', 'InitialTypeText',
                        'InitialPriority', 'DispositionText', 'SelfInitiated', 'Beat',
                        'Zip', 'PoliceDistrict', 'IncidentCategory',
                        'TimeCreate_hour', 'TimeCreate_day', 'TimeCreate_month',
                        'TimeDispatch_hour', 'TimeDispatch_day', 'TimeDispatch_month',
                        'TimeArrive_hour', 'TimeArrive_day', 'TimeArrive_month',
                        'TimeClosed_hour', 'TimeClosed_day', 'TimeClosed_month']

numerical_features = []  # There are no numerical features, but we are creating this empty list for potential data entry changes in the future.

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'  # Pass through any remaining columns (if any)
)

#### 2a.1 Decision Tree Classifier

In [21]:
from sklearn.tree import DecisionTreeClassifier


In [33]:
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Perform cross-validation
cv_scores = cross_val_score(clf, X_train, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores and the average accuracy
print(f"Cross-validation scores: {cv_scores}")
print(f"Average cross-validation accuracy: {cv_scores.mean():.4f}")

# Fit the model to the entire training set (optional, if you need a final model)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)


Cross-validation scores: [0.74444112 0.74892811 0.75331539 0.74542554 0.74846687]
Average cross-validation accuracy: 0.7481


In [None]:
clf_report = classification_report(y_test, y_pred)

print('Decision Tree Classification Report:')
print(clf_report)

Classification Report:
              precision    recall  f1-score   support

   high_risk       0.75      0.74      0.75     12502
    low_risk       0.75      0.76      0.75     12570

    accuracy                           0.75     25072
   macro avg       0.75      0.75      0.75     25072
weighted avg       0.75      0.75      0.75     25072



This hyperparameter tuning code would be used in the case that a Decision Tree Classifier was found to be the best model.

In [None]:


# # Define the scoring metrics
# scoring = {
#     'precision': make_scorer(precision_score, pos_label='high_risk'),
#     'recall': make_scorer(recall_score, pos_label='high_risk'),
#     'f1': make_scorer(f1_score, pos_label='high_risk')
# }

# # Define the parameter grid for hyperparameter tuning
# param_grid = {
#     'classifier__criterion': ['gini', 'entropy'],
#     'classifier__max_depth': [None, 5, 10, 15, 20],
#     'classifier__min_samples_split': [2, 5, 10],
#     'classifier__min_samples_leaf': [1, 2, 4]
# }

# # Perform RandomizedSearchCV for hyperparameter tuning
# random_search = RandomizedSearchCV(
#     clf, param_distributions=param_grid, n_iter=50, cv=5, scoring=scoring, refit='f1', random_state=42
# )

# # Fit the randomized search to the training data
# random_search.fit(X_train, y_train)

# # Print the best hyperparameters and corresponding scores
# print("Best Hyperparameters:", random_search.best_params_)
# print("Best Scores:", random_search.best_score_)

# # Get the best model
# best_model = random_search.best_estimator_

# # Make predictions using the best model
# y_pred_best = best_model.predict(X_test)

# # Evaluate the best model
# print(classification_report(y_test, y_pred_best))


#### 2a.2 Random Forest Classifier

In [22]:
from sklearn.ensemble import RandomForestClassifier

In [34]:
# Creating the pipeline with RandomForestClassifier

rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',RandomForestClassifier())
])

# Perform cross-validation
cv_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores and the average accuracy
print(f"Cross-validation scores: {cv_scores}")
print(f"Average cross-validation accuracy: {cv_scores.mean():.4f}")

# Fit the model to the entire training set
rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)



Cross-validation scores: [0.80252268 0.79888324 0.80157543 0.79463529 0.79942165]
Average cross-validation accuracy: 0.7994


In [35]:
rf_report = classification_report(y_test, y_pred_rf)

print('Random Forest Classification Report:')
print(rf_report)

Random Forest Classification Report:
              precision    recall  f1-score   support

   high_risk       0.79      0.82      0.81     12502
    low_risk       0.81      0.79      0.80     12570

    accuracy                           0.80     25072
   macro avg       0.80      0.80      0.80     25072
weighted avg       0.80      0.80      0.80     25072



In [None]:
# Hyperparameter Tuning

# Define the parameter space to search
param_dist = {
    'classifier__n_estimators': sp_randint(50, 200),
    'classifier__max_depth': sp_randint(2, 10),  # Adjust max_depth range as per your dataset
    'classifier__min_samples_split': sp_randint(2, 11),
    'classifier__min_samples_leaf': sp_randint(1, 5),
    'classifier__bootstrap': [True, False]
}

# Create the RandomSearchCV object
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=100,  # Number of parameter settings to try
    cv=5,        # Number of folds for cross-validation
    scoring='f1_macro',  # Choose your desired scoring metric
    random_state=42
)

# Fit the random search to the training data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print("Best parameters found:", best_params)

# Get the best estimator (the trained model with the best parameters)
best_model = random_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

Best parameters found: {'classifier__bootstrap': True, 'classifier__max_depth': 9, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 196}


Now that we have found the best hyperparameters via RandomSearchCV (less-computationally expensive than GridSearch), we can plug them into our model and look at its resulting classification report scores. They should be better than our initial run of the Random Forest Classifier, as all hyperparameters were at their default values, not the values that would result in the best scores for this dataset.

In [27]:
rf_best = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier',RandomForestClassifier(
    bootstrap=True,
    max_depth=9,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=196,
    random_state=42
))
])


rf_best.fit(X_train, y_train)

# Make predictions
y_pred_rf_best = rf.predict(X_test)



print(classification_report(y_test, y_pred_rf_best))

              precision    recall  f1-score   support

   high_risk       0.80      0.82      0.81     12502
    low_risk       0.81      0.79      0.80     12570

    accuracy                           0.80     25072
   macro avg       0.81      0.81      0.80     25072
weighted avg       0.81      0.80      0.80     25072



#### 2a.3 Gradient Boosting Classifier

In [36]:
preprocessor1 = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# While we currently have no numerical features, it is helpful to include this transformer in the case that a future dataset
# includes numerical features. The final model chosen in this notebook is hypothetically for an app that receives new 911 call data
# as time progresses. The Orleans Parish Communications Department could change their data entry standard practices in the future.

# Creating the pipeline
gb = Pipeline(steps=[
    ('preprocessor', preprocessor1),
    ('classifier', GradientBoostingClassifier(random_state=42))
])

# Perform cross-validation
cv_scores = cross_val_score(gb, X_train, y_train, cv=5, scoring='accuracy')

# Print the cross-validation scores and the average accuracy
print(f"Cross-validation scores: {cv_scores}")
print(f"Average cross-validation accuracy: {cv_scores.mean():.4f}")

# Fit the model to the entire training set
gb.fit(X_train, y_train)

# Make predictions
y_pred_gb = gb.predict(X_test)



Cross-validation scores: [0.78921129 0.78766577 0.79010868 0.7820711  0.78830334]
Average cross-validation accuracy: 0.7875


In [37]:
gb_report = classification_report(y_test, y_pred_gb)

print('Gradient Boosting Classification Report:')
print(gb_report)

Gradient Boosting Classification Report:
              precision    recall  f1-score   support

   high_risk       0.78      0.80      0.79     12502
    low_risk       0.80      0.77      0.78     12570

    accuracy                           0.79     25072
   macro avg       0.79      0.79      0.79     25072
weighted avg       0.79      0.79      0.79     25072



Because the Random Forest and Gradient Boosting Models are similar in accuracy scores, hyperparameter tuning will be conducted in the case that the GB Model has room for improvement that could lead to outperformance over the Random Forest Model.

In [None]:
# Hyperparameter Tuning

# Define the parameter space to search
param_dist_gb = {
    'classifier__n_estimators': sp_randint(50, 200),
    'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
    'classifier__max_depth': sp_randint(2, 6),
    'classifier__min_samples_split': sp_randint(2, 11),
    'classifier__min_samples_leaf': sp_randint(1, 5),
}

# Create the RandomSearchCV object
random_search = RandomizedSearchCV(
    estimator=gb,
    param_distributions=param_dist_gb,
    n_iter=50,  # Number of parameter settings to try. I'm trying 50 here to save computing time.
    cv=5,      # Number of folds for cross-validation
    scoring='f1_macro', # Choose your desired scoring metric (e.g., accuracy, precision, recall, f1-score)
    random_state=42
)

# Fit the random search to the training data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print("Best parameters found:", best_params)


In [None]:
# # Define the parameter space to search
# Gradient Boosting Parameter Space
# param_dist_gb = {
#     'classifier__n_estimators': sp_randint(50, 200),
#     'classifier__learning_rate': [0.01, 0.05, 0.1, 0.2],
#     'classifier__max_depth': sp_randint(2, 6),
#     'classifier__min_samples_split': sp_randint(2, 11),
#     'classifier__min_samples_leaf': sp_randint(1, 5),

# Random Forest Parameter Space
# param_dist = {
#     'classifier__n_estimators': sp_randint(50, 200),
#     'classifier__max_depth': sp_randint(2, 10),  # Adjust max_depth range as per your dataset
#     'classifier__min_samples_split': sp_randint(2, 11),
#     'classifier__min_samples_leaf': sp_randint(1, 5),
#     'classifier__bootstrap': [True, False]

AttributeError: 'Pipeline' object has no attribute 'feature_importances_'