In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Chicago traffic crashes prediction

You've been given a crucial role as a data scientist for Chicago. 
Your job is to predict which accidents might require a response, like medical aid, towing, or both.
You'll analyze factors like accident location, road conditions, speed limits, and time. 
Chicago wants to use this information to better allocate its resources, considering factors like weather and time of day.

**Note**: This dataset is a small subset of the one available at the [Chicago Data Portal](https://data.cityofchicago.org/). 
We've chosen this subset because you'll be using a `KNeighborsClassifier`, which performs efficiently with small to medium-sized datasets but can be quite slow with larger ones. 
In future assignments, you'll work with the entire dataset.

In [57]:
# load data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/traffic_crashes_Chicago.csv'
data = pd.read_csv(url)
data

Unnamed: 0,CRASH_RECORD_ID,RD_NO,CRASH_DATE_EST_I,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,...,INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE,LOCATION,YEAR
0,a0cdc2e317e24a87ffb5ed39a0f1ab99054fe04167615b...,JG205578,,2023-03-31 07:34:00,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,RAIN,DAYLIGHT,REAR END,...,0.0,2.0,0.0,7,6,3,41.909494,-87.747824,POINT (-87.747823796021 41.909493550808),2023
1,00e93310a117dc0228ee5e00affc77ab0bd3334e54db75...,JG317047,,2023-06-26 16:15:00,20,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,...,0.0,2.0,0.0,16,2,6,,,,2023
2,07c772b5d5b0264284f35a7769114ae681037a123d9872...,JG214567,,2023-04-07 17:15:00,15,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,ANGLE,...,0.0,2.0,0.0,17,6,4,41.834402,-87.616894,POINT (-87.61689418428 41.834401691989),2023
3,4d0d885dfa2da00a8d196c58a8d4f249c3c697fb478ecb...,JG138027,,2023-02-01 16:00:00,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,...,0.0,2.0,0.0,16,4,2,41.962140,-87.645937,POINT (-87.645936592224 41.962140154293),2023
4,2630202e4794a8b4dd665b5ad172b09f0be849937eb5f7...,JG167558,,2023-02-27 09:55:00,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,RAIN,"DARKNESS, LIGHTED ROAD",SIDESWIPE SAME DIRECTION,...,0.0,2.0,0.0,9,2,2,41.891604,-87.625307,POINT (-87.625306944978 41.89160410607),2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12974,39e3191301443098210c420c84157dec9a9fcd3b982f51...,JG271497,Y,2023-05-23 08:41:00,25,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,SIDESWIPE SAME DIRECTION,...,0.0,2.0,0.0,8,3,5,41.707211,-87.628239,POINT (-87.628239101889 41.707211473793),2023
12975,ee4a15023569327d9ac20fe8a06dbc79aa4e353c7dcdf9...,JG232982,,2023-04-21 21:15:00,30,NO CONTROLS,NO CONTROLS,CLEAR,"DARKNESS, LIGHTED ROAD",PARKED MOTOR VEHICLE,...,0.0,1.0,0.0,21,6,4,41.659773,-87.636651,POINT (-87.636650606697 41.659773314849),2023
12976,a0e267ee446b134cbdab5b9ae1f64c698a622c87984768...,JG424169,,2023-09-14 15:15:00,30,NO CONTROLS,NO CONTROLS,CLEAR,DAYLIGHT,REAR END,...,0.0,2.0,0.0,15,5,9,41.793506,-87.711398,POINT (-87.711398027946 41.793506266409),2023
12977,9316f75a7f7d6aee7380b7347705907c336b427cea5898...,,,2023-10-11 08:51:00,30,TRAFFIC SIGNAL,FUNCTIONING PROPERLY,CLEAR,DAYLIGHT,TURNING,...,0.0,2.0,0.0,8,4,10,41.799679,-87.733000,POINT (-87.732999967493 41.799679470254),2023


In [59]:
data.columns

Index(['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I', 'CRASH_DATE',
       'POSTED_SPEED_LIMIT', 'TRAFFIC_CONTROL_DEVICE', 'DEVICE_CONDITION',
       'WEATHER_CONDITION', 'LIGHTING_CONDITION', 'FIRST_CRASH_TYPE',
       'TRAFFICWAY_TYPE', 'LANE_CNT', 'ALIGNMENT', 'ROADWAY_SURFACE_COND',
       'ROAD_DEFECT', 'REPORT_TYPE', 'CRASH_TYPE', 'INTERSECTION_RELATED_I',
       'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I', 'DAMAGE', 'DATE_POLICE_NOTIFIED',
       'PRIM_CONTRIBUTORY_CAUSE', 'SEC_CONTRIBUTORY_CAUSE', 'STREET_NO',
       'STREET_DIRECTION', 'STREET_NAME', 'BEAT_OF_OCCURRENCE',
       'PHOTOS_TAKEN_I', 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
       'WORK_ZONE_TYPE', 'WORKERS_PRESENT_I', 'NUM_UNITS',
       'MOST_SEVERE_INJURY', 'INJURIES_TOTAL', 'INJURIES_FATAL',
       'INJURIES_INCAPACITATING', 'INJURIES_NON_INCAPACITATING',
       'INJURIES_REPORTED_NOT_EVIDENT', 'INJURIES_NO_INDICATION',
       'INJURIES_UNKNOWN', 'CRASH_HOUR', 'CRASH_DAY_OF_WEEK', 'CRASH_MONTH',
       'LA

Train, fine-tune, and test a `KNeighborsClassifier` model for predicting the `CRASH_TYPE` column in the dataset. Create a brief report summarizing your findings for the city of Chicago.

In [4]:
data.CRASH_DATE_EST_I.value_counts()

Y    791
N    126
Name: CRASH_DATE_EST_I, dtype: int64

In [5]:
data.CRASH_DATE_EST_I

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
12974      Y
12975    NaN
12976    NaN
12977    NaN
12978    NaN
Name: CRASH_DATE_EST_I, Length: 12979, dtype: object

I'll drop columns RD_NO and CRASH_RECORD_ID because they are unlikely to predict anything about the crash, given that they are labels. I further drop CRASH_DATE_EST_I because it is full of NaN's and I don't know what it means. We drop LOCATION because it is indicated by LATITUDE and LONGITUDE. Similarly, we drop CRASH_DATE in favor of using CRASH_HOUR, and other date-related columns due to difficulty in encoding.

In [88]:
drop_cols = ['CRASH_TYPE','CRASH_DATE', 'CRASH_DATE_EST_I', 'RD_NO', 'CRASH_RECORD_ID', 'BEAT_OF_OCCURRENCE','DATE_POLICE_NOTIFIED','LOCATION']

In [89]:
X= data.drop(drop_cols, axis=1)

In [90]:
y = data.CRASH_TYPE

In [91]:
# train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y)

In [92]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()

In [93]:
# import stuff for pipeline

#Preprocessing 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures

# Pipeline
from sklearn.pipeline import Pipeline   # Sequentially apply a list of transformations
from sklearn.compose import ColumnTransformer # Applies in parallel transformations to columns
from sklearn.preprocessing import FunctionTransformer # it makes functions compatible with scikit-learn pipelines

# Grid search 
from sklearn.model_selection import GridSearchCV

In [103]:
len(X.dtypes)

42

In [113]:
nums = X.dtypes[(X.dtypes == 'float64') | (X.dtypes == 'int64')]

In [115]:
cats = X.dtypes[(X.dtypes != 'float64') & (X.dtypes != 'int64')]

In [116]:
# numerical features pipeline: impute+scale
numeric_features = [feature for feature in nums.index]
numeric_processor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
    ])

In [134]:


# categorical_features pipeline: impute+encode
categorical_features = [i for i in cats.index]
categorical_processor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])



In [135]:
len(categorical_features)

25

In [136]:
len(numeric_features)

17

In [137]:
feature_processor = ColumnTransformer(
    transformers=[
        ('num', numeric_processor, numeric_features),
        ('cat', categorical_processor, categorical_features)
    ],
         remainder='drop')

In [138]:
pipe = Pipeline(steps=[
    ('preprocessor', feature_processor),
    #('poly_features', PolynomialFeatures(degree=2)),
    ('clf',knn_clf)
])

In [139]:
pipe

In [141]:
param_grid = { 
    'clf__n_neighbors': list(range(1,21)),
    'clf__weights' : ['uniform','distance']
}

# instantiate and fit the grid
grid = GridSearchCV(pipe, param_grid, cv=10, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

In [142]:
# best predictor
best_clf = grid.best_estimator_

In [143]:
# best hyper-parameters
grid.best_params_

{'clf__n_neighbors': 12, 'clf__weights': 'distance'}

In [144]:
y_test_pred = best_clf.predict(X_test)

In [151]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [147]:
# accuracy
accuracy_score(y_test,y_test_pred)

0.8810477657935285

In [152]:
confusion_matrix(y_test,y_test_pred)

array([[ 579,  339],
       [  47, 2280]])

## Write-up

We have trained a nearest neighbors model to classify accidents based on their need of injury assistance of towing services. The model was over eighty-eight percent accurate in test cases. The data were preprocessed to remove features that would not aid in prediction- such as removing ID numbers- as well features that are difficult to encode for a k nearest neighbors model, such as dates.

We further processed the data to impute NaN entries with median values in numerical features and frequent values for categorical features. Then we used one-hot encoding on our categorical features, which is necessary for our model type to run.

At this point, we fine-tuned the hyperparameters using grid search and trained the model with the best set.