# Project 5: Car crash data

## Modeling Notebook
This notebook is for modeling the car crash data.  It assumes that the previous notebooks have been run. 

## Problem Statement:


In [1]:
#imports 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split


In [2]:
#read data
crash = pd.read_csv('./data/crash_data_modified.csv')
crash.head()

Unnamed: 0,Crash Time of Day,Collision Type,Surface Condition,Light,Traffic Control,Driver Substance Abuse,Driver At Fault,Driver Distracted By,Vehicle First Impact Location,Vehicle Second Impact Location,Vehicle Body Type,Vehicle Movement,Speed Limit,Parked Vehicle
0,Evening,SAME DIR REAR END,DRY,DAYLIGHT,STOP SIGN,No,No,No,SIX OCLOCK,SIX OCLOCK,PASSENGER CAR,STOPPED IN TRAFFIC LANE,25,No
1,Early Morning,SAME DIR REAR END,DRY,DAWN,TRAFFIC SIGNAL,No,No,No,SIX OCLOCK,SIX OCLOCK,PASSENGER CAR,STOPPED IN TRAFFIC LANE,40,No
2,Early Morning,SINGLE VEHICLE,DRY,DAYLIGHT,NO CONTROLS,No,No,No,ELEVEN OCLOCK,ELEVEN OCLOCK,POLICE VEHICLE/NON EMERGENCY,MOVING CONSTANT SPEED,35,No
3,Late Night,SINGLE VEHICLE,DRY,DARK LIGHTS ON,NO CONTROLS,No,No,No,TWELVE OCLOCK,TWELVE OCLOCK,POLICE VEHICLE/EMERGENCY,MOVING CONSTANT SPEED,35,No
4,Night,SAME DIR REAR END,DRY,DARK LIGHTS ON,NO CONTROLS,Yes,Yes,Yes,TWELVE OCLOCK,TWELVE OCLOCK,PASSENGER CAR,ACCELERATING,35,No


In [3]:
# original feature names
crash.columns

Index(['Crash Time of Day', 'Collision Type', 'Surface Condition', 'Light',
       'Traffic Control', 'Driver Substance Abuse', 'Driver At Fault',
       'Driver Distracted By', 'Vehicle First Impact Location',
       'Vehicle Second Impact Location', 'Vehicle Body Type',
       'Vehicle Movement', 'Speed Limit', 'Parked Vehicle'],
      dtype='object')

In [4]:
# Driver substance abuse
crash['Driver Substance Abuse'].value_counts(normalize = True)

No     0.973233
Yes    0.026767
Name: Driver Substance Abuse, dtype: float64

In [5]:
# Driver distracted by
crash['Driver Distracted By'].value_counts(normalize = True)

No     0.784361
Yes    0.215639
Name: Driver Distracted By, dtype: float64

In [6]:
columns_to_dummify = ['Crash Time of Day', 'Collision Type', 'Surface Condition', 'Light',
       'Traffic Control', 'Vehicle First Impact Location',
       'Vehicle Second Impact Location', 'Vehicle Body Type',
       'Vehicle Movement', 'Parked Vehicle']

df = pd.get_dummies(crash,columns=columns_to_dummify, drop_first=True)

In [7]:
df['Driver At Fault'] = df['Driver At Fault'].map({'No':0, 'Yes':1})
df['Driver Substance Abuse'] = df['Driver Substance Abuse'].map({'No':0, 'Yes':1})
df['Driver Distracted By'] = df['Driver Distracted By'].map({'No':0, 'Yes':1})

In [8]:
df.head()

Unnamed: 0,Driver Substance Abuse,Driver At Fault,Driver Distracted By,Speed Limit,Crash Time of Day_Evening,Crash Time of Day_Late Night,Crash Time of Day_Morning,Crash Time of Day_Night,Crash Time of Day_Noon,Collision Type_ANGLE MEETS LEFT TURN,...,Vehicle Movement_PARKED,Vehicle Movement_PARKING,Vehicle Movement_PASSING,Vehicle Movement_RIGHT TURN ON RED,Vehicle Movement_SKIDDING,Vehicle Movement_SLOWING OR STOPPING,Vehicle Movement_STARTING FROM LANE,Vehicle Movement_STARTING FROM PARKED,Vehicle Movement_STOPPED IN TRAFFIC LANE,Parked Vehicle_Yes
0,0,0,0,25,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,40,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,35,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,35,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,35,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


## Null Model

In [9]:
df['Driver At Fault'].value_counts(normalize = True)

0    0.556386
1    0.443614
Name: Driver At Fault, dtype: float64

For this data set 55.6% of the crashes were listed as the driver not at fault.  This means that if we assign the driver as never at fault we would have a baseline accuracy of 55.6%.  

## Simple Logistic Regression Model

In [10]:
X = df.drop(columns = 'Driver At Fault')
y = df['Driver At Fault']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=13424)

In [12]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train,y_train)
logreg.score(X_train,y_train),logreg.score(X_test,y_test)

(0.8907869877911315, 0.89393217755763)

In [69]:
logreg_coef_dict = {entry[0]:entry[1] for entry in list(zip(X.columns, logreg.coef_[0]))}
logreg_coef_df = pd.DataFrame(logreg_coef_dict.values(), index = logreg_coef_dict.keys())
logreg_coef_df.columns = ['coef']

In [70]:
max(logreg_coef_dict.values())

4.720270414999181

In [71]:
min(logreg_coef_dict.values())

-3.748020648826403

In [72]:
strong_positive_coefs = {key:logreg_coef_dict[key]  for key in logreg_coef_dict.keys() 
                         if logreg_coef_dict[key] >1}
strong_positive_coefs

{'Driver Substance Abuse': 3.0768833946992133,
 'Driver Distracted By': 3.9572506453202436,
 'Collision Type_OPPOSITE DIRECTION SIDESWIPE': 1.0219548695126421,
 'Collision Type_SAME DIR REAR END': 2.9141118727809605,
 'Collision Type_SAME DIR REND LEFT TURN': 1.4199347189186657,
 'Collision Type_SAME DIR REND RIGHT TURN': 1.8257380444717228,
 'Collision Type_SINGLE VEHICLE': 2.769530153272787,
 'Vehicle Movement_BACKING': 4.720270414999181,
 'Vehicle Movement_CHANGING LANES': 2.1450104791761024,
 'Vehicle Movement_ENTERING TRAFFIC LANE': 1.5501430897222583,
 'Vehicle Movement_MAKING U TURN': 1.9983981958612884,
 'Vehicle Movement_PASSING': 1.6210915057291435,
 'Vehicle Movement_RIGHT TURN ON RED': 1.177144457863407}

In [73]:
mild_positive_coefs = {key:logreg_coef_dict[key] for key in logreg_coef_dict.keys() 
              if 0.5 < logreg_coef_dict[key]<=1  }
mild_positive_coefs

{'Collision Type_HEAD ON': 0.6145990211713696,
 'Collision Type_SAME DIRECTION SIDESWIPE': 0.6963573385699274,
 'Surface Condition_ICE': 0.6038949145218858,
 'Surface Condition_SNOW': 0.8531089999960572,
 'Vehicle First Impact Location_ONE OCLOCK': 0.8453980420546452,
 'Vehicle First Impact Location_TWELVE OCLOCK': 0.6947230647338135,
 'Vehicle First Impact Location_TWO OCLOCK': 0.629385356504718,
 'Vehicle First Impact Location_UNDERSIDE': 0.947512763606125,
 'Vehicle Second Impact Location_FIVE OCLOCK': 0.5388644215934185,
 'Vehicle Second Impact Location_FOUR OCLOCK': 0.8434570656801628,
 'Vehicle Second Impact Location_THREE OCLOCK': 0.5575891405412349,
 'Vehicle Body Type_ALL TERRAIN VEHICLE (ATV)': 0.6502620652806128,
 'Vehicle Body Type_AMBULANCE/NON EMERGENCY': 0.768787863909468,
 'Vehicle Body Type_AUTOCYCLE': 0.7284844498524414,
 'Vehicle Body Type_FIRE VEHICLE/NON EMERGENCY': 0.5062737964939242,
 'Vehicle Body Type_LOW SPEED VEHICLE': 0.9281303993492653,
 'Vehicle Movement_L

In [74]:
strong_negative_coefs = {key:logreg_coef_dict[key]  for key in logreg_coef_dict.keys() 
                         if logreg_coef_dict[key] < -1}
strong_negative_coefs

{'Vehicle First Impact Location_SIX OCLOCK': -2.9490090362116024,
 'Vehicle Second Impact Location_SIX OCLOCK': -1.1847054694758332,
 'Vehicle Movement_MOVING CONSTANT SPEED': -1.586952889123024,
 'Vehicle Movement_PARKED': -1.8416024622295983,
 'Vehicle Movement_SLOWING OR STOPPING': -1.1129377247718955,
 'Vehicle Movement_STOPPED IN TRAFFIC LANE': -3.748020648826403,
 'Parked Vehicle_Yes': -1.8416024622295983}

In [75]:
mild_negative_coefs = {key:logreg_coef_dict[key] for key in logreg_coef_dict.keys() 
              if -0.5 >= logreg_coef_dict[key] > -1  }
mild_negative_coefs

{'Collision Type_SAME DIR BOTH LEFT TURN': -0.5970299523481277,
 'Surface Condition_MUD, DIRT, GRAVEL': -0.7514651315362413,
 'Traffic Control_PERSON': -0.6639076503390595,
 'Vehicle First Impact Location_ROOF TOP': -0.6489545401019224,
 'Vehicle Body Type_POLICE VEHICLE/EMERGENCY': -0.5730005276449123,
 'Vehicle Body Type_POLICE VEHICLE/NON EMERGENCY': -0.8360497296936743,
 'Vehicle Movement_STARTING FROM LANE': -0.5499010769790169}

In [81]:
logreg_coef_df.sort_values('coef', ascending = False, inplace = True)
logreg_coef_df[logreg_coef_df['coef']>1]

Unnamed: 0,coef
Vehicle Movement_BACKING,4.72027
Driver Distracted By,3.957251
Driver Substance Abuse,3.076883
Collision Type_SAME DIR REAR END,2.914112
Collision Type_SINGLE VEHICLE,2.76953
Vehicle Movement_CHANGING LANES,2.14501
Vehicle Movement_MAKING U TURN,1.998398
Collision Type_SAME DIR REND RIGHT TURN,1.825738
Vehicle Movement_PASSING,1.621092
Vehicle Movement_ENTERING TRAFFIC LANE,1.550143


In [88]:
logreg_coef_df.sort_values('coef', ascending = False, inplace = True)
logreg_coef_df[(1>=logreg_coef_df['coef']) & (logreg_coef_df['coef']>0.5)]

Unnamed: 0,coef
Vehicle First Impact Location_UNDERSIDE,0.947513
Vehicle Body Type_LOW SPEED VEHICLE,0.92813
Vehicle Movement_MAKING LEFT TURN,0.906453
Surface Condition_SNOW,0.853109
Vehicle First Impact Location_ONE OCLOCK,0.845398
Vehicle Second Impact Location_FOUR OCLOCK,0.843457
Vehicle Body Type_AMBULANCE/NON EMERGENCY,0.768788
Vehicle Body Type_AUTOCYCLE,0.728484
Vehicle Movement_LEAVING TRAFFIC LANE,0.725908
Collision Type_SAME DIRECTION SIDESWIPE,0.696357


In [86]:
logreg_coef_df.sort_values('coef', ascending = True, inplace = True)
logreg_coef_df[logreg_coef_df['coef']<-1]

Unnamed: 0,coef
Vehicle Movement_STOPPED IN TRAFFIC LANE,-3.748021
Vehicle First Impact Location_SIX OCLOCK,-2.949009
Vehicle Movement_PARKED,-1.841602
Parked Vehicle_Yes,-1.841602
Vehicle Movement_MOVING CONSTANT SPEED,-1.586953
Vehicle Second Impact Location_SIX OCLOCK,-1.184705
Vehicle Movement_SLOWING OR STOPPING,-1.112938


In [87]:
logreg_coef_df.sort_values('coef', ascending = True, inplace = True)
logreg_coef_df[(-1<=logreg_coef_df['coef']) & (logreg_coef_df['coef']<-0.5)]

Unnamed: 0,coef
Vehicle Body Type_POLICE VEHICLE/NON EMERGENCY,-0.83605
"Surface Condition_MUD, DIRT, GRAVEL",-0.751465
Traffic Control_PERSON,-0.663908
Vehicle First Impact Location_ROOF TOP,-0.648955
Collision Type_SAME DIR BOTH LEFT TURN,-0.59703
Vehicle Body Type_POLICE VEHICLE/EMERGENCY,-0.573001
Vehicle Movement_STARTING FROM LANE,-0.549901


In [95]:
# surface conditions
logreg_coef_df[logreg_coef_df.index.str.contains('Surface')].sort_values('coef', ascending = False)

Unnamed: 0,coef
Surface Condition_SNOW,0.853109
Surface Condition_ICE,0.603895
Surface Condition_WET,0.256052
Surface Condition_WATER(STANDING/MOVING),0.165033
Surface Condition_SAND,0.029099
Surface Condition_SLUSH,-0.005173
Surface Condition_OIL,-0.027624
"Surface Condition_MUD, DIRT, GRAVEL",-0.751465


In [96]:
# vehicle body type
logreg_coef_df[logreg_coef_df.index.str.contains('Body Type')].sort_values('coef', ascending = False)

Unnamed: 0,coef
Vehicle Body Type_LOW SPEED VEHICLE,0.92813
Vehicle Body Type_AMBULANCE/NON EMERGENCY,0.768788
Vehicle Body Type_AUTOCYCLE,0.728484
Vehicle Body Type_ALL TERRAIN VEHICLE (ATV),0.650262
Vehicle Body Type_FIRE VEHICLE/NON EMERGENCY,0.506274
"Vehicle Body Type_OTHER LIGHT TRUCKS (10,000LBS (4,536KG) OR LESS)",0.468524
Vehicle Body Type_FIRE VEHICLE/EMERGENCY,0.364891
"Vehicle Body Type_CARGO VAN/LIGHT TRUCK 2 AXLES (OVER 10,000LBS (4,536 KG))",0.363128
Vehicle Body Type_SCHOOL BUS,0.346505
"Vehicle Body Type_MEDIUM/HEAVY TRUCKS 3 AXLES (OVER 10,000LBS (4,536KG))",0.346375


The logistic regression model shows that vehicles backing up, driver distraction, and driver substance abuse are the largest contributing factors in determining driver responsibility for a crash.  

Driver responsibility is also heavily influenced by the collision type.  Rear ending, single vehicle, changing langes, u-turn, passing and entering traffic are all issues.  



## Decision Tree Modeling

In [19]:
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train,y_train)

DecisionTreeClassifier(max_depth=5)

In [20]:
tree.score(X_train,y_train),tree.score(X_test,y_test)

(0.8617492498452062, 0.8596399314155078)

In [77]:
tree_coef_dict = {entry[0]:entry[1] for entry in list(zip(X.columns,tree.feature_importances_))}
tree_coef_df = pd.DataFrame(tree_coef_dict.values(), index = tree_coef_dict.keys())
tree_coef_df.columns = ['coef']
tree_coef_df.sort_values(by = ['coef'], ascending= False).head(20)

Unnamed: 0,coef
Driver Distracted By,0.497713
Vehicle First Impact Location_SIX OCLOCK,0.178323
Collision Type_SAME DIR REAR END,0.112282
Collision Type_SINGLE VEHICLE,0.082451
Vehicle Movement_MOVING CONSTANT SPEED,0.059013
Vehicle Second Impact Location_TWELVE OCLOCK,0.035352
Vehicle Movement_STOPPED IN TRAFFIC LANE,0.011978
Vehicle Movement_BACKING,0.009143
Vehicle Second Impact Location_ONE OCLOCK,0.004277
Vehicle Body Type_POLICE VEHICLE/NON EMERGENCY,0.003764


The feature importances for the decision tree model are different than for the logistic regression model.  

The tree model also shows driver distraction is the key indicator of driver responsibility in a crash.  

This also has driver substance abuse of lower importance than the in the logistic regression model.  This is very interesting and we should investigate further.  

### Random Forest

In [52]:
forest = RandomForestClassifier(max_depth = 14)
forest.fit(X_train,y_train)

RandomForestClassifier(max_depth=14)

In [53]:
forest.score(X_train,y_train),forest.score(X_test,y_test)

(0.9093146204772413, 0.9031244046485045)

In [78]:
forest_coef_dict = {entry[0]:entry[1] for entry in list(zip(X.columns,forest.feature_importances_))}
forest_coef_df = pd.DataFrame(forest_coef_dict.values(), index = forest_coef_dict.keys())
forest_coef_df.columns = ['coef']
forest_coef_df.sort_values(by = ['coef'], ascending= False).head(20)

Unnamed: 0,coef
Driver Distracted By,0.238367
Vehicle Second Impact Location_SIX OCLOCK,0.098224
Vehicle First Impact Location_SIX OCLOCK,0.090793
Collision Type_SAME DIR REAR END,0.077417
Vehicle Second Impact Location_TWELVE OCLOCK,0.065813
Vehicle Movement_STOPPED IN TRAFFIC LANE,0.054286
Vehicle First Impact Location_TWELVE OCLOCK,0.053971
Collision Type_SINGLE VEHICLE,0.04104
Vehicle Movement_MOVING CONSTANT SPEED,0.040181
Vehicle Movement_MAKING LEFT TURN,0.037229


The random forest shows similar feature importances to the tree and has about 5 points higher accuracy.  



## To do:
- Still want to do some more examining of the features
- isolate surface conditions, vehicle movement, traffic control 
- Figure out any other models we want to try