# Car accidnt severity prediction
Audience: stakeholders of a map company (i.e. Google Map)

## Introduction

Facing traffic jams is an extremely unpleasant experience as it may cause serious delays and interruptions in one’s schedule. Predicting traffic jams and avoiding them would greatly enhance the efficiency and quality of people’s life. One of the critical factors that cause traffic jams is car accidents.

This report will suggest a new model that can predict the severity of car accidents in a particular area. The model will predict the severity level of possible accidents based on some background information of the daily condition (i.e. weather, road condition, etc.).

By predicting the accident’s severity that may occur in a particular area, our map system can recommend a more effective route for its users. This new intelligent system will preoccupy a unique positioning in the market, thus higher customer satisfaction and attraction of new customers will be made possible. This will not only reduce the chance of facing traffic jams, but also reduce the possibility of accidents, as there will be less traffic on the high-risk area. This will give us better brand image by fulfilling the social responsibility as a map company.

## Data

The data that will be used in the model building will be the collision data from Seattle. It includes all collisions provided by SPD and recorded by traffic records. The data has been recorded from 2004 until present, updated weekly. Below shows the first five rows of the data table.

(Its metadata can be found in this link: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf)

In [1]:
#import necessary programs
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [2]:
collision_data=pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")
collision_data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


There are 37 attributes in the data. Among them, severity code (SEVERITYCODE), which categorizes the severity of the accident from 0 (unknown) to 3 (fatal), will be our target value. As for attributes that will be used to predict the target value, collision address type (ADDRTYPE), weather condition (WEATHER) and road condition (ROADCOND) will be used.

Below shows the table with only the target and attributes stated above.

In [3]:
df=collision_data[['SEVERITYCODE','ADDRTYPE', 'WEATHER','ROADCOND']]
df.head()

Unnamed: 0,SEVERITYCODE,ADDRTYPE,WEATHER,ROADCOND
0,2,Intersection,Overcast,Wet
1,1,Block,Raining,Wet
2,1,Block,Overcast,Dry
3,1,Block,Clear,Dry
4,2,Intersection,Raining,Wet


### Data pre-processing
This section will pre-process the data before model building. Each attributes and target will be processed.
The data type is shown below.

In [4]:
df.dtypes

SEVERITYCODE     int64
ADDRTYPE        object
WEATHER         object
ROADCOND        object
dtype: object

### Severity code
The unique values of the data are shown below.

In [5]:
df['SEVERITYCODE'].unique()

array([2, 1])

The severity code target has values of 1 and 2. From the metadata, 1 indicates property damage and 2 indicates injury.

### Address type
Remove nan values, convert categorical variables into quantitative variables.

In [6]:
df['ADDRTYPE'].unique()

array(['Intersection', 'Block', 'Alley', nan], dtype=object)

In [7]:
df.dropna()
df=pd.concat([df, pd.get_dummies(df['ADDRTYPE'])], axis=1)
df.drop(['ADDRTYPE'],axis=1, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,Alley,Block,Intersection
0,2,Overcast,Wet,0,0,1
1,1,Raining,Wet,0,1,0
2,1,Overcast,Dry,0,1,0
3,1,Clear,Dry,0,1,0
4,2,Raining,Wet,0,0,1


### Weather
Remove nan values, convert categorical variables into quantitative variables.

In [8]:
df['WEATHER'].unique()

array(['Overcast', 'Raining', 'Clear', nan, 'Unknown', 'Other', 'Snowing',
       'Fog/Smog/Smoke', 'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'], dtype=object)

In [9]:
df.dropna(axis=0, inplace=True)
indexname=df[df['WEATHER']=='Unknown'].index
df.drop(indexname, inplace=True)
indexname2=df[df['WEATHER']=='Other'].index
df.drop(indexname2, inplace=True)
df['WEATHER'].unique()

array(['Overcast', 'Raining', 'Clear', 'Snowing', 'Fog/Smog/Smoke',
       'Sleet/Hail/Freezing Rain', 'Blowing Sand/Dirt',
       'Severe Crosswind', 'Partly Cloudy'], dtype=object)

In [10]:
df=pd.concat([df, pd.get_dummies(df['WEATHER'])],axis=1)
df.drop(['WEATHER'], axis=1, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,ROADCOND,Alley,Block,Intersection,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing
0,2,Wet,0,0,1,0,0,0,1,0,0,0,0,0
1,1,Wet,0,1,0,0,0,0,0,0,1,0,0,0
2,1,Dry,0,1,0,0,0,0,1,0,0,0,0,0
3,1,Dry,0,1,0,0,1,0,0,0,0,0,0,0
4,2,Wet,0,0,1,0,0,0,0,0,1,0,0,0


### Road condition
Remove nan values, convert categorical variables into quantitative variables.

In [11]:
df['ROADCOND'].unique()

array(['Wet', 'Dry', 'Unknown', 'Snow/Slush', 'Ice', 'Other',
       'Sand/Mud/Dirt', 'Standing Water', 'Oil'], dtype=object)

In [12]:
indexname=df[df['ROADCOND']=='Unknown'].index
df.drop(indexname, inplace=True)
indexname2=df[df['ROADCOND']=='Other'].index
df.drop(indexname2, inplace=True)
df['ROADCOND'].unique()

array(['Wet', 'Dry', 'Snow/Slush', 'Ice', 'Sand/Mud/Dirt',
       'Standing Water', 'Oil'], dtype=object)

In [13]:
df=pd.concat([df, pd.get_dummies(df['ROADCOND'])],axis=1)
df.drop(['ROADCOND'], axis=1, inplace=True)
df.head()

Unnamed: 0,SEVERITYCODE,Alley,Block,Intersection,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,Dry,Ice,Oil,Sand/Mud/Dirt,Snow/Slush,Standing Water,Wet
0,2,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
2,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
3,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,2,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


### Feature selection- X and y data
Set the attribute data (feature data) as X, and set the target data (severity code) as y.

In [14]:
X=pd.DataFrame(df)
X.drop(['SEVERITYCODE'], axis=1, inplace=True)
X[0:5]

Unnamed: 0,Alley,Block,Intersection,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,Dry,Ice,Oil,Sand/Mud/Dirt,Snow/Slush,Standing Water,Wet
0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
2,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
3,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [15]:
y=df['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

### Normalize data

In [16]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

  return self.partial_fit(X, y)
  if __name__ == '__main__':


array([[-0.0598465 , -1.3187756 ,  1.34155615, -0.01735958, -1.33456065,
        -0.05705211,  2.30601482, -0.00538224, -0.48704473, -0.01203576,
        -0.02525264, -0.07231742, -1.58365616, -0.08056091, -0.01864761,
        -0.01940933, -0.07247915, -0.02502187,  1.6358207 ],
       [-0.0598465 ,  0.75827912, -0.74540301, -0.01735958, -1.33456065,
        -0.05705211, -0.43364856, -0.00538224,  2.05319951, -0.01203576,
        -0.02525264, -0.07231742, -1.58365616, -0.08056091, -0.01864761,
        -0.01940933, -0.07247915, -0.02502187,  1.6358207 ],
       [-0.0598465 ,  0.75827912, -0.74540301, -0.01735958, -1.33456065,
        -0.05705211,  2.30601482, -0.00538224, -0.48704473, -0.01203576,
        -0.02525264, -0.07231742,  0.6314502 , -0.08056091, -0.01864761,
        -0.01940933, -0.07247915, -0.02502187, -0.61131394],
       [-0.0598465 ,  0.75827912, -0.74540301, -0.01735958,  0.74931027,
        -0.05705211, -0.43364856, -0.00538224, -0.48704473, -0.01203576,
        -0.025

### Train/test set split
Here we split the data into training and test sets, the test size will be 20% of the whole data set.

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=4)

## Methodology

### Decision Tree
The first classification model built is decision tree. A decision tree was built having entropy as its criterion, and maximum depth of 4 which leads to the best accuracy score. The model is then trained with X_train and y_train data splited above.

In [18]:
from sklearn.tree import DecisionTreeClassifier

In [19]:
Tree=DecisionTreeClassifier(criterion="entropy", max_depth=4)
Tree

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [20]:
Tree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Logistic regression
Then, logistic regression model is built. The inverse of regularization strength (C parameter) was set to 0.01. The model is then trained with X_train and y_train data splited above. 

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
LR=LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)
LR

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

### Support Vector Machine (SVM)
The last model is support vector machine (SVM). The kernel function chosen here was radial basis function (RBF).

In [23]:
from sklearn import svm

In [24]:
clf=svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

## Results (evaluation)
In this section, the test sets of the data will be tested in each classification method and its accuracy score (Jaccard score and F-1 score for all three models, and log loss for logistic regression) will be calculated. It will assess the accuracy and reliability of the model. This will then be visualized in a table. 

In [25]:
#import necessary programs
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

In [26]:
#Decision Tree assessment
yhat2=Tree.predict(X_test)
DT_ja=jaccard_similarity_score(y_test,yhat2)
DT_f1=f1_score(y_test, yhat2)


In [27]:
#Logistic regression assessment
yhat4=LR.predict(X_test)
yhat4_prob=LR.predict_proba(X_test)
LR_ja=jaccard_similarity_score(y_test,yhat4)
LR_f1=f1_score(y_test, yhat4)
LR_ll=log_loss(y_test, yhat4_prob)


In [28]:
#SVM assessment
yhat3=clf.predict(X_test)
SVM_ja=jaccard_similarity_score(y_test,yhat3)
SVM_f1=f1_score(y_test, yhat3)


In [29]:
#Making a table of evaluation metrics
data={'Algorithm':['Decision Tree','SVM','Logistic Regression'], 'Jaccard':[DT_ja, SVM_ja, LR_ja], 
      'F1-score':[DT_f1, SVM_f1, LR_f1], 'LogLoss':['NA','NA',LR_ll]}
report=pd.DataFrame(data)
report

Unnamed: 0,Algorithm,Jaccard,F1-score,LogLoss
0,Decision Tree,0.676062,0.806726,
1,SVM,0.675656,0.806371,
2,Logistic Regression,0.67612,0.806741,0.612382


## Discussion
From the above sections, two models which are decision tree, logistic regression, and support vector machine (SVM) were built and assessed. The three models resulted in very similar evaluation scores. For Jaccard index, Decision Tree model yielded the index of 0.676062, Logistic Regression model yielded 0.676120, and SVM 0.675656. For F-1 score, Decison Tree showed the score of 0.806726, Logistic Regression showed 0.806741, and SVM 0.806371. All of them yielded the Jaccard index and F-1 scores closed to 1, indicating the accuracy of the three models are considerably high. Moreover, since their scores are very similar, it is considered that there is no point of comparing the tiny difference between them.

Rather, the characteristics and properties of the three models should be considered. 

First of all, logistic regression model may not be the best fit for our car accident severity predictor as the model is used when the target data is binary. It was possible to implement this method because the car accident severity levels given in the data consist of the values 1 and 2 only. However, in the reality, the severity level can vary from 0 to 3, having 0 as unknown and 3 as fatal. This problem occurred from not having sufficient data, not the modelling itself, but still, the logistic regression model will fail to predict other levels of severity aside from levels 1 and 2.
?
For the similar reason, SVM is also considered not to be the best model. An SVM classifies binary data by finding the best hyperplane that separates the two layers of data. As mentioned above, in reality, the severity levels vary from 0 to 3. It is not appropriate to use this model in real life applications.


## Conclusion
In conclusion, among these three models, decision tree is considered as the best model. It is because decision tree models are non-parametric, and makes no assumption on the distribution of data and structure of the model. It can classify target variables into discrete sets of values. With this decision tree model, the severity of possible car accidents in given weather, road, and light conditions could be predicted, giving a better rout recommendation for the map users and ultimate reduce the occurance of car accidents, bringing our company a peerless positioning in the market.

On the other hand, it is unfortunate that the decision tree diagram could not be visualized due to the limitation of time and the program (taking unreasonably long time to execute certain codes). For some future recommendations, the tree diagram can be visualized for a better understanding and application of the model. Moreover, K Nearest Neighbors (KNN) model could not be built for the same limitation, therefore KNN model can be built and studied further for a better model evaluation and selection. 