# Classification of Traffic Accident Severity

## Introduction

Road safety is a primary concern for drivers since traffic accidents can be fatal. Without data, drivers rely on their intuitive perception of the environment to determine the level of cautiousness when they drive, which can be inaccurate at times. For example, drivers may underestimate the danger of speeding on a wet road even after the rain has stopped, or overestimate the risks of driving with light fog. 

Therefore, it is important to educate drivers with correct road safety knowledge that is backed-up by empirical evidence, to boost driving safety and confidence. A multi-class model that classifies accidents of various degrees of severity according to different road conditions can inform drivers of the relevant risks so that they can adjust their level of cautiousness when driving.

## Data I - Description

A multi-class model can be used to classify the various degrees of severity. The features used to classify would be variables such as road condition, light condition, weather, speeding (or not), and the lane the driver is in. For example, consider a driver who is driving under 'wet', 'dark, street lights on', 'not speeding', and on a particular lane. 

A trained model, when assuming that an accident does occur, can classify the severity of the accident. This would provide the driver with the 'worst-case scenario', rather than a probabilistic estimate of an accident occuring. This can still have the effect of inducing an appropriate level of cautiousness in the driver.

The features I will be using for classification are: Speeding, whether accident due to inattention, and whether driver under substance influence. Initially, other features were included, but there were too many categories which resulted in non-convergence of models.

## Data II - Preprocessing

In [47]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

In [25]:
csv_path = 'C:/Users/Ong Jia Yi/Desktop/STUDY/Co-Cirricular/Coursera - IBM Data Science Professional Certificate/9. Capstone Project/Data-Collisions.csv'
data = pd.read_csv(csv_path)
data = pd.DataFrame(data)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [26]:
# subset data
selected = ["SEVERITYCODE", "SPEEDING", "INATTENTIONIND", "UNDERINFL"]
data = data[selected]

In [27]:
# check dimensions
data.shape

(194673, 4)

In [28]:
# checking value
for feature in ["SPEEDING", "INATTENTIONIND", "UNDERINFL"]:
    print(data[feature].unique())

[nan 'Y']
[nan 'Y']
['N' '0' nan '1' 'Y']


In [29]:
# replace 'nan' with 'N' for binary cases
data["SPEEDING"] = data["SPEEDING"].fillna("N")
data["INATTENTIONIND"] = data["INATTENTIONIND"].fillna("N")

# unify labelling for UNDERINFL
data["UNDERINFL"].replace('0', 'N', inplace=True)
data["UNDERINFL"].replace('1', 'Y', inplace=True)

# drop 'nan' values
data.dropna(axis=0, inplace=True)

In [30]:
data.shape

(189789, 4)

In [31]:
data

Unnamed: 0,SEVERITYCODE,SPEEDING,INATTENTIONIND,UNDERINFL
0,2,N,N,N
1,1,N,N,N
2,1,N,N,N
3,1,N,N,N
4,2,N,N,N
...,...,...,...,...
194668,2,N,N,N
194669,1,N,Y,N
194670,2,N,N,N
194671,2,N,N,N


In [32]:
# convert binary categorical features to numerical values
data["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
data["INATTENTIONIND"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
data["UNDERINFL"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)

In [33]:
# Use one hot encoding technique to conver categorical varables to binary variables
# "LIGHTCOND", "ROADCOND", "WEATHER"
Feature = data[["SPEEDING", "INATTENTIONIND", "UNDERINFL"]]
#Feature = pd.concat([Feature, pd.get_dummies(data['LIGHTCOND'])], axis=1)
#Feature.drop(["LIGHTCOND", "ROADCOND", "WEATHER"], axis = 1, inplace=True)
Feature.head()

Unnamed: 0,SPEEDING,INATTENTIONIND,UNDERINFL
0,0,0,0
1,0,0,0
2,0,0,0
3,0,0,0
4,0,0,0


In [34]:
# define datasets
X = Feature
y = data['SEVERITYCODE'].values.astype(str)

## Data III - Visualization

In the following three tables, we can see that the proportion of level 2 severity is higher when any of the three cases is true (speeding, inattentive, or influenced by substance). This means that these features do indeed have an effect on the severity of accidents when it happens.

In [35]:
data.groupby(['SPEEDING'])['SEVERITYCODE'].value_counts(normalize=True)

SPEEDING  SEVERITYCODE
0         1               0.702820
          2               0.297180
1         1               0.621665
          2               0.378335
Name: SEVERITYCODE, dtype: float64

In [36]:
data.groupby(['INATTENTIONIND'])['SEVERITYCODE'].value_counts(normalize=True)

INATTENTIONIND  SEVERITYCODE
0               1               0.707708
                2               0.292292
1               1               0.651166
                2               0.348834
Name: SEVERITYCODE, dtype: float64

In [37]:
data.groupby(['UNDERINFL'])['SEVERITYCODE'].value_counts(normalize=True)

UNDERINFL  SEVERITYCODE
0          1               0.703340
           2               0.296660
1          1               0.609473
           2               0.390527
Name: SEVERITYCODE, dtype: float64

## Methodology

I take a metric-based approach in selecting the best classification algorithm to generate insights.

In [40]:
# normalize data
X = preprocessing.StandardScaler().fit(X).transform(X)

# split "train_loan.csv" data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

# verifying set dimensions
print("Training set: ", X_train.shape, y_train.shape)
print("Testing set: ", X_test.shape, y_test.shape)

Training set:  (151831, 3) (151831,)
Testing set:  (37958, 3) (37958,)


## Methodology I - KNN

In [41]:
model = KNeighborsClassifier(n_neighbors = 4).fit(X_train, y_train)

In [42]:
predicted = model.predict(X_test)

In [43]:
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

## Methodology II - Decision Tree

In [48]:
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(X_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

In [49]:
predicted = Tree_model.predict(X_test)

In [50]:
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)

## Methodology III - Logistic Regression

In [51]:
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(X_train, y_train)

In [52]:
predicted = LR_model.predict(X_test)

In [53]:
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

## Results

In [54]:
table = {
    "Algorithm": ["KNN", "Decision Tree", "LogisticRegression"],
    "F1-score": [KNN_f1, Tree_f1, LR_f1],
    "Accuracy": [KNN_acc, Tree_acc, LR_acc]
}

table = pd.DataFrame(table)
table

Unnamed: 0,Algorithm,F1-score,Accuracy
0,KNN,0.578927,0.697692
1,Decision Tree,0.573798,0.697956
2,LogisticRegression,0.574078,0.697929


We can see that the three models have about the same performance as measured by the two metrices. Thus, I will choose the logistic regression model since it can give probability estimates.

In [68]:
table = {
    "Intercept": LR_model.intercept_,
    "Coef:SPEEDING ": LR_model.coef_[:,0],
    "Coef:INATTENTION ": LR_model.coef_[:,1],
    "Coef:UNDERINFL ": LR_model.coef_[:,2]
}

table = pd.DataFrame(table)
table

Unnamed: 0,Intercept,Coef:SPEEDING,Coef:INATTENTION,Coef:UNDERINFL
0,-0.844432,0.073669,0.099627,0.087791


Since the coefficients are positive, those three features has an effect of increasing accident severity.

## Conclusion

The models provide empirical evidence against driving behaviours such as speeding, inattention, and substance-use driving.