# Capstone Project - Road Accident Severity Prediction

## Introduction: Business Understanding

Road accidents lead to fatalities and economic losses. Thus, <b>preventing loss of life and property</b> is a topic of concern. <br>

The Seattle government can deploy a system that can alert drivers, health system, and police to remind them to practice caution and alertness in case of an incident. As a step towards this solution I will be building a model than will <b>predict the severity of an accident.</b><br> 

This model can be the <b>driving mechanism</b> that could <b>warn people</b>, given the weather and the road conditions about the <b>possibility of a car accident and how severe</b> it would be, so that they would <b>drive more carefully or even change their travel</b> if possible. <br>

In accident severity modeling, the <b>input vectors</b> are the characteristics of the accident, such as <b>driver behavior and attributes of vehicle, highway and environment characteristics</b> while the <b>output vector</b> is the corresponding <b>class of accident severity.</b><br>

By recognizing the <b>key factors that influence accident severity</b>, the solution may be of great utility to various <b>Government Departments/Authorities like Police, R&B and Transport</b> from public policy point of view. <br>

The results of analysis and modeling can be used by these Departments to take appropriate measures to <b>reduce accident impact</b> and thereby <b>improve traffic safety</b>. It is also useful to the Insurers in terms of reduced claims and better underwriting as well as rate making.

## Data

These traffic records were collected by the <b>SPD (Seattle Police Department).</b> <br>
The time-frame of this data is from <b>2004 to present.</b><br>
The data consists of <b>37 attributes and 194,673 collision records</b>. <br>
The dependent variable, <b>“SEVERITYCODE”</b>, contains numbers that correspond to different <b>levels of severity</b> caused by an accident from 0 to 4.

<b>Severity codes are as follows:
- 0: Little to no Probability (Clear Conditions)
- 1: Very Low Probability — Chance or Property Damage
- 2: Low Probability — Chance of Injury
- 3: Mild Probability — Chance of Serious Injury
- 4: High Probability — Chance of Fatality</b>
Following is a table of all the attributes along with their data types, variable length and a description for understanding.<br> This Meta-data is provided by the <b>SDOT Traffic Management Division.</b>

|Attribute          |Data type, length| Description                                                 |
|:------------------|:----------------|:------------------------------------------------------------|
|LOCATION           | Text, 255       |Description of the general location of the collision         | 
|EXCEPTRSNCODE      | Text, 10        |                                                             |
|EXCEPTRSNDESC      | Text, 300       |                                                             |
|SEVERITYCODE       | Text, 100       |A code that corresponds to the severity of the collision:
|                   |                 |3—fatality, 2b—serious injury, 2—injury, 1—prop damage,0—unknown|
|SEVERITYDESC       |Text             |A detailed description of the severity of the collision|
|COLLISIONTYPE      |Text, 300        |Collision type|
|PERSONCOUNT        |Double           |The total number of people involved in the collision|
|PEDCOUNT           |Double           |The number of pedestrians involved in the collision. |
|PEDCYLCOUNT        |Double           |The number of bicycles involved in the collision.|
|VEHCOUNT           |Double           |The number of vehicles involved in the collision.|
|INJURIES           |Double           |The number of total injuries in the collision.|
|SERIOUSINJURIES    |Double           |The number of serious injuries in the collision.|
|FATALITIES         |Double           |The number of fatalities in the collision.|
|INCDATE            |Date             |The date of the incident.|
|INCDTTM            |Text, 30         |The date and time of the incident.|
|JUNCTIONTYPE       |Text, 300        |Category of junction at which collision took place|
|SDOT_COLCODE       |Text, 10         |A code given to the collision by SDOT.|
|SDOT_COLDESC       |Text, 300        |A description of the collision corresponding to the collision code.|
|INATTENTIONIND     |Text, 1          |Whether or not collision was due to inattention.(Y/N)|
|UNDERINFL          |Text, 10         |Whether or not a driver involved was under the influence of drugs or alcohol.| 
|WEATHER            |Text, 300        |A description of the weather conditions during the time of the collision.|
|ROADCOND           |Text, 300        |The condition of the road during the collision.|
|LIGHTCOND          |Text, 300        |The light conditions during the collision.|
|PEDROWNOTGRNT      |Text, 1          |Whether or not the pedestrian right of way was not granted. (Y/N)|
|SDOTCOLNUM         |Text, 10         |A number given to the collision by SDOT.|
|SPEEDING           |Text, 1          |Whether or not speeding was a factor in the collision. (Y/N)|
|ST_COLCODE         |Text, 10         |A code provided by the state that describes the collision|
|ST_COLDESC         |Text, 300        |A description that corresponds to the state’s coding designation.|
|SEGLANEKEY         |Long             |A key for the lane segment in which the collision occurred.|
|CROSSWALKKEY       |Long             |A key for the crosswalk at which the collision occurred.|
|HITPARKEDCAR       |Text, 1          |Whether or not the collision involved hitting a parked car. (Y/N) |

<H4> As observed in the meta-data the severity code has 4 classes. Thus, this is a multi-class regression problem.<br>
I intend to build a machine learning model to predict the severity and classify it into the multi-class severity codes for public undertanding </H4>

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

### Data Loading

In [2]:
collisions_df = pd.read_csv('Collisions.csv')

In [3]:
collisions_df.head()

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,-122.320757,47.609408,1,328476,329976,EA08706,Matched,Block,,BROADWAY BETWEEN E COLUMBIA ST AND BOYLSTON AVE,...,Wet,Dark - Street Lights On,,,,11.0,From same direction - both going straight - bo...,0,0,N
1,-122.319561,47.662221,2,328142,329642,EA06882,Matched,Block,,8TH AVE NE BETWEEN NE 45TH E ST AND NE 47TH ST,...,Dry,Daylight,,,,32.0,One parked--one moving,0,0,Y
2,-122.327525,47.604393,3,20700,20700,1181833,Unmatched,Block,,JAMES ST BETWEEN 6TH AVE AND 7TH AVE,...,,,,4030032.0,,,,0,0,N
3,-122.327525,47.708622,4,332126,333626,M16001640,Unmatched,Block,,NE NORTHGATE WAY BETWEEN 1ST AVE NE AND NE NOR...,...,,,,,,,,0,0,N
4,-122.29212,47.559009,5,328238,329738,3857118,Unmatched,Block,,M L KING JR ER WAY S BETWEEN S ANGELINE ST AND...,...,,,,,,,,0,0,N


In [4]:
collisions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 40 columns):
X                  213918 non-null float64
Y                  213918 non-null float64
OBJECTID           221389 non-null int64
INCKEY             221389 non-null int64
COLDETKEY          221389 non-null int64
REPORTNO           221389 non-null object
STATUS             221389 non-null object
ADDRTYPE           217677 non-null object
INTKEY             71884 non-null float64
LOCATION           216801 non-null object
EXCEPTRSNCODE      100986 non-null object
EXCEPTRSNDESC      11779 non-null object
SEVERITYCODE       221388 non-null object
SEVERITYDESC       221389 non-null object
COLLISIONTYPE      195159 non-null object
PERSONCOUNT        221389 non-null int64
PEDCOUNT           221389 non-null int64
PEDCYLCOUNT        221389 non-null int64
VEHCOUNT           221389 non-null int64
INJURIES           221389 non-null int64
SERIOUSINJURIES    221389 non-null int64
FATALITIES     

### Cleaning the Dataset

Let's examine the lattitudnal and logitudnal data

In [5]:
collisions_df[['X','Y']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 2 columns):
X    213918 non-null float64
Y    213918 non-null float64
dtypes: float64(2)
memory usage: 3.4 MB


In [6]:
collisions_df[['X','Y']].isna().sum()

X    7471
Y    7471
dtype: int64

Dropping the rows having null values in the logitudnal and lattitudnal data columns

In [7]:
collisions_df = collisions_df.dropna(axis=0, subset=['X','Y'])

In [8]:
collisions_df[['X','Y']].isna().sum()

X    0
Y    0
dtype: int64

<H2>The data needs to pre-processed</H2>

The collisions dataset has been sourced from the <b>Seattle Open GeoData Portal</b> and is updated weekly, thus a several unique identifiers and spatial features are present in the dataset which will be irrelevant in further statistical analysis and model building.<br>  

Features like <b>OBJECTID, INCKEY, COLDETKEY, INTKEY and REPORTNO.</b> are the unique identifiers<br>

Features like <b>EXCEPTRSNCODE, EXCEPTRSNDESC and LOCATION</b> won't be contributing to our dataset.<br> 
The LOCATION data will help us in populating the maps and getting the count in a particular area but the lattitudes and logitudes and already in place to serve that purpose.<br>

Features like <b>INCDATE - Incident Date and INCDTTM - Incident Timestamp </b>, The timestamp column doesn't have consistent values. Most values do not contain the time. Let's maintain the incident date from INCDATE column<br>

Features like <b>SDOT_COLCODE and SDOT_COLDESC</b> are redundant, <b>ST_COLCODE, ST_COLDESC, SDOT_COLNUM </b> are the repeated features which shouldn't be considered in further analysis <br>

Feature like <b>COLLISIONTYPE</b> has some missing values, those can be filled by mapping the SDOT_COLDESC values, SDOT_COLDESC involves the collision description and can be used alongwith the SDOT_COLCODE to input null values.<br>

Features like <b>JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND</b> contain null values, it would be best to drop these rows.<br>

Features like <b>INATTENTIONIND, UNDERINFL, SPEEDING</b> are variables with binary values and has values input for only one class. Thus, we can induce the either value to account for all the blank cells. <br>

Feature <b>PEDROWNOTGRNT</b> has 95% null values and considering it for model building would create bias, thus its safe to exclude it.

Features like <b>SEGLANEKEY, CROSSWALKKEY, HITPARKEDCAR </b> should be examined, if they aren't correlated to the target variable then they should be excluded, it can be conceived as noise.

In [9]:
#The INCDATE column has the date and its has zeroes for time (00:00:00), 
#So we can just keep the date and convert it into a datetime 

In [10]:
collisions_df['INCDATE']= pd.to_datetime(collisions_df['INCDATE']) 
collisions_df['INCDATE'] = collisions_df['INCDATE'].dt.date

In [11]:
# Filling the collision type missing values
# Create a dictionary by zipping the Collision types and SDOT_Col codes and their description
# Filling the missing values with the corresponding collision types

In [12]:
collision_type = dict(zip(collisions_df.SDOT_COLCODE,collisions_df.COLLISIONTYPE))
collision_sdot = dict(zip(collisions_df.SDOT_COLCODE,collisions_df.SDOT_COLDESC))

In [13]:
collisions_df.COLLISIONTYPE = collisions_df.COLLISIONTYPE.fillna(collisions_df.SDOT_COLCODE.map(collision_type))

In [14]:
collisions_df['COLLISIONTYPE'].isna().sum()

8797

In [15]:
coll_na = collisions_df.loc[collisions_df['COLLISIONTYPE'].isna(), 'SDOT_COLCODE']

In [16]:
coll_na.value_counts()

0.0     8244
12.0     322
15.0     230
Name: SDOT_COLCODE, dtype: int64

In [17]:
codes = [0,12,15]
descs = [collision_sdot[k] for k in codes if k in collision_sdot]
print(descs)

['NOT ENOUGH INFORMATION / NOT APPLICABLE', 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE AT ANGLE', 'MOTOR VEHICLE STRUCK MOTOR VEHICLE, RIGHT SIDE SIDESWIPE']


<b>Lets substitue 'COLLISIONTYPE' as 'Other' for 'SDOT_COLCODE' - 0, 'Angles' for 12 and 'Sideswipe' for 15</b>

In [18]:
collisions_df.loc[collisions_df['SDOT_COLCODE'] == 0, 'COLLISIONTYPE'] = 'Other'
collisions_df.loc[collisions_df['SDOT_COLCODE'] == 12, 'COLLISIONTYPE'] = 'Angles'
collisions_df.loc[collisions_df['SDOT_COLCODE'] == 15, 'COLLISIONTYPE'] = 'Sideswipe'

In [19]:
collisions_df = collisions_df.dropna(axis=0, subset=['COLLISIONTYPE'])
collisions_df['COLLISIONTYPE'].isna().sum()

0

In [20]:
# Let us examine columns JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND

In [21]:
collisions_df[['JUNCTIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND']].isna().sum()

JUNCTIONTYPE     8189
WEATHER         24198
ROADCOND        24119
LIGHTCOND       24285
dtype: int64

<b>JUNCTIONTYPE</b> has 4% of its values missing <br>
While <b>WEATHER, ROADCOND, LIGHTCOND</b> have almost 10% of their data missing. These numbers aren't very large and we make do with the remaining 90% of the data. <br> Drop the missing value rows from these features

In [22]:
collisions_df = collisions_df.dropna(axis=0, subset=['JUNCTIONTYPE', 'WEATHER', 'ROADCOND', 'LIGHTCOND'])

In [23]:
#Let's examine column INATTENTIONIND - Incident due to not paying attention on the road

In [24]:
collisions_df['INATTENTIONIND'].value_counts()

Y    29164
Name: INATTENTIONIND, dtype: int64

In [25]:
collisions_df['INATTENTIONIND'].isna().sum()

156174

In [26]:
#We will fill the missing places with zero and map value 1 for those values of Y

In [27]:
collisions_df['INATTENTIONIND'] = collisions_df['INATTENTIONIND'].map({'Y': 1})

In [28]:
collisions_df['INATTENTIONIND'] = collisions_df['INATTENTIONIND'].fillna(0)

In [29]:
collisions_df['INATTENTIONIND'].value_counts()

0.0    156174
1.0     29164
Name: INATTENTIONIND, dtype: int64

In [31]:
#Let's examine column UNDERINFL - Incident when driver was under the influence of drugs/alcohol

In [32]:
collisions_df['UNDERINFL'].value_counts()

N    97263
0    78770
Y     5200
1     4104
Name: UNDERINFL, dtype: int64

In [33]:
collisions_df['UNDERINFL'] = collisions_df['UNDERINFL'].map({'N': 0, '0': 0, 'Y': 1, '1': 1})

In [34]:
collisions_df['UNDERINFL'].value_counts()

0.0    176033
1.0      9304
Name: UNDERINFL, dtype: int64

In [35]:
collisions_df['UNDERINFL'].isna().sum()

1

In [36]:
collisions_df = collisions_df.dropna(axis=0, subset=['UNDERINFL'])

In [36]:
##Let's examine column SPEEDING - Incident when driver was speeding the vehicle

In [38]:
collisions_df['SPEEDING'] = collisions_df['SPEEDING'].map({'Y': 1})
collisions_df['SPEEDING'].replace(np.nan, 0, inplace=True)

In [39]:
collisions_df['SPEEDING'].value_counts()

0.0    176112
1.0      9225
Name: SPEEDING, dtype: int64

In [40]:
#Let's examine the 3 columns: 'SEGLANEKEY','CROSSWALKKEY','HITPARKEDCAR'

In [41]:
collisions_df[['SEGLANEKEY','CROSSWALKKEY','HITPARKEDCAR']].nunique()

SEGLANEKEY      2049
CROSSWALKKEY    2305
HITPARKEDCAR       2
dtype: int64

Features <b>SEGLANEKEY and CROSSWALKKEY</b> have more than 2000 categories, and creating 2000< dummies isn't wise, thus it is safe to exclude them for the further analysis <br>

Feature <b>HITPARKEDCAR</b> is a binary feature but this feature is already serving it's purpose in the COLLISIONTYPE - Parked Car, Thus it is safe to exclude this feature for the further analysis</b>

In [43]:
collisions_df['SEVERITYCODE'].value_counts()

1     125527
2      56498
2b      2980
3        330
0          1
Name: SEVERITYCODE, dtype: int64

|SEVERITYCODE|MEANING|         
|:-----------:|:---------------:|
|1  |	Accidents resulting in property damage|
|2  |	Accidents resulting in injuries|
|2b |	Accidents resulting in serious injuries|
|3	|   Accidents resulting in fatalities|
|0	|   Data Unavailable i.e. Blanks|

In [44]:
#As we can observe there is 1 record whoch doesn't have any data. It's appropriate to drop this record

In [53]:
collisions_df = collisions_df[collisions_df.SEVERITYCODE != '0']

In [54]:
collisions_df['SEVERITYCODE'].value_counts()

1     125527
2      56498
2b      2980
3        330
Name: SEVERITYCODE, dtype: int64

Convert <b>INCDATE</b> to type <b>'datetime'</b>

Convert the categorical columns:<br> 
<b>COLLISIONTYPE, JUNCTIONTYPE, WEATHER, ROADCOND, LIGHTCOND, SPEEDING, UNDERINFL, INATTENTIONIND</b> to datatype <B>'category'</B>

In [68]:
collisions_df["INCDATE"] = pd.to_datetime(collisions_df["INCDATE"])

In [71]:
cat_cols = ['COLLISIONTYPE','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','SPEEDING',
            'UNDERINFL','INATTENTIONIND', 'SEVERITYCODE']
collisions_df[cat_cols] = collisions_df[cat_cols].astype('category')

<H4> LET'S SELECT THE RELEVANT FEATURES AS OUR FINAL CLAEN DATASET FUR FURTHER DATA ANALYSIS</H4>

In [72]:
colisn_df = collisions_df[['X','Y','INCDATE','COLLISIONTYPE','JUNCTIONTYPE','WEATHER','ROADCOND','LIGHTCOND','SPEEDING',
                           'UNDERINFL','INATTENTIONIND','PERSONCOUNT','PEDCOUNT','PEDCYLCOUNT','VEHCOUNT','INJURIES',
                          'SERIOUSINJURIES','FATALITIES','SEVERITYCODE']]

In [73]:
colisn_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 185335 entries, 0 to 221388
Data columns (total 19 columns):
X                  185335 non-null float64
Y                  185335 non-null float64
INCDATE            185335 non-null datetime64[ns]
COLLISIONTYPE      185335 non-null category
JUNCTIONTYPE       185335 non-null category
WEATHER            185335 non-null category
ROADCOND           185335 non-null category
LIGHTCOND          185335 non-null category
SPEEDING           185335 non-null category
UNDERINFL          185335 non-null category
INATTENTIONIND     185335 non-null category
PERSONCOUNT        185335 non-null int64
PEDCOUNT           185335 non-null int64
PEDCYLCOUNT        185335 non-null int64
VEHCOUNT           185335 non-null int64
INJURIES           185335 non-null int64
SERIOUSINJURIES    185335 non-null int64
FATALITIES         185335 non-null int64
SEVERITYCODE       185335 non-null category
dtypes: category(9), datetime64[ns](1), float64(2), int64(7)
memory us