# Applied Data Science Capstone Project
## Determining severity of an accident
***
### Table of Contents
+ Introduction : Business Problem
+ Data

### Introduction/Business Problem 

Road accidents are extremely common and they often lead to loss of property and even life. Hence its good to have a tool that can alert the drivers to be more careful depending on the weather and road conditions. If the severity is high the driver can decide whether to be extra cautious or delay the trip if possible.
This tool can also help the police to enforce more safety protocols.

The goal of this project is to predict road accident severity depending on certain weather and road conditions and time of the day.
The data set used for training the model is the one recorded by the Seattle Department of Transportation(SDOT) which includes all types of collisions from 2004 to present.
It has around 194673 records with 38 attributes.

### Data

We will be using the shared data, ie. the collision data recorded by the Seattle Department of Transportation(SDOT) which is avialable at - 
https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv


Inorder to develop a Accident Severity Predicting Model, we will be considering the following Attributes.

+ WEATHER - A description of the weather conditions during the time of the collision.
+ ROADCOND - The condition of the road during the collision.
+ LIGHTCOND - The light conditions during the collision.


The target is the Severity of collision which is represented by column :

+ SEVERITYCODE - A code that corresponds to the severity of the collision

We have two possible outcomes for this in our data set :
1 - Property Damage Only Collision
2 - Injury Collision


In [1]:
#import required libraries
import pandas as pd
import numpy as np

In [2]:
#data file - shared data for SDOT 
data_file = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv"

In [3]:
#read data from file to pandas data frame
df = pd.read_csv(data_file)
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
df.shape

(194673, 38)

In [6]:
#Checking the data types
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [15]:
df["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [45]:
#Creating a new df with the independnet variables(attributes) and target variable
df_final = df[['SEVERITYCODE', 'WEATHER', 'ROADCOND','LIGHTCOND']].copy()
df_final.head()

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight


In [46]:
#Check for missing data
print("Missing values in each columns")
print("SEVERITYCODE : " , df_final['SEVERITYCODE'].isnull().sum(axis=0))
print("WEATHER : " , df_final['WEATHER'].isnull().sum(axis=0))
print("ROADCOND : " , df_final['ROADCOND'].isnull().sum(axis=0))
print("LIGHTCOND : " , df_final['LIGHTCOND'].isnull().sum(axis=0))

Missing values in each columns
SEVERITYCODE :  0
WEATHER :  5081
ROADCOND :  5012
LIGHTCOND :  5170


In [47]:
#Since then no. rows with missing values is less compared to total no. of records, we can drop these rows
df_final.dropna(subset=['WEATHER', 'ROADCOND','LIGHTCOND'], axis=0, inplace=True)
df_final.shape

(189337, 4)

In [48]:
#Analysing the values of Attribute - WEATHER
df_final.groupby(['WEATHER'])['SEVERITYCODE'].value_counts()

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1                  40
                          2                  15
Clear                     1               75200
                          2               35808
Fog/Smog/Smoke            1                 382
                          2                 187
Other                     1                 708
                          2                 116
Overcast                  1               18942
                          2                8739
Partly Cloudy             2                   3
                          1                   2
Raining                   1               21949
                          2               11168
Severe Crosswind          1                  18
                          2                   7
Sleet/Hail/Freezing Rain  1                  85
                          2                  28
Snowing                   1                 732
                          2                 169
U

In [49]:
#Drop Unknown and Other
df_final.drop(df_final[df_final.WEATHER == 'Unknown'].index, inplace=True)
df_final.drop(df_final[df_final.WEATHER == 'Other'].index, inplace=True)

In [50]:
#Analysing the values of Attribute - ROADCOND
df_final.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts()

ROADCOND        SEVERITYCODE
Dry             1               83454
                2               39822
Ice             1                 849
                2                 264
Oil             1                  36
                2                  24
Other           1                  67
                2                  41
Sand/Mud/Dirt   1                  43
                2                  22
Snow/Slush      1                 739
                2                 161
Standing Water  1                  78
                2                  30
Unknown         1                 803
                2                 116
Wet             1               31281
                2               15644
Name: SEVERITYCODE, dtype: int64

In [51]:
#Drop Unknown and Other
df_final.drop(df_final[df_final.ROADCOND == 'Unknown'].index, inplace=True)
df_final.drop(df_final[df_final.ROADCOND == 'Other'].index, inplace=True)

In [52]:
#Analysing the values of Attribute - LIGHTCOND
df_final.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts()

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1                1086
                          2                 322
Dark - Street Lights Off  1                 805
                          2                 309
Dark - Street Lights On   1               32485
                          2               14263
Dark - Unknown Lighting   1                   5
                          2                   3
Dawn                      1                1607
                          2                 806
Daylight                  1               74538
                          2               38080
Dusk                      1                3748
                          2                1900
Other                     1                 138
                          2                  47
Unknown                   1                2068
                          2                 237
Name: SEVERITYCODE, dtype: int64

In [53]:
#Drop Unknown and Other
df_final.drop(df_final[df_final.LIGHTCOND == 'Unknown'].index, inplace=True)
df_final.drop(df_final[df_final.LIGHTCOND == 'Other'].index, inplace=True)

In [54]:
df_final.shape

(169957, 4)

In [55]:
df_final.head(10)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
0,2,Overcast,Wet,Daylight
1,1,Raining,Wet,Dark - Street Lights On
2,1,Overcast,Dry,Daylight
3,1,Clear,Dry,Daylight
4,2,Raining,Wet,Daylight
5,1,Clear,Dry,Daylight
6,1,Raining,Wet,Daylight
7,2,Clear,Dry,Daylight
8,1,Clear,Dry,Daylight
9,2,Clear,Dry,Daylight


In [56]:
df_feature = pd.concat([df_final,pd.get_dummies(df_final[['WEATHER','ROADCOND','LIGHTCOND']])], axis=1)

In [57]:
df_feature = df_feature.drop(['WEATHER','ROADCOND','LIGHTCOND'],axis=1)
df_feature.head(10)

Unnamed: 0,SEVERITYCODE,WEATHER_Blowing Sand/Dirt,WEATHER_Clear,WEATHER_Fog/Smog/Smoke,WEATHER_Overcast,WEATHER_Partly Cloudy,WEATHER_Raining,WEATHER_Severe Crosswind,WEATHER_Sleet/Hail/Freezing Rain,WEATHER_Snowing,...,ROADCOND_Snow/Slush,ROADCOND_Standing Water,ROADCOND_Wet,LIGHTCOND_Dark - No Street Lights,LIGHTCOND_Dark - Street Lights Off,LIGHTCOND_Dark - Street Lights On,LIGHTCOND_Dark - Unknown Lighting,LIGHTCOND_Dawn,LIGHTCOND_Daylight,LIGHTCOND_Dusk
0,2,0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
1,1,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0
2,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,2,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
5,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,1,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,1,0
7,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
8,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,2,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [58]:
df_feature.shape

(169957, 24)

**After Data Cleaning and PreProcessing, we have a data frame with 24 columns,
Out of which 23 are attributes/independent variables and 1 target variable.**