# Applied Data Science Capstone Project - Car accident severity

## Introduction (Business Problem)
The purpose of this project is to build a machine learning model to predict the severity of a car accident in terms of human fatality, traffic delay, property damage, or any other type of bad impact of the accident. 

This prediction system may be helpful for people who are driving or travelling on the same roads that are closed due to the accident. Having an early idea of the accident severity may help them decide whether they should stay on the same way or take a detour. It may also help them cancel or reschedule any appointment they are running late for. An early idea of the severity can also help the police and the paramedics to plan on number of resources (both humans and equipment) need to be involved to deal with the accident. 

I am going to use a supervised machine learning model for this project. Hence, like any supervised machine learning model, a dataset with labelled data will be used to train and validate the model.

The dataset that I will use for this project is the shared dataset that is provided in the course syllabus named **'Data-Collisions.csv'**.

## Data

The dataset, **'Data-Collisions.csv'**, is a __CSV (Comma-Separated Values)__ file provided in the course syllabus and it contains the **Seattle** city traffic accident information from 01-Jan-2004 to 20-May-2020.

The initial dataset has a total of 38 columns (37 features and one label or target column) and a total of 194673 rows or samples.   

The target column or the label of the dataset is the **"SEVERITY"** column which describes the fatality of the accident.  

Some features have missing data and there are numerical and categorical types of data in the dataset. Not all attributes/features are useful, so some of them may need to be dropped. Some level of feature engineering will be required to reform or reshape the data in order to improve the predictability of the model. 

In order to decide on which features should be used to build the predictive model I would consider the features that describe human factors, physical factors and environmental factors. After an initial observation it seems to me that the following attributes are good candidates for my project.
- Location
- Car speeding 
- Person under the influence of alcohol
- Junction type
- Weather condition
- Light conditions
- Road conditions
- Number of people involved
- Number of vehicles involved

Below are the column names in the initial dataset that represent the above listed attributes:

**'LOCATION', 'SPEEDING', 'UNDERINFL', 'JUNCTIONTYPE', 'WEATHER', 'LIGHTCOND', 'ROADCOND', 'PERSONCOUNT', 'VEHCOUNT'**

This is only an initial assumption and I may need to add, remove, split or transform features as I analyse the data in more details along the way. 

_[Note: This is where the work for week 2 submission ended. The work for week 3 continues below.]_

## Data Analysis
In this section I will analyse my data in more details and, if required, I will apply data cleaning, data transformation and/or feature engineering on the dataset in order to make the data suitable for applicable methodologies.    

In [91]:
# import required libraries
import pandas as pd
import numpy as np

In [92]:
# Read the dataset from CSV file into a Pandas dataframe
df = pd.read_csv('Data-Collisions.csv', low_memory=False)

In [93]:
# Let's see the first 5 records (from the top) of the dataset
df.head() 

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [94]:
# Let's see the shape of the dataset
df.shape

(194673, 38)

We can see from above that the dataset actually has 194673 rows/records and 38 columns

In [95]:
# display all column headers as an array
df.columns.values 

array(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY',
       'REPORTNO', 'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION',
       'EXCEPTRSNCODE', 'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC',
       'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT',
       'VEHCOUNT', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE',
       'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER',
       'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING',
       'ST_COLCODE', 'ST_COLDESC', 'SEGLANEKEY', 'CROSSWALKKEY',
       'HITPARKEDCAR'], dtype=object)

In [96]:
# Let's create a new dataframe "df_acc" with only the features that maybe important from initial observation
df_acc = df[['SEVERITYCODE', 'LOCATION', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE', 'INCDTTM', 'JUNCTIONTYPE', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SPEEDING', 'HITPARKEDCAR']]

# display the top 5 rows of the new dataframe
df_acc.head()

Unnamed: 0,SEVERITYCODE,LOCATION,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,HITPARKEDCAR
0,2,5TH AVE NE AND NE 103RD ST,2,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/2013 2:54:00 PM,At Intersection (intersection related),N,Overcast,Wet,Daylight,,,N
1,1,AURORA BR BETWEEN RAYE ST AND BRIDGE WAY N,1,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/2006 6:55:00 PM,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,,,N
2,1,4TH AVE BETWEEN SENECA ST AND UNIVERSITY ST,1,Property Damage Only Collision,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,11/18/2004 10:20:00 AM,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,,,N
3,1,2ND AVE BETWEEN MARION ST AND MADISON ST,1,Property Damage Only Collision,Other,3,0,0,3,2013/03/29 00:00:00+00,3/29/2013 9:26:00 AM,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,,,N
4,2,SWIFT AVE S AND SWIFT AV OFF RP,2,Injury Collision,Angles,2,0,0,2,2004/01/28 00:00:00+00,1/28/2004 8:04:00 AM,At Intersection (intersection related),0,Raining,Wet,Daylight,,,N


In [97]:
# Let's see the bottom 5 rows of my new dataframe 
df_acc.tail()

Unnamed: 0,SEVERITYCODE,LOCATION,SEVERITYCODE.1,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SPEEDING,HITPARKEDCAR
194668,2,34TH AVE S BETWEEN S DAKOTA ST AND S GENESEE ST,2,Injury Collision,Head On,3,0,0,2,2018/11/12 00:00:00+00,11/12/2018 8:12:00 AM,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,,,N
194669,1,AURORA AVE N BETWEEN N 85TH ST AND N 86TH ST,1,Property Damage Only Collision,Rear Ended,2,0,0,2,2018/12/18 00:00:00+00,12/18/2018 9:14:00 AM,Mid-Block (not related to intersection),N,Raining,Wet,Daylight,,,N
194670,2,20TH AVE NE AND NE 75TH ST,2,Injury Collision,Left Turn,3,0,0,2,2019/01/19 00:00:00+00,1/19/2019 9:25:00 AM,At Intersection (intersection related),N,Clear,Dry,Daylight,,,N
194671,2,GREENWOOD AVE N AND N 68TH ST,2,Injury Collision,Cycles,2,0,1,1,2019/01/15 00:00:00+00,1/15/2019 4:48:00 PM,At Intersection (intersection related),N,Clear,Dry,Dusk,,,N
194672,1,34TH AVE BETWEEN E MARION ST AND E SPRING ST,1,Property Damage Only Collision,Rear Ended,2,0,0,2,2018/11/30 00:00:00+00,11/30/2018 3:45:00 PM,Mid-Block (not related to intersection),N,Clear,Wet,Daylight,,,N


In [98]:
# Let's see the size (rows, columns) of the new dataframe
df_acc.shape

(194673, 19)

When I look at the top 5 and bottom 5 rows of the new dataframe it is clear that **SEVERITYCODE** and **SEVERITYCODE.1** columns are duplicates of each other. There is no need to keep both of these columns, so I will remove one of these 2 columns. I will remove the **SEVERITYCODE.1** column. 

It appears that **LOCATION** is not an important feature since it doesn't affect the accident severity. So I will remove this feature too. 

I will also drop the following list of features (columns) for the same reason as **LOCATION**:
- PEDCYLCOUNT
- INCDATE
- INCDTTM
- PEDROWNOTGRNT
- HITPARKEDCAR

In [99]:
# Drop the unimportant features from the dataframe "df_acc"
df_acc = df_acc.drop(['SEVERITYCODE.1', 'LOCATION', 'PEDCYLCOUNT', 'INCDATE', 'INCDTTM', 'PEDROWNOTGRNT', 'HITPARKEDCAR'], axis=1)#, inplace=True)

# Let's see the top 5 rows of the new dataframe
df_acc.head()

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight,
1,1,Property Damage Only Collision,Sideswipe,2,0,2,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On,
2,1,Property Damage Only Collision,Parked Car,4,0,3,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight,
3,1,Property Damage Only Collision,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight,
4,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),0,Raining,Wet,Daylight,


In [100]:
# Let's see the size (rows, columns) of the modified dataset
df_acc.shape

(194673, 12)

In [101]:
# Let's see the number of null values in each column of the dataset
df_acc.isnull().sum()

SEVERITYCODE          0
SEVERITYDESC          0
COLLISIONTYPE      4904
PERSONCOUNT           0
PEDCOUNT              0
VEHCOUNT              0
JUNCTIONTYPE       6329
UNDERINFL          4884
WEATHER            5081
ROADCOND           5012
LIGHTCOND          5170
SPEEDING         185340
dtype: int64

I can see that the SPEEDING column has 185340 null values out of 194673 samples leaving us with insufficient number of samples to build a model with. Hence, using SPEEDING as a feature to build the model would result in a poor predictive model. So I will remove the SPEEDING feature too. 

In [102]:
# Let's remove the SPEEDING column from the dataset
df_acc.drop(['SPEEDING'], axis=1, inplace=True)

### Either replace missing values or remove rows with missng values

As I can see, some of my important features has missing (null) values. So, in order to build a good predictive model I can choose to do one of the two things: 
- I can replace the missing values with median or average values; or
- I can replace all rows with missing values altogether

I have already removed the SPEEDING feature and left with the number of missing values as below:
- COLLISIONTYPE      4904
- JUNCTIONTYPE       6329
- UNDERINFL          4884
- WEATHER            5081
- ROADCOND           5012
- LIGHTCOND          5170

As I can see from above, the number of missing values are not many compared to the total number of samples in my dataset. So, if I remove all rows having missing values, I will still have more than 150,000 samples to build my model on, which I consider as quite sufficient. Hence, I will remove all rows that have missing values.  


In [103]:
# Drop all rows having at least one null value and save the remaining rows into a new dataframe
df_acc_no_null = df_acc.dropna()

In [104]:
# As many samples (rows) are gone, let's reindex dataset
df_acc_no_null.reset_index(drop=True, inplace=True) # drop=True ensures new index is not added as an extra column

In [105]:
df_acc_no_null

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,1,Property Damage Only Collision,Sideswipe,2,0,2,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On
2,1,Property Damage Only Collision,Parked Car,4,0,3,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight
3,1,Property Damage Only Collision,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),0,Raining,Wet,Daylight
...,...,...,...,...,...,...,...,...,...,...,...
183172,2,Injury Collision,Head On,3,0,2,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
183173,1,Property Damage Only Collision,Rear Ended,2,0,2,Mid-Block (not related to intersection),N,Raining,Wet,Daylight
183174,2,Injury Collision,Left Turn,3,0,2,At Intersection (intersection related),N,Clear,Dry,Daylight
183175,2,Injury Collision,Cycles,2,0,1,At Intersection (intersection related),N,Clear,Dry,Dusk


In [106]:
# Let's see the number of null values in the new dataframe (expecting no null values remaining)
df_acc_no_null.isnull().sum()

SEVERITYCODE     0
SEVERITYDESC     0
COLLISIONTYPE    0
PERSONCOUNT      0
PEDCOUNT         0
VEHCOUNT         0
JUNCTIONTYPE     0
UNDERINFL        0
WEATHER          0
ROADCOND         0
LIGHTCOND        0
dtype: int64

In [107]:
# Let's see the size (rows, columns) of the new dataset
df_acc_no_null.shape

(183177, 11)

As I can see, after removing all rows with missing values I still have 183177 samples that I can use to build my predictive model. 

In [108]:
# Let's see some basic information of the dataset
df_acc_no_null.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183177 entries, 0 to 183176
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   SEVERITYCODE   183177 non-null  int64 
 1   SEVERITYDESC   183177 non-null  object
 2   COLLISIONTYPE  183177 non-null  object
 3   PERSONCOUNT    183177 non-null  int64 
 4   PEDCOUNT       183177 non-null  int64 
 5   VEHCOUNT       183177 non-null  int64 
 6   JUNCTIONTYPE   183177 non-null  object
 7   UNDERINFL      183177 non-null  object
 8   WEATHER        183177 non-null  object
 9   ROADCOND       183177 non-null  object
 10  LIGHTCOND      183177 non-null  object
dtypes: int64(4), object(7)
memory usage: 15.4+ MB


Here I can see that UNDERINFL is an "object" data type. I was expecting this to be boolean instead. 

In [109]:
# Let's have a look at the top 5 rows 
df_acc_no_null.head()

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,1,Property Damage Only Collision,Sideswipe,2,0,2,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On
2,1,Property Damage Only Collision,Parked Car,4,0,3,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight
3,1,Property Damage Only Collision,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,2,Injury Collision,Angles,2,0,2,At Intersection (intersection related),0,Raining,Wet,Daylight


Since I am trying to predict the accident severity, my target feature in the dataset can be either SEVERITYCODE or SEVERITYDESC. I will use the SEVERITYCODE as my target feature. So, I will remove the SEVERITYDESC feature. 

_Note: Looking at the SEVERITYCODE and SEVERITYDESC features it is clear that SEVERITYCODE=1 refers to 'Property Damage Only Collision" and SEVERITYCODE=2 refers to "Injury Collision". So I can say SEVERITYCODE=2 is more serious as it involves human injury. I need to keep this in mind once the SEVERITYDESC features is removed from the dataframe._ 

In [110]:
# Remove SEVERITYDESC column
df_acc_no_null = df_acc_no_null.drop(['SEVERITYDESC'], axis=1)

In [111]:
df_acc_no_null.head()

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,2,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,1,Sideswipe,2,0,2,Mid-Block (not related to intersection),0,Raining,Wet,Dark - Street Lights On
2,1,Parked Car,4,0,3,Mid-Block (not related to intersection),0,Overcast,Dry,Daylight
3,1,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,2,Angles,2,0,2,At Intersection (intersection related),0,Raining,Wet,Daylight


In [112]:
# Let's see the number of unique values of SEVERITYCODE feature
df_acc_no_null['SEVERITYCODE'].nunique()

2

We can see that the SEVERITYCODE has only 2 values. As the label or the taget feature is not a continuous value a regression model is not suitable for this project. Since our target can only be one of the two values we can use a classification model to predict the target value for this project.  

I am going to try  different classification algorithms and evaluate them to determine the one that works best for our project. 

In [113]:
# Let's look at how VEHCOUNT affects SEVERITYCODE
df_acc_no_null.groupby(['VEHCOUNT'])['SEVERITYCODE'].value_counts(normalize=True)

VEHCOUNT  SEVERITYCODE
0         2               0.984615
          1               0.015385
1         2               0.554669
          1               0.445331
2         1               0.748449
          2               0.251551
3         1               0.577419
          2               0.422581
4         1               0.554817
          2               0.445183
5         1               0.503802
          2               0.496198
6         1               0.590278
          2               0.409722
7         1               0.511111
          2               0.488889
8         1               0.666667
          2               0.333333
9         2               0.666667
          1               0.333333
10        2               1.000000
11        1               0.500000
          2               0.500000
12        1               1.000000
Name: SEVERITYCODE, dtype: float64

In [114]:
# Let's look at how COLLISIONTYPE affects SEVERITYCODE
df_acc_no_null.groupby(['COLLISIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True)

COLLISIONTYPE  SEVERITYCODE
Angles         1               0.606111
               2               0.393889
Cycles         2               0.876980
               1               0.123020
Head On        1               0.566132
               2               0.433868
Left Turn      1               0.604281
               2               0.395719
Other          1               0.738641
               2               0.261359
Parked Car     1               0.939175
               2               0.060825
Pedestrian     2               0.898542
               1               0.101458
Rear Ended     1               0.568299
               2               0.431701
Right Turn     1               0.793857
               2               0.206143
Sideswipe      1               0.865116
               2               0.134884
Name: SEVERITYCODE, dtype: float64

In [115]:
# Let's look at how JUNCTIONTYPE affects SEVERITYCODE
df_acc_no_null.groupby(['JUNCTIONTYPE'])['SEVERITYCODE'].value_counts(normalize=True)

JUNCTIONTYPE                                       SEVERITYCODE
At Intersection (but not related to intersection)  1               0.700535
                                                   2               0.299465
At Intersection (intersection related)             1               0.563519
                                                   2               0.436481
Driveway Junction                                  1               0.696198
                                                   2               0.303802
Mid-Block (but intersection related)               1               0.678388
                                                   2               0.321612
Mid-Block (not related to intersection)            1               0.782642
                                                   2               0.217358
Ramp Junction                                      1               0.679012
                                                   2               0.320988
Unknown                 

In [116]:
# Let's look at how WEATHER affects SEVERITYCODE
df_acc_no_null.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1               0.734694
                          2               0.265306
Clear                     1               0.673864
                          2               0.326136
Fog/Smog/Smoke            1               0.664875
                          2               0.335125
Other                     1               0.847797
                          2               0.152203
Overcast                  1               0.680943
                          2               0.319057
Partly Cloudy             2               0.600000
                          1               0.400000
Raining                   1               0.660647
                          2               0.339353
Severe Crosswind          1               0.720000
                          2               0.280000
Sleet/Hail/Freezing Rain  1               0.758929
                          2               0.241071
Snowing                   1               0

In [117]:
# Let's look at how ROADCOND affects SEVERITYCODE
df_acc_no_null.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)

ROADCOND        SEVERITYCODE
Dry             1               0.674791
                2               0.325209
Ice             1               0.773345
                2               0.226655
Oil             1               0.600000
                2               0.400000
Other           1               0.658537
                2               0.341463
Sand/Mud/Dirt   1               0.671642
                2               0.328358
Snow/Slush      1               0.831633
                2               0.168367
Standing Water  1               0.724771
                2               0.275229
Unknown         1               0.939324
                2               0.060676
Wet             1               0.665462
                2               0.334538
Name: SEVERITYCODE, dtype: float64

In [118]:
# Let's look at how LIGHTCOND affects SEVERITYCODE
df_acc_no_null.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1               0.775650
                          2               0.224350
Dark - Street Lights Off  1               0.729706
                          2               0.270294
Dark - Street Lights On   1               0.698256
                          2               0.301744
Dark - Unknown Lighting   1               0.636364
                          2               0.363636
Dawn                      1               0.666123
                          2               0.333877
Daylight                  1               0.664107
                          2               0.335893
Dusk                      1               0.666782
                          2               0.333218
Other                     1               0.753555
                          2               0.246445
Unknown                   1               0.945418
                          2               0.054582
Name: SEVERITYCODE, dtype: float64

In [119]:
# Now let's look at how UNDERINFL affects SEVERITYCODE
df_acc_no_null.groupby(['UNDERINFL'])['SEVERITYCODE'].value_counts(normalize=True)

UNDERINFL  SEVERITYCODE
0          1               0.714206
           2               0.285794
1          1               0.592760
           2               0.407240
N          1               0.679004
           2               0.320996
Y          1               0.620024
           2               0.379976
Name: SEVERITYCODE, dtype: float64

It appears that all the remaining features of the dataset significantly affect the value of SEVERITYCODE.

Also, it appears that UNDERINFL feature has different types of values: 0, 1, Y or N. I will turn all of them into one type of value. 

In [136]:
# Make a copy of the dataset before applying further alterations
df_final = df_acc_no_null
df_final.head()

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,2,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,1,Sideswipe,2,0,2,Mid-Block (not related to intersection),N,Raining,Wet,Dark - Street Lights On
2,1,Parked Car,4,0,3,Mid-Block (not related to intersection),N,Overcast,Dry,Daylight
3,1,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,2,Angles,2,0,2,At Intersection (intersection related),N,Raining,Wet,Daylight


In [137]:
# Replace 1 with Y and 0 with N
df_final.loc[(df_final.UNDERINFL == '1'), 'UNDERINFL'] = 'Y'
df_final.loc[(df_final.UNDERINFL == '0'), 'UNDERINFL'] = 'N'

In [138]:
df_final.head()
#df_final.tail()

Unnamed: 0,SEVERITYCODE,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,2,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,1,Sideswipe,2,0,2,Mid-Block (not related to intersection),N,Raining,Wet,Dark - Street Lights On
2,1,Parked Car,4,0,3,Mid-Block (not related to intersection),N,Overcast,Dry,Daylight
3,1,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,2,Angles,2,0,2,At Intersection (intersection related),N,Raining,Wet,Daylight


In [140]:
# Let's see how many of UNDERINFL values are in 'Y' or 'N'
print("Total number of UNDERINFL values in Y or N:", df_final['UNDERINFL'].isin(['Y','N']).count())

# Let's see the UNDERINFL data type as of now
print("Current data type of UNDERINFL:", df_final['UNDERINFL'].dtypes)

Total number of UNDERINFL values in Y or N: 183177
Current data type of UNDERINFL: object


In [141]:
# Rows having UNDERINFL=Y
df_final.loc[(df_final.UNDERINFL == 'Y'),['UNDERINFL']]

Unnamed: 0,UNDERINFL
32,Y
104,Y
117,Y
146,Y
160,Y
...,...
183141,Y
183155,Y
183160,Y
183165,Y


In [142]:
# Convert data type of UNDERINFL into boolean
#df_final['UNDERINFL'].astype(bool)

In [143]:
# Let's see the number of different values of UNDERINFL
df_final['UNDERINFL'].value_counts()

N    174175
Y      9002
Name: UNDERINFL, dtype: int64

## Create the features dataset (X) and the target dataset (Y)

In [145]:
# Create a new dataframe with the features only
df_X = df_final.drop(['SEVERITYCODE'],axis=1)
df_X.head()

Unnamed: 0,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,VEHCOUNT,JUNCTIONTYPE,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND
0,Angles,2,0,2,At Intersection (intersection related),N,Overcast,Wet,Daylight
1,Sideswipe,2,0,2,Mid-Block (not related to intersection),N,Raining,Wet,Dark - Street Lights On
2,Parked Car,4,0,3,Mid-Block (not related to intersection),N,Overcast,Dry,Daylight
3,Other,3,0,3,Mid-Block (not related to intersection),N,Clear,Dry,Daylight
4,Angles,2,0,2,At Intersection (intersection related),N,Raining,Wet,Daylight


In [146]:
# Create a new dataframe with the target column
df_Y = df_final['SEVERITYCODE']
df_Y.head()

0    2
1    1
2    1
3    1
4    2
Name: SEVERITYCODE, dtype: int64

# Classification methods

I will split the dataset into train and test set. Then I am going to use the following classification methods to build a model using the train set and then use the test set to report the accuracy of the model. 

- K Nearest Neighbor(KNN)
- Decision Tree
- Support Vector Machine
- Logistic Regression


## Split the datasets into train and test data

In [149]:
# import required library to split dataset
from sklearn.model_selection import train_test_split

# split the datasets into train and test datasets; 80% for training and 20% for testing 
x_train, x_test, y_train, y_test = train_test_split(df_X, df_Y, test_size=0.20, random_state=1)

print("Train set. X=", x_train.shape, " Y=", y_train.shape)
print("Test set. X=", x_test.shape, " Y=", y_test.shape)

Train set. X= (146541, 9)  Y= (146541,)
Test set. X= (36636, 9)  Y= (36636,)


## K Nearest Neighbor (KNN)

In [151]:
# import required libraries for KNN
#from sklearn.neighbors import KNeighborsClassifier
#from sklearn import metrics

#Ks = 10
#mean_acc = np.zeros((Ks-1))
#std_acc = np.zeros((Ks-1))

#for n in range(1,Ks):  
    #Train Model and Predict  
#    neigh = KNeighborsClassifier(n_neighbors = n).fit(x_train,y_train)
#    yhat=neigh.predict(x_test)
#    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    
#    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

#print('Mean accuracy: ', mean_acc)
#print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

In [None]:
# plot KNN
###########
#plt.plot(range(1,Ks),mean_acc,'g')
#plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
#plt.legend(('Accuracy ', '+/- 3xstd'))
#plt.ylabel('Accuracy ')
#plt.xlabel('Number of Neighbors (K)')
#plt.tight_layout()
#plt.show()
#print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

## Decision Tree

In [153]:
# Training 
###########

#from sklearn.tree import DecisionTreeClassifier

# create a DecisionTreeClassifier. Use entropy criterion to see the informatin gain on each node
#loan_tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)

# Train the model
#loan_tree.fit(x_train,y_train)

## Support Vector Machine

In [155]:
# Train the model
##################

#from sklearn import svm
#clf = svm.SVC(kernel='rbf')
#clf.fit(x_train, y_train)

## Logistic Regression

In [157]:
# Train the model
#from sklearn.linear_model import LogisticRegression
#from sklearn.metrics import confusion_matrix
#LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)
#LR