# Capstone Project  - Car accident severity


This notebook will be mainly used for the capstone project for IBM Data Science Professional Certificate. This project aims to use historical data and to build a model that can predict the car accident severity.

## Business Understanding

### Problem description

Throughout the world, roads are shared by cars, buses, trucks, motorcycles, mopeds, pedestrians, animals, taxis, and other travelers. Travel made possible by motor vehicles supports economic and social development in many countries.

Nowadays vehicles are involved in crashes that are responsible for millions of deaths and injuries. According to Centers for Disease Control and Prevention (CDC), National Center for Injury Prevention and Control (NCIPC). Web-based Injury Statistics Query and Reporting System (WISQARS): https://webappa.cdc.gov/cgi-bin/broker.exe road traffic crashes are a leading cause of death in the world and the leading cause of non-natural death or healthy citizens for all age groups.

Low- and middle-income countries are most affected.  According World Health Organization (WHO) Global Status Report on Road Safety 2018: https://www.who.int/violence_injury_prevention/road_safety_status/2018/en/ the road traffic crash death rate is over three times higher in low-income countries than in high-income countries.

Road traffic injuries place a huge economic burden on low- and middle-income countries. Each year, according to the latest available cost estimate (1998), road traffic injuries cost 518 billion dollars worldwide and $65 billion USD in low- and middle-income countries, which exceeds the total amount that these countries receive in development assistance (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.5207&rep=rep1&type=pdf).

According to Eurostat statistic "Road accident fatalities by vehicles" car drivers and passengers represented the largest category of road traffic deaths in the EU in 2018, with 44.8% of all road traffic fatalities.


<img src="https://ec.europa.eu/eurostat/statistics-explained/images/d/d8/Road_accident_fatalities_by_category_of_vehicles%2C_2018_%28%25%29_pie.png" alt="https://ec.europa.eu/eurostat/statistics-explained/images/d/d8/Road_accident_fatalities_by_category_of_vehicles%2C_2018_%28%25%29_pie.png" class="transparent shrinkToFit" width="470" height="472">

Therefore, it is important to search new ways to reduce road traffic accidents. 


### Objective

To develop the navigation app that could warn the car drivers, given the weather, the road conditions and some other parameters about the possibility of getting into a car accident on the chosen route and how severe it would be. So this app would offer to change the travel route if it is possible or the car drivers can simply take into account the warn message and drive more carefully. 

### The target audience

The target audience for this app are the car drivers with smartphones. 

### Stakeholders

The main stakeholders are the governments of the countries that are interested in the reducing of car accidents on the roads.

### Question

Can we predict for given route the severity of possible car accident in real time for any region?

## Data

### Data source

In order to create the required prediction model, we need to find the dependence of the severity of the accident on parameters that can be collected in real time, such as weather conditions.

First, we need to use statistics about road traffic accidents to determine the factors that influence their severity. For these purposes, data from SDOT Traffic Management Division were used.

All collisions provided by SPD and recorded by Traffic Records with weekly update frequency. This includes all types of collisions in timeframe from 2004 to present. This dataset includes many attributes that describe all the circumstances of the accident, the number of victims and their severity.


### Data acquisition

At this stage, we need to upload a .csv file from the Internet source to our Python environment and create a dataframe using pandas to perform data analysis and derive some additional info from our raw data to define, which attributes could be potentially useful for future prediction model. 

In [1]:
import pandas as pd
data_path = "https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv"
df = pd.read_csv(data_path, low_memory=False)

After reading the dataset, it is necessary to look at the data frame to get a better intuition. Let's print the first 10 rows of our dataset.

In [2]:
df.head(10)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N
5,1,-122.387598,47.690575,6,320840,322340,E919477,Matched,Intersection,36974.0,...,Dry,Daylight,,,,10,Entering at angle,0,0,N
6,1,-122.338485,47.618534,7,83300,83300,3282542,Matched,Intersection,29510.0,...,Wet,Daylight,,8344002.0,,10,Entering at angle,0,0,N
7,2,-122.32078,47.614076,9,330897,332397,EA30304,Matched,Intersection,29745.0,...,Dry,Daylight,,,,5,Vehicle Strikes Pedalcyclist,6855,0,N
8,1,-122.33593,47.611904,10,63400,63400,2071243,Matched,Block,,...,Dry,Daylight,,6166014.0,,32,One parked--one moving,0,0,N
9,2,-122.3847,47.528475,12,58600,58600,2072105,Matched,Intersection,34679.0,...,Dry,Daylight,,6079001.0,,10,Entering at angle,0,0,N


As we can see the names of the attributes in our dataset are written using abbreviations. To decrypt them, in addition to the dataset, it is also necessary to download a file with metadata using a following link https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf, which contains descriptions of the attributes. This will allow us to understand what information we have.

### Data understanding

#### Defining a target function

First of all, we must determine which of these attributes we will predict as a target function in our model. Our task is to predict the severity of a road traffic accident, so we will use SEVERITYCODE as a target function. It corresponds to the severity of the collision and is a discrete value. Therefore, we will use classification model for prediction.


#### Selecting the required attributes

Based on Metadata, we can preliminary analyze the attributes and filter out those that will not be useful for creating a prediction model. 

At first, we can delete the specific codes and definitions that SDOT uses for its reports, since this information cannot in any way be used for data analysis. This includes the following attributes:

* INCKEY - A unique key for the incident
* COLDETKEY - Secondary key for the incident
* SDOT_COLCODE - A code given to the collision by SDOT
* SDOT_COLDESC - A description of the collision corresponding to the collision code
* SDOTCOLNUM - A number given to the collision by SDOT
* REPORTNO - A number of report
* STATUS
* EXCEPTRSNCODE
* EXCEPTRSNDESC

In [3]:
df.drop(['INCKEY', 'COLDETKEY', 'SDOT_COLCODE', 'SDOT_COLDESC', 'SDOTCOLNUM', 'REPORTNO', 'STATUS', 'EXCEPTRSNCODE', 'EXCEPTRSNDESC'], axis=1, inplace=True)

Since for our task it is necessary that the algorithm be universal and can be applied in different regions, we need to weed out those attributes that are relevant only for Seattle. This includes all attributes that contain specific location data:

* OBJECTID - ESRI unique identifier
* SHAPE - ESRI geometry field
* INTKEY - Key that corresponds to the intersection associated with a collision
* LOCATION - Description of the general location of the collision
* SEGLANEKEY - A key for the lane segment in which the collision occurred
* CROSSWALKKEY - A key for the crosswalk at which the collision occurred
* ADDRTYPE - Collision address type

In [4]:
df.drop(['OBJECTID', 'X', 'Y', 'INTKEY', 'LOCATION', 'SEGLANEKEY', 'CROSSWALKKEY', 'ADDRTYPE'], axis=1, inplace=True)

This report contains attributes that contain information about the number of people and vehicles involved in the incident, as well as information about the type of incident. Although they give a more complete understanding of the scale of the incident, they cannot be used as input parameters for the model, since this data was obtained after the incident.

* COLLISIONTYPE - Collision type
* PERSONCOUNT - The total number of people involved in the collision
* PEDCOUNT - The number of pedestrians involved in the collision. 
* PEDCYLCOUNT - The number of bicycles involved in the collision. 
* VEHCOUNT - The number of vehicles involved in the collision
* ST_COLCODE - A code provided by the state that describes the collision
* ST_COLDESC - A description that corresponds to the state’s coding designation 
* HITPARKEDCAR - Whether or not the collision involved hitting a parked car
* JUNCTIONTYPE - Category of junction at which collision took place 
* PEDROWNOTGRNT - Whether or not the pedestrian right of way was not granted

In [5]:
df.drop(['COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'ST_COLCODE', 'ST_COLDESC', 'HITPARKEDCAR', 'JUNCTIONTYPE', 'PEDROWNOTGRNT'], axis=1, inplace=True)

We delete attributes containing date and time stamps of incidents, because based on this information, we cannot make assumptions about a possible accident in the future and its severity:

* INCDATE - The date of the incident
* INCDTTM - The date and time of the incident

In [7]:
df.drop(['INCDATE', 'INCDTTM'], axis=1, inplace=True)

This dataset has two SEVERITYCODE columns that duplicate each other. Therefore, we need to remove one of them.

In [8]:
df.drop(['SEVERITYCODE.1'], axis=1, inplace=True)
df.head(10)

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING
0,2,Injury Collision,,N,Overcast,Wet,Daylight,
1,1,Property Damage Only Collision,,0,Raining,Wet,Dark - Street Lights On,
2,1,Property Damage Only Collision,,0,Overcast,Dry,Daylight,
3,1,Property Damage Only Collision,,N,Clear,Dry,Daylight,
4,2,Injury Collision,,0,Raining,Wet,Daylight,
5,1,Property Damage Only Collision,,N,Clear,Dry,Daylight,
6,1,Property Damage Only Collision,,0,Raining,Wet,Daylight,
7,2,Injury Collision,,N,Clear,Dry,Daylight,
8,1,Property Damage Only Collision,,0,Clear,Dry,Daylight,
9,2,Injury Collision,,0,Clear,Dry,Daylight,


The factors that led to the accident, such as the driver being under the influence of alcohol or drugs, speeding or inattention, affect the severity of the accident. But unfortunately these factors cannot be predicted in advance when applying a route for our users and they cannot be used for our model. Therefore, it is necessary to remove all cases of accidents that were caused by one of these factors, so as not to distort our statistics.

In [9]:
df['SPEEDING'].unique()
indexSPEEDING = df[ df['SPEEDING'] == 'Y' ].index
df.drop(indexSPEEDING , inplace=True)

In [10]:
df['INATTENTIONIND'].unique()
indexINATTENTIONIND = df[ df['INATTENTIONIND'] == 'Y' ].index
df.drop(indexINATTENTIONIND , inplace=True)

In [11]:
df['UNDERINFL'].unique()
indexUNDERINFL1 = df[ df['UNDERINFL'] == '1' ].index
indexUNDERINFL2 = df[ df['UNDERINFL'] == 'Y' ].index
df.drop(indexUNDERINFL1 , inplace=True)
df.drop(indexUNDERINFL2 , inplace=True)


After all cases caused by these factors have been deleted, we can delete the corresponding columns.

In [12]:
df.drop(['INATTENTIONIND', 'UNDERINFL', 'SPEEDING'], axis=1, inplace=True)
df.head(10)

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,WEATHER,ROADCOND,LIGHTCOND
0,2,Injury Collision,Overcast,Wet,Daylight
1,1,Property Damage Only Collision,Raining,Wet,Dark - Street Lights On
2,1,Property Damage Only Collision,Overcast,Dry,Daylight
3,1,Property Damage Only Collision,Clear,Dry,Daylight
4,2,Injury Collision,Raining,Wet,Daylight
5,1,Property Damage Only Collision,Clear,Dry,Daylight
6,1,Property Damage Only Collision,Raining,Wet,Daylight
7,2,Injury Collision,Clear,Dry,Daylight
8,1,Property Damage Only Collision,Clear,Dry,Daylight
9,2,Injury Collision,Clear,Dry,Daylight


As a result, only those parameters remained that could potentially affect the severity of the accident and which can be collected in real time in order to always give the user an up-to-date prediction. These include:

* WEATHER -  weather conditions during the time of the collision 
* LIGHTCOND - The light conditions during the collision
* ROADCOND - The condition of the road during the collision


### Data pre-processing


#### Dealing with missing values 

Let's check if the selected parameters contain missing values or if the selected parameter are unknown. If so then such cases should be removed from the dataset.

In [14]:
df['WEATHER'].unique()
indexWEATHER = df[ df['WEATHER'] == 'Unknown' ].index
df.drop(indexWEATHER , inplace=True)

In [15]:
df['ROADCOND'].unique()
indexROADCOND = df[ df['ROADCOND'] == 'Unknown' ].index
df.drop(indexROADCOND , inplace=True)

In [16]:
df['LIGHTCOND'].unique()
indexLIGHTCOND = df[ df['LIGHTCOND'] == 'Unknown' ].index
df.drop(indexLIGHTCOND , inplace=True)

In [17]:
df.dropna(subset=['WEATHER', 'ROADCOND', 'LIGHTCOND'])

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,WEATHER,ROADCOND,LIGHTCOND
0,2,Injury Collision,Overcast,Wet,Daylight
1,1,Property Damage Only Collision,Raining,Wet,Dark - Street Lights On
2,1,Property Damage Only Collision,Overcast,Dry,Daylight
3,1,Property Damage Only Collision,Clear,Dry,Daylight
4,2,Injury Collision,Raining,Wet,Daylight
5,1,Property Damage Only Collision,Clear,Dry,Daylight
6,1,Property Damage Only Collision,Raining,Wet,Daylight
7,2,Injury Collision,Clear,Dry,Daylight
8,1,Property Damage Only Collision,Clear,Dry,Daylight
9,2,Injury Collision,Clear,Dry,Daylight


#### Exploratory data analysis

Let’s see how many of each class is in our data set:

In [18]:
df['SEVERITYCODE'].value_counts().to_frame()

Unnamed: 0,SEVERITYCODE
1,90851
2,40925


There are 90851 cases of property damage only and 40925 cases of injury collisions. This means that the distribution is uneven, which can negatively affect the model. No fatalities accidents left after pre-processing the data. Thus, I was unable to investigate the conditions that could lead to a fatal accident.

Let's see how many accidents happened in each type of weather:

In [19]:
df['WEATHER'].value_counts().to_frame()

Unnamed: 0,WEATHER
Clear,81801
Raining,23598
Overcast,20022
Snowing,588
Fog/Smog/Smoke,378
Other,223
Sleet/Hail/Freezing Rain,80
Blowing Sand/Dirt,39
Severe Crosswind,20
Partly Cloudy,5


Most accidents happened in clear weather. Also, many accidents occurred in overcast and during rain. These are the most frequent weather conditions that can be encountered on the road. Least of all accidents happened during partly cloudy weather and during a severe crosswind. These are the most unlikely weather conditions that can be encountered on the road. 

Let's see how many accidents happened in each type of road condition:


In [20]:
df['ROADCOND'].value_counts().to_frame()

Unnamed: 0,ROADCOND
Dry,91784
Wet,33538
Ice,675
Snow/Slush,566
Other,74
Oil,56
Standing Water,52
Sand/Mud/Dirt,50


Most accidents occurred on a dry road. Many accidents also happened on a wet road. Other conditions are very unlikely.  Least of all accidents happened on standing water and sand/mud/dirt.

In [21]:
df['LIGHTCOND'].value_counts().to_frame()

Unnamed: 0,LIGHTCOND
Daylight,86498
Dark - Street Lights On,32092
Dusk,4374
Dawn,1794
Dark - No Street Lights,1011
Dark - Street Lights Off,768
Other,162
Dark - Unknown Lighting,9


Most accidents occurred in daylight. Many accidents also happened in the dark with the lights on.

This information only gives us an understanding of how the data is distributed in our dataset. From this distribution we can conclude which conditions are the most frequent.


Let's take a look at the percentage of property damage and injury collisions for each type of condition:

In [22]:
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)

LIGHTCOND                 SEVERITYCODE
Dark - No Street Lights   1               0.801187
                          2               0.198813
Dark - Street Lights Off  1               0.760417
                          2               0.239583
Dark - Street Lights On   1               0.718746
                          2               0.281254
Dark - Unknown Lighting   1               0.555556
                          2               0.444444
Dawn                      1               0.676700
                          2               0.323300
Daylight                  1               0.671437
                          2               0.328563
Dusk                      1               0.679241
                          2               0.320759
Other                     1               0.783951
                          2               0.216049
Name: SEVERITYCODE, dtype: float64

Consider the distribution of data under the most common lighting conditions: Daylight and Dark-Street Lights On. The distribution roughly corresponds to the distribution of data across classes. The greatest deviation from the general distribution is observed under the rarest lighting conditions. 

In [23]:
df.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)

WEATHER                   SEVERITYCODE
Blowing Sand/Dirt         1               0.692308
                          2               0.307692
Clear                     1               0.688036
                          2               0.311964
Fog/Smog/Smoke            1               0.695767
                          2               0.304233
Other                     1               0.695067
                          2               0.304933
Overcast                  1               0.690740
                          2               0.309260
Partly Cloudy             2               0.600000
                          1               0.400000
Raining                   1               0.668785
                          2               0.331215
Severe Crosswind          1               0.750000
                          2               0.250000
Sleet/Hail/Freezing Rain  1               0.737500
                          2               0.262500
Snowing                   1               0

When looking at the distribution of the data for the most frequent weather conditions (Clear, Raining and Overcast), it is noticeable that they almost perfectly correspond to the general distribution. The greatest deviations from the general distribution by classes are observed in snowy and partly cloudy weather. This can be explained by the fact that under these conditions there were few accidents.

In [24]:
df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)

ROADCOND        SEVERITYCODE
Dry             1               0.688846
                2               0.311154
Ice             1               0.791111
                2               0.208889
Oil             1               0.607143
                2               0.392857
Other           1               0.567568
                2               0.432432
Sand/Mud/Dirt   1               0.700000
                2               0.300000
Snow/Slush      1               0.825088
                2               0.174912
Standing Water  1               0.730769
                2               0.269231
Wet             1               0.672312
                2               0.327688
Name: SEVERITYCODE, dtype: float64

Under the most common conditions (Dry and Wet), the distribution of the data is very close to the general distribution. In most other road conditions, the distribution deviates from the overall distribution. However, these conditions are very rare.

From this analysis, it was concluded that the distribution for individual road, weather and lighting conditions does not differ from the overall distribution of the data. This is a bad signal as it may mean that there is no clear correlation between the conditions chosen and the severity of the accident.


Now let's calculate the average accident severity for each combination of conditions:

In [25]:
df_test = df[['LIGHTCOND', 'WEATHER', 'ROADCOND', 'SEVERITYCODE']]

In [26]:
df_grp = df_test.groupby (['LIGHTCOND', 'WEATHER', 'ROADCOND' ], as_index = False).mean()
df_sort = df_grp.sort_values('SEVERITYCODE', ascending=False)
df_sort

Unnamed: 0,LIGHTCOND,WEATHER,ROADCOND,SEVERITYCODE
19,Dark - No Street Lights,Raining,Ice,2.000000
88,Dark - Street Lights On,Sleet/Hail/Freezing Rain,Dry,2.000000
183,Dusk,Clear,Oil,2.000000
184,Dusk,Clear,Other,2.000000
24,Dark - No Street Lights,Sleet/Hail/Freezing Rain,Wet,2.000000
22,Dark - No Street Lights,Raining,Standing Water,2.000000
190,Dusk,Other,Ice,2.000000
170,Daylight,Sleet/Hail/Freezing Rain,Dry,2.000000
17,Dark - No Street Lights,Partly Cloudy,Dry,2.000000
195,Dusk,Overcast,Other,2.000000


There were 220 unique combinations in total. Under 21 combinations of conditions, the average accident severity was 2.0. This means that the probability of injury under such conditions was 100%. However, such combinations were rare. Most of them only take one time.

#### Turning categorical variables into quantitative variables

Most statistical models cannot take in objects or strings as input and for model training only take the numbers as inputs. In our dataset all input values are categorical values. For further analysis, we have to convert these variables into some form of numeric format. 

In [27]:
dummy_variable_1 = pd.get_dummies(df["WEATHER"])
dummy_variable_1.head()

Unnamed: 0,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing
0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0


In [28]:
df = pd.concat([df, dummy_variable_1], axis=1)

df.drop("WEATHER", axis = 1, inplace=True)

dummy_variable_2 = pd.get_dummies(df["ROADCOND"])
dummy_variable_2.head()

Unnamed: 0,Dry,Ice,Oil,Other,Sand/Mud/Dirt,Snow/Slush,Standing Water,Wet
0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1


In [29]:
df = pd.concat([df, dummy_variable_2], axis=1)

df.drop("ROADCOND", axis = 1, inplace=True)


dummy_variable_3 = pd.get_dummies(df["LIGHTCOND"])
dummy_variable_3.head()

Unnamed: 0,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other
0,0,0,0,0,0,1,0,0
1,0,0,1,0,0,0,0,0
2,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,0,0
4,0,0,0,0,0,1,0,0


In [30]:
df = pd.concat([df, dummy_variable_3], axis=1)

df.drop("LIGHTCOND", axis = 1, inplace=True)

df.head(10)

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,...,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other.1
0,2,Injury Collision,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,1,Property Damage Only Collision,0,0,0,0,0,0,1,0,...,0,1,0,0,1,0,0,0,0,0
2,1,Property Damage Only Collision,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,Property Damage Only Collision,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,2,Injury Collision,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
5,1,Property Damage Only Collision,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,1,Property Damage Only Collision,0,0,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
7,2,Injury Collision,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,1,Property Damage Only Collision,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9,2,Injury Collision,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


After one-hot-encoding is our dataset ready to be used for machine learning algorithms. 

#### Feature selection

Lets defind lable set, Y:

In [31]:
y = df['SEVERITYCODE'].values
y[0:5]

array([2, 1, 1, 1, 2])

Lets defind feature sets, X:

In [32]:
Feature = df
Feature.drop(['SEVERITYCODE', 'SEVERITYDESC'], axis = 1,inplace=True)
Feature.head()

Unnamed: 0,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,...,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other.1
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


In [33]:
X = Feature
X[0:5]

Unnamed: 0,Blowing Sand/Dirt,Clear,Fog/Smog/Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Sleet/Hail/Freezing Rain,Snowing,...,Standing Water,Wet,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other.1
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,1,0,0


#### Feature Scaling

Data Standardization give data zero mean and unit variance

In [34]:
from sklearn import preprocessing

X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

  return self.partial_fit(X, y)
  app.launch_new_instance()


array([[-0.01720594, -1.27938986, -0.0536354 , -0.04117201,  2.3625326 ,
        -0.00615992, -0.46705511, -0.01232054, -0.0246467 , -0.06694862,
        -1.51494522, -0.07175446, -0.02061903, -0.02370389, -0.0194827 ,
        -0.06567873, -0.01986869,  1.71147743, -0.08792853, -0.07656525,
        -0.56739521, -0.00826453, -0.11748153,  0.72350341, -0.1852897 ,
        -0.03508379],
       [-0.01720594, -1.27938986, -0.0536354 , -0.04117201, -0.42327458,
        -0.00615992,  2.14107498, -0.01232054, -0.0246467 , -0.06694862,
        -1.51494522, -0.07175446, -0.02061903, -0.02370389, -0.0194827 ,
        -0.06567873, -0.01986869,  1.71147743, -0.08792853, -0.07656525,
         1.76243998, -0.00826453, -0.11748153, -1.38216349, -0.1852897 ,
        -0.03508379],
       [-0.01720594, -1.27938986, -0.0536354 , -0.04117201,  2.3625326 ,
        -0.00615992, -0.46705511, -0.01232054, -0.0246467 , -0.06694862,
         0.66008988, -0.07175446, -0.02061903, -0.02370389, -0.0194827 ,
       

## Predictive Modeling

Our data is ready to be used for machine learning algorithms. As we found out in the previous steps, our target is composed of discrete values. This is a classification problem. That is, given the dataset with predefined labels, we need to build a model to be used to predict the class of a new or unknown case. This means that classification algorithms must be used to build models. For this we use the following algorithms:

* K-Nearest Neighbor(KNN)
* Decision Tree
* Support Vector Machine
* Logistic Regression

Let's split our data set into test and train set to perform the evaluation of the algorithms:

In [35]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=4 )
print ('Train set: ', X_train.shape, y_train.shape)
print ('Test set: ', X_test.shape, y_test.shape)

Train set:  (98832, 26) (98832,)
Test set:  (32944, 26) (32944,)


### K Nearest Neighbor(KNN)

The K-Nearest Neighbors algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points. This algorithm classifies cases based on their similarity to other cases.

First, we need to determine which value of K must be chosen for our algorithm. To do this, you need to try several options and see at what value of K the algorithm shows the best accuracy on the test set.

As a metric, we will use the Jacard similarity score.


In [36]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import numpy as np

Ks = 10
mean_acc = np.zeros((Ks-1))

for i in range (1, Ks):
    
    #Train model and predict
    neigh = KNeighborsClassifier (n_neighbors = i).fit(X_train, y_train)
    y_pred = neigh.predict(X_test)
    
    #mean_acc[i-1] = metrics.accuracy_score(y_test, y_pred)
    mean_acc[i-1] = metrics.jaccard_similarity_score(y_test, y_pred)
    print ('Test set accuracy for k = ', i,': ', mean_acc[i-1]) 
   

#Finding the best k
k = np.argmax(mean_acc, axis=0)+1
print ('The best accuracy was achieved by k = ', k)

Test set accuracy for k =  1 :  0.6203861097620204
Test set accuracy for k =  2 :  0.6674356483729966
Test set accuracy for k =  3 :  0.65441355026712
Test set accuracy for k =  4 :  0.6563562408936376
Test set accuracy for k =  5 :  0.633226080621661
Test set accuracy for k =  6 :  0.6665250121418164
Test set accuracy for k =  7 :  0.6663125303545411
Test set accuracy for k =  8 :  0.6897765905779505
Test set accuracy for k =  9 :  0.6671017484215639
The best accuracy was achieved by k =  8


Once we have chosen the K value, we can train the model.

In [37]:
model_KNN = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)

In [38]:
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import jaccard_similarity_score

y_pred_KNN = model_KNN.predict(X_test)

f1 = []
jac = []
lgloss = []

#print ("F1_Score KNN = ", f1_score(y_test, y_pred_KNN, average='weighted'))
#print ("Jaccard KNN = ", jaccard_similarity_score(y_test, y_pred_KNN))
f1.append(f1_score(y_test, y_pred_KNN, average='weighted'))
jac.append (jaccard_similarity_score(y_test, y_pred_KNN))
lgloss.append (np.nan)

### Decision Tree

Decision trees are built by splitting the training set into distinct nodes, where one node contains all of or most of one category of the data. A decision tree can be constructed by considering the attributes one by one. 

In [39]:
from sklearn.tree import DecisionTreeClassifier

model_DT = DecisionTreeClassifier (criterion = "entropy", max_depth = 4)
model_DT.fit (X_train, y_train)

y_pred_DT = model_DT.predict (X_test)

                 
#print ("F1_Score DT = ", f1_score(y_test, y_pred_DT, average='weighted'))
#print ("Jaccard DT = ", jaccard_similarity_score(y_test, y_pred_DT))
f1.append(f1_score(y_test, y_pred_DT, average='weighted'))
jac.append (jaccard_similarity_score(y_test, y_pred_DT))
lgloss.append (np.nan)

### Support Vector Machine

A Support Vector Machine is a supervised algorithm that can classify cases by finding a separator. SVM works by first mapping data to a high dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. Then, a separator is estimated for the data. The data should be transformed in such a way that a separator could be drawn as a hyperplane. Let's train the SVM algorithm with 'rbf' kernel:

In [40]:
from sklearn import svm

model_SVM = svm.SVC(kernel = 'rbf', gamma='scale')
model_SVM.fit (X_train, y_train)

y_pred_SVM = model_SVM.predict (X_test)

#print ("F1_Score SVM = ", f1_score(y_test, y_pred_SVM, average='weighted'))
#print ("Jaccard SVM = ", jaccard_similarity_score(y_test, y_pred_SVM))
f1.append(f1_score(y_test, y_pred_SVM, average='weighted'))
jac.append (jaccard_similarity_score(y_test, y_pred_SVM))
lgloss.append (np.nan)

### Logistic Regression

A feature of logistic regression is that it can predict the probability of sample and we map the cases to a discrete class based on that probability. 

In [41]:
from sklearn.linear_model import LogisticRegression

model_LR = LogisticRegression (C = 0.1, solver = 'liblinear').fit(X_train, y_train)

y_pred_LR = model_LR.predict (X_test)
y_pred_prob_LR = model_LR.predict_proba(X_test)

#print ("F1_Score LR = ", f1_score(y_test, y_pred_LR, average='weighted'))
#print ("Jaccard LR = ", jaccard_similarity_score(y_test, y_pred_LR))
#print ("LogLoss LR= ", log_loss(y_test, y_pred_prob_LR))
f1.append(f1_score(y_test, y_pred_LR, average='weighted'))
jac.append (jaccard_similarity_score(y_test, y_pred_LR))
lgloss.append (log_loss(y_test, y_pred_prob_LR))

### Model Evaluation Results

After the models of the four classification algorithms have been obtained, we can compare them with each other using metrics such as the F1-Score and the Jacard similarity score.

Let's present the result as a table:


In [42]:
data = {'Algorithm':['KNN', 'Decision Tree', 'SVM', 'LogisticRegression'], 'Jaccard': jac ,'F1-score': f1, 'LogLoss': lgloss}
df_result = pd.DataFrame(data)
df_result = df_result.reset_index().set_index('Algorithm')
del df_result['index']
df_result

Unnamed: 0_level_0,Jaccard,F1-score,LogLoss
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNN,0.689777,0.565658,
Decision Tree,0.690991,0.564857,
SVM,0.690869,0.565135,
LogisticRegression,0.691021,0.564816,0.615403


Among the individual models, the Logistic Regression model performed the best (~69.1% accuracy), though the differences between models were very small. 

The accuracy of the models does not actually differ from the general distribution of data across classes. That is, if the models always predicted only the more common class, then the accuracy would actually be the same. 

This confirmed fears that the severity of road accidents does not actually correlate with the parameters chosen.


## Conclusions

In this study, I analyzed the conditions that can affect the severity of road traffic accidents in order to create a navigation application that can alert drivers to potential 	danger. This app can be very useful for drivers to change their route if possible or to drive more carefully and accurately. This will reduce the number of serious road accidents. The data source will be government agencies that are potential sponsors of this application.

An important factor was that the input parameters of the models could be collected in real time and could be applied to different regions. Of all the possible parameters, only three were selected that meet the task at hand: road conditions, lighting conditions and weather conditions. I have developed classification models to predict how severe (property damage only or injury collision) an accident is more likely to occur under the conditions that currently exist along a given route. 

I was able to achieve an accuracy of about 69% of the developed classification models. This is no different from class distribution of data. Unfortunately, as a result of preliminary analysis and analysis of final models, it was determined that the severity of road accidents does not actually depend on the parameters chosen.


## Future directions

It was not possible to create a reliable model from this dataset that could warn drivers about the severity of an accident based on real-time data. Therefore, it is necessary to try to look for other factors that could be obtained in real time and that would have an impact on the severity of accidents.

There are no fatal cases left in this dataset after preprocessing. Such cases must be considered without fail.

In addition, it may be worth trying to take into account the type of car body, because this can significantly affect the consequences of an accident. The user can set this parameter himself in the application, so there will be no problem to collect this information. It can be difficult to find statistics that take this factor into account. But it can improve the model and use it for the final product.
