# IBM Applied Data Science Capstone Project

This notebook shall be mainly used for completing the capstone project.

Feel free to reach me at [@ScientificGhosh](https://twitter.com/ScientificGhosh) on Twitter.

In [1]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
sns.set()

## Introduction: Business Undertanding

The Open Data Program makes the data generated by the City of Seattle has been openly available to the public for the purpose of increasing the quality of life for the residents, increasing transparency, accountability and comparability, promoting economic development and research, and improving internal performance management.

The Traffic Records Group, Traffic Management Division, Seattle Department of Transportation, provides data for all collisions and crashes that have occured in the state from 2004 to the present day. The data is updated weekly and can be found at the [Seattle Open GeoData Portal](https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0?geometry=-122.326%2C47.592%2C-122.318%2C47.594).

The objective is to exploit this data to extract vital features that would enable us to end up with a good model that would enable the prediction of the severity of future accidents that take place in the state. This would further enable the Department of Transportation to prioritise their SOPs and channel their energy to ensure that fewer fatalities result in automobile collisions.

## Data

The dataset is available as comma-separated values (CSV) files, KML files, and ESRI shapefiles that can be downloaded from the Seattle Open GeoData Portal. The data is also available from RESTful API services in formats such as GeoJSON.

### Downloading and Loading the Data

We download the dataset to our project directory and take a look at the data types and the dimensionality of the data. We can see that the dataset contains 221,389 records and 40 fields.

The metadata of the dataset can be found from the website of the [Seattle Department of Transportation](https://www.seattle.gov/Documents/Departments/SDOT/GIS/Collisions_OD.pdf). On reading the dataset summary, we can determine the description of each of the fields and their possible values.

In [3]:
!wget -O data.csv "https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv"

--2020-09-12 12:02:46--  https://opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv
Resolving opendata.arcgis.com (opendata.arcgis.com)... 34.224.12.157, 50.19.49.12, 54.204.141.17
Connecting to opendata.arcgis.com (opendata.arcgis.com)|34.224.12.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘data.csv’

data.csv                [                <=> ]  80.99M  15.3MB/s    in 6.3s    

2020-09-12 12:02:54 (12.9 MB/s) - ‘data.csv’ saved [84923797]



In [4]:
data = pd.read_csv("data.csv")
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221389 entries, 0 to 221388
Data columns (total 40 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   X                213918 non-null  float64
 1   Y                213918 non-null  float64
 2   OBJECTID         221389 non-null  int64  
 3   INCKEY           221389 non-null  int64  
 4   COLDETKEY        221389 non-null  int64  
 5   REPORTNO         221389 non-null  object 
 6   STATUS           221389 non-null  object 
 7   ADDRTYPE         217677 non-null  object 
 8   INTKEY           71884 non-null   float64
 9   LOCATION         216801 non-null  object 
 10  EXCEPTRSNCODE    100986 non-null  object 
 11  EXCEPTRSNDESC    11779 non-null   object 
 12  SEVERITYCODE     221388 non-null  object 
 13  SEVERITYDESC     221389 non-null  object 
 14  COLLISIONTYPE    195159 non-null  object 
 15  PERSONCOUNT      221389 non-null  int64  
 16  PEDCOUNT         221389 non-null  int6

The data contains several categorical fields and corresponding descriptions which could help us in further analysis. We make an attempt at understanding the data in terms of the fields that we shall take into account for later stages of model building.

The `X` and `Y` fields denote the longitude and latitude of the collisions. We can visualize the first few non-null collisions on a map.

In [5]:
map = folium.Map(location=[47.60, -122.33], zoom_start=12)
marker_cluster = MarkerCluster().add_to(map)
locations = data[['Y', 'X']][data['Y'].notna()].head(1000)
locationlist = locations.values.tolist()
for point in range(len(locations)):
    folium.Marker(locationlist[point]).add_to(marker_cluster)
map

The `WEATHER` field contains a description of the weather conditions during
the time of the collision. 

In [6]:
data['WEATHER'].value_counts().to_frame('count')

Unnamed: 0,count
Clear,114694
Raining,34036
Overcast,28543
Unknown,15131
Snowing,919
Other,860
Fog/Smog/Smoke,577
Sleet/Hail/Freezing Rain,116
Blowing Sand/Dirt,56
Severe Crosswind,26


The `ROADCOND` field describes the condition of the road during the collision. 

In [7]:
data['ROADCOND'].value_counts().to_frame('count')

Unnamed: 0,count
Dry,128535
Wet,48734
Unknown,15139
Ice,1232
Snow/Slush,1014
Other,136
Standing Water,119
Sand/Mud/Dirt,77
Oil,64


The `LIGHTCOND` field describes the light conditions during the collision.

In [8]:
data['LIGHTCOND'].value_counts().to_frame('count')

Unnamed: 0,count
Daylight,119448
Dark - Street Lights On,50125
Unknown,13532
Dusk,6082
Dawn,2608
Dark - No Street Lights,1579
Dark - Street Lights Off,1239
Other,244
Dark - Unknown Lighting,23


The `SPEEDING` field classifies collisions based on whether or not speeding was a factor in the collision. Blanks indicate cases where the vehicle was not speeding.

In [9]:
data['SPEEDING'].value_counts().to_frame()

Unnamed: 0,SPEEDING
Y,9928


The `SEVERITYCODE` field contains a code that corresponds to the severity of the
collision. and `SEVERITYDESC` contains a detailed description of the severity of the collision.

We can conclude that there were 349 collisions that resulted in at least one fatality, and 3,102 collisions that resulted in serious injuries. The following table lists the meaning of each of the codes used in the `SEVERITYCODE` field:

| SEVERITYCODE Value | Meaning |
| :-: | --- |
| 1 | Accidents resulting in property damage |
| 2 | Accidents resulting in injuries |
| 2b | Accidents resulting in serious injuries |
| 3 | Accidents resulting in fatalities |
| 0 | Data Unavailable i.e. Blanks |

In [10]:
data['SEVERITYCODE'].value_counts().to_frame('count')

Unnamed: 0,count
1,137596
2,58747
0,21594
2b,3102
3,349


The `UNDERINFL` field describes whether or not a driver involved was under the
influence of drugs or alcohol. The values `0` and `N` denote that the driver was not under any influence while `1` and `Y` that they were.

In [11]:
data['UNDERINFL'].value_counts().to_frame('count')

Unnamed: 0,count
N,103874
0,81676
Y,5399
1,4230


The `PERSONCOUNT` and `VEHCOUNT` indicate how many people and vehicles were involved in a collision respectively.

In [12]:
data['PERSONCOUNT'].describe()

count    221389.000000
mean          2.227161
std           1.470190
min           0.000000
25%           2.000000
50%           2.000000
75%           3.000000
max          93.000000
Name: PERSONCOUNT, dtype: float64

In [13]:
data['VEHCOUNT'].describe()

count    221389.000000
mean          1.731057
std           0.829259
min           0.000000
25%           2.000000
50%           2.000000
75%           2.000000
max          15.000000
Name: VEHCOUNT, dtype: float64

As the dataset has possibly been sourced from a database table, several unique identifiers and spatial features are present in the database which may be irrelevant in further statistical analysis. These fields are are `OBJECTID`, `INCKEY`, `COLDETKEY`, `INTKEY`, `SEGLANEKEY`, `CROSSWALKKEY`, and `REPORTNO`. Other fields suchs as `EXCEPTRSNCODE`, `SDOT_COLCODE`, `SDOTCOLNUM` and `LOCATION` and their corresponding descriptions (if any) are categorical but have a large number of distinct values that shall not be that much useful for analysis. The `INCDATE` and `INCDTTM` denote the date and the time of the incident but may not be of use in further analyses. The data needs to be pre-processed.