# Applied Data Science Capstone

## Introduction/Business Problem

The Seattle government is concerned with the severity and number of accidents and wants to employ actions based on the analysis of historical data to alert drivers of the imminence of accidents. This study aims to predict the severity of an accident, given the locations, weather and road conservation. The analysis aims to determine a set of possible causes that contribute to the increase in the severity of accidents to allow preventive actions by road users. The targets audience of the project are drivers, rescue groups, police and insurance companys. It's expected to achieve a reduction in the number and severity of accidents to make drivers and passengers more secure.

## About dataset

This dataset is about collisions that occurred between 2004 and 2020 in the city of Seattle. The __Data-Collisions.csv__ data set includes details of 194673 collisions provided by the Seattle Department of Transportation Traffic Management Division. It includes following fields:

| Field | Description |
| --- | --- |
| OBJECTID | ESRI unique identifier |
| LATITUDE (X) | ESRI geometry field |
| LONGITUDE (Y) | ESRI geometry field |
| ADDRTYPE | Collision address type (Alley/Block/Intersection) |
| SEVERITYCODE | A code that corresponds to the severity of the collision (3 — fatality/2b — serious injury/2—injury/1 — prop damage/0 — unknown) |
| COLLISIONTYPE | Collision type |
| INCDTTM | The date and time of the incident |
| UNDERINFL | Whether or not a driver involved was under the influence of drugs or alcohol |
| WEATHER | A description of the weather conditions during the time of the collision |
| ROADCOND | The condition of the road during the collision |
| LIGHTCOND | The light conditions during the collision |
| SPEEDING | Whether or not speeding was a factor in the collision (Y/N) |

## Methodology

### Reading and saving the Data 

Downloading the data set and loading the data from the CSV file

In [1]:
import pandas as pd
# url = 'https://data-seattlecitygis.opendata.arcgis.com/datasets/5b5c745e0f1f48e7a53acec63a0022ab_0.csv'
df = pd.read_csv("Data-Collisions.csv", low_memory=False)
# df.to_csv("Data-Collisions.csv", index=False)

In [2]:
df.head()

Unnamed: 0,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,LOCATION,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,-122.356511,47.517361,1,327920,329420,3856094,Matched,Intersection,34911.0,17TH AVE SW AND SW ROXBURY ST,...,Dry,Daylight,,,,10,Entering at angle,0,0,N
1,-122.361405,47.702064,2,46200,46200,1791736,Matched,Block,,HOLMAN RD NW BETWEEN 4TH AVE NW AND 3RD AVE NW,...,Wet,Dusk,,5101020.0,,13,From same direction - both going straight - bo...,0,0,N
2,-122.317414,47.664028,3,1212,1212,3507861,Matched,Block,,ROOSEVELT WAY NE BETWEEN NE 47TH ST AND NE 50T...,...,Dry,Dark - Street Lights On,,,,30,From opposite direction - all others,0,0,N
3,-122.318234,47.619927,4,327909,329409,EA03026,Matched,Intersection,29054.0,11TH AVE E AND E JOHN ST,...,Wet,Dark - Street Lights On,,,,0,Vehicle going straight hits pedestrian,0,0,N
4,-122.351724,47.560306,5,104900,104900,2671936,Matched,Block,,WEST MARGINAL WAY SW BETWEEN SW ALASKA ST AND ...,...,Ice,Dark - Street Lights On,,9359012.0,Y,50,Fixed object,0,0,N


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221738 entries, 0 to 221737
Data columns (total 40 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   X                214260 non-null  float64
 1   Y                214260 non-null  float64
 2   OBJECTID         221738 non-null  int64  
 3   INCKEY           221738 non-null  int64  
 4   COLDETKEY        221738 non-null  int64  
 5   REPORTNO         221738 non-null  object 
 6   STATUS           221738 non-null  object 
 7   ADDRTYPE         218024 non-null  object 
 8   INTKEY           72027 non-null   float64
 9   LOCATION         217145 non-null  object 
 10  EXCEPTRSNCODE    101335 non-null  object 
 11  EXCEPTRSNDESC    11785 non-null   object 
 12  SEVERITYCODE     221737 non-null  object 
 13  SEVERITYDESC     221738 non-null  object 
 14  COLLISIONTYPE    195287 non-null  object 
 15  PERSONCOUNT      221738 non-null  int64  
 16  PEDCOUNT         221738 non-null  int6

### Data Wrangling

In [4]:
import numpy as np

#### Drop unnecessary columns

In [5]:
df.drop(["INCKEY","COLDETKEY","REPORTNO","STATUS","INTKEY","LOCATION","EXCEPTRSNCODE","EXCEPTRSNDESC","SEVERITYDESC","PERSONCOUNT","PEDCOUNT","PEDCYLCOUNT","VEHCOUNT","INJURIES","SERIOUSINJURIES","FATALITIES","INCDATE","JUNCTIONTYPE","SDOT_COLCODE","SDOT_COLDESC","INATTENTIONIND","UNDERINFL","PEDROWNOTGRNT","SDOTCOLNUM","ST_COLCODE","ST_COLDESC","SEGLANEKEY","CROSSWALKKEY","HITPARKEDCAR"], axis=1, inplace=True)

#### Count missing values in each column

In [6]:
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())

X
False    214260
True       7478
Name: X, dtype: int64
Y
False    214260
True       7478
Name: Y, dtype: int64
OBJECTID
False    221738
Name: OBJECTID, dtype: int64
ADDRTYPE
False    218024
True       3714
Name: ADDRTYPE, dtype: int64
SEVERITYCODE
False    221737
True          1
Name: SEVERITYCODE, dtype: int64
COLLISIONTYPE
False    195287
True      26451
Name: COLLISIONTYPE, dtype: int64
INCDTTM
False    221738
Name: INCDTTM, dtype: int64
WEATHER
False    195097
True      26641
Name: WEATHER, dtype: int64
ROADCOND
False    195178
True      26560
Name: ROADCOND, dtype: int64
LIGHTCOND
False    195008
True      26730
Name: LIGHTCOND, dtype: int64
SPEEDING
True     211802
False      9936
Name: SPEEDING, dtype: int64


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221738 entries, 0 to 221737
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   X              214260 non-null  float64
 1   Y              214260 non-null  float64
 2   OBJECTID       221738 non-null  int64  
 3   ADDRTYPE       218024 non-null  object 
 4   SEVERITYCODE   221737 non-null  object 
 5   COLLISIONTYPE  195287 non-null  object 
 6   INCDTTM        221738 non-null  object 
 7   WEATHER        195097 non-null  object 
 8   ROADCOND       195178 non-null  object 
 9   LIGHTCOND      195008 non-null  object 
 10  SPEEDING       9936 non-null    object 
dtypes: float64(2), int64(1), object(8)
memory usage: 11.8+ MB


#### Drop all rows that do not have data

In [8]:
df = df.dropna(subset=["X","Y","OBJECTID","ADDRTYPE","SEVERITYCODE","COLLISIONTYPE","INCDTTM","WEATHER","ROADCOND","LIGHTCOND"], axis=0)

#### Convert data types to proper format

In [9]:
df['INCDTTM'] = pd.to_datetime(df['INCDTTM'])
df['INCDTTM'].head()

0   2020-01-19 09:01:00
1   2005-04-11 18:31:00
2   2013-03-31 02:09:00
3   2020-01-06 17:55:00
4   2009-12-25 19:00:00
Name: INCDTTM, dtype: datetime64[ns]

In [10]:
df['SPEEDING'].fillna('N', inplace=True)
df['SPEEDING'].value_counts()

N    180271
Y      9290
Name: SPEEDING, dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189561 entries, 0 to 221737
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   X              189561 non-null  float64       
 1   Y              189561 non-null  float64       
 2   OBJECTID       189561 non-null  int64         
 3   ADDRTYPE       189561 non-null  object        
 4   SEVERITYCODE   189561 non-null  object        
 5   COLLISIONTYPE  189561 non-null  object        
 6   INCDTTM        189561 non-null  datetime64[ns]
 7   WEATHER        189561 non-null  object        
 8   ROADCOND       189561 non-null  object        
 9   LIGHTCOND      189561 non-null  object        
 10  SPEEDING       189561 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(7)
memory usage: 12.3+ MB


In [12]:
df['ADDRTYPECODE'] = pd.factorize(df.SPEEDING)[0]
df['COLLISIONTYPECODE'] = pd.factorize(df.SPEEDING)[0]
df['WEATHERCODE'] = pd.factorize(df.WEATHER)[0]
df['ROADCONDCODE'] = pd.factorize(df.ROADCOND)[0]
df['LIGHTCONDCODE'] = pd.factorize(df.LIGHTCOND)[0]
df['SPEEDINGCODE'] = pd.factorize(df.SPEEDING)[0]

In [13]:
list1 = ['3','2b','2','1','0']
list2 = ['Fatality','Serious injury','Injury','Prop damage','unknown']
replacement_map = {i1: i2 for i1, i2 in zip(list1, list2)}
df['SEVERITY'] = df['SEVERITYCODE'].map(replacement_map)

In [14]:
df['YEARS'] = pd.DatetimeIndex(df['INCDTTM']).year
years = df['YEARS']
years = years.tolist()

In [15]:
df.columns.values

array(['X', 'Y', 'OBJECTID', 'ADDRTYPE', 'SEVERITYCODE', 'COLLISIONTYPE',
       'INCDTTM', 'WEATHER', 'ROADCOND', 'LIGHTCOND', 'SPEEDING',
       'ADDRTYPECODE', 'COLLISIONTYPECODE', 'WEATHERCODE', 'ROADCONDCODE',
       'LIGHTCONDCODE', 'SPEEDINGCODE', 'SEVERITY', 'YEARS'], dtype=object)

In [16]:
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
df["ADDRTYPE"].value_counts()

Block           123961
Intersection     65600
Name: ADDRTYPE, dtype: int64

In [18]:
df["SEVERITY"].value_counts()

Prop damage       129499
Injury             56730
Serious injury      2999
Fatality             331
unknown                2
Name: SEVERITY, dtype: int64

In [19]:
df["COLLISIONTYPE"].value_counts()

Parked Car    46776
Angles        35317
Rear Ended    33524
Other         23046
Sideswipe     18293
Left Turn     14022
Pedestrian     7585
Cycles         5883
Right Turn     2968
Head On        2147
Name: COLLISIONTYPE, dtype: int64

In [20]:
df["WEATHER"].value_counts()

Clear                       112433
Raining                      32864
Overcast                     27927
Unknown                      13884
Snowing                        900
Other                          793
Fog/Smog/Smoke                 561
Sleet/Hail/Freezing Rain       115
Blowing Sand/Dirt               49
Severe Crosswind                25
Partly Cloudy                   10
Name: WEATHER, dtype: int64

In [21]:
df["ROADCOND"].value_counts()

Dry               125932
Wet                47245
Unknown            13851
Ice                 1195
Snow/Slush           994
Other                120
Standing Water       106
Sand/Mud/Dirt         65
Oil                   53
Name: ROADCOND, dtype: int64

In [22]:
df["LIGHTCOND"].value_counts()

Daylight                    116884
Dark - Street Lights On      48840
Unknown                      12474
Dusk                          5942
Dawn                          2525
Dark - No Street Lights       1492
Dark - Street Lights Off      1185
Other                          195
Dark - Unknown Lighting         24
Name: LIGHTCOND, dtype: int64

In [23]:
df["SPEEDING"].value_counts()

N    180271
Y      9290
Name: SPEEDING, dtype: int64

#### Comparing the trend of serious injuries or fatalities in the line graph

In [30]:
#df_years_severity.index = df_years_severity.index.map(int) # let's change the index values of df_CI to type integer for plotting
df_years_severity = df_years_severity.transpose()
df_years_severity.plot(kind='line', figsize=(14,8))
plt.title('Severity of accidents')
plt.ylabel('Number of accidents')
plt.xlabel('Years')
plt.show()

TypeError: no numeric data to plot

In [20]:
import folium
from folium import plugins

In [21]:
injuries = df['INJURIES'].sum()
injuries

82922

In [22]:
serious_injuries = df['SERIOUSINJURIES'].sum()
serious_injuries

3371

In [23]:
fat_count = df['FATALITIES'].sum()
fat_count

377

#### Superimposing the location of accidents on the map

In [24]:
avg_longitude = df["X"].astype("float").mean(axis=0)
avg_latitude = df["Y"].astype("float").mean(axis=0)
df_incidents = df.dropna(subset=["X", "Y"], axis=0)
df_incidents_graves = df_incidents.loc[df_incidents['SEVERITYCODE'] == '3']

In [25]:
# define the world map
world_map = folium.Map(location=[round(avg_latitude, 2), round(avg_longitude, 2)], zoom_start=10, tiles='OpenStreetMap')

# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

for lat, lng, in zip(df_incidents_graves.Y, df_incidents_graves.X):
    incidents.add_child(
        folium.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='red',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# display world map
world_map.add_child(incidents)

In [26]:
locations = df.pivot_table(index=['LOCATION'], aggfunc='size')
locations.sort_values(ascending=False).head()

LOCATION
BATTERY ST TUNNEL NB BETWEEN ALASKAN WY VI NB AND AURORA AVE N    298
N NORTHGATE WAY BETWEEN MERIDIAN AVE N AND CORLISS AVE N          297
BATTERY ST TUNNEL SB BETWEEN AURORA AVE N AND ALASKAN WY VI SB    291
AURORA AVE N BETWEEN N 117TH PL AND N 125TH ST                    283
6TH AVE AND JAMES ST                                              276
dtype: int64

In [27]:
collision_descr = df.pivot_table(index=['SDOT_COLDESC'], aggfunc='size')
collision_descr.sort_values(ascending=False).head()

SDOT_COLDESC
MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE     92182
MOTOR VEHICLE STRUCK MOTOR VEHICLE, REAR END               59371
NOT ENOUGH INFORMATION / NOT APPLICABLE                    19164
MOTOR VEHICLE STRUCK MOTOR VEHICLE, LEFT SIDE SIDESWIPE    10945
MOTOR VEHICLE RAN OFF ROAD - HIT FIXED OBJECT               9609
dtype: int64

## K-Nearest Neighbors algorithm

In [28]:
import itertools
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [29]:
df[['SEVERITYCODE','SEVERITYDESC','WEATHER','ROADCOND','LIGHTCOND','SPEEDING','ADDRTYPE','COLLISIONTYPE']]

Unnamed: 0,SEVERITYCODE,SEVERITYDESC,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,ADDRTYPE,COLLISIONTYPE
0,1,Property Damage Only Collision,Clear,Dry,Daylight,N,Intersection,Angles
1,1,Property Damage Only Collision,Raining,Wet,Dusk,N,Block,Rear Ended
2,2,Injury Collision,Clear,Dry,Dark - Street Lights On,N,Block,Head On
3,2,Injury Collision,Raining,Wet,Dark - Street Lights On,N,Intersection,Pedestrian
4,2,Injury Collision,Clear,Ice,Dark - Street Lights On,Y,Block,Other
...,...,...,...,...,...,...,...,...
221733,0,Unknown,,,,N,Block,
221734,1,Property Damage Only Collision,Clear,Dry,Daylight,N,Block,Sideswipe
221735,2,Injury Collision,Clear,Dry,Daylight,N,Intersection,Angles
221736,2,Injury Collision,Clear,Dry,Daylight,Y,Intersection,Angles


The following characteristics of the dataset will be used to make the predictions
1. WEATHER;
2. ROADCOND;
3. LIGHTCOND;
4. SPEEDING;
5. ADDRTYPE;
6. COLLISIONTYPE.

In [30]:
df = df.dropna(subset=['SEVERITYCODE','SEVERITYDESC','WEATHER','ROADCOND','LIGHTCOND','SPEEDING','ADDRTYPE','COLLISIONTYPE'], axis=0)

In [31]:
df.shape

(192988, 28)

In [32]:
df['WEATHER_CODE'] = pd.factorize(df.WEATHER)[0]
df['ROADCOND_CODE'] = pd.factorize(df.ROADCOND)[0]
df['LIGHTCOND_CODE'] = pd.factorize(df.LIGHTCOND)[0]
df['SPEEDING_CODE'] = pd.factorize(df.SPEEDING)[0]
df['ADDRTYPE_CODE'] = pd.factorize(df.SPEEDING)[0]
df['COLLISIONTYPE_CODE'] = pd.factorize(df.SPEEDING)[0]

In [33]:
df.columns.values

array(['X', 'Y', 'ADDRTYPE', 'LOCATION', 'SEVERITYCODE', 'SEVERITYDESC',
       'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT',
       'VEHCOUNT', 'INJURIES', 'SERIOUSINJURIES', 'FATALITIES', 'INCDTTM',
       'SDOT_COLDESC', 'INATTENTIONIND', 'UNDERINFL', 'WEATHER',
       'ROADCOND', 'LIGHTCOND', 'PEDROWNOTGRNT', 'SPEEDING', 'ST_COLCODE',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR', 'YEARS',
       'WEATHER_CODE', 'ROADCOND_CODE', 'LIGHTCOND_CODE', 'SPEEDING_CODE',
       'ADDRTYPE_CODE', 'COLLISIONTYPE_CODE'], dtype=object)

### Define feature sets

In [34]:
X = df[['WEATHER_CODE','ROADCOND_CODE','LIGHTCOND_CODE','SPEEDING_CODE','ADDRTYPE_CODE','COLLISIONTYPE_CODE']]
X[0:5]

Unnamed: 0,WEATHER_CODE,ROADCOND_CODE,LIGHTCOND_CODE,SPEEDING_CODE,ADDRTYPE_CODE,COLLISIONTYPE_CODE
0,0,0,0,0,0,0
1,1,1,1,0,0,0
2,0,0,2,0,0,0
3,1,1,2,0,0,0
4,0,2,2,1,1,1


Our label

In [35]:
y = df['SEVERITYCODE'].values
y[0:5]

array(['1', '1', '2', '2', '2'], dtype=object)

### Normalize Data

In [36]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

array([[-0.68227241, -0.56627799, -0.70470271, -0.23210105, -0.23210105,
        -0.23210105],
       [ 0.21869094,  0.53226101,  0.09010742, -0.23210105, -0.23210105,
        -0.23210105],
       [-0.68227241, -0.56627799,  0.88491755, -0.23210105, -0.23210105,
        -0.23210105],
       [ 0.21869094,  0.53226101,  0.88491755, -0.23210105, -0.23210105,
        -0.23210105],
       [-0.68227241,  1.6308    ,  0.88491755,  4.30846831,  4.30846831,
         4.30846831]])

### Train Test Split

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (154390, 6) (154390,)
Test set: (38598, 6) (38598,)


### Classification
K nearest neighbor (KNN)

In [38]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

Training

In [39]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

KNeighborsClassifier(n_neighbors=4)

Predicting

In [40]:
yhat = neigh.predict(X_test)
yhat[0:5]

array(['1', '1', '1', '1', '1'], dtype=object)

Accuracy classification score

In [41]:
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

Train set Accuracy:  0.6366409741563572
Test set Accuracy:  0.6327270843048862


### Discussion of results

The use of a machine learning model for predicting the severity of an automobile accident reached an accuracy around of  0.636 for the training set and 0.632 for the test set.

### Conclusion

From the analysis of the information obtained from the Seattle Department of Transportation Traffic Management Division we can conclude that the weather conditions, the state of conservation of roads, the lighting and the speed of traffic can have an aggravating impact in the case of automobile accidents.