# Coursera Capstone Project Using Seattle Collision Data

### This is a Jupyter notebook we will be using to analyze and present findings based on collision data in the city of Seattle

# Introduction and Business Understanding

## Overview

There are more than 10,000 traffic collisions per year involving cars, bicyclists and pedestrians. Understanding the causes of collisions as well as the conditions that impact their severity will help provide insight to officials on how to better allocate resources to help reduce the number and severity of such incidents.

Further, a better understanding of the factors that increase the likelihood of collisions and increase the probability of injury or property damage can help with education efforts to help individuals take greater precautions when making travel decisions.

## Goals of the Project

The goal of the project is to use publicly available data compiled by the Seattle Deport of Transportation (SDOT), to identify feautures in the dataset that yield predictive information on the number and severity of collisions and injuries in Seattle.

We will also look to use data visualization tools to communicate this information and provide an overview of the current state of traffic collisions in Seattle.

# Data Understanding

The dataset we are using is *Collisions - All Years* dataset maintained by the SDOT Traffic Management Division's Traffic Records Group.  This dataset includes all types of collisions, including car, bicycle, and pedestrian as provided by the Seattle Police Department in their Traffic Records.

The data set contains information on over 194,000 collisions in Seattle over a 15-year period.  The primary attribute we are looking to predict is the severity of the collision as captured by the Severity Code assigned to the collision.  Interestingly, the dataset description provided by SDOT indicates this Severity Code attribute should take values between 0 and 3 (including both 2 and 2b to differential "injury" from "serious injury"); however, the actual data set only contains the values 1 and 2 for this attribute.  One avenue to explore in a future project is to find additional information on this target attribute.

The data includes 37 different features including: day, time, month, lighting conditions, road conditions and weather conditions.

A full description of the data can be found at: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf

In [1]:
import pandas as pd
import numpy as np
import folium

In [2]:
path="~/Documents/CertificationStuff/IBMPythonDataScience/Data_Science_Capstone/Data-Collisions.csv"

df = pd.read_csv(path, low_memory=False)

In [3]:
# Drop rows without latitude and longitude
df.dropna(subset=["X"],axis=0,inplace=True)

# Replace the rows that are missing Coordinates with the average coordinate 
#mean_longtitude = df["X"].mean()
#mean_latitude = df["Y"].mean()
#df["X"].replace(np.nan, mean_longitude, inplace = True)
#df["Y"].replace(np.nan, mean_latitutde, inplace = True)

# Drop Unnecessary or Redundant Columns
df.drop(['SEVERITYCODE.1','OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','ADDRTYPE','INTKEY','LOCATION','EXCEPTRSNCODE','EXCEPTRSNDESC','SDOT_COLCODE','SDOT_COLDESC','PEDROWNOTGRNT','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY','CROSSWALKKEY'],axis = 1, inplace = True)

# Replace string values with Boolean values in some appropriate Columns
df['INATTENTIONIND'].replace(np.nan, 0, inplace = True)
df['INATTENTIONIND'].replace(to_replace='Y', value = 1, inplace = True)

df['UNDERINFL'].replace(np.nan, 0, inplace = True)
df['UNDERINFL'].replace('N', 0, inplace = True)
df['UNDERINFL'].replace('*',0, inplace= True)
df['UNDERINFL'].replace('Y',1,inplace = True)
df['UNDERINFL'] = df['UNDERINFL'].astype(int)

df['SPEEDING'].replace(np.nan, False, inplace = True)
df['SPEEDING'].replace('Y', True, inplace = True)
df['SPEEDING'] = df['SPEEDING'].astype(int)

df['HITPARKEDCAR'].replace('N', False, inplace = True)
df['HITPARKEDCAR'].replace('Y', True, inplace = True)
df['HITPARKEDCAR'] = df['HITPARKEDCAR'].astype(int)

# Consolidate missing, "Unknown", NaN values in some columns to "Unknown"
df['WEATHER'].replace(np.nan,"Unknown", inplace = True)
df['WEATHER'].replace("Other", "Unknown", inplace = True)

df['ROADCOND'].replace(np.nan, "Unknown", inplace = True)
df['ROADCOND'].replace("Other", "Unknown", inplace = True)

df['LIGHTCOND'].replace(np.nan, "Unknown", inplace = True)
df['LIGHTCOND'].replace("Other", "Unknown", inplace = True)

In [4]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR
0,2,-122.323148,47.70314,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/13 14:54,At Intersection (intersection related),0,0,Overcast,Wet,Daylight,0,0
1,1,-122.347294,47.647172,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/06 18:55,Mid-Block (not related to intersection),0,0,Raining,Wet,Dark - Street Lights On,0,0
2,1,-122.33454,47.607871,Property Damage Only Collision,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,11/18/04 10:20,Mid-Block (not related to intersection),0,0,Overcast,Dry,Daylight,0,0
3,1,-122.334803,47.604803,Property Damage Only Collision,Other,3,0,0,3,2013/03/29 00:00:00+00,3/29/13 9:26,Mid-Block (not related to intersection),0,0,Clear,Dry,Daylight,0,0
4,2,-122.306426,47.545739,Injury Collision,Angles,2,0,0,2,2004/01/28 00:00:00+00,1/28/04 8:04,At Intersection (intersection related),0,0,Raining,Wet,Daylight,0,0


In [5]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
INATTENTIONIND      int64
UNDERINFL           int64
WEATHER            object
ROADCOND           object
LIGHTCOND          object
SPEEDING            int64
HITPARKEDCAR        int64
dtype: object

In [6]:
df.describe(include="all")

Unnamed: 0,SEVERITYCODE,X,Y,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR
count,189339.0,189339.0,189339.0,189339,184582,189339.0,189339.0,189339.0,189339.0,189339,189339,185146,189339.0,189339.0,189339,189339,189339,189339.0,189339.0
unique,,,,2,10,,,,,5985,157960,7,,,10,8,8,,
top,,,,Property Damage Only Collision,Parked Car,,,,,2006/11/02 00:00:00+00,11/2/06,Mid-Block (not related to intersection),,,Clear,Dry,Daylight,,
freq,,,,132221,46381,,,,,88,88,87390,,,108959,122076,113582,,
mean,1.301671,-122.330518,47.619543,,,2.452986,0.037863,0.028996,1.924136,,,,0.154094,0.0469,,,,0.046055,0.036997
std,0.458984,0.029976,0.056157,,,1.349092,0.200053,0.169143,0.629941,,,,0.36104,0.211425,,,,0.209605,0.188755
min,1.0,-122.419091,47.495573,,,0.0,0.0,0.0,0.0,,,,0.0,0.0,,,,0.0,0.0
25%,1.0,-122.348673,47.575956,,,2.0,0.0,0.0,2.0,,,,0.0,0.0,,,,0.0,0.0
50%,1.0,-122.330224,47.615369,,,2.0,0.0,0.0,2.0,,,,0.0,0.0,,,,0.0,0.0
75%,2.0,-122.311937,47.663664,,,3.0,0.0,0.0,2.0,,,,0.0,0.0,,,,0.0,0.0


In [7]:
possible_severities_counts = df['SEVERITYDESC'].value_counts().to_frame()
print(possible_severities_counts)
print()

possible_weather_conditions = df['WEATHER'].value_counts().to_frame()
print(possible_weather_conditions)
print()

possible_road_conditions = df['ROADCOND'].value_counts().to_frame()
print(possible_road_conditions)
print()

possible_lighting_conditions = df['LIGHTCOND'].value_counts().to_frame()
print(possible_lighting_conditions)
print()

speeding = df['SPEEDING'].value_counts().to_frame()
print(speeding)
print()

under_influence = df['UNDERINFL'].value_counts().to_frame()
print(under_influence)

# print("Possible Severities: ", df['SEVERITYDESC'].unique()) 
# print("Weather Conditions: ", df['WEATHER'].unique())
# print("Road Conditions: ", df['ROADCOND'].unique())
# print("Lighting Conditions: ", df['LIGHTCOND'].unique())

                                SEVERITYDESC
Property Damage Only Collision        132221
Injury Collision                       57118

                          WEATHER
Clear                      108959
Raining                     32015
Overcast                    27136
Unknown                     19591
Snowing                       894
Fog/Smog/Smoke                553
Sleet/Hail/Freezing Rain      112
Blowing Sand/Dirt              50
Severe Crosswind               24
Partly Cloudy                   5

                ROADCOND
Dry               122076
Wet                46064
Unknown            18814
Ice                 1177
Snow/Slush           989
Standing Water       102
Sand/Mud/Dirt         64
Oil                   53

                          LIGHTCOND
Daylight                     113582
Dark - Street Lights On       47314
Unknown                       17632
Dusk                           5775
Dawn                           2422
Dark - No Street Lights        1451
Dark - Stre

In [15]:
# Put some types of accidents into data frames so we can analyze

df_icy_road = df[df['ROADCOND']=='Ice']
df_wet_road = df[df['ROADCOND']=='Wet']
df_snow_road = df[df['ROADCOND']=='Snow/Slush']

df_speeding = df[df['SPEEDING'] == 1]
df_under_influence = df[df['UNDERINFL'] == 1]

df_dark = df[df['LIGHTCOND'] == 'Dark - Street Lights Off']

In [16]:
# latitude and longitude of center of Seattle
latitude = 47.6062
longitude = -122.3321

# Create map Seattle
seattle_map = folium.Map(location=[latitude,longitude],zoom_start=11)

# Display map of Seattle
seattle_map

In [17]:
# Adding features to the map

dark_collision = folium.map.FeatureGroup()

for lat,lng in zip(df_dark.Y, df_dark.X):
    dark_collision.add_child(
        folium.features.CircleMarker(
            [lat,lng],
            radius = 5,
            color = 'yellow',
            fill = True,
            fill_color = 'blue',
            fill_opacity = 0.6
        )
    )

# snow_collisions = folium.map.FeatureGroup()

# for lat,lng in zip(df_snow_road.Y, df_snow_road.X):
#    snow_collisions.add_child(
#        folium.features.CircleMarker(
#            [lat,lng],
#            radius = 5,
#            color='yellow',
#            fill = True,
#            fill_color='blue',
#            fill_opacity = 0.6
#        )
#    )

seattle_map.add_child(dark_collision)

In [18]:
from folium import plugins

seattle_map = folium.Map(location=[latitude,longitude],zoom_start=11)

dark_collision = plugins.MarkerCluster().add_to(seattle_map)

for lat, lng, label in zip (df_dark.Y, df_dark.X, df_dark.SEVERITYDESC):
    folium.Marker(
        location = [lat,lng],
        icon = None,
        popup = label,
    ).add_to(dark_collision)
    
seattle_map

In [None]:
"""

All the code to create a visualize a decision tree for predicting accident severity.  It didn't seem to yield much results.


!conda install -c conda-forge pydotplus -y
!conda install -c conda-forge python-graphviz -y

from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
import pydotplus
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline

df_decision = df[['SEVERITYDESC','WEATHER','ROADCOND','LIGHTCOND','SPEEDING','UNDERINFL']]
df_decision = df_decision[df_decision.WEATHER != 'Unknown']
df_decision = df_decision[df_decision.ROADCOND != 'Unknown']
df_decision = df_decision[df_decision.LIGHTCOND != 'Unknown']


X = df_decision[['WEATHER','ROADCOND','LIGHTCOND','SPEEDING','UNDERINFL']].values

le_weather = preprocessing.LabelEncoder()
le_weather.fit(['Blowing Sand/Dirt', 'Clear', 'Fog/Smog/Smoke', 'Overcast', 'Partly Cloudy', 'Raining', 'Severe Crosswind', 'Sleet/Hail/Freezing Rain', 'Snowing'])
X[:,0] = le_weather.transform(X[:,0])

le_road = preprocessing.LabelEncoder()
le_road.fit(['Dry','Wet','Ice','Snow/Slush','Standing Water','Sand/Mud/Dirt','Oil'])
X[:,1] = le_road.transform(X[:,1])

le_light = preprocessing.LabelEncoder()
le_light.fit(['Daylight','Dark - Street Lights On','Dusk','Dawn','Dark - No Street Lights','Dark - Street Lights Off','Dark - Unknown Lighting'])
X[:,2] = le_light.transform(X[:,2])

X[0:5]

y = df_decision['SEVERITYDESC']
y[0:5]

X_trainset, X_testset, y_trainset, y_testset = train_test_split(X,y,test_size = 0.3, random_state = 3)

severityTree = DecisionTreeClassifier(criterion = "entropy", max_depth = 4)
severityTree

severityTree.fit(X_trainset, y_trainset)

predictionTree = severityTree.predict(X_testset)

print (predictionTree [0:5])
print (y_testset [0:5])

print ("Decision Tree Accuracy: ", metrics.accuracy_score(y_testset, predictionTree))

dot_data = StringIO()
filename = "severityTree.png"
featureNames = ['WEATHER','ROADCOND','LIGHTCOND','SPEEDING','UNDERINFL']
targetNames = df["SEVERITYDESC"].unique().tolist()
out = tree.export_graphviz(severityTree,feature_names = featureNames, out_file = dot_data, class_names = np.unique(y_trainset), filled = True, special_characters = True, rotate = False)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(filename)
img = mpimg.imread(filename)
plt.figure(figsize = (100,200))
plt.imshow(img, interpolation = 'nearest')
"""