# Coursera Capstone Project Using Seattle Collision Data

### This is a Jupyter notebook we will be using to analyze and present findings based on collision data in the city of Seattle

# Introduction and Business Understanding

## Overview

There are more than 10,000 traffic collisions per year involving cars, bicyclists and pedestrians. Understanding the causes of collisions as well as the conditions that impact their severity will help provide insight to officials on how to better allocate resources to help reduce the number and severity of such incidents.

Further, a better understanding of the factors that increase the likelihood of collisions and increase the probability of injury or property damage can help with education efforts to help individuals take greater precautions when making travel decisions.

## Goals of the Project

The goal of the project is to use publicly available data compiled by the Seattle Deport of Transportation (SDOT), to identify feautures in the dataset that yield predictive information on the number and severity of collisions and injuries in Seattle.

We will also look to use data visualization tools to communicate this information and provide an overview of the current state of traffic collisions in Seattle.

# Data Understanding

The dataset we are using is *Collisions - All Years* dataset maintained by the SDOT Traffic Management Division's Traffic Records Group.  This dataset includes all types of collisions, including car, bicycle, and pedestrian as provided by the Seattle Police Department in their Traffic Records.

The data set contains information on over 194,000 collisions in Seattle over a 15-year period.  The primary attribute we are looking to predict is the severity of the collision as captured by the Severity Code assigned to the collision.  Interestingly, the dataset description provided by SDOT indicates this Severity Code attribute should take values between 0 and 3 (including both 2 and 2b to differential "injury" from "serious injury"); however, the actual data set only contains the values 1 and 2 for this attribute.  One avenue to explore in a future project is to find additional information on this target attribute.

The data includes 37 different features including: day, time, month, lighting conditions, road conditions and weather conditions.

A full description of the data can be found at: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf

In [21]:
import pandas as pd
import numpy as np

In [22]:
path="~/Documents/CertificationStuff/IBMPythonDataScience/Data_Science_Capstone/Data-Collisions.csv"

df = pd.read_csv(path, low_memory=False)

In [23]:
# Drop rows without latitude and longitude
df.dropna(subset=["X"],axis=0,inplace=True)

# Replace the rows that are missing Coordinates with the average coordinate 
#mean_longtitude = df["X"].mean()
#mean_latitude = df["Y"].mean()
#df["X"].replace(np.nan, mean_longitude, inplace = True)
#df["Y"].replace(np.nan, mean_latitutde, inplace = True)

# Drop Unnecessary or Redundant Columns
df.drop(['SEVERITYCODE.1','OBJECTID','INCKEY','COLDETKEY','REPORTNO','STATUS','ADDRTYPE','INTKEY','LOCATION','EXCEPTRSNCODE','EXCEPTRSNDESC','SDOT_COLCODE','SDOT_COLDESC','PEDROWNOTGRNT','SDOTCOLNUM','ST_COLCODE','ST_COLDESC','SEGLANEKEY','CROSSWALKKEY'],axis = 1, inplace = True)

# Replace string values with Boolean values in some appropriate Columns
df['INATTENTIONIND'].replace(np.nan, False, inplace = True)
df['INATTENTIONIND'].replace(to_replace='Y', value = True, inplace = True)

df['UNDERINFL'].replace(np.nan, False, inplace = True)
df['UNDERINFL'].replace('N', False, inplace = True)
df['UNDERINFL'] = df['UNDERINFL'].astype(bool)

df['SPEEDING'].replace(np.nan, False, inplace = True)
df['SPEEDING'].replace('Y', True, inplace = True)

df['HITPARKEDCAR'].replace('N', False, inplace = True)
df['HITPARKEDCAR'].replace('Y', True, inplace = True)

# Consolidate missing, "Unknown", NaN values in some columns to "Unknown"
df['WEATHER'].replace(np.nan,"Unknown", inplace = True)
df['WEATHER'].replace("Other", "Unknown", inplace = True)

df['ROADCOND'].replace(np.nan, "Unknown", inplace = True)
df['ROADCOND'].replace("Other", "Unknown", inplace = True)

df['LIGHTCOND'].replace(np.nan, "Unknown", inplace = True)
df['LIGHTCOND'].replace("Other", "Unknown", inplace = True)

In [24]:
df.head()

Unnamed: 0,SEVERITYCODE,X,Y,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR
0,2,-122.323148,47.70314,Injury Collision,Angles,2,0,0,2,2013/03/27 00:00:00+00,3/27/13 14:54,At Intersection (intersection related),False,False,Overcast,Wet,Daylight,False,False
1,1,-122.347294,47.647172,Property Damage Only Collision,Sideswipe,2,0,0,2,2006/12/20 00:00:00+00,12/20/06 18:55,Mid-Block (not related to intersection),False,True,Raining,Wet,Dark - Street Lights On,False,False
2,1,-122.33454,47.607871,Property Damage Only Collision,Parked Car,4,0,0,3,2004/11/18 00:00:00+00,11/18/04 10:20,Mid-Block (not related to intersection),False,True,Overcast,Dry,Daylight,False,False
3,1,-122.334803,47.604803,Property Damage Only Collision,Other,3,0,0,3,2013/03/29 00:00:00+00,3/29/13 9:26,Mid-Block (not related to intersection),False,False,Clear,Dry,Daylight,False,False
4,2,-122.306426,47.545739,Injury Collision,Angles,2,0,0,2,2004/01/28 00:00:00+00,1/28/04 8:04,At Intersection (intersection related),False,True,Raining,Wet,Daylight,False,False


In [25]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
INATTENTIONIND       bool
UNDERINFL            bool
WEATHER            object
ROADCOND           object
LIGHTCOND          object
SPEEDING             bool
HITPARKEDCAR         bool
dtype: object

In [26]:
df.describe(include="all")

Unnamed: 0,SEVERITYCODE,X,Y,SEVERITYDESC,COLLISIONTYPE,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,INCDATE,INCDTTM,JUNCTIONTYPE,INATTENTIONIND,UNDERINFL,WEATHER,ROADCOND,LIGHTCOND,SPEEDING,HITPARKEDCAR
count,189339.0,189339.0,189339.0,189339,184582,189339.0,189339.0,189339.0,189339.0,189339,189339,185146,189339,189339,189339,189339,189339,189339,189339
unique,,,,2,10,,,,,5985,157960,7,2,2,10,8,8,2,2
top,,,,Property Damage Only Collision,Parked Car,,,,,2006/11/02 00:00:00+00,11/2/06,Mid-Block (not related to intersection),False,False,Clear,Dry,Daylight,False,False
freq,,,,132221,46381,,,,,88,88,87390,160163,102376,108959,122076,113582,180619,182334
mean,1.301671,-122.330518,47.619543,,,2.452986,0.037863,0.028996,1.924136,,,,,,,,,,
std,0.458984,0.029976,0.056157,,,1.349092,0.200053,0.169143,0.629941,,,,,,,,,,
min,1.0,-122.419091,47.495573,,,0.0,0.0,0.0,0.0,,,,,,,,,,
25%,1.0,-122.348673,47.575956,,,2.0,0.0,0.0,2.0,,,,,,,,,,
50%,1.0,-122.330224,47.615369,,,2.0,0.0,0.0,2.0,,,,,,,,,,
75%,2.0,-122.311937,47.663664,,,3.0,0.0,0.0,2.0,,,,,,,,,,


In [27]:
print("Possible Severities: ", df['SEVERITYDESC'].unique()) 
print("Weather Conditions: ", df['WEATHER'].unique())
print("Road Conditions: ", df['ROADCOND'].unique())
print("Lighting Conditions: ", df['LIGHTCOND'].unique())

Possible Severities:  ['Injury Collision' 'Property Damage Only Collision']
Weather Conditions:  ['Overcast' 'Raining' 'Clear' 'Unknown' 'Snowing' 'Fog/Smog/Smoke'
 'Sleet/Hail/Freezing Rain' 'Blowing Sand/Dirt' 'Severe Crosswind'
 'Partly Cloudy']
Road Conditions:  ['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Sand/Mud/Dirt' 'Standing Water'
 'Oil']
Lighting Conditions:  ['Daylight' 'Dark - Street Lights On' 'Dark - No Street Lights' 'Unknown'
 'Dusk' 'Dawn' 'Dark - Street Lights Off' 'Dark - Unknown Lighting']
