# Coursera Capstone Project Using Seattle Collision Data

### This is a Jupyter notebook we will be using to analyze and present findings based on collision data in the city of Seattle

# Introduction and Business Understanding

## Overview

There are more than 10,000 traffic collisions per year involving cars, bicyclists and pedestrians. Understanding the causes of collisions as well as the conditions that impact their severity will help provide insight to officials on how to better allocate resources to help reduce the number and severity of such incidents.

Further, a better understanding of the factors that increase the likelihood of collisions and increase the probability of injury or property damage can help with education efforts to help individuals take greater precautions when making travel decisions.

## Goals of the Project

The goal of the project is to use publicly available data compiled by the Seattle Deport of Transportation (SDOT), to identify feautures in the dataset that yield predictive information on the number and severity of collisions and injuries in Seattle.

We will also look to use data visualization tools to communicate this information and provide an overview of the current state of traffic collisions in Seattle.

# Data Understanding

The dataset we are using is *Collisions - All Years* dataset maintained by the SDOT Traffic Management Division's Traffic Records Group.  This dataset includes all types of collisions, including car, bicycle, and pedestrian as provided by the Seattle Police Department in their Traffic Records.

The data set contains information on over 194,000 collisions in Seattle over a 15-year period.  The primary attribute we are looking to predict is the severity of the collision as captured by the Severity Code assigned to the collision.  Interestingly, the dataset description provided by SDOT indicates this Severity Code attribute should take values between 0 and 3 (including both 2 and 2b to differential "injury" from "serious injury"); however, the actual data set only contains the values 1 and 2 for this attribute.  One avenue to explore in a future project is to find additional information on this target attribute.

The data includes 37 different features including: day, time, month, lighting conditions, road conditions and weather conditions.

A full description of the data can be found at: https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Metadata.pdf

In [1]:
import pandas as pd
import numpy as np

In [35]:
path="~/Documents/CertificationStuff/IBMPythonDataScience/Data_Science_Capstone/Data-Collisions.csv"

df = pd.read_csv(path, low_memory=False)

In [36]:
df.head(20)

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10.0,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11.0,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32.0,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23.0,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10.0,Entering at angle,0,0,N
5,1,-122.387598,47.690575,6,320840,322340,E919477,Matched,Intersection,36974.0,...,Dry,Daylight,,,,10.0,Entering at angle,0,0,N
6,1,-122.338485,47.618534,7,83300,83300,3282542,Matched,Intersection,29510.0,...,Wet,Daylight,,8344002.0,,10.0,Entering at angle,0,0,N
7,2,-122.32078,47.614076,9,330897,332397,EA30304,Matched,Intersection,29745.0,...,Dry,Daylight,,,,5.0,Vehicle Strikes Pedalcyclist,6855,0,N
8,1,-122.33593,47.611904,10,63400,63400,2071243,Matched,Block,,...,Dry,Daylight,,6166014.0,,32.0,One parked--one moving,0,0,N
9,2,-122.3847,47.528475,12,58600,58600,2072105,Matched,Intersection,34679.0,...,Dry,Daylight,,6079001.0,,10.0,Entering at angle,0,0,N


In [12]:
df.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [37]:
df.describe(include="all")

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,194673.0,194673,192747,65070.0,...,189661,189503,4667,114936.0,9333,194655.0,189769,194673.0,194673.0,194673
unique,,,,,,,194670.0,2,3,,...,9,9,1,,1,63.0,62,,,2
top,,,,,,,1782439.0,Matched,Block,,...,Dry,Daylight,Y,,Y,32.0,One parked--one moving,,,N
freq,,,,,,,2.0,189786,126926,,...,124510,116137,4667,,9333,44421.0,44421,,,187457
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,,,,37558.450576,...,,,,7972521.0,,,,269.401114,9782.452,
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,,,,51745.990273,...,,,,2553533.0,,,,3315.776055,72269.26,
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,,,,23807.0,...,,,,1007024.0,,,,0.0,0.0,
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,,,,28667.0,...,,,,6040015.0,,,,0.0,0.0,
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,,,,29973.0,...,,,,8023022.0,,,,0.0,0.0,
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,,,,33973.0,...,,,,10155010.0,,,,0.0,0.0,


In [32]:
# Code to drop rows without a longitude
# df.dropna(subset=["X"],axis=0,inplace=True)

# Code to replace the rows that are missing Coordinates with the average coordinate 
#mean_longtitude = df["X"].mean()
#mean_latitude = df["Y"].mean()

#df["X"].replace(np.nan, mean_longitude, inplace = True)
#df["Y"].replace(np.nan, mean_latitutde, inplace = True)