This notebook will be used for the data science capstone project.

<h1>Capstone Project - Car accident severity

In [5]:
import pandas as pd

In [6]:
import numpy as np

In [7]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


<h2>Introduction and Business Problem

Car accidents can vary in severity. Emergency services such as the police, fire brigade or paramedics are often called to deal with car accidents.

Knowing the severity of an accident before they arrive can help these emergency services plan ahead to prepare for what vehicles and equipment they need to send and how they might start tackling the problem when they arrive. This in turn can lead to better outcomes from the accident for all involved.

The severity of accidents can depend on a number of factors, many of which can easily be identified from the accident scene by those present.

Our aim is to build a supervised machine learning model that can use the inputs of these factors from an initial call or report of an accident and use it to predict the severity of the acciedent. This can then be used to inform the emergency services before they arrive at the scene.

In this study, we will focus on building a model for the emergency services in the Seattle area of the United States.

<h2>Data

Our data set contains accident data recorded in Traffic Records for the city of Seattle as collected by the SDOT and SPD. There are over 194,000 records for accidents from 2004 to present, each with up to 37 different attributes set and an indication of the severity of the accident.

We will use this data set to identify the keep attributes and then to train and test our model in order to predict the severity of an accident.



<h3>Attributes

Our data contains attributes covering a range of different areas:

* time/date - such the time of the accident (INCDTTM) and the date of the accident (INCDATE)
* location - such as associated intersection (INTKEY), junction type (JUNCTIONTYPE), address type (ADDRTYPE), a description of the location (LOCATION) and crosswalk id (CROSSWALKKEY)
* involement - such as the number of people involved (PERSONCOUNT), the number of pedestrians involved (PEDCOUNT), the number of bicylces involved (PEDCYLCOUNT) and the number of vehciles involved (VEHCOUNT)
* conditions - such as weather (WEATHER), road conditions (ROADCOND), light conditions (LIGHTCOND)
* collision details - such as collision description (COLLISIONTYPE, ST_COLCODE, ST_COLDESC, SDOT_COLCODE, SDOT_COLDESC), lane segment involved (SEGLANEKEY), if speeding was a factor (SPEEDING), if a parked car was involved (HITPARKEDCAR), information on the predestrian right of way (PEDROWNOTGRNT), whether the driver was under the influence or not (UNDERINFL), whether the accident was due to inattention (INATTENTIONIND)
* identification - unique IDs given by various organisations involved in collecting the data, such as OBJECTID, INCKEY, COLDETKEY, REPORTNO and SDOTCOLNUM


<h3>Identifying the relevant attributes

Not all of these attributes will be relevant for our model. For example, UNDERINFL - attribute identifying whether someone was under the influence of drugs or alcohol, is unlikely to be known about in advance, therefore, we should not include it in our model.  In the same way, information on the predestrian right of way (PEDROWNOTGRNT), knowing if speeding (SPEEDING) or if driver inattention (INATTENTIONIND) were factors may also not be known in advance.

The codes given by the SDOT and SPD are quite detailed. Examples for SDOT_COLDESC include:
* MOTOR VEHICLE STRUCK MOTOR VEHICLE, FRONT END AT ANGLE
* DRIVERLESS VEHICLE STRUCK MOTOR VEHICLE REAR END
While examples for ST_COLDESC include:
* From opposite direction - one left turn - one straight
* From same direction - both going straight - both moving - rear-end
It is unlikely sufficient information will be obtained from initial call outs to identify the correct code and description here and therefore these should not be used in our model in their current format. However, COLLISIONTYPE appears to give a more usable attribute containing similar data, but in a format which is more likely to be obtainable from the initial report. Examples of COLLISIONTYPE values include:
* Rear Ended
* Angles
* Parked Car


The data contained in the LOCATION attribute can give very specific addresses. This data will be hard to analyse and group in a machine learning model. We will likely be unable to use it unless we can discover a way to split or group it in to a usable format.

The unique IDs will also not form part of our model, but we will pick OBJECTID in order to identify each individual record.

We will therefore attempt to use the following attributes in our model, all of which should normally be easily identifiable or estimated for a collision at the time it is first reporting to the emergency services:
* time of the accident (INCDTTM)
* the date of the accident (INCDATE)
* Iintersection (INTKEY)
* junction type (JUNCTIONTYPE)
* address type (ADDRTYPE)
* crosswalk identifier (CROSSWALKKEY)
* lane segment involved (SEGLANEKEY)
* the number of people involved (PERSONCOUNT)
* the number of pedestrians involved (PEDCOUNT)
* the number of bicylces involved (PEDCYLCOUNT)
* the number of vehciles involved (VEHCOUNT)
* weather (WEATHER)
* road conditions (ROADCOND)
* light conditions (LIGHTCOND)
* collision type (COLLISIONTYPE)
* if a parked car was involved (HITPARKEDCAR)