# Introduction

Every year, more than 1200 Australians lose their lives on the roads. This not only creates enormous pain and suffering for the families who lose loved ones to these accidents, but echoes of these losses account for much grief and suffering in the wider community. Furthermore, these deaths have a significant impact on both the local and national economy, due to not only the lost productivity of the lives cut short by fatal motor vehicle accidents, but also due to injuries, temporary or permanent disabilities and the emotional trauma experienced by both the individuals involved in the accidents and their families and loved ones.

To date, in Australia, there has been a major policy focus on 3 areas:

1. Speeding prevention and speed limit compliance;
2. Drink driving prevention; and
3. Enhanced restrictions on younger drivers (including extended duration of Probationary periods, lower alcohol restrictions, limits on passengers, etc).

However, other factors such as the weather, lighting conditions and road conditions have not received the same amount of attention.

By utilizing data science techniques to assess a robust dataset that includes a range of factors, policy makers can understand what factors correlate with a high accident rate, and a high accident rate resulting in death, and determine where to direct policy focus. This data may also be used to integrate recommendations into map and/or weather software, to direct drivers on routes that avoid areas with known conditions that may lead to accidents if at all possible (e.g. poor lighting, weather, road conditions, etc). It may also allow better utilization of emergency services resources and minimization of emergency services response times by allowing emergency services planners to better predict where their services will be needed in response to car accidents.

Key stakeholders for this project include:
 - Road safety policy makers;
 - Mapping and direction planning software developers;
 - Emergency services response planners; and
 - Road users.

Currently, the data gathered in Australia does not consider factors like the weather, the road conditions, or lighting conditions. As such, analogous data from the US will be utilized to help direct Australian data gathering efforts.

This project will consist of 3 deliverables:

1. A report notebook (assessed in Week 1 and week 2), which will be presented as a report; 
2. A coding notebook (which will show the code; and
3. A final presentation.

This notebook is the reporting notebook.


 # Data

The Seattle SDOT Traffic Management Division, Traffic Records Group has collated data from 2004 to present, which will be used as a basis for initial data modelling in the absence of Australian data. This data is stored [here](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv).

It should be noted that this data has been pre-treated to amalgamate fatality data with injury data, and has scrubbed data where the outcome was unknown. It therefore catagorises each element as either resulting in property damage (1) or physical injury (2) (unlike the statement in the meta data file).

Approximately 5000 elements (out of nearly 2 million) are missing collision type information, accounting for ~2.5% of the data. However, this is not relevant to the question at hand (how weather, lighting conditions and intersection types impact the likelihood of a collision and the likelihood of a collision resulting in an injury). Similarly, the data on driver intoxicaiton is inconsistent, utilizing both a 1/0 and Y/N format; and some location data is also missing; but again, these are not one of the independent or dependent variables in this analysis and will not be utilized in the project.


 ## Feature Selection


The accident severity code (SEVERITYCODE), road conditions (ROADCOND), weather conditions (WEATHER), and lighting conditions (LIGHTCOND) will be the main focus of this analysis. Therefore, elements where these codes are not defined (ie "other"), unknown or blank has been scrubbed for that portion of the analysis. Removing this data resulted in the loss of 24716 out of 194673 elements, or approximately 12.7% of the data. This is viewed as preferable to the distortion of results due to missing data.
