### This is a Jupiter Notebook for the Capstone project of Applied Data Science specialisation on Coursera
Link: https://www.coursera.org/learn/applied-data-science-capstone/home/welcome

## Introduction

In this project we attempt to build a machine learning based system that can predict severity of a car accident based on a variety of known attributes of an accident and conditions under which the accident has occured. Severeity value is used to label an accident and split all accidents into categories based on a negative impact in terms of human fatalities, injuries, property damage, traffic delay or any other type of negative impact.

#### Interest

Such system is of interest for insurance companies, police department traffic management divisions or any other party interested in analysing labeled incident data to make certain conclusions. Insurance companies could use derived severity data in insurance payments calculations. Police department could use labeled data to prioritise future work of improving road infrastructure to decrease the number of accidents in most accident-prone areas.


## Data


#### Dataset

As a dataset for building the prediction system collision records provided by Seattle Police Depratment have been used. Records contain a total of a little less than 200 000 records and span in a timeframe from 2004 till 2020. 

Dataset includes many types of collisions with attributes. Key attributes include such information as number of people and vehicles involved in an incident, weather and road conditions, reported violations that might have increased chances of a collision. Incidents are labeled by a severity code ranging from 1 (property damage) to 3 (fatality). 

Looking closer at the severity level in the provided dataset it is easy to determine, that only 2 values are present in records: severity level 1 (property damage) and severity level 2 (injury). Probable explaination for a lack of records with higher severity in the dataset could be sensitivity of such data, causing the authorities to filter such data out from the records available to the public.

In [63]:
import pandas as pd
import numpy as np

df = pd.read_csv('Data-Collisions.csv')
df.shape

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

#### Data cleaning

Looking closer at the data it becomes obvious, that dataset contains a lot of information not relevant for our analysis that can be removed. In particular: 

1. Columns containing various indentificators of an incident - OBJECTID, INCKEY, COLDETKEY, REPORTNO, INTKEY, SDOTCOLNUM, EXCEPTRSNCODE
2. Columns containing location data - X, Y, LOCATION, SEGLANEKEY, CROSSWALKKEY
3. Descriptions of label data - SEVERITYDESC, ST_COLDESC, EXCEPTRSNDESC, SDOT_COLDESC
4. Date and time data - INCDATE, INCDTTM
5. Severety label is duplicated in colums SEVERITYCODE and SEVERITYCODE.1, we can drop the latter one
6. STATUS columns is missing description and is irrelevant for the analysis

In [64]:
columns_to_drop = [
    'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO', 'INTKEY', 'SDOTCOLNUM', 'EXCEPTRSNCODE', #identifiers
    'X', 'Y', 'LOCATION', 'SEGLANEKEY', 'CROSSWALKKEY', #location
    'SEVERITYDESC', 'ST_COLDESC', 'EXCEPTRSNDESC', 'SDOT_COLDESC', #descriptions
    'INCDATE', 'INCDTTM', #date and time
    'SEVERITYCODE.1', #duplicated label
    'STATUS' #unknown purpose of this column
]
df.drop(columns=columns_to_drop, inplace=True)
df.shape

(194673, 18)

The following columns needed small fixing to replace missing values with a default value or to align values in the column:
1. INATTENTIONIND, PEDROWNOTGRNT, SPEEDING - replace NaNs with 0s, replace 'Y's with 1s
2. UNDERINFL, HITPARKEDCAR - replace 'N's with 0s, replace 'Y' with 1s

In [66]:
df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING']] = df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING']].fillna(0)
df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING', 'UNDERINFL', 'HITPARKEDCAR']] = df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING', 'UNDERINFL', 'HITPARKEDCAR']].replace('N', 0)
df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING', 'UNDERINFL', 'HITPARKEDCAR']] = df[['INATTENTIONIND', 'PEDROWNOTGRNT', 'SPEEDING', 'UNDERINFL', 'HITPARKEDCAR']].replace('Y', 1)
df.shape

(194673, 18)

Finally we will drop all rows that contain 'Unknown' or NaN values in any of the columns

In [67]:
df.replace('Unknown', np.nan, inplace=True)
df.dropna(inplace=True)
df.shape

(167840, 18)

In [68]:
df.columns

Index(['SEVERITYCODE', 'ADDRTYPE', 'COLLISIONTYPE', 'PERSONCOUNT', 'PEDCOUNT',
       'PEDCYLCOUNT', 'VEHCOUNT', 'JUNCTIONTYPE', 'SDOT_COLCODE',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SPEEDING', 'ST_COLCODE', 'HITPARKEDCAR'],
      dtype='object')