# This is a notebook for Coursera IBM Data Science Capstone project

### Introduction of Business Problem

The ultimate purpose of this project is to prevent avoidable car accidents by alerting drivers and relevant public functions with forecasted severity of car accidents. The estimation can be used as a good reference to remind people to be more careful in critical situations.

Some car accidents are caused by lacking of attention during driving, abusing drugs and alcohol or over-speed driving. Majority of these accidents can be prevented by setting harsher regulations and implementing properlly. However, there are also other uncontrollable factors like weather, visibility, road conditions significantly increase the probability of car accidents. Therefore revealing the underlying pattern in historical data and sending timely warnings to the drivers and public functions would be helpful in preventing avoidable car accidents and better allocating of rescue efforts.

The project should benefit individual drivers, local government, police, rescue groups, and car insurance institutes as well. The model and its results are going to provide some advice for these target audience to make insightful decisions for reducing the number of accidents and injuries.

### Description of Data

The data, collected since 2004, consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers corresponding to different levels of severity caused by an accident from 0 to 4.<br>

Severity codes are as follows:<br>
0: Little to no Probability (Clear Conditions)<br>
1: Very Low Probability — Chance or Property Damage<br>
2: Low Probability — Chance of Injury<br>
3: Mild Probability — Chance of Serious Injury<br>
4: High Probability — Chance of Fatality<br>

Having a quick look of the data, we know the data need to be preprocessed first.

### Data Preprocessing

The dataset in the original form is not ready for data analysis. We need to take below actions first.<br> 
1) Drop the non-relevant columns. <br>
2) Take care of null values in some records.<br>
3) Convert object data types into numerical data types.<br>
We select 4 features to focus on, namely, severity, weather conditions, road conditions, and light conditions.
Since the dataset is unbalanced, so we need to reshape it into a balanced dataframe as below.

In [14]:
import pandas as pd
import numpy as np

In [15]:
df = pd.read_csv('Data-Collisions.csv')
df.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N


In [16]:
df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In [17]:
df.shape

(194673, 38)

In [18]:
df["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [19]:
from sklearn.utils import resample

In [20]:
df_maj = df[df.SEVERITYCODE==1]
df_min = df[df.SEVERITYCODE==2]

df_maj_dsample = resample(df_maj,replace=False,n_samples=58188,random_state=123)

new_df=pd.concat([df_maj_dsample,df_min])

new_df.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64