# This is a notebook for Coursera IBM Data Science Capstone project

### Introduction of Business Problem

The ultimate purpose of this project is to prevent avoidable car accidents by alerting drivers and relevant public functions with forecasted severity of car accidents. The estimation can be used as a good reference to remind people to be more careful in critical situations.

Some car accidents are caused by lacking of attention during driving, abusing drugs and alcohol or over-speed driving. Majority of these accidents can be prevented by setting harsher regulations and implementing properlly. However, there are also other uncontrollable factors like weather, visibility, road conditions significantly increase the probability of car accidents. Therefore revealing the underlying pattern in historical data and sending timely warnings to the drivers and public functions would be helpful in preventing avoidable car accidents and better allocating of rescue efforts.

The project should benefit individual drivers, local government, police, rescue groups, and car insurance institutes as well. The model and its results are going to provide some advice for these target audience to make insightful decisions for reducing the number of accidents and injuries.

### Description of Data

The data, collected since 2004, consists of 37 independent variables and 194,673 rows. The dependent variable, “SEVERITYCODE”, contains numbers corresponding to different levels of severity caused by an accident from 0 to 4.<br>

Severity codes are as follows:<br>
0: Little to no Probability (Clear Conditions)<br>
1: Very Low Probability — Chance or Property Damage<br>
2: Low Probability — Chance of Injury<br>
3: Mild Probability — Chance of Serious Injury<br>
4: High Probability — Chance of Fatality<br>

Having a quick look of the data, we know the data need to be preprocessed first.

### Data Preprocessing

The dataset in the original form is not ready for data analysis. We need to take below actions first.<br> 
1) Drop the non-relevant columns. <br>
2) Take care of null values in some records.<br>
3) Convert object data types into numerical data types.<br>
We select 4 features to focus on, namely, severity, weather conditions, road conditions, and light conditions.
Since the dataset is unbalanced, so we need to reshape it into a balanced dataframe as below.

In [14]:
import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline

In [2]:
raw_df = pd.read_csv('Data-Collisions.csv')
raw_df.head(3)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N


In [3]:
raw_df.columns

Index(['SEVERITYCODE', 'X', 'Y', 'OBJECTID', 'INCKEY', 'COLDETKEY', 'REPORTNO',
       'STATUS', 'ADDRTYPE', 'INTKEY', 'LOCATION', 'EXCEPTRSNCODE',
       'EXCEPTRSNDESC', 'SEVERITYCODE.1', 'SEVERITYDESC', 'COLLISIONTYPE',
       'PERSONCOUNT', 'PEDCOUNT', 'PEDCYLCOUNT', 'VEHCOUNT', 'INCDATE',
       'INCDTTM', 'JUNCTIONTYPE', 'SDOT_COLCODE', 'SDOT_COLDESC',
       'INATTENTIONIND', 'UNDERINFL', 'WEATHER', 'ROADCOND', 'LIGHTCOND',
       'PEDROWNOTGRNT', 'SDOTCOLNUM', 'SPEEDING', 'ST_COLCODE', 'ST_COLDESC',
       'SEGLANEKEY', 'CROSSWALKKEY', 'HITPARKEDCAR'],
      dtype='object')

In [4]:
raw_df.shape

(194673, 38)

In [5]:
raw_df["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

In [6]:
from sklearn.utils import resample

In [7]:
df_maj = raw_df[raw_df.SEVERITYCODE==1]
df_min = raw_df[raw_df.SEVERITYCODE==2]

df_maj_dsample = resample(df_maj,replace=False,n_samples=58188,random_state=123)

df=pd.concat([df_maj_dsample,df_min])

df.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

### Methodology

For implementing the solution, I have used Github as a repository and running Jupyter Notebook to preprocess data and build Machine Learning models. Regarding coding, I have used Python and its popular packages such as Pandas, NumPy and Sklearn.<br>

Once I have load data into Pandas Dataframe, used ‘dtypes’ attribute to check the feature names and their data types. Then I have selected the most important features to predict the severity of accidents in Seattle. Among all the features, the following features have the most influence in the accuracy of the predictions:<br>

“WEATHER”<br>
“ROADCOND”<br>
“LIGHTCOND”<br>

Also, as I mentioned earlier, “SEVERITYCODE” is the target variable.

I have run a value count on road (‘ROADCOND’) and weather condition (‘WEATHER’) to get ideas of the different road and weather conditions. I also have run a value count on light condition (’LIGHTCOND’), to see the breakdowns of accidents occurring during the different light conditions. The results can be seen below:

After balancing SEVERITYCODE feature, and standardizing the input feature, the data has been ready for building machine learning models.
I have employed three machine learning models:<br>

K Nearest Neighbour (KNN)<br>
Decision Tree<br>
Linear Regression<br>

After importing necessary packages and splitting preprocessed data into test and train sets, for each machine learning model, I have built and evaluated the model and shown the results as follow:

In [12]:
missing_data =new_df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

SEVERITYCODE
False    116376
Name: SEVERITYCODE, dtype: int64

WEATHER
False    113560
True       2816
Name: WEATHER, dtype: int64

ROADCOND
False    113612
True       2764
Name: ROADCOND, dtype: int64

LIGHTCOND
False    113528
True       2848
Name: LIGHTCOND, dtype: int64



In [15]:
import numpy as np
X = new_df[['SEVERITYCODE','WEATHER','ROADCOND','LIGHTCOND']].values
X [0:5]

array([[1, 'Raining', 'Wet', 'Dark - Street Lights On'],
       [1, 'Clear', 'Dry', 'Daylight'],
       [1, 'Unknown', 'Unknown', 'Unknown'],
       [1, 'Clear', 'Dry', 'Daylight'],
       [1, 'Clear', 'Dry', 'Daylight']], dtype=object)

In [46]:
df.groupby(['WEATHER'])['SEVERITYCODE'].value_counts(normalize=True)

WEATHER           SEVERITYCODE
0                 1               0.871572
                  2               0.128428
1                 1               0.622575
                  2               0.377425
2                 2               0.542672
                  1               0.457328
3                 2               0.519539
                  1               0.480461
4                 2               0.527478
                  1               0.472522
Severe Crosswind  2               0.538462
                  1               0.461538
Name: SEVERITYCODE, dtype: float64

In [None]:
df['WEATHER'].replace(to_replace=['Clear'], value=[4],inplace=True)
df['WEATHER'].replace(to_replace=['Overcast','Partly Cloudy'], value=[3,3],inplace=True)
df['WEATHER'].replace(to_replace=['Raining','Fog/Smog/Smoke'], value=[2,2],inplace=True)
df['WEATHER'].replace(to_replace=['Sleet/Hail/Freezing Rain','Blowing Sand/Dirt','Snowing','Severe Crosswind'], value=[1,1,1,1],inplace=True)
df['WEATHER'].replace(to_replace=['Unknown','Other'], value=[0,0],inplace=True)
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts

df['ROADCOND'].replace(to_replace=['Dry'], value=[4],inplace=True)
df['ROADCOND'].replace(to_replace=['Wet'], value=[3],inplace=True)
df['ROADCOND'].replace(to_replace=['Sand/Mud/Dirt','Standing Water','Oil'], value=[2,2,2],inplace=True)
df['ROADCOND'].replace(to_replace=['Ice','Snow/Slush'], value=[1,1],inplace=True)
df['ROADCOND'].replace(to_replace=['Unknown','Other'], value=[0,0],inplace=True)
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts

In [43]:
df.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)

ROADCOND  SEVERITYCODE
0.0       1               0.885034
          2               0.114966
1.0       1               0.635158
          2               0.364842
2.0       2               0.538462
          1               0.461538
3.0       2               0.536359
          1               0.463641
4.0       2               0.527158
          1               0.472842
Name: SEVERITYCODE, dtype: float64

In [37]:
df.groupby(['LIGHTCOND'])['SEVERITYCODE'].value_counts(normalize=True)

LIGHTCOND  SEVERITYCODE
0.0        1               0.893844
           2               0.106156
1.0        1               0.507343
           2               0.492657
2.0        2               0.539887
           1               0.460113
3.0        2               0.539054
           1               0.460946
Name: SEVERITYCODE, dtype: float64

In [62]:
new_df = df[['SEVERITYCODE''WEATHER','ROADCOND','LIGHTCOND']]
new_df.head(3)

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
25055,1,2.0,3.0,1.0
65280,1,4.0,4.0,3.0
86292,1,0.0,0.0,0.0


In [74]:
missing_data =new_df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 


SEVERITYCODE
False    116376
Name: SEVERITYCODE, dtype: int64

WEATHER
False    113560
True       2816
Name: WEATHER, dtype: int64

ROADCOND
False    113612
True       2764
Name: ROADCOND, dtype: int64

LIGHTCOND
False    113528
True       2848
Name: LIGHTCOND, dtype: int64



In [55]:
X = new_df
y = df['SEVERITYCODE'].values
y[0:5]

array([1, 1, 1, 1, 1])

In [75]:
new_df.dropna(axis=0,how='any')  

Unnamed: 0,SEVERITYCODE,WEATHER,ROADCOND,LIGHTCOND
25055,1,2.0,3.0,1.0
65280,1,4.0,4.0,3.0
86292,1,0.0,0.0,0.0
155111,1,4.0,4.0,3.0
64598,1,4.0,4.0,3.0
...,...,...,...,...
194663,2,2.0,3.0,3.0
194666,2,4.0,3.0,3.0
194668,2,4.0,4.0,3.0
194670,2,4.0,4.0,3.0


#### K Nearest Neighbor(KNN)

In [76]:
# We split the X into train and test to find the best k
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (93100, 3) (93100,)
Test set: (23276, 3) (23276,)


In [None]:
# Modeling
from sklearn.neighbors import KNeighborsClassifier
k = 2
#Train Model and Predict  
kNN_model = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
kNN_model

#### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
DT_model.fit(X_train,y_train)
DT_model

In [None]:
yhat = DT_model.predict(X_test)
yhat

#### Linear Regression

### Results and Evaluations

### Conclusion