<a href="https://colab.research.google.com/github/yashrakeshmishra/Coursera_Capstone/blob/master/capstoneProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Project IBM Data Science Course

This Notebook is for the final Capstone Project in the IBM Data Science Professional Certificate course. It will involve a series of operations as required by the project.

# Predicting the severity of accident based on different conditions.

The model is trained using a classification algorithm. A data set consisting of road accidents in the city of Seattle from 2014 to present is used to train the model and evaluate it's metrics. Since the dataset only contains about two types of severities; namely slight and severe. A classic binary classfication algorithm is used to construct a model.

## Importing the important libraries for handling data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

  import pandas.util.testing as tm


## Downloading the data from the internet and storing it as a Pandas dataframe.

In [2]:
df = pd.read_csv('https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv')

## Refining the data

Important colums which will we required to evaluate the severity are kept and other non essential elements are dropped.

In [3]:
data = df[['SEVERITYCODE','PERSONCOUNT','VEHCOUNT','WEATHER','ROADCOND','LIGHTCOND','INATTENTIONIND','UNDERINFL']]
attributes = data.columns.to_list()

The data obtained has many null values. We could have eliminated the null values but that would result in loss of dataset. Therefore, we replace the NaN values with the value with highest frequency in th column, since that value has the highest probability.

In [4]:
for column in attributes:
    data[column].fillna(data[column].mode()[0], inplace=True)

The columns <code> INATTENTIONIND </code> and <code> UNDERINFL </code>contained mixed values such as 'Y','N',1,0. Therefore we convert the data into numerical form.

In [5]:
data['INATTENTIONIND'].replace('Y',1,inplace=True)
data['INATTENTIONIND'].replace('N',0,inplace=True)
data.INATTENTIONIND.astype('int64')
data['UNDERINFL'].replace('Y',1,inplace=True)
data['UNDERINFL'].replace('N',0,inplace=True)
data.UNDERINFL.astype('int64');


Using the **One-Hot-Encoding** for the weather followed by dropping the weather column to obtain additonal columns with different types of weathers. The column value for a particular weather is 1 or 0 depending if the weather condition was true at the time of that particular accident.

In [6]:
weathers=pd.get_dummies(data.WEATHER)
data = pd.concat([data,weathers],axis=1)

In [7]:
data.drop(columns='WEATHER', inplace=True)

In [8]:
data.rename(columns={'Fog/Smog/Smoke': 'Smoke','Sleet/Hail/Freezing Rain':'Hail', 'Unknown': 'Unpredictable Weather', 'Blowing Sand/Dirt': 'Sandy'}, inplace=True)

In [9]:
data.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,VEHCOUNT,ROADCOND,LIGHTCOND,INATTENTIONIND,UNDERINFL,Sandy,Clear,Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Hail,Snowing,Unpredictable Weather
0,2,2,2,Wet,Daylight,1,0,0,0,0,0,1,0,0,0,0,0,0
1,1,2,2,Wet,Dark - Street Lights On,1,0,0,0,0,0,0,0,1,0,0,0,0
2,1,4,3,Dry,Daylight,1,0,0,0,0,0,1,0,0,0,0,0,0
3,1,3,3,Dry,Daylight,1,0,0,1,0,0,0,0,0,0,0,0,0
4,2,2,2,Wet,Daylight,1,0,0,0,0,0,0,0,1,0,0,0,0


Performing the **One-Hot-Encoding** for <code>ROADCOND</code> and <code> LIGHTCOND</code> as well.

In [10]:
lighting = pd.get_dummies(data.LIGHTCOND)
data = pd.concat([data,lighting],axis=1)

In [11]:
data.drop('LIGHTCOND',axis=1,inplace=True)
data.rename(columns={'Unknown': 'Unknown Lighting'}, inplace=True)

In [12]:
roads = pd.get_dummies(data.ROADCOND, prefix='roadcond_')
data = pd.concat([data,roads],axis=1)

In [13]:
data.drop('ROADCOND',axis=1,inplace=True)
data.head()

Unnamed: 0,SEVERITYCODE,PERSONCOUNT,VEHCOUNT,INATTENTIONIND,UNDERINFL,Sandy,Clear,Smoke,Other,Overcast,Partly Cloudy,Raining,Severe Crosswind,Hail,Snowing,Unpredictable Weather,Dark - No Street Lights,Dark - Street Lights Off,Dark - Street Lights On,Dark - Unknown Lighting,Dawn,Daylight,Dusk,Other.1,Unknown Lighting,roadcond__Dry,roadcond__Ice,roadcond__Oil,roadcond__Other,roadcond__Sand/Mud/Dirt,roadcond__Snow/Slush,roadcond__Standing Water,roadcond__Unknown,roadcond__Wet
0,2,2,2,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
1,1,2,2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1,4,3,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
3,1,3,3,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
4,2,2,2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1


The data has been cleaned and is now ready to be processed. We know form our X and y sets for processing the data.
The 'y' set consists of the severity codes while the 'X' column consists of the features.

In [14]:
data.shape

(194673, 34)

In [15]:
y = np.asarray(data['SEVERITYCODE'])
y[0:5]

array([2, 1, 1, 1, 2])

In [16]:
X = data.drop('SEVERITYCODE', axis=1)
X = np.asarray(X[X.columns.to_list()])
X = X.astype(int)

Normalizing the dataset:

In [17]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[-0.33020207,  0.12553783,  0.        , -0.2217116 , -0.01696304,
        -1.21707436, -0.05414257, -0.06551471, -0.03476509,  2.45445634,
        -0.00506801, -0.45298634, -0.011333  , -0.02409974, -0.06841713,
        -0.28988624, -0.08920831, -0.07872239, -0.576075  , -0.00751719,
        -0.1141037 ,  0.77768637, -0.17682024, -0.06551471, -0.03476509,
        -0.27267986, -1.4099744 , -0.07905204, -0.01813462, -0.02604842,
        -0.01963186, -0.07200071, -0.02431221, -0.28975087,  1.76085874],
       [-0.33020207,  0.12553783,  0.        , -0.2217116 , -0.01696304,
        -1.21707436, -0.05414257, -0.06551471, -0.03476509, -0.4074222 ,
        -0.00506801,  2.2075721 , -0.011333  , -0.02409974, -0.06841713,
        -0.28988624, -0.08920831, -0.07872239,  1.73588509, -0.00751719,
        -0.1141037 , -1.2858654 , -0.17682024, -0.06551471, -0.03476509,
        -0.27267986, -1.4099744 , -0.07905204, -0.01813462, -0.02604842,
        -0.01963186, -0.07200071, -0.02431221, -0.

## Train/Test dataset

Now we create a Train/Test split to emulate out-of-sample testing to ensure the performance of our model.

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (155738, 35) (155738,)
Test set: (38935, 35) (38935,)
