# Capstone Project - Car accident severity 

## Introduction/Business Understanding:
In an effort to reduce the frequency of car collisions in a community, an algorithm must be developed to predict the severity of an accident given the current weather, road and visibility conditions. When conditions are bad, this model will alert drivers to remind them to be more careful.
    
In most cases, not paying enough attention during driving, abusing drugs and alcohol are the main causes of occurring accidents that can be prevented by enacting harsher regulations. Besides the mentioned reasons weather, visibility, or road conditions are the major factors that can be prevented by revealing hidden patterns in the data and announcing warning to the local government, police and drivers on the targeted locations.

## Data 
Import Primary Modules. The first thing we'll do is import two key data analysis modules: pandas and Numpy.

In [3]:
import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library
from sklearn.utils import resample

### Data Understanding
In this assignment we will be using shared data file : [Data-Collision.csv](https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv)

Download the dataset and read it into a pandas dataframe.

In [4]:
df_collision = pd.read_csv("https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv")

  interactivity=interactivity, compiler=compiler, result=result)


Let's take a look at the overview of our dataset.

In [5]:
df_collision.describe()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,65070.0,194673.0,194673.0,194673.0,194673.0,194673.0,194673.0,114936.0,194673.0,194673.0
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,37558.450576,1.298901,2.444427,0.037139,0.028391,1.92078,13.867768,7972521.0,269.401114,9782.452
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,51745.990273,0.457778,1.345929,0.19815,0.167413,0.631047,6.868755,2553533.0,3315.776055,72269.26
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,23807.0,1.0,0.0,0.0,0.0,0.0,0.0,1007024.0,0.0,0.0
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,28667.0,1.0,2.0,0.0,0.0,2.0,11.0,6040015.0,0.0,0.0
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,29973.0,1.0,2.0,0.0,0.0,2.0,13.0,8023022.0,0.0,0.0
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,33973.0,2.0,3.0,0.0,0.0,2.0,14.0,10155010.0,0.0,0.0
max,2.0,-122.238949,47.734142,219547.0,331454.0,332954.0,757580.0,2.0,81.0,6.0,2.0,12.0,69.0,13072020.0,525241.0,5239700.0


Let's take a look at the first five items in our dataset.

In [6]:
df_collision.head()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


Our predictor or target variable will be 'SEVERITYCODE' because it is used to measure the severity of an accident from 0 to 5 within the dataset. Attributes used to weigh the severity of an accident are 'WEATHER', 'ROADCOND' and 'LIGHTCOND'.

The dependent variable, “SEVERITYCODE”, contains numbers that correspond to different levels of severity caused by an accident from 0 to 4.

Severity codes are as follows:

- 0: Little to no Probability (Clear Conditions)
- 1: Very Low Probability (Chance or Property Damage)
- 2: Low Probability (Chance of Injury)
- 3: Mild Probability (Chance of Serious Injury)
- 4: High Probability (Chance of Fatality)

Furthermore, because of the existence of null values in some records, the data needs to be pre-processed before any further processing.

### Data Preprocessing
The dataset in the original form is not ready for data analysis. In order to prepare the data, first, we need to drop the non-relevant columns. Also, most of the features are of type object, when they should be numerical type.

After analyzing the data set, I have decided to focus on only four features, severity, weather conditions, road conditions, and light conditions, among others.

To get a good understanding of the dataset, Now let's check different values in the features. 

In [7]:
df_collision["SEVERITYCODE"].value_counts()

1    136485
2     58188
Name: SEVERITYCODE, dtype: int64

The results show, the target feature is imbalance, so we use a simple statistical technique to balance it by downsampling the majority class.

In [8]:
df_col_max = df_collision[df_collision.SEVERITYCODE == 1]
df_col_min = df_collision[df_collision.SEVERITYCODE == 2]

df_coll_downsampl = resample(df_col_max,
                             replace = False,
                             n_samples = 58188,
                             random_state = 100
                            )
df_balanced = pd.concat([df_coll_downsampl, df_col_min])
df_balanced.SEVERITYCODE.value_counts()

2    58188
1    58188
Name: SEVERITYCODE, dtype: int64

We will be concerntrating on features 'WEATHER', 'ROADCOND' and 'LIGHTCOND' so convert them into int format for analysis.

#### Convert String data to integers

In [31]:
df_balanced["WEATHER"].value_counts()

Clear                       67909
Raining                     20580
Overcast                    16884
Unknown                      6868
Snowing                       484
Other                         426
Fog/Smog/Smoke                339
Sleet/Hail/Freezing Rain       65
Blowing Sand/Dirt              37
Severe Crosswind               12
Partly Cloudy                   4
Name: WEATHER, dtype: int64

In [32]:
df_balanced["WEATHER"].head()

117913        NaN
35013     Raining
193049      Clear
159701    Unknown
70109     Raining
Name: WEATHER, dtype: object

In [57]:
# creating a dict file  
weather_dict = {'Clear': 1,'Raining': 2,'Overcast': 3,'Unknown': 4,'Snowing': 5,'Other': 6,'Fog/Smog/Smoke': 7,'Sleet/Hail/Freezing Rain': 8,'Blowing Sand/Dirt': 9,'Severe Crosswind': 10,'Partly Cloudy': 11} 
#dict_w = dict(Clear = 1,Raining = 2,Overcast = 3,Unknown = 4,Snowing = 5,Other = 6,Fog/Smog/Smoke = 7,Sleet/Hail/Freezing Rain = 8, Blowing Sand/Dirt = 9,Severe Crosswind = 10,Partly Cloudy = 11)
# traversing through dataframe 
#df_collision.WEATHER = [weather_dict[item] for item in df_collision.WEATHER] 
#df_collision.map()
df_balanced.dropna(subset = ['WEATHER'], inplace=True)
df_balanced['WEATHER_INT'] = df_balanced['WEATHER'].map(weather_dict)

In [58]:
df_balanced[['WEATHER','WEATHER_INT']].head()

Unnamed: 0,WEATHER,WEATHER_INT
35013,Raining,2
193049,Clear,1
159701,Unknown,4
70109,Raining,2
149064,Clear,1


In [59]:
df_balanced["ROADCOND"].value_counts()

Dry               75977
Wet               29336
Unknown            6854
Ice                 677
Snow/Slush          518
Other                77
Standing Water       60
Sand/Mud/Dirt        42
Oil                  40
Name: ROADCOND, dtype: int64

In [62]:
# creating a dict file  
road_dict = {'Dry': 1,'Wet': 2,'Unknown': 3,'Ice': 4,'Snow/Slush': 5,'Other': 6,'Standing Water': 7,'Sand/Mud/Dirt': 8,'Oil': 9} 

# traversing through dataframe 
df_balanced.dropna(subset = ['ROADCOND'], inplace=True)
df_balanced['ROADCOND_INT'] = df_balanced['ROADCOND'].map(road_dict)

In [63]:
df_balanced[['ROADCOND','ROADCOND_INT']].head()

Unnamed: 0,ROADCOND,ROADCOND_INT
35013,Wet,2
193049,Dry,1
159701,Unknown,3
70109,Unknown,3
149064,Dry,1


In [64]:
df_balanced["LIGHTCOND"].value_counts()

Daylight                    71615
Dark - Street Lights On     28958
Unknown                      6051
Dusk                         3619
Dawn                         1530
Dark - No Street Lights       859
Dark - Street Lights Off      705
Other                         132
Dark - Unknown Lighting         8
Name: LIGHTCOND, dtype: int64

In [67]:
# creating a dict file  
light_dict = {'Daylight': 1,'Dark - Street Lights On': 2,'Unknown': 3,'Dusk': 4,'Dawn': 5,'Other': 6,'Dark - No Street Lights': 7,'Dark - Street Lights Off': 8,'Other': 9,'Dark - Unknown Lighting': 10} 

# traversing through dataframe 
df_balanced.dropna(subset = ['LIGHTCOND'], inplace=True)
df_balanced['LIGHTCOND_INT'] = df_balanced['LIGHTCOND'].map(light_dict)

In [68]:
df_balanced[['LIGHTCOND','LIGHTCOND_INT']].head()

Unnamed: 0,LIGHTCOND,LIGHTCOND_INT
35013,Dark - Street Lights On,2
193049,Dark - Street Lights On,2
159701,Unknown,3
70109,Daylight,1
149064,Daylight,1


### Methodology
For implementing the solution, we will be using Github as a repository and Jupyter Notebook to preprocess data and build Machine Learning models.

We will use the following models:

#### K-Nearest Neighbor (KNN)
KNN will help us predict the severity code of an outcome by finding the most similar to data point within k distance.

#### Decision Tree
A decision tree model gives us a layout of all possible outcomes so we can fully analyze the concequences of a decision. It context, the decision tree observes all possible outcomes of different weather conditions.

#### Logistic Regression
Because our dataset only provides us with two severity code outcomes, our model will only predict one of those two classes. This makes our data binary, which is perfect to use with logistic regression.

## Define X and Y

In [49]:
from sklearn import preprocessing

In [69]:
Feature = np.asarray(df_balanced[['WEATHER_INT','ROADCOND_INT','LIGHTCOND_INT']])
X = Feature
X[0:5]

array([[2, 2, 2],
       [1, 1, 2],
       [4, 3, 3],
       [2, 3, 1],
       [1, 1, 1]])

In [70]:
y = df_balanced['SEVERITYCODE'].values
y[0:5]

array([1, 1, 1, 1, 1])

In [71]:
X= preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]



array([[ 0.26285166,  0.79462377,  0.34775004],
       [-0.67855782, -0.5918625 ,  0.34775004],
       [ 2.14567063,  2.18111004,  1.24008278],
       [ 0.26285166,  2.18111004, -0.54458271],
       [-0.67855782, -0.5918625 , -0.54458271]])

## K-Nearest Neighbor (KNN)

In [72]:
#Importing Libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [73]:
x_train, x_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', x_train.shape,  y_train.shape)
print ('Test set:', x_test.shape,  y_test.shape)

Train set: (90781, 3) (90781,)
Test set: (22696, 3) (22696,)


In [77]:
#Check best value of k
best_knn = 0
prev_score = 0
for k in range(1, 10):
    knn_model  = KNeighborsClassifier(n_neighbors = k).fit(x_train, y_train)
    knn_yhat = knn_model.predict(x_test)
    print("For K = {} accuracy = {}".format(k,accuracy_score(y_test,knn_yhat)))
    if ( accuracy_score(y_test,knn_yhat) >= prev_score ):
        best_knn = k
        prev_score = accuracy_score(y_test,knn_yhat)

For K = 1 accuracy = 0.5081512160733169
For K = 2 accuracy = 0.5087240042298202
For K = 3 accuracy = 0.514760310186817
For K = 4 accuracy = 0.511059217483257
For K = 5 accuracy = 0.510927035600987
For K = 6 accuracy = 0.511587945012337
For K = 7 accuracy = 0.5165667959111738
For K = 8 accuracy = 0.5159058864998237
For K = 9 accuracy = 0.5545470567500881


In [78]:
print ("KNN model is best for k = ", best_knn)

KNN model is best for k =  9


In [79]:
#Building the model with best value of K
best_knn_model = KNeighborsClassifier(n_neighbors = best_knn).fit(x_train, y_train)
best_knn_model

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=9, p=2,
           weights='uniform')

In [80]:
# Evaluation Metrics
# jaccard score and f1 score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score

print("Train set Accuracy (Jaccard): ", jaccard_similarity_score(y_train, best_knn_model.predict(x_train)))
print("Test set Accuracy (Jaccard): ", jaccard_similarity_score(y_test, best_knn_model.predict(x_test)))

print("Train set Accuracy (F1): ", f1_score(y_train, best_knn_model.predict(x_train), average='weighted'))
print("Test set Accuracy (F1): ", f1_score(y_test, best_knn_model.predict(x_test), average='weighted'))

Train set Accuracy (Jaccard):  0.5594342428481731
Test set Accuracy (Jaccard):  0.5545470567500881
Train set Accuracy (F1):  0.5008373281052447
Test set Accuracy (F1):  0.4943563821036535


## Decision Tree

In [81]:
# importing libraries
from sklearn.tree import DecisionTreeClassifier

In [82]:
#Find best value for depth
best_depth = 0
prev_dscore = 0

for d in range(1,10):
    dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = d).fit(x_train, y_train)
    dt_yhat = dt.predict(x_test)
    print("For depth = {}  the accuracy score is {} ".format(d, accuracy_score(y_test, dt_yhat)))
    if (accuracy_score(y_test, dt_yhat) >= prev_dscore ):
        best_depth = d
        prev_dscore = accuracy_score(y_test, dt_yhat)

For depth = 1  the accuracy score is 0.5493479027141347 
For depth = 2  the accuracy score is 0.5493479027141347 
For depth = 3  the accuracy score is 0.5510662671836447 
For depth = 4  the accuracy score is 0.5582481494536482 
For depth = 5  the accuracy score is 0.5581600281988015 
For depth = 6  the accuracy score is 0.5574991187874515 
For depth = 7  the accuracy score is 0.5584684525907649 
For depth = 8  the accuracy score is 0.5587768769827283 
For depth = 9  the accuracy score is 0.559525907648925 


In [83]:
print ("Decision Tree model is best for depth d = ", best_depth)

Decision Tree model is best for depth d =  9


In [85]:
# Creating the best model for decision tree with best value of depth 

best_dt_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = best_depth).fit(x_train, y_train)
best_dt_model

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=9,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [86]:
# Evaluation Metrics
# jaccard score and f1 score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score

print("Train set Accuracy (Jaccard): ", jaccard_similarity_score(y_train, best_dt_model.predict(x_train)))
print("Test set Accuracy (Jaccard): ", jaccard_similarity_score(y_test, best_dt_model.predict(x_test)))

print("Train set Accuracy (F1): ", f1_score(y_train, best_dt_model.predict(x_train), average='weighted'))
print("Test set Accuracy (F1): ", f1_score(y_test, best_dt_model.predict(x_test), average='weighted'))

Train set Accuracy (Jaccard):  0.5641819323426708
Test set Accuracy (Jaccard):  0.559525907648925
Train set Accuracy (F1):  0.5303521501107132
Test set Accuracy (F1):  0.5264419670998483


## Logistic Regression

In [87]:
# importing libraries
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import log_loss

In [88]:
best_solver = ''
prev_rscore = 0

for k in ('lbfgs', 'saga', 'liblinear', 'newton-cg', 'sag'):
    lr_model = LogisticRegression(C = 0.01, solver = k).fit(x_train, y_train)
    lr_yhat = lr_model.predict(x_test)
    y_prob = lr_model.predict_proba(x_test)
    print('When Solver is {}, logloss is : {}'.format(k, log_loss(y_test, y_prob)))
    if (log_loss(y_test, y_prob) >= prev_rscore ):
        best_solver = k
        prev_rscore = log_loss(y_test, y_prob)

When Solver is lbfgs, logloss is : 0.6830913862931347
When Solver is saga, logloss is : 0.6830914098436894
When Solver is liblinear, logloss is : 0.6830913271342262
When Solver is newton-cg, logloss is : 0.6830914111953412
When Solver is sag, logloss is : 0.683091376175836


In [89]:
print("Solver : '{}' has the best score {}".format(best_solver,prev_rscore))

Solver : 'newton-cg' has the best score 0.6830914111953412


In [90]:
# Best logistic regression model with best solver

best_lr_model = LogisticRegression(C = 0.01, solver = best_solver).fit(x_train, y_train)
best_lr_model

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [91]:
# Evaluation Metrics
# jaccard score and f1 score
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score

print("Train set Accuracy (Jaccard): ", jaccard_similarity_score(y_train, best_lr_model.predict(x_train)))
print("Test set Accuracy (Jaccard): ", jaccard_similarity_score(y_test, best_lr_model.predict(x_test)))

print("Train set Accuracy (F1): ", f1_score(y_train, best_lr_model.predict(x_train), average='weighted'))
print("Test set Accuracy (F1): ", f1_score(y_test, best_lr_model.predict(x_test), average='weighted'))

Train set Accuracy (Jaccard):  0.5316641147376654
Test set Accuracy (Jaccard):  0.5320761367641875
Train set Accuracy (F1):  0.5300991509225988
Test set Accuracy (F1):  0.5305240301885548


## Model Evaluation

In [92]:
# Jaccard

# KNN
knn_yhat = best_knn_model.predict(x_test)
jacc1 = round(jaccard_similarity_score(y_test, knn_yhat), 2)

# Decision Tree
dt_yhat = best_dt_model.predict(x_test)
jacc2 = round(jaccard_similarity_score(y_test, dt_yhat), 2)

# Logistic Regression
lr_yhat = best_lr_model.predict(x_test)
jacc3 = round(jaccard_similarity_score(y_test, lr_yhat), 2)

jss = [jacc1, jacc2, jacc3]
jss

[0.55, 0.56, 0.53]

In [93]:
# F1_score

# KNN
knn_yhat = best_knn_model.predict(x_test)
f1 = round(f1_score(y_test, knn_yhat, average = 'weighted'), 2)

# Decision Tree
dt_yhat = best_dt_model.predict(x_test)
f2 = round(f1_score(y_test, dt_yhat, average = 'weighted'), 2)

# Logistic Regression
lr_yhat = best_lr_model.predict(x_test)
f3 = round(f1_score(y_test, lr_yhat, average = 'weighted'), 2)

f1_list = [f1, f2, f3]
f1_list

[0.49, 0.53, 0.53]

In [96]:
# log loss

# Logistic Regression
lr_prob = best_lr_model.predict_proba(x_test)
ll_list = ['NA','NA', round(log_loss(y_test, lr_prob), 2)]
ll_list

['NA', 'NA', 0.68]

In [97]:
columns = ['KNN', 'Decision Tree', 'Logistic Regression']
index = ['Jaccard', 'F1-score', 'Logloss']

accuracy_df = pd.DataFrame([jss, f1_list, ll_list], index = index, columns = columns)
accuracy_df1 = accuracy_df.transpose()
accuracy_df1.columns.name = 'Algorithm'
accuracy_df1

Algorithm,Jaccard,F1-score,Logloss
KNN,0.55,0.49,
Decision Tree,0.56,0.53,
Logistic Regression,0.53,0.53,0.68


## Disscussion
In the beginning of this notebook, we had categorical data that was of type 'object'. This is not a data type that we could have fed through an algoritim, so label encoding was used to created new classes that were of type int8; a numerical data type.

After solving that issue we were presented with another - imbalanced data. As mentioned earlier, class 1 was nearly three times larger than class 2. The solution to this was downsampling the majority class with sklearn's resample tool. We downsampled to match the minority class exactly with 58188 values each.

Once we analyzed and cleaned the data, it was then fed through three ML models; K-Nearest Neighbor, Decision Tree and Logistic Regression. Although the first two are ideal for this project, logistic regression made most sense because of its binary nature.

Evaluation metrics used to test the accuracy of our models were jaccard index, f-1 score and logloss for logistic regression. Choosing different k, max depth and hyparameter C values helped to improve our accuracy to be the best possible.

## Conclusion
Based on the dataset provided for this capstone from weather, road, and light conditions pointing to certain classes, we can conclude that particular conditions have a somewhat impact on whether or not travel could result in property damage (class 1) or injury (class 2).