##### Result
 Based on the result that I got by training the model with the seven algorithms with normal distribution of the 
 classes(Baseline model), undersampling and SMOTE oversampling Techniques, we can show that the accuracy and F1 score of the Models are 
 higher that that of normal distributions of the classes and undersampling technique except Naive Bayesian algorithm.
 Random Forest is having the higher score and optimistic result than other algorithms so we'll implement hyperparameter tuning to check wether we can optimize the score of the model or not.

In [1]:
##Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt 
import plotly.offline as py
from sklearn.model_selection import cross_val_score, train_test_split,GridSearchCV, KFold
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,roc_auc_score,roc_curve,f1_score,recall_score,precision_score
from sklearn.exceptions import ConvergenceWarning
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
import time             
import plotly.offline as py
from plotly import tools
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

In [2]:
## For ignoring warnings to view clean output
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [3]:
## Importing the dataset
data = pd.read_csv('Data/cleaned.csv',header=0)

In [4]:
### Separating Independent and Dependent features
X = data.iloc[:,:-1]
y = data.iloc[:, 14]

### Data Transformation
#### Handling Categorical Variables - Creating Dummy Variables

In [5]:
# Shows the columns with their number of categories each variable is having
for col in data.columns:
    print(col, ':', len(data[col].unique()), 'categories')

Age_band_of_driver : 5 categories
Sex_of_driver : 3 categories
Educational_level : 7 categories
Vehicle_driver_relation : 4 categories
Driving_experience : 8 categories
Lanes_or_Medians : 7 categories
Types_of_Junction : 8 categories
Road_surface_type : 6 categories
Light_conditions : 4 categories
Weather_conditions : 9 categories
Type_of_collision : 10 categories
Vehicle_movement : 13 categories
Pedestrian_movement : 9 categories
Cause_of_accident : 20 categories
Accident_severity : 3 categories


In [5]:
pd.get_dummies(data,drop_first=True).shape

(12316, 100)

In [6]:
X = pd.get_dummies(X, drop_first=True)

In [7]:
X.shape

(12316, 99)

In [8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [9]:
print("Mean of the dataset: ", np.mean(X).round(8))
print("Standard deviation of the dataset: ", np.std(X).round(8))

Mean of the dataset:  0.0
Standard deviation of the dataset:  1.0


### Handling Class imbalance

In [10]:
### Checking for data imbalance 
y.value_counts()

2    10415
1     1743
0      158
Name: Accident_severity, dtype: int64

In [11]:
print('Slight Injury: ' + str(round(data['Accident_severity'].value_counts()[2] / len(data) * 100, 2)) + '%\nSerious Injury: ' + 
      str(round(data['Accident_severity'].value_counts()[1] / len(data) * 100, 2))  + '%\nFatal Injury: ' + 
      str(round(data['Accident_severity'].value_counts()[0] / len(data) * 100, 2)) + '%')

Slight Injury: 84.56%
Serious Injury: 14.15%
Fatal Injury: 1.28%


#### SMOTE Oversampling Techniques for handling imbalanced dataset

In [12]:
# Oversampling
sm = SMOTE(random_state=0)
X_over, y_over = sm.fit_sample(X, y)
## train test split
X_train_over,X_test_over,y_train_over,y_test_over = train_test_split(X_over,y_over,test_size=0.2,random_state=42)
#setting 20% aside as validation data for cross validation
x_train_t, x_train_v, y_train_t, y_train_v = train_test_split(X_train_over, y_train_over, test_size = 0.2, random_state = 42)

In [13]:
# Print class frequencies 
pd.Series(y_over).value_counts()

2    10415
1    10415
0    10415
Name: Accident_severity, dtype: int64

In [14]:
# print the shapes of our training and test set 
print(X_train_over.shape)
print(X_test_over.shape)
print(y_train_over.shape)
print(y_test_over.shape)

(24996, 99)
(6249, 99)
(24996,)
(6249,)


In [15]:
y_test_over.value_counts()

1    2100
0    2085
2    2064
Name: Accident_severity, dtype: int64

#### Random Forest Without Hyperparameter Tuning

In [16]:
start_time = time.time()
RF_model = RandomForestClassifier()
# feeding the training data into the model
RF_model.fit(X_train_over, y_train_over)
print("Execution time: " + str((time.time() - start_time)) + ' sec')

Execution time: 5.987990617752075 sec


In [17]:
# predicting the values for x-test
y_pred = RF_model.predict(X_test_over)
# finding the training and testing accuracy
name = ['Fatal','Serious','Slight']
RF_r=recall_score(y_test_over,y_pred, average='macro')
RF_p=precision_score(y_test_over,y_pred, average='macro')
RF_f=f1_score(y_test_over,y_pred, average='macro')
print("Confusion Matrix: - \n",confusion_matrix(y_test_over, y_pred))
print()
print("Classification Report: - \n",classification_report(y_test_over, y_pred,target_names=name))
print("Recall:", RF_r)
print("Precision:", RF_p)
print("F1 score:", RF_f)

Confusion Matrix: - 
 [[2057    2   26]
 [   6 1727  367]
 [   1   42 2021]]

Classification Report: - 
               precision    recall  f1-score   support

       Fatal       1.00      0.99      0.99      2085
     Serious       0.98      0.82      0.89      2100
      Slight       0.84      0.98      0.90      2064

    accuracy                           0.93      6249
   macro avg       0.94      0.93      0.93      6249
weighted avg       0.94      0.93      0.93      6249

Recall: 0.9293727874842982
Precision: 0.9363211584115742
F1 score: 0.9288250783345221


## Hyperparameter Tuning

#### Model Evaluation : F1 Macro
**In this problem domain all classes should be treated equally. So Macro F1-score will give the same importance to each label/
class. It will be low for models that only perform well on the common classes while performing poorly on the rare classes.***

###  Random Forest

In [18]:
start_time = time.time()
#Setting values for the parameters
n_estimators = [100, 300, 500, 800, 1000]
criterion = ['gini','entropy']
max_depth = [5, 10, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]
max_features = ['auto','log2','sqrt']

#Creating a dictionary for the hyper parameters
parameters = dict(n_estimators = n_estimators, criterion =criterion, max_depth = max_depth, 
              min_samples_split = min_samples_split, min_samples_leaf = min_samples_leaf, max_features = max_features)
cv = KFold(n_splits= 5, shuffle=False, random_state=42)
RF_model = RandomForestClassifier()
grid_RF_model = GridSearchCV(RF_model, parameters, cv = cv, scoring='f1_macro', verbose = 1, n_jobs = -1)

# feeding the training data into the model
best_RF= grid_RF_model.fit(X_train_over, y_train_over)
print("Execution time: " + str((time.time() - start_time)) + ' sec')

Fitting 5 folds for each of 3000 candidates, totalling 15000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 15.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 35.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 58.4min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 90.8min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 147.3min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 214.5min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 309.1min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 418.3min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 562.9min
[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed: 718.6min
[Parallel(n_jobs=-1)]: Done 7192 tasks      | elapsed: 884.8min
[Parallel(n_jobs=-1)]: Done 8442 tasks      | elapsed: 1006.2min
[Parallel(n_jobs=-1)]: Done 9792 tasks      | elapsed: 1157.7min
[Parallel(n_jobs=-1)]: Done 11242 t

Execution time: 149796.46750068665 sec


In [19]:
#Printing the best hyperparameters
print('The best hyper parameters are:\n',grid_RF_model.best_params_)

The best hyper parameters are:
 {'criterion': 'gini', 'max_depth': 30, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 500}


In [21]:
#Fitting the random forest model with the best hyper parameters obtained through GridSearchCV
start_time = time.time()
RF_model = RandomForestClassifier(criterion ='gini',max_depth=30,max_features='log2',min_samples_leaf=1, min_samples_split=5, n_estimators=500)
RF_model.fit(X_train_over,y_train_over)
y_pred_h = RF_model.predict(X_test_over)
print("Execution time: " + str((time.time() - start_time)) + ' sec')

Execution time: 22.996673345565796 sec


In [22]:
## print the evaluation metrics results
name = ['Fatal','Serious','Slight']
RF_r=recall_score(y_test_over,y_pred_h, average='macro')
RF_p=precision_score(y_test_over,y_pred_h, average='macro')
RF_f=f1_score(y_test_over,y_pred_h, average='macro')
print("Confusion Matrix: - \n",confusion_matrix(y_test_over, y_pred_h))
print()
print("Classification Report: - \n",classification_report(y_test_over, y_pred_h,target_names=name))
print("Recall:", RF_r)
print("Precision:", RF_p)
print("F1 score:", RF_f)

Confusion Matrix: - 
 [[2058    1   26]
 [   7 1718  375]
 [   0   14 2050]]

Classification Report: - 
               precision    recall  f1-score   support

       Fatal       1.00      0.99      0.99      2085
     Serious       0.99      0.82      0.90      2100
      Slight       0.84      0.99      0.91      2064

    accuracy                           0.93      6249
   macro avg       0.94      0.93      0.93      6249
weighted avg       0.94      0.93      0.93      6249

Recall: 0.9327875506903448
Precision: 0.9414493225566417
F1 score: 0.9321057229894095


##### Observtaions & Conclusions:
After hyperparameter tuning the performance of the model has increased to a certain extent against the one when we used default hyperparameters.