# KNN 

## Introduction

In this file, we are using the pre-processed accidents data to analyse the relationship between the underlying variables. Especially, it is our aim to assess the behaviour of the features in their co-existence with other features, portray the predictive power of multiple classification models as well assign importance to selected feature variables with respect to their effect on the variable of interest. Doing so, we will perform a short data analysis to visually inspect patterns, obtain relation strengths through application of numerous classification algorithms and perform a type of feature selection in which we assign feature importance through a range of selection techniques. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
import sklearn
import matplotlib.lines as mlines
import statsmodels.api as sm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix

from sklearn.linear_model import LogisticRegression

from scipy import stats
from scipy.stats import norm, bernoulli

import seaborn as sns

Data Analysis
In this part, we will quickly analyse the underlying behaviour of our feature variables to assess their distributional forms, observe their co-existance with other variables and define, through visual inspection, to what decree individual features might have an impact on the classification framework. As such, we will perform a first analysis that is likely to give preliminary results expected to be captured by the subsequent models for prediciton and selection.

Model Parameters
As was shortly introduced in the data preparation notebook, we obtain a dataset consisting of 16 feature variables. These are:

Accident Severity: A 3-level categorical character string indicating the severity of an accident. Ranging from Slight over Severe to Lethal.

Light Conditions: A categorical variable covering 7 specific light conditions.

Weather Conditions: A categorical variable covering 7 specific weather conditions.

Road Surface Conditions: A categorical variable covering 5 specific surface conditions.

Hour of day: A continuous variable previously defined as "time of accident". This variable was transformed into hourly intervals to obtain a better interpretability and improve policy predictions. As such, we defined 23 intervals covering each hour of the day.

Day of Week: A categorical variable covering the 7 specific days of the week which are indicated as Monday to Sunday.

Sex of Driver: The gender of the driver causing the accident, given as male or female.

Age of Driver: A categoorical variable covering the age of the driver in bins of 5-10 years, respectively. Starting at 18-20, then in 5-10 year intervals approaching 75 and ending on 75+.

Speed Limit: The respective speed limit where the accident occured.

Vehicle Type: A categorical variable previously given by multiple sub-categories for cars and motorcycles. These sub-categories were merged into either the major category cars or motorcycles.

Engine Capacity: The engine capacity of the vehicle causing the accident. Defined individually for cars and motorcycles. Both categories are assigned into a 4-level cluster.

Junction Detail: Covers details on environment surrounding the accident. Summarised into a 3-level categorical feature with levels junction, open street and roundabout. This is to indicate in which street setting the accident took place.

Multiple Vehicles involved: A binary feature indicating 1 if more than one car was involved in the accident.

Month + Year: Features indicating in which month and which year the accident occured.

Age of vehicle: A categorical feature indicating the age of the vehicle at accident date. Binned into 4 levels, ranging from 0-1 to +10 years.

We will now perform a short data analysis with these features.

In [2]:
acc = pd.read_csv('UK_accidents_preprocessed.csv')
acc.count()

Accident_Index                1482372
Accident_Severity             1482372
Road_Class                    1482372
Speed_limit                   1482372
Junction_Detail               1482372
Light_Conditions              1482372
Weather_Conditions            1482372
Road_Surface_Conditions       1482372
Hour_of_Day                   1482372
Year                          1482372
Month                         1482372
Day_of_Week                   1482372
Multiple_Vehicles_involved    1482372
Urban_Area                    1482372
Vehicle_Type                  1482372
Sex_of_Driver                 1482372
Age_of_Driver                 1482372
Engine_Capacity_(CC)          1482372
Age_of_Vehicle                1482372
dtype: int64

In [3]:
acc.isna().sum()

Accident_Index                0
Accident_Severity             0
Road_Class                    0
Speed_limit                   0
Junction_Detail               0
Light_Conditions              0
Weather_Conditions            0
Road_Surface_Conditions       0
Hour_of_Day                   0
Year                          0
Month                         0
Day_of_Week                   0
Multiple_Vehicles_involved    0
Urban_Area                    0
Vehicle_Type                  0
Sex_of_Driver                 0
Age_of_Driver                 0
Engine_Capacity_(CC)          0
Age_of_Vehicle                0
dtype: int64

In [4]:
# First, the variables which were already in the correct order as characters are transformed  is transformed into characters of 1 to 3. Here, 1 stands for slight and 3 for lethal accident

# re-assign the only nan value

acc['Age_of_Vehicle'] = np.where(acc['Age_of_Vehicle'].isna() == True, 'New (0-1)', acc['Age_of_Vehicle'])
acc['Day_of_Week'] = np.where(acc['Day_of_Week'] == 'We', 'Wednesday', acc['Day_of_Week'])
acc['Age_of_Driver'] = np.where(acc['Age_of_Driver'].isna() == True, '21-25', acc['Age_of_Driver'])
acc['Engine_Capacity_(CC)'] = np.where(acc['Engine_Capacity_(CC)'].isna() == True, '0-125cc', acc['Engine_Capacity_(CC)'])

var = ['Accident_Severity', 'Road_Class', 'Junction_Detail', 'Light_Conditions', 'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver'] 

for i in acc[var]: 
  acc[i + '_Code'] = pd.factorize(acc[i])[0] + 1

# Then, we assign the day of the week codes back: 

acc['Day_of_Week_Code'] = np.zeros
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Monday'] = 1
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Tuesday'] = 2
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Wednesday'] = 3
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Thursday'] = 4
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Friday'] = 5
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Saturday'] = 6
acc['Day_of_Week_Code'].loc[acc['Day_of_Week'] == 'Sunday'] = 7

acc['Day_of_Week_Code'] = acc['Day_of_Week_Code'].astype(int)

# Next, we do the same for Age of Vehicle

acc['Age_of_Vehicle_Code'] = np.zeros
acc['Age_of_Vehicle_Code'].loc[acc['Age_of_Vehicle'] == 'New (0-1)'] = 1
acc['Age_of_Vehicle_Code'].loc[acc['Age_of_Vehicle'] == '2-5'] = 2
acc['Age_of_Vehicle_Code'].loc[acc['Age_of_Vehicle'] == '6-10'] = 3
acc['Age_of_Vehicle_Code'].loc[acc['Age_of_Vehicle'] == '>10'] = 4

acc['Age_of_Vehicle_Code'] = acc['Age_of_Vehicle_Code'].astype(int)

# And for Age of Driver bins

acc['Age_of_Driver_Code'] = np.zeros
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '<18'] = 1
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '18-20'] = 2
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '21-25'] = 3
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '26-35'] = 4
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '36-45'] = 5
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '46-55'] = 6
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '56-65'] = 7
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '66-75'] = 8
acc['Age_of_Driver_Code'].loc[acc['Age_of_Driver'] == '>75'] = 9

acc['Age_of_Driver_Code'] = acc['Age_of_Driver_Code'].astype(int)


# As well as for Engine Capacity 

acc['Engine_Capacity_(CC)_Code'] = np.zeros
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '0-125cc'] = 1
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '126-350cc'] = 2
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '351-600cc'] = 3
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '601-1150cc'] = 4
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '1151-1999cc'] = 5
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '2000-2999cc'] = 6
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '3000-3999cc'] = 7
acc['Engine_Capacity_(CC)_Code'].loc[acc['Engine_Capacity_(CC)'] == '>4000cc'] = 8

acc['Engine_Capacity_(CC)_Code'] = acc['Engine_Capacity_(CC)_Code'].astype(int)


Now, with the variables sorted, we are able to have a look at their description. 

In [5]:
acc.describe()

Unnamed: 0,Speed_limit,Hour_of_Day,Year,Month,Multiple_Vehicles_involved,Urban_Area,Accident_Severity_Code,Road_Class_Code,Junction_Detail_Code,Light_Conditions_Code,Weather_Conditions_Code,Road_Surface_Conditions_Code,Vehicle_Type_Code,Sex_of_Driver_Code,Day_of_Week_Code,Age_of_Vehicle_Code,Age_of_Driver_Code,Engine_Capacity_(CC)_Code
count,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0,1482372.0
mean,39.22839,13.62957,2011.485,6.617869,0.7076746,0.6281608,1.177874,1.960208,1.655555,1.674603,1.417309,1.362196,1.085856,1.666635,3.929747,2.906279,4.698292,4.82437
std,14.2968,5.19385,4.273592,3.444451,0.4548312,0.4832959,0.4175724,1.125635,0.6472355,1.177433,1.071775,0.57695,0.2801509,0.4714159,1.937186,0.9644327,1.889562,1.019235
min,10.0,0.0,2005.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,30.0,10.0,2008.0,4.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,5.0
50%,30.0,14.0,2011.0,7.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,4.0,3.0,5.0,5.0
75%,50.0,17.0,2015.0,10.0,1.0,1.0,1.0,3.0,2.0,3.0,1.0,2.0,1.0,2.0,6.0,4.0,6.0,5.0
max,70.0,23.0,2019.0,12.0,1.0,1.0,3.0,5.0,3.0,5.0,8.0,5.0,2.0,2.0,7.0,4.0,9.0,8.0


## KNN - Classification with subsample

In [6]:
# Creating a 40% subsample (592'949 datapoints)
acc_sub = acc.sample(frac=0.4, random_state=1)
acc_sub.count()

Accident_Index                  592949
Accident_Severity               592949
Road_Class                      592949
Speed_limit                     592949
Junction_Detail                 592949
Light_Conditions                592949
Weather_Conditions              592949
Road_Surface_Conditions         592949
Hour_of_Day                     592949
Year                            592949
Month                           592949
Day_of_Week                     592949
Multiple_Vehicles_involved      592949
Urban_Area                      592949
Vehicle_Type                    592949
Sex_of_Driver                   592949
Age_of_Driver                   592949
Engine_Capacity_(CC)            592949
Age_of_Vehicle                  592949
Accident_Severity_Code          592949
Road_Class_Code                 592949
Junction_Detail_Code            592949
Light_Conditions_Code           592949
Weather_Conditions_Code         592949
Road_Surface_Conditions_Code    592949
Vehicle_Type_Code        

In [7]:
# Get the features
features = acc_sub[acc_sub.columns[acc_sub.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features = features.astype(object)
features[['Multiple_Vehicles_involved','Urban_Area']] = features[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X = pd.get_dummies(features, drop_first=False).to_numpy()

# Get the label in the right format
y = acc_sub['Accident_Severity'].astype(object).to_numpy()

In [8]:
# Apply a train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

In [9]:
# Loop for different values for k
k_range = range(1, 6)

for k in k_range:
    clf = KNeighborsClassifier(n_neighbors=k, n_jobs=-1) #n_jobs=-1 means that all processes who can be parallelized among cpu cores are performed parallelly
    clf.fit(Xtrain, ytrain)
    yhat_knn = clf.predict(Xtest)
    accuracy = np.mean(yhat_knn == ytest)
    print('Test accuracy with k=%.0f neighbors is %.4f' % (k, accuracy))
    print(confusion_matrix(ytest, yhat_knn))
    print(classification_report(ytest, yhat_knn))

Test accuracy with k=1 neighbors is 0.7382
[[   46   375  1199]
 [  327  3330 14266]
 [ 1127 13756 84164]]
              precision    recall  f1-score   support

       Fatal       0.03      0.03      0.03      1620
     Serious       0.19      0.19      0.19     17923
      Slight       0.84      0.85      0.85     99047

    accuracy                           0.74    118590
   macro avg       0.36      0.35      0.35    118590
weighted avg       0.73      0.74      0.74    118590

Test accuracy with k=2 neighbors is 0.6564
[[   88   590   942]
 [  657  5648 11618]
 [ 2215 24724 72108]]
              precision    recall  f1-score   support

       Fatal       0.03      0.05      0.04      1620
     Serious       0.18      0.32      0.23     17923
      Slight       0.85      0.73      0.78     99047

    accuracy                           0.66    118590
   macro avg       0.35      0.37      0.35    118590
weighted avg       0.74      0.66      0.69    118590

Test accuracy with k=3 n

In [10]:
clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
clf.fit(Xtrain, ytrain)
yhat_knn = clf.predict(Xtest)
print('Confusion Matrix')
print('')
print(confusion_matrix(ytest, yhat_knn))
print('')
print('Classification Report')
print('')
print(classification_report(ytest, yhat_knn))

Confusion Matrix

[[   15   200  1405]
 [   85  1468 16370]
 [  159  3628 95260]]

Classification Report

              precision    recall  f1-score   support

       Fatal       0.06      0.01      0.02      1620
     Serious       0.28      0.08      0.13     17923
      Slight       0.84      0.96      0.90     99047

    accuracy                           0.82    118590
   macro avg       0.39      0.35      0.35    118590
weighted avg       0.75      0.82      0.77    118590



## KNN - Classification with a balanced dataset

In [11]:
# Create a balanced dataset with 20'000 observations of each severity category
acc_balanced = pd.concat([acc[acc['Accident_Severity'] == 'Slight'].sample(n=20000, random_state=1),
acc[acc['Accident_Severity'] == 'Serious'].sample(n=20000, random_state=1),
acc[acc['Accident_Severity'] == 'Fatal'].sample(n=20000, random_state=1)])

# Create a dataset consisting of the remaining datapoints that are not included in the balanced set
acc_sub_res = pd.concat([acc_sub, acc_balanced, acc_balanced]).drop_duplicates(keep=False)

In [12]:
# Get the features for the balanced dataset
features = acc_balanced[acc_balanced.columns[acc_balanced.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features = features.astype(object)
features[['Multiple_Vehicles_involved','Urban_Area']] = features[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X = pd.get_dummies(features, drop_first=False).to_numpy()

# Get the label in the right format
y = acc_balanced['Accident_Severity'].astype(object).to_numpy()

In [13]:
# Get the features for the remaining dataset
features_res = acc_sub_res[acc_sub_res.columns[acc_sub_res.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features_res = features_res.astype(object)
features_res[['Multiple_Vehicles_involved','Urban_Area']] = features_res[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X_res = pd.get_dummies(features_res, drop_first=False).to_numpy()

# Get the label in the right format
y_res = acc_sub_res['Accident_Severity'].astype(object).to_numpy()

In [14]:
# Apply a train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

In [15]:
# Include the residual X values, that are not included in the balanced dataset, to the test dataset.
Xtest = np.concatenate([Xtest, X_res])

# Include the residual y values, that are not included in the balanced dataset, to the test dataset.
ytest = np.concatenate([ytest, y_res])

In [16]:
clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
clf.fit(Xtrain, ytrain)
yhat_knn = clf.predict(Xtest)
print('Confusion Matrix')
print('')
print(confusion_matrix(ytest, yhat_knn))
print('')
print('Classification Report')
print('')
print(classification_report(ytest, yhat_knn))

Confusion Matrix

[[  2684   1062    536]
 [ 33525  29466  21981]
 [144593 165855 181417]]

Classification Report

              precision    recall  f1-score   support

       Fatal       0.01      0.63      0.03      4282
     Serious       0.15      0.35      0.21     84972
      Slight       0.89      0.37      0.52    491865

    accuracy                           0.37    581119
   macro avg       0.35      0.45      0.25    581119
weighted avg       0.78      0.37      0.47    581119



## KNN - Classification (binary)

In [17]:
# Reclassify 'Fatal' and 'Serious' accidents to a new class 'Serious & Fatal' to make it a binary KNN
acc_sub.loc[acc['Accident_Severity'] == 'Fatal', ['Accident_Severity']] = 'Serious & Fatal'
acc_sub.loc[acc['Accident_Severity'] == 'Serious', ['Accident_Severity']] = 'Serious & Fatal'

In [18]:
# Get the features
features = acc_sub[acc_sub.columns[acc_sub.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features = features.astype(object)
features[['Multiple_Vehicles_involved','Urban_Area']] = features[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X = pd.get_dummies(features, drop_first=False).to_numpy()

# Get the label in the right format
y = acc_sub['Accident_Severity'].astype(object).to_numpy()

In [19]:
# Apply a train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

In [20]:
clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
clf.fit(Xtrain, ytrain)
yhat_knn = clf.predict(Xtest)
print('Confusion Matrix')
print('')
print(confusion_matrix(ytest, yhat_knn))
print('')
print('Classification Report')
print('')
print(classification_report(ytest, yhat_knn))

Confusion Matrix

[[ 1768 17775]
 [ 3787 95260]]

Classification Report

                 precision    recall  f1-score   support

Serious & Fatal       0.32      0.09      0.14     19543
         Slight       0.84      0.96      0.90     99047

       accuracy                           0.82    118590
      macro avg       0.58      0.53      0.52    118590
   weighted avg       0.76      0.82      0.77    118590



## KNN - Classification (binary) with a balanced dataset

In [21]:
# Create a balanced dataset with 20'000 observations of each severity category
acc_balanced2 = pd.concat([acc_sub[acc_sub['Accident_Severity'] == 'Slight'].sample(n=30000, random_state=1),
acc_sub[acc_sub['Accident_Severity'] == 'Serious & Fatal'].sample(n=30000, random_state=1)])

In [22]:
# Create a dataset consisting of the remaining datapoints that are not included in the balanced set
acc_sub_res = pd.concat([acc_sub, acc_balanced2, acc_balanced2]).drop_duplicates(keep=False)

In [23]:
# Get the features
features = acc_balanced2[acc_balanced2.columns[acc_balanced2.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features = features.astype(object)
features[['Multiple_Vehicles_involved','Urban_Area']] = features[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X = pd.get_dummies(features, drop_first=False).to_numpy()

# Get the label in the right format
y = acc_balanced2['Accident_Severity'].astype(object).to_numpy()

In [24]:
# Get the features for the remaining dataset
features_res = acc_sub_res[acc_sub_res.columns[acc_sub_res.columns.isin(['Road_Class', 'Junction_Detail', 'Light_Conditions', 
                                             'Weather_Conditions', 'Road_Surface_Conditions', 'Vehicle_Type', 'Sex_of_Driver', 
                                             'Day_of_Week', 'Engine_Capacity_(CC)', 'Age_of_Driver', 'Age_of_Vehicle',
                                             'Multiple_Vehicles_involved', 'Urban_Area', 'Month', 'Year', 'Hour_of_Day', 'Speed_limit'])]]
                                             
# Bring them into the desired format
features_res = features_res.astype(object)
features_res[['Multiple_Vehicles_involved','Urban_Area']] = features_res[['Multiple_Vehicles_involved','Urban_Area']].astype(int)

# Apply one hot encoding to get category-indicating variables for each categorical feature
X_res = pd.get_dummies(features_res, drop_first=False).to_numpy()

# Get the label in the right format
y_res = acc_sub_res['Accident_Severity'].astype(object).to_numpy()

In [25]:
# Apply a train-test split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

In [26]:
# Include the residual X values, that are not included in the balanced dataset, to the test dataset.
Xtest = np.concatenate([Xtest, X_res])

# Include the residual y values, that are not included in the balanced dataset, to the test dataset.
ytest = np.concatenate([ytest, y_res])

In [27]:
clf = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
clf.fit(Xtrain, ytrain)
yhat_knn = clf.predict(Xtest)
print('Confusion Matrix')
print('')
print(confusion_matrix(ytest, yhat_knn))
print('')
print('Classification Report')
print('')
print(classification_report(ytest, yhat_knn))

Confusion Matrix

[[ 39946  33067]
 [198076 273860]]

Classification Report

                 precision    recall  f1-score   support

Serious & Fatal       0.17      0.55      0.26     73013
         Slight       0.89      0.58      0.70    471936

       accuracy                           0.58    544949
      macro avg       0.53      0.56      0.48    544949
   weighted avg       0.80      0.58      0.64    544949

