# 2016 Road Safety - Accidents 

The goal of this report is to build a model that predicts if a police officer is likely to attend an accident or not using the accidents data provided in: https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data

In [None]:
# Importing the needed libraries
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None

# Importing libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Importing the machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, precision_recall_curve, average_precision_score
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier

# This is for sampling unbalanced data
from imblearn.under_sampling import RandomUnderSampler

# Importing the counter
from collections import Counter

## Importing and analysing the data

In [None]:
data = pd.read_csv('../input/Accidents_2016.csv', low_memory=False)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

## Feature Selection

Let's first simplify the column names to be able to better understand the exploratory visualizations.

In [None]:
names = ["Accident_Index","Location_Easting","Location_Northing",
         "Longitude","Latitude","Police_Force","Accident_Severity","No_of_Vehicles",
         "No_of_Casualties","Date","Day_of_Week","Time","Local_Authority_District",
         "Local_Authority_Highway","First_Road_Class","First_Road_No","Road_Type","Speed_Limit",
         "Junction_Detail","Junction_Control","Second_Road_Class","Second_Road_Number",
         "Pedestrian_Crossing_Control","Pedestrian_Crossing_Facilities",
         "Light_Conditions","Weather_Conditions","Road_Surface_Conditions","Special_Conditions_at_Site",
         "Carriageway_Hazards","Urban_or_Rural_Area","Police_Attendance","Accident_Location"]
data.columns = names

### Excluding features that seem unrelated to the police attendence
To get an idea about about the data, I am going to draw a heatmap of the columns. This also helps me to reduce the number of features. As it can be seen in the heatmap below, some features are highly correlated and are not mutually independent, therefore there is no use to use all of them to train the model (e.g., Location_Easting and Longitude). From the heatmap, there doesn't seem to be any meaningful correlations between the features and the police attendence.


Taking a look at the documentations of the data, and the data itself, there are some other features that can be excluded as they don't seem to have an effect on police attendence (using common sense). For example, "Accident Index" is only a unique identifier of the accident, moreover, there are some data that are mostly related to the cause of the accident itself, such as "Pedestrian Crossing Control" and "Light Conditions".

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 10)
ax =sns.heatmap(data.corr())

## Data preprocessing and preparation

### Handling the null values
As there are a lot of samples, to simplify the process I decided to drop the null values.

### Handling non-numerical values
There are some non-numerical values in the dataset that should be managed:
- To handle the date, I will extract the month. The day can be scaped as we already have the day of the weak included in the data set. And the year is always 2016.
- To handle the time values, I am going to extract only the hour.
- To handle non-numerical categorical values, I will convert them to numerical categories.

### Managing the target categories
The main question to be answered by the model is to predict if a police officer is likely to attend an accident or not. Looking at the police attendence field of the data there are three categories, 1 representing Yes, 2 representing No, and 3 representing No (the accident was self reported). As the goal is to predict the police attendence the target category can be simplified into a binary value.

In [None]:
data_no_na = data.dropna()

In [None]:
data_no_na['Month']=data_no_na['Date'].apply(lambda x: x.split("/")[1])
data_no_na['Hour']=data_no_na['Time'].apply(lambda x: int(x.split(":")[0]))

In [None]:
data_no_na['Accident_Location'] = data_no_na['Accident_Location'].astype('category')
data_no_na['Accident_Location_Cat'] = data_no_na['Accident_Location'].cat.codes

data_no_na['Local_Authority_Highway'] = data_no_na['Local_Authority_Highway'].astype('category')
data_no_na['Local_Authority_Highway_Cat'] = data_no_na['Local_Authority_Highway'].cat.codes

data_no_na['Police_Attendance']= data_no_na['Police_Attendance'].apply(lambda x: 1 if x==1 else 0)

In [None]:
features_minimal = data_no_na[[ 'Location_Easting', 'Location_Northing', 'Police_Force', 'Accident_Severity', 'No_of_Vehicles',
       'No_of_Casualties', 'Day_of_Week','Local_Authority_District',
       'First_Road_Class', 'First_Road_No', 'Road_Type', 'Speed_Limit',
       'Junction_Detail', 'Junction_Control', 'Second_Road_Class',
       'Second_Road_Number','Weather_Conditions', 'Road_Surface_Conditions',
       'Special_Conditions_at_Site', 'Carriageway_Hazards',
       'Urban_or_Rural_Area','Month', 'Hour','Accident_Location_Cat','Local_Authority_Highway_Cat']]

In [None]:
target = data_no_na['Police_Attendance']

## Visualizing and Analyzing the Results

I just use the following functions to visualize and analyse the results of the trained models

In [None]:
def plot_confusion_matrix(y_test, y_pred,classes,normalize=False,title='Confusion Matrix',cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
    """
    # Compute confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    np.set_printoptions(precision=2)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
   

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    plt.grid(False)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.

    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, format(cm[i, j], fmt),
                horizontalalignment="center",
                color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

def plot_side_by_side_confusion_matrix(y_test, y_pred):
    """
    Plots the confusion matrix
    """
    plt.figure(figsize=(12,4))
    plt.subplot(121)
    plot_confusion_matrix(y_test, y_pred, [0,1], normalize=False, title='Confusion Matrix')
    plt.subplot(122)
    plot_confusion_matrix(y_test, y_pred, [0,1], normalize=True, title='Normalized Confusion Matrix')
    
def plot_roc_curve(y_test, probs):
    """
    Plots the ROC curve
    """
    fpr, tpr, thresholds = roc_curve(y_test, probs)
    plt.plot(fpr, tpr, lw=1)
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r')
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC')
    plt.legend(loc="lower right")

def plot_precision_recall_curve(y_test, probs):
    """
    Plots the Precision-Recall Curve
    """
    precision, recall, _ = precision_recall_curve(y_test, probs)
    average_precision = average_precision_score(y_test, probs)
    plt.step(recall, precision, color='b', alpha=0.2, where='post')
    plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')

    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.0])
    plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(average_precision))

# Model Selection

First thing to consider is that here we are tackling a binary classification problem, where we have considerable amount of samples and features to train the model.

## Linear Support Vector Classification

The first model I would like to test on the data is simply the linear SVC.

Before spliting the samples into the test and train data, considering that in the samples, most of the times there have been a police attendence, the model could become biased toward this result. Therefore, it seemed that it would be better to use a technique to balance the sampling of the dataset. I used the imbalanced-learn library for this purpose that can be found here: http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html. However, the final results showed that this technique didn't improve the model in terms of the precision and recall.

To have this comparison, in the following there is the results of the trained linear SVC with the normal sampling, followed by the same model with the balanced sampling:

In [None]:
lin_svc_X_train, lin_svc_X_test, lin_svc_y_train, lin_svc_y_test = train_test_split(features_minimal, target, test_size=0.4, random_state=101)

In [None]:
lin_svc_model = svm.LinearSVC()

In [None]:
lin_svc_model.fit(lin_svc_X_train,lin_svc_y_train)

In [None]:
lin_svc_predictions = lin_svc_model.predict(lin_svc_X_test)

In [None]:
print(confusion_matrix(lin_svc_y_test,lin_svc_predictions))

In [None]:
print(classification_report(lin_svc_y_test,lin_svc_predictions))

Now using the balanced sampling:

In [None]:
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_sample(features_minimal, target)
print(sorted(Counter(y_resampled).items()))

In [None]:
balanced_X_train, balanced_X_test, balanced_y_train, balanced_y_test = train_test_split(X_resampled, y_resampled, test_size=0.4, random_state=101)

In [None]:
lin_svc_balanced_model = svm.LinearSVC()

In [None]:
lin_svc_balanced_model.fit(balanced_X_train,balanced_y_train)

In [None]:
lin_svc_balanced_predictions = lin_svc_balanced_model.predict(balanced_X_test)

In [None]:
print(confusion_matrix(balanced_y_test,lin_svc_balanced_predictions))

In [None]:
print(classification_report(balanced_y_test,lin_svc_balanced_predictions))

## Improving the model using GridSearchCV
To tune the model to get better results I would use GridSearchCV on the 'C' parameter of the Linear SVC model.

In [None]:
param_grid = {'C':[1,10,100,1000]}

In [None]:
grid = GridSearchCV(svm.LinearSVC(),param_grid,verbose=3)

In [None]:
grid.fit(lin_svc_X_train,lin_svc_y_train)

In [None]:
grid.best_params_

In [None]:
grid_predictions = grid.predict(lin_svc_X_test)

In [None]:
print(confusion_matrix(lin_svc_y_test, grid_predictions))

In [None]:
print(classification_report(lin_svc_y_test, grid_predictions))

## Analysing the Results

As it can be seen in the results, using the GridSearchCV to tune the 'C' parameter, the model have improved in both recall and precision.

## K-Nearest Neighbors Classification

As the results of the linear SVC model didn't seem very satisfying, I would also train a KNN classifier to compare the results. To start with, the samples should be scaled to be used for training the model.

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(features_minimal)

In [None]:
scaled_features  = scaler.transform(features_minimal)

In [None]:
knn_features = pd.DataFrame(scaled_features, columns=features_minimal.columns)

In [None]:
knn_features.head()

In [None]:
knn_X_train, knn_X_test, knn_y_train, knn_y_test = train_test_split(knn_features, target, test_size=0.4, random_state=101)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(knn_X_train,knn_y_train)

In [None]:
knn_predictions = knn.predict(knn_X_test)

In [None]:
print(confusion_matrix(knn_y_test,knn_predictions))

In [None]:
print(classification_report(knn_y_test,knn_predictions))

## Analysis of the Result
The results show that on the KNN classifier we have less precision and better recall compared with the linear SVC, however the model deosn't seem to be biased on the police attendence. To test if the KNN classifier will be improved by adding the number of neighbors, I would train KNN classifiers with the number of neighbors up to 20.

In [None]:
test_error_rate = list()

for i in range(1,20):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(knn_X_train,knn_y_train)
    knn_pred_i = knn.predict(knn_X_test)
    test_error_rate.append(np.mean(knn_pred_i != knn_y_test))

In [None]:
plt.figure(figsize=(7,4))
plt.plot(range(1,20),test_error_rate,color='blue',linestyle='dashed',marker='o',markerfacecolor='blue',markersize=5)
plt.xlabel('K Value')
plt.ylabel('Error Rate')
plt.title('Error Rate vs. K Value')

## Analysing the Results
The results show that considering K around 7-9 we should have balance between the test and train error rates. However, the results do not improve much more by adding the K value.

(Note: it was better to consider also the train error rate in this evaluation)


## The final model

Using the analysis, the K value of the KNN classifier is considered to be 9. I would retrain the model with this tuned parameter to compare the results.

In [None]:
knn9 = KNeighborsClassifier(n_neighbors=9)
knn9.fit(knn_X_train,knn_y_train)

In [None]:
knn9_predictions = knn9.predict(knn_X_test)
print(confusion_matrix(knn_y_test,knn9_predictions))
print(classification_report(knn_y_test,knn9_predictions))

In [None]:
knn9_predictions_probs = knn9.predict_proba(knn_X_test)

In [None]:
plot_side_by_side_confusion_matrix(knn_y_test,knn9_predictions)

In [None]:
plot_roc_curve(knn_y_test, knn9_predictions_probs[:,1])

In [None]:
plot_precision_recall_curve(knn_y_test, knn9_predictions_probs[:,1])

## Analysing the Results

The final results show that the precision and recall have imprived significantly. Also the model seem to better predict the non-police attendance targets (considering the improved precision on non police attendence target).