# Data Preparation

In this notebook, we prepare the chosen categorical features into formats that are needed for machine learning analysis and modeling. 

In [1]:
# %matplotlib inline
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sns
import folium # for mapping
import matplotlib.pyplot as plt
from matplotlib import rc
rc('text', usetex=True)

#### Import the Data


In [2]:
df = pd.read_csv('../data/uk_accident_data.csv', low_memory=False, index_col=0)

Display the columns 

In [3]:
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0_level_0,Light_Conditions,Road_Surface_Conditions,Road_Type,Speed_limit,Urban_or_Rural_Area,Weather_Conditions,Age_Band_of_Driver,Vehicle_Type
Accident_Severity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Slight,Daylight,Dry,Single carriageway,Under 40 km/s,Urban,Fine,Adult,Car
Slight,Daylight,Wet,Single carriageway,Under 40 km/s,Urban,Rain,Adult,Car
Slight,Daylight,Wet,Single carriageway,Under 40 km/s,Urban,Rain,Adult,Motorcycle
Slight,Darkness - lights lit,Dry,Single carriageway,Under 40 km/s,Urban,Fine,Adult,Motorcycle
Slight,Darkness - lights lit,Dry,Single carriageway,Under 40 km/s,Urban,Fine,Adult,Car


Display the features and their labels.

In [4]:
for c in df.columns:
    print('Column: %s' % c)
    print('-------------------------------------------------')
    print(df[c].value_counts())
    print('-------------------------------------------------')
    print()

Column: Light_Conditions
-------------------------------------------------
Daylight                  1283201
Darkness - lights lit      298731
Darkness - no lighting      96746
Name: Light_Conditions, dtype: int64
-------------------------------------------------

Column: Road_Surface_Conditions
-------------------------------------------------
Dry      1169930
Wet       470440
Ice        25355
Snow       10627
Flood       2326
Name: Road_Surface_Conditions, dtype: int64
-------------------------------------------------

Column: Road_Type
-------------------------------------------------
Single carriageway    1212181
Dual carriageway       302680
Roundabout             117079
One way street          25667
On/off ramp             21071
Name: Road_Type, dtype: int64
-------------------------------------------------

Column: Speed_limit
-------------------------------------------------
Under 40 km/s    1152832
Over 40 km/s      525820
Name: Speed_limit, dtype: int64
----------------------

### Drop Poor Predictor Features

We drop features that have little to no influence on accident severity.  Note: this does not mean that we are omitting features that are poor predictors of accidents.  It means we are omitting features that are poor predictors, for whatever reason, of *accident severity*. 

Interestingly, we see from our feature heatmaps that weather conditions do not play a large factor in predicting accident severity.  This is likely because people tend to drive more cautiously in inclement weather, and accident severity depends on other factors (such as speed, vehicle type, etc.).  Because road surface conditions correlate with weather conditions, its not surpising that road surface condition is not a good precitor of accident severity either.  So we omit these two features. 

In [5]:
df.drop(['Road_Surface_Conditions','Weather_Conditions'],axis=1,inplace=True)


In [6]:
df.columns.tolist()

['Light_Conditions',
 'Road_Type',
 'Speed_limit',
 'Urban_or_Rural_Area',
 'Age_Band_of_Driver',
 'Vehicle_Type']

### One-hot Encoding of Categorical Attributes

In [7]:
df_oh = pd.get_dummies(df,
               columns=["Light_Conditions", "Road_Type","Speed_limit","Urban_or_Rural_Area","Age_Band_of_Driver","Vehicle_Type"],
               prefix=["light", "road","speed","area","age","vehicle"],
              )#.head()
df_oh.reset_index(inplace=True)
df_oh

Unnamed: 0,Accident_Severity,light_Darkness - lights lit,light_Darkness - no lighting,light_Daylight,road_Dual carriageway,road_On/off ramp,road_One way street,road_Roundabout,road_Single carriageway,speed_Over 40 km/s,speed_Under 40 km/s,area_Rural,area_Urban,age_Adolescent,age_Adult,age_Senior,vehicle_Bike,vehicle_Car,vehicle_Goods,vehicle_Motorcycle
0,Slight,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0
1,Slight,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0
2,Slight,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,1
3,Slight,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,0,0,1
4,Slight,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678673,Slight,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,1,0,0
1678674,Slight,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,0,1,0,0
1678675,Slight,0,1,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0
1678676,Slight,1,0,0,0,0,0,0,1,0,1,1,0,1,0,0,0,1,0,0


### Imbalanced Dataset

Our dataset is imbalanced, meaning that the number of "slight", "serious", and "fatal" labels for the accident severity column vary drastically.  If we were to use the dataset as is, it would be biased toward the slight cateogory, since the occurence of slight accidents heavily outweighs the occurence of fatal accidents. 
<br><br/>
To account for this, we use undersampling and randomly choose instances from the slight and serious cases, until they are equal to the number of fatal accident cases. 

In [9]:
df_oh['Accident_Severity'].value_counts()

Slight     1430139
Serious     224816
Fatal        23723
Name: Accident_Severity, dtype: int64

In [10]:
df_slight  = df_oh[df_oh['Accident_Severity']=='Slight']
df_serious = df_oh[df_oh['Accident_Severity']=='Serious']
df_fatal   = df_oh[df_oh['Accident_Severity']=='Fatal']

In [11]:
from sklearn.utils import resample

df_slight_downsampled = resample(df_slight, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_fatal),     # to match minority class
                                 random_state=123) # reproducible results

df_serious_downsampled = resample(df_serious, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_fatal),     # to match minority class
                                 random_state=123) # reproducible results

In [12]:
df_downsampled = pd.concat([df_fatal, df_slight_downsampled, df_serious_downsampled])

In [13]:
df_downsampled['Accident_Severity'].value_counts()

Serious    23723
Slight     23723
Fatal      23723
Name: Accident_Severity, dtype: int64

In [14]:
df_downsampled.to_csv('uk_accidents_prepared.csv')