In [34]:
import pandas as pd

#Read the dataset
df = pd.read_csv('./SafeDrive-AI/Traffic_Collision_Data.csv')

#Remove columns that won't be useful
df.drop(['Geo_ID', 'X', 'Y', 'Accident_Year'], axis = 1, inplace = True)

#Convert column to right variable type
df['Accident_Date'] = pd.to_datetime(df['Accident_Date'])

print(df.shape)
df.info()

(74612, 26)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74612 entries, 0 to 74611
Data columns (total 26 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   ID                          74612 non-null  object        
 1   Accident_Date               74612 non-null  datetime64[ns]
 2   Accident_Time               74612 non-null  object        
 3   Location                    74612 non-null  object        
 4   Location_Type               74612 non-null  object        
 5   Classification_Of_Accident  74612 non-null  object        
 6   Initial_Impact_Type         74607 non-null  object        
 7   Road_Surface_Condition      74611 non-null  object        
 8   Environment_Condition       74610 non-null  object        
 9   Light                       74612 non-null  object        
 10  Traffic_Control             74611 non-null  object        
 11  Num_of_Vehicle              74612 non-null

# Feature Engineering for Accident_Time Column

Random forest can only process numerical values, and Accident_Time column is time type. Therefore, random forest is not able to understand and correctly process this column. That being said, we need a way to successfully convert Accident_Time into a numerical value, which conserves the cyclical nature of time (meaning, 23:00 is closer to 1:00 than 11:00 is to 14:00).

To achieve this, the first thing we need to convert time hh:mm into only mm. For example, time 8:45 would be converted into 525 minutes (8 hours * 60 minutes + 45 minutes). The general formula is as follows, where $t$ represents hh:mm, and $t_hh$ represents only the hh part.

$$minutes(t) = (t_{hh} * 60) + t_{mm} $$

With that in mind, I found a well known approach for this: using sine and cosine to transform our time into numerical values that can keep their cyclical properties. 

It means, we will create two new columns from our single Accident_Time column, by using the following formulas:

$$sin\_time(t) = \sin(\frac{2\pi * minutes(t)}{1440})$$

$$cos\_time(t) = \cos(\frac{2\pi * minutes(t)}{1440})$$



In [35]:
import numpy as np
#Feature engineering for Accident_Time
#Before starting, we need to remove all Unkown values from the dataset
df = df.loc[df['Accident_Time'] != 'Unknown']

#First step: convert time hh:mm to only mm
times = pd.to_datetime(df['Accident_Time'], format = '%H:%M', errors = 'coerce')
hours = times.dt.hour
minutes = times.dt.minute
time_minutes = (hours*60) + minutes

df['time_minutes'] = time_minutes

#Second step: create columns for sin_time and cosine_time
calculated_time = (2 * np.pi * df['time_minutes']) / 1440
df['sin_time'] = np.sin(calculated_time)
df['cos_time'] = np.cos(calculated_time)

#Third step: remove original Accident_Time column since it won't be of any use
df.drop(['Accident_Time'], axis = 1, inplace = True)

df.loc[:, ['sin_time', 'cos_time']]

Unnamed: 0,sin_time,cos_time
0,0.374607,0.927184
1,0.754710,0.656059
2,0.944089,-0.329691
3,0.713250,-0.700909
4,0.082808,-0.996566
...,...,...
74607,0.034899,-0.999391
74608,-0.043619,-0.999048
74609,-0.675590,-0.737277
74610,-0.700909,-0.713250
