Acknowledgements

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

# US - Accidents Analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/us-accidents/US_Accidents_Dec19.csv')

In [None]:
df.head()

In [None]:
#Let's take a look at the columns
df.info()

In [None]:
#Where were the data collected from?
df['Source'].value_counts()

In [None]:
sns.set_style('whitegrid')
sns.set_palette('Spectral')

In [None]:
plt.figure(figsize = (8, 6))
sns.countplot(df['Source'], order = df['Source'].value_counts().index)

MapQuest stands out as the major source of this accident data

In [None]:
#Let's check any missing data
df.isna().sum()

Some columns are perfect while others have some missing data, we can deal with the missing data later

In [None]:
#TMC provides more detailed event code
#df['TMC'].value_counts()
#after looking at https://wiki.openstreetmap.org/wiki/TMC/Event_Code_List, I do not think those details would add much value
#for this analysis, I would drop the column here
df.drop('TMC', axis = 1, inplace = True)

In [None]:
df['Severity'].value_counts()

In [None]:
sns.countplot(df['Severity'])

Apparently Severity 2 has the most instances, that fits our expectation that most accidents' severities are in the middle

In [None]:
#Let's look at Start and End time for a sec
df[['Start_Time', 'End_Time']].head()

In [None]:
from datetime import datetime

In [None]:
#The times are pretty close, we might be interested to see the time difference between End_Time and Start_Time
time_diff = \
df['End_Time'].apply(datetime.strptime, args = ('%Y-%m-%d %H:%M:%S',)) - \
df['Start_Time'].apply(datetime.strptime, args = ('%Y-%m-%d %H:%M:%S',))

In [None]:
#Convert everything to hour difference, ignoring microseconds
time_diff_hr = time_diff.apply(lambda x: x.days * 24 + x.seconds / 3600)

In [None]:
time_diff_hr[:10]

In [None]:
#Add time_diff_hr back to df as Time_Diff
df['Time_Diff'] = time_diff_hr

Another use of time is that we can divide time into different buckets, ex. Morning, Afternoon, Evening; because we have Start_Time and End_Time, I just use the Start_Time

In [None]:
start_hour = df['Start_Time'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S').hour)

In [None]:
morning = [1 if 6 <= x <= 11 else 0 for x in start_hour]
afternoon = [1 if 12 <= x <= 17 else 0 for x in start_hour]
evening = [1 if 18 <= x <= 24 or 0 <= x <= 5 else 0 for x in start_hour]

In [None]:
#Now we put them back into df
df = df.assign(Morning = morning, Afternoon = afternoon, Evening = evening)

In [None]:
#We can drop the Start_Time and End_Time columns now
df.drop(['Start_Time', 'End_Time'], axis = 1, inplace = True)

In [None]:
#Now let's see accidents distribution by timeframe
temp = np.asarray(morning) + np.asarray(afternoon) * 2 + np.asarray(evening) * 3
timeframe = ['morning' if x == 1 else 'afternoon' if x == 2 else 'evening' for x in temp]
del temp

In [None]:
plt.figure(figsize = (8, 6))
sns.countplot(timeframe, order = ['morning', 'afternoon', 'evening'])
#It seems most accidents happen in the morning
#My initial thought is that evening times would have more accidents because of poor lighting but the plot shows otherwise

In [None]:
#Now view what columns we have again
df.info()

In [None]:
#The next columns are Latitudes and Longitudes, it might be hard to use them directly, 
#but I am thinking about using K-means to group them into 10 geo-spacial areas

#Let's look at some sample data first
df[['Start_Lat', 'Start_Lng', 'End_Lat', 'End_Lng']].head(10)

In [None]:
#End_Lat and End_Lng have many missing values concurring with our earlier findings
#So we would drop End_Lat and End_Lng and only use Start_Lat and Start_Lng for K-means
#Actually, here, why not we just drop all the columns that have a lot of missing values altogether?
#Again, the missing value columns are
cols_missing_vals = df.isna().sum()[lambda x: x > 0]
cols_missing_vals

In [None]:
#The total # of rows of the df is
num_rows = len(df.index)
num_rows

In [None]:
#Let's say we do not want columns that is missing over 5% of the data
cols_to_drop = cols_missing_vals[lambda x: x > num_rows * 0.05]
cols_to_drop

In [None]:
#Now drop the columns from df
df.drop(cols_to_drop.index, axis = 1, inplace = True)

In [None]:
#Now Let's see if we can use kdeplot on a SAMPLE of df to visualize density of Start_Lng vs. Start_Lat
df_sample = df.sample(10000)
sns.kdeplot(df_sample['Start_Lng'], df_sample['Start_Lat'], shade = True)

Imagine overlapping the kde plot with the map of the United States, it seems many accidents took place on west coast and east coast (with some down south also), which makes sense

In [None]:
#Now implement K-means to find 10 clusters
#reference on elbow method (not used here): https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203
from sklearn.cluster import KMeans

In [None]:
X = df[['Start_Lat', 'Start_Lng']]
kmeans = KMeans(n_clusters = 10)
geo_cluster = kmeans.fit_predict(X)
df['Geo_Cluster'] = geo_cluster

In [None]:
#Now we can drop Start_Lat and Start_Lng
df.drop(['Start_Lat', 'Start_Lng'], axis = 1, inplace = True)

#### Additional visualizations showing different data properties/relationships

In [None]:
#Top 10 states for accidents
fig, ax = plt.subplots(figsize = (8, 6))

temp = df['State'].value_counts().head(10)
sns.barplot(temp.index, temp.values, ax = ax)

ax.set_xlabel('States')
ax.set_ylabel('# of Accidents')

del temp

In [None]:
#Distribution of Temperature(F)
sns.distplot(df['Temperature(F)'].dropna())

In [None]:
#Weather conditions where most accidents happen
fig, ax = plt.subplots(figsize = (8, 6))

temp = df['Weather_Condition'].value_counts().head(10)
sns.barplot(temp.index, temp.values, ax = ax)

ax.set_xlabel('Weather_Condition')
ax.set_ylabel('# of Accidents')

ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

del temp

Most accidents took place because of clear sky? Or it is just clear sky most likely to take place?...

-------------------------------------------------------------------------------------------

In [None]:
#What I am going to do next is to drop the features that we are not going to use in this analysis
df.drop(['Description', 'Street', 'Side', 'City', 'County', 'State', 'Zipcode', 'Country', 'Airport_Code', 'Weather_Timestamp', 
        'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight', 'Astronomical_Twilight'], axis = 1, inplace = True)

In [None]:
#Lets see the columns that are left
df.info()

In [None]:
#Missing Value Cols
cols_missing_vals = df.isna().sum()[lambda x: x > 0].index
df[cols_missing_vals].info()

In [None]:
#We would fill missing value cols by data type
def fill_missing_values(col_name):
    #For float64, we fill using median
    if df[col_name].dtype == np.float64:
        df[col_name].fillna(df[col_name].median(), inplace = True)
    #For objects, we use existing distribution
    else:
        nas = df[col_name].isna()
        df.loc[nas, col_name] = df.loc[~nas, col_name].sample(nas.sum(), replace = True).values

In [None]:
for col_name in cols_missing_vals:
    fill_missing_values(col_name)

In [None]:
#Now check for columns missing data again
df.isna().any()

In [None]:
#Now let's find correlations between features
df.corr()

In [None]:
#Something seems to be off for Turning_Loop, let's inspect it
df['Turning_Loop'].value_counts()

In [None]:
#It does not have any True values, so let's drop col Turning_Loop
df.drop('Turning_Loop', axis = 1, inplace = True)

In [None]:
df_corr = df.corr()

In [None]:
#Now we can visualize correlations using heatmap
plt.figure(figsize = (10, 8))
sns.heatmap(df_corr, linewidths=.5, cmap = 'Blues')

In [None]:
#It seems we have few variables that are highly correlated with each other, which is good.
#Now let's take a look at our data again and see if we need additional processing on some columns
pd.set_option('display.max_columns', None)
df.head()

In [None]:
#Drop ID, drop Evening because we only need Morning and Afternoon to represent all three time buckets
df.drop(['ID', 'Evening'], axis = 1, inplace = True)

In [None]:
#For Wind_Direction and Weather_Condition, because they have many categories, 
#we would like to narrow the categories down, see below
df['Wind_Is_Calm'] = df['Wind_Direction'] == 'Calm'
df['Weather_Is_Clear'] = df['Weather_Condition'] == 'Clear'
df.drop(['Wind_Direction', 'Weather_Condition'], axis = 1, inplace = True)

In [None]:
#Get dummies for categorical features
df = pd.get_dummies(data = df, columns = ['Source', 'Timezone', 'Geo_Cluster'], drop_first = True)

### Let's say if we would like to develop a model to predict the severity of the accidents

In [None]:
X = df.drop('Severity', axis = 1)
y = df['Severity']

In [None]:
#transform bool columns to int
bool_cols = X.select_dtypes(include = ['bool']).columns
X.loc[:, bool_cols] = X[bool_cols].astype(int)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1, stratify = y)

In [None]:
#Standardize numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

cols_to_scale = ['Distance(mi)', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Time_Diff']

scaler.fit(X_train[cols_to_scale])

X_train_std = X_train.copy()
X_test_std = X_test.copy()

X_train_std.loc[:, cols_to_scale] = scaler.transform(X_train[cols_to_scale])
X_test_std.loc[:, cols_to_scale] = scaler.transform(X_test[cols_to_scale])

In [None]:
#Experiment:
#Let's write an ovr (one-vs-rest) Logistic Regression class using gradient descent
#We would then going to map Severity column to only two classes and apply the function
class LogisticRegressionGD(object):
    def __init__(self, eta = 0.1, n_iter = 30, random_state = 1):
        '''
        eta: learning rate
        n_iter: number of iterations
        '''
        self.w_ = []
        self.random_state = random_state
        self.cost_ = []
        self.eta = eta
        self.n_iter = n_iter
        
    
    def fit(self, X, y):
        '''
        X(n_samples, n_features) numpy array
        y(n_samples,)
        '''
        X = np.asarray(X)
        y = np.asarray(y)
        
        self._initialize_w(X)
        
        for _ in range(self.n_iter):
            net_input = self._net_input(X)
            activation = self._activation(net_input)
            
            #if we forget about regularization...
            cost = np.sum(-y * np.log(activation).ravel() - (1 - y) * np.log(1 - activation).ravel())
            self.cost_.append(cost)

            self.w_[1:] += (self.eta * X.T @ (y.reshape(-1, 1) - activation)).ravel()
            self.w_[0] += self.eta * (y - activation.ravel()).sum()
            
        return self
    
    def _initialize_w(self, X):
        rgen = np.random.RandomState(self.random_state)
        self.w_ = rgen.normal(loc = 0.0, scale = 0.1, size = 1 + X.shape[1])
        
    def _net_input(self, X):
        return X @ self.w_[1:].reshape(-1, 1) + self.w_[0]
    
    def _activation(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -35, 35)))
        
    def predict(self, X):
        X = np.asarray(X)
        return np.where(self._net_input(X) >= 0, 1, 0)

In [None]:
np.unique(y)

In [None]:
#make our target - to predict whether Severity is 4 or not
y_dual_train = y_train.isin([4]).astype(int) 
y_dual_test = y_test.isin([4]).astype(int)

In [None]:
eta = 0.00001
n_iter = 50
lr = LogisticRegressionGD(eta = eta, n_iter = n_iter)
lr.fit(X_train_std, y_dual_train)

In [None]:
plt.figure(figsize = (8, 6))
plt.plot(lr.cost_)
plt.xlabel('n_iter')
plt.ylabel('cost')
plt.title('Cost vs. N_iter')

In [None]:
#Let's do some predictions
y_dual_pred = lr.predict(X_test_std)

In [None]:
y_dual_actual = y_dual_test.reset_index(drop = True).rename('Actual')
y_dual_pred = pd.Series(y_dual_pred.ravel(), name = 'Predicted')

In [None]:
pd.crosstab(y_dual_actual, y_dual_pred)

#### from first glance, the results are not that bad, remember that class 1 is severity 4 and class 0 is other severities
Now we would like to calculate F-score to see how we did because our target has unbalanced # of examples between classes

In [None]:
from sklearn.metrics import f1_score
f1_score(y_dual_actual, y_dual_pred)

So our F-score is quite low actually if we are interested in severity 4 vs. rest

In [None]:
#Let's try using the package on Logistic Regression and how see how it performs multi-class wise
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000, n_jobs=-1)
lr.fit(X_train_std, y_train)
y_pred = lr.predict(X_test_std)

In [None]:
y_actual = y_test.reset_index(drop = True).rename('Actual')
y_pred = pd.Series(y_pred.ravel(), name = 'Predicted')
pd.crosstab(y_actual, y_pred)

Note that severity 0 and 1 were not being predicted at all because of the small # of instances comparing to other classes

In [None]:
#Get F-score
f1_score(y_actual, y_pred, average = 'weighted')

It seems Logistic Regression model might not be a good fit for Severity classification using the processed data