**Acknowledgements****

Please cite the following papers if you use this dataset:

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019.

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

### Importing the libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import seaborn as sns
import matplotlib.pyplot as plt
# Any results you write to the current directory are saved as output.

### Importing other Libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

### Importing the dataset

In [None]:
data= pd.read_csv("/kaggle/input/us-accidents/US_Accidents_May19.csv")

### **Data Pre-processing**

In [None]:
data.head()

In [None]:
data.info()

### Finding out the columns with null values

In [None]:
data.isnull().sum()

There are many columns that contain null values. We will deal with these null values later.

Finding the correlation between the variables with the help of a heatmap.

In [None]:
fig=plt.gcf()
fig.set_size_inches(20,20)
fig=sns.heatmap(data.corr(),annot=True,linewidths=1,linecolor='k',square=True,mask=False, 
                vmin=-1, vmax=1,cbar_kws={"orientation": "vertical"},cbar=True)

### **Exploratory Data Analysis**

Let us find out which state has the most number of accidents recorded. For this, we will find out the top 10 states that are prone to accidents.

In [None]:

fig=plt.plot()
clr = ("blue", "green", "red", "orange", "purple",'black','pink','gray','darkgreen','brown')
data.State.value_counts().sort_values(ascending=False)[:10].sort_values().plot(kind='barh',color=clr)


* We can see that **California is the most accident prone state** followed by Texas and Florida.

Let us take a look at the **weather conditions** when the accidents occured. We will cosider **Top 10 weather conditions ** for this analysis.

In [None]:
fig, ax=plt.subplots()
data['Weather_Condition'].value_counts().sort_values(ascending=False).head(10).plot.bar(width=0.5,edgecolor='k',align='center')
plt.xlabel('Weather_Condition')
plt.ylabel('Number of Accidents')
ax.tick_params()
plt.title('Top 10 Weather Condition for accidents')
plt.ioff()

 It can be seen that **most accidents have occured when the weather was clear**. Thus, it can be inferred that people drive more carefully in severe weather conditions hence the probability of accidents is less as compared to that in a clear weather.

In [None]:
#Converting the date and time in the standard format.
data['time'] = pd.to_datetime(data.Start_Time, format='%Y-%m-%d %H:%M:%S')
data = data.set_index('time')
data.head()

In [None]:
#Adding an extra column as Day of the week to get the weekday name.
data['Start_Time'] = pd.to_datetime(data['Start_Time'], format="%Y/%m/%d %H:%M:%S")
data['Day'] = data['Start_Time'].dt.weekday_name
data.head()

In [None]:
#Plotting the graph 
fig, ax=plt.subplots()
data['Day'].value_counts().plot.bar(width=0.5,edgecolor='k',align='center')
plt.xlabel('Day of the Week')
plt.ylabel('Number of accidents')
ax.tick_params(labelsize=20)
plt.title('Accidents per day')
plt.ioff()

The number of accidents is more during the weekdays as compared to the weekends.

#### Feature Selection for the algorithms
We will select only a certain columns for the algorithm.

In [None]:
features=['Source','TMC','Severity','Start_Lng','Start_Lat','Distance(mi)','Side','City','County',
             'State','Timezone','Temperature(F)','Humidity(%)','Pressure(in)', 'Visibility(mi)',
             'Wind_Direction','Weather_Condition','Amenity','Bump','Crossing','Give_Way','Junction',
             'No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal',
             'Turning_Loop','Sunrise_Sunset','Day']

In [None]:
df=data[features].copy()

We will drop the rows with the missing values in the selected features.

In [None]:
df.dropna(subset=df.columns[df.isnull().mean()!=0], how='any', axis=0, inplace=True)

Let us select the state of California for further analysis since it is the most accident prone state. 

In [None]:
# Select the state of California
state='CA'
df_state=df.loc[df.State==state].copy()
df_state.drop('State',axis=1, inplace=True)
df_state.info()

In [None]:
# Map of accidents, color code by county

sns.scatterplot(x='Start_Lng', y='Start_Lat', data=df_state, hue='County', legend=False, s=20)
plt.show()

In the state of California, we will select San Fransisco as the county.

In [None]:
# Select San Francisco as the county
county='San Francisco'
df_county=df_state.loc[df_state.County==county].copy()
df_county.drop('County',axis=1, inplace=True)
df_county.info()

Splitting the data into train and test samples.

In [None]:
#Dealing with categorical variables
#Categorical variables are converted into dummy indicator variables.
df_dummy = pd.get_dummies(df_county,drop_first=True)


In [None]:
target='Severity'
y=df_dummy[target]
x=df_dummy.drop(target,axis=1)

In [None]:
#Splitting using the train-test split.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)

#### 1. Performing Logistic Regression.

In [None]:
lreg=LogisticRegression(random_state=0)
result=lreg.fit(x_train,y_train)
result


In [None]:

y_pred1=lreg.predict(x_test)
acc1=accuracy_score(y_test, y_pred1)
acc1

#### 2. Performing knn

In [None]:

knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train,y_train)
y_pred2 = knn.predict(x_test)

# Get the accuracy score
acc2=accuracy_score(y_test, y_pred2)
acc2

#### 3. Decision Tree with Entropy

In [None]:

dt = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)


# Fit dt_entropy to the training set
dt.fit(x_train, y_train)

# Use dt_entropy to predict test set labels
y_pred3= dt.predict(x_test)

# Evaluate accuracy_entropy
acc3 = accuracy_score(y_test, y_pred3)
acc3


#### 4. Decision Tree with Gini index

In [None]:
dt_gini = DecisionTreeClassifier(max_depth=8, criterion='gini', random_state=1)


# Fit dt_entropy to the training set
dt_gini.fit(x_train, y_train)

# Use dt_entropy to predict test set labels
y_pred4= dt_gini.predict(x_test)

# Evaluate accuracy_entropy
accuracy_gini = accuracy_score(y_test, y_pred4)
accuracy_gini

#### 5. Random Forest Classifier

In [None]:


rfc=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
rfc.fit(x_train,y_train)

y_pred5=rfc.predict(x_test)


# Get the accuracy score
acc5=accuracy_score(y_test, y_pred5)

acc5


The Random Forest Classifier performs the best on this dataset with an accuracy of 92.54%
Similarly, we can build models for different states or counties using different feature sets.

In [None]:
y_pred5