# US Accidents Exploratory Data Analysis 

#### US-Accidents can be used for numerous applications such as real-time accident prediction, studying accident hotspot locations, casualty analysis and extracting cause and effect rules to predict accidents, or studying the impact of precipitation or other environmental stimuli on accident occurrence.

### I have used the US Accidents dataset which contains over 3 million records

## Downloading the Data

In [None]:
#pip install opendatasets --upgrade
#import opendatasets as od
datasets_url=("https://www.kaggle.com/sobhanmoosavi/us-accidents?rvi=1")
#od.download(datasets_url)

## Data Preperation and Cleaning
####     1. Load the dataset using pandas 
####     2. Look at information about the data and columns
####     3. Fix missing or incorrect values

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
sns.set_style("darkgrid")

In [None]:
df=pd.read_csv("../input/us-accidents/US_Accidents_Dec20_Updated.csv")
df.head()

#### Checking Information about the data

In [None]:
df.info()

In [None]:
df.describe()

#### How many different dtypes we have in our dataframe and what are the number of columns for each

In [None]:
df.dtypes.unique()

In [None]:
len(df.select_dtypes(include='number').columns)

In [None]:
len(df.select_dtypes(include='object').columns)

In [None]:
len(df.select_dtypes(include='bool').columns)

### Missing Values

In [None]:
df.isna().sum()

##### Start_Lat and Start_Lang has 282821 missing values. So, it is possible that 282821 accidents were instantaneous or point accidents

In [None]:
missing_percentages=df.isna().sum().sort_values(ascending=False)/len(df)
missing_percentages=missing_percentages*100
missing_percentages=missing_percentages[missing_percentages>0]

In [None]:
missing_percentages

In [None]:
x_values=missing_percentages.index.tolist()
y_values=missing_percentages.tolist()

In [None]:
font = {'weight' : 'bold',
        'size'   : 10}

matplotlib.rc('font', **font)
matplotlib.rcParams.update({'font.size': 30})

In [None]:
fig,ax=plt.subplots(figsize=(40,20))
plt.xticks(rotation=90)
sns.barplot(x_values,y_values,ax=ax)
plt.xlabel("Columns")
plt.ylabel("Missing values percentage")
plt.title("Missing Values for each column")
plt.show()

##### number column shows the street number, and because it has over 60% missing values, we might as well drop it
##### As for the End_Lat and End_Lang we can assume they were point accidents and fill them with corresponding vallues from Start_Lat and Start_Lang

In [None]:
df=df.drop(columns=["Number"])
df["End_Lat"]=df["End_Lat"].fillna(df["Start_Lat"])
df["End_Lng"]=df["End_Lng"].fillna(df["Start_Lng"])

##### For City we use ffill to fill the na values but first sorting by State

In [None]:
df=df.sort_values(by="State")
df["City"]=df["City"].fillna(method="ffill")
df=df.sort_index()

##### For precipitation we assume that na represents no precipitation and fill up 0 for those values

In [None]:
df["Precipitation(in)"]=df["Precipitation(in)"].fillna(0)

##### We drop the following columns as they might not be of much significance in our analysis

In [None]:
df=df.drop(columns=['Weather_Timestamp','Airport_Code','Timezone','Zipcode','Wind_Direction'])

##### For the following columns we use the most occuring value as they have very small percentage  of missing data and are of object type

In [None]:
columns=['Nautical_Twilight','Astronomical_Twilight','Civil_Twilight','Sunrise_Sunset','Weather_Condition']
for col in columns:
    df[col]=df[col].fillna(df[col].value_counts().index[0])

##### For the following columns we use the median value as they are of numeric type and have very small percentage of missing data

In [None]:
cols_n=['Visibility(mi)','Humidity(%)','Temperature(F)','Pressure(in)',]
for col in cols_n:
    df[col]=df[col].fillna(df[col].median())

##### Wind Chill and Wind Speed have significant number of missing values

In [None]:
df_hm=df.loc[:,['Wind_Chill(F)','Temperature(F)','Pressure(in)','Precipitation(in)',"Humidity(%)"]]

In [None]:
df_hm.corr()

##### Seems like Wind_Chill(F) has the highest correlation with Temperature
##### So, we first sort according to temperature and then use ffill

In [None]:
df=df.sort_values(by="Temperature(F)")
df["Wind_Chill(F)"]=df["Wind_Chill(F)"].fillna(method='ffill')
df=df.sort_index()

In [None]:
df_hm=df.loc[:,['Wind_Speed(mph)','Wind_Chill(F)','Temperature(F)','Pressure(in)','Precipitation(in)',"Humidity(%)"]]

In [None]:
df_hm.corr()

##### Wind_speed(mph) has the highest correlation to temperature and precipitation
##### So, we first sort according to precipitation and temperature and then use ffill

In [None]:
df=df.sort_values(by=["Temperature(F)","Precipitation(in)"])
df["Wind_Speed(mph)"]=df["Wind_Speed(mph)"].fillna(method='ffill')
df=df.sort_index()

## Exploratory Analysis and Visualization
### Columns we will analyze:
#### 1)City
#### 2)Start time
#### 3)Start Latitude and Longitude

### City - Analysing the "City" column of the dataframe

In [None]:
len(df["City"].unique())

In [None]:
cities_by_accident=df["City"].value_counts().index[0:50]
cities_by_accident

##### NY City, the most populated city in USA, does not show up because the data does not contain entries for NY State

###### We plot the top 50 cities by accidents

In [None]:
x_vals=cities_by_accident.tolist()
y_vals=df["City"].value_counts().tolist()[:50]

In [None]:
font = {'weight' : 'bold',
        'size'   : 50}

matplotlib.rc('font', **font)
matplotlib.rcParams.update({'font.size': 50})

In [None]:
fig,ax=plt.subplots(figsize=(60,40))
ax.barh(x_vals, y_vals)
ax.set_ylabel("Most Prone Cities")
ax.set_xlabel("Number of cases")
plt.title("Top 50 most accident prone cities")
plt.show()

##### The top 5 cities with most accidents reported are - L.A, Houston, Charlotte, Miami and Dallas

##### Next we would like to know the distribution of data

In [None]:
x=df["City"].value_counts()

In [None]:
fig,ax=plt.subplots(figsize=(60,40))
sns.distplot(x, hist=False,kde_kws=dict(linewidth=10))
plt.title("Distribution of accidents over different cities")
plt.show()

##### From distribution plot, we can see that most cities had between 0 to 2,500 accidents. Infact, the top cities like L.A and Houston are outliers

In [None]:
cities=df["City"].value_counts()

In [None]:
high_accident_cities=cities[cities>=2500]
low_accident_cities=cities[cities<2500]

In [None]:
len(high_accident_cities)*100/len(cities)

In [None]:
len(low_accident_cities)*100/len(cities)

##### As we can see our assumption was correct
##### Only 1.49% of total cities have higher than 2,500 accidents
##### Whereas 98.50% of total cities have lower than 2,500 accidents

### Start Time - Analysing the Start_Time column of the dataframe

In [None]:
df["Start_Time"]=pd.to_datetime(df["Start_Time"])

In [None]:
df["Hour"]=df["Start_Time"].dt.hour

In [None]:
df["Hour"].value_counts()

In [None]:
x_val=df["Hour"].value_counts().index
y_val=df["Hour"].value_counts().tolist()

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.barplot(x_val, y_val, ax=ax)
plt.xlabel("Time")
plt.ylabel("Number of Accidents")
plt.title("Number of accidents for each hour of the day")
plt.show()

##### From the bar plot we can see that the number of accidents increases uptill 8 A.M in the morning which is the rush hour then decreases a little. It then again increases uptill 5 P.M which is again the rush hour. Hence we can say that most accidents occur during the office commute times

In [None]:
accident_times=df["Hour"].value_counts()
accident_times=accident_times.sort_index()

In [None]:
rush_hour=accident_times[[6,7,8,15,16,17]]

In [None]:
rush_hour_accidents=rush_hour.sum()
total_accidents=len(df)
rush_hour_accidents*100/total_accidents

##### Therefore we can see, that our assumption was correct and more than 40% of accidents occur between 6 A.M-8A.M in the morning and between 3 P.M-5P.M in the evening which are the office commute times

##### Next, we would like to know at what time the most severe accidents occur

In [None]:
df.columns

In [None]:
x_val=df["Hour"].value_counts().index
y_val1=df[df["Severity"]==1]["Hour"].value_counts().tolist()
y_val2=df[df["Severity"]==2]["Hour"].value_counts().tolist()
y_val3=df[df["Severity"]==3]["Hour"].value_counts().tolist()
y_val4=df[df["Severity"]==4]["Hour"].value_counts().tolist()

In [None]:
width=0.25
fig,ax=plt.subplots(figsize=(40,30))
plt.bar(x_val-2*width, y_val1, color='green', width=width, label='1')
plt.bar(x_val-width, y_val2, color='blue', width=width, label='2')
plt.bar(x_val, y_val3, color='yellow', width=width, label='3')
plt.bar(x_val+width, y_val4, color='red', width=width, label='4')
plt.xlabel("Time")
plt.ylabel("Number of Accidents")
plt.title("Number of accidents for different severity")
plt.show()

In [None]:
width=0.25
fig,ax=plt.subplots(figsize=(40,30))
plt.bar(x_val, y_val1, color='green', width=width, label='1')
plt.bar(x_val+width, y_val4, color='red', width=width, label='4')
plt.xlabel("Time")
plt.ylabel("Number of Accidents")
plt.title("Accidents of severity 1 and 4")
plt.show()

In [None]:
sev1=df[df["Severity"]==1]["Hour"]
sev2=df[df["Severity"]==2]["Hour"]
sev3=df[df["Severity"]==3]["Hour"]
sev4=df[df["Severity"]==4]["Hour"]

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.distplot(sev4, hist=False, color='red',kde_kws=dict(linewidth=5))
sns.distplot(sev3, hist=False, color='yellow', kde_kws=dict(linewidth=5))
sns.distplot(sev2, hist=False, color='blue', kde_kws=dict(linewidth=5))
sns.distplot(sev1, hist=False, color='green', kde_kws=dict(linewidth=5))
plt.title("Distribution of Accidents of different severity over all hours")
plt.show()

##### From analysing the severity, we find that accidents of severity 2 and 3 follow the commute time frame. However, the ones with severity 1 and 4 do not.
##### The accidents with severity 4 occurs fairly evenly at any hour of the day which may indicate to reckless driving.
##### Accidents with severity 1 has a peak after office hours which may suggest it is caused by non regular drivers.
##### We can in some ways also infer that regular drivers are more likely to face accidents of severity 2 and 3

##### Another thing we would like to investigate is the trend during weekends

In [None]:
df["Day"]=df["Start_Time"].dt.dayofweek

In [None]:
df_weekends=df[(df["Start_Time"].dt.dayofweek==6)|(df["Start_Time"].dt.dayofweek==5)]

In [None]:
x_val=df_weekends["Hour"].value_counts().index
y_val1=df_weekends[df_weekends["Severity"]==1]["Hour"].value_counts().tolist()
y_val2=df_weekends[df_weekends["Severity"]==2]["Hour"].value_counts().tolist()
y_val3=df_weekends[df_weekends["Severity"]==3]["Hour"].value_counts().tolist()
y_val4=df_weekends[df_weekends["Severity"]==4]["Hour"].value_counts().tolist()

In [None]:
width=0.25
fig,ax=plt.subplots(figsize=(40,30))
plt.bar(x_val-2*width, y_val1, color='green', width=width, label='1')
plt.bar(x_val-width, y_val2, color='blue', width=width, label='2')
plt.bar(x_val, y_val3, color='yellow', width=width, label='3')
plt.bar(x_val+width, y_val4, color='red', width=width, label='4')
plt.xlabel("Time")
plt.ylabel("Number of Accidents")
plt.title("Accidents for each hour of the day during weekends")
plt.show()

##### As we can see, during the weekends the peak occurs during the time interval 10A.M-6P.M, when people are most likely to go out

In [None]:
sev1=df_weekends[df_weekends["Severity"]==1]["Hour"]
sev2=df_weekends[df_weekends["Severity"]==2]["Hour"]
sev3=df_weekends[df_weekends["Severity"]==3]["Hour"]
sev4=df_weekends[df_weekends["Severity"]==4]["Hour"]

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.distplot(sev4, hist=False, color='red',kde_kws=dict(linewidth=5))
sns.distplot(sev3, hist=False, color='yellow', kde_kws=dict(linewidth=5))
sns.distplot(sev2, hist=False, color='blue', kde_kws=dict(linewidth=5))
sns.distplot(sev1, hist=False, color='green', kde_kws=dict(linewidth=5))
plt.title("Distribution of Accidents of different severities over all hours during weekends")
plt.show()

##### As for the distribution of severity of accidents, it follows the same trend as in the weekdays

##### Next we would like to know the trend for each month

In [None]:
df["Month"]=df["Start_Time"].dt.month
df_17=df[df["Start_Time"].dt.year>2016]

In [None]:
x=df["Month"].value_counts().index.tolist()
y=df["Month"].value_counts().to_list()

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.barplot(x,y,ax=ax)
plt.xlabel("Month of the year")
plt.ylabel("Number of accidents")
plt.title("Number of Accidents for each month")
plt.show()

##### From the plot we can see that accidents are more likely to occur during winter, especially during the festive season

In [None]:
mask=(df["Month"]<3)|(df["Month"]>8)
len(df[mask])/len(df)

##### Almost 60% of all accidents happen during the winter months

### Starting Latitude and Longitude

In [None]:
df.columns

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.scatterplot(x="Start_Lng", y="Start_Lat", data=df)
plt.title("Density of accidents over all of USA")
plt.show()

##### From the plot we see, higher density of accidents are reported along the costal areas where population density is also higher.

##### We want to see where the density of accidents is more in each day of the week

In [None]:
df["Day"].unique()

In [None]:
color=['red','blue','olive','green','darkcyan','orange','brown']

In [None]:
for x in sorted(df["Day"].unique().tolist()):
    df_x=df[df["Day"]==x]
    fig,ax=plt.subplots(figsize=(40,30))
    sns.scatterplot(x="Start_Lng", y="Start_Lat", data=df_x, color=color[x])
    plt.title("Accidents on day {} of the week".format(x))
    plt.show()

##### All days of the week show almost the same pattern in terms of location of accidents

##### We will see the distribution of accidents during night and day

In [None]:
df["Sunrise_Sunset"].unique()

In [None]:
df_x=df[df["Sunrise_Sunset"]=='Day']
fig,ax=plt.subplots(figsize=(40,30))
sns.scatterplot(x="Start_Lng", y="Start_Lat", data=df_x, color='indigo')
plt.title("Accidents during Daytime")
plt.show()

In [None]:
df_x=df[df["Sunrise_Sunset"]=='Night']
fig,ax=plt.subplots(figsize=(40,30))
sns.scatterplot(x="Start_Lng", y="Start_Lat", data=df_x, color='darkslategrey')
plt.title("Accidents during Night")
plt.show()

In [None]:
x_vals=(df["Sunrise_Sunset"].value_counts().index.tolist())
y_vals=(df["Sunrise_Sunset"].value_counts().tolist())

In [None]:
fig,ax=plt.subplots(figsize=(40,30))
sns.barplot(x_vals,y_vals,ax=ax)
plt.title("Accidents during day and night")
plt.show()

##### Number of accidents during the day is more than number of accidents during night

## Insights
### -----
#### Cities ----
##### -No data for New York
##### -Less than 2% cities have more than 2,500 accidents
##### -The top 5 cities with most accidents reported are - L.A, Houston, Charlotte, Miami and Dallas
###### -
###### -
#### Start_Time ----
##### - Over 40%  accidents occur during the commute times of 6A.M-8A.M and 3P.M-5P.M
##### - Accidents of severity 1 and 4 do not follow this time frame, with accidents of severity 4 evenly distributed and those of severity 1 having a peak after office hours
##### - Accidents of severity 4 mostly caused due to rash driving and those of severity 1 possibly by non regular drivers
##### - During the weekends the peak occurs during the time interval 10A.M-6P.M, when people are most likely to go out
##### - As for the distribution of severity of accidents during weekends, it follows the same trend as in the weekdays
##### - Accidents are more likely to occur during winter, especially during the festive season
##### - Almost 60% of all accidents happen during the winter months
###### -
###### -
#### Latitude and Longitude ----
##### - The Costal regions show higher density of accidents reported, which is also the more densly populated regions of USA
##### - All days of the week show roughly the same pattern in terms of location of accidents
##### - Number of accidents during day is higher than number of accidents during night
