In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("/kaggle/input/us-accidents/US_Accidents_June20.csv")

In [None]:
print(df.columns.tolist())

Some important features here to analyze are:
* Severity
* Start Time
* State of Accident (Does one state have more accidents than others?)
* Presence of Crossings, Junctions, Railways, Roundabouts, Stations, Stop
* Precipitation and Visibility

We can use this data to predict what causes accidents, as well as how some of these features contribute to higher severity accidents. The features listed above are what I;m going to focus on based on an itial review of what I believe would most likely be indicators of car accidents.

# Part 1: Dealing with Missing Data/Preprocessing

First, we should deal with missing data and work on preprocessing our data to be in a form where it is usable.

In [None]:
df.isnull().sum(axis = 0).sort_values(ascending = False).head(25)

In [None]:
len(df)

We have null values for some important features such as Visibility and Precipitation as well as weather condition. For that reason, I'm going to remove some unimportant indicators below. For the sake of this exercise, I am removing the following indicators. In real pratice, I would do some research to understand what features actually affect car accidents so I dont remove important features prematurely. 

In [None]:
df = df.drop(['Number','Wind_Chill(F)', "End_Lat", "End_Lng", "Timezone", "Airport_Code", "Weather_Timestamp", "Zipcode", 
              "Pressure(in)", "Wind_Direction(mph)", "Humidity(%"], axis=1)

In [None]:
df.isnull().sum(axis = 0).sort_values(ascending = False).head(10)

Now, we should drop datapoints with missing values in the following categories: astronomical twilight, nautical twilight, civil twilight, description since these wont affect the database since there is so few missing values.

In [None]:
df = df.dropna(subset=["Astronomical_Twilight", "Nautical_Twilight", "Civil_Twilight", "Description"])

To fill in the precipitation values, we fill in the missing precipitation data with the median of the precipitation values (we don't use mean because it is more likely to be subject to skewed data). Another step that I could have took was to create a seperate column which contains whether the precipiration data was null or not. I do the same with visibility here. It is important to note that this might not have ben the best course of action if our visiblity distribution has a large distribution. However, NaN values only account for 2% of the data so we're hoping that doing this doesn't have a huge impact on our overall results. 

In [None]:
df['Precipitation(in)'] = df['Precipitation(in)'].fillna(df['Precipitation(in)'].median())
df['Visibility(mi)'] = df['Visibility(mi)'].fillna(df['Visibility(mi)'].median())
df['Temperature(F)'] = df['Temperature(F)'].fillna(df['Temperature(F)'].median())
df['Humidity(%)'] = df['Humidity(%)'].fillna(df['Humidity(%)'].median())

# Part 2: Data Analysis and Severity Analysis

Next, we can do some EDA to understand when accidents happen, why they happen, and where they happen the most.

# State

In [None]:
plt.figure(figsize=(17,8))
ax = df['State'].value_counts().plot(kind='bar')
ax.set_title("Accident count by state")
ax.set_xlabel("State")

**CA, TX, FL, SC, NC** are the leaders here in car accidents. We see a huge skew towards California drivers for accidents. Why? Is this dataset representative or did the sampling technique cause more california accidents to be scraped?

In [None]:
plt.figure(figsize=(17,8))
ax = df['Severity'].value_counts().plot(kind='bar')
ax.set_title("Accident count by severity")
ax.set_xlabel("Severity Rating")

Most accidents recieved a 2 rating for severity. There is a smaller amount of 4 ratings for severity which means that the 3 rating for severity will also be an important part of understanding 

In [None]:
print(df["Severity"].value_counts())

# Time of Accident

In [None]:
df['time'] = pd.to_datetime(df["Start_Time"], format='%Y-%m-%d %H:%M:%S')
pd.DatetimeIndex(df['time']).month.value_counts().sort_index().plot.bar(width=0.5,align='center')
plt.title("Accident Count by Month with Severity ")
plt.xlabel("Month")
plt.ylabel("Accident Count")


For some reason, there is a smaller amount of accidents in July. But, as a general rule, there does not seem to be much dependence on months in terms of accidents.

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

df['weekday'] = df['time'].dt.dayofweek
df['weekday'].value_counts().sort_index().plot.bar(width=0.5,align='center')
plt.title("Accident Count by Month with Severity ")
plt.xlabel("Day")
plt.ylabel("Accident Count")


Here we can see that Saturdays and Sundays have far less accidents than the other days of the week. 

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

df['weekday'] = df['time'].dt.hour
df['weekday'].value_counts().sort_index().plot.bar(width=0.5,align='center')
plt.title("Accident Count by Time with Severity ")
plt.xlabel("Hour")
plt.ylabel("Accident Count")

In [None]:
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])

df['weekday'] = df['time'].dt.hour
df[df["Severity"] == 4]['weekday'].value_counts().sort_index().plot.bar(width=0.5,align='center')
plt.title("Severe Accident Count by Time with Severity ")
plt.xlabel("Hour")
plt.ylabel("Accident Count")

Accidents occurred most frequently at 7 and 8 in the morning and at 4 and 5 in the afternoon and are the lowest after midnight as expected. 

However, extremely severe accidents pretty much occurred at any time of day, there were no major differences between what time of day severe accidents occurred on the streets.

In [None]:
plt.subplots(figsize=(12,5))
df['Weather_Condition'].value_counts().head(20).plot.bar(width=0.5,align='center')
plt.xlabel('Weather Condition',fontsize=16)
plt.ylabel('Accident Count',fontsize=16)
plt.title('20 of The Main Weather Conditions for Accidents of Severity ',fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

Since most accidents here are during clear weather, we are unsure whether inclement weather causes greater accidents. It is interesting however, that light rain and light snow cause more accidents than snow and regular slow. Is this because drivers tend to go slower in heavy rain and heavy snow whereas they might not notice heavy snow and rain?


Now we'll try to see how some of these factors actually affect the severity of the accident rather than just looking at whether an accident occurs or not.

In [None]:
plt.title("Severity with Fog")
df.loc[df["Weather_Condition"] == "Fog"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

In [None]:
plt.title("Severity with Fog")
df.loc[df["Weather_Condition"] == "Fog"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

In [None]:
plt.title("Severity with Light Rain")
df.loc[df["Weather_Condition"] == "Light Rain"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

In [None]:
plt.title("Severity with Rain")
df.loc[df["Weather_Condition"] == "Rain"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

In [None]:
plt.title("Severity with Heavy Rain")
df.loc[df["Weather_Condition"] == "Heavy Rain"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

In [None]:
plt.title("Severity with Snow")
df.loc[df["Weather_Condition"] == "Snow"]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

Level 3 and Level 4 Accidents increase as we go frrom light rain to heavy rain and snow. THis means that accidents are more likely to be severe under the influence of heavy rain and snow when compared to light rain and snow. 

In [None]:
for s in ['Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']:
    if (df[s] == True).sum() > 0:
        plt.subplots(1,1,figsize=(12,5))
        plt.xticks()
        plt.suptitle('Accident Severity Near ' + s,fontsize=16)
        plt.subplot(1,2,2)
        df.loc[df[s] == True]['Severity'].value_counts().plot.pie(autopct='%1.0f%%',fontsize=16)

From this information we are able to find that junctions, give ways, railways, and no exits cause higher rates of crashes. 

From this notebook, we were able to find that certain states had higher rates of accidents (CA specifically). We see that certain road features such as the presence of crossings, junctions and railways cause more severe accidents and the presence of inlcement weather also causes more severe accidents as a whole. 

After this initial review, if I were to expand on this study, I would attempt to find a correlation between visibility and severity of accidents. In addition,I would try to fit a logistic regression model to the dataset in order to find out whether we could predict the severity of an accident by inputting a variety of factors. This data analysis confirmed my hypothesis that inclement weather causes more severe accidents and that certain junctions and road features also cause more high profile accidents. This shows us that there are a lot of important factors to consider when looking at the severity of accidents and the presence of accidents around the country. 