In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Exploratory Data Analysis on US Accidents

### Introduction
This dataset contains about 3 million car accident records captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. 

The license to use this dataset can be found [here](https://creativecommons.org/licenses/by-nc-sa/4.0/)

Something about the dataset:

* The dataset contains information about US Accidents
* Can be useful to prevent accidents
* This dataset doesnot contain data about New York

## Loading the dataset

In [None]:
dataset = pd.read_csv('../input/us-accidents/US_Accidents_Dec20_Updated.csv')

## Data preparation and cleaning

In [None]:
dataset.columns

In [None]:
dataset

In [None]:
dataset.shape

In [None]:
dataset.describe

In [None]:
dataset.describe()

In [None]:
dataset.info()

Finding the number of numerical columns using pandas.

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_df = dataset.select_dtypes(include=numerics)
len(numeric_df.columns)

So this dataset contains 14 numerical columns.

Now lets find the percentage of missing values or incorrect values in this dataset.

In [None]:
missing_values_percent_per_col = (dataset.isna().sum().sort_values(ascending = False) / len(dataset)) * 100
missing_values_percent_per_col

We can see that some of the columns have missing values in them.

Lets show the percentage of missing values on bar plot.
 
Some of the columns have zero percentages in them. We don't want to show them on plot. So, we will filter out those columns.



In [None]:
import seaborn as sns
sns.set_style("darkgrid")

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
missing_values_percent_per_col[missing_values_percent_per_col != 0].plot.barh(color = 'cornflowerblue')
plt.xticks(fontsize=15)
plt.yticks(fontsize=20)
plt.xlabel('Percentage of missing values', fontsize=20)
plt.title('Missing values percentage per column', fontsize = 20)

plt.show()


In [None]:
dataset.columns

## Exploratory analysis and Visualization

Columns that we will analyse:
1. City
2. Start_Time
3. Start_Lat, Start_Lng
4. Temperature(F)
5. Weather_Condition


In [None]:
dataset.City

In [None]:
cities = dataset.City.unique()
len(cities)

This dataset contains data from 11790 US cities. According to world population review, there are over 19,000 incorporated places registered in the US. This implies that this dataset does not contain information about every city in the US.

We can see the cities with major number of accidents.

In [None]:
cities_by_accidents = dataset.City.value_counts()
cities_by_accidents

It looks like Los Angeles has the highest number of accidents recorded in the US but the most populated city in the US is New york. We have to check if this dataset contains data from New york or not.

In [None]:
cities_by_accidents[:20]

In [None]:
'New York' in dataset.City

In [None]:
'NY' in dataset.State

Thus, this dataset doesnot contain data from New York city.

Let's plot a bar graph for the top 20 cities with major accidents in the US.

In [None]:
plt.figure(figsize=(20,10))
cities_by_accidents[:20].plot.barh(color = 'cornflowerblue')
plt.xticks(fontsize=15)
plt.yticks(fontsize=20)
plt.xlabel('Number of accidents', fontsize=20)
plt.title('Top 20 cities with major accidents in the US', fontsize = 20)

plt.show()

To check if a lot of cities have small number of accidents or large number of accidents, we can plot a histogram.

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(cities_by_accidents, log_scale = True, color = 'cornflowerblue')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Number of accidents on log scale', fontsize=20)
plt.ylabel('Number of cites', fontsize=20)

plt.title('Distribution of accidents in the cities of the US', fontsize = 20)

plt.show()

From the bar graph, we can see that only few cities have higher number of accidents while the rest have smaller number of accidents. There is an exponential decrease in the number of accidents per city.
Let's seggregate the cities by high accidents and low accidents.

In [None]:
high_accident_cities = cities_by_accidents[cities_by_accidents >= 1000]
low_accident_cities = cities_by_accidents[cities_by_accidents < 1000]

Let's find out the percentage of cities with more than 1000 accidents yearly i.e. percentage of high accident cities.

In [None]:
(len(high_accident_cities) / len(cities)) * 100

It shows that about 4 percent of the cities record more than 1000 accidents per year.

In [None]:
cities_by_accidents[cities_by_accidents ==1]

Over 1300 cities have reported just 1 accident throughout the year. This needs to be investigated.

## Start Time

In [None]:
dataset.Start_Time

The Start_Time column of the dataset is of object data type which cannot be used for effective analysis. So, we have to convert its data type from object to datetime format.

In [None]:
dataset.Start_Time = pd.to_datetime(dataset.Start_Time)

In [None]:
dataset.Start_Time[0]

We can now pull out pieces of information from the timestamp to get insights like:
* The distribution of accidents in a day.
* Day of the week with more accidents.
* Trend of accidents over a year and so on.

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(data = dataset.Start_Time.dt.hour, stat = 'density', bins=24, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('24 hours', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents over a period of 24 hours ', fontsize = 20)

plt.show()

From the above graph, we can observe 2 peaks. This implies that a higher percentage of accidents occur between 6 AM to 10 AM in the morning and 2 PM to 6 PM in the evening probably because people tend to be in a hurry to get to work and return from work. 

Let's plot the distribution of accidents over a week.

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(data = dataset.Start_Time.dt.dayofweek,stat = 'density', bins=7, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Days of week', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents over a week ', fontsize = 20)

plt.show()

We can observe from the plot that the distribution of accidents over weekdays almost remain constant while it decreases for weekends probably because of holidays. (0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday)

Let's check if the distribution of accidents by hour on weekends remains the same as that on weekdays.


In [None]:
weekends_start_time = dataset.Start_Time[dataset.Start_Time.dt.dayofweek >= 5]
plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1)
sns.histplot(data = weekends_start_time.dt.hour, stat = 'density', bins=24, color = 'indigo')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('24 hours of weekends', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents by hour for Weekends ', fontsize = 20)

# subplot method is used to plot the two graphs side by side

weekdays_start_time = dataset.Start_Time[dataset.Start_Time.dt.dayofweek <= 4]

plt.subplot(1, 2, 2)
sns.histplot(data = weekdays_start_time.dt.hour, stat = 'density', bins=24, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('24 hours of weekdays', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents by hour for Weekdays ', fontsize = 20)

plt.show()

From the above two plots, it can be very well observed that the trend of accidents on weekends is entirely different from that on weekdays. 
* The peak for weekends occur between 10 AM to 5 PM unlike that for weekdays.


Let's have a look at the distribution of accidents for months.

In [None]:
plt.figure(figsize=(15,8))
sns.histplot(data = dataset.Start_Time.dt.month, stat = 'density', bins=12, color = 'indigo')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('12 Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for Months ', fontsize = 20)

plt.show()

From the above plot, it can be seen that more number of accidents occur around the month of december. 
Let's look at year by year plot to verify that this trend is consistent and to rule out any possibility of missing data.

In [None]:
dataset_2016 = dataset[dataset.Start_Time.dt.year == 2016]
dataset_2017 = dataset[dataset.Start_Time.dt.year == 2017]
dataset_2018 = dataset[dataset.Start_Time.dt.year == 2018]
dataset_2019 = dataset[dataset.Start_Time.dt.year == 2019]
dataset_2020 = dataset[dataset.Start_Time.dt.year == 2020]

The dataset has been divided based on years. Let's plot the distribution of accidents for these years.

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(3, 2, 1)                            # To put more than one plot in a cell
sns.histplot(data = dataset_2016.Start_Time.dt.month, stat = 'density', bins=12, color = 'indigo')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for 2016 ', fontsize = 20)

# subplot method is used to plot the two graphs side by side

plt.subplot(3, 2, 2)
sns.histplot(data = dataset_2017.Start_Time.dt.month, stat = 'density', bins=12, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for 2017 ', fontsize = 20)

plt.subplot(3, 2, 3)
sns.histplot(data = dataset_2018.Start_Time.dt.month, stat = 'density', bins=12, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for 2018 ', fontsize = 20)

plt.subplot(3, 2, 4)
sns.histplot(data = dataset_2019.Start_Time.dt.month, stat = 'density', bins=12, color = 'seagreen')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for 2019 ', fontsize = 20)

plt.subplot(3, 2, 5)
sns.histplot(data = dataset_2020.Start_Time.dt.month, stat = 'density', bins=12, color = 'indigo')
plt.xticks(fontsize=15)

plt.yticks(fontsize=15)
plt.xlabel('Months', fontsize=20)
plt.ylabel('Density of accidents', fontsize=20)

plt.title('Distribution of accidents for 2020 ', fontsize = 20)


plt.tight_layout()       # to adjust all the plots in such a way that they do not overlap each other.
plt.show()

After looking at the above plots, we observe that, the distribution for the years 2017, 2018 and 2019 seems more or less balanced with slight variation. But, when the trend for the years 2016 and 2020 are compared with that of the rest years, there is a significant variation in the distribution of accidents. This could mean that some data is missing for the years 2016 and 2020. The covid pandemic started in the year 2020 so, it might also be one the reasons for this variation in the trend in the year 2020 because of lockdown all over the US.

In [None]:
accident_count = [dataset_2016.shape[0], dataset_2017.shape[0], dataset_2018.shape[0], dataset_2019.shape[0], dataset_2020.shape[0]]
years = [2016, 2017, 2018, 2019, 2020]

In [None]:
plt.figure(figsize=(10,5))

sns.lineplot(x=years, y=accident_count, color = 'seagreen')
plt.ticklabel_format(style='plain')
plt.xticks(fontsize=15)
plt.xticks(years)

plt.yticks(fontsize=15)

plt.xlabel('Years', fontsize=20)
plt.ylabel('Number of accidents', fontsize=20)

plt.title('Distribution of accidents year over year ', fontsize = 20)


From the above graph, we can observe an increasing trend in the number of accidents year over year.

## Start Latitude and Start Longitude

In [None]:
dataset.Start_Lat

In [None]:
dataset.Start_Lng

Let's make a scatter plot to get an idea of the distribution of accidents in the US. But first, we have to reduce the number of points as there are about 3 million data points.

In [None]:
plt.figure(figsize=(10,5))

# Reducing the dataset
sample_df = dataset.sample(int(0.1 * len(dataset)))     
sns.scatterplot(x = sample_df.Start_Lng, y = sample_df.Start_Lat, size = 0.001, color = 'blueviolet')

plt.xticks(fontsize=15)

plt.yticks(fontsize=15)

plt.xlabel('Start_Lng',fontsize=20)
plt.ylabel('Start_Lat', fontsize=20)

plt.title('Distribution of accidents ', fontsize = 20)

We can see from the above scatter plot, that the density of accidents near the coastline is more compared to the center.

Let's show the accident points on a map to get a clear picture of the distribution of accidents over the US.

In [None]:
import folium

In [None]:
# Reducing the number of datapoints to add them on the map

sample1_df = dataset.sample(int(0.001 * len(dataset)))
lat_lon = list(zip(list(sample1_df.Start_Lat), list(sample1_df.Start_Lng)))

In [None]:
# Specifying the center coordinates to position the map to show the US
center = [39.8097343, -98.5556199]

# Adding data points on the Map
map_us = folium.Map(location=center, zoom_start=4)
for i in range(0,len(lat_lon)):
    folium.Marker(lat_lon[i]).add_to(map_us)

#displaying the map
map_us

## Temperature

In [None]:
dataset['Temperature(F)']

Seggregating the accidents based on hot and cold climate. Above 70 degrees F is chosen as hot climate and below it is chosen to be cold climate.

In [None]:
cold_temp_count = len(dataset[dataset['Temperature(F)'] < 70.0])
hot_temp_count = len(dataset[dataset['Temperature(F)'] >= 70.0])

Let's plot a pie chart to better visualize the accidents in cold and hot temperature

In [None]:
# Setting up the labels for the pie chart
lab = ['Accidents in cold climate', 'Accidents in hot climate', 'Accidents without temperature records']

In [None]:
plt.figure(figsize=(20,10))
plt.pie([cold_temp_count, hot_temp_count, len(dataset) - cold_temp_count - hot_temp_count], labels = lab, autopct = '%0.2f%%', shadow = True, explode = [0.2,0.2,0.2], textprops={'fontsize': 10})
plt.title('Accident Percentages in cold and hot climate ', fontsize = 20)
plt.show()

From the above chart, we can see that about 62 percent of the accidents occur in colder areas while 35 percent of the accidents occur in hotter areas.

## Weather Condition

Calculating the percentage of accidents on certain weather condition.

In [None]:
accident_weather_condition = (dataset.Weather_Condition.value_counts() / len(dataset)) * 100
accident_weather_condition = accident_weather_condition.sort_values(ascending = False)

In [None]:
plt.figure(figsize=(20,10))
accident_weather_condition[:20].plot.barh(color = 'cornflowerblue')
plt.xticks(fontsize=15)
plt.yticks(fontsize=20)
plt.xlabel('Percentage of accidents', fontsize=20)
plt.title('Top 20 weather conditions with associated accident percentages', fontsize = 20)

plt.show()

In [None]:
plt.figure(figsize=(20,10))
(dataset.Weather_Condition.value_counts() / len(dataset))[0:12].plot(kind='pie',autopct = '%0.2f%%', normalize = False, shadow = True)
plt.ylabel("")
plt.title('Distribution of accidents based on weather condition', fontsize = 20)
plt.show()

From the above bar graph and pie chart, we can observe that the majority of accidents occur in Fair, Clear, Mostly cloudy, Partly cloudy , Cloudy and overcast conditions. The rest conditions result in less than 5 percent accident for each condition.

Let's make a correlation matrix for all the variables in the dataset.

In [None]:
cor= dataset.corr()
plt.figure(figsize=(25,12))
sns.heatmap(cor, annot=True)
plt.show()

There is a strong correlation between Traffic_calming and Bump.

# Summary and conclusions
### Insights:

* This dataset does not contain data from New York city.
* There is an exponential decrease in the number of accidents per city.
* About 4 percent of the cities record more than 1000 accidents per year.
* Over 1300 cities have reported just 1 accident throughout the year. This needs to be investigated.
* A higher percentage of accidents occur between 6 AM to 10 AM in the morning and 2 PM to 6 PM in the evening probably because people tend to be in a hurry to get to work and return from work.
* The peak for weekends occur between 10 AM to 5 PM unlike that for weekdays.
* More number of accidents occur around the month of december.
* An increasing trend is observed in the number of accidents year over year.
* Density of accidents near the coastline is more compared to the center.
* About 62 percent of the accidents occur in colder areas while 35 percent of the accidents occur in hotter areas.
* Majority of accidents occur in Fair, Clear, Mostly cloudy, Partly cloudy , Cloudy and overcast conditions.


There are so many more columns in this dataset that can be analysed for some interesting insights.