# US Accidents EDA  
---

This project focuses on exploring and analyzing the US Accidents dataset, which contains data on traffic incidents across the United States. The goal is to extract meaningful insights, identify patterns, and understand the factors that contribute to accidents across different states and times.

- The dataset is sourced from Kaggle, providing comprehensive records of traffic accidents across the United States.

- Insights derived from this analysis can contribute to strategies aimed at reducing and preventing accidents, enhancing road safety.

In [None]:
import warnings
#ignores 'Runtime Warning'!
warnings.filterwarnings("ignore", category=RuntimeWarning) 
#Ignores 'FutureWarning'!
warnings.filterwarnings("ignore", category=FutureWarning, message=".*use_inf_as_na option is deprecated.*")

## Load the Data
---

In [None]:
data_url = "https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents"
data_filename = "/kaggle/input/us-accidents/US_Accidents_March23.csv"

In [None]:
#libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data preparation & cleaning
---
- Load the data using Pandas
- Look at some information about the data and the columns
- Fix any incorrect or missing values

In [None]:
df = pd.read_csv(data_filename)

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
missing_percentage = df.isnull().sum().sort_values(ascending = False) / len(df) * 100
print('\033[32mMissing Value Percentage\033[0m')
missing_percentage

In [None]:
type(missing_percentage)

In [None]:
missing_percentage[missing_percentage > 0]

In [None]:
df.drop(['End_Lat','End_Lng', 'ID'], axis = 1, inplace = True)

In [None]:
#Calculating no. of Numeric Columns 
numeric_df = df.select_dtypes(include = ['int', 'float'])
f'No. of Numeric Columns = {len(numeric_df.columns)}'

## Exploratory analysis & visualization
---
Columns under Considration:

1. City
2. Start Time
3. Start Lat, Start Lang.
4. Temparature
5. Weather Condition

### 1. Exploring Cities
---

In [None]:
cities = df.City.unique()
len(cities)

In [None]:
cities_by_accident = df.City.value_counts()
cities_by_accident[:20]

In [None]:
cities_by_accident.loc['New York']

In [None]:
cities_by_accident[:20].plot(kind='barh', figsize=(12, 7))

In [None]:
# NOTE: aspect attribute widens the plot
sns.displot(cities_by_accident, kind='kde', aspect = 2)
plt.xlabel('Number of Accidents')
plt.ylabel('Density')

In [None]:
high_accident_cities = cities_by_accident[cities_by_accident >= 1000]
low_accident_cities = cities_by_accident[cities_by_accident < 1000]

In [None]:
print(f'Out of {len(cities_by_accident)} cities \033[93m {len(high_accident_cities)} \033[0m have more than 1000 yearly accidents')

In [None]:
print(f'Out of {len(cities_by_accident)} cities \033[93m {len(low_accident_cities)} \033[0m have less than 1000 yearly accidents')

In [None]:
#percentage of cities which have more then 1000 acccidents
high = len(high_accident_cities)/len(cities_by_accident) *100
f'{high:.2f}%'

In [None]:
#percentage of cities which have less then 1000 acccidents
less = len(low_accident_cities)/len(cities_by_accident) *100
f'{less:.2f}%'

In [None]:
# NOTE: aspect attribute widens the plot
sns.displot(high_accident_cities, kind='kde', log_scale=True, aspect = 2)
plt.xlabel('Number of Accidents')
plt.ylabel('Density')
plt.title('Distribution of High accident cities')

- Plot concludes the presence of fewer  cities with 1,00,000 accident

In [None]:
# NOTE: aspect attribute widens the plot
sns.displot(low_accident_cities, kind='kde', log_scale=True, aspect = 2)
plt.xlabel('Number of Accidents')
plt.ylabel('Density')
plt.title('Distribution of low accident cities')

- Plot concludes the presence of more no. of cities with just one accident

In [None]:
one_accident = len(cities_by_accident[cities_by_accident == 1])
print(f'No. of cities with just one accident \033[32m "{one_accident}" \033[0m')

### 2. Exploring Start Time
___

In [None]:
df.Start_Time

In [None]:
type(df.Start_Time)

In [None]:
df['Start_Time'] = pd.to_datetime(df['Start_Time'], format='mixed')
df['Start_Time'] = pd.to_datetime(df['Start_Time'], format="%Y-%m-%d %H:%M:%S.%f")

In [None]:
df['Start_Time']

In [None]:
df.Start_Time[0]

In [None]:
#retriving hour of the ouccered accidents 'df.Start_Time.dt.hour'
# NOTE: aspect attribute widens the plot
sns.displot(df.Start_Time.dt.hour, bins= 24, stat='percent', aspect = 2 )
plt.title('Accidents at a perticular Hour')
plt.ylabel("Percentage")
plt.ylabel("Hour")
plt.xticks(ticks=range(0,24),labels=list(range(1,25)))
plt.show()

- A high percentage of accidents occur between 6 am to 11 am (probability people in hurry to get to work)
- Next highest percentage is between 3 pm to 6 pm.

In [None]:
#retriving day of the week 'df.Start_Time.dt.hour'
# NOTE: aspect attribute widens the plot
sns.displot(df.Start_Time.dt.dayofweek, bins= 7, stat='percent', aspect = 2 )
plt.title('Accidents at day of a week')
plt.ylabel("Percentage")
plt.xlabel("Days")
plt.xticks(ticks=range(7), labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()

- Weekednds has fewer accidents.

In [None]:
mon_start_time = df.Start_Time[(df.Start_Time.dt.dayofweek == 0)] #Monday
sat_start_time = df.Start_Time[(df.Start_Time.dt.dayofweek == 5)] #Saturday
sun_start_time = df.Start_Time[(df.Start_Time.dt.dayofweek == 6)] #Sunday

In [None]:
#retriving hour of the ouccered accidents 'df.Start_Time.dt.hour'
# NOTE: aspect attribute widens the plot
sns.displot(mon_start_time.dt.hour, bins= 24, stat='percent', aspect = 2 )
plt.title('Accidents at a perticular Hour')
plt.ylabel("Percentage")
plt.xlabel("Monday")
plt.xticks(ticks=range(0,24),labels=list(range(1,25)))

#retriving hour of the ouccered accidents 'df.Start_Time.dt.hour'
# NOTE: aspect attribute widens the plot
sns.displot(sat_start_time.dt.hour, bins= 24, stat='percent', aspect = 2 )
plt.title('Accidents at a perticular Hour')
plt.ylabel("Percentage")
plt.xlabel("Saturday")
plt.xticks(ticks=range(0,24),labels=list(range(1,25)))

#retriving hour of the ouccered accidents 'df.Start_Time.dt.hour'
# NOTE: aspect attribute widens the plot
sns.displot(sun_start_time.dt.hour, bins= 24, stat='percent', aspect = 2 )
plt.title('Accidents at a perticular Hour')
plt.ylabel("Percentage")
plt.xlabel("Sunday")
plt.xticks(ticks=range(0,24),labels=list(range(1,25)))

plt.show()

**Is the distribution of accidents by hour same on weekends as on weekdays?**

- On Sundays & Saturdays (Weekends) Peak time is between 10 am and 8 pm unlike week days.
add Codeadd Markdown

In [None]:
year = 2019  #Tunable 
df_2023 = df.Start_Time[df.Start_Time.dt.year == year]

In [None]:
#retriving month of the ouccered accidents 'df.Start_Time.dt.month'
# NOTE: aspect attribute widens the plot

sns.displot(df_2023.dt.month, bins= 12, stat='percent', aspect = 2 )
plt.title(f'Trend of aaccidents in {year}')
plt.ylabel("Percentage")
plt.xlabel("Month")
plt.xticks(ticks=range(1,13),labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.plot()

- Much data is missing for 2016, 2017 & 2023

In [None]:
df.Source.value_counts().plot(kind='pie')
df.Source.unique()

- Source 2 has Inconsistant data (Cross Verified)

### 3. Exploring Start Lattitude & Start Longitude
---

In [None]:
sns.scatterplot(x = df.Start_Lng, y = df.Start_Lat, s = 20)

In [None]:
# sample_df = df.sample(int(0.01 * len(df)))
sample_df = df.sample(n=1000)

In [None]:
sns.scatterplot(x = sample_df.Start_Lng, y = sample_df.Start_Lat, s = 20)

In [None]:
import folium
m1 = folium.Map(location=[39.5, -98.35], zoom_start=4)
for lat, lon in zip(sample_df.Start_Lat, sample_df.Start_Lng):
    folium.CircleMarker(location=[lat, lon], radius=1, color='blue', fill=True).add_to(m1)
m1

In [None]:
from folium.plugins import HeatMap
m2 = folium.Map(location=[39.5, -98.35], zoom_start=4)
heat_data =(list( zip(sample_df.Start_Lat, sample_df.Start_Lng)))
HeatMap(heat_data).add_to(m2)
m2

## Interrogation

1. Are there more accidents in warmer or colder areas?
2. Which 5 states have the most number of acciddents? How about per capita?
3. Does New York show up in the data?  If yes why is the count lower if this is the most popullated city?
4. Among the Top 100 cities which state does they belong to most freequently?
5. What time of the day are accidents most frequently in?
6. Which days of the week have the most accidents?
7. Which months have more accidents?
8. Whats the trend of accidents year over year(increasing/decreasing?)
9. When is accidents per unit of traffic the highest?

## Concluding Summary
---

- Less that 9% of cities have more than 1000 yearly accidents.
- The no. of accidents per city decreases exponentially.
- Over 1000 cities reported just one accident (need to investigate).