## U.S Accidents Data Analysis

This project analyzes the US Accidents dataset availabe on Kaggle. The dataset contains over four million accident records from February 2016 to March 2023. Each record includes details such as the location, time, severity, and weather conditions at the time of the accident. This analysis aims to understand patterns and report insights about accidents based on time, location and other factors.

#### Data Loading and Inspection

In [None]:
# Import necessary libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap

# Load the dataset
df = pd.read_csv('US_Accidents_March23.csv')

# Display the first few rows of the dataset
df.head()


#### Data Cleaning and Preparation

In [None]:
# Drop unnecessary columns

columns_to_drop = ['ID','Number','Country','Airport_Code','Country','Wind_Chill(F)','Precipitation(in)','Weather_Timestamp','Description']
df = df.drop(columns_to_drop, axis=1)

# Remove missing values
df = df.dropna()

#convert Start_Time to datetime
df['Start_Time'] = pd.to_datetime(df['Start_Time'])

# Extract date and time components
df['Start_Hour'] = df['Start_Time'].dt.hour
df['Start_Day'] = df['Start_Time'].dt.dayofweek
df['Start_Month'] = df['Start_Time'].dt.month
df['Start_Year'] = df['Start_Time'].dt.year

# Basic Statistics
df.describe()

#### Exploratory Data Analysis (EDA)

 Distribution of accidents by hour

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(df['Start_Hour'], bins=24, kde=False, color='blue')
plt.title('Distribution of Accidents by Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Frequency')
plt.show()

Distribution of accidents by day of the week

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(df['Start_Day'], color='orange')
plt.title('Distribution of Accidents by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Frequency')
plt.xticks(ticks=[0, 1, 2, 3, 4, 5, 6], labels=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()

Distribution of accidents by month

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(df['Start_Month'], color='green')
plt.title('Distribution of Accidents by Month')
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.xticks(ticks=range(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

Distribution of accidents by year

In [None]:
plt.figure(figsize=(10, 6))
sns.countplot(df['Start_Year'], color='purple')
plt.title('Distribution of Accidents by Year')
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.show()

Correlation Matrix

In [None]:
plt.figure(figsize=(12, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Pair Plot for key variables

In [None]:
subset_df = df[['Severity', 'Temperature(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)']]
sns.pairplot(subset_df)
plt.show()

Geospatial Analysis

In [None]:
# Sample a fraction of the data
sample_df = df.sample(frac=0.01, random_state=42)

# Create a folium map for accident locations
map = folium.Map(location=[sample_df['Start_Lat'].mean(), sample_df['Start_Lng'].mean()], zoom_start=5)
heat_data = [[row['Start_Lat'], row['Start_Lng']] for index, row in sample_df.iterrows()]
HeatMap(heat_data).add_to(map)
map

Clustering Example

In [None]:
from sklearn.cluster import KMeans

# Choosing a subset of the data for clustering
cluster_data = df[['Start_Lat', 'Start_Lng']].sample(10000, random_state=42)

# Applying KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
cluster_data['Cluster'] = kmeans.fit_predict(cluster_data)

# Plotting the clusters on a map
map_clusters = folium.Map(location=[cluster_data['Start_Lat'].mean(), cluster_data['Start_Lng'].mean()], zoom_start=5)
for idx, row in cluster_data.iterrows():
    folium.CircleMarker([row['Start_Lat'], row['Start_Lng']],
                        radius=3,
                        color='red' if row['Cluster'] == 0 else 'blue' if row['Cluster'] == 1 else 'green' if row['Cluster'] == 2 else 'purple' if row['Cluster'] == 3 else 'orange',
                        fill=True).add_to(map_clusters)

map_clusters

#### Conclusion

- Most accidents occur during rush hours.
- The number of accidents varies significantly by day of the week and month.
- Heat maps show accident hotspots across the United States.
- Correlation analysis reveals potential relationships between weather conditions and accident severity.