In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Preparation and Cleaning

- Load the file using Pandas
- Look at some information about the data and the columns
- Fix any missing or incorrect values

In [None]:
df = pd.read_csv('../input/us-accidents/US_Accidents_Dec20_Updated.csv',index_col='ID')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# Ask & Answer Questions
- Are there more accidents in warmer or colder areas?
- When (as in the time of accident) are the accidents more frequent?
- Which days of the week (or even month) have the most accident?
- What is the trend of these accidents over the year? (decreasing/increasing)

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

numeric_df = df.select_dtypes(include=numerics)
len(numeric_df)

### Missing Values

In [None]:
df.isna().sum()

There are a lot of columns that have missing values!

#### Missing Column Values Percentage

In [None]:
missing_percentage = df.isna().sum().sort_values(ascending=False) / len(df)
missing_percentage

In [None]:
#Filtering out columns without missing values
Y = missing_percentage[missing_percentage != 0]
Y

In [None]:
Y.plot(kind='barh')

# Exploratory Analysis and Visualization
Columns we'll analyze 
- City
- Start Time
- Weather Condition [TODO]
- Temperature [TODO]
- Start Lang, Start Lat

### Cities

In [None]:
df.columns

In [None]:
cities = df.City.unique()
len(cities)

The Dataset consists registered accidents from over 11790 cities in the US

In [None]:
cities_by_accident = df.City.value_counts()
cities_by_accident[:5] #Top 5 cities by accident count

Interesting to note that these cities are also one of the most populated cities in the US. Although, it doesn't include the most populated city of all, New York.

In [None]:
cities_by_accident[:20].plot(kind='barh')

In [None]:
import seaborn as sns
sns.set_style('darkgrid')
sns.distplot(cities_by_accident)

In [None]:
high_accident_cities = cities_by_accident[cities_by_accident>=1000]
less_accident_cities = cities_by_accident[cities_by_accident<=1000]

In [None]:
len(high_accident_cities)/len(cities)

This shows that less than 5% cities have more than 1000 yearly accidents.

In [None]:
sns.distplot(high_accident_cities)

In [None]:
sns.histplot(cities_by_accident,log_scale=True)

We can also see which states have the most accidents by using the `State` column from our dataframe

In [None]:
state_by_accident = df['State'].value_counts()
state_by_accident.plot(kind='line')

### Start Time

`Start_Time` is the time at which the accident occured. Let's dive into it!

In [None]:
df.Start_Time

Since the time here is of `dtype: object`, we need to convert it to Timestamp

In [None]:
df.Start_Time = pd.to_datetime(df.Start_Time)

In [None]:
df.Start_Time[0]

In [None]:
import seaborn as sns

### Hour Start Time

In [None]:
sns.distplot(df.Start_Time.dt.hour, bins=24, kde=False, norm_hist=True)

### Days of the week Start Time

In [None]:
sns.distplot(df.Start_Time.dt.dayofweek, bins=7, kde=False, norm_hist=True)

X-axis represents days of the week. It is evident that weekdays have more number of accidents than weekends. Let's take a closer look at how different these 2 distributions are.

### On Sundays

In [None]:
sunday_start_time = df.Start_Time[df.Start_Time.dt.dayofweek ==6]
sns.distplot(sunday_start_time.dt.hour, bins=24, kde=False, norm_hist=True)

There's a slight peek at 7am. But the number of accidents is largest during the afternoon, i.e., 1pm to 6pm.

### On Mondays

In [None]:
monday_start_time = df.Start_Time[df.Start_Time.dt.dayofweek ==0]
sns.distplot(monday_start_time.dt.hour, bins=24, kde=False, norm_hist=True)

> The Distribution on Mondays is very different from that on Sundays

The 2 Gaussian curves in the above graph could represent 
1. People going to work
2. People returning home.

Amazing how the human systems that we've built mess with the general statistics ;)

Let's look at the monthly distribution



### By Month

In [None]:
sns.distplot(sunday_start_time.dt.month, bins=12, kde=False, norm_hist=True)

There seems to be some error in the dataset because there's a huge difference between the accidents during the summers and winters. Our dataset consists data from February 2016 to 2020. This could be one of the reasons. Let's how this distribution looks for the year 2019.

### For the year 2019

In [None]:
df_2019 = df[df.Start_Time.dt.year == 2019]
sns.distplot(df_2019.Start_Time.dt.month, bins=12, kde=False, norm_hist=True)

Aha! This distribution seems good. Accidents seem to be more frequent during the winters.

## Start Latitude & Longitude

Now, let's look at a distribution based on the location of accidents (latitude and longitude).

In [None]:
df.Start_Lat

In [None]:
df.Start_Lng

In [None]:
list(zip(list(df.Start_Lat), list(df.Start_Lng))) #Pair up the Latitiude and the Longitude

In [None]:
sample_df = df.sample(int(0.1*len(df))) #Using 10% of our dataset
sns.scatterplot(x=sample_df.Start_Lng, y=sample_df.Start_Lat, size=0.9)

Looks like the [map](https://www.mappr.co/wp-content/uploads/2018/11/USA-States-Color-Map.jpg) of USA right?

Next, let's use the library folium to create interactive maps with the latitudes and longitudes.

In [None]:
import folium

In [None]:
from folium.plugins import HeatMap

In [None]:
map = folium.Map()
HeatMap(list(zip(list(df.Start_Lat), list(df.Start_Lng)))).add_to(map)
map

Have fun interracting with the heatmap!

## Summary and Insights
- No data from New York(which is the most poppulated city in the US)
- Less the 5% of cities have more than 1000 yearly registered accidents.
- Accidents are more frequent from 6am to 10am and 3pm to 6 pm during the weekdays.