# Data Analysis Project
This is a small data analyis project, where I have tried to look at the accidents that occured in USA and the correlation between the cities and time.

## Data explanation

The data is taken from Kaggle.com by Sobhan Moosavi. The title of the data is "US Accidents (3 million records -- updated)". This data contains huge set of information on the vehicle related accidents that are occured in USA. 

## Downloading the data
Downloading and re-uploading the data in jupyter consumes alot of time and resource, so I have used an easier method to do so by using a github project by JovianML, called opendatasets. This tool will goto the website and directly download the data. Further information can be found in his [github page](https://github.com/JovianML/opendatasets).

Install command:
pip install opendatasets --upgrade -quiet

Download data command:
```
import opendatsets as od

downloadURL = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'

od.download(downloadURL)
```

In [None]:
# Installing the package
#pip install opendatasets --upgrade -quiet

## Downloading the data using opendatasets

In [None]:
#import opendatasets as od
#downloadURL = 'https://www.kaggle.com/sobhanmoosavi/us-accidents'
#od.download(downloadURL)

# OR
# Download from the website itself or from here. I have attached file in the project.

## Importing libraries

In [None]:
import pandas as pd
import seaborn as sns
sns.set(style = 'darkgrid')
sns.color_palette("Paired")
import matplotlib.pyplot as plt
# To map the latitude and longitude
import folium
from folium.plugins import HeatMap

## Dataset Parameters

In [None]:
dataFileName = '../input/us-accidents/US_Accidents_Dec20_Updated.csv'

## Data Processing and Cleaning

In [None]:
data = pd.read_csv(dataFileName)
data

## Going through the data

In [None]:
# Overall review of the data
data.info()

# Looking at columns only
print(data.columns)

# Priting the number of rows and columns
print(f'rows = {len(data)}')
print(f'columns = {len(data.columns)}')

In [None]:
# Missing Values
# .isna() is function to check the missing value
# if the data set has empty or missing value or null value, isna() will give "True" else "False" aS output
print(data.isna())

In [None]:
# Summing the missing data dn ordering them
missingData = data.isna().sum().sort_values(ascending = False)

# Missing data into percentage
missingPercent = data.isna().sum().sort_values(ascending =  False) / len(data)
missingPercent

In [None]:
# Data with missing data only
missingData = missingPercent[missingPercent != 0]
missingData

In [None]:
# Plotting the missing data
graphMissing = missingData.plot(kind = 'barh',  figsize=(8, 6), title = 'Missing Data Percentage of different columns ')
graphMissing.set_xlabel('Percentage')
graphMissing.set_ylabel('Data Columns')

## Analyzing the data
I want to look at few columns for this project, namely
* City
* Start Time
* Start Lat, Start Lng

## Accidents and City
Let's see which city has the most accident in a horizontal graph

In [None]:
# Going through cities

cities =  data.City.unique()
print(f'Number of cities with accident: {len(cities)}')

citiesByAccident = data.City.value_counts()
citiesByAccident

In [None]:
# Plotting top 50 cities
graphCities = citiesByAccident[:50].plot(kind = 'bar', figsize=(13, 7), title= 'Number of Accidents of top 50 cities ')
graphCities.title.set_size(20)
graphCities.set_xlabel('Cities', fontsize = 15)
graphCities.set_ylabel('Number of Accidents', fontsize = 15)

## Time of accidents
The column "Start_Time" gives the time, when did the accident occured and from this column, one can find the answers of following questions
- What time of the day, does the accidents occur frequently?
- Which days of the week have the most accidents?
- Which months ahve the most accidents?
- Trend of the accidents over the years


In [None]:
data.Start_Time
# This column is a string, so the best way to begin would be converting them to real date-time format

In [None]:
data.Start_Time = pd.to_datetime(data.Start_Time)
data.Start_Time

In [None]:
# After converting the column as datetime format, now one can easily work with the time and date separatly. 
# To the hour from the datetime format
data.Start_Time.dt.hour

In [None]:
sns.histplot(data.Start_Time.dt.hour, bins=24, stat='probability').set_title('Time of the day of the Accidents')

In [None]:
# Now let's look at the day of the week chart
sns.histplot(data.Start_Time.dt.dayofweek, bins=7, stat ='probability').set_title('Day of the week of the Accidents')

In [None]:
# Moving one step further and check out, how is the trend for each day of the week.
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
fig, axs =  plt.subplots(1, len(days), figsize=(20, 4))

for i in range(7):
    dayTime = data.Start_Time[data.Start_Time.dt.dayofweek == i]
    sns.histplot(dayTime.dt.hour, bins = 24, ax=axs[i], stat ='probability').set_title(f'{days[i]}')
    
plt.tight_layout()
plt.show()


In [None]:
# Accidents over the year
sns.histplot(data.Start_Time.dt.month, bins=12, stat ='probability').set_title('Accidents over the years')

In [None]:
years = [2016, 2017, 2018, 2019]

fig, axs =  plt.subplots(1, len(years), figsize=(20, 4))

for i in range(len(years)):
    year = data.Start_Time[data.Start_Time.dt.year == years[i]]
    sns.histplot(year.dt.month, bins = 12, stat ='probability', ax=axs[i]).set_title(f'{years[i]}')
    
plt.tight_layout()
plt.show()


In [None]:
sns.scatterplot(x=data.Start_Lng, y=data.Start_Lat, size = 0.001)

In [None]:
# Creating a heatmap
map =  folium.Map()
HeatMap(zip(list(data.Start_Lat), list(data.Start_Lng))).add_to(map)
map

# Summary and Conclusion
### Accidents and cities
- In the data set, 3 columns has more than 40% of the data missing, namely *Wind Chill, Precipitation and Number*. It is always recommended that the data which has more than 30-40% missing data is not really suitable for the data analysis. Hence, it is best to either remove the column or not use at all.
- The number of accidents per city decreases expotentially
- Less than 5% of cities have more than 1000 accidents

### Time of accidents
- Most the accidents occurs during the morning between *07:00 to 08:00* and later during the afternoon between *15:00 to 17:00*
    - -> Probably, people are going to work and getting home after work
- On weekdays, the number of accidents seems to be in the range and the number of accidents decreases over the weekend.
    - -> One can assume that people over the weekend don't travel much as in weekdays
- From the statments above, it can be concluded that accidents occurs mostly when people are mostly going to work. This theory can be pushed forward, by looking the graphs of time of accidents on different days of week. There is a clear trend that on the weekdays the accidents happens during the rush hour and on the weekends, it is during afternoons
- It seems like there are more accidents around the months from *October* to *December*. One can speculate that, December being the holiday season, people slowly start to travel from one place to another and hence accidents rate increases. 
- But there is also a steap increase in the number of accidents staring from July already, but concreate reason must be investigated.
- It seems like the data is incomplete for year 2016, therefore the weird trend at the beginning.

### Place of Accidents
- From the heatmap, it is pausible to conclude, there are more accidents in the coastal area and those are the places where the population is high also.


With this data set, much more information can be extracted, like state-wise accidents, sources of the accidents, weather conditions and many more. I have showed just a small sample data analysis, that one can perform with the help of pandas dataframe and visualisation. Such data analysis can be extended not only to this data set but any other data.

If there is any questions (9)