## Canada Accident Analysis 1999-2017

- The project gives detailed insights into **Canada long-term road accident between 1999 - 2017,** which includes but not limited to potential casualties due to road accidents, areas most affected by accidents, impact of aggressive driving, and geographical regions well known for accidents, and environmental condition.

- This notebook represents the exploration and visualization of the National Collision Database (NCDB) [Dataset](https://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a)

### Contents
**1. proposing Questions to be answered**

**2. Reading Data**

**3. Cleaning Data**

**4. Exploration**

**5. Conclusion**

#### Questions

3.1 Exploration based on date of accidents

- Is the number of Accidents per year decreasing ? (from 1999 to 2017)
- Which months have higher frequency of Accidents ?
- Which Day-of-the-Month is most safe to drive ?
- Time series of all accidents from 2005 to 2016
- Time series for all accidents in each year

3.2 Exploration based on roads where accidents occured

- Which types of roads are high risk?
- Which type of road gradient is high risk?

3.3 Exploration based on people involved in the accidents

- What was the condition of the people after the accident?
- What was the age distribution of the people involved?
- What was the sex distribution of the people involved?

3.4 Exploration based on use of safety equipment

- What was the distribution of Safety Equipment used?
- Did use of Safety Eqipment impact condition of people after the accident?

In [9]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
import calendar
import plotly.express as px

In [10]:
df = pd.read_csv('NCDB_1999_to_2017.csv')


Columns (1,2,5,12) have mixed types.Specify dtype option on import or set low_memory=False.



In [11]:
pd.options.display.max_columns = None
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6772563 entries, 0 to 6772562
Data columns (total 23 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   C_YEAR  int64 
 1   C_MNTH  object
 2   C_WDAY  object
 3   C_HOUR  object
 4   C_SEV   int64 
 5   C_VEHS  object
 6   C_CONF  object
 7   C_RCFG  object
 8   C_WTHR  object
 9   C_RSUR  object
 10  C_RALN  object
 11  C_TRAF  object
 12  V_ID    object
 13  V_TYPE  object
 14  V_YEAR  object
 15  P_ID    object
 16  P_SEX   object
 17  P_AGE   object
 18  P_PSN   object
 19  P_ISEV  object
 20  P_SAFE  object
 21  P_USER  object
 22  C_CASE  int64 
dtypes: int64(3), object(20)
memory usage: 1.2+ GB


## Data Cleaning

In [None]:
df_clean = df.replace(['UU','XX','U','X','UUUU','XXXX'], np.nan, regex=False)

def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total NaN Values', 'Percentage of NaN Values'])

missing_data(df_clean)

In [None]:
# Change the data type of specific columns to float64 
df_clean['C_MNTH'] = pd.to_numeric(df['C_MNTH'], errors='coerce')
df_clean['C_WDAY'] = pd.to_numeric(df['C_WDAY'], errors='coerce')

### What's the road accidents trend from 1999 to 2017?

In [None]:
def create_stack_bar_data(col, df):
    aggregated = df[col].value_counts().sort_index()
    x_values = aggregated.index.tolist()
    y_values = aggregated.values.tolist()
    return x_values, y_values

x1, y1 = create_stack_bar_data('C_YEAR', df_clean)

#x1 = x1[:-1]
#y1 = y1[:-1]
#color1 = ['092a35']*9
#color2 = ['a2738c']*3
#color1.extend(color2)
trace1 = go.Bar(x=x1, y=y1, opacity=0.75, name="year count")
layout = dict(height=400, title={'text':'Year wise Number of Accidents in Canada','y':0.85,'x':0.5,'xanchor':'center','yanchor':'top'}, legend=dict(orientation="h"), 
              xaxis = dict(title = 'Year'), yaxis = dict(title = 'Number of Accidents'))

fig = go.Figure(data=[trace1], layout=layout);
iplot(fig);

### Which months have higher frequency of accidents?

In [None]:
nu_month = df_clean.groupby('C_MNTH')['C_CASE'].nunique()
month = [month for month,df in df_clean.groupby('C_MNTH')]

fig = px.bar(df_clean, x = month , y = nu_month, labels={'x':'Month','y':'Number of Accidents'})
fig.update_layout(
    title={
        'text': "Month Wise Number of Accidents in Canada",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

### Which Day-of-the-Week is less safe?

In [None]:
nu_day = df_clean.groupby('C_WDAY')['C_CASE'].nunique()
day = [day for day,df in df_clean.groupby('C_WDAY')]
fig = px.bar(df_clean, x = day , y = nu_day, labels={'x':'Day of the week','y':'Number of Accidents'})
fig.update_layout(
    title={
        'text': "",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()