<h1>Table of Contents<span class="tocSkip"></span></h1>
    <div class="toc"><ul class="toc-item"><li><span><a href="#Road-Traffic-Accident-Classification---EDA" data-toc-modified-id="Coursework-3-:-English-Property-Prices-1">Road Traffic Accident Classification - EDA</a></span><ul class="toc-item"><li><span><a href="#The-Goal-Of-The-Notebook" data-toc-modified-id="The-goal-of-the-notebook">The Goal Of The Notebook</a></span></li><li><span><a href="#RTA-Dataset-Details" data-toc-modified-id="RTA-Dataset-Details">RTA Dataset Details</a></span></li></ul></li><li><span><a href="#1.0-Loading-and-Preparation-of-Data" data-toc-modified-id="1.0 Loading and Preparation of Data">1.0 Loading and Preparation of Data</a></span><ul class="toc-item"><li><span><a href="#1.1-Import-Libraries" data-toc-modified-id="1.1-Import-Libraries">1.1 Import Libraries</a></span></li><li><span><a href="#1.2-Loading-Dataset-and-Doing-Initial-Analysis" data-toc-modified-id="1.2-Missing-values-in-the-region-or-area-data-(2c)-2.2">1.2 Loading Dataset and Doing Initial Analysis</a></span></li><li><span><a href="#1.3-Identifying-Missing-Values-And-Analysis" data-toc-modified-id="1.3-Identifying-Missing-Values-And-Analysis">1.3 Identifying Missing Values And Analysis</a></span></li></ul></li><li><span><a href="#2.0-EDA-Categorical-Variables" data-toc-modified-id="2.0-EDA-Categorical-Variables">2.0 EDA Categorical Variables</a></span><ul class="toc-item"><li><span><a href="#2.1-Display-Each-Categorical-Variable-Values" data-toc-modified-id="2.1 Display-Each-Categorical-Variable-Values">2.1 Display Each Categorical Variable Values</a></span></li><li><span><a href="#2.2-Display-Count-Of-Each-Category" data-toc-modified-id="2.2-Display-Count-Of-Each-Category">2.2 Display Count Of Each Category</a></span></li><li><span><a href="#2.3-Display-Ratio-Of-Each-Category-Along-With-Entropy-And-Variation-Ratio" data-toc-modified-id="2.3-Display-Ratio-Of-Each-Category-Along-With-Entropy-And-Variation-Ratio">2.3 Display Ratio Of Each Category Along With Entropy And Variation Ratio</a></span></li><li><span><a href="#2.4-Display-Segment-Categorical-Features-By-The-Target-Classes" data-toc-modified-id="2.4-Display-Segment-Categorical-Features-By-The-Target-Classes">2.4 Display Segment Categorical Features By The Target Classes</a></span></li></ul></li><li><span><a href="#3.0-EDA-Numerical-Variables" data-toc-modified-id="3.0-EDA-Numerical-Variables">3.0 EDA Numerical Variables</a></span>

# Road Traffic Accident Classification - EDA

## The Goal Of The Notebook

The goal of this notebook is to visualise data or discover trends, patterns, or to check assumptions in data with the help of statistical summary and graphical representations using visual techniques called Exploratory Data Analysis (EDA).

## RTA Dataset Details

This data set is collected from Addis Ababa Sub city police departments for Masters research work. The data set has been prepared from manual records of road traffic accident of the year 2017-20. All the sensitive information have been excluded during data encoding and finally it has 32 features and 12316 instances of the accident. Then it is preprocessed and for identification of major causes of the accident by analyzing it using different machine learning classification algorithms algorithms.

# 1.0 Loading and Preparation of Data

## 1.1 Import Libraries


In [None]:
import random
import pandas as pd
import numpy as np
import math

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

import missingno as msno
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from scipy.stats import entropy

from IPython.core.display import display
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
# Set notebook mode to work in offline
import plotly.offline as pyo
pyo.init_notebook_mode()
from scipy.stats import chi2_contingency
px.renderers.default='notebook'
go.renderers.default='notebook'

## 1.2 Loading Dataset and Doing Initial Analysis

Loading Road Traffic Accident data set in dataframe name as data. At first glance will do some initial data analysis share obervations.

In [None]:
# load data
data = pd.read_csv('RTA_Dataset.csv')

In [None]:
# rows and columns of the data
print(data.shape)

In [None]:
# Get the number of rows and columns
rows = len(data.axes[0])
cols = len(data.axes[1])

# Print the number of rows and columns
print("Number of Rows: " + str(rows))
print("Number of Columns: " + str(cols))

In [None]:
# visualise the dataset
data.head()

In [None]:
df = data['Accident_severity'].value_counts()
prob = (data['Accident_severity'].value_counts()/ len(data['Accident_severity'])).tolist()
fig = px.bar(df, y='Accident_severity', x=df.index.values, text_auto=True, color='Accident_severity')
fig.show("notebook")

In [None]:
#Chceking segregation of target value 'Accident severity' in form of count
data['Accident_severity'].value_counts()

In [None]:
#Chceking segregation of target value 'Accident severity' in "%" form
data.Accident_severity.value_counts(normalize = True)*100

<b>Inference:</b> The data is highly imbalanced dataset as it can be seen the count of people who suffered <b>Slight Injury is 85.9%, Serious Injury is 14.1% and Fatal injury 1.28%.

In [None]:
#concise summary of the dataframe.
data.info()

In [None]:
data['Time'] = data['Time'].astype('datetime64[ns]')
data.info()

## 1.3 Identifying Missing Values And Analysis

In [None]:
# we can use the mean method after isnull
# to visualise the percentage of
# missing values for each variable

data.isnull().mean()

In [None]:
# we can quantify the total number of missing values using
# the isnull method plus the sum method on the dataframe

data.isnull().sum()

In [None]:
# check the presence of missing data.
# (there are no missing data in this dataset)

[col for col in data.columns if data[col].isnull().sum() > 0]

In [None]:
#Function to make missing value table
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
train_missing= missing_values_table(data)
train_missing

In [None]:
train_missing.index.values

In [None]:

msno.bar(data)

Let’s visualize its missing values using a matrix plot. In the matrix plot, each white line represents missing observations and the lines are visualized in the order they appear in the dataset, top to bottom. For large datasets like this, a cluster of missing values will form brighter white lines.

In [None]:
#Let’s visualize its missing values using a matrix plot
msno.matrix(data)

In [None]:
msno.matrix(data.sample(100))

To find out if the missingness has any correlation with any of the existing variables, we will use a correlation heatmap.

In [None]:
#Let’s visualize its missing values using correlation heatmap
msno.heatmap(data, cmap='rainbow');

The high scores indicate missing values in one column are highly dependent on the missingness of another column. Liklwise we can strong correlation between "Vehicle_driver_relation" and "Educational_level" same with "Fitness_of_casuality" and "Work_of_casuality".



In [None]:
msno.matrix(data.sample(1000).sort_values(['Vehicle_driver_relation','Educational_level']));

In [None]:
msno.matrix(data.sample(1000).sort_values(['Driving_experience','Vehicle_driver_relation']));

In [None]:
msno.matrix(data.sample(1000).sort_values(['Driving_experience','Vehicle_driver_relation']));

In [None]:
msno.matrix(data.sample(1000).sort_values(['Fitness_of_casuality','Work_of_casuality']));

In [None]:
new = train_missing.index.values.tolist()

In [None]:
new

In [None]:
msno.matrix(data.sample(2000).sort_values(new));

In [None]:
null_data = data[data.isnull().any(axis=1)]

In [None]:
null_data.groupby(['Accident_severity']).apply(lambda x: x[x['Driving_experience'].isna()])

The plot shows that if a data point is missing in <b>Fitness_of_casuality</b>, we can guess that it is also missing from <b>Work_of_casuality</b> column or vice versa. Because of this connection, we can safely say the missing data in both columns are <b>not missing at random (MNAR)</b> so cant do deletion of missing observations because i can lead to bias.

For the remaining, we can safely say they are <b>MCAR</b> because of the small number of missing values and low correlation.

In [None]:
msno.dendrogram(data)

# 2.0 EDA Categorical Variables

In first step separate categorical variables in new dataframe.

In [None]:
# find categorical variables
cat_cols = [c for c in data.columns if data[c].dtypes=='O']
df_cat_cols = data[cat_cols]
data[cat_cols].head()

In [None]:
n_cat_cols = len(cat_cols)
print("Number of Categorical Variables: " + str(n_cat_cols))

##  2.1 Display Each Categorical Variable Values

There are 29 Categorical Variables.

Now, lets check categories in each variable.


In [None]:
null_data = data[data.isnull().any(axis=1)]

In [None]:
for col in cat_cols:
    print("--------------------------")
    print("Column Name: " + col)
    print("--------------------------")
    print(" ")
    print(df_cat_cols[col].unique())
    print(" ")
##  2.1 Display Each Vategorical Variable Value    print(" ")

##  2.2 Display Count Of Each Category

Let's check count of each variable.

In [None]:
for col in cat_cols:
    print("--------------------------")
    print("Column Name: " + col)
    print("--------------------------")
    print(" ")
    print(df_cat_cols[col].value_counts())
    print(" ")
    print(" ")

I have included <b>ENTOPY</b> as we know in information theory, entropy is a measure of the amount of uncertainty or randomness in a system. When applied to categorical variables, entropy can be used to measure the amount of uncertainty or randomness in the distribution of the variable.

Moreover also included variation ratio which is a measure of the diversity or variability in a categorical dataset. It is calculated as the ratio of the number of distinct categories in the dataset to the total number of observations. Less the variation ration more probability of being <b>Quasi-constant</b> variable (Quasi Constant Features are those that show the same value for a great majority of observations in the dataset.)

In [None]:
#defining function to calculate Variation Ratio
def variationRation(df, col):
    myFreqs = df[col].value_counts()
    fMode = myFreqs.max()
    n = myFreqs.sum()
    pMode = fMode / n
    VR =1 -pMode
    return VR

In [None]:
for col in cat_cols:
    df_counts = df_cat_cols[col].value_counts()
    prob = (df_cat_cols[col].value_counts() / len(df_cat_cols[col])).tolist()
    Variation = variationRation(df_cat_cols, col)
    ENTR = entropy(prob, base=2)

    colors = px.colors.sequential.Plasma[:len(df_counts)]

    fig = px.bar(df_counts, y=col, x=df_counts.index.values, text_auto=True,
                 title='Entropy {:.2f}, Variation Ratio {:.2f}'.format(ENTR, Variation),
                 color=df_counts.index.values, color_discrete_sequence=colors)

    fig.show("notebook")


## 2.3 Display Ratio Of Each Category Along With Entropy And Variation Ratio

Entropy: The randomness of the system is measured by entropy. Higher the disorder higher the entropy

Variation Ratio: Fraction of cases different from the mode (most frequent value).

Any category with high ration and low variable ration could be Quasi-constant.

In [None]:
#defining function to calculate Variation Ratio
def variationRation(df, col):
    myFreqs = df[col].value_counts()
    fMode = myFreqs.max()
    n = myFreqs.sum()
    pMode = fMode / n
    VR =1 -pMode
    return VR

In [None]:
variationRation(df_cat_cols, "Vehicle_driver_relation")

In [None]:
for col in cat_cols:
    print("--------------------------")
    print("Column Name: " + col)
    print("--------------------------")
    print(" ")
    prob = (df_cat_cols[col].value_counts()/ len(df_cat_cols[col])).tolist()
    Variation = variationRation(df_cat_cols, col)
    ENTR = entropy(prob, base=2)
    print("Entropy: {}"  .format(ENTR))
    print("Variation Ratio: {}"  .format(Variation))
    print(" ")
    print(df_cat_cols[col].value_counts()/ len(df_cat_cols[col]))
    print(" ")
    print(" ")

In [None]:
# Assuming `cat_cols` is a list of your categorical column names
for col in cat_cols:
    df = df_cat_cols[col].value_counts()/ len(df_cat_cols[col])
    prob = (df_cat_cols[col].value_counts()/ len(df_cat_cols[col])).tolist()
    Variation = variationRation(df_cat_cols, col)
    ENTR = entropy(prob, base=2)
    
    fig = px.bar(df, y=col, x=df.index.values, text_auto=True,
                 title='Entropy {:.2f}, Variation Ratio {:.2f}'.format(ENTR, Variation),
                 color=df.index.values, color_discrete_sequence=colors)
    fig.show("notebook")

Contingency Table

In [None]:
for col in cat_cols:
    ct_table_ind=pd.crosstab(df_cat_cols["Accident_severity"],df_cat_cols[col])
    print('contingency_table :')
    print(ct_table_ind)
    print("")
    print("")
    print("")

In [None]:
# Assuming you have already defined 'cat_cols' and 'df_cat_cols'

significant_cols = []
insignificant_cols = []

for col in cat_cols:
    obs = pd.crosstab(df_cat_cols[col], df_cat_cols['Accident_severity'])
    chi2, pval, dof, expected = chi2_contingency(obs)
    if pval < 0.05:
        significant_cols.append(col)
    else:
        insignificant_cols.append(col)

print("Columns associated with Accident_severity outcome variable (significant):")
print(", ".join(significant_cols))
print("\nColumns not associated with Accident_severity outcome variable (insignificant):")
print(", ".join(insignificant_cols))

<b>Inference:</b> As it can be observed that folowing categorical variables <b>Pedestrian_movement</b>, <b>Fitness_of_casuality</b>, <b>Road_allignment</b> are <b>Quasi-constant</b> features beacsue it contains proximate 90% of data of one category. In other words, these feature have the same values for a very large subset of the outputs. Such features are not very useful for making predictions so we going to remove them.

# 2.4 Display Segment Categorical Features By The Target Classes

In [None]:
#Segment Categorical features by the target classes
for col in data.select_dtypes(include='object'):
    if data[col].nunique() <=4:
        display(pd.crosstab(data['Accident_severity'], data[col], normalize='index'))
        print("")
        print("")
        print("")
        #display(pd.crosstab(data['target'], data[col]))

In [None]:
#Count plot of target across various categorical features
for col in data.select_dtypes(include='object'):
    if data[col].nunique() <= 4:
        fig = px.histogram(data, x=col, color='Accident_severity', barmode='group')
        fig.update_xaxes(tickangle=60)
        fig.show()

In [None]:
df_temp = data.loc[:, ('Time','Area_accident_occured','Number_of_casualties','Accident_severity')]
df_temp

In [None]:
columns_names = ['Day_of_week','Sex_of_driver', 'Educational_level', 'Vehicle_driver_relation', 'Driving_experience', 'Type_of_vehicle' ,'Owner_of_vehicle', 'Service_year_of_vehicle',
 'Defect_of_vehicle', 'Area_accident_occured', 'Types_of_Junction', 'Road_surface_type', 'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
  'Vehicle_movement','Casualty_class','Sex_of_casualty', 'Cause_of_accident', 'Accident_severity','Age_band_of_casualty','Age_band_of_driver']

for feature in columns_names:
    df_temp = data.loc[:, columns_names]
    df_temp.dropna( axis=0, inplace=True )
    fig = px.histogram(df_temp, x="Accident_severity", color=feature, barmode="group",  height=500).update_xaxes(categoryorder='total ascending')
    fig.show("notebook")

In [None]:
columns_names = ['Day_of_week','Sex_of_driver', 'Educational_level', 'Vehicle_driver_relation', 'Driving_experience', 'Type_of_vehicle' ,'Owner_of_vehicle', 'Service_year_of_vehicle',
 'Defect_of_vehicle', 'Area_accident_occured', 'Types_of_Junction', 'Road_surface_type', 'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
  'Vehicle_movement','Casualty_class','Sex_of_casualty', 'Cause_of_accident', 'Accident_severity']

for feature in columns_names:
    df_temp = data.loc[:, columns_names]
    df_temp.dropna( axis=0, inplace=True )
    fig = px.histogram(df_temp, x=feature, color="Accident_severity",  height=500).update_xaxes(categoryorder='total ascending')
    fig.show("notebook")

In [None]:
for feature in columns_names:
    data_2 = pd.crosstab(index=data['Accident_severity'], columns=data[feature], normalize='index')
    fig = px.imshow(data_2.T,
                labels=dict(x="Accident_severity", y="Cause_of_accident", color="COUNT"),
                y=data_2.columns.tolist(),
                x=data_2.index, height=800
               )
    fig.update_xaxes(side="top")
    fig.show("notebook")

In [None]:
#df_temp['Time'] = pd.to_datetime(data['Time'])
fig = px.scatter(data, x="Time", y="Area_accident_occured",  size="Number_of_casualties", color="Accident_severity", size_max=30)
fig.update_layout( height=1000 )
fig.show("notebook")

In [None]:
['Day_of_week', 'Age_band_of_driver', 'Area_accident_occured', 'Types_of_Junction', 'Light_conditions', 'Weather_conditions', 'Age_band_of_casualty', 'Accident_severity']

In [None]:
data['hour'] = data['Time'].dt.hour

In [None]:
fig = px.histogram(data, x='hour', color="Accident_severity",  height=500).update_xaxes(categoryorder='total ascending')
fig.show("notebook")

In [None]:
#making function to devide whole day into different sessions
def f(x):
    if (x > 4) and (x <= 8):
        return 'Early Morning'
    elif (x > 8) and (x <= 12 ):
        return 'Morning'
    elif (x > 12) and (x <= 16):
        return'Noon'
    elif (x > 16) and (x <= 20) :
        return 'Eve'
    elif (x > 20) and (x <= 24):
        return'Night'
    elif (x <= 4):
        return'Late Night'

In [None]:
data['session'] = data['hour'].apply(f)

In [None]:
data['session']

In [None]:
# Calculate statistics
df = data["session"].value_counts() / len(data["session"])
prob = (data["session"].value_counts() / len(data["session"])).tolist()
Variation = variationRation(data, "session")
ENTR = entropy(prob, base=2)

# Create a DataFrame for the bar chart
df_bar = pd.DataFrame({"session": df.index.values, "count": df.values})

# Add a color scale
colors = px.colors.sequential.Plasma[:len(df_bar)]

# Create the bar chart with color
fig = px.bar(df_bar, y="session", x="count", text="count",
             title='Entropy {:.2f}, Variation Ratio {:.2f}'.format(ENTR, Variation),
             color="count", color_continuous_scale=colors)

# Show the chart in a Jupyter Notebook
fig.show("notebook")

In [None]:
pd.crosstab(data['Accident_severity'], data["Vehicle_driver_relation"], normalize='index')

In [None]:
data_2 = pd.crosstab(index=data['Accident_severity'], columns=data['Cause_of_accident'])

In [None]:
data_2

In [None]:
data_2.columns.tolist()

In [None]:
import plotly.express as px
fig = px.imshow(data_2.T,
                labels=dict(x="Accident_severity", y="Cause_of_accident", color="COUNT"),
                y=data_2.columns.tolist(),
                x=data_2.index, height=800
               )
fig.update_xaxes(side="top")
fig.show("notebook")

<b>Inference for Categorical Data</b>

Most of the drivers are <b>male (92%)</b> and in <b>18-50 yrs age groups (67%)</b> have education upto <b>Junior high school (61%)</b>  and are mostly <b>employees (78%) </b> have <b>2-10 yrs (28%)</b> of driving experience, moreover most of the accidents happened with <b>personally owned vehicle's passenger</b>.

Most of the drivers have met accident at <b>Y Shape (36.8%)</b> junction.

Most of accedent happend between <b>8 to 18 hours</b>.

As it can be observed that folowing categorical variables <b>Pedestrian_movement</b>, <b>Fitness_of_casuality</b>, <b>Road_allignment</b> are <b>Quasi-constant</b> features beacsue it contains proximate 90% of data of one category. In other words, these feature have the same values for a very large subset of the outputs. Such features are not very useful for making predictions so we going to remove them.

# 3.0 EDA Numerical Variables

In [None]:
# find numerical variables

num_cols = [c for c in data.columns if data[c].dtypes!='O']
df_num_cols = data[num_cols]
df_num_cols.head()

In [None]:
#The information contains the number of columns, column labels, column data types
df_num_cols.info()

In [None]:
df_num_cols.describe()

In [None]:
#checking unique values
df_num_cols.Number_of_vehicles_involved.unique()

In [None]:
#checking values counts
df_num_cols.Number_of_vehicles_involved.value_counts()

In [None]:
#checking unique values
df_num_cols.Number_of_casualties.unique()

In [None]:
#checking values counts
df_num_cols.Number_of_casualties.value_counts()

In [None]:
#checking with target value
data.loc[:, ('Time','Area_accident_occured','Number_of_casualties','Number_of_vehicles_involved','Accident_severity')]

In [None]:
df_num_cols = df_num_cols.drop(['Time'], axis=1)
df_num_cols.head()

In [None]:
n_num_cols = len(num_cols)
print("Number of Numerical Variables: " + str(n_num_cols))

In [None]:
for col in data.select_dtypes(include='object'):
    if data[col].nunique() <= 3:
        display(data.groupby(col)[['Number_of_vehicles_involved', 'Number_of_casualties']].mean())
        print("")
        print("")
        print("")

In [None]:
corr = data.corr()
corr

In [None]:
plt.figure(figsize=(6,6))
sns.heatmap(corr, cmap='RdBu_r', annot=True, vmax=1, vmin=-1)
plt.show()

<b>We can't find any observation or pattern in Numerical variable.</b>