In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
accident_safety_data=pd.read_csv("C:/Applications/Machine Learning/NLP/CapstoneProjectNLP/data/hse_data.csv")

In [3]:
accident_safety_data.head()

Unnamed: 0.1,Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee or Third Party,Critical Risk,Description
0,0,2016-01-01 00:00:00,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,While removing the drill rod of the Jumbo 08 f...
1,1,2016-01-02 00:00:00,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,During the activation of a sodium sulphide pum...
2,2,2016-01-06 00:00:00,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,In the sub-station MILPO located at level +170...
3,3,2016-01-08 00:00:00,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,Being 9:45 am. approximately in the Nv. 1880 C...
4,4,2016-01-10 00:00:00,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,Approximately at 11:45 a.m. in circumstances t...


In [4]:
accident_safety_data.columns

Index(['Unnamed: 0', 'Data', 'Countries', 'Local', 'Industry Sector',
       'Accident Level', 'Potential Accident Level', 'Genre',
       'Employee or Third Party', 'Critical Risk', 'Description'],
      dtype='object')

We can see that the columns "Unnamed" is unwanted, as it will not help us in our analysis. 
Also, Data column should be renamed to "Date". Therefore, let's drop the column "Unnamed" and rename the column "Data" to "Date"

In [6]:
#create a backup of the dataset before we make any changes to it
accident_safety_data_new=accident_safety_data
accident_safety_data_new.head()

Unnamed: 0.1,Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee or Third Party,Critical Risk,Description
0,0,2016-01-01 00:00:00,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,While removing the drill rod of the Jumbo 08 f...
1,1,2016-01-02 00:00:00,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,During the activation of a sodium sulphide pum...
2,2,2016-01-06 00:00:00,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,In the sub-station MILPO located at level +170...
3,3,2016-01-08 00:00:00,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,Being 9:45 am. approximately in the Nv. 1880 C...
4,4,2016-01-10 00:00:00,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,Approximately at 11:45 a.m. in circumstances t...


In [7]:
#dropping "Unnamed" column
accident_safety_data_new.drop('Unnamed: 0',axis='columns', inplace=True)
#renaming "Data" column to "Date"
accident_safety_data_new.rename(columns = {'Data':'Date'}, inplace = True)

In [8]:
#Let us check the shape of our dataset
accident_safety_data_new.shape

(425, 10)

We can see that the dataset has 425 rows and 10 columns

In [9]:
accident_safety_data_new.head()

Unnamed: 0,Date,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee or Third Party,Critical Risk,Description
0,2016-01-01 00:00:00,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,While removing the drill rod of the Jumbo 08 f...
1,2016-01-02 00:00:00,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,During the activation of a sodium sulphide pum...
2,2016-01-06 00:00:00,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,In the sub-station MILPO located at level +170...
3,2016-01-08 00:00:00,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,Being 9:45 am. approximately in the Nv. 1880 C...
4,2016-01-10 00:00:00,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,Approximately at 11:45 a.m. in circumstances t...


In [10]:
#Let us check for missing values in the dataset
accident_safety_data_new.isna().apply(pd.value_counts)

Unnamed: 0,Date,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee or Third Party,Critical Risk,Description
False,425,425,425,425,425,425,425,425,425,425


We can see that this dataset has no null values.

In [11]:
#Let us now check the datatype of the dataset and also get to know some more details
accident_safety_data_new.dtypes

Date                        object
Countries                   object
Local                       object
Industry Sector             object
Accident Level              object
Potential Accident Level    object
Genre                       object
Employee or Third Party     object
Critical Risk               object
Description                 object
dtype: object

Here, we can see that all the columns of the dataset are of "object" datatype. Coming to the type of data present in each column, we can see that there is a column "Date", which means it holds time series data. All other columns except "Description" are of categorical datatype. 

In [12]:
accident_safety_data_new.describe().T

Unnamed: 0,count,unique,top,freq
Date,425,287,2017-02-08 00:00:00,6
Countries,425,3,Country_01,251
Local,425,12,Local_03,90
Industry Sector,425,3,Mining,241
Accident Level,425,5,I,316
Potential Accident Level,425,6,IV,143
Genre,425,2,Male,403
Employee or Third Party,425,3,Third Party,189
Critical Risk,425,33,Others,232
Description,425,411,On 02/03/17 during the soil sampling in the re...,3


From the above table, we can infer the below:

1. This dataset contains accident data of 3 countries, out of which Country1 has the most number of accidents. 

2. The data is collected from 3 types of industry sectors.Local_3 has the most number of accidents.

3. There are 5 major accident levels in which this dataset has been classified.316 accidents are of accident level 1, making it the most frequent accident type. This also means that the data is not distributed evenly.

4. The data is a consolidation of accidents faced by employees as well as third party vendors and others. Third party employees have faced the most number of accidents according to this dataset.

5. 403 male employees have been reported to have accidents, which mean the distribution of data in this case is also not evenly balanced.

6. 33 different types of critical risks have been identified in the dataset.

We have seen that there are quite a few categorical columns in the dataset which can be encoded to numerical values e.g. 

1. Local

2. Accident Level

3. Potential Accident Level


In [None]:
from sklearn.preprocessing import LabelEncoder
lb_make = LabelEncoder()
#accident_safety_data_new['Countries'] = lb_make.fit_transform(accident_safety_data_new['Countries'])
#accident_safety_data_new['Local'] = lb_make.fit_transform(accident_safety_data_new['Local'])
accident_safety_data_new['Accident Level'] = lb_make.fit_transform(accident_safety_data_new['Accident Level'])
accident_safety_data_new['Potential Accident Level'] = lb_make.fit_transform(accident_safety_data_new['Potential Accident Level'])
#accident_safety_data_new['Genre'] = lb_make.fit_transform(accident_safety_data_new['Genre'])


In [None]:
accident_safety_data_new.head()

UNIVARIATE ANALYSIS

1. Let us check the distribution of data based on country.

In [None]:
accident_safety_data_new['Countries'].value_counts().plot(kind='bar')

We can see that "Country_01" has the most number of accident cases.

Let us now see the distribution of accidents with respect to the type of employee.(Employee/ThirdParty/ThirdPartyRemote)

In [None]:
accident_safety_data_new['Employee or Third Party'].value_counts().plot(kind='bar')

From the graph it is very clear that accidents have happened in almost equal proportions among permanent employees or third party contractors, with thrid party contractors a bit on the higher side.

Let us also check the distribution of accidents as per industry sector.

In [None]:
accident_safety_data_new['Industry Sector'].value_counts().plot(kind='bar')

We can see that majority of accidents have happened in the mining sector, followed by metal industry and other type of industries.

We will now see the distribution of accidents as per Gender

In [None]:
accident_safety_data_new['Genre'].value_counts().plot(kind='bar')

Clearly, the distribution of accidents is imbalanced when checked by "Genre". The count of accidents in males is way higher than that in females.

Lastly, let us check the distribution by Locals.

In [None]:
accident_safety_data_new['Local'].value_counts().plot(kind='bar')

From the graph it is pretty clear, that the plants can be divided into 4 categories, based on the frequency of accidents. 

1. Local_03 - Very High, 

2. Local_05,Local_01,Local_04,Local_06,Local_10 - High

3. Local_08,Local_02,Local_07 - Medium

4. Local_12,Local_11,Local_09 - Low

BIVARIATE ANALYSIS

Let us check the relation between different factors which have lead to accidents