# Tool Extractor - Injury Treatment Analysis

* February 2nd, 2021
* Ryan Kazmerik
* Enterprise Data Science

## Heinrich / Bird Safety Pyramid

Herbert W. Heinrich was a pioneering occupational health and safety researcher.
His 1931 publication *Industrial Accident Prevention: A Scientific Approach*
was based on the analysis of workplace injuries and accident data collected by 
his employer, a large insurance company.
            
The work was pursued and disseminated in the 1970’s by **Frank E. Bird**, who 
worked for the insurance company of the North America. Bird analyzed more than 1.7 
million accidents reported by 297 cooperating companies.	

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Pyramid_of_risks.svg/1024px-Pyramid_of_risks.svg.png" alt="drawing" width="400"/>

Birds safety theory suggests that there is a 15:3:1 ratio between accidents with no injury, minor injuries (first-aid)
and major injuries (medical-aid).
            
**Using this ratio - we should be able to identify tools that are more dangerous (i.e causing more major injuries than is typical.)**

### Let's load in some incidents from the EDW_EHS table in Oracle, filtering by incidents that resulting in injury:

In [1]:
import altair as alt
import pandas as pd
import spacy
import textacy
from altair import datum
from pattern.text.en import singularize
from spacy import displacy

ner = spacy.load('../models/tool_model')
nlp = spacy.load('en_core_web_sm')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
df = pd.read_parquet('../models/tool_model/results_v1.parquet')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32983 entries, 0 to 32982
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   ID                     32983 non-null  int64 
 1   ACTIVITY               32974 non-null  object
 2   AREA                   32983 non-null  object
 3   ASSET                  32983 non-null  object
 4   BODY_PARTS             2732 non-null   object
 5   DESCRIPTION            32983 non-null  object
 6   DIVISION               32983 non-null  object
 7   INJURY_TREATMENT       32983 non-null  object
 8   IMMEDIATE_CAUSES       2664 non-null   object
 9   INCIDENT_CATEGORY      32983 non-null  object
 10  INCIDENT_DATE_TIME     32983 non-null  object
 11  INCIDENT_RELATIONSHIP  32917 non-null  object
 12  INCIDENT_TYPES         32983 non-null  object
 13  INJURY_CLASSIFICATION  2695 non-null   object
 14  OPERATING_AREA         32983 non-null  object
 15  PERSON_INVOLVED    

### The Bird safety theory suggest for every 600 non-injury incidents, there will be 30 minor incidents (first aid) and 10 major incidents (medical aid). Let's explore this ratio with our dataset:

In [3]:
# convert null values to 'None'
df['INJURY_TREATMENT'] = ['Accident w/o Injury' if x is None else x for x in df['INJURY_TREATMENT']]

df_treatment = df.groupby(['INJURY_TREATMENT']).agg({
        'ID': 'count'
}).reset_index()

df_treatment.head()

Unnamed: 0,INJURY_TREATMENT,ID
0,Accident w/o Injury,30288
1,First Aid,2024
2,Medical Aid,671


In [4]:
alt.Chart(df_treatment).mark_bar().encode(
    x=alt.X('PercentOfTotal:Q', axis=alt.Axis(format='.0%'), title='PERCENT OF TOTAL'),
    y=alt.Y('INJURY_TREATMENT', title='INJURY TREATMENT'),
    color='INJURY_TREATMENT',
    tooltip=[alt.Tooltip('INJURY_TREATMENT', title='Treatment'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Incidents by Injury Treatment',
    height=150,
    width=700
).transform_joinaggregate(
    TotalTime='sum(ID)',
).transform_calculate(
    PercentOfTotal="datum.ID / datum.TotalTime"
)

### The ratio for all incidents works out to (600:40:12) which isn't far off from Birds safety theory of (600:30:10).

### The safety theory suggests that 94% of all accidents result in no injury, 5% of all accidents result in minor injury & 2% of all accidents result in major injury.

### Since 2012, 92% of our accidents resulted in no injury, 6% of our accidents resulted in minor injury & 2% of our accidents resulted in major injury.

### When compared with the safety theory, this would suggest that we are quite close to the average when it comes to accidents that cause injuries.

### Let's use the safety theory to examine the 3:1 ratio between minor incidents (first aid) and major incidents (medical aid) - first by activity

In [5]:
# filter to only show rows that resulted in first or medical aid
df_activity = df[df['INJURY_TREATMENT'].str.contains("Aid", na=False)]

df_activity['INJURY_TREATMENT'].value_counts()

First Aid      2024
Medical Aid     671
Name: INJURY_TREATMENT, dtype: int64

In [6]:
df_activity = df_activity.groupby(['ACTIVITY','INJURY_TREATMENT']).agg({
        'ID': 'count'
}).reset_index()

df_activity.head()

Unnamed: 0,ACTIVITY,INJURY_TREATMENT,ID
0,Completions,First Aid,661
1,Completions,Medical Aid,226
2,Construction - Facilities & Pipelines,First Aid,295
3,Construction - Facilities & Pipelines,Medical Aid,95
4,Construction - Lease & Road,First Aid,26


In [7]:
c1 = alt.Chart(df_activity).mark_bar().encode(
    x=alt.X('ACTIVITY', title='ACTIVITY'),
    y=alt.Y('ID', stack="normalize", axis=alt.Axis(format='%', title='PERCENT')),
    color=alt.Color('INJURY_TREATMENT', scale=alt.Scale(domain=['Medical Aid', 'First Aid'], range=['#457FBF', '#F88D2B'])),
    order=alt.Order('INJURY_TREATMENT', sort='ascending'),
    tooltip=[alt.Tooltip('ACTIVITY', title='Activity'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Injury Treatment by Activity',
    height=300,
    width=750
).transform_joinaggregate(
    totals='sum(ID)',
    groupby=['ACTIVITY']
).transform_filter(
    (datum.totals >= 50)
)

line = alt.Chart(pd.DataFrame({'y': [0.75]})).mark_rule(color='red',strokeDash=[5, 5]).encode(y='y')
c1 = c1+line
c1.configure_axisBottom(
    labelAngle=30
)

### Most activities are close to the suggested 3:1 ratio for medical aid:first aid treatment.

### Let's see the breakdown by operating area:

In [8]:
# filter to only show rows that resulted in first or medical aid
df_area = df[df['INJURY_TREATMENT'].str.contains("Aid", na=False)]

#filter out jonah energy data
df_area = df_area[~df_area['OPERATING_AREA'].str.contains("Jonah", na=False)]


df_area = df_area.groupby(['OPERATING_AREA','INJURY_TREATMENT']).agg({
        'ID': 'count'
}).reset_index()

df_area.head()

Unnamed: 0,OPERATING_AREA,INJURY_TREATMENT,ID
0,Anadarko Operations,First Aid,106
1,Anadarko Operations,Medical Aid,27
2,Atlantic Canada,First Aid,22
3,Atlantic Canada,Medical Aid,14
4,Canadian Operations,First Aid,978


In [9]:
c2 = alt.Chart(df_area).mark_bar().encode(
    x=alt.X('OPERATING_AREA', title='OPERATING AREA'),
    y=alt.Y('ID', stack="normalize", axis=alt.Axis(format='%', title='PERCENT')),
    color=alt.Color('INJURY_TREATMENT', scale=alt.Scale(domain=['Medical Aid', 'First Aid'], range=['#457FBF', '#F88D2B'])),
    order=alt.Order('INJURY_TREATMENT', sort='ascending'),
    tooltip=[alt.Tooltip('OPERATING_AREA', title='Operating Area'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Injury Treatment by Operating Area',
    height=250,
    width=750
).transform_joinaggregate(
    totals='sum(ID)',
    groupby=['OPERATING_AREA']
).transform_filter(
    (datum.ID >= 5)
)

line = alt.Chart(pd.DataFrame({'y': [0.75]})).mark_rule(color='red',strokeDash=[5, 5]).encode(y='y')
c2 = c2+line
c2.configure_axisBottom(
    labelAngle=30
)

### The 3:1 ratio is quite close when considering injury treatment by the major operating areas.

### Let's run our tool extractor to find the names of any tools mentioned in the description column of these incidents:

### Let's see what tools lead to the most medical and first aid treatments and their distribution:

In [10]:
tools_flat = []
for idx,tools in enumerate(df['TOOLS']):
    
    for tool in tools:
        tools_flat.append({
            'ID': df['ID'].iloc[idx],
            'DESCRIPTION':  df['DESCRIPTION'].iloc[idx],
            'TOOL': tool,
            'INJURY_TREATMENT': df['INJURY_TREATMENT'].iloc[idx],
            'OPERATING_AREA': df['OPERATING_AREA'].iloc[idx]
        })

df_tools = pd.DataFrame(tools_flat, columns =['ID', 'DESCRIPTION', 'TOOL', 'INJURY_TREATMENT', 'OPERATING_AREA'])

# filter to only show rows that resulted in first or medical aid
df_tools = df_tools[df_tools['INJURY_TREATMENT'].str.contains("Aid", na=False)]

df_tools = df_tools.groupby(['TOOL','INJURY_TREATMENT']).agg({
        'ID': 'count',
        'DESCRIPTION': 'first',
        'OPERATING_AREA':'first'
}).reset_index()

df_tools.head()

Unnamed: 0,TOOL,INJURY_TREATMENT,ID,DESCRIPTION,OPERATING_AREA
0,a driver,First Aid,1,"Approximately 1:30PM on the DFK 2001, a driver...",Texas Operations
1,amperage manager,First Aid,1,At 19:10hrs on the 6 0f May. Amperage supervis...,Canadian Operations
2,angle grinder,First Aid,4,While worker was hammering loose the diverter ...,Canadian Operations
3,angle iron,Medical Aid,2,Worker tacked a piece of angle iron onto a ver...,Canadian Operations
4,axe,First Aid,2,Worker was using a pick axe to remove soil fro...,Canadian Operations


In [11]:
alt.Chart(df_tools).mark_bar().encode(
    x=alt.X('TOOL'),
    y=alt.Y('ID', title='COUNT'),
    color=alt.Color('INJURY_TREATMENT', scale=alt.Scale(domain=['Medical Aid', 'First Aid'], range=['#457FBF', '#F88D2B'])),
    #column='OPERATING_AREA',
    order=alt.Order('INJURY_TREATMENT', sort='ascending'),
    tooltip=[alt.Tooltip('TOOL', title='Tool'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Top Tools by Frequency',
    height=400,
    width=750
).transform_joinaggregate(
    totals='sum(ID)',
    groupby=['TOOL']
).transform_filter(
    (datum.totals > 3)
).configure_axisBottom(
    labelAngle=30
)

### Let's see the same tools as a percentage, we'll also plot a line at our 3:1 ratio (75%) :

In [12]:
c3 = alt.Chart(df_tools).mark_bar().encode(
    x='TOOL',
    y=alt.Y('ID', stack="normalize", axis=alt.Axis(format='%', title='PERCENT')),
    color=alt.Color('INJURY_TREATMENT', scale=alt.Scale(domain=['Medical Aid', 'First Aid'], range=['#457FBF', '#F88D2B'])),
    #column='OPERATING_AREA',
    order=alt.Order('INJURY_TREATMENT', sort='ascending'),
    tooltip=[alt.Tooltip('TOOL', title='Tool'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Top Tools by Percent',
    height=400,
    width=750
).transform_joinaggregate(
    totals='sum(ID)',
    groupby=['TOOL']
).transform_filter(
    (datum.totals >= 3)
)

line = alt.Chart(pd.DataFrame({'y': [0.75]})).mark_rule(color='red',strokeDash=[5, 5]).encode(y='y')
c3 = c3+line
c3.configure_axisBottom(
    labelAngle=30
)

### As we can see above, most tools follow the 3:1 ratio however there are some tools that cause more serious injury (medical aid) than is regular according to the Bird Theory.

### Additionally, many of the tools above fall into similar categories such as blades, hammers and wrenches. If we group these tools by their category we can see that there are 3 major tool groupings causing injury:

In [13]:
blades = ['axe', 'blade', 'box cutter', 'cutter', 'exacto knife', 'knife', 'pocket knife', 'saw', 'utility knife']
hammers = ['claw hammer', 'hammer', 'sledge hammer', 'sledgehammer']
wrenches = ['crescent wrench', 'hammer wrench', 'pipe wrench', 'ratchet', 'socket', 'spanner wrench', 
            'torque wrench', 'valve wrench', 'wrench']

def get_category(df):
    
    if df['TOOL'] in blades:
        return 'Blade'
    elif df['TOOL'] in hammers:
        return 'Hammer'
    elif df['TOOL'] in wrenches:
        return 'Wrench'
    else:
        return 'Other'

df_tools_flat = pd.DataFrame(tools_flat, columns =['ID', 'DESCRIPTION', 'TOOL', 'INJURY_TREATMENT', 'OPERATING_AREA'])
df_tools_flat['TOOL_CATEGORY'] = df_tools_flat.apply(get_category, axis = 1)  

df_tools_cat = df_tools_flat.groupby(['TOOL_CATEGORY']).agg({
        'ID': 'count'
}).reset_index()

#filter out tool category = Other
df_tools_cat = df_tools_cat[~df_tools_cat['TOOL_CATEGORY'].str.contains("Other", na=False)]

df_tools_cat

Unnamed: 0,TOOL_CATEGORY,ID
0,Blade,268
1,Hammer,368
3,Wrench,532


In [14]:
alt.Chart(df_tools_cat).mark_bar().encode(
    x=alt.X('ID', title='COUNT'),
    y=alt.Y('TOOL_CATEGORY', title='TOOL CATEGORY'),
    color='TOOL_CATEGORY',
    tooltip=[alt.Tooltip('TOOL_CATEGORY', title='Tool Category'), alt.Tooltip('ID', title='Count')]
).properties(
    title='Incidents by Tool Category',
    height=150,
    width=750
)

### One hunch we would like to explore is that retractable blades are safer than non-retractable blades.

### Let's look at all the blade incident descriptions that mention the word 'retractable':

In [64]:
df_retractable = df_tools_flat[df_tools_flat['DESCRIPTION'].str.contains("retractable", na=False)]

df_retractable = df_retractable.groupby(['DESCRIPTION']).agg({
        'ID': 'count',
        'TOOL_CATEGORY': 'first'
}).reset_index()

descriptions = "\n\n\n".join(df_retractable['DESCRIPTION'])

matches = [i for i in range(len(descriptions)) if descriptions.startswith('retractable', i)]

marks = []
for match in matches:

    marks.append({
        "start": match,
        "end": match + 11,
        "label": ""
    })

doc = [{
    "text": "\n\n\n".join(df_retractable['DESCRIPTION']),
    "ents": marks
}]

colors = {"": "linear-gradient(90deg, #D8D800, #FFFF62)"}
options = {"colors": colors}

displacy.render(doc, style="ent", options=options, manual=True)

### Another hunch we would like to explore is the role of gloves in incidents that involved blades. 

### Let's look at all the incident descriptions that mention gloves:

In [58]:
df_gloves = df_tools_flat[df_tools_flat['DESCRIPTION'].str.contains("gloves", na=False)]

df_gloves = df_gloves.groupby(['DESCRIPTION']).agg({
        'ID': 'count',
        'TOOL_CATEGORY': 'first'
}).reset_index()

df_gloves = df_gloves[df_gloves['TOOL_CATEGORY'].str.contains("Blade", na=False)]
        
descriptions = "\n\n\n".join(df_gloves['DESCRIPTION'])

matches = [i for i in range(len(descriptions)) if descriptions.startswith('gloves', i)]

marks = []
for match in matches:

    marks.append({
        "start": match,
        "end": match + 6,
        "label": ""
    })

doc = [{
    "text": "\n\n\n".join(df_gloves['DESCRIPTION']),
    "ents": marks
}]

colors = {"": "linear-gradient(90deg, #FFAC14, #FFCF76)"}
options = {"colors": colors}

displacy.render(doc, style="ent", options=options, manual=True)

### Another hunch we would like to test is that closed-ended wrenches (ex. rachets) cause less slippage than open ended wrenches (ex. crescent) 