<a href="https://colab.research.google.com/github/sandippani/chatbot_1/blob/main/Capstone_Project_NLP_Batch_2_Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROBLEM STATEMENT
### • DOMAIN: Industrial safety. NLP based Chatbot.
### • CONTEXT:

#### The database comes from one of the biggest industry in Brazil and in the world. It is an urgent need for industries/companies around the globe to understand why employees still suffer some injuries/accidents in plants. Sometimes they also die in such environment.

# • DATA DESCRIPTION:
###### This The database is basically records of accidents from 12 different plants in 03 different countries which every line in the data is an occurrence of an accident.
## Columns description:
######  ‣ Data: timestamp or time/date information
###### ‣ Countries: which country the accident occurred (anonymised)
###### ‣ Local: the city where the manufacturing plant is located (anonymised)
###### ‣ Industry sector: which sector the plant belongs to
###### ‣ Accident level: from I to VI, it registers how severe was the accident (I means not severe but VI means very severe)
###### ‣ Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)
###### ‣ Genre: if the person is male of female
###### ‣ Employee or Third Party: if the injured person is an employee or a third party
###### ‣ Critical Risk: some description of the risk involved in the accident
###### ‣ Description: Detailed description of how the accident happened.

# PROJECT OBJECTIVE:
#### Design a ML/DL based chatbot utility which can help the professionals to highlight the safety risk as per the incident description.


In [None]:
from google.colab import drive
path = '/content/drive/'
drive.mount(path)
filepath = path + 'MyDrive/NLP/data/industrial_safety_and_health_database_with_accidents_description.csv'

Mounted at /content/drive/


In [None]:
import pandas as pd, numpy as np, re, time
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import spacy

### Load Data

In [None]:

newname = ["Countries", "Local", "IndustrySector", "AccidentLevel","PotentialAccidentLevel","Genre","EmployeeOrThirdParty", "CriticalRisk", "Description" ]
df = pd.read_csv(filepath,usecols = ["Countries", "Local", "Industry Sector", "Accident Level","Potential Accident Level","Genre","Employee or Third Party", "Critical Risk", "Description"])
df.columns = newname

In [None]:
df.head()


Unnamed: 0,Countries,Local,IndustrySector,AccidentLevel,PotentialAccidentLevel,Genre,EmployeeOrThirdParty,CriticalRisk,Description
0,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,While removing the drill rod of the Jumbo 08 f...
1,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,During the activation of a sodium sulphide pum...
2,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,In the sub-station MILPO located at level +170...
3,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,Being 9:45 am. approximately in the Nv. 1880 C...
4,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,Approximately at 11:45 a.m. in circumstances t...


In [None]:
df.shape

(425, 9)

In [None]:
df['Description'].values


array(['While removing the drill rod of the Jumbo 08 for maintenance, the supervisor proceeds to loosen the support of the intermediate centralizer to facilitate the removal, seeing this the mechanic supports one end on the drill of the equipment to pull with both hands the bar and accelerate the removal from this, at this moment the bar slides from its point of support and tightens the fingers of the mechanic between the drilling bar and the beam of the jumbo.',
       'During the activation of a sodium sulphide pump, the piping was uncoupled and the sulfide solution was designed in the area to reach the maid. Immediately she made use of the emergency shower and was directed to the ambulatory doctor and later to the hospital. Note: of sulphide solution = 48 grams / liter.',
       'In the sub-station MILPO located at level +170 when the collaborator was doing the excavation work with a pick (hand tool), hitting a rock with the flat part of the beak, it bounces off hitting the steel ti

In [None]:
import numpy as np, re
from nltk.stem.porter import PorterStemmer

### Clean text by removing unnecessary characters and altering the format of words

In [None]:
def removeNltkStopWords(df,column_name):
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    stopwords = set(stopwords.words('english'))
    # Remove stopwords
    data = df[column_name].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))
    return data

In [None]:
def stemmingData(dataList):
    # Stemming our data
    ps = PorterStemmer()
    data = dataList.apply(lambda x: x.split())
    return data.apply(lambda x : ' '.join([ps.stem(word) for word in x]))


In [None]:
def removePunctuations(df,column_name):
    # remove punctuations
    data = df[column_name].apply(lambda s: re.sub(r'[^\w\s]', ' ', s))
    data = df[column_name].apply(lambda s: re.sub(r'\_', ' ', s))
    return data

In [None]:
def removeHtmlTags(df,column_name):
    # removeHtmlTags
    cleanHtml = re.compile('<.*?>')
    data = df[column_name].apply(lambda s: re.sub(cleanHtml, ' ', s))
    return data

In [None]:
def removeURLs(df,column_name):
    # remove URLS
    data = df[column_name].apply(lambda s: re.sub(r'^https?:\/\/.*[\r\n]*', ' ', s))
    return data

In [None]:
def removeEmoji(df,column_name):
    # remove emojis
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    data = df[column_name].apply(lambda s: re.sub(emoji_pattern, ' ', s))
    return data

In [None]:
def removeUnwantedCharacters(df,column_name):
    # Relacing special symbols and digits 
    data = df[column_name].apply(lambda s : re.sub('[^a-zA-Z]', ' ', s))
    return data

In [None]:
def porterStemmerData(df,column_name):
    # Stemming our data
    ps = PorterStemmer()
    data = df[column_name].apply(lambda x: x.split())
    return data.apply(lambda x : ' '.join([ps.stem(word) for word in x]))

In [None]:
def snowballStemmerData(df,column_name):
    # Stemming our data
    stemmer = SnowballStemmer(language='english')
    data = df[column_name].apply(lambda x: x.split())
    return data.apply(lambda x : ' '.join([stemmer.stem(word) for word in x]))

In [None]:
def nltkLemmatizeData(df,column_name):
    lemmatizer = WordNetLemmatizer()
    # lemmatize our data
    data = df[column_name].apply(lambda x: x.split())
    return data.apply(lambda x : ' '.join([lemmatizer.lemmatize(word) for word in x]))

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
 
def spacyLemmatizer(text):    
    wordList = []
    sentence = nlp(text)
    for word in sentence:
        wordList.append(word.lemma_)
    return " ".join(wordList)

In [None]:
def spacyLemmatizeData(df,column_name):
    # lemmatize our data
    data = df[column_name].apply(lambda x: x.split())
    return data.apply(lambda x : ' '.join([spacyLemmatizer(word) for word in x]))
    


In [None]:
#preprocess data
df['Description'] = convertToLowerCase(df,'Description')
df['Description'] = removeHtmlTags(df,'Description')
df['Description'] = removeURLs(df,'Description')
df['Description'] = removeEmoji(df,'Description')
df['Description'] = removePunctuations(df,'Description')
df['Description'] = removeUnwantedCharacters(df,'Description')
df['Description'] = removeNltkStopWords(df,'Description')
df['Description'] = spacyLemmatizeData(df,'Description')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Get length of each description and add a column for that**

In [None]:
def countWords(wordslist):
  cnt=0
  for word in wordslist:
    cnt=cnt+1
  return cnt 

In [None]:
def getLength(dataList):
    data = dataList.apply(lambda x: x.split())
    count=[]
    for item in data:
        count.append(countWords(item))
    return  count  

In [None]:
descriptionCount = getLength(df['Description']) 

In [None]:
df['descriptionCount'] = descriptionCount

In [None]:
df.head()

Unnamed: 0,Countries,Local,IndustrySector,AccidentLevel,PotentialAccidentLevel,Genre,EmployeeOrThirdParty,CriticalRisk,Description,descriptionCount
0,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed,remove drill rod jumbo maintenance supervisor ...,37
1,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems,activation sodium sulphide pump piping uncoupl...,27
2,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools,sub station milpo locate level collaborator ex...,29
3,Country_01,Local_04,Mining,I,I,Male,Third Party,Others,approximately nv cx ob personnel begin task un...,51
4,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others,approximately circumstance mechanic anthony gr...,44


**Check whether data is cleaned**

In [None]:
print(df[:10]['Description'].values)


['remove drill rod jumbo maintenance supervisor proceeds loosen support intermediate centralizer facilitate removal see mechanic support one end drill equipment pull hand bar accelerate removal moment bar slide point support tightens finger mechanic drilling bar beam jumbo'
 'activation sodium sulphide pump piping uncoupled sulfide solution design area reach maid immediately make use emergency shower direct ambulatory doctor later hospital note sulphide solution grams liter'
 'sub station milpo locate level collaborator excavation work pick hand tool hit rock flat part beak bounce hit steel tip safety shoe metatarsal area leave foot collaborator cause injury'
 'approximately nv cx ob personnel begin task unlocking soquet bolt bhb machine penultimate bolt identify hexagonal head wear proceed mr crist bal auxiliary assistant climb platform exert pressure hand dado key prevent come bolt moment two collaborator rotate lever anti clockwise direction leave key bolt hit palm leave hand cause 

Limit the text we will use to the shorter 95% percentile

In [None]:
def getLimitedTextLength(percentileNumber,dataFrameColumn):
    return dataFrameColumn.quantile(percentileNumber/100)

In [None]:
maxlen = getLimitedTextLength(95,df['descriptionCount'])

In [None]:
print(maxlen)

63.0


Creating the dictionary

In [None]:
nltk.download('punkt')
vocab_dictionary = {}

for sentence in df['Description']:
  tokens = nltk.word_tokenize(sentence)
  for token in tokens:
    if token not in vocab_dictionary.keys():
      vocab_dictionary[token] = 1
    else:
      vocab_dictionary[token] += 1

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
#Limiting to words with frequency more than 2

def sortFreqDict(vocab_dictionary):

    SortedDict = [(vocab_dictionary[key], key) for key in vocab_dictionary]

    SortedDict.sort()

    SortedDict.reverse()

    return SortedDict

In [None]:
SortedDict

[(191, 'cause'),
 (184, 'leave'),
 (183, 'employee'),
 (181, 'hand'),
 (155, 'right'),
 (133, 'operator'),
 (120, 'activity'),
 (112, 'time'),
 (111, 'injury'),
 (106, 'use'),
 (103, 'moment'),
 (90, 'worker'),
 (88, 'work'),
 (84, 'collaborator'),
 (81, 'area'),
 (80, 'one'),
 (78, 'equipment'),
 (76, 'finger'),
 (75, 'perform'),
 (75, 'assistant'),
 (73, 'accident'),
 (72, 'pipe'),
 (71, 'level'),
 (70, 'support'),
 (67, 'make'),
 (66, 'cm'),
 (65, 'move'),
 (65, 'hit'),
 (65, 'floor'),
 (62, 'cut'),
 (61, 'place'),
 (60, 'remove'),
 (59, 'mesh'),
 (59, 'fall'),
 (57, 'rock'),
 (54, 'mr'),
 (53, 'safety'),
 (53, 'glove'),
 (52, 'x'),
 (52, 'meter'),
 (50, 'team'),
 (50, 'approximately'),
 (48, 'impact'),
 (47, 'height'),
 (46, 'side'),
 (46, 'pump'),
 (46, 'part'),
 (45, 'generate'),
 (45, 'describe'),
 (44, 'circumstance'),
 (43, 'truck'),
 (43, 'kg'),
 (43, 'injure'),
 (43, 'come'),
 (42, 'metal'),
 (42, 'face'),
 (42, 'carry'),
 (40, 'release'),
 (40, 'position'),
 (39, 'towards')

In [None]:
vocab_freq = vocab_dictionary
THRESHOLD = 2
vocab_freq = dict((k, v) for k, v in vocab_freq.items() if v >= THRESHOLD)
vocab_freq

{'remove': 60,
 'drill': 30,
 'rod': 21,
 'jumbo': 9,
 'maintenance': 32,
 'supervisor': 10,
 'proceeds': 12,
 'loosen': 6,
 'support': 70,
 'intermediate': 3,
 'removal': 14,
 'see': 13,
 'mechanic': 37,
 'one': 80,
 'end': 34,
 'equipment': 78,
 'pull': 15,
 'hand': 181,
 'bar': 23,
 'moment': 103,
 'slide': 25,
 'point': 36,
 'finger': 76,
 'drilling': 24,
 'beam': 9,
 'sodium': 2,
 'sulphide': 2,
 'pump': 46,
 'solution': 12,
 'area': 81,
 'reach': 31,
 'maid': 5,
 'immediately': 22,
 'make': 67,
 'use': 106,
 'emergency': 6,
 'shower': 2,
 'direct': 7,
 'doctor': 5,
 'later': 7,
 'hospital': 10,
 'note': 5,
 'sub': 3,
 'station': 6,
 'milpo': 2,
 'locate': 15,
 'level': 71,
 'collaborator': 84,
 'excavation': 3,
 'work': 88,
 'pick': 7,
 'tool': 12,
 'hit': 65,
 'rock': 57,
 'part': 46,
 'bounce': 3,
 'steel': 14,
 'tip': 9,
 'safety': 53,
 'shoe': 3,
 'metatarsal': 3,
 'leave': 184,
 'foot': 36,
 'cause': 191,
 'injury': 111,
 'approximately': 50,
 'nv': 21,
 'cx': 11,
 'ob': 8,


In [None]:
print("Total number of Words occurring more than twice:",len(vocab_freq))

SortedDict = sortFreqDict(vocab_freq)

Total number of Words occurring more than twice: 1396




*   Add the unique tokens to the
*   i.e. an inverse dictionary for vocab_to_int. Create dictionaries to map the unique integers to their respective words 





In [None]:
#we will create dictionaries to provide a unique integer for each word.
PAD_token = 0   # Used for padding short sentences
EOS_token = 1   # Start-of-sentence token
UNK_token = 2   # End-of-sentence token
GO_token = 3    # not sure where it is used

word_num  = 4 #number 4 is left for WORD_CODE_START for model decoder later
encoding = {}
decoding = {}
decoding = {PAD_token:'<PAD>', EOS_token: '<EOS>',UNK_token : '<UNK>', GO_token : '<GO>' }
encoding = {'<PAD>':PAD_token, '<EOS>': EOS_token, '<UNK>': UNK_token, '<GO>' :  GO_token }
for word, count in SortedDict:
        encoding[word] = word_num 
        decoding[word_num] = word
        word_num += 1

print("No. of vocab used:", word_num)


No. of vocab used: 1400


In [None]:
decoding


{0: '<PAD>',
 1: '<EOS>',
 2: '<UNK>',
 3: '<GO>',
 4: 191,
 5: 184,
 6: 183,
 7: 181,
 8: 155,
 9: 133,
 10: 120,
 11: 112,
 12: 111,
 13: 106,
 14: 103,
 15: 90,
 16: 88,
 17: 84,
 18: 81,
 19: 80,
 20: 78,
 21: 76,
 22: 75,
 23: 75,
 24: 73,
 25: 72,
 26: 71,
 27: 70,
 28: 67,
 29: 66,
 30: 65,
 31: 65,
 32: 65,
 33: 62,
 34: 61,
 35: 60,
 36: 59,
 37: 59,
 38: 57,
 39: 54,
 40: 53,
 41: 53,
 42: 52,
 43: 52,
 44: 50,
 45: 50,
 46: 48,
 47: 47,
 48: 46,
 49: 46,
 50: 46,
 51: 45,
 52: 45,
 53: 44,
 54: 43,
 55: 43,
 56: 43,
 57: 43,
 58: 42,
 59: 42,
 60: 42,
 61: 40,
 62: 40,
 63: 39,
 64: 38,
 65: 37,
 66: 37,
 67: 36,
 68: 36,
 69: 36,
 70: 36,
 71: 35,
 72: 35,
 73: 34,
 74: 34,
 75: 34,
 76: 34,
 77: 33,
 78: 33,
 79: 32,
 80: 32,
 81: 31,
 82: 31,
 83: 31,
 84: 31,
 85: 30,
 86: 30,
 87: 30,
 88: 30,
 89: 29,
 90: 29,
 91: 29,
 92: 29,
 93: 28,
 94: 28,
 95: 28,
 96: 28,
 97: 28,
 98: 28,
 99: 28,
 100: 27,
 101: 27,
 102: 26,
 103: 26,
 104: 25,
 105: 25,
 106: 25,
 107: 25,
