<a href="https://colab.research.google.com/github/wlail-iu/D590-NLP-F24/blob/main/WLail_Copy_of_HealthCareChatBot_NLP_FinalProjectPart2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **NLP Final Project Part 2**

# 1. Health Care ChatBot

## Team Members

<li> Amanda Alonzo
<li> Wade Lail
<li> Edmund Zarek



# 2. Preprocessing Steps (30 pts)

All essential steps necessary for our application
<ol>
<li> Install dependencies </li>
<li> Load chatbot model </li>
<li> Load health care Q&A data from file ("train.csv") </li>
<li> Summarize answers for similar questions </li>
<li> Perform custom name entity recognition for health care terms such as symptom, side effect, treatment, and prevention.</li>
</ol>

In [2]:
!python --version
!pip install SQLAlchemy
import sqlalchemy
sqlalchemy.__version__
!pip install ChatterBot2

Python 3.10.12


In [None]:
# load file for medical questions and answers that will be used to train the chatbot
import pandas as pd

df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,qtype,Question,Answer
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen..."


In [None]:
len(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16407 entries, 0 to 16406
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   qtype     16407 non-null  object
 1   Question  16407 non-null  object
 2   Answer    16407 non-null  object
dtypes: object(3)
memory usage: 384.7+ KB



# 3. Feature Extraction (30 pts)

Implement feature extraction tools and methods (term frequency, word embeddings etc)

From Question field, derive a new question_cleansed field that:
- removes stop words and punctuation
- all lower case

From Answer field, derive a new answer_cleansed field that:
- shortens the answer to 1 sentence for readability in a chat format.


In [None]:
# remove punctuation and stop words from question and make lower case
# shorten answer to one sentence

import nltk
import string
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

# lower case and remove punctuation
df["question_cleansed"] = df["Question"].str.lower().replace(r'[^\w\s]', '', regex=True)

# remove stop words
df["question_cleansed"] = df["question_cleansed"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

# shorten answer to first sentence
df["answer_cleansed"] = df["Answer"].apply(lambda x: x.split(".")[0]+'.' )

df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,qtype,Question,Answer,question_cleansed,answer_cleansed
0,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,LCMV infections can occur after exposure to fr...,risk lymphocytic choriomeningitis lcm,LCMV infections can occur after exposure to fr...
1,symptoms,What are the symptoms of Lymphocytic Choriomen...,LCMV is most commonly recognized as causing ne...,symptoms lymphocytic choriomeningitis lcm,LCMV is most commonly recognized as causing ne...
2,susceptibility,Who is at risk for Lymphocytic Choriomeningiti...,Individuals of all ages who come into contact ...,risk lymphocytic choriomeningitis lcm,Individuals of all ages who come into contact ...
3,exams and tests,How to diagnose Lymphocytic Choriomeningitis (...,"During the first phase of the disease, the mos...",diagnose lymphocytic choriomeningitis lcm,"During the first phase of the disease, the mos..."
4,treatment,What are the treatments for Lymphocytic Chorio...,"Aseptic meningitis, encephalitis, or meningoen...",treatments lymphocytic choriomeningitis lcm,"Aseptic meningitis, encephalitis, or meningoen..."


In [None]:
# review some questions to test with in chat below
for index, row in df.iterrows():
  if index < 5:
    print("Q: " + row["Question"]  + " (" + row["question_cleansed"] + ")")
    print("A: " + row["answer_cleansed"] )

Q: Who is at risk for Lymphocytic Choriomeningitis (LCM)? ? (risk lymphocytic choriomeningitis lcm)
A: LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.
Q: What are the symptoms of Lymphocytic Choriomeningitis (LCM) ? (symptoms lymphocytic choriomeningitis lcm)
A: LCMV is most commonly recognized as causing neurological disease, as its name implies, though infection without symptoms or mild febrile illnesses are more common clinical manifestations.
Q: Who is at risk for Lymphocytic Choriomeningitis (LCM)? ? (risk lymphocytic choriomeningitis lcm)
A: Individuals of all ages who come into contact with urine, feces, saliva, or blood of wild mice are potentially at risk for infection.
Q: How to diagnose Lymphocytic Choriomeningitis (LCM) ? (diagnose lymphocytic choriomeningitis lcm)
A: During the first phase of the disease, the most common laboratory abnormalities are a low white blood cell count (leukopenia) and a low p

In [None]:
# generate conversations.yml from our data frame
# using shorten responses
# todo: consolidate questions for improved performance (time to train is too long - 20 minutes)

yaml_lines = []
yaml_lines.append("categories:")
yaml_lines.append("- medical")
yaml_lines.append("conversations:")

for _, row in df.iterrows():
    yaml_lines.append("- - \"" + str(row["question_cleansed"]).replace('-', '').replace('"','') + "\"")
    yaml_lines.append("  - \"" + str(row["answer_cleansed"]).replace('-', '').replace('"','') + "\"")


yaml_str = "\n".join(yaml_lines)
print(yaml_str[0:10])
with open('conversations.yml', 'a') as the_file:
    the_file.write(yaml_str)

"""
sample format:

categories:
- medical
conversations:
- - what disease does a carcinogen cause
  - cancer.
- - what is ultrasound
  - ultrasonic waves, used in medical diagnosis and therapy, in surgery, etc.
- - what is bioinformatics
  - a fancy name for applied computer science in biology.
- - what is cytology
  - the study of cells.
- - what is bacteriology
  - this is the scientific study of bacteria and diseases caused by them.
- - what is botulism?
  - Botulism is a rare but serious paralytic illness caused by a nerve toxin that is produced by the bacterium Clostridium botulinum and sometimes by strains of Clostridium butyricum and Clostridium baratii. There are five main kinds of botulism. Foodborne botulism is caused by eating foods that contain the botulinum toxin. Wound botulism is caused by toxin produced from a wound infected with Clostridium botulinum. Infant botulism is caused by consuming the spores of the botulinum bacteria, which then grow in the intestines and release toxin. Adult intestinal toxemia (adult intestinal colonization) botulism is a very rare kind of botulism that occurs among adults by the same route as infant botulism. Lastly, iatrogenic botulism can occur from accidental overdose of botulinum toxin. All forms of botulism can be fatal and are considered medical emergencies. Foodborne botulism is a public health emergency because many people can be poisoned by eating a contaminated food.
- - what are marine toxins?
  - Marine toxins are naturally occurring chemicals that can contaminate certain seafood. The seafood contaminated with these chemicals frequently looks, smells, and tastes normal. When humans eat such seafood, disease can result.



"""



categories


'\nsample format:\n\ncategories:\n- medical\nconversations:\n- - what disease does a carcinogen cause\n  - cancer.\n- - what is ultrasound\n  - ultrasonic waves, used in medical diagnosis and therapy, in surgery, etc.\n- - what is bioinformatics\n  - a fancy name for applied computer science in biology.\n- - what is cytology\n  - the study of cells.\n- - what is bacteriology\n  - this is the scientific study of bacteria and diseases caused by them.\n- - what is botulism?\n  - Botulism is a rare but serious paralytic illness caused by a nerve toxin that is produced by the bacterium Clostridium botulinum and sometimes by strains of Clostridium butyricum and Clostridium baratii. There are five main kinds of botulism. Foodborne botulism is caused by eating foods that contain the botulinum toxin. Wound botulism is caused by toxin produced from a wound infected with Clostridium botulinum. Infant botulism is caused by consuming the spores of the botulinum bacteria, which then grow in the inte

In [None]:
from chatterbot import ChatBot

# setup and configure the bot
# don't save conversations for privacy(due to HIPAA for sensitive health care data)
bot = ChatBot('Doctor',   read_only=True)

# Create object of ChatBot class with Storage Adapter
bot = ChatBot(
    'Doctor',
    storage_adapter='chatterbot.storage.SQLStorageAdapter',
    database_uri='sqlite:///database.sqlite3'
)

# Create object of ChatBot class with Best Match Adapter
bot = ChatBot(
    'Doctor',
    logic_adapters=[
        'chatterbot.logic.BestMatch'
        ],
)

In [None]:
from chatterbot.trainers import ChatterBotCorpusTrainer

# train chatbot on our data using yml file generated above
# this takes a few minutes

trainer = ChatterBotCorpusTrainer(bot)

trainer.train("/content/conversations.yml")

Training conversations.yml: [####################] 100%


In [None]:
# set up sentiment analysis for
# proper response to initial chat question of "how do you feel?"
!pip install nltk

import nltk
nltk.download("punkt")
nltk.download("vader_lexicon")
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [None]:
# test sentiment analysis results
print(sid.polarity_scores("I feel horrible."))

print(sid.polarity_scores("I feel fine."))

print(sid.polarity_scores("my spouse feels sick."))

{'neg': 0.778, 'neu': 0.222, 'pos': 0.0, 'compound': -0.5423}
{'neg': 0.0, 'neu': 0.357, 'pos': 0.643, 'compound': 0.2023}
{'neg': 0.524, 'neu': 0.476, 'pos': 0.0, 'compound': -0.5106}


In [None]:

print(sid.polarity_scores("I feel sick.")['neg']>sid.polarity_scores("I feel sick.")['pos'] )

True


In [None]:
#  TBD: will use if our app needs any similarity scores
#similarity score of user entered question to data question cleansed
"""
!pip install spacy
import spacy

nlp = spacy.load("en_core_web_sm")


doc1=nlp("diabetes")
doc2=nlp("what is diabetes?")
t1=doc1[0]
t2=doc2[2]
print(f"sim betw {t1} and {t2} = ", round(t1.similarity(t2),3))
print(f"sim betw {doc1} and {doc2} = ", round(doc1.similarity(doc2),3))
"""


thanks


In [None]:
# Testing the bot with fixed inputs
# Get a response to the input text for testing
test_input =  'What is botulism?'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)

test_input =   'What are marine toxins?'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)

Input: What is botulism?
Bot Response: risk marburg hemorrhagic fever marburg hf
Input: What are marine toxins?
Bot Response: Marine toxins are naturally occurring chemicals that can contaminate certain seafood.


In [None]:
# Get a response to the input text
test_input =  'bacteriology'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)

test_input =  'toxins'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)

Input: bacteriology
Bot Response: BuschkeOllendorff syndrome results from mutations in the LEMD3 gene.
Input: toxins
Bot Response: Marine toxins are naturally occurring chemicals that can contaminate certain seafood.


In [None]:
test_input =  'how to prevent parasites - cysticercosis'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)

test_input =  'risk lymphocytic choriomeningitis'
response = bot.get_response(test_input)

print("Input:", test_input)
print("Bot Response:", response)




Input: how to prevent parasites - cysticercosis
Bot Response: Cysticercosis is an infection caused by the larvae of the tapeworm, Taenia solium.
Input: risk lymphocytic choriomeningitis
Bot Response: LCMV infections can occur after exposure to fresh urine, droppings, saliva, or nesting materials from infected rodents.


# 4. Main Functionality (30 pts)

This is our main task of interacting with robo chat about  medical information.

<ol>
<li> Obtain medical nature acknowledgements from user.
<li> For all subsequent steps, identify keywords for mental health issues and add to response.
<li> Ask use how they feel for conversastional tone. Obtain adjective using NER and assess sentiment analysis for response.
<li> Obtain question from user
<li> Find similarity score of most similar question
<li> Show answer
<li> Ask for more information or new question
<li> Repeat for all questions
<li> Ask for any topic user would like summary about
<li> Before exiting, remind them to defer to advice from health care provider and contact health care professional as needed.


In [None]:
# Get users name
# Provide medical disclaimers
# Ask how they feel
# get one or more questions and provide response
# when user enters bye or thanks, no more questions
# provide boilerplate exit message with medical provider details

import re

name = input("What is your name? ")
print("Welcome to the Dr. Bot Service! For medical emergencies, please dial 911.")
print("Please say Bye or Thanks before you leave.")
print("How do you feel?")
request=input(name+':')

# Check if they are feeling sick or fine and customize response accordingly

if sid.polarity_scores(request)['neg']>sid.polarity_scores(request)['pos']:
  print('Dr. Bot:',"I'm sorry you don't feel well. What would you like to know more about?")
else:
  print('Dr. Bot:',"I'm glad you feel well. What would you like to know more about?")

while True:
    request = input(name+': ')

    # remove punctuation and lower case the request
    # todo: remove stopwords

    clean_request = re.sub(r'[^\w\s]', '', request.lower())

    if clean_request  =='bye' or  clean_request =="thanks":
        print('Dr. Bot: Thanks for chatting with me! Please visit the patient portal for followup appointments and to refill any meds.')
        break

    else:

        response = bot.get_response(clean_request)
        print('Dr. Bot:',response)

What is your name? amanda
Welcome to the Dr. Bot Service! For medical emergencies, please dial 911.
Please say Bye or Thanks before you leave.
How do you feel?
amanda:bye
Dr. Bot: I'm glad you feel well. What would you like to know more about?
amanda: water
Dr. Bot: Summary : We all need to drink water.
amanda: paper
Dr. Bot: treatments congenital heart defects
amanda: what time is it
Dr. Bot: many people affected 3hydroxyacylcoa dehydrogenase deficiency
amanda: $#$255
Dr. Bot: symptoms familial dermographism
amanda: bye
Dr. Bot: Thanks for chatting with me! Please visit the patient portal for followup appointments and to refill any meds.


# 5. Personal Contribution Statement (10 pts)


####Summary of tasks and team members' contributions

<br>

<table>
<tr><td><b>Task</b></td><td><b>Team Member(s)</b></td><td><b> Contribution</b></td> </tr>
<tr><td>Learn Web Applications</td><td>Ed</td> <td>Searchlight</td>
</tr>
<tr><td>Learn Web Applications</td><td>Amanda</td> <td>Flask Tutorial</td></tr>
<tr><td>Review Health Care Free/Public Data Sources</td><td>Amanda</td> <td>Research Kaggle, NIH, **etc**.</td></tr>
<tr><td>Communicate Project Requirements</td><td>Wade</td> <td>Obtain information from T.A. </td></tr>
</table>


---


#### **Amanda Alonzo Personal Statement**

I researched chatbots using huggging face and found that alghouth powerful, loading the models for transfer learning was time consuming and perhaps too much processing for our use case because we have our own more isolated body of questions and answers to utilize.

The chatterbot package provides a more lightweight option that can load quickly and be custom trained with our data as a corpus. I tried to train the chatterbot using basic text, but this had the limitation of a maximum of 1,000,000 characters that our data set exceeded. Also, it seemed to be limited to chronological or sequential conversational order. Whereas our use case is a body of questions and answers that someone may ask any question not in a specific order.

Therefore, I tried to train the chatterbot using a yaml document that explicitly states what the user may enter and what the bot can respond as to support any sequence of questions.

Due to the nature of health care data, I chose to configure the bot as read only so that any sensitive information the person enters will not be saved. This allows for removal of any PHI or sensitive information for HIPAA that we wanted to support.

Although rule-based systems have limited scalability, for the case of health care, we wanted to provide some more structure and explicit rules. To do this, we opted for acknowledgements prior to using for medical emergencies and consulting an actual doctor.

Lastly, I researched web application frameworks by doing the Flask tutorial and setup the github and Heroku account to prepare for deployment.

---


#### **Wade Lail Personal Statement**


---


#### **Ed Zarek Personal Statement**

## Resources

https://www.datacamp.com/tutorial/building-a-chatbot-using-chatterbot

https://pypi.org/project/ChatterBot2/


## Data:

Health Care Q & A

https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset/data

Medication Side Effects


https://www.kaggle.com/datasets/jithinanievarghese/drugs-side-effects-and-medical-condition

