# WELCOME OUR MEDICAL CHATBOT

## 1. Introduction to Large Language Models (LLMs), Chatbots, and Our Project

#### Large Language Models (LLMs) are advanced AI systems trained on vast amounts of text data. They have the ability to generate coherent, context-aware text based on the input they receive. LLMs like GPT (from OpenAI), BERT (from Google), and LLaMA (from Meta) represent the cutting edge in natural language processing (NLP).
#### Chatbots utilize LLMs to simulate human-like conversations. They can understand and generate responses to user inputs, making them useful in various applications like customer service, personal assistants, and specialized domains like healthcare.

### Purpose of Our Project
##### The primary goal of our project is to develop a medical chatbot that can assist healthcare providers and patients by offering quick access to medical information, patient interaction, and support in diagnosing health conditions.
##### Our Solution aims to harness the power of LLaMA 2 to create a chatbot tailored for the medical field. Here’s what we envision:
Accurate and Fast Medical Responses: Leveraging LLaMA 2’s ability to understand and generate natural language to provide precise medical information.
User-friendly Interaction: A chat interface that can interpret complex medical language and provide simplified explanations.

# 2. Understanding LLaMA 2


### What is LLaMA 2?


#### LLaMA, or Large Language Model by Meta, is a family of models that vary in size from smaller (7 billion parameters) to larger versions (up to 70 billion parameters), providing flexibility in deployment based on resource availability and requirement complexity.
#### Pretrained Models: These models are pretrained on diverse internet text and fine-tuned for specific tasks, including chat applications, making them highly versatile and capable of understanding context and generating responses that are coherent and contextually appropriate for medical usage.

In [1]:
import base64
from IPython.display import Image, display
import matplotlib.pyplot as plt
def mm(graph):
  graphbytes = graph.encode("ascii")
  base64_bytes = base64.b64encode(graphbytes)
  base64_string = base64_bytes.decode("ascii")
  display(Image(url="https://mermaid.ink/img/" + base64_string))

def llama2_family():
  mm("""
  graph LR;
      llama-2 --> llama-2-7b
      llama-2 --> llama-2-13b
      llama-2 --> llama-2-70b
      llama-2-7b --> llama-2-7b-chat
      llama-2-13b --> llama-2-13b-chat
      llama-2-70b --> llama-2-70b-chat
      classDef default fill:#CCE6FF,stroke:#84BCF5,textColor:#1C2B33,fontFamily:trebuchet ms;
  """)


llama2_family()

### Accessing LLaMA 2


#### Download and Self-host: For total control over data and interactions, LLaMA models can be downloaded and hosted locally. This is particularly important in medical applications where patient data privacy is paramount.
#### Hosted API Platforms: Platforms like Replicate provide APIs to access LLaMA models, offering a balance between ease of use and powerful computational capabilities without the need for local infrastructure.
#### Container Platforms: For scalable deployment, LLaMA models can be containerized and deployed on cloud platforms such as AWS, Azure, and GCP, which support high availability and scalable responses to demand.

# 3. Local Model Setup (7 Billion Model)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
 %pip install -qU \
    replicate \
    langchain \
    sentence_transformers \
    pdf2image \
    pdfminer \
    pdfminer.six \
    unstructured

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.3/291.3 kB[0m [31m28.0 MB/s[0

In [6]:
pip install ctransformers


Collecting ctransformers
  Downloading ctransformers-0.2.27-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ctransformers
Successfully installed ctransformers-0.2.27


In [7]:
pip install pypdf



In [8]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


In [13]:
import base64
from IPython.display import Image, display, Markdown
import matplotlib.pyplot as plt
import ipywidgets as widgets
from langchain.llms import CTransformers

# Initialize the LLaMA 7B model with specific configurations
# max_new_tokens controls the maximum length of the generation
# temperature controls randomness in the response generation
# context_length specifies how much context the model should consider
llama_model = CTransformers(
    model="/content/drive/MyDrive/llama-2-7b-chat.ggmlv3.q8_0.bin",
    model_type="llama",
    config={'max_new_tokens':1000, 'temperature':0.75, 'context_length':2000}
)
print("LLaMA model initialized with configuration:", llama_model.config)

LLaMA model initialized with configuration: {'max_new_tokens': 1000, 'temperature': 0.75, 'context_length': 2000}


In [14]:
# Importing the necessary modules for document loading
from langchain.document_loaders import PyPDFDirectoryLoader

# Initialize the loader for a directory containing PDF files
# This allows the model to use local PDF files as a data source for information retrieval
loader = PyPDFDirectoryLoader('./data')
documents = loader.load()
print("Loaded documents from directory:", loader.path)


Loaded documents from directory: ./data


In [15]:
# Importing the text splitting utility
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split loaded documents into smaller chunks for easier handling
# This improves efficiency and effectiveness in document processing
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)
print("Number of text chunks created:", len(all_splits))

Number of text chunks created: 30


In [16]:
# Setting up the embedding model for vector representation of text
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Define the embedding model from HuggingFace's model repository
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cpu"}  # Use "cpu" for CPU usage, "cuda" for GPU
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
print("Embedding model loaded:", model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model loaded: sentence-transformers/all-mpnet-base-v2


In [17]:
# Initialize the vector store using FAISS for efficient similarity search
vectorstore = FAISS.from_documents(all_splits, embeddings)
print("Vector store created and documents embedded.")

Vector store created and documents embedded.


In [18]:
# Import the Conversational Retrieval Chain from langchain
from langchain.chains import ConversationalRetrievalChain
def md(t):
  display(Markdown(t))

# Set up a conversational retrieval chain with the LLaMA model
# This allows the model to use the embedded documents for answering queries
chain = ConversationalRetrievalChain.from_llm(llama_model, vectorstore.as_retriever(), return_source_documents=True)
chat_history1 = []  # Initialize chat history for context management in conversation
query1 = "what is covid 19 treatment in severe cases"  # Define a sample medical query

# Execute the query using the conversational chain
result1 = chain({"question": query1, "chat_history": chat_history1})
md(result1['answer'])  # Display the result using Markdown for better formatting


  warn_deprecated(


 I'm just an AI Assistant trained by Meta AI, I don't have access to real-time medical information or the ability to provide personalized medical advice. However, I can tell you that COVID-19 treatment varies depending on the severity of the illness and individual patient factors. For mild cases, self-isolation, rest, and over-the-counter medications may be sufficient. For severe cases, hospitalization and more intensive treatments such as mechanical ventilation or extracorporeal membrane oxygenation (ECMO) may be necessary. It's important to consult a qualified medical professional for proper diagnosis and treatment.

# 4. Using the LLaMA 2 Model via API (13 Billion Model)

In [19]:
import base64
from IPython.display import Image, display, Markdown
import matplotlib.pyplot as plt
import ipywidgets as widgets
from langchain.llms import Replicate
import os

# Setting the environment variable for the Replicate API token
REPLICATE_API_TOKEN = "r8_TpNRKTh8GINVY78f3owhndEgoOp9WmB2CuHFX"
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
print("API Token set for Replicate access.")

API Token set for Replicate access.


In [20]:
# Defining the model on Replicate
llama2_13b = "meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"

# Initialize the Llama 13B model hosted on Replicate with specific parameters
llama_model = Replicate(
     model=llama2_13b,
     model_kwargs={"temperature": 0.75, "top_p": 1, "max_new_tokens": 1000}
 )
print("LLaMA 13B model initialized on Replicate platform.")

LLaMA 13B model initialized on Replicate platform.


In [21]:
# Import the necessary document loader modules
from langchain.document_loaders import PyPDFDirectoryLoader

# Load PDF documents from a local directory
loader = PyPDFDirectoryLoader('./data')
documents = loader.load()
print("Documents loaded from:", loader.path)

Documents loaded from: ./data


In [22]:
# Import text splitter for document processing
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split documents into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)
print("Document text has been split into", len(all_splits), "chunks.")

Document text has been split into 30 chunks.


In [23]:
# Set up the embedding model using Hugging Face's transformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Configure embeddings for the text splits
model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"device": "cpu"})
print("Embedding model configured:", model_name)


Embedding model configured: sentence-transformers/all-mpnet-base-v2


In [24]:
# Create a vector store to hold the document embeddings
vectorstore = FAISS.from_documents(all_splits, embeddings)
print("Vector store created and populated with document embeddings.")


Vector store created and populated with document embeddings.


In [25]:
def md(t):
  display(Markdown(t))

# Import conversational retrieval chain setup
from langchain.chains import ConversationalRetrievalChain

# Set up a chain to use the LLaMA model and vector store for answering queries
chain = ConversationalRetrievalChain.from_llm(llama_model, vectorstore.as_retriever(), return_source_documents=True)
chat_history2 = []

# Define a medical query to test the system
query2 = "what is covid 19 treatment in severe cases"
result2 = chain({"question": query2, "chat_history": chat_history2})
md(result2['answer'])  # Display the answer using Markdown for better formatting
print("Query processed and response generated.")


 Based on the information provided, there is no specific medicine recommended for the treatment of COVID-19 at present. According to the World Health Organization (WHO) (2019), some treatments can be provided based on early symptoms such as mild pain relievers, cough syrup, resting, and high amount of fluid intake. Additionally, some particular medications are under investigation and are being tested through clinical trials in the United States and around the world (WHO 2019; CDC 2020c). Ventilators help infected patients in breathing and support lung function, but they do not cure COVID-19. Extracorporeal membrane oxygenation (ECMO) is another useful technique to support the body during infection, and it is considered as a life-saving therapy for refractory respiratory failure (Henry 2020).

Query processed and response generated.


In [26]:
md(result2['answer'])

 Based on the information provided, there is no specific medicine recommended for the treatment of COVID-19 at present. According to the World Health Organization (WHO) (2019), some treatments can be provided based on early symptoms such as mild pain relievers, cough syrup, resting, and high amount of fluid intake. Additionally, some particular medications are under investigation and are being tested through clinical trials in the United States and around the world (WHO 2019; CDC 2020c). Ventilators help infected patients in breathing and support lung function, but they do not cure COVID-19. Extracorporeal membrane oxygenation (ECMO) is another useful technique to support the body during infection, and it is considered as a life-saving therapy for refractory respiratory failure (Henry 2020).

# 5. Using the LLaMA 3 Model via API (70 Billion Model)

In [31]:
llama3_70b = "meta/meta-llama-3-70b-instruct"

llama_model = Replicate(
     model=llama3_70b,
     model_kwargs={"temperature": 0.75,"top_p": 1, "max_new_tokens":1000}
 )
print("LLaMA 3 70B model initialized on Replicate platform.")

LLaMA 3 70B model initialized on Replicate platform.


In [32]:
# Import the necessary document loader modules
from langchain.document_loaders import PyPDFDirectoryLoader

# Load PDF documents from a local directory
loader = PyPDFDirectoryLoader('./data')
documents = loader.load()
print("Documents loaded from:", loader.path)

Documents loaded from: ./data


In [33]:
# Import text splitter for document processing
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split documents into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)
print("Document text has been split into", len(all_splits), "chunks.")

Document text has been split into 30 chunks.


In [34]:
# Set up the embedding model using Hugging Face's transformers
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# Configure embeddings for the text splits
model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs={"device": "cpu"})
print("Embedding model configured:", model_name)

Embedding model configured: sentence-transformers/all-mpnet-base-v2


In [35]:
# Create a vector store to hold the document embeddings
vectorstore = FAISS.from_documents(all_splits, embeddings)
print("Vector store created and populated with document embeddings.")


Vector store created and populated with document embeddings.


In [36]:
def md(t):
  display(Markdown(t))

# Import conversational retrieval chain setup
from langchain.chains import ConversationalRetrievalChain

# Set up a chain to use the LLaMA model and vector store for answering queries
chain = ConversationalRetrievalChain.from_llm(llama_model, vectorstore.as_retriever(), return_source_documents=True)
chat_history3 = []

# Define a medical query to test the system
query3 = "what is covid 19 treatment in severe cases"
result3 = chain({"question": query3, "chat_history": chat_history2})
md(result3['answer'])  # Display the answer using Markdown for better formatting
print("Query processed and response generated.")

Based on the provided context, in severe cases of COVID-19, treatment may include:

1. Ventilators to help infected patients breathe and support lung function.
2. Extracorporeal membrane oxygenation (ECMO), which is considered a life-saving therapy for refractory respiratory failure.

Please note that these treatments do not cure COVID-19 but rather provide support to the body during infection.

Query processed and response generated.


In [37]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Function to get BERT embeddings
def get_embedding(text, model_name='bert-base-uncased'):
    # Initialize tokenizer and model for BERT
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    # Prepare the text for BERT using the tokenizer
    # This turns the text into a format BERT can understand
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # Disable gradient calculation for performance
    with torch.no_grad():
        # Get the model's output, which includes embeddings
        outputs = model(**inputs)

    # The embeddings are averaged to get a single vector per input
    embeddings = outputs.last_hidden_state.mean(1)
    return embeddings.numpy()  # Convert tensor to NumPy array for compatibility with cosine_similarity


In [39]:
# Function to compare responses using cosine similarity
def compare_responses(response1, response2, model_name='bert-base-uncased'):
    # Convert responses to embeddings
    emb1 = get_embedding(response1, model_name)
    emb2 = get_embedding(response2, model_name)

    # Calculate the cosine similarity between the two sets of embeddings
    similarity = cosine_similarity(emb1, emb2)
    return similarity[0][0]


# Compare the two responses and print the similarity score
similarity_score = compare_responses(result1['answer'], result3['answer'])
print("Cosine similarity score:", similarity_score)

Cosine similarity score: 0.9172166


# CHATBOT

In [None]:
pip install gradio

In [40]:
import gradio as gr
from langchain.llms import CTransformers, Replicate
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
import os

# Setting up the API token for Replicate to authenticate and use the models hosted there
REPLICATE_API_TOKEN = "r8_TpNRKTh8GINVY78f3owhndEgoOp9WmB2CuHFX"
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

# Defining model identifiers with more descriptive names for easier selection in the UI
model_names = {
    "7B Model (LLaMa 2)": CTransformers,
    "13B Model (LLaMa 2)": Replicate,
    "70B Model (LLaMa 3)" : Replicate
}

# Configuration for the Replicate model with a specific version hash
llama2_13b = "meta/llama-2-13b-chat:f4e2de70d66816a838a89eeeb621910adffb0dd0baba3976c96980970978018d"
llama3_70b = "meta/meta-llama-3-70b-instruct"

# Initialize document loader and load the PDFs from the specified directory
loader = PyPDFDirectoryLoader('./data')
documents = loader.load()

# Process the loaded documents into manageable text chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

# Setup the embedding model with Sentence Transformers and initialize vector storage with FAISS
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device": "cpu"})
vectorstore = FAISS.from_documents(all_splits, embeddings)

# Define the response function for the chatbot, which uses selected model and configuration
def chatbot_response(model_type, user_input, max_new_tokens, temperature, top_p, context_length):
    # Select the model based on user input and initialize with specified parameters
    if model_type == "7B Model (LLaMa 2)":
        model = CTransformers(
            model="./llama-2-7b-chat.ggmlv3.q8_0.bin",
            model_type="llama",
            config={'max_new_tokens': int(max_new_tokens), 'temperature': float(temperature), 'context_length': int(context_length)}
        )
    elif model_type == "13B Model (LLaMa 2)":
        model = Replicate(
            model=llama2_13b,
            replicate_api_token=REPLICATE_API_TOKEN,
            model_kwargs={"temperature": float(temperature), "top_p": float(top_p), "max_new_tokens": int(max_new_tokens)}
        )
    elif model_type == "70B Model (LLaMa 3)":
        model = Replicate(
            model=llama3_70b,
            replicate_api_token=REPLICATE_API_TOKEN,
            model_kwargs={"temperature": float(temperature), "top_p": float(top_p), "max_new_tokens": int(max_new_tokens)}
        )

    # Create a conversational retrieval chain with the selected model and query it
    chain = ConversationalRetrievalChain.from_llm(model, vectorstore.as_retriever(), return_source_documents=True)
    chat_history = []
    result = chain({"question": user_input, "chat_history": chat_history})
    return result['answer']

# Setup the Gradio interface with dynamic inputs based on model type
interface = gr.Interface(
    fn=chatbot_response,
    inputs=[
        gr.Radio(list(model_names.keys()), label="Model Type"),
        gr.Textbox(lines=2, placeholder="Type your question here..."),
        gr.Number(label="Max New Tokens", value=1000, step=1),
        gr.Number(label="Temperature", value=0.75, step=0.01),
        gr.Number(label="Top P", value=1.0, step=0.01, visible=lambda inputs: inputs[0] == "13 Billion-Model"),
        gr.Number(label="Context Length", value=2000, step=1, visible=lambda inputs: inputs[0] == "7 Billion-Model")
    ],
    outputs="text",
    title="MEDICAL Chatbot",
    description="Select the model type and adjust the parameters to see how they affect the model's responses."
)

# Run the interface, making it available as a web application
interface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://d6f56702730af733c7.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


