<a href="https://colab.research.google.com/github/sunnamsriram1/-Apps/blob/main/sriramBioMistral_chatbot_ipynb_%E0%B0%95%E0%B0%BE%E0%B0%AA%E0%B1%80.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build BioMistral Medical RAG Chatbot using BioMistral Open Source LLM

In the notebook we will build a Medical Chatbot with BioMistral LLM and Heart Health pdf file.

## Installation

In [None]:
!pip install langchain sentence-transformers chromadb llama-cpp-python langchain_community pypdf



## Import libraries

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter,RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain_community.llms import LlamaCpp
from langchain.chains import RetrievalQA, LLMChain

In [None]:
import pathlib
import textwrap
from IPython.display import display
from IPython.display import Markdown



def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
# Used to securely store your API key
from google.colab import userdata

## Setup HuggingFace Access Token

- Log in to [HuggingFace.co](https://huggingface.co/)
- Click on your profile icon at the top-right corner, then choose [“Settings.”](https://huggingface.co/settings/)
- In the left sidebar, navigate to [“Access Token”](https://huggingface.co/settings/tokens)
- Generate a new access token, assigning it the “write” role.


In [None]:
# Or use `os.getenv('HUGGINGFACEHUB_API_TOKEN')` to fetch an environment variable.
import os
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = userdata.get("HUGGINGFACEHUB_API_TOKEN")
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "HUGGINGFACEHUB_API_TOKEN"

## Import document

In [None]:
#connect to google drive
from google.colab import drive
drive.mount('/content/drive') #Fixed typo in directory name

Mounted at /content/drive


In [None]:
loader = PyPDFDirectoryLoader("/content/drive/MyDrive/Pdf")
docs = loader.load()

In [None]:
docs

[Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content="STATE LEVEL POLICE RECRUITMENT BOARD, ANDHRA PRADESH,\nMANGALAGIRI.\n  \nFILLED IN ONLINE APPLICATION FORM\n  \nFor the posts of SCT PCs (Civil) (Men & Women),  \n  \nSCT PCs (APSP) (Men) in Police Dept.\n  \n  Vide Notification Rc.No.161/SLPRB/Rect.2/2022, dt: 28.11.2022.\nBasic Details\nRegistration No.\n4159216\n1. Name of the Candidate\n  \n(as per SSC or Equivalent Certificate)\nSUNNAM SEETHARAM\n2. Father's/ Husband's Name\nS VEERRAJU\n3. Gender\nMale\n4. Date of Birth\n  \n(as per SSC or Equivalent certificate)\n16-Aug-1996\n5. SSC/Equivalent Roll No\n1111127157\n6. Community\nABO-ST\n7. Mobile No.\n8688655324\n8. e-Mail Id\nbtgsuryacomputers@gmail.com\nAge Relaxation Details\n9. As per the information you provided in Column Nos. 9 & 10 in Online Application Form (Stage 1), do you \ncome under the Reservation Quota?\nABO-ST\na) Do You come under Creamylayer or Non-Creamylayer\nN

## Text Splitting - Chunking

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

In [None]:
len(chunks)

1552

In [None]:
chunks[0]

Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content='STATE LEVEL POLICE RECRUITMENT BOARD, ANDHRA PRADESH,\nMANGALAGIRI.\n  \nFILLED IN ONLINE APPLICATION FORM\n  \nFor the posts of SCT PCs (Civil) (Men & Women),  \n  \nSCT PCs (APSP) (Men) in Police Dept.\n  \n  Vide Notification Rc.No.161/SLPRB/Rect.2/2022, dt: 28.11.2022.\nBasic Details\nRegistration No.')

In [None]:
chunks[1]

Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content="Basic Details\nRegistration No.\n4159216\n1. Name of the Candidate\n  \n(as per SSC or Equivalent Certificate)\nSUNNAM SEETHARAM\n2. Father's/ Husband's Name\nS VEERRAJU\n3. Gender\nMale\n4. Date of Birth\n  \n(as per SSC or Equivalent certificate)\n16-Aug-1996\n5. SSC/Equivalent Roll No\n1111127157\n6. Community")

In [None]:
chunks[2]

Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content='5. SSC/Equivalent Roll No\n1111127157\n6. Community\nABO-ST\n7. Mobile No.\n8688655324\n8. e-Mail Id\nbtgsuryacomputers@gmail.com\nAge Relaxation Details\n9. As per the information you provided in Column Nos. 9 & 10 in Online Application Form (Stage 1), do you \ncome under the Reservation Quota?\nABO-ST')

In [None]:
chunks[3]

Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content='come under the Reservation Quota?\nABO-ST\na) Do You come under Creamylayer or Non-Creamylayer\nNA\nb) Sub Caste(BC-E Candidates)\nNA\n10. Do you claim benefit of Physical Measurements and age Relaxtion under category of ABO-ST? (Please \ngo through the Notification)\nYes\nWhat is your Native Scheduled Area')

In [None]:
chunks[4]

Document(metadata={'source': '/content/drive/MyDrive/Pdf/4159216 (3).pdf', 'page': 0}, page_content='Yes\nWhat is your Native Scheduled Area\nButtaygudem (Wholly)\n11. Do you claim age relaxation under category Employee of the A.P. Government? (Please go through the \nNotification)\nNo\ni) Date of Joining in Service\nNA\nii) Length of Service as on 28.11.2022\nNA\niii) Are you still in Service\nNA')

## Embeddings

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")

  embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Vector Store - FAISS or ChromaDB

In [None]:
vectorstore = Chroma.from_documents(chunks, embeddings)

In [None]:
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x79fd9a802d40>

In [None]:
query = "what is network" # what is at risk of heart disease
search = vectorstore.similarity_search(query)

In [None]:
to_markdown(search[0].page_content)

>   *  Network Scanning: To check the number of ac -
> tive hosts on the network.
>   *  Vulnerability scanning: Means to check the 
> weaknesses in the target so that it attacker us -
> es those to gain the access of the target
> So now I am going to use the universal vulner -

## Retriever

In [None]:
retriever = vectorstore.as_retriever(
    search_kwargs={'k': 5}
)

In [None]:
retriever.get_relevant_documents(query)

  retriever.get_relevant_documents(query)


[Document(metadata={'page': 9, 'source': '/content/drive/MyDrive/Pdf/Guide_To_Kali_Linux_pdf.pdf'}, page_content='Extra 03/2013 10BASICS\nKali Linux is the latest linux distribution made \nfor penetration testing by and used by secu -\nrity assessors and hackers. Kali Linux is al-\nso considered as a successor to Backtrack. Back -\ntrack was based on Ubuntu Distribution ( www.'),
 Document(metadata={'page': 4, 'source': '/content/drive/MyDrive/Pdf/Guide_To_Kali_Linux_pdf.pdf'}, page_content='nity support, Kali is an open source Linux distribution \ncontaining many security tools to meet the needs of \nHIPAA network vulnerability scans. \nKALI LINUX  \n– A Solution to HACKING/SECURITY 40\nBy Deepanshu Khanna, Linux Security Researcher\nToday is the world of technology and everyone some -'),
 Document(metadata={'page': 13, 'source': '/content/drive/MyDrive/Pdf/Guide_To_Kali_Linux_pdf.pdf'}, page_content='•  apt-get update\n•  apt-get upgrade\n•  apt-get dist-upgrade\nSummary\nKali Linux 

## Large Language Model - Open Source

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
llm = LlamaCpp(
    model_path= "/content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf",
    temperature=0.3,
    max_tokens=2048,
    top_p=1)

ValidationError: 1 validation error for LlamaCpp
__root__
  Could not load Llama model from path: /content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf. Received error Model path does not exist: /content/drive/MyDrive/Model&Data/BioMistral-7B.Q4_K_M.gguf (type=value_error)

## RAG Chain

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from langchain.prompts import ChatPromptTemplate

In [None]:
template = """
<|context|>
You are an AI assistant that follows instruction extremely well.
Please be truthful and give direct answers
</s>
<|user|>
{query}
</s>
 <|assistant|>
"""

In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
rag_chain = (
    {"context": retriever,  "query": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
response = rag_chain.invoke("what disease affect the heart?")

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4075.93 ms
llama_print_timings:      sample time =      79.20 ms /   103 runs   (    0.77 ms per token,  1300.52 tokens per second)
llama_print_timings: prompt eval time =    9099.07 ms /    16 tokens (  568.69 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time =   87709.65 ms /   102 runs   (  859.90 ms per token,     1.16 tokens per second)
llama_print_timings:       total time =   97416.49 ms /   118 tokens


In [None]:
to_markdown(response)

> The heart is affected by many diseases, some of which include coronary artery disease, cardiomyopathy, endocarditis, myocarditis, arrhythmia, congestive heart failure, atherosclerosis, hypertrophic cardiomyopathy, valvular heart disease, arrhythmogenic right ventricular cardiomyopathy, dilated cardiomyopathy, hypertension, and heart valve stenosis.

In [None]:
import sys

while True:
  user_input = input(f"Input Prompt: ")
  if user_input == 'exit':
    print('Exiting')
    sys.exit()
  if user_input == '':
    continue
  result = rag_chain.invoke(user_input)
  print("Answer: ",result)

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4075.93 ms
llama_print_timings:      sample time =     251.28 ms /   368 runs   (    0.68 ms per token,  1464.47 tokens per second)
llama_print_timings: prompt eval time =    7480.05 ms /    14 tokens (  534.29 ms per token,     1.87 tokens per second)
llama_print_timings:        eval time =  300823.51 ms /   367 runs   (  819.68 ms per token,     1.22 tokens per second)
llama_print_timings:       total time =  310512.93 ms /   381 tokens


Answer:   Heart diseases refer to a group of conditions that involve the heart and blood vessels, including coronary artery disease, heart failure, arrhythmias, and congenital heart defects. Coronary artery disease is the most common type of heart disease and involves the buildup of plaque in the arteries that supply blood to the heart, which can lead to reduced blood flow and oxygen supply to the heart muscle. This can cause chest pain, shortness of breath, and other symptoms, and can increase the risk of heart attack and stroke. Heart failure is another common type of heart disease, which occurs when the heart muscle becomes weakened or damaged, causing it to pump blood less efficiently than it should. This can lead to symptoms such as fatigue, shortness of breath, swelling in the legs, and weight gain, and can increase the risk of premature death. Arrhythmias are another type of heart disease, which occur when there are abnormalities in the rate or rhythm of the heartbeat. This can 