<a href="https://colab.research.google.com/github/saishdesai23/DentalAI/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dental AI Assistant: A Medical Chatbot built using Retrieval QA chain and Prompt Tuning (NLP)

This is an experimental chatbot implemented to study the use of Retrival Augmented Generation and Promp Tuning to create a chatbot which solve queries based on external and knowledge of the pre-trained model used.

### Library Installation

In [118]:
!pip install langchain --quiet
!pip install tiktoken --quiet
!pip install -U sentence-transformers --quiet
!pip install chromadb --quiet
!pip install openai



RAG Piplene:

1) Loading

2) Chunking/Splitting

3) Embedding and storing

4) Retrieval and generation using prompts

## Step 1: Loading the Document

Pretrained language models have been trained on generic english text. However to answer a question pertaining to a domain the the model needs support of some external source of information. For generating a coherent response we are using data posted on the Clevenland Clinic Website (https://my.clevelandclinic.org/health/diseases/10946-cavities).

Please note that the data has only been used for educational purpose and has not been altered for the purpose of this use.

In [119]:
from langchain.document_loaders import WebBaseLoader

# Initialize the WebBaseLoader with the URL of the document to be loaded
loader = WebBaseLoader("https://my.clevelandclinic.org/health/diseases/10946-cavities")

# Load the document and store it in the 'data' variable
data = loader.load()

# Display the content of the loaded document
print(data)

[Document(page_content='Cavities (Tooth Decay): Symptoms, Causes & Treatment800.223.2273100 Years of Cleveland ClinicMyChartNeed Help?GivingCareersSearchClevelandClinic.orgFind A DoctorLocations & DirectionsPatients & VisitorsHealth LibraryInstitutes & DepartmentsAppointmentsHome/Health Library/Diseases & Conditions/CavitiesAdvertisementAdvertisementAdvertisementCavitiesCavities are holes, or areas of tooth decay, that form in your teeth surfaces. Causes include plaque buildup, eating lots of sugary snacks and poor oral hygiene. Treatments include dental fillings, root canal therapy and tooth extraction. The sooner you treat a cavity, the better your chance for a predictable outcome and optimal oral health.ContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsManagement and TreatmentPreventionOutlook / PrognosisLiving WithAdditional Common QuestionsContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsManagement and TreatmentPreventionOutlook / PrognosisLiving WithA

## Step 2: Splitting the Document into Chunks

We now split the loaded document into manageable chunks, also known as splits for easier processing and retrieval in the subsequent stages of the RAG pipeline.

In [120]:
import tiktoken


# Set up token encoding for the GPT-3.5 Turbo model (https://github.com/openai/tiktoken)
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [121]:
tokenizer = tiktoken.get_encoding('cl100k_base')

# Define a function to calculate the token length of a given text
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("Dentin decay: Dentin is the layer just beneath your tooth enamel.")

15

In [122]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with specified parameters (https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = tiktoken_len
)

In [123]:
# Split the loaded document into smaller chunks
chunks = text_splitter.split_documents(data)

In [124]:
# Check the total number of chunks generated
len(chunks)

38

## Step 3: Storing the Vector Embeddings in Vector Database
1. **Database Storage**: To facilitate future retrieval of our document splits, it's essential to store them in a database.

2. **Embedding Model**: To convert our document splits into vector embeddings, we require an embedding model.

3. **Vector Store**: Finally, the vector embeddings and documents will be stored in a vector store. For this purpose, we will be using ChromaDB.

We use the HuggingFaceEmbeddings for generating the vectors for respective chunks- (# https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#)

In [125]:
from langchain.embeddings import HuggingFaceEmbeddings


# Specify the model name and additional arguments
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device' : 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Initialize HuggingFace Embeddings
hf = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

In [126]:
# testing the embedding
embed = hf.embed_documents(texts=['h','e'])

# Print the length of one of the embeddings to check its dimensions
print(len(embed[1]))

384


In [127]:
# Save document chunks and their respective embeddings in a Vector Store
from langchain.vectorstores import Chroma

# Initialize Chroma vector database with chunks and HuggingFace embeddings
vectordb = Chroma.from_documents(chunks, hf)

In [128]:
# Perform a similarity search on the vector database
vectordb.similarity_search('bleeding gums', k=3)

[Document(page_content='sodas, juice and milk). The bacteria convert these carbohydrates into acids.Bacteria, acid, food and saliva mix to form dental plaque. This sticky substance coats your teeth.Without proper brushing and flossing, acids in plaque dissolve tooth enamel, creating cavities, or holes, in the enamel surface.What are the risk factors for cavities?Certain factors increase your risk of cavities, including:Dry mouth (xerostomia). Certain conditions (like Sjögren’s', metadata={'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en', 'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment'}),
 Document(page_content='sodas, juice and milk). The bacteria convert these carbohydrates into acids.Bacteria, acid, food and saliva mix to form dental plaque. This st

## Step 4: Retrieve and Generate

Retrival process uses RetrievalQA to extract relevant chunk for generating the response. However, to enhance the answer generating we modify the default prompt template for RetrievalQA chain.

In [129]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
open_api_key = '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'

# Initialize a language model with ChatOpenAI
llm = ChatOpenAI(model_name= 'gpt-3.5-turbo', temperature=0.6, openai_api_key = open_api_key)

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}

Question: {question}
Helpful Answer:"""

# Initialize the Promp template
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain

# Initialize a RetrievalQA chain with the language model and vector database retriever
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [130]:
# Initilaize chain
# Set chain_type to prompt template defines
# Return_source_documents to True to get the source document
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [131]:
question = "What does a dentist drill? And what does he do after that?"
result = qa_chain({"query": question})

# Check the result of the query
result["result"]

'A dentist drills out decayed tissue from a tooth. After drilling, the dentist fills the hole with a dental filling made of composite resin, silver amalgam, or gold. Thanks for asking!'

In [132]:

# Check the source document from where we
result["source_documents"]

[Document(page_content='tooth, a dentist drills out the decayed tissue and fills the hole. Dental fillings consist of composite resin (a tooth-colored material), silver amalgam or gold.Root canal therapyRoot canal therapy relieves pain from advanced tooth decay. Endodontists (specialists who treat issues that affect a tooth’s root) usually perform the procedure. During root canal treatment, an endodontist removes the tooth pulp, then fills the canals and pulp chamber with gutta-percha (a special filling', metadata={'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en', 'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment'}),
 Document(page_content='tooth, a dentist drills out the decayed tissue and fills the hole. Dental fillings consist of composite resin (a to