<a href="https://colab.research.google.com/github/saishdesai23/DentalAI/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG

### Library Installation

In [56]:
!pip install langchain --quiet
!pip install tiktoken --quiet
!pip install -U sentence-transformers --quiet
!pip install chromadb --quiet
!pip install openai

Collecting openai
  Downloading openai-1.3.7-py3-none-any.whl (221 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.4/221.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.[0m[31m
[0mSuccessfully installed httpcore-1.0.2 httpx-0.25.2 openai-1

## Step 1: Loading the Document

In [13]:
from langchain.document_loaders import WebBaseLoader

# Initialize the WebBaseLoader with the URL of the document to be loaded
loader = WebBaseLoader("https://my.clevelandclinic.org/health/diseases/10946-cavities")

# Load the document and store it in the 'data' variable
data = loader.load()

In [15]:
# Display the content of the loaded document
print(data)

[Document(page_content='Cavities (Tooth Decay): Symptoms, Causes & Treatment800.223.2273100 Years of Cleveland ClinicMyChartNeed Help?GivingCareersSearchClevelandClinic.orgFind A DoctorLocations & DirectionsPatients & VisitorsHealth LibraryInstitutes & DepartmentsAppointmentsHome/Health Library/Diseases & Conditions/CavitiesAdvertisementAdvertisementAdvertisementCavitiesCavities are holes, or areas of tooth decay, that form in your teeth surfaces. Causes include plaque buildup, eating lots of sugary snacks and poor oral hygiene. Treatments include dental fillings, root canal therapy and tooth extraction. The sooner you treat a cavity, the better your chance for a predictable outcome and optimal oral health.ContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsManagement and TreatmentPreventionOutlook / PrognosisLiving WithAdditional Common QuestionsContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsManagement and TreatmentPreventionOutlook / PrognosisLiving WithA

## Step 2: Splitting the Document into Chunks

Splitting the document into managable chunks

In [19]:
import tiktoken


# Set up token encoding for the GPT-3.5 Turbo model (https://github.com/openai/tiktoken)
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [20]:
tokenizer = tiktoken.get_encoding('cl100k_base')

# Define a function to calculate the token length of a given text
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("Dentin decay: Dentin is the layer just beneath your tooth enamel.")

15

In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter with specified parameters (https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 20,
    length_function = tiktoken_len
)

In [22]:
# Split the loaded document into smaller chunks
chunks = text_splitter.split_documents(data)

In [32]:
# Check the total number of chunks generated
len(chunks)

38

## Step 3: Storing the Vector Embeddings in Vector Database
1. **Database Storage**: To facilitate future retrieval of our document splits, it's essential to store them in a database.

2. **Embedding Model**: To convert our document splits into vector embeddings, we require an embedding model.

3. **Vector Store**: Finally, the vector embeddings and documents will be stored in a vector store. For this purpose, we will be using ChromaDB.

In [41]:
from langchain.embeddings import HuggingFaceEmbeddings
# https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#
# Specify the model name and additional arguments
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device' : 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Initialize HuggingFace Embeddings
hf = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [44]:
# testing the embedding
embed = hf.embed_documents(texts=['h','e'])

# Print the length of one of the embeddings to check its dimensions
print(len(embed[1]))

384


In [47]:
# Save document chunks and their respective embeddings in a Vector Store
from langchain.vectorstores import Chroma

# Initialize Chroma vector database with chunks and HuggingFace embeddings
vectordb = Chroma.from_documents(chunks, hf)

In [48]:
# Perform a similarity search on the vector database
vectordb.similarity_search('bleeding gums', k=3)

[Document(page_content='sugary foods and drinks.Many adults also get cavities. Sometimes, new decay develops around the edges of cavities treated in childhood. Adults are also more likely to have receding gums. This condition exposes your teeth roots to plaque, which can cause cavities.AdvertisementCleveland Clinic is a non-profit academic medical center. Advertising on our site helps support our mission. We do not endorse non-Cleveland Clinic products or services. PolicySymptoms and CausesWhat are the signs of', metadata={'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en', 'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment'}),
 Document(page_content='sodas, juice and milk). The bacteria convert these carbohydrates into acids.Bacteria, acid, food and saliva

## Step 4: Retrieve and Generate

In [108]:
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
open_api_key = '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'

# Initialize a language model with ChatOpenAI
llm = ChatOpenAI(model_name= 'gpt-3.5-turbo', temperature=0.6, openai_api_key = open_api_key)

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}

Question: {question}
Helpful Answer:"""

# Initialize the Promp template
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain

# Initialize a RetrievalQA chain with the language model and vector database retriever
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [109]:
# Initilaize chain
# Set return_source_documents to True to get the source document
# Set chain_type to prompt template defines
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [117]:
question = "What does a dentist drill? And what does he do after that?"
result = qa_chain({"query": question})
# Check the result of the query
result["result"]

'A dentist drills out decayed tissue from a tooth. After drilling, the dentist fills the hole with a dental filling made of composite resin, silver amalgam, or gold. Thanks for asking!'

In [115]:
# Check the source document from where we
result["source_documents"]

[Document(page_content='tooth, a dentist drills out the decayed tissue and fills the hole. Dental fillings consist of composite resin (a tooth-colored material), silver amalgam or gold.Root canal therapyRoot canal therapy relieves pain from advanced tooth decay. Endodontists (specialists who treat issues that affect a tooth’s root) usually perform the procedure. During root canal treatment, an endodontist removes the tooth pulp, then fills the canals and pulp chamber with gutta-percha (a special filling', metadata={'description': 'A cavity is a hole, or area of decay, in your tooth. Cavities form when acids in your mouth erode (wear down) your tooth enamel — your tooth’s hard, outer layer.', 'language': 'en', 'source': 'https://my.clevelandclinic.org/health/diseases/10946-cavities', 'title': 'Cavities (Tooth Decay): Symptoms, Causes & Treatment'}),
 Document(page_content='/ PrognosisLiving WithAdditional Common QuestionsContentsArrow DownOverviewSymptoms and CausesDiagnosis and TestsMa