# **Implementing Retrieval-Augmented Generation (RAG) on Medical Wiki Terms Dataset**



## **1. Project Overview**

- **Objective**: Build a RAG architecture for answering medical-related questions based on the `gamino/wiki_medical_terms` dataset.

- **Dataset**: Medical Wiki Terms dataset containing medical terms and definitions.

- **Key Components**:

  1. Data preprocessing

  2. Document embedding and storage

  3. Retrieval system setup

  4. Language model integration for response generation

  5. Evaluation of the RAG pipeline

---

## **2. Setup and Data Loading**

### **2.1. Install Required Libraries**

- Install the necessary libraries:

  - LangChain

  - Hugging Face Datasets

  - Pinecone vectore store

  - OpenAI or Hugging Face Transformers for language models

In [1]:
!pip install -q langchain==0.3.8
!pip install -q langchain-community==0.3.8
!pip install -q langchain-huggingface==0.1.2
!pip install -q langchain-pinecone==0.2.0

!pip install -q transformers==4.46.3
!pip install -q datasets==3.1.0

!pip install -q pinecone-client==5.0.1

!pip install -q pyngrok==7.2.1
!pip install -q streamlit==1.40.1


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.8.3 requires cubinlinker, which is not installed.
cudf 24.8.3 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.8.3 requires ptxcompiler, which is not installed.
cuml 24.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.8.3 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.8.3 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.6.0 which is incompatible.
distributed 2024.7.1 requires dask==2024.7.1, but you have dask 2024.9.1 which is incompatible.
google-cloud-bigquery 2.34.4 requires packaging<22.0dev,>=14.3, but you have packaging 24.2 which is incompatible.
jupyterlab 4.2.5 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 which is incompatible.
jupyterlab-lsp 5.1.0 requires jupyter-lsp>=2.0.0, but you have jupyter-lsp 1.5.1 w

In [3]:
import numpy as np
import pandas as pd
import torch
from datasets import load_dataset

device = "cuda" if torch.cuda.is_available() else "cpu"

#

### **2.2. Load the Dataset**

- Load the `gamino/wiki_medical_terms` dataset using the Hugging Face Datasets library.

- Explore the dataset to understand its structure (fields like `page_title` and `page_text`).


In [4]:
dataset_raw = load_dataset("gamino/wiki_medical_terms")

README.md:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

wiki_medical_terms.parquet:   0%|          | 0.00/33.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6861 [00:00<?, ? examples/s]

In [5]:
dataset_raw['train']

Dataset({
    features: ['page_title', 'page_text', '__index_level_0__'],
    num_rows: 6861
})

---


## **3. Preprocessing**


### **3.1. Data Cleaning**

- Remove unnecessary fields or rows if any.

- Ensure consistent formatting for terms and definitions.

In [6]:
pages = [doc for doc in dataset_raw['train']['page_text']]

metadata = [title for title in dataset_raw['train']['page_title']]



# Ensure pages and metadata have the same length

assert len(pages) == len(metadata), "Mismatch between pages and metadata lengths!"

assert dataset_raw['train'].filter(lambda x: any(v is None for v in x.values())).num_rows == 0, "Dataset contains null values!"


Filter:   0%|          | 0/6861 [00:00<?, ? examples/s]

### **3.2. Convert to LangChain Documents**

- Combine the medical terms and their definitions into LangChain `Document` objects.

- Metadata should include the `page_title` for easy traceability.

In [7]:
from langchain.schema import Document



# Create a list of Document objects

input_documents = [

    Document(page_content=page, metadata={"title": title})

    for page, title in zip(pages, metadata)

]

### 3.3 Chunking the `page_title` due to its long context

- Consider Recursive Chunking to preserve context for retrieval

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter



def split_documents(input_documents, chunk_size=1000, chunk_overlap=100):

    """

    Splits a list of Document objects into smaller chunks while preserving metadata.



    Parameters:

        input_documents (list of Document): List of Document objects to be split.

        chunk_size (int): Maximum size of each chunk in characters.

        chunk_overlap (int): Number of overlapping characters between chunks.



    Returns:

        list of Document: A list of new Document objects with split content.

    """

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    output_documents = []



    for doc in input_documents:

        # Split the content of each document

        chunks = text_splitter.split_text(doc.page_content)



        # Create new Document objects for each chunk while preserving metadata

        for chunk in chunks:

            output_documents.append(Document(page_content=chunk, metadata=doc.metadata))



    return output_documents


In [9]:
chunked_docs = split_documents(input_documents=input_documents, chunk_size = 1000)

In [10]:
print(f"Number of chunks: {len(chunked_docs)}")

print(f"Average length (no. of words) of chunks: {sum(len(doc.page_content.split()) for doc in chunked_docs) / len(chunked_docs):.2f} words")

print(f"Average length (no. of characters) of chunks: {sum(len(doc.page_content) for doc in chunked_docs) / len(chunked_docs):.2f} characters")


Number of chunks: 101798
Average length (no. of words) of chunks: 92.20 words
Average length (no. of characters) of chunks: 616.28 characters


In [12]:
counter = 0

for doc in chunked_docs:

  if len(doc.page_content.split()) < 11:

    print(doc.page_content)

    print(f"Metadata:{doc.metadata}")

    print("*"*100)

    counter += 1



print(f"Number of documents with length less than 11 words: {counter}")   ## Here we see that these chunks are meaningless so we can drop them

Signs and symptoms
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Cause
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Pathophysiology
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Diagnosis
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Prevention
Limitation of availability
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Treatment
Gastric decontamination
Metadata:{'title': 'Paracetamol poisoning'}
****************************************************************************************************
Acetylc

**Note**: Here we have `12599` chunks less than 11 words, which are meaningless and we can drop

In [13]:
# Drop Chunks less than 11

chunked_docs = [doc for doc in chunked_docs if len(doc.page_content.split()) >= 11]

In [14]:
print(f"Number of new chunks: {len(chunked_docs)}")

print(f"Average length (no. of words) of new chunks: {sum(len(doc.page_content.split()) for doc in chunked_docs) / len(chunked_docs):.2f} words")

print(f"Average length (no. of characters) of new chunks: {sum(len(doc.page_content) for doc in chunked_docs) / len(chunked_docs):.2f} characters")


Number of new chunks: 89199
Average length (no. of words) of new chunks: 104.91 words
Average length (no. of characters) of new chunks: 700.82 characters


In [15]:
print(f"Max Chunk size in words: {max(len(doc.page_content.split()) for doc in chunked_docs)}")

print(f"Min Chunk size in words: {min(len(doc.page_content.split()) for doc in chunked_docs)}")

Max Chunk size in words: 201
Min Chunk size in words: 11


**Note:** It is crucial to ensure that the maximum chunk size is less than or equal to the context length of the embedding model you are using. In this case, the maximum chunk size in words is **201**, which is less than the context length of the **all-MiniLM-L6-v2** model, which is **256**. This is in accordance with the [model documentation on Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).


In [16]:
chunked_docs[0].page_content

'Paracetamol poisoning, also known as acetaminophen poisoning, is caused by excessive use of the medication paracetamol (acetaminophen). Most people have few or non-specific symptoms in the first 24 hours following overdose. These include feeling tired, abdominal pain, or nausea. This is typically followed by a couple of days without any symptoms, after which yellowish skin, blood clotting problems, and confusion occurs as a result of liver failure. Additional complications may include kidney failure, pancreatitis, low blood sugar, and lactic acidosis. If death does not occur, people tend to recover fully over a couple of weeks. Without treatment, death from toxicity occurs 4 to 18 days later.Paracetamol poisoning can occur accidentally or as an attempt to die by suicide. Risk factors for toxicity include alcoholism, malnutrition, and the taking of certain other hepatotoxic medications. Liver damage results not from paracetamol itself, but from one of its metabolites,'

In [11]:
chunked_docs[1].page_content

'medications. Liver damage results not from paracetamol itself, but from one of its metabolites, N-acetyl-p-benzoquinone imine (NAPQI). NAPQI decreases the livers glutathione and directly damages cells in the liver. Diagnosis is based on the blood level of paracetamol at specific times after the medication was taken. These values are often plotted on the Rumack-Matthew nomogram to determine level of concern.Treatment may include activated charcoal if the person seeks medical help soon after the overdose. Attempting to force the person to vomit is not recommended. If there is a potential for toxicity, the antidote acetylcysteine is recommended. The medication is generally given for at least 24 hours. Psychiatric care may be required following recovery. A liver transplant may be required if damage to the liver becomes severe. The need for transplant is often based on low blood pH, high blood lactate, poor blood clotting, or significant hepatic encephalopathy. With early treatment liver'

---

### 3.4 Summarize each `page_text` using an LLM

- This approach will increase the accuracy of retrieval based on experienced trial and error. (still under test)

In [18]:
from transformers import pipeline , AutoTokenizer



model_name = "google/pegasus-xsum"

summarizer = pipeline("summarization", model=model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)




The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [19]:
# Function to summarize a single document

def summarize_document(doc, max_length=200, min_length=30):

    # Tokenize the input and truncate if necessary

    inputs = tokenizer(

        doc.page_content,

        max_length=summarizer.model.config.max_position_embeddings, # Use model's max position embeddings

        truncation=True,

        return_tensors="pt",

    )

    # Perform summarization

    summary_ids = summarizer.model.generate(

        inputs["input_ids"],  # Ensure input is on correct device

        max_length=max_length,

        min_length=min_length,

        do_sample=False,

        length_penalty=1.5,

        num_beams=4



    )

    # Decode the summary

    summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return Document(page_content=summary_text, metadata=doc.metadata)


In [20]:
chunked_docs[2].page_content

'lactate, poor blood clotting, or significant hepatic encephalopathy. With early treatment liver failure is rare. Death occurs in about 0.1% of cases.Paracetamol poisoning was first described in the 1960s. Rates of poisoning vary significantly between regions of the world. In the United States more than 100,000 cases occur a year. In the United Kingdom it is the medication responsible for the greatest number of overdoses. Young children are most commonly affected. In the United States and the United Kingdom, paracetamol is the most common cause of acute liver failure.'

In [21]:
summarize_document(chunked_docs[2])

Document(metadata={'title': 'Paracetamol poisoning'}, page_content='paracetamol poisoning is the most common cause of acute liver failure in the United States and the United Kingdom, and is the most common cause of death in the United States and the United Kingdom.')

In [22]:
# Summarize all documents

summarized_documents = [summarize_document(doc) for doc in input_documents]



# Output: List of summarized documents

for doc in summarized_documents:

    print(f"Title: {doc.metadata['title']}")

    print(f"Summary: {doc.page_content}")

    print("\n")

KeyboardInterrupt: 

## **4. Build the Retrieval System**


### **4.1. Choose a Vector Store**

- Use Pinecone vector database.

- Decide on an embedding model, I'll use `sentence-transformers/all-MiniLM-L6-v2`.

In [12]:
 # Extract the text from the documents

texts = [doc.page_content for doc in chunked_docs]

# Extract the metadata from the documents

metadata = [{"title": doc.metadata.get("title", "No title")} for doc in chunked_docs]


In [13]:
# # Writing the list to the file
# with open("documents.csv", "w") as file:
#     for doc in texts:
#         file.write(doc + "\n")

In [14]:
from langchain_huggingface import HuggingFaceEmbeddings



embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Find your pinecone API key here [app.pinecone.io](app.pinecone.io)

In [20]:
from pinecone import Pinecone, ServerlessSpec
import os
import getpass



if not os.getenv("PINECONE_API_KEY"):

    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")


pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)


index_name = "langchain-medical"

### Note:
Run this cell only if you are creating your index or want to create a new index.

In [21]:
# if not index_name in pc.list_indexes():

#     pc.create_index(

#         name=index_name,

#         dimension=384,

#         metric="cosine",

#         spec=ServerlessSpec(

#             cloud='aws',

#             region='us-east-1'

#         )

#     )

In [22]:
index = pc.Index(index_name)

### Note:
Run this cell only if you want to delete your index

In [10]:
# pc.delete_index(index_name)

### **4.2. Create and Store Embeddings**

- Use Langchain_pinecone to store the documents.

- Store the embeddings in the chosen vector database

In [23]:
# Store the embeddings in the chosen vector database

from langchain_pinecone import PineconeVectorStore



vector_store = PineconeVectorStore(index=index, embedding=embedding_model)

### Note:
Run this cell only if you are creating the index; in this case, we have a ready index at out pinecone_client

In [24]:
# ids = [f"id-{i}" for i in range(len(chunked_docs))]



# vector_store.add_documents(documents=chunked_docs, ids=ids)

### **4.3. Implement Search**

- Test similarity search to retrieve relevant documents for a sample query.

In [25]:
results = vector_store.similarity_search(query="What is Amoebiasis",k=1)

for doc in results:

    print(f"* {doc.page_content} [{doc.metadata}]")

* Society and culture
An outbreak of amoebic dysentery occurs in Diana Gabaldons novel A Breath of Snow and Ashes.

References
External links

Amoebiasis - Centers for Disease Control and Prevention [{'title': 'Amoebiasis'}]


In [26]:
results = vector_store.similarity_search(query="What is paracetamol?",k=3)

for doc in results:

    print(f"* {doc.page_content} [{doc.metadata}]")
    print("*"*100)

* Paracetamol,  also known as acetaminophen, is a medication used to treat fever and mild to moderate pain. Common brand names include Tylenol and Panadol. [{'title': 'Paracetamol'}]
****************************************************************************************************
* Pain
Paracetamol is used for the relief of mild to moderate pain such as headache, muscle aches, minor arthritis pain, toothache as well as pain caused by cold, flu, sprains, and dysmenorrhea. It is recommended, in particular, for acute mild to moderate pain, since the evidence for the treatment of chronic pain is insufficient. [{'title': 'Paracetamol'}]
****************************************************************************************************
* It is on the World Health Organizations List of Essential Medicines. Paracetamol is available as a generic medication, with brand names including Tylenol and Panadol among others. In 2019, it was the 145th most commonly prescribed medication in the Unite

---


## **5. Integrate a Language Model**

### **5.1. Select a Language Model**

- Use a generative model like OpenAI GPT-4, Hugging Face T5, or FLAN-T5.

- Load the model using the appropriate API or library.

In [27]:
from langchain.chains import LLMChain, StuffDocumentsChain
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.llms import HuggingFaceHub


if not os.getenv("HUGGINfACE_API_KEY"):

    os.environ["HUGGINfACE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

hf_api_token = os.environ.get("HUGGINfACE_API_KEY")

hf_llm = HuggingFaceHub(
    repo_id="meta-llama/Llama-3.2-1B-Instruct",  # Replace with your Arabic model
    model_kwargs={
        "max_length": 512,
        "truncation": True,
        "do_sample": False  # Deterministic generation
    },
    huggingfacehub_api_token=hf_api_token

    )

  hf_llm = HuggingFaceHub(


In [28]:
from langchain.chains import RetrievalQA

# Connect to the existing Pinecone index
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 3})


# Set up RAG chain
qa_chain = RetrievalQA.from_chain_type(llm=hf_llm, retriever=retriever, return_source_documents=False)


### Note:
The Model response starts after **Helpful Answer:**

In [31]:
query = "what is another name for paracetamol?"
result = qa_chain({"query": query})

# Output the response
print("Generated Answer:", result["result"])

if 'source_documents' in result.keys():
    print("\nSource Documents:")
    for doc in result["source_documents"]:
        print(f"Title: {doc.metadata['title']}")
        print(f"Content: {doc.page_content}")




Generated Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Paracetamol,  also known as acetaminophen, is a medication used to treat fever and mild to moderate pain. Common brand names include Tylenol and Panadol.

It is on the World Health Organizations List of Essential Medicines. Paracetamol is available as a generic medication, with brand names including Tylenol and Panadol among others. In 2019, it was the 145th most commonly prescribed medication in the United States, with more than 4 million prescriptions.

Society and culture
Naming
Paracetamol is the Australian Approved Name and British Approved Name as well as the international nonproprietary name used by the WHO and in many other countries; acetaminophen is the United States Adopted Name and Japanese Accepted Name and also the name generally used in Canada, Venezuela, Colombia, and Iran. Both paracetamol 

### **5.2. Create the RAG Pipeline**

- Combine the retriever and the language model into a RAG pipeline.

- Pass the retrieved context from the retriever to the generator to produce answers.

In [None]:
from pyngrok import ngrok, conf
import os
import subprocess
import streamlit
import getpass
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Ensure you have all the required files
if not os.path.exists("/kaggle/input/depi-deployment/app.py"):
    raise FileNotFoundError("The app.py file is not found in the current directory.")

# Retrieve ngrok auth token from .env or prompt the user
ngrok_auth_token = os.getenv("NGROK_AUTH_TOKEN")
if not ngrok_auth_token:
    print("Ngrok auth token not found in .env file.")
    print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
    ngrok_auth_token = getpass.getpass("Ngrok Auth Token: ")

# Configure ngrok with the auth token
conf.get_default().auth_token = ngrok_auth_token

# Define the Streamlit app port
port = 8501

# Run the Streamlit app as a subprocess
process = subprocess.Popen(["streamlit", "run", "/kaggle/input/depi-deployment/app.py", "--server.port", str(port)], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# Connect ngrok to the Streamlit app
public_url = ngrok.connect(port).public_url
print(f' * ngrok tunnel "{public_url}" -> "http://127.0.0.1:{port}"')

# Stream the subprocess output to check for issues
try:
    for line in process.stdout:
        print(line.decode("utf-8").strip())
except KeyboardInterrupt:
    print("Stopping Streamlit app...")
    process.terminate()
    ngrok.disconnect(public_url)


Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth


 ········


 * ngrok tunnel "https://45f1-34-31-102-179.ngrok-free.app" -> "http://127.0.0.1:8501"

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://172.19.2.2:8501
External URL: http://34.31.102.179:8501



In [None]:
!ps aux | grep ngrok

---

## **6. Evaluate the RAG System**


Here we can use a Q/A generation model to generate questions and answers from our dataset and be able to measure the accuracy of our model (retrieval, generation) at once.

In [None]:
# from transformers import pipeline

# qa_pipeline = pipeline("question-generation")
# questions = qa_pipeline(large_text)
# print(questions)
