## Problem Statement

### Business Context

The healthcare industry is rapidly evolving, with professionals facing increasing challenges in managing vast volumes of medical data while delivering accurate and timely diagnoses. The need for quick access to comprehensive, reliable, and up-to-date medical knowledge is critical for improving patient outcomes and ensuring informed decision-making in a fast-paced environment.

Healthcare professionals often encounter information overload, struggling to sift through extensive research and data to create accurate diagnoses and treatment plans. This challenge is amplified by the need for efficiency, particularly in emergencies, where time-sensitive decisions are vital. Furthermore, access to trusted, current medical information from renowned manuals and research papers is essential for maintaining high standards of care.

To address these challenges, healthcare centers can focus on integrating systems that streamline access to medical knowledge, provide tools to support quick decision-making, and enhance efficiency. Leveraging centralized knowledge platforms and ensuring healthcare providers have continuous access to reliable resources can significantly improve patient care and operational effectiveness.

**Common Questions to Answer**

1. **Critical Care Protocols:** "What is the protocol for managing sepsis in a critical care unit?"

2. **General Surgery:** "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"

3. **Dermatology:** "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"

4. **Neurology:** "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"


### Objective

As an AI specialist, your task is to develop a RAG-based AI solution using renowned medical manuals to address healthcare challenges. The objective is to **understand** issues like information overload, **apply** AI techniques to streamline decision-making, **analyze** its impact on diagnostics and patient outcomes, **evaluate** its potential to standardize care practices, and **create** a functional prototype demonstrating its feasibility and effectiveness.

### Data Description

The **Merck Manuals** are medical references published by the American pharmaceutical company Merck & Co., that cover a wide range of medical topics, including disorders, tests, diagnoses, and drugs. The manuals have been published since 1899, when Merck & Co. was still a subsidiary of the German company Merck.

The manual is provided as a PDF with over 4,000 pages divided into 23 sections.

## Installing and Importing Necessary Libraries and Dependencies

This command installs all the necessary Python libraries required to build and evaluate a Retrieval-Augmented Generation (RAG) pipeline.
langchain and langchain_community provide tools to build LLM-based applications and manage components like document loaders, retrievers, and chains.
chromadb is a vector database used for storing and retrieving embeddings efficiently.
pymupdf enables reading and parsing PDF documents for text extraction.
tiktoken is used for tokenizing text compatible with OpenAI models, essential for chunking and embedding.
datasets from Hugging Face provides utilities for loading and managing datasets used during evaluation.
evaluate offers evaluation metrics such as accuracy and BLEU for assessing model performance.
langchain_openai integrates OpenAI models with LangChain for embedding generation and response synthesis.
ragas (Retrieval-Augmented Generation Assessment) is used to evaluate RAG systems using metrics like faithfulness, context precision, and answer relevancy.
The -q flag ensures the installation process runs quietly without verbose output.

In [None]:
# Install required libraries
!pip install -q langchain_community==0.3.27 \
              langchain==0.3.27 \
              chromadb==1.0.15 \
              pymupdf==1.26.3 \
              tiktoken==0.9.0 \
              datasets==4.0.0 \
              evaluate==0.4.5 \
              langchain_openai==0.3.30 \
              ragas

**Installing required dependencies**

This section imports all the essential libraries required to build, process, and evaluate a Retrieval-Augmented Generation (RAG) pipeline.

The os module allows interaction with the operating system, such as setting environment variables, and json handles reading and writing JSON data.
PyMuPDFLoader from langchain.document_loaders is used to load and extract text content from PDF files, while OpenAI provides access to OpenAI’s models and services.

For data processing, tiktoken handles tokenization—counting and splitting text into manageable pieces for language models—and pandas supports loading and analyzing tabular data.

LangChain components such as RecursiveCharacterTextSplitter break text into overlapping chunks for embedding, OpenAIEmbeddings generates vector representations of text using OpenAI models, and Chroma stores and retrieves these embeddings for semantic search.

For RAG evaluation, evaluate from the ragas library runs automated performance assessments using metrics like Faithfulness, AnswerRelevancy, and LLMContextPrecisionWithoutReference, which measure how accurate, relevant, and contextually grounded the model’s responses are.
Finally, Dataset from the Hugging Face datasets library structures the inputs (questions, answers, and contexts) in a tabular format, and ChatOpenAI provides an interface for interacting with OpenAI’s chat-based LLMs within LangChain.

In [None]:
# Import core libraries
import os                                                                       # Interact with the operating system (e.g., set environment variables)
import json                                                                     # Read/write JSON data

# Import libraries for working with PDFs and OpenAI
from langchain.document_loaders import PyMuPDFLoader                            # Load and extract text from PDF files
from openai import OpenAI                                                       # Access OpenAI's models and services

# Import libraries for processing dataframes and text
import tiktoken                                                                 # Tokenizer used for counting and splitting text for models
import pandas as pd                                                             # Load, manipulate, and analyze tabular data

# Import LangChain components for data loading, chunking, embedding, and vector DBs
from langchain.text_splitter import RecursiveCharacterTextSplitter              # Break text into overlapping chunks for processing
from langchain.embeddings.openai import OpenAIEmbeddings                        # Create vector embeddings using OpenAI's models  # type: ignore
from langchain.vectorstores import Chroma                                       # Store and search vector embeddings using Chroma DB  # type: ignore

# Import components to run evaluation on RAG pipeline outputs
from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    LLMContextPrecisionWithoutReference,
)
from datasets import Dataset                                                    # Used to structure the input (questions, answers, contexts etc.) in tabular format
from langchain_openai import ChatOpenAI                                         # This is needed since LLM is used in metric computation

## Question Answering using LLM

Loading the google driver

In [None]:
from google.colab import drive
drive.mount('/content/drive')

#### Downloading and Loading the model

This section loads API credentials from a configuration file and initializes the OpenAI client for further use.

First, the JSON configuration file is opened and read. The json.load() function parses the file contents and converts them into a Python dictionary named config. From this dictionary, the OpenAI API key (OPENAI_API_KEY) and the API base URL (OPENAI_API_BASE) are extracted.

Next, these credentials are stored as environment variables using the os.environ dictionary. This allows secure access to the API key and base URL throughout the notebook without hardcoding sensitive information in multiple places.

Finally, the OpenAI client is initialized with the extracted API key and base URL. This establishes a connection to the OpenAI API, enabling subsequent operations such as generating embeddings, chat completions, or performing RAG-related tasks through the client.

In [None]:
# Load the JSON file and extract values
file_name = '/content/drive/MyDrive/Colab_Notebooks/GenAI_course/Transformers_Text_Generation/Project_Medical_Assistant/config.json'                                                       # Name of the configuration file
with open(file_name, 'r') as file:                                              # Open the config file in read mode
    config = json.load(file)                                                    # Load the JSON content as a dictionary
    OPENAI_API_KEY = config.get("OPENAI_API_KEY")                                             # Extract the API key from the config
    OPENAI_API_BASE = config.get("OPENAI_API_BASE")                             # Extract the OpenAI base URL from the config

# Store API credentials in environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY                                          # Set API key as environment variable
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE                                 # Set API base URL as environment variable

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY, base_url=OPENAI_API_BASE)                                                               # Create an instance of the OpenAI client

This function, response(), is designed to generate text responses from an OpenAI language model based on a user-provided prompt.

It takes in several parameters:

1. user_prompt — the input text or query to which the model should respond.

2. max_tokens — the maximum number of tokens (words or subwords) to include in the generated output.

3. temperature — a parameter that controls randomness in the model’s responses. Lower values (e.g., 0.3) make the output more focused and deterministic.

4. top_p — a parameter for nucleus sampling, controlling the diversity of possible words the model can choose from. A value of 0.95 allows for some variation while maintaining coherence.

Inside the function, a chat completion request is created using the OpenAI client with the model gpt-4o-mini. The user’s input is passed as a message with the role "user". The API processes this request and returns a structured response containing multiple choices.

Finally, the function returns only the textual content of the first generated message — completion.choices[0].message.content — which represents the model’s reply to the user prompt.

In [None]:
# Define a function to get a response
def response(user_prompt, max_tokens=500, temperature=0.3, top_p=0.95):
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="gpt-4o-mini",                                                     # Specify the model to use
        messages=[
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
question_1 = "What is the protocol for managing sepsis in a critical care unit?"
base_response_question1 = response(question_1)
print(base_response_question1)

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
question_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
base_response_question2 = response(question_2)
print(base_response_question2)

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
question_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
base_response_question3 = response(question_3)
print(base_response_question3)

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
question_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
base_response_question4 = response(question_4)
print(base_response_question4)

This section creates and displays a pandas DataFrame to organize the questions and their corresponding model-generated responses.

The pd.DataFrame() constructor is used to build a structured table with two columns:

1. "questions" — contains the list of question variables (question_1, question_2, question_3, question_4).

2. "base_prompt_responses" — holds the responses generated by the model for each question (base_response_question1, base_response_question2, base_response_question3, base_response_question4).

By aligning each question with its respective response, the DataFrame provides a clear, tabular view of the model’s outputs for easy inspection and comparison.

Finally, result_df.head() displays the first few rows of the DataFrame, allowing a quick preview of the stored data.

In [None]:
# Create the DataFrame
result_df = pd.DataFrame({
    "questions": [question_1, question_2, question_3, question_4],
    "base_prompt_responses": [base_response_question1, base_response_question2, base_response_question3, base_response_question4]})

# Display the DataFrame
result_df.head()

## Question Answering using LLM with Prompt Engineering

In [None]:
system_prompt = """
Here is a generalized prompt designed for an AI model that needs to retrieve information from the web.

You are an AI assistant designed to be a helpful and accurate information resource. Your primary function is to search the web to find the most relevant and up-to-date information to answer user questions.

User input will be a question or a request for information. The response should be within 500 words.

When crafting your response:

1. You must use a web search tool to find relevant information.

2. Your answer should be accurate, concise, and directly address the user's query.

3. You must cite your sources. For each key piece of information, provide the source title and URL where the information was found.

Synthesize the information from one or more reliable sources into a coherent answer.

1. If you cannot find a clear answer or the information is ambiguous, clearly state that.

2. Do not provide personal opinions, speculations, or information that is not supported by the search results.

3. Please adhere to the following response guidelines:

4. Provide clear, direct answers based on the information retrieved from the web.

5. Do not include information from outside the retrieved web search results.

6. If the user's query is vague, you may ask for clarification.

7. If a search yields no relevant information, clearly state that you were unable to find the answer.

Here is an example of how to structure your response:

Answer: [Provide the answer synthesized from the web search results.]

Sources:

[Information snippet 1] (Source: [Source Title], [URL])

[Information snippet 2] (Source: [Source Title], [URL])
"""

This function, query_openai(), is designed to send a structured prompt and query to an OpenAI language model and return the generated response.

It accepts two parameters:

1. prompt — a system-level instruction that defines the model’s behavior or context (for example, specifying tone, role, or task).

2. query — the user’s actual question or input that the model needs to answer.

Inside the function, both are combined into a messages list following the chat completion format, where the "system" message sets context and the "user" message contains the actual query.

The OpenAI client’s chat.completions.create() method is then called using the "gpt-4o-mini" model. Parameters like max_tokens, temperature, and top_p control the length, randomness, and diversity of the generated output.

Finally, the function returns only the text portion of the model’s reply, extracted from response.choices[0].message.content. This modular design allows easy reuse for different prompts and queries, making it ideal for testing or building multi-turn conversational workflows.

In [None]:
#prompt: Create a function that accepts a prompt and query, and returns the response generated by the OpenAI model.

def query_openai(prompt, query):
    """
    Queries the OpenAI model with a given prompt and query.

    Args:
        prompt (str): The prompt for the model.
        query (str): The query to be answered by the model.

    Returns:
        str: The model's response.
    """
    #print (prompt)
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": query}
    ]
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # Or another suitable OpenAI model
        messages=messages,
        max_tokens=1000,  # Adjust max_tokens as needed
        temperature=0.3,                                                # Controls randomness in output
        top_p=0.95
    )
    return response.choices[0].message.content

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
question_1 = "What is the protocol for managing sepsis in a critical care unit?"
response_with_prompt_eng_1=query_openai(system_prompt,question_1)
response_with_prompt_eng_1

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
question_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
response_with_prompt_eng_2=query_openai(system_prompt,question_2)
response_with_prompt_eng_2

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
question_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
response_with_prompt_eng_3=query_openai(system_prompt,question_3)
response_with_prompt_eng_3

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
question_4 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
response_with_prompt_eng_4=query_openai(system_prompt,question_4)
response_with_prompt_eng_4

This section appends a new column to the existing DataFrame to store model responses generated using prompt engineering techniques.

A new column named responses_with_prompt_eng is created in result_df, containing responses (response_with_prompt_eng_1, response_with_prompt_eng_2, response_with_prompt_eng_3, response_with_prompt_eng_4) that correspond to each question already present in the DataFrame. This allows for a side-by-side comparison between base responses and improved responses generated after applying prompt engineering.

Finally, result_df.head() displays the first few rows of the updated DataFrame, providing a quick preview of all questions and both sets of responses for analysis.

In [None]:
# Add the results to a new column in the DataFrame
result_df['responses_with_prompt_eng'] = [response_with_prompt_eng_1, response_with_prompt_eng_2, response_with_prompt_eng_3, response_with_prompt_eng_4]

# Display the DataFrame
result_df.head()

## Data Preparation for RAG

### Loading the Data

This section loads a PDF document into memory for further text processing and analysis.

The variable document_path stores the file path of the medical diagnosis manual PDF located in Google Drive. The PyMuPDFLoader from LangChain is then initialized with this path — it serves as a document loader specifically designed to read and extract text content from PDF files.

By calling loader.load(), the PDF is parsed, and its textual content (along with metadata such as page information) is loaded into the document variable. This variable now holds the extracted text in a structured format, making it ready for subsequent steps such as chunking, embedding, or retrieval in a RAG pipeline.

In [None]:
document_path = "/content/drive/MyDrive/Colab_Notebooks/GenAI_course/Transformers_Text_Generation/Project_Medical_Assistant/medical_diagnosis_manual.pdf"
loader = PyMuPDFLoader(document_path)
document = loader.load()

### Data Overview

This code iterates through and displays the text content of the first ten pages of the loaded PDF document.

The for loop runs from 0 to 9 (a total of ten iterations), where each iteration represents one page of the document.
Inside the loop:

1. print(f"Page Number : {i+1}") displays the current page number, adjusted by +1 since Python indexing starts at 0.

2. print(document[i].page_content) prints the actual text extracted from that page using the page_content attribute of the document object.

The end="\n" argument ensures proper line breaks between pages, making the output easier to read.
This step is primarily used for verifying that the PDF was loaded correctly and inspecting the extracted content before applying any text processing or chunking operations.

In [None]:
# Display first 10 pages
for i in range(10):
    print(f"Page Number : {i+1}",end="\n")
    print(document[i].page_content,end="\n")

### Data Chunking

This section initializes a text splitter to divide large text documents into smaller, manageable chunks that can be efficiently processed by language models.

The RecursiveCharacterTextSplitter.from_tiktoken_encoder() method uses OpenAI’s tiktoken encoder (cl100k_base), which is compatible with modern LLMs such as GPT-4. This ensures that the chunking process aligns with how tokens are actually counted by the model.

The chunk_size=256 parameter specifies that each chunk of text will contain up to 256 tokens (not characters). This helps maintain meaningful context within each chunk while preventing the model from exceeding its token limits during embedding or retrieval.

By splitting text recursively at logical boundaries (e.g., paragraphs, sentences, or words), this splitter preserves context better than simple character-based splitting, improving both embedding quality and downstream RAG performance.

In [None]:
# Initialize a text splitter that uses OpenAI's token encoder
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='cl100k_base', # Encoding used by popular LLMs
    chunk_size=256, # Each chunk will have up to 512 character
)

Split the document into chunks

In [None]:
document_chunks = loader.load_and_split(text_splitter)

Display the length of the chunks

In [None]:
print(f"Created {len(document_chunks)} chunks.")

### Embedding

Initialize the OpenAI embedding model using API credentials, create embeddings for the first two chunks, and display the vector dimension.

In [None]:
# Initialize the OpenAI Embeddings model with API credentials
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,                                                     # Your OpenAI API key for authentication
    openai_api_base=OPENAI_API_BASE                                             # The OpenAI API base URL endpoint
)

# Generate embeddings (vector representations) for the first two document chunks
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)      # Embedding for chunk 0
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)      # Embedding for chunk 1

# Check and print the dimension (length) of the embedding vector
print("Dimension of the embedding_1", len(embedding_1))                   # Typically 1536 or 2048 depending on model
print("Dimension of the embedding_2 ", len(embedding_2))

### Vector Database

Convert chunks into vector representations using the embedding model.

This section defines and creates a directory to store the vector database that will hold text embeddings.

The variable out_dir is assigned the name 'vectorstore', which serves as the folder where the vector database files (generated by tools like Chroma) will be saved.

The if not os.path.exists(out_dir): condition checks whether a directory with that name already exists. If it does not, os.makedirs(out_dir) creates the directory.

This ensures that the storage location for embeddings is properly set up before saving or initializing the vector store—preventing file path errors during later stages of the RAG pipeline.

In [None]:
out_dir = 'vectorstore'    # complete the code to define the name of the vector database

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

This section builds a vector store—a database that stores vector embeddings of document chunks—and saves it for future retrieval operations.

The Chroma.from_documents() method initializes a Chroma vector database by converting the provided text chunks (document_chunks) into embeddings using the specified embedding model (embedding_model). These embeddings represent the semantic meaning of the text, enabling efficient similarity searches later on.

The parameter persist_directory=out_dir defines where the vector database files will be saved (in this case, the vectorstore folder). This ensures that the database can be reused across sessions without needing to regenerate embeddings each time.

In summary, this code step transforms processed text into searchable vector representations and stores them locally, forming the foundation for semantic retrieval in the RAG pipeline.

In [None]:
# Building the vector store and saving it to disk for future use
vectorstore = Chroma.from_documents(
    document_chunks,                                                            # Documents to index
    embedding_model,                                                            # Embedding model for converting text to vectors
    persist_directory=out_dir                                                   # Save vector DB files here
)

Load Vector Database

In [None]:
vectorstore = Chroma(
    persist_directory=out_dir,
    embedding_function=embedding_model
)

### Retriever

Set up a similarity-based retriever to fetch the top 5 most relevant document chunks.

In [None]:
retriever = vectorstore.as_retriever(
    search_type='similarity',                                                   # Use similarity search (based on vector distance)
    search_kwargs={'k': 5}                                                      # Retrieve top 5 most relevant documents
)

System and User Prompt Template

In [None]:
# Define the system prompt for the model
qna_system_message = """
You are an AI assistant designed to support users in efficiently reviewing provided documents. Your task is to provide accurate, concise, and relevant answers based on the context provided from the source material.

User input will include the necessary context for you to answer their questions. This context will begin with the token:

###Context The context contains excerpts from one or more documents, along with associated metadata such as titles, authors, or specific sections relevant to the query.

When crafting your response:

1. Use only the provided context to answer the question.

2. If the answer is found in the context, respond with concise and direct answers.

3. Include the document title and, where applicable, a page or section reference as the source.

4. If the question is unrelated to the context or the context is empty, clearly respond with: "Sorry, this is out of my knowledge base."

5. 6. Please adhere to the following response guidelines:

Provide clear, direct answers using only the given context.

1. Do not include any additional information outside of the context.

2. Avoid rephrasing or generalizing unless explicitly relevant to the question.

3. If no relevant answer exists in the context, respond with: "Sorry, this is out of my knowledge base."

4. If the context is not provided, your response should also be: "Sorry, this is out of my knowledge base."

Here is an example of how to structure your response:

Answer: [Answer based on context]

Source: [Source details with page or section]
"""

In [None]:
# Define the user message template
qna_user_message_template = """
###Context
Here are some excerpts from GEN AI Research Paper and their sources that are relevant to the Gen AI question mentioned below:
{context}

###Question
{question}
"""

### Response Function

This function, generate_rag_response(), generates an AI-powered answer using a Retrieval-Augmented Generation (RAG) pipeline.

It takes the user’s input (user_input) and retrieves the most relevant information from a document database before generating a final response with the OpenAI model.

Here’s a step-by-step explanation:

1. Retrieve relevant chunks:
The retriever searches the vector database for the top k most relevant document chunks (retriever.get_relevant_documents(query=user_input, k=k)), where k determines how many context pieces to fetch.

2. Prepare the context:
The text content from each retrieved chunk is extracted and combined into a single string (context_for_query), which acts as the contextual background for the model.

3. Format the user message:
A user message template (qna_user_message_template) is updated by replacing placeholders {context} and {question} with the actual context and user query. This ensures the model receives both the question and supporting information.

4. Generate the response:
The OpenAI chat completion API (client.chat.completions.create) is called with:
*   The system message (qna_system_message) defining the model’s role or behavior.
*   The user message, containing the combined context and query. The parameters max_tokens, temperature, and top_p control response length, creativity, and sampling diversity.

5. Return the model output:
The text response is extracted from the API output (response.choices[0].message.content.strip()).
If an error occurs during the API call, it catches the exception and returns a clear error message.

Overall, this function ties together retrieval and generation—fetching factual context from stored documents and using an LLM to produce a coherent, context-aware answer.

In [None]:
def generate_rag_response(user_input,k=5,max_tokens=500,temperature=0.3,top_p=0.95):
    global qna_system_message,qna_user_message_template
    # Retrieve relevant document chunks
    relevant_document_chunks = retriever.get_relevant_documents(query=user_input,k=k)
    context_list = [d.page_content for d in relevant_document_chunks]

    # Combine document chunks into a single context
    context_for_query = ". ".join(context_list)

    user_message = qna_user_message_template.replace('{context}', context_for_query)
    user_message = user_message.replace('{question}', user_input)

    # Generate the response
    try:
        response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": qna_system_message},
            {"role": "user", "content": user_message}
        ],
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p
        )
        # Extract and print the generated text from the response
        response = response.choices[0].message.content.strip()
    except Exception as e:
        response = f'Sorry, I encountered the following error: \n {e}'

    return response

## Question Answering using RAG

### Question 1: What is the protocol for managing sepsis in a critical care unit?

In [None]:
question_1 = "What is the protocol for managing sepsis in a critical care unit?"
response_with_rag_1 = generate_rag_response(question_1)
response_with_rag_1

### Question 2: What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?

In [None]:
question_2 = "What are the common symptoms for appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?"
response_with_rag_2 = generate_rag_response(question_2)
response_with_rag_2

### Question 3: What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?

In [None]:
question_3 = "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?"
response_with_rag_3 = generate_rag_response(question_3)
response_with_rag_3

### Question 4:  What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?

In [None]:
question_4 = "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?"
response_with_rag_4 = generate_rag_response(question_4)
response_with_rag_4

This section adds a new column to the existing DataFrame to store responses generated using the RAG (Retrieval-Augmented Generation) approach.

A new column named responses_with_RAG is created in result_df, where each entry (response_with_rag_1, response_with_rag_2, response_with_rag_3, response_with_rag_4) corresponds to the model’s RAG-based answer for the respective question.

This allows for direct comparison between:

1. Base responses (without retrieval),

2. Prompt-engineered responses, and

3. RAG-enhanced responses.

Finally, result_df.head() displays the first few rows of the updated DataFrame, providing a preview of all questions and their responses across different response-generation techniques.

In [None]:
# Add the results to a new column in the DataFrame
result_df['responses_with_RAG'] = [response_with_rag_1, response_with_rag_2, response_with_rag_3, response_with_rag_4]

# Display the DataFrame
result_df.head()

## Output Evaluation

Defining required System Prompts

In [None]:
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.

Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {groundedness_score:4}
"""

In [None]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
Return only the Score in last in a dictionary format not json and score should be in the range of 1 to 5.
Example {relevance_score:4}
"""

In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

This function, generate_ground_relevance_response(), evaluates a model’s response based on two key metrics used in RAG evaluation — groundedness and relevance.

It takes as input the user’s question (user_input) and the model’s generated answer (response), and returns two separate evaluations:
one assessing how factually grounded the answer is in the retrieved documents, and another evaluating how relevant the answer is to the query.

Here’s a step-by-step explanation:

1. Retrieve supporting context:
The retriever searches for the top 5 most relevant document chunks related to the user’s query. Their text content is stored in context_for_query.

2. Construct evaluation prompts:
 Two separate prompts are built using formatted text blocks:

*   groundedness_prompt — uses a predefined system instruction (groundedness_rater_system_message) asking the model to judge whether the provided answer is factually supported by the retrieved context.
*   relevance_prompt — uses another instruction (relevance_rater_system_message) asking the model to rate how relevant the answer is to the user’s question.
Both prompts include the context, question, and answer values dynamically inserted via the user_message_template.

3. Generate model evaluations:
The OpenAI model gpt-4o is called twice — once for groundedness and once for relevance.
Parameters like max_tokens, temperature, and top_p control the output length and determinism (temperature is set to 0 for consistent evaluations).

4. Return evaluation results:
The function returns two text outputs — the groundedness evaluation (response_1) and the relevance evaluation (response_2) — extracted from the model’s responses.

In summary, this function automates the self-evaluation step of a RAG pipeline by using an LLM to assess how well its answers align with retrieved evidence and how relevant they are to the user’s original query.1.

In [None]:
def generate_ground_relevance_response(user_input,response, max_tokens=500,temperature=0,top_p=0.95):
    global qna_user_message_template

    context_for_query = [doc.page_content for doc in retriever.get_relevant_documents(user_input, k=5)]

    # Combine user_prompt and system_message to create the prompt
    groundedness_prompt = f"""[INST]{groundedness_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=response)}
                [/INST]"""

    # Combine user_prompt and system_message to create the prompt
    relevance_prompt = f"""[INST]{relevance_rater_system_message}\n
                {'user'}: {user_message_template.format(context=context_for_query, question=user_input, answer=response)}
                [/INST]"""

    response_1 = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": groundedness_prompt}
                ],
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p
            )

    response_2 = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": relevance_prompt}
                ],
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p
            )

    return response_1.choices[0].message.content,response_2.choices[0].message.content

#### **Evaluation 1: Base Prompt Response Evaluation**

In [None]:
# Question 1
llm_judge_base_ground_1,llm_judge_base_rel_1 = generate_ground_relevance_response(user_input=question_1, response=result_df.base_prompt_responses[0])
print(llm_judge_base_ground_1,end="\n\n")
print(llm_judge_base_rel_1)

In [None]:
# Question 2
llm_judge_base_ground_2,llm_judge_base_rel_2 = generate_ground_relevance_response(user_input=question_2, response=result_df.base_prompt_responses[1])
print(llm_judge_base_ground_2,end="\n\n")
print(llm_judge_base_rel_2)

In [None]:
# Question 3
llm_judge_base_ground_3,llm_judge_base_rel_3 = generate_ground_relevance_response(user_input=question_3, response=result_df.base_prompt_responses[2])
print(llm_judge_base_ground_3,end="\n\n")
print(llm_judge_base_rel_3)

In [None]:
# Question 4
llm_judge_base_ground_4,llm_judge_base_rel_4 = generate_ground_relevance_response(user_input=question_4, response=result_df.base_prompt_responses[3])
print(llm_judge_base_ground_4,end="\n\n")
print(llm_judge_base_rel_4)

This section adds two new columns to the existing DataFrame to store evaluation scores for the model’s base responses, focusing on groundedness and relevance.

Each column stores the evaluation scores extracted from the outputs of the LLM-based evaluation functions:

1. base_prompt_responses_groundedness_score — contains how factually supported (grounded) each base response is with respect to the retrieved context.

2. base_prompt_responses_relevance_score — contains how relevant each base response is to the original user question.

The notation [-2] indicates that the second-to-last element from each evaluation result (e.g., llm_judge_base_ground_1) holds the numeric score extracted from the model’s evaluation output.

By adding these columns, the DataFrame now combines both responses and their quality scores, allowing an easy side-by-side comparison of each question, its answer, and the corresponding evaluation metrics.

Finally, result_df.head() displays the first few rows, providing a quick preview of the dataset with these newly added evaluation columns.

In [None]:
result_df['base_prompt_responses_groundedness_score'] = [llm_judge_base_ground_1[-2], llm_judge_base_ground_2[-2], llm_judge_base_ground_3[-2],llm_judge_base_ground_4[-2]]
result_df['base_prompt_responses_relevance_score'] = [llm_judge_base_rel_1[-2], llm_judge_base_rel_2[-2], llm_judge_base_rel_3[-2],llm_judge_base_rel_4[-2]]
result_df

# Display the DataFrame
result_df.head()

#### **Evaluation 2: Prompt Engineering Response Evaluation**

In [None]:
# Question 1
ground1,rel1 = generate_ground_relevance_response(user_input=result_df.questions[0], response=result_df.responses_with_prompt_eng[0], max_tokens=516)
print(ground1,end="\n\n")
print(rel1)

In [None]:
# Question 2
ground2,rel2 = generate_ground_relevance_response(user_input=result_df.questions[1], response=result_df.responses_with_prompt_eng[1], max_tokens=516)
print(ground2,end="\n\n")
print(rel2)

In [None]:
# Question 3
ground3,rel3 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.responses_with_prompt_eng[2], max_tokens=516)
print(ground3,end="\n\n")
print(rel3)

In [None]:
# Question 4
ground4,rel4 = generate_ground_relevance_response(user_input=result_df.questions[3], response=result_df.responses_with_prompt_eng[3], max_tokens=516)
print(ground4,end="\n\n")
print(rel4)

This section appends two additional columns to the DataFrame to record evaluation scores for responses generated using prompt engineering techniques.

The columns capture how effectively prompt engineering improved the quality of model responses:

1. prompt_engineering_responses_groundedness_score — represents how factually accurate or well-grounded each prompt-engineered response is based on retrieved context.

2. prompt_engineering_responses_relevance_score — represents how relevant and contextually appropriate each response is to the corresponding question.

The expressions like ground1[-2] and rel1[-2] extract the numeric evaluation scores (typically from the model’s returned evaluation output) for each respective response.

By adding these two columns, the DataFrame now contains comparative evaluation metrics for both base responses and prompt-engineered responses, making it easier to analyze how prompt engineering impacts factual grounding and relevance.

Finally, result_df.head() displays the first few rows of the updated DataFrame, showing the questions, responses, and their evaluation scores.

In [None]:
result_df['prompt_engineering_responses_groundedness_score'] = [ground1[-2], ground2[-2], ground3[-2],ground4[-2]]
result_df['prompt_engineering_responses_relevance_score'] = [rel1[-2], rel2[-2], rel3[-2],rel4[-2]]
result_df

# Display the DataFrame
result_df.head()

#### **Evaluation 3: RAG Response Evaluation**

In [None]:
# Question 1
llm_judge_rag_ground_1,llm_judge_rag_rel_1 = generate_ground_relevance_response(user_input=result_df.questions[0], response=result_df.responses_with_RAG[0], max_tokens=500)
print(llm_judge_rag_ground_1,end="\n\n")
print(llm_judge_rag_rel_1)

In [None]:
# Question 2
llm_judge_rag_ground_2,llm_judge_rag_rel_2 = generate_ground_relevance_response(user_input=result_df.questions[1], response=result_df.responses_with_RAG[1], max_tokens=500)
print(llm_judge_rag_ground_2,end="\n\n")
print(llm_judge_rag_rel_2)

In [None]:
# Question 3
llm_judge_rag_ground_3,llm_judge_rag_rel_3 = generate_ground_relevance_response(user_input=result_df.questions[2], response=result_df.responses_with_RAG[2], max_tokens=500)
print(llm_judge_rag_ground_2,end="\n\n")
print(llm_judge_rag_rel_2)

In [None]:
# Question 4
llm_judge_rag_ground_4,llm_judge_rag_rel_4 = generate_ground_relevance_response(user_input=result_df.questions[3], response=result_df.responses_with_RAG[3], max_tokens=500)
print(llm_judge_rag_ground_3,end="\n\n")
print(llm_judge_rag_rel_3)

This section adds two new columns to the DataFrame to store evaluation scores for responses generated using the RAG (Retrieval-Augmented Generation) approach.

These columns capture how well the RAG-based responses perform in terms of factual correctness and contextual alignment:

1. RAG_responses_groundedness_score — measures how factually accurate and evidence-backed each RAG-generated response is when compared with the retrieved document context.

2. RAG_responses_relevance_score — measures how relevant and contextually appropriate each RAG-generated response is to the original user question.

Each score (e.g., llm_judge_rag_ground_1[-2]) is extracted from the evaluation outputs returned by the LLM-based scoring process, where [-2] typically accesses the numeric score element.

By adding these two columns, the DataFrame now includes a complete set of evaluation metrics — for base responses, prompt-engineered responses, and RAG responses — enabling detailed performance comparison across all response-generation methods.

Finally, displaying result_df shows the fully updated table with questions, all response types, and their associated groundedness and relevance scores.

In [None]:
result_df['RAG_responses_groundedness_score'] = [llm_judge_rag_ground_1[-2], llm_judge_rag_ground_2[-2], llm_judge_rag_ground_3[-2],llm_judge_rag_ground_4[-2]]
result_df['RAG_responses_relevance_score'] = [llm_judge_rag_rel_1[-2], llm_judge_rag_rel_2[-2], llm_judge_rag_rel_3[-2],llm_judge_rag_rel_4[-2]]
result_df

This section standardizes and analyzes the evaluation metrics stored in the DataFrame by cleaning column names, ensuring numeric data types, and calculating average scores for each response generation approach.

Here’s what happens step by step:

1. Clean column names:
result_df.columns = result_df.columns.str.strip() removes any leading or trailing spaces from the column names to prevent reference errors caused by inconsistent naming.

2. Select relevant columns:
The list cols defines the six numeric evaluation columns corresponding to groundedness and relevance scores for the three approaches — Base Prompt, Prompt Engineering, and RAG.

3. Convert columns to numeric:
result_df[cols].apply(pd.to_numeric, errors='coerce') ensures that all selected columns are treated as numeric values. Any non-numeric entries (e.g., text or None) are safely converted to NaN using errors='coerce'.

4. Compute averages:
The mean(numeric_only=True) function calculates the average groundedness and relevance scores for each method:


*   Base Prompt Evaluation — measures the model’s raw performance without enhancements.
*   Prompt Engineering Evaluation — reflects how tailored prompts improve factual grounding and relevance.
*  RAG Response Evaluation — evaluates how integrating retrieval-based context impacts accuracy and contextual alignment.

5. Print summary results:
The printed output provides a concise comparison of the average groundedness and relevance scores across the three approaches, highlighting which method delivers the best overall performance.

In [None]:
result_df.columns = result_df.columns.str.strip()
cols = [
    'base_prompt_responses_groundedness_score',
    'base_prompt_responses_relevance_score',
    'prompt_engineering_responses_groundedness_score',
    'prompt_engineering_responses_relevance_score',
    'RAG_responses_groundedness_score',
    'RAG_responses_relevance_score'
]
result_df[cols] = result_df[cols].apply(pd.to_numeric, errors='coerce')

print("Average scores for Base Prompt Evaluation:")
print(result_df[['base_prompt_responses_groundedness_score', 'base_prompt_responses_relevance_score']].mean(numeric_only=True))
print("\n")
print("Average scores for Prompt Engineering Evaluation:")
print(result_df[['prompt_engineering_responses_groundedness_score', 'prompt_engineering_responses_relevance_score']].mean(numeric_only=True))
print("\n")
print("\nAverage scores for RAG Response Evaluation:")
print(result_df[['RAG_responses_groundedness_score', 'RAG_responses_relevance_score']].mean(numeric_only=True))

## Actionable Insights and Business Recommendations

1. **Actionable Insights**

* RAG Significantly Outperforms Other Methods:
Integrating retrieval mechanisms (vector database + embeddings) provides both high factual grounding and strong contextual relevance. This confirms that augmenting the model with real-world knowledge sources is the most reliable approach for domain-specific or information-sensitive use cases (e.g., healthcare, finance, customer support).

* Prompt Engineering Alone Is Insufficient:
Despite structured instructions, prompt engineering without external knowledge leads to lower factual accuracy. It highlights the model’s limitation in reasoning accurately without grounding its responses in verified information.

* Base Prompt Offers Good Relevance but Moderate Accuracy:
The model’s baseline performance suggests it understands the context but occasionally “hallucinates” details—typical of LLMs when not connected to a factual retrieval layer.

2. **Business Recommendations**

- Adopt RAG-Based Systems for Production:
    - Implement RAG pipelines for all use cases requiring factual reliability (e.g., medical advice, policy guidance, financial documentation).

    - Store relevant domain data in a well-structured vector database (e.g., Chroma, Pinecone) to ensure accurate retrieval.

- Use Prompt Engineering as a Complementary Strategy:

    - Continue refining prompts to improve tone, structure, and clarity—but always pair them with retrieval for factual consistency.

    - Use prompt engineering for conversational quality and contextual framing, while RAG ensures correctness.

- Establish Continuous Evaluation Frameworks:

    - Regularly monitor groundedness and relevance scores to detect performance drift.

    - Automate this evaluation pipeline (using libraries like ragas) to ensure quality consistency as data or models evolve.

- Leverage Insights for Business Scaling:

    - Deploy RAG-enhanced models in customer-facing products to improve trust and reduce misinformation.

    - Extend the same architecture to other internal knowledge domains (e.g., HR FAQs, product documentation) to increase productivity.

3. **Summary**: RAG-based AI systems deliver the most accurate, reliable, and contextually aligned results. For business adoption, combining retrieval mechanisms with well-engineered prompts offers the best balance between natural interaction and factual precision — driving both customer trust and operational efficiency.