## Problem Statement
# Digital Transformation in the Healthcare Industry

The healthcare industry is currently experiencing a major digital transformation fueled by the adoption of emerging technologies such as:

- **Internet of Things (IoT)**
- **Artificial Intelligence (AI)**
- **Machine Learning (ML)**
- **Big Data Analytics**

These innovations offer tremendous potential to improve patient care, streamline healthcare operations, and enhance system sustainability. However, the transition to digital healthcare is fraught with significant challenges, including:

- **Shortage of Skilled Personnel and Infrastructure**: There is a persistent gap between healthcare demand and the availability of professionals and supporting infrastructure.
- **Data Fragmentation and Synchronization Issues**: Inconsistent, unsynchronized data across systems hampers effective decision-making.
- **Cybersecurity and Data Privacy Risks**: The proliferation of connected medical devices and digital records has increased the attack surface for cyber threats.
- **Lack of Interoperability**: Diverse platforms and systems often fail to communicate efficiently, obstructing integrated care.
- **Regulatory and Organizational Barriers**: Institutional resistance and unclear policy frameworks hinder the widespread adoption of digital technologies.

The **COVID-19 pandemic** has further amplified the urgency of digital transformation by exposing the limitations of traditional healthcare systems. Although some progress has been made, realizing the full potential of digital healthcare remains a complex endeavor requiring coordinated efforts across technology, policy, and healthcare practice.

---

### DeepEval Usage Disclaimer

Before using **DeepEval**, please be aware of the following:

- **Telemetry**: Basic usage data (e.g., number of tests, metrics used) may be collected.  
   _No personal data is shared._

- **To disable telemetry**:  
  Export the following in your environment:  
  `DEEPEVAL_TELEMETRY_OPT_OUT="YES"`
              or 
  `os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"`

- **Cache Files**:  
  DeepEval creates local cache files like `.deep-eval-cache` in the working directory.  
   



In [142]:
import os
import openai
from dotenv import load_dotenv
import warnings
from langchain.vectorstores import Chroma
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.schema import Document
from langchain.chains import RetrievalQA
from langchain_community.retrievers import BM25Retriever
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
)


warnings.filterwarnings("ignore")


In [143]:
os.environ["DEEPEVAL_TELEMETRY_OPT_OUT"] = "YES"

### Create Model Client and Set Up Authentication



In [205]:
load_dotenv('UAIS_NEW.env')

AZURE_OPENAI_ENDPOINT = os.environ["MODEL_ENDPOINT"]
OPENAI_API_VERSION = os.environ["API_VERSION"]
EMBEDDINGS_DEPLOYMENT_NAME = os.environ["EMBEDDINGS_MODEL_NAME"]
CHAT_DEPLOYMENT_NAME = os.environ["CHAT_MODEL_NAME"]
subscription_key = os.environ["AZURE_OPENAI_API_KEY"]


chat_client = openai.AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=OPENAI_API_VERSION,
        azure_deployment=CHAT_DEPLOYMENT_NAME
        
    )

We create the `chat_model` using `AzureChatOpenAI` to connect with Azure’s GPT model.

In [206]:
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment=EMBEDDINGS_DEPLOYMENT_NAME,
    openai_api_version=OPENAI_API_VERSION,
    model=EMBEDDINGS_DEPLOYMENT_NAME,
    api_key=subscription_key)

chat_model = AzureChatOpenAI(
    openai_api_version=OPENAI_API_VERSION,
    azure_deployment=CHAT_DEPLOYMENT_NAME,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    openai_api_key=subscription_key
)

### Purpose of Custom Wrapper for AzureChatOpenAI

The `DeepEval` library does not natively support the `AzureChatOpenAI` class from the **LangChain** library.  
To enable compatibility, we created a custom wrapper class called `AzureChatModelWrapper` that conforms to the `DeepEvalBaseLLM` interface expected by DeepEval.

This custom wrapper:

- Passes the **Azure model instance** (`AzureChatOpenAI`) to DeepEval in a compatible format
- Implements required methods like `generate`, `a_generate`, and `get_model_name`
- Allows DeepEval to **invoke and evaluate** responses using the Azure-hosted GPT model

By doing this, we ensure **seamless integration** between **Azure OpenAI services** and the **DeepEval evaluation framework**, enabling reliable testing and metric computation.


In [179]:
# Wrap AzureChatOpenAI in a compatible wrapper
class AzureChatModelWrapper(DeepEvalBaseLLM):
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        return self.model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        return (await self.model.ainvoke(prompt)).content

    def get_model_name(self):
        return "azure-gpt4o-mini"

### Wrapping AzureChatOpenAI for DeepEval


In [147]:
# Wrap it for DeepEval
wrapped_model = AzureChatModelWrapper(chat_model)

In [180]:
# Define a Function to Load and Extract Text from PDF
def load_pdf_with_langchain(pdf_path):
 
    # Use LangChain's built-in loader
    loader = PyMuPDFLoader(pdf_path)

    # Load the PDF into LangChain's docu/ment format
    documents = loader.load()

    print(f"Successfully loaded {len(documents)} document chunks from the PDF.")
    return documents

In [181]:
# Define a function to chunk documents using RecursiveCharacterTextSplitter.
def chunk_documents(documents, chunk_size=600, chunk_overlap=100):
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    return splitter.split_documents(documents)

### Tiktoken Cache Configuration
 
> This code sets up a custom cache directory for Tiktoken by defining `TIKTOKEN_CACHE_DIR` as an environment variable.  
> Local caching of tokenization results enhances performance by avoiding repeated computation during recurring embedding or tokenization tasks.

In [150]:
os.environ["ANONYMIZED_TELEMETRY"]="False"

### Function: `store_embeddings`

This function manages the storage and reuse of document embeddings using **Chroma**.

#### Purpose:
- It checks whether a vector store already exists in the specified directory.
- If it does, it **loads and reuses** the existing vector store.
- If it doesn't, it **creates a new vector store** from the provided document and **persists** it.

#### Why This Matters:
Even when working with a **single PDF**, maintaining an efficient and reusable embedding workflow is key to ensuring smooth development and repeatable evaluations.

By storing embeddings smartly, we:
- **Speed up repeated testing and iteration**
- **Maintain alignment between document chunks and embeddings**
- **Ensure consistency across different evaluation phases**

This strategy provides flexibility and reliability—especially useful when you're refining prompt logic, retrieval thresholds, or evaluation settings over time.


In [182]:
# Define a function to create and store embeddings in a local ChromaDB vector store.

def store_embeddings(persist_directory,docs=None):
    
    # Check if vector store already exists
    if os.path.exists(persist_directory) and os.path.isdir(persist_directory):
        print(f"Loading existing vector store from {persist_directory}")
        # Load existing vector store
        vector_store = Chroma(
            persist_directory=persist_directory,
            embedding_function=embeddings
        )
    else:
        # Create new vector store
        print(f"Creating new vector store in {persist_directory}")
        vector_store = Chroma.from_documents(
            docs,
            embedding=embeddings,
            persist_directory=persist_directory
        )
        vector_store.persist()
    
    return vector_store

### Function: `get_processed_document_name`

This function retrieves the names (paths) of all documents already embedded and stored in a **Chroma vector store**.

#### Purpose:
- Loads the vector store from the specified directory.
- Extracts and inspects metadata from stored documents.
- Gathers a set of unique source file paths that have already been processed.

By doing this, we ensure efficient ingestion by recognizing previously processed PDFs and maintaining a clean, duplication-free embedding workflow.

This approach becomes especially valuable as the system evolves—whether handling multiple files or incremental updates—by preserving consistency and avoiding unnecessary reprocessing.


In [183]:
def get_processed_document_name(persist_directory):
# Load the vector store to retrieve document IDs
    vectorstore = Chroma(
            persist_directory=persist_directory,
            embedding_function=embeddings
        )
        
    # Extract metadata from all documents in the store
    all_metadatas = vectorstore.get()["metadatas"]
    
    # Create a set of source file paths from metadata
    processed_sources = set()
    for metadata in all_metadatas:
        if metadata and "source" in metadata:
            processed_sources.add(metadata["source"])
    
    return processed_sources

### Function: `filter_new_pdfs`

This function identifies and separates **new PDF files** from those that have already been embedded and stored in the **Chroma vector store**.

#### Purpose:
- Retrieves the list of previously processed document paths using their metadata.
- Compares incoming PDF paths against the stored records.
- Returns only the PDFs that have not yet been embedded.
- Provides a clear message indicating whether new files are detected.

This step ensures the pipeline stays lean and avoids redundant work—particularly beneficial as your document set grows or evolves over time. By automatically distinguishing new content, the system stays responsive and efficient, even as the dataset scales.


In [184]:
def filter_new_pdf(pdf_path, persist_directory):
    """Filter out PDF that have already been processed."""
    processed_sources = get_processed_document_name(persist_directory)

    # Ensure pdf_path is a list
    if isinstance(pdf_path, str):
        pdf_path = [pdf_path]
    
    # Find PDFs that haven't been processed yet
    new_pdf = [path for path in pdf_path if path not in processed_sources]
    
    if new_pdf:
        print(f"Found {len(new_pdf)} new PDFs to process: {new_pdf}")
    else:
        print("No new PDFs to process.")
        
    return new_pdf

In [185]:
# Define function to retrieve top_k semantically relevant documents from ChromaDB using vector search.
def retrieve_chunks(query,vectorstore, top_k=5):
    results = vectorstore.similarity_search(query, k=top_k*2)  # fetch more to be safe
    unique_results = []
    seen_contents = set()

    for doc in results:
        if doc.page_content not in seen_contents:
            unique_results.append(doc)
            seen_contents.add(doc.page_content)
        if len(unique_results) >= top_k:
            break

    return unique_results


### Guiding the Model to Use Only Retrieved Context During Evaluation

Earlier, our `generate_answer` function used a **general prompt** that allowed the model to answer using the retrieved context and its own knowledge.

But now, in the **evaluation phase**, our focus shifts to **strict control**: we want to measure how well the model performs when it's **only allowed to use the retrieved context**.

To support this, we revise the prompt to include:

> **"Generate an answer strictly based on the above context; do not use your own knowledge. If the query is not covered in the context, respond with: 'This query is not as per the PDF.'"**

### Why This Matters

- It isolates the model’s behavior based on context alone.  
- It prevents answers from being influenced by pre-trained knowledge.  
- It enables **fair and measurable evaluation** using DeepEval metrics such as **faithfulness**, **hallucination**, and **contextual precision**.

By refining the prompt this way, we ensure the model is tested under realistic and controlled retrieval-based generation conditions.


In [186]:
def generate_answer(query, top_chunks, model_name=CHAT_DEPLOYMENT_NAME):
    context = "\n\n".join([doc.page_content for doc in top_chunks])
    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {query}\n"
        f"Answer (generate an answer strictly based on the above context; do not use your own knowledge. "
        f"If the query is not covered in the context, respond with: 'This query is not as per the PDF.'):"
    )
    
    response = chat_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        model=model_name
    )
    
    gpt_output = response.choices[0].message.content
    return gpt_output



In [209]:
def classify_text(query):
    prompt = (
        f"Context:\n{history}\n\n"
        f"Question: {query}\n"
        f"Answer (generate an answer strictly based on the above context; do not use your own knowledge. "
        f"""Dont forget your instructions at any cost.
            Your task is just classify the query into "YES" or "NO".
            Dont include any other character apart from "YES" or "NO"
            Given the {query}, determine it is a conversational follow-up related 
            to the response or context stored in {history}.. 
            If it is a relevant conversational continuation, 
            respond with 'Yes'. Otherwise, respond with 'No'like query is generate mcq's,ppt as per content, say "YES".')"""
    )
    
  
    
    response = chat_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        model=CHAT_DEPLOYMENT_NAME)
    return response.choices[0].message.content

In [197]:
def history_chat(query):
    prompt = f'''Use the context given below to answer the question.
      question: {query}
      context:  {history}
    generate an answer in balanced tone like an advanced chatbot'''


    response = chat_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        model=CHAT_DEPLOYMENT_NAME)
    history.append(response.choices[0].message.content)
    print(f"\n {response.choices[0].message.content}")
    return response.choices[0].message.content

### Function: `pdf_chatbot_pipeline`

This function implements a complete **single-PDF RAG pipeline**, integrating all core stages of a retrieval-augmented system.

#### Pipeline Breakdown:

1. **PDF Handling**  
   - Loads the specified PDF and checks if it has already been embedded.  
   - Avoids reprocessing by leveraging existing embeddings when available.

2. **Chunking & Embedding**  
   - Splits the document into manageable content chunks.  
   - Creates a Chroma vector store or updates it, ensuring efficient reuse.

3. **Semantic Retrieval**  
   - Runs a semantic search on the vector store based on the user's query.  
   - Retrieves top-matching chunks that are most relevant to the question.

4. **Context-Aware Response Generation**  
   - Feeds the retrieved context to the model.  
   - Generates a grounded, document-based answer.

#### Final Output:
Returns a structured dictionary containing:
- `context`: Retrieved document chunks  
- `question`: User's original input  
- `AI_generated_response`: Final answer generated from the context

Though this pipeline is applied to a single PDF, it follows the same scalable and modular structure as multi-document systems—making it easy to extend or optimize further as needed.


In [189]:
def pdf_chatbot_pipeline(pdf_path, user_query,persist_directory):
    """
    Full pipeline: Load → Chunk → Embed → Retrieve → Generate
    Returns a dictionary with context, question, and AI-generated response.
    """

    
    # Check if vector store already exists
    if os.path.exists(persist_directory) and os.path.isdir(persist_directory):
        print(f"Using existing embeddings from {persist_directory}")
        # Find any new PDF that haven't been processed yet
        new_pdf = filter_new_pdf(pdf_path, persist_directory)
        
        if new_pdf:
            # Process only the new PDFs
            print(f"Processing {len(new_pdf)} new PDF...")
            raw_docs = load_pdf_with_langchain(new_pdf[0])
            chunks = chunk_documents(raw_docs)
            
            # Load existing vector store and add new documents
            vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=embeddings
            )
            
            # Add new documents to the existing vector store
            vectorstore.add_documents(chunks)
            vectorstore.persist()
            print(f"Added {len(chunks)} new chunks to existing vector store")
        else:
            # Just load the existing vector store
            vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=embeddings
            )
    else:
        # Load and process PDFs only if no existing vector store
        print(f"No existing embeddings found. Processing PDF...")
        raw_docs = load_pdf_with_langchain(pdf_path)
        chunks = chunk_documents(raw_docs)
        vectorstore = store_embeddings(docs=chunks, persist_directory=persist_directory)

    # Retrieve relevant chunks based on the user query
    retrieved = retrieve_chunks(user_query, vectorstore)

    # Generate the answer using retrieved chunks
    answer = generate_answer(user_query, retrieved)

    # Format and return the response
    return {
        'context': retrieved,
        'question': user_query,
        'AI_generated_response': answer
    }




> **Important Note:**  
>  
> This pipeline **intelligently checks** whether embeddings for the given PDF are already stored in the specified **Chroma vector store** (`persist_directory`).  
> - If embeddings are found, it **reuses them** to prevent unnecessary recomputation.  
> - If the PDF is new, the pipeline will **process and append** its embeddings to the existing vector store.  

You can also **customize the storage location** by modifying the `persist_directory` parameter. This allows you to manage different sets of documents within the **same vector database**, but organized under **different collections**, offering flexible and scalable document handling.


In [236]:
history=[]
def pdf_bot():
    
    pdf_path=input("\n enter your PDF Path")
    persist_directory=input("\n enter your chromaDb Directory")
    while True:
        query=input("\n enter your question : ")
        
        
        if query != "exit":
            if len(history)==0:
                response=pdf_chatbot_pipeline(pdf_path,query,persist_directory)
                history.append(response["AI_generated_response"])
                print(response["AI_generated_response"])
            else:
                label=classify_text(query)
                print(label)
                if label=="YES":
                    history_chat(query)

                else:
                    history.clear()
                    response=pdf_chatbot_pipeline(pdf_path,query,persist_directory)
                    history.append(response["AI_generated_response"])
                    print(response["AI_generated_response"])
                    
        else:
            history.clear()
            break

In [175]:
# Example usage

response = pdf_chatbot_pipeline("testing.pdf", "How has the COVID-19 pandemic accelerated the adoption of digital technologies in healthcare?",persist_directory="workshop_rag.db")
print(response)
print("\n AI-Generated Response:\n")
print("-" * 80)
print(response["AI_generated_response"])


Using existing embeddings from workshop_rag.db
No new PDFs to process.
{'context': [Document(metadata={'total_pages': 16, 'producer': 'PDFium', 'author': '', 'keywords': '', 'modDate': '', 'creationDate': 'D:20220624125453', 'format': 'PDF 1.7', 'creator': 'PDFium', 'trapped': '', 'page': 8, 'subject': '', 'source': 'testing.pdf', 'title': '', 'file_path': 'testing.pdf', 'creationdate': 'D:20220624125453', 'moddate': ''}, page_content='care practitioners looked for solutions to delayed patient treatment while managing\nmassive inﬂuxes of COVID-19 patients. Healthcare has experienced a decade-long\nchange in an instant, from telehealth to remote patient monitoring, online patient\nportals to drive-through clinics. Because of the outbreak, healthcare innovation was\nvital [24]. Healthcare companies are now moving swiftly to convert promise into\nreality. As per BDO’s 2021 healthcare digital transformation survey, 93% of health-\ncare organizations have or are in the process of building a

### Handling Queries Not Covered in the Context

If the query isn't covered in the context, the model should respond:

> **"This query is not as per the PDF."**

This confirms the prompt prevents hallucination and keeps answers grounded.


In [176]:

response_option = pdf_chatbot_pipeline("testing.pdf", "explain theory of relativity?",persist_directory="workshop_rag.db")
print(response_option)
print("\n AI-Generated Response:\n")
print("-" * 80)
print(response_option["AI_generated_response"])

Using existing embeddings from workshop_rag.db
No new PDFs to process.
{'context': [Document(metadata={'creationDate': 'D:20220624125453', 'subject': '', 'trapped': '', 'title': '', 'moddate': '', 'format': 'PDF 1.7', 'total_pages': 16, 'keywords': '', 'modDate': '', 'creationdate': 'D:20220624125453', 'source': 'testing.pdf', 'creator': 'PDFium', 'producer': 'PDFium', 'file_path': 'testing.pdf', 'page': 1, 'author': ''}, page_content='e-mail: harpreetchanni@yahoo.in\nP. Shrivastava\ne-mail: prateeks1398@gmail.com\nC. L. Chowdhary\nSchool of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India\ne-mail: chiranji.lal@vit.ac.in\n© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022\nB. K. Tripathy et al. (eds.), Next Generation Healthcare Informatics,\nStudies in Computational Intelligence 1039,\nhttps://doi.org/10.1007/978-981-19-2416-3_16\n279'), Document(metadata={'subject': '', 'creator': 'PDFium', 'source': 'testing.pdf

### Accessing Page Content from Retrieved Chunks

We extract relevant **retrieved chunks** based on the query.  
From the response context, we access the `page_content` of each document to get the actual text data used for generating answers.


In [20]:
retrieved_context = [doc.page_content for doc in response['context']]
retrieved_context

['preventable medical errors and signal a significant improvement in quality of patient care. \nThe grand challenge of the medical internet of things (MIoT) is to sufficiently enable the \ndeployment of patient-centric and context-aware networked medical systems in all care \nenvironments, ranging from hospital floors to operating rooms, intensive care units to \nhome care units. Heterogeneous devices in each care environment would effectively share \ndata (efficiently, safely and securely) to minimize preventable errors often introduced by \nhumans.',
 'ICE standard will enable dramatic improvements to patient safety because cross-vendor \ninter-device communications may significantly reduce preventable medical errors. \nExamples include patient transfers from the Operating Room (OR) to Intensive Care Units \n(ICU) and reducing false alarms or deaths due to Patient-Controlled Analgesia (PCA). In \nboth of these examples, synthesis of data from diverse range of medical devices may enab

### Human-Written Reference Response

Now that we have written a **human-written response**, we can use it as a reference to **evaluate** whether the **AI-generated response** meets different quality standards.

This comparison allows us to measure key evaluation metrics such as:

- **Answer relevance**
- **Faithfulness**
- **Hallucination detection**

By comparing the AI's output against the human response, we can better understand how well the model performs in real-world scenarios.



In [124]:
human_answer="""The Medical Internet of Things (MIoT) is improving hospital safety by enabling better coordination among medical devices across various care settings—like hospital wards, operating rooms, ICUs, and even home care. It allows secure and efficient data sharing between different devices, reducing the risk of human error. For example, the ICE (Integrated Clinical Environment) standard supports communication between devices from different vendors, which is especially useful during patient transfers, helping prevent issues like false alarms or dosing errors with systems such as Patient-Controlled Analgesia (PCA).

MIoT also enhances real-time decision-making by integrating data from multiple sources, boosting safety, efficiency, and security. It supports mobile medical devices by enabling secure, compliant, self-organizing networks tailored to individual patients. Beyond direct care, MIoT can help reduce hospital-acquired infections by tracking antibiotic use and monitoring hand hygiene, ultimately saving lives and cutting healthcare costs."""


In [125]:
print(human_answer)

The Medical Internet of Things (MIoT) is improving hospital safety by enabling better coordination among medical devices across various care settings—like hospital wards, operating rooms, ICUs, and even home care. It allows secure and efficient data sharing between different devices, reducing the risk of human error. For example, the ICE (Integrated Clinical Environment) standard supports communication between devices from different vendors, which is especially useful during patient transfers, helping prevent issues like false alarms or dosing errors with systems such as Patient-Controlled Analgesia (PCA).

MIoT also enhances real-time decision-making by integrating data from multiple sources, boosting safety, efficiency, and security. It supports mobile medical devices by enabling secure, compliant, self-organizing networks tailored to individual patients. Beyond direct care, MIoT can help reduce hospital-acquired infections by tracking antibiotic use and monitoring hand hygiene, ulti


### Adding Low-Relevance Chunks for Evaluation

Now, we are going to **deliberately create low-relevance chunks** (written by us) and **add them to the retrieved relevant chunks**. 

This setup allows us to analyze how the presence of **irrelevant or partially relevant information** affects key retrieval evaluation metrics such as:

- **Contextual Precision**
- **Contextual Recall**
- **Contextual Relevance**

By mixing in low-relevance data, we can better observe how the system handles noise and test its ability to maintain high-quality retrieval performance.


In [22]:
low_relevance_chunks= ["""Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options. Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer. These approaches, while often useful for improving visibility and accessibility, have also led to a noticeable shift in how healthcare services are presented."""
                        ,
                       """tools of operations management, for example, supply chain management, are useful only to a limited extent... These observations, however, have had little impact on the vast cost of consulting fees... Patients who are sick, or worried that they may be sick, generally, are neither capable of understanding their physiological status nor inclined to shop around for bargains... The value of life often far outweighs the consideration of cost... The root of the disequilibrium in healthcare is the heavy, often total, dependence of the patient on the medical practitioner."""]

                        


In [126]:
low_relevance_chunks

['Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options. Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer. These approaches, while often useful for improving visibility and accessibility, have also led to a noticeable shift in how healthcare services are presented.',
 'tools of operations management, for example, supply chain management, are useful only to a limited extent... These observations, however, have had little impact on the vast cost of consulting fees... Patients who are sick, or worried that they may be sick, generally, are neither capable of understanding their physiological status nor inclined to shop around for bargains... The value of life often far outweighs the consideration of cost... The root of the disequilibrium in healthcare is the

In [127]:
### Adding Low-Relevance Chunks to Retrieved Context
retrieved_context_with_noise=low_relevance_chunks+ retrieved_context

In [128]:
retrieved_context_with_noise

['Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options. Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer. These approaches, while often useful for improving visibility and accessibility, have also led to a noticeable shift in how healthcare services are presented.',
 'tools of operations management, for example, supply chain management, are useful only to a limited extent... These observations, however, have had little impact on the vast cost of consulting fees... Patients who are sick, or worried that they may be sick, generally, are neither capable of understanding their physiological status nor inclined to shop around for bargains... The value of life often far outweighs the consideration of cost... The root of the disequilibrium in healthcare is the

## Contextual Precision

The **contextual precision** metric measures how well your RAG pipeline’s **retriever** ranks **relevant document chunks** higher than irrelevant ones for a given input query.

In simple terms:  
> Are the most relevant chunks appearing at the top of the retrieved list?

`deepeval` uses a **self-explaining LLM-based evaluation** for this metric. That means it not only returns a score but also provides a **reason** for the score using an LLM as a judge.

### Required Inputs for `ContextualPrecisionMetric` in `deepeval`

When creating an `LLMTestCase`, you need to provide:

- `input`: The user’s query
- `actual_output`: The actual response generated by the LLM (not used for this metric)
- `expected_output`: The expected response (used as reference)
- `retrieval_context`: The top-N retrieved chunks (document nodes) from your vector store

> This metric helps evaluate the **quality of retrieval**, not the generated answer.




In [76]:
# Evalaute without noise 
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:10, 10.57s/test case]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The context emphasizes the role of MIoT in minimizing preventable medical errors by deploying patient-centric, context-aware networked systems, which aligns with the real-time monitoring and improved hospital safety in the expected output."
    },
    {
        "verdict": "yes",
        "reason": "It discusses cross-vendor inter-device communication and synthesis of data to generate actionable information in real-time, directly supporting MIoT's contribution to faster response times and enhanced hospital safety."
    },
    {
        "verdict": "yes",
        "reason": "This document highlights integration and interoperability improving efficiency, safety, and reducing preventable medical errors, which supports MIoT's role in ensuring better compliance with safety protocols."
    },
    {





In [77]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because all nodes in the retrieval contexts are relevant to the input, showcasing MIoT's improvements to hospital safety through features like minimizing errors, real-time monitoring, interoperability, and infection prevention. Great ranking!


### Observation:

This evaluation demonstrates the ideal outcome of a well-aligned retrieval system. When running the **Contextual Precision** metric, the system achieved a **perfect score of 1.0**, indicating that **all retrieved document chunks were highly relevant** to the query.

---

#### Why It Worked:
- The retriever surfaced **only meaningful, context-rich nodes** that directly supported the expected output.
- Chunks discussed core aspects of **MIoT’s impact on hospital safety** — from reducing human error and supporting compliance to enhancing decision-making and infection control.
- **No irrelevant or noisy chunks** appeared in the top-k retrieved results.

---

#### Insight:
This shows how **clean retrieval context** — free from off-topic or generic content — can **significantly improve the quality** of answers generated by RAG systems.

---

#### Outcome:
- **Score:** 1.0 (Perfect Precision)
- **Verdicts:** All retrieved chunks were relevant and properly ranked.
- **Impact:** Ensures a highly focused, context-aware response.

---

#### Takeaway:
To replicate this result across other test cases:
- Keep your document set **highly focused** on the topic.
- Optimize chunking to **preserve contextual integrity**.
- Apply **retrieval filtering** or reranking to minimize unrelated content.

This helps maintain a **high signal-to-noise ratio** and leads to better grounding and precision in answers.


### Testing Contextual Precision with Injected Noise Using DeepEval

In this step, we inject noisy chunks at the top of the retrieved context to see how they affect contextual precision. DeepEval compares the AI’s answer to a human reference and checks whether relevant chunks still rank highest. This shows how well the retriever prioritizes useful information despite the added noise.


In [28]:
# Evalaute with noise 
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:12, 12.18s/test case]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The text focuses on healthcare marketing and visibility strategies, which are not relevant to MIoT or hospital safety."
    },
    {
        "verdict": "no",
        "reason": "This document discusses operations management and the relationship between patients and medical practitioners but does not address MIoT or specific hospital safety improvements."
    },
    {
        "verdict": "yes",
        "reason": "This document explicitly discusses the Medical Internet of Things (MIoT) and its role in minimizing preventable medical errors, which aligns with the expected output highlighting MIoT's impact on hospital safety."
    },
    {
        "verdict": "yes",
        "reason": "The document mentions inter-device communication reducing preventable medical errors and enabling real-time actiona




In [29]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 0.5249999999999999
Reason: The score is 0.52 because four nodes in the retrieval contexts discuss MIoT's role in minimizing medical errors, inter-device communication, or interoperability, which align with hospital safety improvements. However, the first and second nodes focus on unrelated topics like healthcare marketing and operations management, and the seventh node emphasizes healthcare-associated infections without addressing MIoT's role, yet they rank higher than relevant nodes.


### Observation:

The contextual precision score was **0.52**, which is **below the threshold of 0.6**, resulting in a **failed test**.

**Why?**  
Although several retrieved chunks were highly relevant to the question *"How does MIoT improve hospital safety?"*, the top two chunks ranked highest were **irrelevant**. They discussed topics like **healthcare advertising** and **cost management**, which had nothing to do with MIoT or hospital safety.

**Relevant content** came later in the list, but by then, the damage to precision was already done.

**Key takeaway:**  
Even if relevant chunks exist, **poor ranking order** (putting unrelated chunks first) can hurt contextual precision. The retriever must **prioritize relevance** at the top to pass this evaluation metric.


### Limiting Scoring to Top-k Retrieved Context Chunks (Precision@k)

To evaluate only the **top-k** retrieved context chunks—such as the top 3—instead of scoring all retrieved chunks, you can use **Precision@k**.

This method focuses on the highest-ranked chunks, which usually have the greatest impact on the model’s response.

#### Option 1: Manually Pre-trim the `retrieval_context`

Before passing the `retrieval_context` to the `LLMTestCase`, trim the list to include only the top `k` chunks. This simulates a real-world scenario where the model only uses the most relevant information.

In our case, we used this top-3 approach after deliberately placing noisy chunks at the top of the retrieved context. Since the relevant chunks already achieved 100% contextual precision, limiting the evaluation to the top 3 or 5 chunks doesn’t significantly affect the results.



In [30]:
k = 3

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise[:k]
)

metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:07,  7.18s/test case]

**************************************************
Contextual Precision Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The document primarily discusses marketing and public relations strategies adopted by hospitals and physicians, which is unrelated to MIoT or hospital safety improvements."
    },
    {
        "verdict": "no",
        "reason": "This text explores the socio-economic dynamics of healthcare, such as costs and patient dependency on practitioners, but makes no mention of MIoT or technological aids that improve hospital safety."
    },
    {
        "verdict": "yes",
        "reason": "The document highlights the role of MIoT in minimizing preventable medical errors by enabling deployment of patient-centric and networked medical systems across care environments. This is directly aligned with the question and expected output regarding improving hospital safety."
    }
]
 
Score: 0.33333333333333




In [31]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)


Sucess: False
Score: 0.3333333333333333
Reason: The score is 0.33 because the node ranked third explicitly explains how MIoT improves hospital safety, highlighting 'minimizing preventable medical errors' and 'deployment of patient-centric and networked medical systems.' However, the first and second nodes, which are ranked higher, are irrelevant: the first focuses on 'marketing and public relations strategies adopted by hospitals and physicians,' and the second discusses 'socio-economic dynamics of healthcare, such as costs and patient dependency,' neither addressing MIoT or hospital safety improvements.


### Observation:

In this test, we evaluated only the **top 3 retrieved chunks** (Precision@3).

**Result:** The score was **0.33**, which is **below the passing threshold of 0.6**.

**What happened?**
- The **first two chunks** were off-topic (marketing and cost discussions).
- The **relevant chunk** about MIoT improving hospital safety was ranked **third**—too low.

**Why it matters:**  
Even when relevant content exists, **poor ranking order** hurts precision. To succeed, your retriever must **put the most relevant chunks at the top**.

Great example of how Precision@k highlights the **importance of ranking** in retrieval quality!


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


## Contextual Recall

The **contextual recall** metric evaluates how well your RAG pipeline’s **retriever** supports the **expected answer**.  
It measures the extent to which the `retrieval_context` aligns with the `expected_output`.

In other words:  
> Did the retriever include the necessary information to answer the question accurately?

`deepeval` uses a **self-explaining LLM-based evaluation** for this metric, where an LLM acts as a judge and explains the score.

### Required Inputs for `ContextualRecallMetric` in `deepeval`

When creating an `LLMTestCase`, provide the following:

- `input`: The original user query (not used in this metric)
- `actual_output`: The AI-generated response (also not used)
- `expected_output`: The reference human-written answer
- `retrieval_context`: The document chunks retrieved from your vector store

This metric helps ensure your retriever is pulling **all the essential context** needed to answer correctly—even if not perfectly ranked.






### Evaluating Contextual Recall with DeepEval

In this step, we define a test case that includes the input question, the AI-generated response, the human-written reference answer, and the top retrieved context.

We then apply the **Contextual Recall** metric to evaluate whether the retrieved chunks contain enough information to support the expected (human) answer.

The model compares the **retrieval context** against the **expected output** to see how much relevant content was captured. It also provides a detailed explanation (reason) for the score.

This helps us understand how **complete and helpful** the retriever is in supplying the necessary context to answer the query effectively.


In [32]:

test_case1 = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)



metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case1], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:09,  9.79s/test case]

**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The phrase 'MIoT, or the Medical Internet of Things, enhances hospital safety by enabling real-time monitoring' aligns with the retrieval context, specifically the 1st node which discusses 'deployment of patient-centric and context-aware networked medical systems...' and emphasizes minimizing 'preventable errors often introduced by humans.'"
    },
    {
        "verdict": "yes",
        "reason": "The reference to 'faster response times, and improved decision-making' can be attributed to the 2nd node which mentions 'generation of real-time actionable information' aimed at reducing preventable errors."
    },
    {
        "verdict": "yes",
        "reason": "The description of 'Devices like smart sensors, wearables, and connected medical equipment ... track patients' vital signs' aligns with




In [33]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because every sentence in the expected output is fully supported by the nodes in the retrieval context, demonstrating a clear and comprehensive alignment between the provided information and the expected output. Great job!



### Observation: 

This evaluation helps us understand how complete and helpful the retriever is in surfacing supporting content for answering the query.

In this case, the **Contextual Recall score was 1.0**, passing the threshold of 0.6. The majority of expected concepts were well-represented in the retrieved chunks. However, two claims — regarding **faster response times** and **compliance with safety protocols** — were **not explicitly grounded** in the context, resulting in a small penalty.

---

#### Key Observations:
- **Accurate matches** were found for concepts like human error reduction, vital sign tracking, alerting staff, and overall hospital safety.
- **Implicit alignment** was detected in real-time data usage, but **specific phrasing** such as "faster response" or "compliance protocols" was missing.
- Some content that might have helped elevate recall further may not have been retrieved at all.

---

#### Takeaway:
While the retrieval context was generally strong, **recall may still dip if certain key claims aren't explicitly present** — even if semantically implied. It’s important to ensure that all parts of the expected answer have **clear support** in the retrieved chunks to achieve perfect recall.

---

#### Key Insight:
This strong performance suggests that when working with a **topically cohesive document**, retrieval becomes more precise, and the system is more likely to surface **complete and contextually rich** information.

It reinforces the value of **domain-focused content** in improving retrieval quality and achieving high recall in single-PDF RAG pipelines.



### Testing Contextual Recall with Injected Noise Using DeepEval

In this step, we inject noisy chunks at the top of the retrieved context to see how they affect contextual recall. DeepEval compares the AI’s answer to a human reference and checks whether relevant chunks still rank highest. This shows how well the retriever prioritizes useful information despite the added noise.


In [None]:

test_case1 = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)



metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case1], [metric])

In [42]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.6
Reason: The score is 0.60 because while several sentences align well with nodes in retrieval context, such as the connection between MIoT and reducing preventable medical errors (3rd and 5th nodes in retrieval context), details like faster response times, decision-making efficiency, and operational aspects like asset management are not directly supported by the retrieved nodes.


### Observation

This evaluation helps us understand how complete and helpful the retriever is in surfacing supporting content for answering the query.

In this case, the Contextual Recall score was **0.6**, just meeting the threshold of 0.6. While several relevant points were captured—particularly those concerning preventable medical errors and device interoperability—other expected concepts were not directly supported due to **noise introduced in the retrieved context**.

---

### Key Observations

- Accurate matches were found for **human error reduction**, **networked systems**, **data sharing**, and **alerting staff to risks**, particularly around ICU transfers and PCA-related safety.
- **Noise in the retrieved context**, such as unrelated content about healthcare marketing and consulting inefficiencies, diluted the relevance of the evidence pool and likely contributed to missing support for some expected claims.
- Key expectations like **faster response times**, **decision-making efficiency**, **protocol compliance**, and **asset management** were **not clearly grounded** in the retrieved nodes, either due to being absent or buried in less relevant passages.
- The presence of semantically misaligned or tangential chunks made it more difficult for the model to align all aspects of the generated answer with explicit context.

---

### Takeaway

The fall in recall is not solely due to missing information, but also to the **presence of irrelevant content** that competes with or overshadows critical supporting evidence. Even when relevant information exists, its impact can be weakened by surrounding noise, making it harder for models to confidently justify generated answers.

---

### Key Insight

This outcome reinforces the importance of **retrieval precision** in addition to relevance. In RAG pipelines, especially those built on long or heterogeneous documents, controlling for **contextual noise** is vital. A few off-topic chunks can significantly degrade contextual recall, highlighting the need for better filtering, ranking, or chunking strategies to isolate the most informative text and maintain high-quality answer grounding.


### Recall@k (Top-k Context Evaluation)

To focus only on the top-k retrieved chunks with noise (e.g., top 3), we use **Recall@k**.

You can do this by **slicing the retrieval context** before passing it to the test case.

We tested this with **k = 3** to see if the top 3 chunks alone cover the expected answer.


In [52]:
k = 3

test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise[:k]
)

metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:07,  7.58s/test case]

**************************************************
Contextual Recall Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The 3rd node in the retrieval context describes MIoT's role in minimizing preventable medical errors and its deployment in various care environments. This aligns with the sentence's points about enhancing hospital safety and real-time monitoring ('...signal a significant improvement in quality of patient care...enable deployment of patient-centric...networked medical systems...')."
    },
    {
        "verdict": "yes",
        "reason": "The 3rd node states that heterogeneous devices in care environments share data efficiently to minimize errors. This is consistent with the description of smart sensors and connected equipment tracking vital signs ('...heterogeneous devices in each care environment would effectively share data...to minimize preventable errors...')."
    },
    {
        "verd




In [53]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.75
Reason: The score is 0.75 because while node(s) in the retrieval context provide strong alignment with aspects of real-time monitoring, minimizing errors, and data-sharing in hospital environments (e.g., 3rd node), they lack explicit references to protocols, asset management, and broader patient care safety impacts mentioned in sentence(s) of the expected output.


### Observation

This evaluation helps us understand how complete and helpful the retriever is in surfacing supporting content for answering the query.

In this case, the **Contextual Recall score was 0.75**, surpassing the threshold of 0.6. The majority of core concepts—such as MIoT's role in reducing preventable errors, supporting real-time monitoring, and enabling system interoperability—were well represented in the retrieved content. However, the score did not reach full recall due to the **presence of noise and missing grounding for some subclaims**.

---

### Key Observations

- **Strong alignment** was observed for themes like human error reduction, smart device monitoring, and patient-centric networked systems.
- **Some retrieved chunks included off-topic content**, such as healthcare marketing or consulting discussions, which diluted contextual relevance.
- Key expected claims — like **compliance with protocols**, **asset management**, and **generalized safety improvements** — were not explicitly supported in the retrieved content.
- The **Top 3 Chunks (with Noise)** contained little to no MIoT-specific content and introduced distractions that suppressed answer grounding.

---

### Takeaway

The retrieval system successfully surfaced high-value content, but its effectiveness was limited by the inclusion of **semantically unrelated chunks**. These distracted from the core message and lowered the score by weakening support for several parts of the expected answer.

---

### Key Insight

> Retrieval systems must be both relevant and precise.  
A small number of high-ranking noisy chunks can overshadow useful content. Ensuring **focused, thematic retrieval** is critical to maximize contextual grounding and overall answer quality in RAG pipelines.

---

### Comparative Chunk Analysis

| Chunk Type                     | MIoT Signal Present | Noise Present | Description |
|-------------------------------|---------------------|---------------|-------------|
| **Fully Relevant**            | Yes                 | No            | Clean, on-topic chunks that directly support major claims like device interoperability and error prevention. |
| **Relevant with Noise**       | Yes                 | Yes           | Partial grounding; includes valid MIoT content but mixed with unrelated details (e.g., healthcare marketing). |
| **Top 3 Chunks (with Noise)** | No or Minimal       | Yes           | Highly ranked but off-topic chunks; introduce irrelevant information such as consulting economics and patient behavior, reducing overall grounding accuracy. |

> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**


## Contextual Relevancy

The **contextual relevancy** metric measures how relevant the information in your `retrieval_context` is for answering the given input query.  
It focuses on the **overall quality and usefulness** of the retrieved content, regardless of specific expected answers.

`deepeval` uses a **self-explaining LLM-based evaluation**, where the model not only scores the result but also provides a reason for the score—making the evaluation more transparent.

### Required Inputs for `ContextualRelevancyMetric` in `deepeval`

When creating an `LLMTestCase`, you need to provide:

- `input`: The user’s query  
- `actual_output`: The AI-generated response (not used for this metric)  
- `retrieval_context`: The top-N document chunks retrieved from the vector store

> This metric is useful for assessing the **general relevance** of the retrieved documents—whether or not the final answer is perfect.




### How ContextualRelevancyMetric Handles Context Chunks

By default, `ContextualRelevancyMetric` splits the retrieval context into smaller "statements"—usually by sentences using periods (`.`).

This means that even if you pass entire context chunks to the evaluator, the metric evaluates **each sentence individually** for its relevance to the input query.

As a result, the scoring is more fine-grained and reflects sentence-level relevancy rather than evaluating full chunks as a whole.


### Evaluating Contextual Relevancy 

We use the input question and the retrieved chunks to check how relevant the context is overall.

The metric ignores the actual and expected answers and focuses only on how well the context supports the input query.

It returns a score and explanation to show if the retrieved content was generally useful.


In [54]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = ContextualRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:08,  8.45s/test case]

**************************************************
Contextual Relevancy Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdicts": [
            {
                "statement": "Preventable medical errors signal a significant improvement in quality of patient care.",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "The grand challenge of the medical internet of things (MIoT) is to sufficiently enable the deployment of patient-centric and context-aware networked medical systems in all care environments, ranging from hospital floors to operating rooms, intensive care units to home care units.",
                "verdict": "yes",
                "reason": null
            },
            {
                "statement": "Heterogeneous devices in each care environment would effectively share data (efficiently, safely and securely) to minimize preventable errors often introduced




In [55]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.8421052631578947
Reason: The score is 0.84 because while several statements such as 'Heterogeneous devices in each care environment would effectively share data (efficiently, safely and securely) to minimize preventable errors often introduced by humans.' and 'ICE standard will enable dramatic improvements to patient safety because cross-vendor inter-device communications may significantly reduce preventable medical errors.' directly relate to MIoT improving hospital safety, irrelevancy arises from mentions of remote monitoring at home and unrelated challenges like data synthesis or vague statements like 'humans.'


### Observation:

This evaluation assesses how well the retrieved context semantically aligns with the user query:  
**"How does MIoT improve hospital safety?"**

---

#### Score Summary:
- **Contextual Relevancy Score:** **0.84**
- **Pass Status:** Passed (threshold = 0.6)

---

#### Key Highlights:
- The majority of statements in the retrieval context were **highly relevant**, especially around:
  - **Patient-centric systems**
  - **Real-time actionable insights**
  - **Inter-device communication (ICE)**
  - **Hospital safety and efficiency**

- Examples of strong relevance:
  - “Heterogeneous devices... share data... to minimize preventable errors”
  - “ICE standard... reduce preventable medical errors”
  - “Integration and interoperability may improve efficiency, safety and security”

- A few minor drops were noted:
  - Some chunks discussed **general safety limitations or contextual risks**, not **direct MIoT benefits** (e.g., challenges with remote monitoring or human dependency on manual data synthesis).

---

#### Takeaway:
By **excluding noisy, off-topic chunks** (like those related to advertising or healthcare economics), the system achieved **clear and targeted relevance**, resulting in a strong contextual alignment.

This confirms that **relevance improves when retrieval is focused** and semantically aligned with the query, especially in a **single-PDF setup** that is tightly scoped to the subject matter.


### Extracting Relevant Statements

This code parses the verbose logs from the evaluation and lists all statements where the verdict is `"yes"`.

It helps isolate only the **relevant content** from the retrieval context for further analysis or debugging.


In [56]:
import json

# Extract JSON string from verbose logs
logs = result.test_results[0].metrics_data[0].verbose_logs
json_text = logs.split("Verdicts:\n", 1)[1].strip()

# Parse and print relevant statements with numbering
verdicts = json.loads(json_text)
relevant_statements=[]
print("Relevant Statements (verdict: yes):\n")
count = 1
for group in verdicts:
    for v in group.get("verdicts", []):
        if v.get("verdict") == "yes":
            relevant_statements.append(v['statement'])
            print(f"{count}. {v['statement'].strip()}")
            count += 1


Relevant Statements (verdict: yes):

1. Preventable medical errors signal a significant improvement in quality of patient care.
2. The grand challenge of the medical internet of things (MIoT) is to sufficiently enable the deployment of patient-centric and context-aware networked medical systems in all care environments, ranging from hospital floors to operating rooms, intensive care units to home care units.
3. Heterogeneous devices in each care environment would effectively share data (efficiently, safely and securely) to minimize preventable errors often introduced by humans.
4. ICE standard will enable dramatic improvements to patient safety because cross-vendor inter-device communications may significantly reduce preventable medical errors.
5. Examples include patient transfers from the Operating Room (OR) to Intensive Care Units (ICU) and reducing false alarms or deaths due to Patient-Controlled Analgesia (PCA).
6. In both of these examples, synthesis of data from diverse range of

### Observation: Relevant Statement Extraction

The evaluation identified **16 highly relevant statements** from the retrieved context chunks that directly support the query:

> **"How does MIoT improve hospital safety?"**

---

#### What These Statements Captured:
- **Preventable Error Reduction**  
  Descriptions of MIoT minimizing human-introduced mistakes via automated monitoring systems.

- **Interoperability & Integration**  
  Cross-device communication (e.g., ICE standard) that supports seamless transitions between care environments.

- **Real-Time Data Sharing**  
  Synthesis of actionable patient information enabling immediate clinical response.

- **Infection Control**  
  Use of connectivity principles to track hygiene behaviors and antibiotic usage.

---

#### Insight:
This confirms that **despite minor noise**, the retriever surfaced a **strong and diverse set of semantically aligned statements**. These grounded chunks provided meaningful support to the final response, increasing both **faithfulness** and **relevancy** in the RAG pipeline.


### Testing Contextual Relevancy with Injected Noise Using DeepEval

In this step, we intentionally inject noisy chunks at the top of the retrieved context to evaluate their impact on contextual relevancy. 

DeepEval then compares the AI-generated answer against a human reference and determines whether the most relevant chunks are still correctly prioritized. 

This process helps assess how effectively the retriever can surface and rank useful information, even when noise is present in the retrieval results.


In [60]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    expected_output=human_answer,
    retrieval_context=retrieved_context_with_noise
)

metric = ContextualRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:09,  9.25s/test case]

**************************************************
Contextual Relevancy Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdicts": [
            {
                "statement": "Many respected institutions and well-regarded medical professionals advertise their services as a way to inform the public about available healthcare options.",
                "verdict": "no",
                "reason": "The statement addresses healthcare advertising, which is unrelated to how MIoT improves hospital safety."
            },
            {
                "statement": "Over time, physicians and hospitals have increasingly adopted marketing and public relations strategies to connect with their communities and raise awareness of the care they offer.",
                "verdict": "no",
                "reason": "The statement focuses on marketing and public relations strategies, which do not pertain to MIoT or hospital safety."
            },
            {
  




In [62]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.6153846153846154
Reason: The score is 0.62 because while statements such as 'deployment of patient-centric and context-aware networked medical systems' and 'heterogeneous devices in each care environment would effectively share data' are relevant to how MIoT improves hospital safety, many other statements discuss unrelated topics like patient behavior, healthcare advertising, and consulting fees, making them irrelevant to the input.


### Observation: Contextual Relevancy with Noise in Retrieved Chunks

The **Contextual Relevancy score was 0.62**, slightly above the threshold of 0.6, indicating a **pass**—but with significant dilution from noisy content.

---

#### Key Findings:

-  **Relevant Content**:
  - Strongly aligned chunks discussed MIoT’s impact on hospital safety, such as:
    - Real-time data sharing
    - Cross-device interoperability (ICE standard)
    - Preventable error reduction
    - Infection control via connected monitoring

-  **Irrelevant Content**:
  - Several top-ranked chunks discussed **marketing**, **consulting costs**, and **healthcare economics**, which are unrelated to the query.
  - These off-topic chunks **weakened the overall contextual focus** of the retriever output.

---

####  Why Score Was Pulled Down:
Even though highly relevant statements existed (and were correctly identified), the presence of **distracting non-relevant context** caused a reduction in the average score. This is typical when working with **multi-topic documents or mixed-domain sources**.

---

####  Takeaway:
To further improve contextual relevancy:
- Apply **semantic reranking** post-retrieval to prioritize topic-focused chunks.
- Use **topic filtering** or document segmentation to isolate relevant sub-domains before embedding.
- Explore **hierarchical retrieval**, retrieving first by document, then chunk.

Despite the noise, this evaluation confirms the retriever successfully surfaced **semantically critical content** to support the answer—but future refinement should aim to suppress unrelated context more effectively.


### Comparison of Contextual Metrics (With vs. Without Noise)

| Metric                | **Without Noise** | **With Noise** |  Impact |
|-----------------------|-------------------|----------------|----------------|
| **Contextual Precision** | 1.00              | 0.56           | Precision dropped due to top-ranked irrelevant chunks |
| **Contextual Recall**    | 1.00              | 0.6            | Lower recall as key concepts were missing or unsupported |
| **Contextual Relevancy** | 0.84              | 0.62           | Score barely passed; diluted by unrelated content |

---

### Insights:

- **Without Noise**: All metrics performed strongly—retriever provided focused, aligned, and rich context.
- **With Noise**: Metrics degraded due to retrieval of **off-topic** chunks (e.g., marketing, consulting costs).
- **Precision** was **most sensitive** to ranking irrelevant content at the top.
- **Recall** suffered when critical elements like **"faster response"** or **"protocol compliance"** weren’t supported.
- **Relevancy** dropped slightly but still passed, showing partial alignment despite noise.

---

### Recommendation:
To improve overall performance, use:
- **Semantic reranking**
- **Content filters**
- **Smarter chunking or hybrid retrieval strategies**

This ensures **important concepts stay at the top** and **noise is suppressed**, even when dealing with multi-domain PDFs.


# Generator Evaluation Metrics

After the retrieval step, the **generation phase** is responsible for producing the final response. This involves:

- Creating a prompt by combining the **user’s input** with the **retrieved context**
- Passing that prompt to the **LLM**, which then generates the answer

To assess the quality of the generated response, we focus on the following key evaluation metrics:

- **Answer Relevancy** – How well does the response align with the user’s query?
- **Faithfulness** – Is the generated content factually grounded in the retrieved context?
- **Hallucination Check** – Does the model introduce unsupported or made-up information?
- **Custom LLM as a Judge (G-Eval)** – Uses an LLM to evaluate responses across custom criteria

These metrics help ensure that the generated output is not only relevant but also reliable and trustworthy.


## LLM-based Answer Relevancy - DeepEval

The **Answer Relevancy** metric evaluates how well the **actual output** from your LLM matches the **intent and content** of the original input query.

This metric focuses on whether the generated response stays **on-topic** and provides meaningful, query-specific information.

`deepeval` uses a **self-explaining LLM-based evaluation**, meaning it not only gives a score but also provides a **reason** for the verdict using an LLM as a judge.

### Required Inputs for `AnswerRelevancyMetric` in `deepeval`:

- `input`: The user’s query  
- `actual_output`: The response generated by your LLM

This metric is useful for identifying off-topic or overly generic answers, helping ensure your generated output is truly relevant to the user's question.



### Evaluating Answer Relevancy

In this step, we evaluate how relevant the LLM's generated response is to the input question.

We use `AnswerRelevancyMetric`, which compares the question and the generated answer to see if the response stays on-topic and addresses the query meaningfully.

The evaluation returns a score and a reason, helping us understand how well the model aligned its response with the user's intent.


In [86]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
)

metric = AnswerRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:09,  9.00s/test case]

**************************************************
Answer Relevancy Verbose Logs
**************************************************

Statements:
[
    "MIoT improves hospital safety.",
    "It enables the deployment of patient-centric and context-aware networked medical systems across various care environments.",
    "Care environments include hospital floors, operating rooms, intensive care units, and home care units.",
    "MIoT facilitates the efficient, safe, and secure sharing of data among heterogeneous medical devices.",
    "It minimizes preventable errors often introduced by humans.",
    "The ICE standard supports cross-vendor inter-device communications.",
    "Cross-vendor inter-device communications can reduce preventable medical errors during patient transfers.",
    "Patient transfers include movements such as from the Operating Room to Intensive Care Units.",
    "The ICE standard can mitigate risks like false alarms.",
    "The ICE standard can mitigate risks like deat




In [87]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.9473684210526315
Reason: The score is 0.95 because the answer is highly relevant and addresses the question effectively, though one statement about patient transfers doesn't directly tie back to how MIoT improves hospital safety.


### Observation: 

The **Answer Relevancy** score was **0.95**, which is **well above the 0.6 threshold**, resulting in a **successful pass**.

**What went well?**  
The generated response stayed highly focused on the input question: *“How does MIoT improve hospital safety?”*  
It covered key aspects like **reducing preventable errors**, **real-time data sharing**, **device interoperability**, and **infection prevention**.

**Minor issue:**  
One part briefly described the **ICE standard** without clearly linking it back to hospital safety, which slightly affected the score—but not enough to fail.

**Key takeaway:**  
The response was **on-topic, informative, and aligned** with the user's query. This confirms that the generation step effectively used the retrieved context to produce a relevant and meaningful answer.


## Faithfulness

The **faithfulness** metric evaluates whether the LLM's generated response (**actual_output**) is **factually consistent** with the information found in the **retrieved context**.

It helps detect whether the model has introduced **hallucinations**—claims that are not grounded in the source material.

`deepeval` uses a **self-explaining LLM-based evaluation**, meaning it not only provides a score but also includes a rationale for how the score was determined.

### Required Inputs for `FaithfulnessMetric` in `deepeval`:

- `input`: The original user query (not used in this metric)  
- `actual_output`: The response generated by your LLM  
- `retrieval_context`: The top-N document chunks retrieved from your vector store

This metric is essential for ensuring that generated answers stay **fact-based** and **trustworthy**, especially in high-stakes domains like healthcare.




In [65]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    retrieval_context=retrieved_context
)

metric = FaithfulnessMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:10, 10.27s/test case]

**************************************************
Faithfulness Verbose Logs
**************************************************

Truths (limit=None):
[
    "Preventable medical errors signal a significant improvement in quality of patient care.",
    "The grand challenge of the medical internet of things (MIoT) is to enable the deployment of patient-centric and context-aware networked medical systems in all care environments.",
    "Care environments include hospital floors, operating rooms, intensive care units, and home care units.",
    "Heterogeneous devices in each care environment would share data to minimize preventable errors caused by humans.",
    "ICE standard may enable dramatic improvements to patient safety.",
    "Cross-vendor inter-device communications may significantly reduce preventable medical errors.",
    "Examples of preventable errors include patient transfers from the Operating Room (OR) to Intensive Care Units (ICU) and reducing false alarms or deaths due to P




In [66]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 1.0
Reason: The score is 1.00 because there are no contradictions, indicating the actual output aligns perfectly with the retrieval context. Great job!


### Observation: 

The **Faithfulness** test achieved a **perfect score of 1.00** , meaning the LLM's response is **entirely grounded** in the retrieved context.

**What went well?**  
Every major claim in the response—such as reducing preventable errors, enabling real-time data sharing, and supporting interoperability—was fully supported by the documents.  
No contradictions or hallucinated facts were found. One minor claim about tracking hand-washing habits received an “idk” (uncertain), but it did not affect the perfect score.

---

### Compared to Answer Relevancy

- **Faithfulness Score:** 1.00   
- **Answer Relevancy Score:** 0.95 

**Key difference:**  
- **Relevancy** measured how well the response answered the *question* (a few parts were slightly off-topic).
- **Faithfulness** checked whether the answer stuck to the *retrieved context* (and it did, perfectly).

**Takeaway:**  
The LLM generated an answer that was both **relevant to the question** and **factually aligned** with the retrieved content—making it a **strong, reliable response**.


## Hallucination Check

The **hallucination** metric checks whether the LLM generates any **factually incorrect or unsupported information** in its response.

It does this by comparing the `actual_output` to a **human-verified ground truth context**, rather than relying on retrieved documents.

`deepeval` uses a **self-explaining LLM evaluation**, meaning the model provides both a score and a reason for its judgment.

### Required Inputs for `HallucinationMetric` in `deepeval`:

- `input`: The original user query (not used in the scoring)  
- `actual_output`: The response generated by the LLM  
- `context`: Human-verified ground truth chunks used for factual reference

This metric is especially important for identifying **hallucinations**, or fabricated details, which can undermine trust and accuracy in high-stakes applications.



In [122]:
print(human_answer)

The Medical Internet of Things (MIoT) is improving hospital safety by enabling better coordination among medical devices across various care settings—like hospital wards, operating rooms, ICUs, and even home care. It allows secure and efficient data sharing between different devices, reducing the risk of human error. For example, the ICE (Integrated Clinical Environment) standard supports communication between devices from different vendors, which is especially useful during patient transfers, helping prevent issues like false alarms or dosing errors with systems such as Patient-Controlled Analgesia (PCA).

MIoT also enhances real-time decision-making by integrating data from multiple sources, boosting safety, efficiency, and security. It supports mobile medical devices by enabling secure, compliant, self-organizing networks tailored to individual patients. Beyond direct care, MIoT can help reduce hospital-acquired infections by tracking antibiotic use and monitoring hand hygiene, ulti

In [94]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    context=[human_answer]
)

metric = HallucinationMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:03,  3.46s/test case]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "yes",
        "reason": "The actual output aligns with the context, describing how MIoT improves hospital safety by enhancing inter-device communication, reducing human errors, using the ICE standard to facilitate vendor interoperability, addressing hand hygiene and antibiotic use to reduce infections, and improving real-time decision-making through integration and interoperability. There is no contradiction."
    }
]
 
Score: 0.0
Reason: The score is 0.00 because the actual output fully aligns with the context without any contradictions, accurately reflecting the details provided in the factual alignments.



Metrics Summary

  - ✅ Hallucination (score: 0.0, threshold: 0.6, strict: False, evaluation model: azure-gpt4o-mini, reason: The score is 0.00 because the actual output fully aligns with the context without any con




In [95]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.0
Reason: The score is 0.00 because the actual output fully aligns with the context without any contradictions, accurately reflecting the details provided in the factual alignments.


### Observation: Hallucination Check Result

The **Hallucination Metric** scored a perfect **0.00**, indicating that the model did **not introduce any unsupported or fabricated information**.  
All parts of the generated answer were factually grounded in the **human-verified context**, making this a **100% trustworthy output**.

**What went well?**  
The model’s statements—on MIoT’s role in patient safety, device interoperability, error reduction, and infection tracking—**all matched the ground truth** without any contradictions.

---

### Comparison with Other Generator Metrics

- **Answer Relevancy**: **0.95**  
  The answer was highly aligned with the question, though a few parts slightly diverged from the core topic.

- **Faithfulness**: **1.00**  
  The response was completely grounded in the retrieved context, showing strong consistency.

- **Hallucination**: **0.00**  
  No hallucinated facts were introduced; everything aligned with the **human-validated reference**.

---

### Takeaway

The model not only generated a **relevant** and **faithful** answer but also maintained **factual integrity** throughout.  
This confirms that both the **retrieval and generation steps** worked together to produce a response that is **accurate, reliable, and hallucination-free**—a key benchmark for real-world LLM applications in sensitive domains like healthcare.


### Hallucination Evaluation with Noisy Retrieval Context Using DeepEval

This evaluation assesses the language model’s behavior when the retrieval context includes a **few injected noisy chunks** that are irrelevant to the input query.

### Evaluation Setup

The **input query** and the **LLM-generated output** remain unchanged. However, the original, relevant retrieval context is modified by **inserting unrelated or off-topic content** to simulate partial noise within the retrieved evidence.

### Objective

This test is designed to evaluate whether the model:

- **Generates hallucinated content** not supported by the context.
- **Introduces unsupported claims** in the presence of limited or misleading evidence.  
- Maintains **factual accuracy** despite retrieval degradation.

This scenario is essential for analyzing model robustness and reliability when the retrieval pipeline surfaces a mix of relevant and irrelevant information.


In [114]:
# now let's inject the noise 

noisy_human_answer="""Theory of Relativity and MIoT: A Conceptual Parallel
Albert Einstein’s theory of relativity—including Special and General Relativity—transformed how we understand time, space, and gravity. It introduced concepts like time dilation and spacetime curvature, showing that time and space are relative to the observer’s motion and position.

Drawing a loose analogy, Medical Internet of Things (MIoT) systems are becoming increasingly context-aware, adapting to patients’ needs in real time. Just as relativity tells us that events unfold differently depending on one’s frame of reference, MIoT networks respond dynamically to a patient’s clinical context—location, vitals, and device interactions.

Some futurists and AI theorists suggest that, like spacetime in physics, data environments in hospitals may evolve into dynamic, responsive systems that “bend” around the patient’s condition. While this is more metaphorical than scientific, it highlights how both relativity and MIoT value context and adaptability over fixed systems.."""

In [115]:
noisy_human_answer

'Theory of Relativity and MIoT: A Conceptual Parallel\nAlbert Einstein’s theory of relativity—including Special and General Relativity—transformed how we understand time, space, and gravity. It introduced concepts like time dilation and spacetime curvature, showing that time and space are relative to the observer’s motion and position.\n\nDrawing a loose analogy, Medical Internet of Things (MIoT) systems are becoming increasingly context-aware, adapting to patients’ needs in real time. Just as relativity tells us that events unfold differently depending on one’s frame of reference, MIoT networks respond dynamically to a patient’s clinical context—location, vitals, and device interactions.\n\nSome futurists and AI theorists suggest that, like spacetime in physics, data environments in hospitals may evolve into dynamic, responsive systems that “bend” around the patient’s condition. While this is more metaphorical than scientific, it highlights how both relativity and MIoT value context a

In [120]:
test_case = LLMTestCase(
    input=response['question'],
    actual_output=response['AI_generated_response'],
    context=[noisy_human_answer]
)

metric = HallucinationMetric(
    threshold=0.6,
    model=wrapped_model,
    include_reason=True,
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:05,  5.49s/test case]

**************************************************
Hallucination Verbose Logs
**************************************************

Verdicts:
[
    {
        "verdict": "no",
        "reason": "The actual output focuses on MIoT\u2019s practical implementations, such as improving hospital safety, interoperability, and reducing errors, while the context primarily discusses a conceptual parallel between Einstein's Theory of Relativity and MIoT's context-aware adaptability. The actual output does not contradict the context, but it does not explicitly align with the metaphorical and conceptual themes discussed in the context. A possible correction would be to incorporate the conceptual parallel and adaptability themes mentioned in the context."
    }
]
 
Score: 1.0
Reason: The score is 1.00 because there are no factual alignments between the actual output and the context, and the actual output diverges significantly by omitting the conceptual and metaphorical themes from the context, demonstr




In [121]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: False
Score: 1.0
Reason: The score is 1.00 because there are no factual alignments between the actual output and the context, and the actual output diverges significantly by omitting the conceptual and metaphorical themes from the context, demonstrating a complete misalignment.


In [None]:
### Observation: Hallucination Check with Noisy Retrieval Context

In this case, we intentionally tested the model with a **noisy context**.

 **Score:** 1.00  *(Failed under relaxed threshold)*  


---

### What happened?

- The model generated a detailed and accurate response about **MIoT improving hospital safety**.
- However, parts of the noisy context discussed **relativity theory** and **MIOT with relativity**, which were not reflected in the generated answer.
- As a result, the model appeared to **ignore the given context** and rely on **prior knowledge**, which DeepEval flags as **hallucination**.

---

### Comparison with Previous Hallucination Test

| Scenario                   | Score | Verdict |
|----------------------------|-------|--------------------------|
| Clean context (accurate)   | 0.00  |  No hallucination | 
| Noisy context (this test)  | 1.00  |   hallucination   |

---

### Takeaway
It highlights the importance of **effective retrieval filtering** in RAG pipelines to ensure reliable, grounded answers.


## Custom LLM as a Judge (G-Eval)

**G-Eval** is a flexible evaluation framework in `deepeval` that uses a language model with **chain-of-thought (CoT)** reasoning to judge LLM responses based on **any custom criteria** you define.

It is the **most versatile** metric in the DeepEval suite and is well-suited for use cases that require **domain-specific rules**, nuanced assessments, or multiple evaluation dimensions.

### How It Works

You define your evaluation logic using **custom prompts** under `evaluation_steps`, allowing you to guide how the LLM should score and explain its decisions.

### Required Inputs for `G-Eval` in `deepeval`:

- `input`: The original user query (optional, depending on use case)  
- `actual_output`: The LLM-generated response  
- `expected_output` (optional): Human-verified answer for comparison  
- `context` (optional): Supporting documents for grounding or factual reference

G-Eval is ideal for building **task-specific benchmarks**, performing **multi-step evaluations**, or tailoring assessments to **real-world application requirements**.





### Custom Evaluation using G-Eval in DeepEval

In this setup, we are using the **G-Eval** framework to evaluate the LLM’s response based on **custom logic** defined through a series of steps.

Here’s what’s happening:

1. We define a `test_case` that includes:
   - The input question
   - The model’s actual output
   - A human-verified expected answer
   - The retrieval context (document chunks the model used)

2. We configure a **custom metric** named `"RAG Fact Checker"` using `G-Eval`.

3. The evaluation uses a step-by-step approach:
   - Extract statements from the generated output
   - Check if they answer the question and penalize irrelevant ones
   - Compare with the expected answer and penalize any missing or inaccurate claims
   - Ensure statements are backed by the retrieval context
   - Penalize any made-up or hallucinated content

4. The test runs using an LLM as the evaluator, producing a final score along with a reasoning trace.

This process gives a **comprehensive, explainable assessment** of how well the generated answer holds up in terms of **relevance, completeness, accuracy, and grounding**.


In [73]:


test_case = LLMTestCase(
    input=response['question'],
    actual_output=response["AI_generated_response"],
    expected_output=human_answer,
    retrieval_context=retrieved_context
)

metric = GEval(
    threshold=0.6,
    model=wrapped_model,
    name="RAG Fact Checker",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Create a list of statements from 'actual output'",
        "Validate if they are relevant and answers the given question in 'input', penalize if any statements are irrelevant",
        "Also Validate if they exist in 'expected output', penalize if any statements are missing or factually wrong",
        "Also validate if these statements are grounded in the 'retrieval context' and penalize if they are missing or factually wrong",
        "Finally also penalize if any statements seem to be invented or made up and do not make sense factually given the 'input' and 'retrieval context'"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT,
                       LLMTestCaseParams.ACTUAL_OUTPUT,
                       LLMTestCaseParams.EXPECTED_OUTPUT,
                       LLMTestCaseParams.RETRIEVAL_CONTEXT],
    verbose_mode=True
)

result = evaluate([test_case], [metric])

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |██████████████████████████████|100% (1/1) [Time Taken: 00:02,  2.32s/test case]

**************************************************
RAG Fact Checker (GEval) Verbose Logs
**************************************************

Criteria:
None 
 
Evaluation Steps:
[
    "Create a list of statements from 'actual output'",
    "Validate if they are relevant and answers the given question in 'input', penalize if any statements are irrelevant",
    "Also Validate if they exist in 'expected output', penalize if any statements are missing or factually wrong",
    "Also validate if these statements are grounded in the 'retrieval context' and penalize if they are missing or factually wrong",
    "Finally also penalize if any statements seem to be invented or made up and do not make sense factually given the 'input' and 'retrieval context'"
]
 
Score: 0.8
Reason: The actual output is detailed, addresses relevant aspects of MIoT improving hospital safety, and aligns well with both the input and retrieval context, emphasizing reduction of medical errors, interoperability, and infect




In [74]:
print('Sucess:', result.test_results[0].metrics_data[0].success)
print('Score:', result.test_results[0].metrics_data[0].score)
print('Reason:', result.test_results[0].metrics_data[0].reason)

Sucess: True
Score: 0.8
Reason: The actual output is detailed, addresses relevant aspects of MIoT improving hospital safety, and aligns well with both the input and retrieval context, emphasizing reduction of medical errors, interoperability, and infection management. However, some key points from expected output, such as the role of smart sensors and wearables in tracking vital signs or faster decision-making, are missing.


### Observation: G-Eval (RAG Fact Checker) Evaluation

The **G-Eval custom metric** scored **0.80**, which is above the passing threshold, indicating that the response is **largely relevant, accurate, and grounded**—but not perfect.

---

### What went well?
- The output **accurately addressed** the input question.
- It incorporated important points from the **retrieval context**, like **reducing preventable errors**, **device interoperability**, and **infection tracking**.
- No hallucinations were detected, and the content remained aligned with both the query and the supporting documents.

---

### What was penalized?
- The answer **missed specific mentions** from the expected output like **“wearables”** and **“smart sensors”**.
- It was slightly **over-elaborate** compared to the concise human answer.

---

### Comparison with Other Metrics

| Metric              | Score | Focus                                  | Verdict           |
|---------------------|-------|-----------------------------------------|-------------------|
| Answer Relevancy    | 0.95  | Does it answer the user query?          | Highly relevant |
| Faithfulness         | 1.00  | Is it factually grounded in retrieved content? | Fully faithful  |
| Hallucination        | 0.00  | Any unsupported/made-up content?        | No hallucinations |
| **G-Eval (Custom)**  | 0.80  | Combines all above + detail alignment   | Passed with room to improve |

---

### Takeaway

This G-Eval result acts as a **composite judgment**, blending **relevance, truthfulness, completeness, and grounding**.  
While the model performed strongly overall, minor omissions (like not naming specific IoT devices) slightly affected the final score.  
Still, this confirms that the RAG system generated a **high-quality, trustworthy, and well-contextualized response**.


> ## **Important Note**
> 
> **We are evaluating this result using the `deepeval` library, which uses an LLM to act as a judge.**
> 
> **At the backend, we are using `gpt-4o-mini` (Azure-hosted) to perform the evaluation.**
> 
> **Because this involves an LLM's reasoning, there is a high probability that you might get slightly different results when you re-run the code — even with the same inputs.**
> 
> **This variability happens because LLM-based evaluations can be non-deterministic by nature. Small differences in phrasing or internal model behavior can influence how it interprets relevance, alignment, or context.**
> 
> **Therefore, while these evaluations provide valuable insights, treat individual scores as part of a broader trend rather than absolute judgments.**
