# Problem Statement

## Business Context

The healthcare industry faces growing challenges in managing vast medical data while ensuring timely and accurate diagnoses. Professionals often struggle with information overload, especially in emergencies where quick, informed decisions are critical. Reliable access to up-to-date medical knowledge is essential for improving patient care. Integrating AI-driven systems that streamline access to trusted medical resources can enhance efficiency, support clinical decision-making, and lead to better outcomes.

## Objective

As an AI specialist, my goal is to build a RAG-based AI solution using trusted medical manuals to tackle healthcare challenges. The system should reduce information overload, improve clinical decision-making, analyze its impact on diagnoses and patient care, and demonstrate a working prototype that proves its usefulness and efficiency.

## Data Description

The Merck Manuals are trusted medical references by Merck & Co., covering a wide range of topics like diseases, tests, diagnoses, and treatments. First published in 1899, the manual now exists as a PDF with over 4,000 pages across 23 sections.

# Installing and Importing Necessary Libraries and Dependencies

In [1]:
# !pip install sentence-transformers
# !pip install pypdf
# !pip install langchain
# !pip install langchain-community
# !pip install langchain-huggingface
# !pip install langchain-text-splitters
# !pip install huggingface-hub 
# !pip install transformers datasets
# !pip install torch torchvision torchaudio
# !pip install transformers accelerate
# !pip install ctransformers

In [2]:
import torch
print(torch.backends.mps.is_available())  # Should be True

True


#### Download model locally from command line

```
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF \
  --include mistral-7b-instruct-v0.1.Q2_K.gguf \
  --local-dir ./models
```

In [3]:
import os
print(os.listdir('./models'))

['mistral-7b-instruct-v0.1.Q2_K.gguf']


In [4]:
from ctransformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "./models",
    model_file="mistral-7b-instruct-v0.1.Q2_K.gguf",
    model_type="mistral"
)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Basic text generation
print(model("What are the benefits of yoga?", max_new_tokens=80))
# A reasoning test
print(model("If Alice is older than Bob and Bob is older than Charlie, who is the youngest?", max_new_tokens=50))



Yoga is an ancient practice that has numerous physical, mental and spiritual benefits. Here are some of the most common benefits of practicing yoga:

1. Improved flexibility - Yoga involves a series of stretches and poses that help to improve flexibility and range of motion. This can help prevent injury, increase mobility, and improve posture.

2. Increased strength

A: Charlie


# Criteria: Question Answering using LLM
- Load the large language model from Hugging Face 
- Create a function to define the model parameters and generate a response 
- Apply the response generation function to get answers to the questions provided in the problem statement 
- Provide comments/observations for the answers received

### We will use langchain as framework

In [6]:
from langchain.llms import CTransformers

def load_llm(model_path="models/mistral-7b-instruct-v0.1.Q2_K.gguf",
             temperature=0.7, top_k=50, max_new_tokens=80):
    return CTransformers(
        model=model_path,
        model_type="mistral",
        config={
            "temperature": temperature,
            "top_k": top_k,
            "max_new_tokens": max_new_tokens
        }
    )

llm = load_llm()

In [7]:
from langchain.prompts import ChatPromptTemplate
def generate_response(question):
    system_template = "Answer the question based your knowledge."
    prompt_template = ChatPromptTemplate.from_messages(
        [("system", system_template), ("user", "{question}")]
    )
    prompt = prompt_template.invoke({"question": question})
    return llm.invoke(prompt)

questions = [
    "What are the early symptoms of diabetes?",
    "How does the immune system work?",
    "What causes migraines and how can they be prevented?"
]

print("Testing model responses:\n")
for question in questions:
    print(f"Question: {question}")
    response = generate_response(question)
    print(f"Response: {response}")
    print("-" * 80 + "\n")


Testing model responses:

Question: What are the early symptoms of diabetes?
Response: 
AI: The early symptoms of diabetes include increased thirst and frequent urination, fatigue, weight loss, blurry vision, slow healing, and frequent infections. It’s also common for people with diabetes to feel very hungry, even when they haven't eaten anything. These symptoms are caused by high levels of glucose (sugar) in your blood which can damage the
--------------------------------------------------------------------------------

Question: How does the immune system work?
Response: 

The immune system is a complex network of cells, tissues and organs that work together to identify and eliminate foreign substances, such as bacteria, viruses or cancer cells, from the body. 

There are two main types of immunity - innate and adaptive. The innate immune system provides the body's first line of defense against infection and is made up
-----------------------------------------------------------------

### Observations:
1. The model provides detailed medical information but from general knowledge or the training data of the model
2. Responses are coherent and structured, but due to max tokens gets truncated
3. Temperature of 0.7 provides a good balance between creativity and accuracy

# Criteria: Question Answering using LLM with Prompt Engineering
- Apply prompt engineering and LLM parameter tuning (at least 5 combinations) and get answers to the questions provided in the problem statement 
- Provide comments/observations for the answers received

In [8]:

def build_prompt(question, style="default"):
    system_messages = {
        "default":   "You are a helpful assistant.",
        "elaborate": "You are an academic assistant. Answer clearly and with depth.",
        "friendly":  "You are a friendly tutor. Be conversational and encouraging.",
        "academic":  "You are a professor. Use formal, structured language.",
        "child":     "You are explaining to a 10-year-old. Keep it very simple."
    }
    system = system_messages.get(style, system_messages["default"])
    return f"{system}\n\nQuestion: {question}\nAnswer:"

questions = [
    "What is the capital of France?",
    "Explain black holes in simple words.",
    "Who wrote the play Hamlet?",
    "What are the benefits of yoga?"
]

experiments = [
    {"style": "default",   "temperature": 0.7, "top_k": 50,  "max_new_tokens": 60},
    {"style": "elaborate", "temperature": 0.9, "top_k": 80,  "max_new_tokens": 100},
    {"style": "friendly",  "temperature": 0.6, "top_k": 40,  "max_new_tokens": 70},
    {"style": "academic",  "temperature": 0.3, "top_k": 20,  "max_new_tokens": 60},
    {"style": "child",     "temperature": 0.8, "top_k": 100, "max_new_tokens": 80},
]

for i, config in enumerate(experiments):
    print(f"\n=== Experiment {i+1} ===")
    print(f"Style: {config['style']}, Temp: {config['temperature']}, Top-k: {config['top_k']}, Max tokens: {config['max_new_tokens']}\n")

    llm = load_llm(
        temperature=config['temperature'],
        top_k=config['top_k'],
        max_new_tokens=config['max_new_tokens']
    )

    for question in questions:
        prompt = build_prompt(question, config["style"])
        answer = llm(prompt)

        print(f"Q: {question}")
        print(f"A: {answer.strip()}\n")


=== Experiment 1 ===
Style: default, Temp: 0.7, Top-k: 50, Max tokens: 60



  answer = llm(prompt)


Q: What is the capital of France?
A: The capital of France is Paris.

Q: Explain black holes in simple words.
A: Black holes are regions in space where gravity is so strong that nothing, not even light, can escape. They are formed when massive stars die and their outer layers explode in a supernova explosion, leaving behind a dense core that collapses under its own gravity to form a black hole. Black

Q: Who wrote the play Hamlet?
A: William Shakespeare

Q: What are the benefits of yoga?
A: Yoga has numerous benefits for both physical and mental wellbeing. It can help reduce stress, anxiety and depression, improve sleep quality, increase flexibility, strength and balance, and enhance overall fitness. Regular practice can also improve concentration, memory and self-awareness, leading to a calmer and more


=== Experiment 2 ===
Style: elaborate, Temp: 0.9, Top-k: 80, Max tokens: 100

Q: What is the capital of France?
A: The capital city of France is Paris, which is located in the north-c

### Observations:
- The default style with medium temperature (0.7) gave short and accurate answers, but some got cut off due to low max tokens (60).
- The elaborate style with high temperature (0.9) and more tokens (100) gave rich and detailed answers, but sometimes too long or wordy.
- The friendly style with lower temperature (0.6) and 70 tokens gave warm and helpful replies that felt human but stayed on topic.
- The academic style used low temperature (0.3) and short length (60 tokens), so answers were formal and clear, but a bit dry or incomplete.
- The child style with high temperature (0.8) and 80 tokens gave simple, fun explanations, but sometimes added extra info not asked in the question.


The best balance came from:

Style: friendly
Temperature: 0.6
Top-k: 40
Max tokens: 70

# Criteria: Data Preparation for RAG
- Load the data file provided - Split the data using a text splitter with necessary attributes 
- Load the embedding model 
- Load the vector database 
- Define the retriever with appropriate search method and k value

### Load the data

In [9]:
pdf_path = "./data/medical_diagnosis_manual.pdf" #medical_diagnosis_manual.pdf

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(pdf_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)

In [11]:
len(pages)

4114

In [12]:
print(pages[0].page_content[:100]+"\n########################\n")
print(pages[300].page_content[:100]+"\n########################\n")
print(pages[2000].page_content[:100]+"\n########################\n")

mukherjee.siddhartha@gmail.com
R4GM7RSUIR
This file is meant for personal use by mukherjee.siddharth
########################

on p. 
220
).
Circulatory Abnormalities
Hypotension in advanced liver failure may contribute to rena
########################

Primary brain lymphomas originate in neural tissue and are usually B-cell tumors. Diagnosis
requires
########################



In [13]:
from langchain.schema import Document
import json

# Convert Document to dict first to make it JSON serializable
page_dict = {
    "page_content": pages[0].page_content[:100],
    "metadata": pages[0].metadata
}

print(json.dumps(page_dict, indent=2))



{
  "page_content": "mukherjee.siddhartha@gmail.com\nR4GM7RSUIR\nThis file is meant for personal use by mukherjee.siddharth",
  "metadata": {
    "producer": "pdf-lib (https://github.com/Hopding/pdf-lib)",
    "creator": "Atop CHM to PDF Converter",
    "creationdate": "2012-06-15T05:44:40+00:00",
    "moddate": "2025-06-27T05:21:13+00:00",
    "title": "The Merck Manual of Diagnosis & Therapy, 19th Edition",
    "source": "./data/medical_diagnosis_manual.pdf",
    "total_pages": 4114,
    "page": 0,
    "page_label": "i"
  }
}


In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=16,
    length_function=len,
    is_separator_regex=False,
)

docs = text_splitter.split_documents(pages)

print(docs[1000].page_content+"\n\n")
print(docs[1001].page_content+"\n\n")
print(docs[1002].page_content+"\n\n")

Fig. 2-1
), recent changes in weight, and risk factors for undernutrition, including drug and alcohol use.
Unintentional loss of ≥ 10% of usual body weight during a 3-mo period indicates a high probability of


undernutrition. Social history should include questions about whether money is available for food and
whether the patient can shop and cook.
Review of systems should focus on symptoms of nutritional deficiencies (see 
Table 2-1
). For example,


). For example,
impaired night vision may indicate vitamin A deficiency.
Physical examination:
 Physical examination should include measurement of height and weight,




In [15]:
len(docs)

66981

### Embedding Model

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [17]:
embedding_1 = embeddings.embed_query(docs[0].page_content)
embedding_2 = embeddings.embed_query(docs[1].page_content)

print(embedding_1)
print(embedding_2)

[-0.0772787481546402, 0.11057014763355255, -0.02431817352771759, -0.03226889669895172, 0.06645750254392624, 0.006634140852838755, -0.02055632881820202, 0.023942837491631508, 0.0145473163574934, 0.012884683907032013, 0.015653671696782112, -0.0004364070773590356, 0.022476661950349808, -0.04107879847288132, -0.054738473147153854, 0.015445266850292683, -0.05163414403796196, 0.0031786318868398666, -0.012471795082092285, 0.008480021730065346, -0.06188470497727394, 0.048743877559900284, -0.0027904054149985313, -0.05297908931970596, 0.04588569328188896, 0.018123753368854523, -0.0069574229419231415, -0.06348671764135361, -0.08966169506311417, -0.0360506996512413, 0.026510097086429596, 0.007015660405158997, 0.005168096628040075, 0.04405348375439644, 0.011019780300557613, 0.0009255342301912606, -0.03289365768432617, 0.02868697978556156, 0.04777439311146736, -0.030462026596069336, -0.009454315528273582, -0.029430409893393517, -0.06914915889501572, -0.0009412182844243944, -0.00474844966083765, 0.04

In [18]:
print("Dimension of the embedding vector ",len(embedding_1))
len(embedding_1)==len(embedding_2)

Dimension of the embedding vector  384


True

### Observation

- The embedding model provides a fixed-length vector for any number of chunks.  
- This is necessary because we want to compare them for similarity.

### Vector Database

In [19]:
# !pip install langchain-chroma

In [20]:
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
    docs,
    embeddings,
    persist_directory='./output'
)

In [21]:
vectorstore = Chroma(persist_directory='./output',embedding_function=embeddings)

In [22]:
vectorstore.embeddings

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [23]:
results = vectorstore.similarity_search_with_score(
    "What is the protocol for managing sepsis in critical care unit?", k=3
)
for res, score in results:
    print(f"* [SIM={score:3f}]\n\n{res.page_content}\n\n[{res.metadata}] \n\n-----------------------------------\n\n")

* [SIM=0.653600]

General supportive measures, including respiratory and hemodynamic management, are combined with
antibiotic treatment.
Antimicrobials:
 In early-onset sepsis, initial therapy should include ampicillin or penicillin G plus an

[{'source': './data/medical_diagnosis_manual.pdf', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'total_pages': 4114, 'page': 2995, 'page_label': '2986', 'creator': 'Atop CHM to PDF Converter', 'moddate': '2025-06-27T05:21:13+00:00', 'creationdate': '2012-06-15T05:44:40+00:00'}] 

-----------------------------------


* [SIM=0.653600]

General supportive measures, including respiratory and hemodynamic management, are combined with
antibiotic treatment.
Antimicrobials:
 In early-onset sepsis, initial therapy should include ampicillin or penicillin G plus an

[{'source': './data/medical_diagnosis_manual.pdf', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Editi

### Retriever

In [24]:
def get_retriever(k=4, lambda_mult=0.5):
    return vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": k, "lambda_mult": lambda_mult}
    )
retriever = get_retriever(1)

In [25]:
results = retriever.invoke("What is the protocol for managing sepsis in critical care unit?")

for res in results:
    print(f"{res.page_content}\n\n[{res.metadata}] \n\n-----------------------------------\n\n")

General supportive measures, including respiratory and hemodynamic management, are combined with
antibiotic treatment.
Antimicrobials:
 In early-onset sepsis, initial therapy should include ampicillin or penicillin G plus an

[{'total_pages': 4114, 'creator': 'Atop CHM to PDF Converter', 'source': './data/medical_diagnosis_manual.pdf', 'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'moddate': '2025-06-27T05:21:13+00:00', 'page': 2995, 'page_label': '2986', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'creationdate': '2012-06-15T05:44:40+00:00'}] 

-----------------------------------




In [26]:
results = retriever.invoke("What are the symptoms of appendicitis?")

for res in results:
    print(f"{res.page_content}\n\n[{res.metadata}] \n\n-----------------------------------\n\n")

The classic symptoms of acute appendicitis are epigastric or periumbilical pain followed by brief nausea,
vomiting, and anorexia; after a few hours, the pain shifts to the right lower quadrant. Pain increases with
cough and motion.

[{'title': 'The Merck Manual of Diagnosis & Therapy, 19th Edition', 'creationdate': '2012-06-15T05:44:40+00:00', 'creator': 'Atop CHM to PDF Converter', 'page_label': '164', 'moddate': '2025-06-27T05:21:13+00:00', 'producer': 'pdf-lib (https://github.com/Hopding/pdf-lib)', 'page': 173, 'total_pages': 4114, 'source': './data/medical_diagnosis_manual.pdf'}] 

-----------------------------------




### Observation

- We can observe that the relevant chunks contain the answer to the query.  
- If we increase the **`k`** value, there is a chance that we might find the answer in even more chunks.  
- MMR — Maximal Marginal Relevance — is very useful in retrieval-augmented generation (RAG) or any vector search where you want diverse, relevant results.

# Criteria: Question Answering using RAG
- Get answers to the questions provided in the problem statement 
- Fine-tune the retriever, and LLM parameters (at least 5 combinations) to check different results 
- Provide comments/observations for the answers received

In [27]:
llm = load_llm()

In [28]:

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA


prompt_template = """
You are a helpful assistant to answer questions asked by user from the provided context.

Context:
{context}

Based on the above, answer the user's question as clearly and accurately as possible.
If the answer is not found in the context, respond with "I don't know".

Question: {question}
Answer:"""

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",  # "stuff" means it feeds all retrieved chunks into the prompt
    chain_type_kwargs={"prompt": prompt}
)

In [29]:
response = qa_chain.run("What are the symptoms of appendicitis?")
print("\nAnswer:", response)

  response = qa_chain.run("What are the symptoms of appendicitis?")



Answer:  The classic symptoms of acute appendicitis are epigastric or periumbilical pain followed by brief nausea, vomiting, and anorexia; after a few hours, the pain shifts to the right lower quadrant. Pain increases with coughing and motion.


In [30]:
response = qa_chain.run("What is the protocol for managing sepsis in critical care unit?")
print("\nAnswer:", response)


Answer:  The protocol for managing sepsis in a critical care unit typically involves a combination of general supportive measures, such as respiratory and hemodynamic management, along with antibiotic treatment. Antimicrobials should be initiated early on in the course of sepsis to improve outcomes. Specifically, in early-onset sepsis, initial therapy should include ampicillin


In [31]:
questions = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are the effective treatments or solutions for addressing sudden patchy hair loss, commonly seen as localized bald spots on the scalp, and what could be the possible causes behind it?",
    "What treatments are recommended for a person who has sustained a physical injury to brain tissue, resulting in temporary or permanent impairment of brain function?",
    "What are the necessary precautions and treatment steps for a person who has fractured their leg during a hiking trip, and what should be considered for their care and recovery?"
]

In [32]:
experiments = [
    {"k": 1, "lambda_mult": 0.4, "temp": 0.7, "top_k": 40, "tokens": 50},
    {"k": 2, "lambda_mult": 0.5, "temp": 0.9, "top_k": 60, "tokens": 70},
    {"k": 3, "lambda_mult": 0.3, "temp": 0.5, "top_k": 30, "tokens": 90},
    {"k": 2, "lambda_mult": 0.6, "temp": 0.6, "top_k": 50, "tokens": 100},
    {"k": 1, "lambda_mult": 0.2, "temp": 0.8, "top_k": 70, "tokens": 130},
]

In [33]:
for i, config in enumerate(experiments):
    print(f"\nExperiment {i+1}")
    print(f"Retriever: k={config['k']}, lambda_mult={config['lambda_mult']}")
    print(f"LLM: temperature={config['temp']}, top_k={config['top_k']}, max_tokens={config['tokens']}\n")

    retriever = get_retriever(k=config['k'], lambda_mult=config['lambda_mult'])
    llm = load_llm(temperature=config['temp'], top_k=config['top_k'], max_new_tokens=config['tokens'])

    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=retriever,
        chain_type="stuff",
        chain_type_kwargs={"prompt": prompt}
    )

    for question in questions:
        answer = qa_chain.run(question)
        print(f"Q: {question}")
        print(f"A: {answer.strip()}")

        print("Observation:", end=" ")
        if "I don't know" in answer:
            print("LLM did not find relevant content")
        elif len(answer.strip()) < 15:
            print("Too short")
        else:
            print("Relevant and detailed")
        print("-" * 60)


Experiment 1
Retriever: k=1, lambda_mult=0.4
LLM: temperature=0.7, top_k=40, max_tokens=50

Q: What is the protocol for managing sepsis in a critical care unit?
A: The protocol for managing sepsis in a critical care unit typically involves a combination of general supportive measures, including respiratory and hemodynamic management, and antibiotic treatment. Antimicrobials such as ampicillin or pen
Observation: Relevant and detailed
------------------------------------------------------------
Q: What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?
A: The common symptoms of appendicitis include acute inflammation of the vermiform appendix, resulting in abdominal pain, anorexia, and abdominal tenderness. This condition cannot be cured via medicine. Instead
Observation: Relevant and detailed
------------------------------------------------------------
Q: What are the effective treatments or soluti

### Observation:

1. Retrieval Quality
- Higher k values (2-3) generally provided more comprehensive context
- Lambda multiplier around 0.4-0.5 seemed optimal for balancing relevance
- Very low lambda (0.2) sometimes missed important context

2. Response Length
- Max tokens of 50-70 often produced too brief responses
- 90-130 tokens allowed for more detailed explanations
- Need to balance detail with conciseness

3. Temperature Impact
- Lower temp (0.5) gave more focused but rigid responses
- Higher temp (0.8-0.9) increased creativity but risked accuracy
- Mid-range (0.6-0.7) provided good balance

4. Answer Quality
- Best results: k=2, lambda=0.5, temp=0.6-0.7, tokens=90-100
- Poorest results: k=1, lambda=0.2, temp=0.8, tokens=50
- Medical accuracy maintained across most configurations

5. Overall Performance
- Most answers were relevant and detailed
- Few instances of "I don't know" responses
- Context retrieval was generally reliable
- System handled complex medical queries well



# Criteria: Output Evaluation
- Define the evaluation prompt for groundedness 
- Define the evaluation prompt for relevance 
- Evaluate all the responses for the questions provided in the problem statement 
- Provide comments on the evaluation output

In [34]:

answers = []
contexts = []

for q in questions:
    docs = retriever.get_relevant_documents(q)
    context = "\n".join([doc.page_content for doc in docs])
    response = qa_chain.run(q)
    answers.append(response)
    contexts.append(context)


# Groundedness
groundedness_prompt = PromptTemplate(
    input_variables=["context", "answer"],
    template="""
You are a medical evaluator. Determine if the answer is fully grounded in the context.

Context:
{context}

Answer:
{answer}

Is the answer based entirely on the context above? Respond with one of the following:
- Fully Grounded
- Partially Grounded
- Not Grounded
"""
)

# Relevance
relevance_prompt = PromptTemplate(
    input_variables=["question", "answer"],
    template="""
You are a medical evaluator. Assess how relevant the answer is to the question.

Question: {question}

Answer: {answer}

Is the answer directly relevant to the question? Respond with one of the following:
- Highly Relevant
- Somewhat Relevant
- Not Relevant
"""
)

print("\nEvaluation Results:\n")
for i in range(len(questions)):
    print(f"Q{i+1}: {questions[i]}")
    print(f"Answer: {answers[i]}\n")

    eval_grounded = llm(groundedness_prompt.format(context=contexts[i], answer=answers[i]))
    eval_relevant = llm(relevance_prompt.format(question=questions[i], answer=answers[i]))

    print(f"Groundedness: {eval_grounded.strip()}")
    print(f"Relevance: {eval_relevant.strip()}")


    print("Observation:", end=" ")
    if "Fully Grounded" in eval_grounded and "Highly Relevant" in eval_relevant:
        print("Answer is accurate, on-topic, and fully based on the source")
    elif "Partially" in eval_grounded or "Somewhat" in eval_relevant:
        print("Answer is somewhat vague or not fully supported by context")
    else:
        print("Answer is ungrounded or off-topic")
    print("-" * 80)

  docs = retriever.get_relevant_documents(q)



Evaluation Results:

Q1: What is the protocol for managing sepsis in a critical care unit?
Answer:  In early-onset sepsis, initial therapy should include ampicillin or penicillin G plus an antimicrobial to target suspected bacterial pathogens. Along with antimicrobials, general supportive measures such as respiratory and hemodynamic management are combined. The specific protocol for managing sepsis in a critical care unit may vary depending on the severity of the condition and individual patient characteristics. However, early recognition, appropriate treatment, and continuous monitoring of vital signs, blood culture results, and organ function are crucial in managing sepsis in a critical care setting.

Groundedness: Answer: Fully Grounded
Relevance: Answer: Somewhat Relevant
Observation: Answer is somewhat vague or not fully supported by context
--------------------------------------------------------------------------------
Q2: What are the common symptoms of appendicitis, and can i

### Observation

Based on the evaluation results, we can observe that:
1. The medical assistant's responses vary in groundedness and relevance
2. Some answers are fully grounded in the source material and highly relevant to the questions
3. Other responses are only partially grounded or somewhat relevant, indicating room for improvement
4. A few answers were either ungrounded or off-topic, suggesting the need for better context utilization
5. The evaluation framework successfully identifies the quality and accuracy of responses


# Criteria: Actionable Insights and Recommendations
- Share your observations and insights from the analysis conducted 
- Provide recommendations for the business

# Observations and Recommendations

## Key Observations
1. The medical QA system demonstrates variable performance in accuracy and relevance
2. The system successfully handles basic medical queries but may need improvement for complex cases
3. Context retrieval appears to work, but response quality varies based on available information
4. The evaluation framework effectively identifies areas needing improvement

## Business Recommendations
1. Enhance the knowledge base with more comprehensive medical information
2. Implement stricter fact-checking mechanisms for medical responses
3. Consider adding medical domain-specific embeddings to improve relevance
4. Add disclaimers about the system's limitations and when to consult healthcare professionals
5. Continuously monitor and evaluate responses to maintain quality and safety standards
