[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1Y2sT9fbvpn21KfKG-9sDBelT0SE3_ArU/view?usp=sharing)

# Evaluating the Medical Assistant with Flotorch Eval

This notebook walks through evaluating a **Medical Assistant**—a retrieval-augmented agent that answers questions about clinical conditions, treatments, and patient education—using the **Flotorch SDK** together with the **Flotorch Eval** library.

---

### **Use Case Overview**

The assistant guides clinicians and patients across:
- **Cardiovascular care** — hypertension workups, coronary artery disease management, heart failure escalation plans
- **Respiratory medicine** — asthma action plans, COPD therapy ladders, pneumonia treatment guidelines
- **Digestive health** — gastrointestinal differential diagnoses, nutrition considerations, post-operative monitoring
- **Endocrine and metabolic disorders** — diabetes protocols, thyroid dysfunction pathways, hormone replacement nuances
- **Neurology and behavioral health** — migraine regimens, neurodegenerative care plans, mental health screenings

It retrieves evidence-backed content from a **Comprehensive Medical Reference Guide** (diagnostic criteria, treatment pathways, medication guidance, patient handouts) so that generated advice remains grounded in current clinical practice.

This notebook evaluates the **answer relevance** of those responses—confirming that each answer stays on-topic and directly addresses the medical question.

---

### **Notebook Workflow**

We follow a structured evaluation loop:

1. **Iterate Questions** – Read each prompt from `medical_eval.json` (ground truth set).  
2. **Retrieve Context** – Query the medical knowledge base for the most relevant passages.  
3. **Generate Answer** – Compose a response with the shared system prompt and inference LLM.  
4. **Store Results** – Record question, context, generated answer, and expected answer.  
5. **Evaluate Answer Relevance** – Score topical alignment with `LLMEvaluator`.  
6. **Display Results** – Compare Ragas and DeepEval outputs side by side.

---

### **Metric Evaluated — Answer Relevance**

We focus on a single answer-quality signal: **Answer Relevance**. It asks whether the response directly addresses the user’s question, independent of factual accuracy. Scores near 1.0 mean the assistant stayed on topic, mid-range scores expose partial coverage, and low scores flag answers that ignored the request—giving us an immediate cue for debugging prompts, retrieval, or guardrails.

#### Ragas Answer Relevance (Flotorch `evaluation_engine="ragas"`)
- Generates auxiliary questions from the model’s answer using an LLM.  
- Embeds those questions and the original prompt to measure similarity.  
- Reports an average similarity score (0.0–1.0) that reflects topical alignment.

#### DeepEval Answer Relevance (Flotorch `evaluation_engine="deepeval"`)
- Extracts intent-bearing snippets from the answer with DeepEval’s judging LLM.  
- Labels each snippet as relevant or irrelevant to the user question.  
- Applies optional thresholds (e.g., `0.8`) to trigger pass/fail outcomes and surfaces rationales.

Both engines share the same objective—“does the answer actually address the question?”—but emphasize different diagnostics: Ragas provides fast, retrieval-aware scoring; DeepEval adds snippet-level verdicts and guardrail-friendly thresholds.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop) so that retrieval metrics default to the best available backend.
- `evaluation_engine="ragas"` — keeps every metric inside the [**Ragas**](https://docs.ragas.io/en/stable/getstarted/) rubric for retrieval-aware evaluations (faithfulness, answer relevance, context precision, aspect critic, etc.).
- `evaluation_engine="deepeval"` — routes metrics through DeepEval’s judging engine, enabling custom prompts, thresholds, and rationales while still capturing Flotorch gateway telemetry.

We begin with the Ragas-only flow and then repeat the run with the DeepEval backend to highlight differences in scoring and diagnostics.

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.

---
#### **Documentation References**

- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) for reference pipelines.  
- [**Ragas Answer Relevance**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/) - detailed explanation of the metric.
- [**DeepEval Answer Relevance**](https://deepeval.com/docs/metrics-answer-relevancy) - detailed explanation of the metric.

---


## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "flotorch/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers



In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval.llm_eval import display_llm_evaluation_results


print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**  
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
  {
    "question": "What is the definition of hypertension according to the guide?",
    "answer": "Hypertension is defined as persistent elevation of blood pressure above 140/90 mmHg."
  },
  {
    "question": "What is the target blood pressure for patients with chronic kidney disease?",
    "answer": "The target blood pressure for CKD patients is less than 130/80 mmHg."
  }
]
```

**2. `prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful Medical assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages


## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full RAG pipeline for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

### Using Ragas Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can set up the `LLMEvaluator`.

This class is the core component of the **Flotorch-Eval** library — think of it as the *“head judge”* for our evaluation process. It coordinates metric calculations, semantic comparisons, and LLM-based judgments using the configuration we provide.

### Parameter Insights

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — The evaluator uses:
  - an **LLM** (`inferencer_model`) for reasoning-based checks, and  
  - an **embedding model** (`embedding_model`) for semantic and contextual similarity evaluations.  
- **`evaluation_engine`** — Here, we set this to `"ragas"`, meaning the evaluator will use the **[Ragas framework](https://docs.ragas.io/en/stable/getstarted/)** for metric computation.  
  Ragas is well-suited for RAG-style evaluations and handles metrics such as:
  - **Faithfulness**
  - **Answer Relevance**
  - **Context Precision**
  - **Aspect Critic (custom maliciousness check)**  
- **`metrics`** — In this configuration, we evaluate only **`MetricKey.ANSWER_RELEVANCE`**.

### Answer Relevance Metric

**Definition**: evaluates how pertinent and directly relevant the generated answer is to the given question. It measures whether the response actually addresses what was asked, regardless of factual accuracy. The score ranges from 0 to 1, calculated by: **generating questions from the answer using an LLM, then computing semantic similarity between generated questions and the original question**. This metric is crucial for ensuring the Medical Assistant provides on-topic, focused responses that directly address user queries about medical conditions.

**How It Works**:
1. Generate multiple questions from the given answer using an LLM
2. Compute semantic similarity between generated questions and the original question using embeddings
3. Score = Average similarity (higher similarity = more relevant answer)

**Example**:

*Question*: "What are the common symptoms of Type 2 Diabetes?"

*Relevant Answer* (High Relevance): "Common symptoms of Type 2 Diabetes include increased thirst, frequent urination, increased hunger, fatigue, blurred vision, and slow-healing sores." → **Score: 1.0**

*Irrelevant Answer* (Low Relevance): "Hypertension is treated with ACE inhibitors and beta-blockers. Patients should monitor blood pressure regularly." → **Score: 0.0** (talks about hypertension instead of diabetes symptoms)


In [None]:
# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.ANSWER_RELEVANCE,
    ],
    evaluation_engine="ragas"
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation (Ragas)

### Purpose
Execute the evaluation process to score all generated answers using the **answer relevance** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **answer relevance** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Answer relevance scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute answer relevance scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results (Ragas)

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

In [None]:
display_llm_evaluation_results(eval_results)

## 11. Initialize the Evaluator (DeepEval)

### Using DeepEval Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can switch the `LLMEvaluator` to the **DeepEval** backend to compare its scoring with the Ragas run.

This class is still the *“head judge”* for our evaluation process; we’re simply changing which judging rubric it follows.

### Parameter Insights (DeepEval Mode)

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — DeepEval-powered scoring still needs the evaluator LLM and embeddings for semantic checks.  
- **`evaluation_engine="deepeval"`** — Routes metrics through DeepEval, which (per the [flotorch-eval repository](https://github.com/FissionAI/flotorch-eval/tree/develop)) unlocks the following metric keys:  
  - **`MetricKey.FAITHFULNESS`**
  - **`MetricKey.ANSWER_RELEVANCY`**
  - **`MetricKey.CONTEXT_RELEVANCY`**
  - **`MetricKey.CONTEXT_PRECISION`**
  - **`MetricKey.CONTEXT_RECALL`**
  - **`MetricKey.HALLUCINATION`**
  These are the same metrics surfaced in Flotorch’s *auto* mode when Ragas prerequisites (like embeddings) are missing.  
- **`metric_configs`** — Lets us pass DeepEval-specific arguments, such as a custom threshold for answer relevancy (we set `0.8` below) plus tuning knobs for other diagnostics.

### DeepEval Answer Relevance Metric

**Definition**: checks whether every statement in the generated answer directly addresses the user’s question using the [DeepEval answer relevance rubric](https://deepeval.com/docs/metrics-answer-relevance). The metric returns a score between 0 and 1, where higher values indicate fully on-topic answers that satisfy the configured threshold.

**How It Works**:
1. Extracts intent-carrying snippets from the answer with DeepEval’s grading LLM.
2. Judges each snippet against the original question, marking it relevant or irrelevant and providing short rationales.
3. Computes the answer relevance score as the proportion of relevant snippets, then compares the result to any threshold set in `metric_configs` (e.g., `0.8`) to drive pass/fail signals.

**Example**:

*Question*: "What are the common symptoms of Type 2 Diabetes?"

- *Pass Scenario* (Score ≈ 1.0): The answer lists classic symptoms—polydipsia, polyuria, fatigue—so every snippet is marked relevant and the threshold is cleared.
- *Fail Scenario* (Score ≈ 0.0): The answer pivots to hypertension treatments; DeepEval flags each snippet as irrelevant and explains the mismatch.

This mirrors the Ragas workflow while adding threshold-driven outcomes and explanatory diagnostics useful for automated guardrails.


In [None]:
# Configure DeepEval-specific metric arguments
metric_args = {
    "answer_relevancy": {"threshold": 0.7},
}

# Initialize the LLMEvaluator client using DeepEval
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.ANSWER_RELEVANCE,
    ],
    evaluation_engine="deepeval",
    metric_configs=metric_args,
)

print("LLMEvaluator client initialized.")


## 12. Run Evaluation (DeepEval)

### Purpose
Execute the evaluation process to score all generated answers using the **answer relevance** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **answer relevance** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Answer relevance scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute answer relevance scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")


### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 13. View Per-Question Results (DeepEval)

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

In [None]:
display_llm_evaluation_results(eval_results)

## 14. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:
  - **Answer Relevance**: Answer relevance score (0.0 to 1.0)
  - **average_score**: Average of all evaluated metrics
  - **total_latency_ms**: Total evaluation time in milliseconds
  - **total_cost**: Cost of evaluation in USD
  - **total_tokens**: Token count for evaluation

This raw JSON format is useful for further analysis, exporting results, or integrating with other tools.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))

## 15. Summary

### What We Accomplished

This notebook now demonstrates a full answer-relevance evaluation workflow using **both** the Ragas and DeepEval engines. We can baseline response relevance with Ragas and then validate the same evaluation set with DeepEval’s thresholded judging rubric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation.
   - Connected to `FlotorchVectorStore` for context retrieval.
   - Prepared shared prompts, credentials, and evaluation artifacts.

2. **Generated Responses**
   - Loaded prompts and ground-truth questions from `medical_eval.json`.
   - Retrieved the top medical context passages per query.
   - Generated answers with the inference LLM while collecting latency, cost, and token metadata.

3. **Scored Answer Relevance Twice**
   - **Ragas run**: computed answer relevance scores via semantic similarity to establish a baseline.
   - **DeepEval run**: re-scored the same `evaluation_items` with `evaluation_engine="deepeval"`, applying a `0.8` threshold and capturing rationales plus telemetry for each verdict.

4. **Visualized & Exported Results**
   - Rendered separate tables for Ragas and DeepEval outcomes to compare how each engine rates the same answers.
   - Provided a combined JSON export that includes metric scores and gateway telemetry for downstream analysis or automation.

### Key Takeaways

- **Answer Relevance ≈ 1.0** signals that the generated answer fully addresses the medical question; scores near **0.0** expose off-topic, incomplete, or hallucinated guidance.
- Ragas delivers fast, retrieval-aware scoring that’s easy to track across experimentation, regression suites, and tuning cycles.
- DeepEval layers threshold-driven guardrails and snippet-level rationales on top of the score, helping teams triage prompt drift, injection attempts, or policy violations.
- Monitoring both engines together provides confidence that the assistant stays on-topic while surfacing the telemetry needed for production governance and alerting.

### Ragas vs. DeepEval in Practice

| Dimension | Ragas Answer Relevance | DeepEval Answer Relevance |
|-----------|------------------------|---------------------------|
| **Docs** | [Metric docs](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/) | [Metric docs](https://deepeval.com/docs/metrics-answer-relevance) |
| **Core Flow** | Generates follow-up questions from the answer and scores semantic similarity to the original question. | Uses LLM judgements to label each answer snippet as relevant/irrelevant, aggregates relevance, and applies optional thresholds. |
| **Engine Toggle** | `evaluation_engine="ragas"` keeps scoring inside the Ragas framework with minimal configuration. | `evaluation_engine="deepeval"` unlocks DeepEval metrics per the [Flotorch Eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop). |
| **Additional Metrics** | Answer relevance, faithfulness, context precision, aspect critic (when configured). | Answer relevancy, faithfulness, context precision/recall, hallucination, plus threshold-based outcomes and rationale traces. |

**Example Question**: “What are the common symptoms of Type 2 Diabetes?”

- **Ragas Outcome**  
  - High-scoring run: Symptoms like polydipsia, polyuria, and fatigue appear prominently, so similarity remains near 1.0.  
  - Low-scoring run: The answer shifts to treatment plans without listing symptoms; the score collapses toward 0.0 even if other details sound reasonable.
- **DeepEval Outcome**  
  - Pass scenario: DeepEval marks each symptom-oriented snippet as relevant, clears the 0.8 threshold, and attaches supporting rationales.  
  - Fail scenario: When the answer wanders into hypertension management, DeepEval flags every off-topic snippet, explains the mismatch, and surfaces telemetry for alerting.

DeepEval’s richer diagnostic output complements the Ragas baseline, pairing quick semantic checks with guardrail-friendly reasoning so medical stakeholders can maintain trust in the assistant’s on-topic behavior.