[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1QN0rqMopoTvq72H7Y06JIt6Gl0zcLpWx/view?usp=sharing)

# Evaluating the Fission Labs Assistant with Flotorch Eval

This notebook walks through evaluating a **Fission Labs Assistant** — a retrieval-augmented agent that answers questions about **Fission Labs Organization**.using the **Flotorch SDK** alongside the **Flotorch Eval** library.

---

### **Use Case Overview**

The **Fission Labs Assistant** helps users explore facts about Fission Labs—founded in 2008, headquartered in Sunnyvale, California, with delivery hubs in Hyderabad.  
It retrieves relevant information from a **Fission Labs Knowledge Base** built from internal documentation and generates helpful, accurate responses.

This notebook focuses on evaluating the **faithfulness** of the model’s answers — that is, whether the generated responses are **factually grounded** in the retrieved context.

---

### **Notebook Workflow**

We’ll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each question in the `gt.json` file (Ground Truth).  
2. **Retrieve Context** – Fetch relevant passages from the Fission Labs Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to produce a response.  
4. **Store Results** – Log each question, retrieved context, generated answer, and ground truth.  
5. **Evaluate Faithfulness** – Use `LLMEvaluator` from Flotorch Eval to assess how well each response is grounded in the given context.  
6. **Display Results** – Summarize the faithfulness scores in a simple comparison table.

---

### **Metric Evaluated — Faithfulness**

We focus on a single quality signal: **Faithfulness**. It measures how faithfully a generated answer reflects the supporting context retrieved from our knowledge base. A high score means every claim in the answer is grounded in the evidence; a low score indicates hallucinations or contradictions.

#### Ragas Faithfulness (Flotorch `evaluation_engine="ragas"`)
According to the [Ragas documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/), the pipeline:
- breaks the generated answer into atomic claims using an evaluator LLM,
- checks each claim against retrieved context passages for support, and
- reports the ratio of supported claims to total claims as a score between 0.0 and 1.0.

#### DeepEval Faithfulness (Flotorch `evaluation_engine="deepeval"`)
Per the [DeepEval faithfulness specification](https://deepeval.com/docs/metrics-faithfulness), the evaluator:
- extracts fine-grained statements from the answer,
- verifies each statement against the provided context using DeepEval’s judging rubric,
- outputs a 0.0–1.0 score, and
- optionally applies a threshold (for example `0.8`) to automate pass/fail checks or generate reasoning about unsupported claims.

Both engines serve the same goal—ensuring generated answers stay truthful—but they offer different tuning knobs. Ragas emphasizes retrieval-aware scoring, while DeepEval adds thresholding and richer diagnostics for production monitoring.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop) (Ragas first, DeepEval as fallback) so that faithfulness, answer relevance, and context metrics always run on the best available backend.
- `evaluation_engine="ragas"` — keeps every metric inside the [**Ragas**](https://docs.ragas.io/en/stable/getstarted/) rubric for retrieval-aware evaluations (faithfulness, answer relevance, context precision, aspect critic, etc.), which is what we configure in the first half of this notebook.
- `evaluation_engine="deepeval"` — routes metrics through DeepEval’s engine (faithfulness, answer relevancy, context relevancy, context precision, context recall, hallucination) while still capturing Flotorch gateway telemetry as documented in the [Flotorch Eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop). This mode is showcased later in the notebook.

We begin with the Ragas-only flow and then repeat the run with the DeepEval engine to highlight the differences in scoring and configuration.  

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.  
---

#### **Documentation References**
- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.
- [**Ragas Faithfulness Documentation**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/) — detailed explanation of the metric.
- [**DeepEval Faithfulness Documentation**](https://deepeval.com/docs/metrics-faithfulness) — detailed explanation of the metric.
---

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.
%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "flotorch/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>"

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers



In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval.llm_eval import display_llm_evaluation_results


print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
  {
    "question": "When was Fission Labs founded?",
    "answer": "Fission Labs was founded in 2008."
  },
  {
    "question": "Who is the Co-Founder and CEO of Fission Labs?",
    "answer": "Eswar Lingam is the Co-Founder and CEO of Fission Labs."
  }
]
```

**2. `prompt.json` (Agent Prompts)**
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful Fission labs assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages


## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full RAG pipeline for each question to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

# Flotorch-Eval Has Both Ragas and DeepEval Support
1. `flotorch-eval` ships Ragas and DeepEval engines together, letting auto mode pick the right one per the [docs](https://github.com/FissionAI/flotorch-eval/tree/develop).
2. Ragas stays primary for faithfulness, answer relevance, and context precision when embeddings are configured.
3. DeepEval backs up the same metrics whenever Ragas prerequisites are missing, so coverage never drops.
4. Gateway telemetry (latency, cost, tokens) remains active because auto mode still passes Flotorch headers through each run.


## 8. Initialize the Evaluator

### Using Ragas Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can set up the `LLMEvaluator`.

This class is the core component of the **Flotorch-Eval** library — think of it as the *“head judge”* for our evaluation process. It coordinates metric calculations, semantic comparisons, and LLM-based judgments using the configuration we provide.

### Parameter Insights

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — The evaluator uses:
  - an **LLM** (`inferencer_model`) for reasoning-based checks, and  
  - an **embedding model** (`embedding_model`) for semantic and contextual similarity evaluations.  
- **`evaluation_engine`** — Here, we set this to `"ragas"`, meaning the evaluator will use the **[Ragas framework](https://docs.ragas.io/en/stable/getstarted/)** for metric computation.  
  Ragas is well-suited for RAG-style evaluations and handles metrics such as:
  - **Faithfulness**
  - **Answer Relevance**
  - **Context Precision**
  - **Aspect Critic (custom maliciousness check)**
- **`metrics`** — In this configuration, we evaluate only **`MetricKey.FAITHFULNESS`**.

### Faithfulness Metric

**Definition**: evaluates how factually consistent a generated response is with the retrieved context. It measures whether all claims in the generated answer can be supported by the context. The score ranges from 0 to 1, calculated as: **Number of claims supported by context / Total number of claims in the response**. This metric is crucial for preventing hallucinations and ensuring the AI doesn't fabricate information beyond what's provided in the source documents.

**How It Works**:
1. Breaks the answer into individual claims
2. Checks each claim against retrieved context  
3. Score = (Supported claims) / (Total claims)

**Example**:

*Context*: "Fission Labs was founded in 2008 and is headquartered in Sunnyvale, California."

*Faithful Answer* (Correct): "Fission Labs was founded in 2008 and is headquartered in Sunnyvale, California." → **Score: 1.0**

*Unfaithful Answer* (Incorrect): "Fission Labs was founded in 2005 and has over 1000 employees." → **Score: 0.0**


In [None]:
# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.FAITHFULNESS,
    ],
    evaluation_engine="ragas"
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation (Ragas)

### Purpose
Execute the evaluation process to score all generated answers using the **faithfulness** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **faithfulness** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Faithfulness scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute faithfulness scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

In [None]:
display_llm_evaluation_results(eval_results)

# 11. Initialize the Evaluator

### Using DeepEval Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can switch the `LLMEvaluator` to the **DeepEval** backend to compare its scoring with the Ragas run.

This class is still the *“head judge”* for our evaluation process; we’re simply changing which judging rubric it follows.

### Parameter Insights (DeepEval Mode)

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — DeepEval-powered scoring still needs the evaluator LLM and embeddings for semantic checks.  
- **`evaluation_engine="deepeval"`** — Routes metrics through DeepEval, which (per the [flotorch-eval repository](https://github.com/FissionAI/flotorch-eval/tree/develop)) unlocks the following metric keys:
  - **`MetricKey.FAITHFULNESS`**
  - **`MetricKey.ANSWER_RELEVANCY`**
  - **`MetricKey.CONTEXT_RELEVANCY`**
  - **`MetricKey.CONTEXT_PRECISION`**
  - **`MetricKey.CONTEXT_RECALL`**
  - **`MetricKey.HALLUCINATION`**
  These are the same metrics surfaced in Flotorch’s *auto* mode when Ragas prerequisites (like embeddings) are missing.
- **`metric_configs`** — Lets us pass DeepEval-specific arguments such as custom thresholds for faithfulness (we set `0.8` below) and tuning knobs for other metrics (maliciousness, hallucination, etc.).

### DeepEval Faithfulness Metric

**Definition**: checks whether every claim in the generated answer is supported by the supplied context using the [DeepEval faithfulness rubric](https://deepeval.com/docs/metrics-faithfulness). The metric returns a score between 0 and 1, where higher values indicate fully grounded answers that satisfy the configured threshold.

**How It Works**:
1. Extracts atomic statements from the model answer with DeepEval’s grading LLM.
2. Verifies each statement against the provided context to classify it as supported or unsupported, optionally producing reasoning for any mismatch.
3. Computes the faithfulness score as supported statements ÷ total statements, then compares the result to any threshold set in `metric_configs` (e.g., `0.8`) to drive pass/fail signals.

**Example**:

*Context*: "Fission Labs was founded in 2008, is headquartered in Sunnyvale, California, and runs delivery hubs in Hyderabad, India."

*Faithful Answer* (Score ≈ 1.0): "Fission Labs was founded in 2008 in Sunnyvale and operates delivery hubs in Hyderabad." — every statement is grounded, so the answer clears the 0.8 threshold.

*Unfaithful Answer* (Score ≈ 0.0): "Fission Labs launched in 2012, moved its headquarters to Austin, and recently opened a hardware division." — unsupported claims drive the score to 0, and DeepEval surfaces reasoning indicating the contradictions.

This mirrors the Ragas faithfulness workflow while adding threshold-driven outcomes and optional explanations that are useful for automated guardrails.

In [None]:
# Configure a custom metric for faithfulness
metric_args = {

    "faithfulness": {
        "threshold": 0.7
        }

}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.FAITHFULNESS,
    ],
    evaluation_engine="deepeval",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

## 12. Run Evaluation (DeepEval)

### Purpose
Execute the evaluation process to score all generated answers using the **faithfulness** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **faithfulness** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Faithfulness scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute faithfulness scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = evaluator_client.evaluate(evaluation_items)

print("Evaluation complete.")

## 13. View Per-Question Results

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.


In [None]:
display_llm_evaluation_results(eval_results)

## 14. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection and programmatic access.

### Output Structure
The JSON output includes for each question:
- **model**: The evaluation LLM model used
- **input_query**: The original question
- **context**: Full retrieved context passages (not truncated)
- **generated_answer**: Complete LLM-generated response
- **groundtruth_answer**: Expected correct answer
- **evaluation_metrics**: Dictionary containing:
  - **faithfulness**: Faithfulness score (0.0 to 1.0)
  - **average_score**: Average of all evaluated metrics
  - **total_latency_ms**: Total evaluation time in milliseconds
  - **total_cost**: Cost of evaluation in USD
  - **total_tokens**: Token count for evaluation

This raw JSON format is useful for further analysis, exporting results, or integrating with other tools.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))

## 15. Summary

### What We Accomplished

This notebook provided a complete, step-by-step workflow for evaluating a RAG agent using Flotorch Eval with the Ragas faithfulness metric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation
   - Connected to `FlotorchVectorStore` for context retrieval
   - Initialized `LLMEvaluator` with Ragas engine for faithfulness scoring

2. **Generated Responses**
   - Loaded ground truth questions from `gt.json`
   - Retrieved relevant context from the Knowledge Base for each question
   - Generated answers using the inference LLM with retrieved context
   - Captured metadata (cost, latency, tokens) from each LLM call

3. **Evaluated Faithfulness**
   - Scored each generated answer using the Ragas faithfulness metric
   - Verified that answers are factually grounded in retrieved context
   - Collected evaluation metrics and gateway statistics for each question

4. **Visualized Results**
   - Displayed per-question scores in a formatted table for quick analysis
   - Exported complete results as JSON for further processing
   - Observed that all real-world answers scored 1.0 (perfect faithfulness)

### Key Takeaways

- **Faithfulness = 1.0** means generated answers are fully supported by the context (no hallucinations)
- **Faithfulness = 0.0** means generated answers contain unsupported or contradictory claims
- The metric evaluates **Generated Answer ↔ Context**, NOT **Generated Answer ↔ Ground Truth**
- Faithfulness is crucial for ensuring RAG systems produce reliable, context-grounded responses without hallucinations—critical for applications like the Fission Labs Assistant where accuracy is paramount

### Ragas vs. DeepEval in Practice

| Dimension | Ragas Faithfulness | DeepEval Faithfulness |
|-----------|--------------------|------------------------|
| **Docs** | [Metric docs](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/) | [Metric docs](https://deepeval.com/docs/metrics-faithfulness) |
| **Core Flow** | Splits answers into claims and checks each one against retrieved passages; score is supported-claims ÷ total claims (0–1). | Similar claim analysis, with optional reasoning text and configurable thresholds for pass/fail automation. |
| **Engine Toggle** | `evaluation_engine="ragas"` keeps all scoring inside the Ragas framework. | `evaluation_engine="deepeval"` enables DeepEval metrics per the [Flotorch Eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop). |
| **Additional Metrics** | Faithfulness, answer relevance, context precision, aspect critic (when configured). | Faithfulness, answer relevancy, context relevancy, context precision, context recall, hallucination, plus threshold-based outcomes. |

**Example Context**: “Fission Labs was founded in 2008 and operates delivery hubs in Hyderabad.”

- **Ragas Outcome**
  - Faithful answer: “Founded in 2008 with delivery hubs in Hyderabad” → score ≈ 1.0
  - Unfaithful answer: “Founded in 2008 with offices in New York” → unsupported claim lowers score toward 0
- **DeepEval Outcome**
  - Faithful answer: passes when threshold is 0.8, no hallucinations flagged
  - Unfaithful answer: fails threshold, DeepEval surfaces reasoning showing the “New York” claim isn’t in the context

DeepEval’s broader metric set complements the Ragas run, giving both retrieval-focused and hallucination-aware views of answer quality.
