[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1mpYsJOxzWxPRtXrh9VpZcKJRhXFCpXSq/view?usp=sharing)

# Evaluating the Financial Banking Assistant with Flotorch Eval

This notebook walks through evaluating a **Financial Banking Assistant**—a retrieval-augmented agent that answers questions about credit cards, loans, savings products, and investment options—using the **Flotorch SDK** alongside the **Flotorch Eval** library.

---

### **Use Case Overview**

The **Financial Banking Assistant** helps customers explore comprehensive information about:
- **Credit Cards** — premium rewards for travel, everyday cashback programs, and student-friendly starter cards
- **Loans** — unsecured personal lending, fixed-rate mortgage options, and flexible auto financing plans
- **Investments** — high-yield savings products, laddered certificate of deposit portfolios, and managed brokerage offerings
- **Savings Accounts** — tiered deposit accounts spanning everyday banking, premium perks, youth-focused savings, and money-market balances

It retrieves relevant information from a **Financial Banking Knowledge Base** containing detailed product documentation with eligibility requirements, fees, APR ranges, rewards structures, and policy information to generate helpful, accurate responses for customer inquiries.

This notebook focuses on evaluating the **context precision** of the retrieval system — that is, whether the system retrieves **relevant and useful context** from the knowledge base that actually helps answer customer questions accurately.

---

### **Notebook Workflow**

We’ll follow a structured evaluation process:

1. **Iterate Questions** – Loop through each question in the `financial_banking_gt.json` file (Ground Truth).  
2. **Retrieve Context** – Fetch relevant passages from the Financial Banking Knowledge Base.  
3. **Generate Answer** – Use the system prompt and LLM to produce a response.  
4. **Store Results** – Log each question, retrieved context, generated answer, and ground truth.  
5. **Evaluate Context Precision** – Use `LLMEvaluator` from Flotorch Eval to assess how relevant the retrieved context is to the question.  
6. **Display Results** – Summarize the context precision scores in a simple comparison table.

---

### **Metric Evaluated — Context Precision**

We focus on a single retrieval-quality signal: **Context Precision**. It measures how effectively the retriever ranks relevant passages above irrelevant ones for a given question. A high score means the first few chunks already contain the information needed to answer; a low score indicates off-topic passages dominate the ranking.

#### Ragas Context Precision (Flotorch `evaluation_engine="ragas"`)
According to the [Ragas Context Precision documentation](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/?h=context), the evaluator:
- inspects each retrieved chunk for relevance to the user query and reference answer,
- computes precision@k across the ranked list with higher weight on early positions, and
- reports an aggregate score between 0.0 and 1.0 that rewards “relevant-first” retrieval.

#### DeepEval Contextual Precision (Flotorch `evaluation_engine="deepeval"`)
Per the [DeepEval contextual precision specification](https://deepeval.com/docs/metrics-contextual-precision), the evaluator:
- gathers an LLM judgment for every retrieved chunk given the user query,
- labels each chunk as relevant or irrelevant using DeepEval’s structured rubric, and
- returns the proportion of chunks deemed relevant, with optional thresholds for alerting or pass/fail gating.

Both engines target the same goal—ensuring retrieval delivers helpful evidence—but they emphasise different diagnostics. Ragas centres on ranking-aware scoring, while DeepEval exposes fine-grained relevance verdicts suitable for production monitoring.

---

### **Evaluation Engine**

- `evaluation_engine="auto"` — lets Flotorch Eval mix Ragas and DeepEval according to the priority routing described in the [flotorch-eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop) so that retrieval metrics default to the best available backend.
- `evaluation_engine="ragas"` — keeps every metric inside the [**Ragas**](https://docs.ragas.io/en/stable/getstarted/) rubric for retrieval-aware evaluations (faithfulness, answer relevance, context precision, aspect critic, etc.).
- `evaluation_engine="deepeval"` — routes metrics through DeepEval’s judging engine, enabling custom prompts, thresholds, and rationales while still capturing Flotorch gateway telemetry.

We begin with the Ragas-only flow and then repeat the run with the DeepEval backend to highlight differences in scoring and diagnostics.  

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.  

---

#### **Documentation References**

- [**flotorch-eval GitHub repo**](https://github.com/FissionAI/flotorch-eval/tree/develop) — reference implementation with sample notebooks and evaluation pipelines.
- [**Ragas Context Precision Documentation**](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/?h=context) — detailed explanation of the metric.
- [**DeepEval Contextual Precision Documentation**](https://deepeval.com/docs/metrics-contextual-precision)  — detailed explanation of the metric.

---

## 1. Install Dependencies

First, we install the two necessary libraries. The `-q` flag is for a "quiet" installation, hiding the lengthy output.

-   `flotorch`: The main Python SDK for interacting with all Flotorch services, including LLMs and Knowledge Bases.
-   `flotorch-eval[llm]`: The evaluation library. We use the `[llm]` extra to install dependencies required for LLM-based (Ragas) evaluations.

In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

### Example :

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key | `sk_...` |
| `FLOTORCH_BASE_URL` | Gateway endpoint | `https://gateway.flotorch.cloud` |
| `inference_model_name` | The LLM your agent uses to generate answers | `flotorch/gpt-4o-mini` |
| `evaluation_llm_model_name` | The LLM used to evaluate the answers | `flotorch/gpt-4o` |
| `evaluation_embedding_model_name` | Embedding model for semantic similarity checks | `open-ai/text-embedding-ada-002` |
| `knowledge_base_repo` | The ID of your Flotorch Knowledge Base | `digital-twin` |

In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# Prefect-side authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  # Used by Prefect flow and local runs
    print(f"Success")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # Prefect gateway or cloud endpoint

inference_model_name = "flotorch/<your-model-name>"  # Model generating answers
evaluation_llm_model_name = "flotorch/<your_model_name>"  # Model judging answer quality
evaluation_embedding_model_name = "flotorch/<embedding_model_name>"  # Embedding model for similarity checks

knowledge_base_repo = "<your_knowledge_base_id>" #Knowledge_base ID

## 3. Import Required Libraries

### Purpose
Import all required components for evaluating the RAG assistant.

### Key Components
- `json` : Loads configuration files and ground truth data from disk
- `tqdm` : Shows a lightweight progress bar while iterating over evaluation items
- `FlotorchLLM` : Connects to the Flotorch inference endpoint for answer generation
- `FlotorchVectorStore` : Retrieves context snippets from the configured knowledge base
- `memory_utils` : Utility helpers for extracting text from vector-store search results
- `LLMEvaluator`, EvaluationItem, MetricKey** : Runs metric scoring for the generated answers



In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars
from google.colab import files

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval.llm_eval import display_llm_evaluation_results

print("Imported necessary libraries successfully")

## 4. Load Data and Prompts

### Purpose
Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local
files.

### Files Required

**1. `gt.json` (Ground Truth)**  
Contains question-answer pairs for evaluation. Each `answer` is the expected correct response.

```json
[
    {
        "question": "What is the APR range for the Premium Rewards Credit Card?",
        "answer": "The APR range for the Premium Rewards Credit Card is 18.99% - 26.99% Variable APR based on creditworthiness."
    },
    {
        "question": "What is the minimum credit score required for a Personal Loan Prime?",
        "answer": "The minimum credit score required for a Personal Loan Prime is 700+.",
    }
]
```

**2. `prompt.json` (Agent Prompts)**  
Defines the system prompt and user prompt template with `{context}` and `{question}` placeholders for dynamic formatting.

```json
{
  "system_prompt": "You are a helpful Financial Banking assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

### Instructions
Update `gt_path` and `prompt_path` variables in the next cell to point to your local file locations.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

### Purpose
Create a prompt-formatting helper for LLM message construction.

### Functionality
The `create_messages` function:
- Builds the final prompt that will be sent to the LLM.
- Accepts system prompt, user prompt template, question, and retrieved context chunks
- Replaces `{context}` and `{question}` placeholders in the user prompt
- Returns a structured message list with (`{role: ..., content: ...}`) fields ready for LLM consumption

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages


## 6. Initialize Clients

### Purpose
Set up the infrastructure for RAG pipeline execution.

### Components Initialized
1. **FlotorchLLM** (`inference_llm`): Connects to the LLM endpoint for generating answers based on retrieved context
2. **FlotorchVectorStore** (`kb`): Connects to the Knowledge Base for semantic search and context retrieval
3. **Prompt Variables**: Extracts system prompt and user prompt template from `prompt_config` for dynamic message formatting

These clients power the evaluation loop by retrieving relevant context and generating answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

### Purpose
Execute the full RAG pipeline for each question using real Knowledge Base retrieval to generate answers for evaluation.

### Pipeline Steps
For each question in `ground_truth`, the loop performs:

1. **Retrieve Context**: Searches the Knowledge Base (`kb.search()`) to fetch relevant context passages based on the question
2. **Build Messages**: Uses `create_messages()` to format the system prompt, user prompt, question, and retrieved context into LLM-ready messages
3. **Generate Answer**: Invokes the inference LLM (`inference_llm.invoke()`) with `return_headers=True` to capture response metadata (cost, latency, tokens)
4. **Store for Evaluation**: Packages question, generated answer, expected answer, context, and metadata into an `EvaluationItem` object

### Error Handling
A `try...except` block gracefully handles API failures, storing error messages as evaluation items to ensure the loop completes without crashes.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

### Using Ragas Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can set up the `LLMEvaluator`.

This class is the core component of the **Flotorch-Eval** library — think of it as the *“head judge”* for our evaluation process. It coordinates metric calculations, semantic comparisons, and LLM-based judgments using the configuration we provide.

### Parameter Insights

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — The evaluator uses:
  - an **LLM** (`inferencer_model`) for reasoning-based checks, and  
  - an **embedding model** (`embedding_model`) for semantic and contextual similarity evaluations.  
- **`evaluation_engine`** — Here, we set this to `"ragas"`, meaning the evaluator will use the **[Ragas framework](https://docs.ragas.io/en/stable/getstarted/)** for metric computation.  
  Ragas is well-suited for RAG-style evaluations and handles metrics such as:
  - **Faithfulness**
  - **Answer Relevance**
  - **Context Precision**
  - **Aspect Critic (custom maliciousness check)**  

- **`metrics`** — In this configuration, we evaluate only **`MetricKey.CONTEXT_PRECISION`**.

### Context Precision Metric

**Definition**: measures how relevant the retrieved context chunks are to the question. The score ranges from 0 to 1 and reflects **the proportion of top-ranked chunks that actually help answer the question**. High precision confirms the retriever surfaces the right evidence first; low precision reveals noisy or off-topic context.

**How It Works**:
1. Inspects each retrieved chunk for relevance to the question (and reference answer when available)
2. Labels chunks as relevant or irrelevant and applies higher weight to earlier ranks
3. Computes precision over the ranking to produce the final score

**Example**:

*Question*: "What is the annual fee for the Premium Rewards Credit Card?"

*Relevant Context* (High Score): "Premium Rewards Credit Card - Product Code: CC-PR-001, Annual Fee: $195 (waived first year for qualified applicants), APR Range: 18.99% - 26.99%..." → **Score: 1.0**

*Irrelevant Context* (Low Score): "High-Yield Savings Account offers 4.35% APY with FDIC insurance up to $250,000. Minimum opening deposit is $100..." → **Score: 0.0**


In [None]:
# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.CONTEXT_PRECISION,
    ],
    evaluation_engine="ragas"
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation (Ragas)

### Purpose
Execute the evaluation process to score all generated answers using the **Context precision** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **context precision** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Context precision scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute context precision scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = evaluator_client.evaluate(evaluation_items)

print("Evaluation complete.")

## 10. View Per-Question Results (Ragas)

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

In [None]:
display_llm_evaluation_results(eval_results)

# 11. Initialize the Evaluator

### Using DeepEval Engine

Now that we have our `evaluation_items` list (containing the generated answers), we can switch the `LLMEvaluator` to the **DeepEval** backend to compare its scoring with the Ragas run.

This class is still the *“head judge”* for our evaluation process; we’re simply changing which judging rubric it follows.

### Parameter Insights (DeepEval Mode)

- **`api_key` / `base_url`** — Standard credentials used to authenticate and connect with the Flotorch-Eval service.  
- **`inferencer_model` / `embedding_model`** — DeepEval-powered scoring still needs the evaluator LLM and embeddings for semantic checks.  
- **`evaluation_engine="deepeval"`** — Routes metrics through DeepEval, which (per the [flotorch-eval repository](https://github.com/FissionAI/flotorch-eval/tree/develop)) unlocks the following metric keys:  
  - **`MetricKey.FAITHFULNESS`**
  - **`MetricKey.ANSWER_RELEVANCY`**
  - **`MetricKey.CONTEXT_RELEVANCY`**
  - **`MetricKey.CONTEXT_PRECISION`**
  - **`MetricKey.CONTEXT_RECALL`**
  - **`MetricKey.HALLUCINATION`**
  These are the same metrics surfaced in Flotorch’s *auto* mode when Ragas prerequisites (like embeddings) are missing.  
- **`metric_configs`** — Lets us pass DeepEval-specific arguments such as custom thresholds for contextual precision (we set `0.75` below) and tuning knobs for other diagnostics (maliciousness, hallucination, etc.).

### DeepEval Context Precision Metric

**Definition**: checks whether each retrieved chunk is relevant to the question using the [DeepEval contextual precision rubric](https://deepeval.com/docs/metrics-contextual-precision). The metric returns a score between 0 and 1, where higher values indicate the retriever consistently surfaces helpful evidence within the top-ranked chunks.

**How It Works**:
1. Extracts the ranked context list returned by the retriever.  
2. Labels each chunk as relevant or irrelevant given the question (and, when supplied, the reference answer).  
3. Computes precision over the ranking—optionally applying thresholds in `metric_configs` to drive alerts or pass/fail signalling.

**Example**:

*Question*: "What is the annual fee for the Premium Rewards Credit Card?"

*High Precision (Score ≈ 1.0)*: Top chunks cite "Annual Fee: $195 (waived first year for qualified applicants)" and supporting fee details.

*Low Precision (Score ≈ 0.0)*: Top chunks drift to unrelated savings accounts or mortgage content, so DeepEval flags the retrieval as off-topic.

This mirrors the Ragas precision workflow while adding threshold-driven outcomes and richer chunk-by-chunk diagnostics that are useful for production monitoring.


In [None]:
# Configure a custom metric for context_precision
metric_args = {

    "context_precision": {
        "threshold": 0.7
        }

}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.CONTEXT_PRECISION,
    ],
    evaluation_engine="deepeval",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

## 12. Run Evaluation (DeepEval)

### Purpose
Execute the evaluation process to score all generated answers using the **context precision** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- For each evaluation item:
  - The evaluator scores **context precision** by comparing the generated answer against the retrieved context.

- Collect the following outputs:
  - Context precision scores
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute context precision scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = evaluator_client.evaluate(evaluation_items)

print("Evaluation complete.")

## 13. View Per-Question Results (DeepEval)

### Purpose
Display evaluation results in a formatted table for easy analysis and comparison.

In [None]:
display_llm_evaluation_results(eval_results)

## 14. View Raw JSON Results

### Purpose
Display the complete evaluation results in JSON format for detailed inspection, exporting, or downstream automation. The payload reflects whichever evaluation engine was executed most recently (DeepEval in this workflow).

### Output Structure
Each entry contains:
- **model**: The evaluation LLM model that produced the scores.  
- **input_query**: The original user question.  
- **context**: Full retrieved context passages (untruncated).  
- **generated_answer**: Complete LLM-generated response.  
- **groundtruth_answer**: Expected reference answer.  
- **evaluation_metrics**: Dictionary with metric scores and telemetry, for example:  
  - **context_precision / contextual_precision**: Context precision score (0.0–1.0) under the active engine.  
  - **average_score**: Mean of all returned metrics.  
  - **total_latency_ms**: Total evaluation time in milliseconds.  
  - **total_cost**: Cost of the evaluation in USD.  
  - **total_tokens**: Token count consumed during scoring.

Use this JSON output for further analysis, comparisons between Ragas and DeepEval runs, or integration with orchestration tools.

In [None]:
import json

print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))


## 15. Summary

### What We Accomplished

This notebook now demonstrates a full retrieval-evaluation workflow using **both** the Ragas and DeepEval engines for contextual precision. You can baseline retrieval quality with Ragas and then validate the same evaluation set with DeepEval’s thresholded judging rubric.

### Workflow Summary

1. **Configured Infrastructure**
   - Set up `FlotorchLLM` for answer generation.
   - Connected to `FlotorchVectorStore` for context retrieval.
   - Prepared shared prompts, credentials, and evaluation artifacts.

2. **Generated Responses**
   - Loaded prompts and ground-truth questions from `financial_banking_gt.json`.
   - Retrieved top-N context chunks per query.
   - Generated answers with the inference LLM while collecting latency, cost, and token metadata.

3. **Scored Retrieval Quality Twice**
   - **Ragas run**: computed `llm_context_precision_with_reference` to establish a retrieval baseline.
   - **DeepEval run**: re-scored the same `evaluation_items` with `evaluation_engine="deepeval"`, applying a `0.75` contextual-precision threshold via `metric_configs` and capturing auxiliary diagnostics (answer relevancy, context recall, telemetry).

4. **Visualized & Exported Results**
   - Rendered separate tables for Ragas and DeepEval outcomes to compare how each engine ranks questions.
   - Provided a combined JSON export that includes metric scores and gateway telemetry for downstream analysis or automation.

### Key Takeaways

- **Context Precision ≈ 1.0** indicates that early-ranked context chunks are highly relevant; scores near **0.0** reveal noisy retrieval.
- DeepEval adds thresholding plus chunk-level rationales, making it easier to trigger alerts when precision drops below operational requirements.
- Both engines evaluate **Retrieved Context ↔ Question relevance**, not direct answer-to-ground-truth similarity.
- Running both engines side-by-side offers confidence in retrieval health across experimentation, regression testing, and production monitoring.

### Ragas vs. DeepEval in Practice

| Dimension | Ragas Context Precision | DeepEval Context Precision |
|-----------|-------------------------|----------------------------|
| **Docs** | [Metric docs](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/) | [Metric docs](https://deepeval.com/docs/metrics-contextual-precision) |
| **Core Flow** | Computes precision@k over the ranked context list, weighting early chunks and relying on embeddings for relevance checks. | Uses LLM judgements to label each chunk as relevant/irrelevant, aggregates precision, and exposes rationales plus optional thresholds. |
| **Engine Toggle** | `evaluation_engine="ragas"` keeps scoring inside the Ragas framework with minimal configuration. | `evaluation_engine="deepeval"` unlocks DeepEval metrics per the [Flotorch Eval repo](https://github.com/FissionAI/flotorch-eval/tree/develop). |
| **Additional Metrics** | Context precision, answer relevance, faithfulness, aspect critic (when configured). | Context precision, context recall, answer relevancy, hallucination, plus threshold-based outcomes and reasoning traces. |

**Example Retrieval**: “What is the annual fee for the Premium Rewards Credit Card?”

- **Ragas Outcome**
  - High-scoring run: Top-ranked chunks quote “Annual Fee: $195 (waived first year for qualified applicants)” so context precision stays near 1.0.
  - Low-scoring run: If retrieval floats savings-account details to the top, precision collapses toward 0.0 even when the downstream answer looks correct.
- **DeepEval Outcome**
  - Pass scenario: DeepEval labels the first chunks as relevant, clearing the 0.75 threshold and returning supporting rationales.
  - Fail scenario: When irrelevant chunks dominate, DeepEval flags the retrieval, captures explanations for each off-topic chunk, and surfaces telemetry for alerting.

DeepEval’s broader metric set complements the Ragas baseline, giving both ranking-aware scores and guardrail-friendly diagnostics.