[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1eNJlbx1PBrVeC2A3NsUrDZehRF2mvcmu/view?usp=sharing)

# Evaluating the Flotorch Product Assistant with Flotorch Eval



In this notebook, we perform a structured evaluation of a **RAG-based QA system** using the **Flotorch SDK** together with the **Flotorch Eval** framework.

---

### **Use Case Overview**

The **Flotorch Product Assistant** helps users understand and navigate Flotorch’s tools, SDKs, and workflows.  
It retrieves information from a knowledge base populated with **Flotorch product documentation** and generates accurate, helpful answers to user queries.

This notebook evaluates how well the assistant performs in providing **relevant, factual, and safe** responses.

---

### **Notebook Workflow**

We’ll go through the following steps:

1. **Iterate Questions** – Loop through each question in a `gt.json` (Ground Truth) file.  
2. **Retrieve Context** – Query the Flotorch Knowledge Base for relevant information.  
3. **Generate Answer** – Use the system prompt and Flotorch LLM to produce an answer.  
4. **Store Results** – Save each question, retrieved context, generated answer, and ground truth.  
5. **Evaluate** – Use `LLMEvaluator` from Flotorch Eval to assess the outputs.  
6. **Display Scores** – Show the evaluation results in a simple table.

---

### **Metrics Evaluated**

To measure the assistant’s quality, we’ll use four key metrics:

- **Faithfulness** – Is the answer supported by the retrieved context?  
- **Answer Relevancy** – Does it directly address the user’s question?  
- **Context Precision** – Is the retrieved context appropriate and focused?  
- **Hallucination** – This metric assesses whether the model generates factually incorrect or fabricated information that is not supported by the input or context.

---

### **Requirements**

- Flotorch account with configured LLM, embedding model, and Knowledge Base.  
- `gt.json` containing question–answer pairs for evaluation.  
- `prompt.json` containing the system and user prompt templates.


## 1. Install Dependencies

Before proceeding, install the required libraries for model interaction and evaluation:

- **`flotorch`** — The primary Python SDK for accessing Flotorch services, including LLM inference, knowledge bases, and related APIs.  
- **`flotorch-eval[all]`** — The evaluation toolkit. The `[all]` option ensures that all optional dependencies needed for metrics, evaluation engines, and integrations are installed.


In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

In [None]:
# -----------------------------------------------------------
# Flotorch Model Configuration
# -----------------------------------------------------------
# Update the placeholders below with the correct values
# from your Flotorch Console before running inference
# or evaluation.
# -----------------------------------------------------------

from getpass import getpass

# Authentication
FLOTORCH_API_KEY = getpass("Enter your Flotorch API key: ")
FLOTORCH_BASE_URL = input("Enter Base URL: ")   # Example: "https://gateway.flotorch.cloud"

# Model names (replace with your actual model IDs)
inference_model_name = "<INFERENCE_MODEL_NAME>"                # Model used by your agent
evaluation_llm_model_name = "<EVALUATION_LLM_MODEL_NAME>"      # Model used for scoring
evaluation_embedding_model_name = "<EMBEDDING_MODEL_NAME>"     # Embedding model for Eval,
                                                               # Example:"flotorch/text-embedding-model

# Knowledge base repository used for retrieval
knowledge_base_repo = "<KNOWLEDGE_BASE_REPO_NAME>"


## 3. Import Libraries

Next, we import all the necessary modules from Python and the Flotorch libraries that we'll use throughout the notebook.

In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm # Use standard tqdm for simple progress bars

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils
from google.colab import files

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval import display_llm_evaluation_results

## 4. Load Data and Prompts

Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local files.

Your files should be structured as follows:

**`gt.json` (Ground Truth)**
A list of question-answer objects. The `answer` is the "perfect" response you are testing against.
```json
[
  {
    "question": "What is Flotorch?",
    "answer": "Flotorch is an end-to-end platform for..."
  },
  {
    "question": "How do I install the SDK?",
    "answer": "You can install it using pip: pip install flotorch."
  }
]
```

**`prompt.json` (Agent Prompts)**
Contains the system prompt and the user prompt template. Note the `{context}` and `{question}` placeholders, which we will fill dynamically.
```json
{
  "system_prompt": "You are a helpful Flotorch product assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

**Note:** In Google Colab, you can use the file icon on the left to upload your files and then adjust the paths in `gt_path` and `prompt_path`.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

This simple helper function, `create_messages`, builds the final prompt that will be sent to the LLM. It takes the system prompt, user template, question, and retrieved context, and formats them into the standard list of message objects (`{role: ..., content: ...}`) that the model expects.

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages

## 6. Initialize Clients

We create the clients for the generative LLM (`FlotorchLLM`) and the Knowledge Base (`FlotorchVectorStore`). These clients will be used inside the loop to get context and generate answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt_template", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

This is the core logic for *generating* the answers. We loop through each question in our `ground_truth` list and perform the full RAG pipeline:

1.  **Retrieve Context**: We ` kb.search()` to get context from the vector store.
2.  **Build Messages**: We use our `create_messages` helper to assemble the final prompt.
3.  **Generate Answer**: We `inference_llm.invoke()` to get the agent's response.
4.  **Store for Evaluation**: We package all this information into an `EvaluationItem` object and add it to our `evaluation_items` list.

We also include a `try...except` block to gracefully handle any errors during the API calls, ensuring the loop doesn't crash.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"Experiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

With the `evaluation_items` list ready, we can now initialize the `LLMEvaluator`.

The `LLMEvaluator` is the central component of the **Flotorch-Eval** library—it orchestrates metric computation, similarity analysis, and LLM-based judgments based on the configuration provided.

### Parameter Overview

- **`api_key` / `base_url`**  
  Credentials required to authenticate and communicate with the Flotorch-Eval service.

- **`inferencer_model` / `embedding_model`**  
  The evaluator uses:  
  - an **LLM** (`inferencer_model`) for reasoning-driven evaluations, and  
  - an **embedding model** (`embedding_model`) for semantic similarity and context comparison.

- **`evaluation_engine`**  
  Set to `"ragas"` in this notebook, meaning the evaluator uses the **Ragas** framework, which is well-suited for RAG system evaluation.  
  Ragas supports core metrics such as:  
  - **Faithfulness**  
  - **Answer Relevance**  
  - **Context Precision**  
  - **Aspect Critic**  

  Other available engines include:  
  - **`"deepeval"`** — Uses the DeepEval framework for LLM-as-judge and critic-based metrics.  
  - **`"auto"`** — Automatically selects the appropriate engine for each metric.

- **`metrics`**  
  Defines the metrics to compute—in this case:  
  **faithfulness**, **answer_relevance**, **context_precision**, and **aspect_critic**.

- **`metric_configs`**  
  Provides configuration options for metrics that require additional parameters.  
  A key example is **`ASPECT_CRITIC`**, which functions as a *parent metric*.  
  Under **Aspect Critic**, you can define one or more evaluation aspects, such as:  
  - **hallucination**  (used in this notebook)
  - **maliciousness**   
  - **toxicity**  
  - **instruction adherence**, etc.

  Here, we configure Aspect Critic specifically to evaluate **Hallucination**, illustrating how checks for fabricated or unsupported information can be integrated into a RAG evaluation workflow.

In [None]:
# Configure a custom metric for hallucination
metric_args = {
    MetricKey.ASPECT_CRITIC: {
        "hallucination": {
            "name": "hallucination",
            "definition": "Does the answer coming from outof the context?"
        }
    }
}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.FAITHFULNESS,
        MetricKey.ANSWER_RELEVANCE,
        MetricKey.CONTEXT_PRECISION,
        MetricKey.ASPECT_CRITIC
    ],
    metric_configs=metric_args,
    evaluation_engine="ragas"
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation

### Purpose
Execute the evaluation process to score all generated answers using the all **ragas** metric.

### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question to compute  metric scores.  
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results

The `eval_results` variable contains a list of dictionaries, where each dictionary holds
the detailed evaluation results and metrics for an individual question.

### Process
- Pass `eval_results` to the `display_llm_evaluation_results` method.
- This renders the results in a clean, structured, and human-readable format.


In [None]:
display_llm_evaluation_results(eval_results)

## 11. View Raw JSON Results

Finally, we can print the raw `eval_result` list as a formatted JSON. This is useful for seeing all the data at once, including gateway metrics like latency and cost for each individual evaluation call, which are stored inside the `evaluation_metrics` dictionary for each item.

In [None]:
print("--- Aggregate Evaluation Results ---")

print(json.dumps(eval_results, indent=2))

## 12. Summary

This notebook provided a complete, step-by-step workflow for evaluating a RAG agent using Flotorch Eval.

We successfully:

1.  **Configured** clients for the Flotorch SDK (`FlotorchLLM`, `FlotorchVectorStore`) and Flotorch Eval (`LLMEvaluator`).
2.  **Generated** responses by looping through a `gt.json` file, retrieving context from the Knowledge Base, and calling the inference LLM.
3.  **Evaluated** each response individually by calling `evaluator_client.evaluate()` for each item, collecting detailed metrics for Faithfulness, Answer Relevancy, Context Precision, and Hallucination.
4.  **Displayed** the final, per-question scores in a formatted table, allowing for easy analysis of where the agent is succeeding or failing.