[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1kgrdQ5euuiLZtuwuxdMvIgyZvm-z-Cb2/view?usp=sharing)
# Evaluating a CopilotKit Documentation Agent with Flotorch Eval


This notebook provides a step-by-step workflow for evaluating a **RAG-based** retrieval and response generation agent using the **Flotorch SDK** and **Flotorch Eval library**

---
### **Use Case**

We aim to evaluate a **CopilotKit documentation assistant** — an agent designed to answer developer questions *solely* based on the official CopilotKit documentation.  
The goal is to measure how accurately and faithfully the agent retrieves and uses relevant information to form correct, safe, and helpful answers.

---
### **Notebook Workflow**

1. **Iterate Questions** – Load each question from `gt.json` (ground truth set).  
2. **Retrieve Context** – Query the Flotorch Knowledge Base (preloaded with CopilotKit docs).  
3. **Generate Answer** – Use your assistant’s prompt and a Flotorch LLM to produce a response.  
4. **Store Results** – Record the question, retrieved context, generated answer, and ground truth.  
5. **Evaluate** – Use the `LLMEvaluator` to automatically score each answer.  
6. **Display Scores** – Present final results in a structured, easy-to-read table.
---
### **Evaluation Engine**

We use the `evaluation_engine="auto"` setting — allowing Flotorch Eval to **intelligently select** the right backend for each metric:  

  - [**Ragas**](https://docs.ragas.io/en/stable/getstarted/s) — used for metrics like *faithfulness*, *context relevancy*, and *answer relevance*.  
  - [**DeepEval**](https://deepeval.com/docs/getting-started) — used for *hallucination* and *aspect-based (LLM-critic)* evaluations.
  
This hybrid setup ensures accurate and efficient multi-metric evaluation without manual configuration.

---
### **Metrics Measured**

- **Faithfulness** – Is the answer factually consistent with the provided context?  
- **Answer Relevance** – Does it address the question directly and appropriately?  
- **Context Precision** – Is the retrieved context relevant and necessary?  
- **Maliciousness** – This metric is not essential for RAG evaluations, we include this metric to demonstrate how Flotorch can assess safety-related dimensions as well., **and Few more Metrices are measured.**

---
### **Requirements**

- A Flotorch account with a configured LLM, embedding model, and Knowledge Base (loaded with CopilotKit docs).  
- A `gt.json` file containing question–answer pairs for evaluation.  
- A `prompt.json` file defining the system and user prompt templates for your agent.  


## 1. Install Dependencies

Before proceeding, install the required libraries for model interaction and evaluation:

- **`flotorch`** — The primary Python SDK for accessing Flotorch services, including LLM inference, knowledge bases, and related APIs.  
- **`flotorch-eval[all]`** — The evaluation toolkit. The `[all]` option ensures that all optional dependencies needed for metrics, evaluation engines, and integrations are installed.


In [None]:
# Install flotorch-sdk and flotorch-core
# You can safely ignore dependency errors during installation.

%pip install flotorch==3.1.0b1 flotorch-eval==2.0.0
%pip install opentelemetry-instrumentation-httpx==0.58b0

## 2. Configure Environment

This is the main configuration step. Set your API key, base URL, and the model names you want to use.

-   **`FLOTORCH_API_KEY`**: Your Flotorch API key (found in your Flotorch Console).
-   **`FLOTORCH_BASE_URL`**: Your Flotorch console instance URL.
-   **`inference_model_name`**: The LLM your agent uses to *generate* answers (your 'agent's brain').
-   **`evaluation_llm_model_name`**: The LLM used to *evaluate* the answers (the 'evaluator's brain'). This is typically a powerful, separate model like `flotorch/gpt-4o` to ensure an unbiased, high-quality judgment.
-   **`evaluation_embedding_model_name`**: The embedding model used for semantic similarity checks during evaluation.
-   **`knowledge_base_repo`**: The ID of your Flotorch Knowledge Base, which acts as the 'source of truth' for your RAG agent.

In [None]:
# -----------------------------------------------------------
# Flotorch Model Configuration
# -----------------------------------------------------------
# Update the placeholders below with the correct values
# from your Flotorch Console before running inference
# or evaluation.
# -----------------------------------------------------------

from getpass import getpass

# Authentication
FLOTORCH_API_KEY = getpass("Enter your Flotorch API key: ")
FLOTORCH_BASE_URL = input("Enter Base URL: ")   # Example: "https://gateway.flotorch.cloud"

# Model names (replace with your actual model IDs)
inference_model_name = "<INFERENCE_MODEL_NAME>"                # Model used by your agent
evaluation_llm_model_name = "<EVALUATION_LLM_MODEL_NAME>"      # Model used for scoring
evaluation_embedding_model_name = "<EMBEDDING_MODEL_NAME>"     # Embedding model for Eval,
                                                               # Example:"openai/text-embedding-ada-002

# Knowledge base repository used for retrieval
knowledge_base_repo = "<KNOWLEDGE_BASE_REPO_NAME>"


## 3. Import Libraries

Next, we import all the necessary modules from Python and the Flotorch libraries that we'll use throughout the notebook.

In [None]:
#Required imports
import json
from typing import List
from tqdm import tqdm

# Flotorch SDK components
from flotorch.sdk.llm import FlotorchLLM
from flotorch.sdk.memory import FlotorchVectorStore
from flotorch.sdk.utils import memory_utils
from google.colab import files

# Flotorch Eval components
from flotorch_eval.llm_eval import LLMEvaluator, EvaluationItem, MetricKey
from flotorch_eval import display_llm_evaluation_results

## 4. Load Data and Prompts

Here, we load our ground truth questions (`gt.json`) and the agent prompts (`prompt.json`) from local files.

Your files should be structured as follows:

**`gt.json` (Ground Truth)**
A list of question-answer objects. The `answer` is the "perfect" response you are testing against.
```json
[
{
    "question": "What is CopilotKit?",
    "answer": "CopilotKit is an open-source framework and hosted service, described as 'The Agentic Application Framework."
  },
  {
    "question": "How do I provide application context to my copilot?",
    "answer": "You can provide application state and context to your copilot by using the `useCopilotReadable` hook"
  }
]
```

**`prompt.json` (Agent Prompts)**
Contains the system prompt and the user prompt template. Note the `{context}` and `{question}` placeholders, which we will fill dynamically.
```json
{
  "system_prompt": "You are a helpful CopilotKit framework assistant. Answer based only on the context provided.",
  "user_prompt_template": "Context:\n{context}\n\nQuestion:\n{question}\n\nAnswer:"
}
```

**Note:** In Google Colab, you can use the file icon on the left to upload your files and then adjust the paths in `gt_path` and `prompt_path`.

In [None]:
print("Please upload your Ground Truth file (gt.json)")
gt_upload = files.upload()

gt_path = list(gt_upload.keys())[0]
with open(gt_path, 'r') as f:
    ground_truth = json.load(f)
print(f"Ground truth loaded successfully — {len(ground_truth)} items\n")


print("Please upload your Prompts file (prompts.json)")
prompts_upload = files.upload()

prompts_path = list(prompts_upload.keys())[0]
with open(prompts_path, 'r') as f:
    prompt_config = json.load(f)
print(f"Prompts loaded successfully — {len(prompt_config)} prompt pairs")

## 5. Define Helper Function

This simple helper function, `create_messages`, builds the final prompt that will be sent to the LLM. It takes the system prompt, user template, question, and retrieved context, and formats them into the standard list of message objects (`{role: ..., content: ...}`) that the model expects.

In [None]:
def create_messages(system_prompt: str, user_prompt_template: str, question: str, context: List[str] = None):
    """
    Creates a list of messages for the LLM based on the provided prompts, question, and optional context.
    """
    context_text = ""
    if context:
        if isinstance(context, list):
            context_text = "\n\n---\n\n".join(context)
        elif isinstance(context, str):
            context_text = context

    # Format the user prompt template
    user_content = user_prompt_template.replace("{context}", context_text).replace("{question}", question)

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_content}
    ]
    return messages

## 6. Initialize Clients

We create the clients for the generative LLM (`FlotorchLLM`) and the Knowledge Base (`FlotorchVectorStore`). These clients will be used inside the loop to get context and generate answers for each question.

In [None]:
# 1. Set up the LLM for generating answers
inference_llm = FlotorchLLM(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    model_id=inference_model_name
)

# 2. Set up the Knowledge Base connection
kb = FlotorchVectorStore(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    vectorstore_id=knowledge_base_repo
)

# 3. Load prompts into variables
system_prompt = prompt_config.get("system_prompt", "")
user_prompt_template = prompt_config.get("user_prompt_template", "{question}")

print("Models and Knowledge Base are ready.")

## 7. Run Experiment Loop

This is the core logic for *generating* the answers. We loop through each question in our `ground_truth` list and perform the full RAG pipeline:

1.  **Retrieve Context**: We `kb.search()` to get context from the vector store.
2.  **Build Messages**: We use our `create_messages` helper to assemble the final prompt.
3.  **Generate Answer**: We `inference_llm.invoke()` to get the agent's response.
4.  **Store for Evaluation**: We package all this information into an `EvaluationItem` object and add it to our `evaluation_items` list.

We also include a `try...except` block to gracefully handle any errors during the API calls, ensuring the loop doesn't crash.


In [None]:
evaluation_items = [] # This will store our results

# Use simple tqdm for a progress bar
print(f"Running experiment on {len(ground_truth)} items...")

for qa in tqdm(ground_truth):
    question = qa.get("question", "")
    gt_answer = qa.get("answer", "")

    try:
        # --- 1. Retrieve Context ---
        search_results = kb.search(query=question)
        context_texts = memory_utils.extract_vectorstore_texts(search_results)

        # --- 2. Build Messages ---
        messages = create_messages(
            system_prompt=system_prompt,
            user_prompt_template=user_prompt_template,
            question=question,
            context=context_texts
        )

        # --- 3. Generate Answer ---
        response, headers = inference_llm.invoke(messages=messages, return_headers=True)
        generated_answer = response.content

        # --- 4. Store for Evaluation ---
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=generated_answer,
            expected_answer=gt_answer,
            context=context_texts, # Store the context for later display
            metadata=headers,
        ))

    except Exception as e:
        print(f"[ERROR] Failed on question '{question[:50]}...': {e}")
        # Store a failure case so we can see it
        evaluation_items.append(EvaluationItem(
            question=question,
            generated_answer=f"Error: {e}",
            expected_answer=gt_answer,
            context=[],
            metadata={"error": str(e)},
        ))

print(f"\n\nExperiment completed. {len(evaluation_items)} items are ready for evaluation.")

## 8. Initialize the Evaluator

With the `evaluation_items` list prepared, we can now initialize the `LLMEvaluator`.

The `LLMEvaluator` is the core component of the **Flotorch-Eval** ecosystem. It coordinates metric computation, semantic similarity checks, and LLM-based judgments using the configuration you provide. In essence, it acts as the central evaluation engine for the entire pipeline.

### Parameter Overview

- **`api_key` / `base_url`**  
  Authentication details required to connect to the Flotorch-Eval service.

- **`inferencer_model` / `embedding_model`**  
  Models used internally by the evaluator:
  - The **inferencer model** performs reasoning-driven evaluations.
  - The **embedding model** handles semantic similarity and context-comparison tasks.

- **`evaluation_engine`**  
  Set to `"auto"` in this notebook, enabling Flotorch Eval to automatically select the appropriate backend for each metric:
  - **[Ragas](https://docs.ragas.io/en/stable/getstarted/)** — used for RAG-specific metrics such as *faithfulness*, *answer relevance*, and *context precision*.  
  - **[DeepEval](https://deepeval.com/docs/getting-started)** — used for hallucination checks and LLM-critic–style aspect evaluations.  

  This dynamic routing ensures optimal accuracy without requiring manual engine selection.

- **`metrics`**  
  A list of `MetricKey` values defining the evaluation dimensions to compute. These determine the “scorecard” the evaluator will produce.

- **`metric_configs`**  
  Provides additional configuration for metrics that require custom parameters.  
  In this notebook, the **`ASPECT_CRITIC`** metric is configured with a `maliciousness` aspect — a targeted check to identify harmful or unsafe content.  
  *Aspect Critic acts as a parent metric and can support multiple customized evaluation aspects.*

- **Thresholds**  
  Each metric may optionally define a pass/fail threshold (from `0.0` to `1.0`), allowing you to specify the minimum acceptable score for success.

This setup ensures a robust and flexible evaluation process, leveraging both Ragas and DeepEval where they perform best.


In [None]:
# Configure a custom metric for maliciousness
metric_args = {
    MetricKey.ASPECT_CRITIC: {
        "maliciousness": {
            "name": "maliciousness",
            "definition": "Is the response harmful or toxic?"
        }
    },
    "faithfulness": {"threshold": 0.8},
    "context_relevancy":{"threshold":0.7},
    "context_precision":{"threshold":0.6},
    "context_recall":{"threshold":0.5},
    "answer_relevance": {"threshold": 0.7},
    "hallucination": {"threshold": 0.3},


}

# Initialize the LLMEvaluator client
evaluator_client = LLMEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    embedding_model=evaluation_embedding_model_name,
    inferencer_model=evaluation_llm_model_name,
    metrics=[
        MetricKey.FAITHFULNESS,
        MetricKey.CONTEXT_RELEVANCY,
        MetricKey.CONTEXT_PRECISION,
        MetricKey.CONTEXT_RECALL,
        MetricKey.ANSWER_RELEVANCE,
        MetricKey.ASPECT_CRITIC,
        MetricKey.HALLUCINATION,
    ],
    evaluation_engine="auto",
    metric_configs=metric_args
)

print("LLMEvaluator client initialized.")

## 9. Run Evaluation

### Purpose
Execute the evaluation process to score all generated answers.
### Process
- Call either:
  - `evaluator_client.evaluate()` for **synchronous** (sequential) execution, or  
  - `evaluator_client.aevaluate()` for **asynchronous** (concurrent) execution  
  using the complete list of `evaluation_items`.

- Collect the following outputs:
  - ragas and deepeval metrics
  - Gateway metrics (cost, latency, token usage)
  - Structured evaluation results

### Output
- A complete evaluation report ready for analysis.

> **Note:**  
> This step may take a few minutes, as it requires LLM calls for each question.
> Use the **synchronous** method for standard sequential execution, or the **asynchronous** method for faster, concurrent processing.


### Asynchronous Evaluation

In [None]:
print("Starting evaluation... This may take a few minutes.")

eval_results = await evaluator_client.aevaluate(evaluation_items)

print("Evaluation complete.")

### Synchronous Evaluation (uncomment the below code to use synchronous manner)

In [None]:
# print("Starting evaluation... This may take a few minutes.")

# eval_results = evaluator_client.evaluate(evaluation_items)

# print("Evaluation complete.")

## 10. View Per-Question Results

The `eval_result` variable now contains a list of dictionaries, with each dictionary holding the detailed results and metrics for a single question.

We can now loop through this list and use `tabulate` to display the scores for each query in a readable grid. This table is the primary output, allowing you to compare the `Generated Answer` with the `Ground Truth` and see the scores for `Faithfulness`, `Answer Relevancy`, and `Context Precision` side-by-side.

In [None]:
display_llm_evaluation_results(eval_results)

## 11. View Raw JSON Results

Finally, we can print the raw `eval_result` list as a formatted JSON. This is useful for seeing all the data at once, including gateway metrics like latency and cost for each individual evaluation call, which are stored inside the `evaluation_metrics` dictionary for each item.

In [None]:
print("--- Aggregate Evaluation Results ---")
print(json.dumps(eval_results, indent=2))

## 12. Summary

This notebook provided a complete, step-by-step workflow for evaluating a RAG agent using Flotorch Eval.

We successfully:

1.  **Configured** clients for the Flotorch SDK (`FlotorchLLM`, `FlotorchVectorStore`) and Flotorch Eval (`LLMEvaluator`).
2.  **Generated** responses by looping through a `gt.json` file, retrieving context from the Knowledge Base, and calling the inference LLM.
3.  **Evaluated** each response individually by calling `evaluator_client.evaluate()` for each item, collecting detailed metrics for Faithfulness, Answer Relevancy, Context Precision, and Maliciousness.
4.  **Displayed** the final, per-question scores in a formatted table, allowing for easy analysis of where the agent is succeeding or failing.