# Building and Evaluating LlamaIndex Agents with Query Engine Tools

You can install all the dependencies for this tutorial using:

In [1]:
%pip install litellm llama-index-embeddings-google-genai llama-index-llms-google-genai llama-index weave -q

Note: you may need to restart the kernel to use updated packages.


Weâ€™ll use a `.env` file to manage API keys securely. You can also set them manually as environment variables, but for this tutorial, weâ€™ll go ahead with a `.env` setup.  

Also include `.env` in your `.gitignore` to avoid accidentally exposing sensitive API keys.

In [19]:
from dotenv import load_dotenv

load_dotenv()

True

## Building Agents with Query Engine Tools

In LlamaIndex, an agent with query engine tools receives a user query and intelligently breaks it down to determine which query engines should handle the original query or specific sub-queries. With parallel tool calling enabled by default, the agent simultaneously sends queries or sub-queries to multiple relevant query engines, then synthesizes the results to provide a comprehensive response. The agent continues this process until the user's query is fully answered or comes to the conclusion that it cannot be answered.

### Setting the LLM and Embedding Model

In [3]:
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from google.genai import types

llm = GoogleGenAI(
    model="gemini-2.5-flash",
    generation_config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0)  # Disables thinking
    ),
)

embed_model = GoogleGenAIEmbedding(model_name="text-embedding-004")

### Downloading the Data

In [4]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

--2025-07-05 16:50:21--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: â€˜data/10k/uber_2021.pdfâ€™


2025-07-05 16:50:23 (1.09 MB/s) - â€˜data/10k/uber_2021.pdfâ€™ saved [1880483/1880483]

--2025-07-05 16:50:23--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting respons

### Setting the Vector Database and the Indexing the data

In [5]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# load data
lyft_docs = SimpleDirectoryReader(input_files=["./data/10k/lyft_2021.pdf"]).load_data()
uber_docs = SimpleDirectoryReader(input_files=["./data/10k/uber_2021.pdf"]).load_data()

# build index
lyft_index = VectorStoreIndex.from_documents(lyft_docs)
uber_index = VectorStoreIndex.from_documents(uber_docs)

# persist index
lyft_index.storage_context.persist(persist_dir="./storage/lyft")
uber_index.storage_context.persist(persist_dir="./storage/uber")

### Setting up the Query Engine 

In [6]:
lyft_engine = lyft_index.as_query_engine(similarity_top_k=3)
uber_engine = uber_index.as_query_engine(similarity_top_k=3)

### Making Query Engine Tools for the Agent

In [7]:
from llama_index.core.tools import QueryEngineTool

query_engine_tools = [
    QueryEngineTool.from_defaults(
        query_engine=lyft_engine,
        name="lyft_10k",
        description=(
            "Provides information about Lyft financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
    QueryEngineTool.from_defaults(
        query_engine=uber_engine,
        name="uber_10k",
        description=(
            "Provides information about Uber financials for year 2021. "
            "Use a detailed plain text question as input to the tool."
        ),
    ),
]

### Setting up the Agent

In this step we will provide the Agent with Large Language Model(LLM) which will be responsible for the making the decisions and tools which will provide it capabilities do actions.

In [8]:
from llama_index.core.agent.workflow import FunctionAgent
from llama_index.core.workflow import Context

agent = FunctionAgent(tools=query_engine_tools, llm=llm)

### Try It Out!

In [9]:
from llama_index.core.agent.workflow import ToolCallResult, AgentStream

handler = agent.run("What's the revenue for Lyft in 2021 vs Uber?")

async for ev in handler.stream_events():
    if isinstance(ev, ToolCallResult):
        print(
            f"Call {ev.tool_name} with args {ev.tool_kwargs}\nReturned: {ev.tool_output}"
        )
    elif isinstance(ev, AgentStream):
        print(ev.delta, end="", flush=True)

response = await handler

Call lyft_10k with args {'input': 'What was Lyft revenue in 2021?'}
Returned: Lyft's revenue in 2021 was $3,208,323,000.
Call uber_10k with args {'input': 'What was Uber revenue in 2021?'}
Returned: $17,455 million.
In 2021, Lyft's revenue was $3,208,323,000, while Uber's revenue was $17,455,000,000.

## Evaluating the Agent with Wandb weave

When using Weave for evaluation, you need three main components:

1. **Dataset**: A collection of queries or inputs you want to evaluate your application on.  

2.	**Model**: This is an abstraction that represents the application you want to evaluate. Itâ€™s not a literal machine learning model, but a wrapper provided by Weave that defines how your application handles input and produces output.  

3. **Scorers**: These are the metrics or scoring functions that assess how well your application performs on the dataset. For example, they might check correctness, retrieval quality.


### Initializing the Project and Creating the Dataset

**Agentic RAG Evaluation: Two Types**

There are two main evaluation approaches for Agentic RAG systems:

1. **Reference-Based** - Uses queries with golden/ground truth answers to measure correctness
2. **Reference-Free** - Uses only queries without ground truth, relying on heuristics or LLM judgment to assess quality


We'll demonstrate both approaches by creating two datasets: one with ground truth answers for reference-based evaluation, and one with queries only for reference-free evaluation.

#### Creating Reference-Based Dataset

In [24]:
import weave
from weave import Dataset

weave.init(project_name="llama_index_evaluations")

eval_dataset_with_reference = Dataset(
    name="agentic-rag-evaluation-dataset-with-reference",
    rows=[
        {
            "id": "0",
            "query": "What's the revenue for Lyft in 2021 vs Uber?",
            "reference": "In 2021, Lyft's revenue was $3,208,323,000 and Uber's revenue was $17,455 million.",
        },
    ],
)

weave.publish(eval_dataset_with_reference)
dataset_ref = weave.ref("agentic-rag-evaluation-dataset-with-reference").get()

[36m[1mweave[0m: ðŸ“¦ Published to https://wandb.ai/deep-learning-assignments/llama_index_evaluations/weave/objects/agentic-rag-evaluation-dataset-with-reference/versions/s8ZhlTyCQSxeehwLzJmMuF00uDOm2JhK65K4ea8ReY4


#### Creating Reference-Free Dataset

In [11]:
import weave
from weave import Dataset

weave.init(project_name="llama_index_evaluations")

eval_dataset_without_reference = Dataset(
    name="agentic-rag-evaluation-dataset-without-reference",
    rows=[
        {"id": "0", "query": "What's the revenue for Lyft in 2021 vs Uber?"},
    ],
)

weave.publish(eval_dataset_without_reference)
dataset_ref = weave.ref("agentic-rag-evaluation-dataset-without-reference").get()

[36m[1mweave[0m: ðŸ“¦ Published to https://wandb.ai/deep-learning-assignments/llama_index_evaluations/weave/objects/agentic-rag-evaluation-dataset-without-reference/versions/t2mQhnjjQG19L0lFeNxCGD0Zkgd6lda2QsUCKvoFPoI


### Setting the Model

In [12]:
import weave
import asyncio
from llama_index.core.agent.workflow import AgentOutput


class LlamaIndexAgenticRAG(weave.Model):
    @weave.op()
    async def predict(self, query: str) -> AgentOutput:
        agent = FunctionAgent(tools=query_engine_tools, llm=llm)
        handler = agent.run(query)
        response = asyncio.run(handler)
        return response

### Defining the Scorers(without reference)

When evaluating RAG or Agentic RAG applications, we focus on two core components. The first is retrieval, which measures how effectively the system retrieves relevant information from the knowledge-base based on the input query. The second is generation, which evaluates how well the system generates an answer using the retrieved context.

**Defining Scorers**

To evaluate both components, we will define scorers to assess both the retrieval and generation process. In this tutorial we'll use two built-in scorers to do the evaluation:

- Retrieval Evaluation

    [ContextRelevancyScorer](): Measures how relevant the retrieved context is to the input query. Returns a score between 0 and 1, with higher scores indicating better relevance.

- Generation Evaluation

    [HallucinationFreeScorer](): Checks if the generated response contains hallucinated information by determining whether the answer is faithful to the retrieved context without adding unsupported information.


**Evaluation Modes:** We'll perform this evaluation in both reference-free mode (no ground truth answer provided) and reference-based mode (a ground truth answer is available). We will start with the reference-free evaluation and then move on to the reference-based setup.

#### Evaluating Agentic Systems with Multiple RAG Calls

In agentic systems, a single user query often breaks down into multiple sub-queries, each triggering its own RAG operation.

To evaluate these systems properly, we need to evaluate each individual RAG call and then aggregate those evaluations into a single overall score for the original query.

**Wrapper Scorer Approach**

We use a wrapper scorer that takes a base scorer (like hallucination or context relevance) and applies it to every RAG call made by the agent. It then combines all results to provide one comprehensive evaluation score for the entire query.

This ensures we evaluate every step of the agent's multi-step process, giving us a complete picture of its performance.

In [13]:
import weave
from weave import Scorer
from typing import Dict
import numpy as np
from weave.scorers import HallucinationFreeScorer, ContextRelevancyScorer


class AgenticHallucinationFreeScorer(Scorer):
    base_scorer: Scorer = HallucinationFreeScorer(model_id="gemini/gemini-2.5-flash")

    @weave.op
    async def score(self, output: AgentOutput) -> Dict:
        tool_calls = output.tool_calls
        tool_calls = [
            {
                "query": tool_call.tool_kwargs["input"],
                "output": tool_call.tool_output.content,
                "context": [
                    node.text for node in tool_call.tool_output.raw_output.source_nodes
                ],
            }
            for tool_call in tool_calls
        ]

        scores = await asyncio.gather(
            *[
                self.base_scorer.score(
                    output=tool_call["output"], context="\n".join(tool_call["context"])
                )
                for tool_call in tool_calls
            ]
        )

        final_score = np.mean([score["has_hallucination"] for score in scores])
        return {"hallucination_free_score": final_score}


class AgenticContextRelevancyScorer(Scorer):
    base_scorer: Scorer = ContextRelevancyScorer(model_id="gemini/gemini-2.5-flash")

    @weave.op
    async def score(self, output: AgentOutput) -> Dict:
        tool_calls = output.tool_calls
        tool_calls = [
            {
                "query": tool_call.tool_kwargs["input"],
                "output": tool_call.tool_output.content,
                "context": [
                    node.text for node in tool_call.tool_output.raw_output.source_nodes
                ],
            }
            for tool_call in tool_calls
        ]

        scores = await asyncio.gather(
            *[
                self.base_scorer.score(
                    output=tool_call["output"], context="\n".join(tool_call["context"])
                )
                for tool_call in tool_calls
            ]
        )

        final_score = np.mean([score["relevancy_score"] for score in scores])
        return {"relevancy_score": final_score}

### Performing Evaluations

In [14]:
halucination_free_scorer = AgenticHallucinationFreeScorer()
context_relevancy_scorer = AgenticContextRelevancyScorer()

evaluation_without_reference = weave.Evaluation(
    dataset=eval_dataset_without_reference,
    scorers=[context_relevancy_scorer, halucination_free_scorer],
)

In [20]:
import nest_asyncio

nest_asyncio.apply()

model = LlamaIndexAgenticRAG()
result = asyncio.run(evaluation_without_reference.evaluate(model))

result

[36m[1mweave[0m: Evaluated 1 of 1 examples
[36m[1mweave[0m: Evaluation summary {
[36m[1mweave[0m:   "output": {
[36m[1mweave[0m:     "raw": {
[36m[1mweave[0m:       "index": {
[36m[1mweave[0m:         "mean": 0.0
[36m[1mweave[0m:       },
[36m[1mweave[0m:       "usage_metadata": {
[36m[1mweave[0m:         "candidates_token_count": {
[36m[1mweave[0m:           "mean": 41.0
[36m[1mweave[0m:         },
[36m[1mweave[0m:         "prompt_token_count": {
[36m[1mweave[0m:           "mean": 272.0
[36m[1mweave[0m:         },
[36m[1mweave[0m:         "total_token_count": {
[36m[1mweave[0m:           "mean": 313.0
[36m[1mweave[0m:         }
[36m[1mweave[0m:       }
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "AgenticContextRelevancyScorer": {
[36m[1mweave[0m:     "relevancy_score": {
[36m[1mweave[0m:       "mean": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "AgenticHallucinatio

{'output': {'raw': {'index': {'mean': 0.0},
   'usage_metadata': {'candidates_token_count': {'mean': 41.0},
    'prompt_token_count': {'mean': 272.0},
    'total_token_count': {'mean': 313.0}}}},
 'AgenticContextRelevancyScorer': {'relevancy_score': {'mean': 1.0}},
 'AgenticHallucinationFreeScorer': {'hallucination_free_score': {'mean': 0.0}},
 'model_latency': {'mean': 5.51798677444458}}

### Defining the Scorers(with reference)

Now that weâ€™ve covered reference-free evaluation, weâ€™ll move on to reference-based evaluation, where each query includes a ground truth answer.

In this setup, we can define custom scorers that compare the generated response against the expected (reference) response. These scorers use both the input query, the retrieved context, and the ground truth answer to evaluate the quality of the generation.

Defining Custom Scorers

For reference-based evaluation, weâ€™ll define a custom metric that uses the ground truth to assess how closely the generated response matches the expected answer. This allows us to measure response accuracy, completeness, or any other domain-specific quality using direct comparison.

In the following section, we will define and apply this custom scorer to perform reference-based evaluations on our dataset.

In [26]:
from textwrap import dedent
from typing import Any
from weave.scorers.scorer_types import LLMScorer
from pydantic import BaseModel, Field


class ContextPrecisionResponse(BaseModel):
    reason: str = Field(
        description="Step-by-step reasoning about whether the retrieved context was useful in arriving at the given ground truth answer"
    )
    score: int = Field(
        description="Binary score indicating if the context was useful in producing the answer (1 for useful, 0 for not useful)"
    )


class ContextPrecisionWithReferenceScorer(LLMScorer):
    name: str = "context_precision_with_reference"
    prompt_template: str = dedent(
        """
    You are given a question, a retrieved context, and the correct (ground truth) answer.

    Your task is to evaluate whether the retrieved context was useful in arriving at the given answer.

    - If the context includes information that directly supports or helps generate the answer, return a score of 1.
    - If the context is unrelated or not helpful in generating the answer, return a score of 0.

    Think step by step and provide a reason for your decision.

    Question: {question}

    Context: {context}

    Answer: {answer}

    Reasoning:
    <your reasoning here>

    Final Score (0 or 1):
    """
    )
    model_id: str = "gemini/gemini-2.0-flash"

    @weave.op
    async def score(
        self, *, output: AgentOutput, query: str, reference: str, **kwargs: Any
    ) -> dict:
        tool_calls = output.tool_calls
        contexts = [
            node.text
            for tool_call in tool_calls
            for node in tool_call.tool_output.raw_output.source_nodes
        ]
        prompt = self.prompt_template.format(
            question=query, context="\n".join(contexts), answer=reference
        )
        response = await self._acompletion(
            messages=[{"role": "user", "content": prompt}],
            response_format=ContextPrecisionResponse,
            model=self.model_id,
        )
        response = ContextPrecisionResponse.model_validate_json(
            response.choices[0].message.content
        )
        return response.model_dump()


class AnswerCorrectnessResponse(BaseModel):
    reason: str = Field(
        description="Step-by-step reasoning about whether the generated answer covers all the key points in the reference answer"
    )
    score: int = Field(
        description="Binary score: 1 if the generated answer fully captures all factual details from the reference answer, else 0"
    )


class AnswerCorrectnessScorer(LLMScorer):
    name: str = "answer_correctness"
    prompt_template: str = dedent(
        """
    You are given a question, a generated answer, and the reference (ground truth) answer.

    Your task is to decide whether the generated answer includes **all the key factual points** from the reference answer.

    - If it fully matches the reference in meaning and completeness, return 1.
    - If anything is missing, inaccurate, or not supported by the reference, return 0.

    Question: {question}

    Generated Answer: {generated_answer}

    Reference Answer: {reference_answer}

    Reasoning:
    <your step-by-step reasoning here>

    Final Score (0 or 1):
    """
    )
    model_id: str = "gemini/gemini-2.0-flash"

    @weave.op
    async def score(
        self, *, output: AgentOutput, query: str, reference: str, **kwargs: Any
    ) -> dict:
        generated_answer = output.response.blocks[0].text
        prompt = self.prompt_template.format(
            question=query,
            generated_answer=generated_answer,
            reference_answer=reference,
        )
        response = await self._acompletion(
            messages=[{"role": "user", "content": prompt}],
            response_format=AnswerCorrectnessResponse,
            model=self.model_id,
        )
        response = AnswerCorrectnessResponse.model_validate_json(
            response.choices[0].message.content
        )
        return response.model_dump()

### Performing Evaluations

In [27]:
context_precision_scorer = ContextPrecisionWithReferenceScorer()
answer_correctness_scorer = AnswerCorrectnessScorer()

evaluation_with_reference = weave.Evaluation(
    dataset=eval_dataset_with_reference,
    scorers=[context_precision_scorer, answer_correctness_scorer],
)

In [28]:
import nest_asyncio

nest_asyncio.apply()

model = LlamaIndexAgenticRAG()
result = asyncio.run(evaluation_with_reference.evaluate(model))

result

[36m[1mweave[0m: Evaluated 1 of 1 examples
[36m[1mweave[0m: Evaluation summary {
[36m[1mweave[0m:   "output": {
[36m[1mweave[0m:     "raw": {
[36m[1mweave[0m:       "index": {
[36m[1mweave[0m:         "mean": 0.0
[36m[1mweave[0m:       },
[36m[1mweave[0m:       "usage_metadata": {
[36m[1mweave[0m:         "candidates_token_count": {
[36m[1mweave[0m:           "mean": 47.0
[36m[1mweave[0m:         },
[36m[1mweave[0m:         "prompt_token_count": {
[36m[1mweave[0m:           "mean": 257.0
[36m[1mweave[0m:         },
[36m[1mweave[0m:         "total_token_count": {
[36m[1mweave[0m:           "mean": 304.0
[36m[1mweave[0m:         }
[36m[1mweave[0m:       }
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "context_precision_with_reference": {
[36m[1mweave[0m:     "score": {
[36m[1mweave[0m:       "mean": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "answer_correctness": {
[3

{'output': {'raw': {'index': {'mean': 0.0},
   'usage_metadata': {'candidates_token_count': {'mean': 47.0},
    'prompt_token_count': {'mean': 257.0},
    'total_token_count': {'mean': 304.0}}}},
 'context_precision_with_reference': {'score': {'mean': 1.0}},
 'answer_correctness': {'score': {'mean': 1.0}},
 'model_latency': {'mean': 4.837825059890747}}

## Next

Now that you have a better understanding of how to evaluate your RAG agents, you can take this further by trying to evaluate web search agents as well. At a high level, while web search is not exactly the same as RAG, the evaluation approach can be quite similar especially in reference-free mode, where you donâ€™t have a ground truth answer.

With this understanding of RAG agent evaluation, you can apply similar approaches to web search agents. While web search differs from RAG, the evaluation methods are quite comparable, especially in reference-free scenarios where ground truth answers aren't available.

You can use the same evaluation metrics or create custom evaluators tailored to your specific needs. This flexible approach allows you to assess various types of agentic and retrieval-based systems effectively.

We hope this tutorial helps you explore and experiment with different evaluation techniques for your own applications.