# Building and Evaluating LlamaIndex ReAct Agents

You can install all the dependencies for this tutorial using:

In [None]:
%pip install litellm llama-index-embeddings-google-genai llama-index-llms-google-genai llama-index weave -q

We’ll use a `.env` file to manage API keys securely. You can also set them manually as environment variables, but for this tutorial, we’ll go ahead with a `.env` setup.  

Also include `.env` in your `.gitignore` to avoid accidentally exposing sensitive API keys.

In [None]:
from dotenv import load_dotenv

load_dotenv()

## Building ReAct Agent

ReAct breaks down complex tasks into a series of thoughts, actions, and observations, ReAct agents can tackle intricate problems with a level of transparency and adaptability that was previously challenging to achieve. This methodology allows for a more nuanced understanding of the agent’s decision-making process, making it easier for developers to debug, refine, and optimize LLM responses.

### Defining the Tools and the LLM

In [2]:
from llama_index.llms.google_genai import GoogleGenAI
from google.genai import types

llm = GoogleGenAI(
    model="gemini-2.0-flash",
    generation_config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    ),
)


def multiply(a: int, b: int) -> int:
    """Multiply two integers and returns the result integer"""
    return a * b


def add(a: int, b: int) -> int:
    """Add two integers and returns the result integer"""
    return a + b

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


### Setting up the Agent

In [3]:
from llama_index.core.agent.workflow import ReActAgent


agent = ReActAgent(tools=[multiply, add], llm=llm)

### Try It Out!

In [4]:
from llama_index.core.agent.workflow import AgentStream

handler = agent.run("What is 20+(2*4)?")

async for ev in handler.stream_events():
    if isinstance(ev, AgentStream):
        print(f"{ev.delta}", end="", flush=True)

response = await handler

```
Thought: The current language of the user is: English. I need to perform the calculation 20 + (2 * 4). I will start by multiplying 2 and 4, and then adding the result to 20.
Action: multiply
Action Input: {"a": 2, "b": 4}
```Thought: The current language of the user is: English. Now I need to add 8 to 20.
Action: add
Action Input: {'a': 20, 'b': 8}Thought: I can answer without using any more tools. I'll use the user's language to answer
Answer: 28


## Evaluating the Agent with Wandb weave

When using Weave for evaluation, you need three main components:

1. **Dataset**: A collection of queries or inputs you want to evaluate your application on.  

2.	**Model**: This is an abstraction that represents the application you want to evaluate. It’s not a literal machine learning model, but a wrapper provided by Weave that defines how your application handles input and produces output.  

3. **Scorers**: These are the metrics or scoring functions that assess how well your application performs on the dataset. For example, they might check correctness, retrieval quality.

### Initializing the Project and Creating the Dataset

In [5]:
import weave
from weave import Dataset

weave.init(project_name="llama_index_react_agent_evaluations")

dataset = Dataset(
    name="react-agent-evaluation-dataset",
    rows=[
        {
            "id": "0",
            "query": "What is 5+3+2",
        },
        {
            "id": "0",
            "query": "What is 20+(2*4)?",
        },
    ],
)

weave.publish(dataset)

  from .autonotebook import tqdm as notebook_tqdm
[36m[1mweave[0m: Logged in as Weights & Biases user: siddharth-plaksha.
[36m[1mweave[0m: View Weave data at https://wandb.ai/deep-learning-assignments/llama_index_react_agent_evaluations/weave
[36m[1mweave[0m: 📦 Published to https://wandb.ai/deep-learning-assignments/llama_index_react_agent_evaluations/weave/objects/react-agent-evaluation-dataset/versions/85yI6eRFBYkFCHp3ZgKLg1G0PTEsU2wXlXkgr5cBq3s


ObjectRef(entity='deep-learning-assignments', project='llama_index_react_agent_evaluations', name='react-agent-evaluation-dataset', _digest='85yI6eRFBYkFCHp3ZgKLg1G0PTEsU2wXlXkgr5cBq3s', _extra=())

[36m[1mweave[0m: 🍩 https://wandb.ai/deep-learning-assignments/llama_index_react_agent_evaluations/r/call/0197e2f4-613f-7316-91fa-88c66e3dcc0a


### Setting the Model

In [11]:
import weave
from llama_index.core.agent.workflow import ReActAgent
from typing import Sequence, List
from llama_index.core.agent.workflow import (
    AgentOutput,
    AgentStream,
)
from llama_index.core.tools import BaseTool
from llama_index.core import PromptTemplate


react_system_header_str = """\

You are designed to help with a variety of tasks, from answering questions \
    to providing summaries to other types of analyses.

## Tools
You have access to a wide variety of tools. You are responsible for using
the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools
to complete each subtask.

You have access to the following tools:
{tool_desc}

## Output Format
To answer the question, please use the following format.

```
Thought: One-liner explanation for the tool selection and parameter selection
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
```

Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

```
Observation: tool response
```

You should keep repeating the above format until you have enough information
to answer the question without using any more tools. At that point, you MUST respond
in the one of the following two formats:

```
Thought: your one-liner thought here, stating how you have everything to complete the task
Answer: [your answer here]
```

```
Thought: I cannot answer the question with the provided tools.
Answer: Sorry, I cannot answer your query.
```

## Additional Rules
- The answer MUST contain a sequence of bullet points that explain how you arrived at the answer. This can include aspects of the previous conversation history.
- You MUST obey the function signature of each tool. Do NOT pass in no arguments if the function expects arguments.

## Current Conversation
Below is the current conversation consisting of interleaving human and assistant messages.

"""

react_system_prompt = PromptTemplate(react_system_header_str)

llm = GoogleGenAI(model="gemini-2.0-flash")


class LlamaIndexReActAgent(weave.Model):
    @staticmethod
    def get_react_tool_descriptions(tools: Sequence[BaseTool]) -> List[str]:
        """Tool."""
        tool_descs = []
        for tool in tools:
            tool_desc = (
                f"> Tool Name: {tool.metadata.name}\n"
                f"Tool Description: {tool.metadata.description}\n"
                f"Tool Args: {tool.metadata.fn_schema_str}\n"
            )
            tool_descs.append(tool_desc)
        return tool_descs

    @weave.op()
    async def predict(self, query: str) -> AgentOutput:

        agent = ReActAgent(tools=[multiply, add], llm=llm)
        agent.update_prompts({"react_header": react_system_prompt})

        handler = agent.run(query)
        trace = []
        async for ev in handler.stream_events():
            if not isinstance(ev, AgentStream):
                trace.append(ev)

        response = await handler
        return trace, self.get_react_tool_descriptions(agent.tools)

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


### Defining the Scorers

To evaluate agent tool usage of a ReAct agent, we designed three LLM-based scorers focused on different aspects:

- **Parameter Selection:** Measures the agent's accuracy in providing correct parameters when invoking selected tools. It evaluates whether all required parameters are included, parameter names and types match the tool signature, and no invalid or extraneous parameters are passed.

- **Tool Selection:** Measures the agent's ability to identify and select the most appropriate tool(s) for accomplishing the given task. It evaluates whether the chosen tool matches the task requirements, if tools were used when necessary, and if irrelevant tools were avoided.

- **Tool Calling:** Measures the complete correctness of tool invocation by combining both tool selection and parameter accuracy. It evaluates whether the agent made the right tool call with correct parameters, representing end-to-end tool usage effectiveness.

In [12]:
from textwrap import dedent
from pydantic import BaseModel, Field
import weave
from weave.scorers.scorer_types import LLMScorer
from typing import Dict
from llama_index.core.agent.workflow import AgentInput


class ParameterSelectionCorrectnessResponse(BaseModel):
    score: float = Field(
        description=dedent(
            """
        A float score indicating correctness of tool parameter usage:
        - 1.0 = All tool calls were valid and used correct parameters.
        - 0.0 = No tool calls were made by the assistant.
        - -1.0 = At least one tool call was made, but with missing, incorrect, or invalid parameters.
        """
        ).strip()
    )


class ParameterSelectionCorrectnessScorer(LLMScorer):
    name: str = "parameter_selection_correctness"
    prompt_template: str = dedent(
        """
        You are designed to **evaluate whether tools were correctly used** in a conversation between an assistant and a user.

        ## Tools
        You have access to a list of tools and their function signatures. Your task is NOT to call them, but to inspect whether the assistant **previously called these tools correctly**, based on the function signature.

        You must check:
        - Whether the assistant used only tools listed in the available tools.
        - Whether required parameters were correctly included.
        - Whether no extra or invalid parameters were used.
        - Whether parameters follow the correct types or structure.

        ## Output Format
        You must respond with **only one** of the following three values:
        - "1.0" → All tool calls were valid and used correct parameters.
        - "0.0" → No tool calls were made by the assistant.
        - "-1.0" → At least one tool call was made, but with missing, incorrect, or invalid parameters.

        Do not include any other explanation, justification, or text in your output.

        ## Tool Definitions
        Here are the available tools and their parameters:
        {tool_desc}

        ## Trace
        Below is the trace with possible tool calls. Tool calls are in the format:
        CALL: tool_name(param1=value1, param2=value2)

        [BEGIN TRACE]
        {trace}
        [END TRACE]

        Evaluate the assistant's tool usage in this trace based on the tool definitions above, and output only the correct value from the list: "1.0", "0.0", "-1.0".
        """
    )

    model_id: str = "gemini/gemini-2.0-flash"

    @weave.op
    async def score(self, output: tuple, query: str) -> Dict:
        # output contains trace and tool descriptions
        trace, tool_desc = output

        for event in trace:
            if isinstance(event, AgentInput):
                agent_input = event

        trace = "\n".join([i.__str__() for i in agent_input.input[1:]])

        prompt = self.prompt_template.format(trace=trace, tool_desc=tool_desc)

        response = await self._acompletion(
            messages=[{"role": "user", "content": prompt}],
            response_format=ParameterSelectionCorrectnessResponse,
            model=self.model_id,
        )
        parsed = ParameterSelectionCorrectnessResponse.model_validate_json(
            response.choices[0].message.content
        )
        return parsed.model_dump()


class ToolSelectionCorrectnessResponse(BaseModel):
    score: float = Field(
        description=dedent(
            """
        A float score indicating correctness of tool selection:
        - 1.0 = The assistant selected and used the most appropriate tools for the user’s task.
        - 0.0 = No tools were selected by the assistant, even though tool use was clearly necessary.
        - -1.0 = The assistant selected or used tools that were clearly incorrect or suboptimal for the user’s task.
        """
        ).strip()
    )


class ToolSelectionCorrectnessScorer(LLMScorer):
    name: str = "tool_selection_correctness"
    prompt_template: str = dedent(
        """
        You are designed to **evaluate whether the correct tools were selected and used** in a conversation between an assistant and a user.

        ## Task
        You are provided with a list of tools and their descriptions. Your goal is NOT to execute or validate tool parameters, but to assess whether the assistant chose the **most appropriate tool(s)** for the user's intent, based on the tool capabilities.

        You must check:
        - Whether the assistant selected tools relevant to the user’s request.
        - Whether any obviously incorrect or irrelevant tools were used.
        - Whether tool usage was necessary for the given task.
        - Whether the assistant missed calling a necessary tool when it should have.

        ## Output Format
        You must respond with **only one** of the following three values:
        - "1.0" → The assistant selected and used the most appropriate tools for the user’s task.
        - "0.0" → No tools were selected by the assistant, even though tool use was clearly necessary.
        - "-1.0" → The assistant selected or used tools that were clearly incorrect or suboptimal for the user’s task.

        Do not include any other explanation, justification, or text in your output.

        ## Tool Descriptions
        Here are the available tools and what they are used for:

        {tool_desc}

        ## Trace
        Below is the trace along with the reasoning of selecting tools with possible tool calls. Tool calls are in the format:
        CALL: tool_name(param1=value1, param2=value2)

        [BEGIN TRACE]
        {trace}
        [END TRACE]

        Evaluate the assistant’s **tool selection** based on the user's needs and the tool descriptions above, and output only the correct value from the list: "1.0", "0.0", "-1.0".
        """
    )

    model_id: str = "gemini/gemini-2.0-flash"

    @weave.op
    async def score(self, output: tuple, query: str) -> Dict:
        trace, tool_desc = output

        prompt = self.prompt_template.format(trace=trace, tool_desc=tool_desc)

        response = await self._acompletion(
            messages=[{"role": "user", "content": prompt}],
            response_format=ToolSelectionCorrectnessResponse,
            model=self.model_id,
        )
        parsed = ToolSelectionCorrectnessResponse.model_validate_json(
            response.choices[0].message.content
        )
        return parsed.model_dump()


class ToolAccuracyResponse(BaseModel):
    score: float = Field(
        description=dedent(
            """
        A float score indicating accuracy of tool usage:
        - 1.0 = The assistant used the correct tool(s) at the appropriate time and passed all parameters correctly.
        - 0.0 = No tool was used by the assistant, even though tool usage was clearly necessary.
        - -1.0 = A tool was used, but either the wrong tool was selected or it was invoked with missing, incorrect, or invalid parameters.
        """
        ).strip()
    )


class ToolAccuracyScorer(LLMScorer):
    name: str = "tool_accuracy"
    prompt_template: str = dedent(
        """
        You are designed to **evaluate whether the assistant accurately used tools** in a conversation with a user.

        ## Task
        You are provided with a list of tools, their descriptions, and function signatures. Your job is to inspect the assistant’s tool usage in a conversation trace. Your evaluation must consider **both**:
        - Whether the assistant selected the **correct tool(s)** to fulfill the user's intent.
        - Whether the tool(s) were invoked with the **correct parameters**, based on the function signatures.

        You must check:
        - Did the assistant use tools relevant to the user's task?
        - Were any necessary tools **omitted**?
        - Were any tools used that were **clearly irrelevant**?
        - Did the assistant use only tools listed in the available tools?
        - Were all **required parameters included**?
        - Were there any **extra or invalid parameters**?
        - Did the parameters follow the **correct names, types, and structure**?

        ## Output Format
        You must respond with **only one** of the following three values:
        - "1.0" → The assistant used the correct tool(s) at the appropriate time and passed all parameters correctly.
        - "0.0" → No tool was used by the assistant, even though tool usage was clearly necessary.
        - "-1.0" → A tool was used, but either the wrong tool was selected or it was invoked with missing, incorrect, or invalid parameters.

        Do not include any other explanation, justification, or text in your output.

        ## Tool Definitions
        Here are the available tools and their function signatures:

        {tool_desc}

        ## Trace
        Below is the trace reasoning of of tool call selection. Tool calls are in the format:
        CALL: tool_name(param1=value1, param2=value2)

        [BEGIN TRACE]
        {trace}
        [END TRACE]

        Evaluate the assistant’s tool usage based on the above definitions, and output only the correct value from the list: "1.0", "0.0", "-1.0".
        """
    )

    model_id: str = "gemini/gemini-2.0-flash"

    @weave.op
    async def score(self, output: tuple, query: str) -> Dict:
        trace, tool_desc = output

        prompt = self.prompt_template.format(trace=trace, tool_desc=tool_desc)

        response = await self._acompletion(
            messages=[{"role": "user", "content": prompt}],
            response_format=ToolAccuracyResponse,
            model=self.model_id,
        )
        parsed = ToolAccuracyResponse.model_validate_json(
            response.choices[0].message.content
        )
        return parsed.model_dump()

### Performing Evaluations

In [13]:
parameter_selection_scorer = ParameterSelectionCorrectnessScorer()
tool_selection_scorer = ToolSelectionCorrectnessScorer()
tool_accuracy = ToolAccuracyScorer()

evaluation = weave.Evaluation(
    dataset=dataset,
    scorers=[parameter_selection_scorer, tool_selection_scorer, tool_accuracy],
)

llama_index_reAct_model = LlamaIndexReActAgent()

In [14]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

asyncio.run(evaluation.evaluate(llama_index_reAct_model))

[36m[1mweave[0m: Evaluated 1 of 2 examples
[36m[1mweave[0m: Evaluated 2 of 2 examples
[36m[1mweave[0m: Evaluation summary {
[36m[1mweave[0m:   "parameter_selection_correctness": {
[36m[1mweave[0m:     "score": {
[36m[1mweave[0m:       "mean": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "tool_selection_correctness": {
[36m[1mweave[0m:     "score": {
[36m[1mweave[0m:       "mean": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "tool_accuracy": {
[36m[1mweave[0m:     "score": {
[36m[1mweave[0m:       "mean": 1.0
[36m[1mweave[0m:     }
[36m[1mweave[0m:   },
[36m[1mweave[0m:   "model_latency": {
[36m[1mweave[0m:     "mean": 4.0369240045547485
[36m[1mweave[0m:   }
[36m[1mweave[0m: }


{'parameter_selection_correctness': {'score': {'mean': 1.0}},
 'tool_selection_correctness': {'score': {'mean': 1.0}},
 'tool_accuracy': {'score': {'mean': 1.0}},
 'model_latency': {'mean': 4.0369240045547485}}