[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/operations/trulens/query-agent-evaluation-with-trulens.ipynb)

# Query Agent Evaluation

In this recipe, we extend the [E-Commerce Query Agent](../../../weaviate-services/agents/query-agent-get-started.ipynb) and show how to evaluate and improve its performance.

We will use [TruLens](https://www.trulens.org/) to trace and evaluate the agent. By using TruLens we can identify opportunities to improve the agent, track our experiments, and compare across versions of the app.

To evaluate the agent, we can access metadata from the intermediate steps in the trace. Then, we can use this metadata to evaluate things like the relevance of the filter used by the query agent, or the relevance of the intermediate search results.

[Custom evaluations](https://www.trulens.org/component_guides/evaluation/feedback_implementations/custom_feedback_functions/) are particularly valuable here, because they allow us to easily extend existing feedbacks to unique scenarios. In this example, we show how to record a Query Agent run. We also show how to use custom instructions to customize an existing LLM judge to provide tailored feedback for our situation.

By evaluating this ecommerce agent, we are able to identify opportunities for improvement when the search results include items that do not match what the customer is looking for.

Follow along!

## 1. Setup Keys and Install Packages

In [None]:
#! pip install trulens-core trulens-providers-openai trulens-dashboard weaviate-client weaviate-agents datasets pydantic==2.10.6 # note: pydantic < 2.11.0 is required for now due to compatibility issue

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-..."
os.environ["WEAVIATE_URL"]="..."
os.environ["WEAVIATE_API_KEY"]="..."

## 2. Create weaviate client

In [None]:
import weaviate
from weaviate.classes.init import Auth
from weaviate.agents.query import QueryAgent

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WEAVIATE_URL"],
    auth_credentials=Auth.api_key(os.environ["WEAVIATE_API_KEY"]),
    headers=headers,
)

## 3. Prepare the Collections

In [None]:
from weaviate.classes.config import Configure, Property, DataType

# Using `auto-schema` to infer the data schema during import
client.collections.create(
    "Brands",
    description="A dataset that lists information about clothing brands, their parent companies, average rating and more.",
    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),
)

# Explicitly defining the data schema
client.collections.create(
    "ECommerce",
    description="A dataset that lists clothing items, their brands, prices, and more.",
    vectorizer_config=Configure.Vectorizer.text2vec_weaviate(),
    properties=[
        Property(name="collection", data_type=DataType.TEXT),
        Property(
            name="category",
            data_type=DataType.TEXT,
            description="The category to which the clothing item belongs",
        ),
        Property(
            name="tags",
            data_type=DataType.TEXT_ARRAY,
            description="The tags that are assocciated with the clothing item",
        ),
        Property(name="subcategory", data_type=DataType.TEXT),
        Property(name="name", data_type=DataType.TEXT),
        Property(
            name="description",
            data_type=DataType.TEXT,
            description="A detailed description of the clothing item",
        ),
        Property(
            name="brand",
            data_type=DataType.TEXT,
            description="The brand of the clothing item",
        ),
        Property(name="product_id", data_type=DataType.UUID),
        Property(
            name="colors",
            data_type=DataType.TEXT_ARRAY,
            description="The colors on the clothing item",
        ),
        Property(name="reviews", data_type=DataType.TEXT_ARRAY),
        Property(name="image_url", data_type=DataType.TEXT),
        Property(
            name="price",
            data_type=DataType.NUMBER,
            description="The price of the clothing item in USD",
        ),
    ],
)

In [None]:
from datasets import load_dataset

brands_dataset = load_dataset(
    "weaviate/agents", "query-agent-brands", split="train", streaming=True
)
ecommerce_dataset = load_dataset(
    "weaviate/agents", "query-agent-ecommerce", split="train", streaming=True
)

brands_collection = client.collections.get("Brands")
ecommerce_collection = client.collections.get("ECommerce")

with brands_collection.batch.dynamic() as batch:
    for item in brands_dataset:
        batch.add_object(properties=item["properties"], vector=item["vector"])

with ecommerce_collection.batch.dynamic() as batch:
    for item in ecommerce_dataset:
        batch.add_object(properties=item["properties"], vector=item["vector"])

failed_objects = brands_collection.batch.failed_objects
if failed_objects:
    print(f"Number of failed imports: {len(failed_objects)}")
    print(f"First failed object: {failed_objects[0]}")

print(f"Size of the ECommerce dataset: {len(ecommerce_collection)}")
print(f"Size of the Brands dataset: {len(brands_collection)}")

## 4. Create the Query Agent

In [None]:
from weaviate.agents.query import QueryAgent
from trulens.apps.app import instrument

class Agent:
    def __init__(self, client):
        self.agent =  QueryAgent(
            client=client,
            collections=["ECommerce", "Brands"],
        )

    @instrument
    def run(self, query):
        return self.agent.run(query)
    
    @instrument
    def fetch_sources(self, agent_response): # fetch sources is unneccessary, but gives us more power to examine and evaluate the sources
        sources = []
        for source in agent_response.sources:
            object_id = source.object_id
            collection_name = source.collection
            collection = client.collections.get(collection_name)
            data_obj = collection.query.fetch_object_by_id(object_id)
            sources.append(data_obj)
        return sources
    
    @instrument
    def run_and_fetch_sources(self, query):
        agent_response = self.run(query)
        self.fetch_sources(agent_response)
        return agent_response
    
query_agent = Agent(client)

## 5. Set the logging

By default, TruLens will log traces and evaluations locally in a `sqlite` file.

TruLens can also connect to any [SQL-alchemy database](https://www.trulens.org/component_guides/logging/where_to_log/) to store the logs, including [Snowflake](https://www.trulens.org/component_guides/logging/where_to_log/log_in_snowflake/).

In [None]:
from trulens.core import TruSession

# Set logging
session = TruSession()

## 6. Define evaluations

For this example, we will define three feedback functions to use as evaluators.

* Answer relevance will evaluate the relevance of the answer end-to-end
* Filter relevance will assess the effectiveness of the filter produced as an intermediate step by the agent
* Context relevance identifies the relevance of the search results produced as an intermediate step by the the agent

In addition to scoring relevance, all three feedback functions will also provide chain-of-thought reasons to explain the scoring. These explanations, along with the scores will be accessible in the TruLens dashboard.

In [None]:
from trulens.providers.openai import OpenAI as fOpenAI
from trulens.core import Feedback
from trulens.core import Select



# Initialize OpenAI-based feedback function collection class:
fopenai = fOpenAI()

# answer relevance
f_answer_relevance = Feedback(fopenai.relevance_with_cot_reasons, name = "Answer Relevance").on_input().on(Select.RecordCalls.run.rets.final_answer)

# filter relevance
filter_relevance_custom_criteria = "You are specifically gauging the relevance of the filter, described as a python list of dictionaries, to the query. The filter is a list of dictionaries, where each dictionary represents a filter condition. Each dictionary has three keys: 'operator', 'property_name', and 'value'. The 'operator' key is a string that represents the comparison operator to use for the filter condition. The 'property_name' key is a string that represents the property of the object to filter on. The 'value' key is a float that represents the value to compare the property to. The relevance score should be a float between 0 and 1, where 0 means the filter is not relevant to the query, and 1 means the filter is highly relevant to the query."

f_filter_relevance = Feedback(fopenai.relevance_with_cot_reasons, name = "Filter Relevance",
                              min_score_val=0,
                              max_score_val=1,
                              criteria = filter_relevance_custom_criteria,
                              ).on_input().on(Select.RecordCalls.run.rets.searches[0][0].filters[0][0].collect())

# context relevance
f_context_relevance = (
    Feedback(fopenai.context_relevance_with_cot_reasons, 
                 name = "Context Relevance",
                 criteria="Evaluate the relevance of the individual SEARCH RESULT option to the query, regardless of whether the user requests multiple options. If the only issue is that the SEARCH RESULT does not provide a list of multiple options, return a RELEVANCE score of 3.")
                 .on_input()
                 .on(Select.RecordCalls.fetch_sources.rets[:].properties)
)

## 7. Register the agent

Registering the agent allows us to track our experiments. We'll give it a name, version and list the feedback functions we want to use to evaluate application runs.

In [None]:
from trulens.apps.app import TruApp

tru_agent = TruApp(
    query_agent,
    app_name="query agent",
    app_version="base",
    feedbacks=[f_answer_relevance, f_filter_relevance, f_context_relevance],
)

## 8. Run and record the agent

Using the `tru_agent` we just defined as a context manager, we can run our agent. This will record the traces and kick off evaluations for each time the agent runs.

In [None]:
with tru_agent as recording:
    response = query_agent.run_and_fetch_sources("I like vintage clothes, can you list me some options that are less than $200?")

## 9. Run the dashboard

TruLens ships with a Streamlit dashboard that can be launched using `run_dashboard`, and operates off the evaluation and trace logs.

This is the primary way you can explore traces, view evaluation results, and compare app versions.

In [None]:
from trulens.dashboard import run_dashboard

run_dashboard(session)  # open a local streamlit app to explore

## 10. Identify issue using the TruLens dashboard

By evaluating the query agent, we notice it occasionally returns non-clothing items even though the customer specifically asks for clothing.

![trulens identify issues](images/trulens_weaviate_identify_issues.gif)

## 11. Improve the agent

Let's add additional instruction into the system prompt to help guide the agent to return only the type of result the user is looking for.

In [None]:
from weaviate.agents.query import QueryAgent
from trulens.apps.app import instrument

class Agent:
    def __init__(self, client):
        self.agent =  QueryAgent(
            client=client,
            collections=["ECommerce", "Brands"],
            system_prompt="You are a helpful assistant that always returns only results that match the user's query. For example, if the user asks for clothing, only return clothing."
        )

    @instrument
    def run(self, query):
        return self.agent.run(query)
    
    @instrument
    def fetch_sources(self, agent_response): # fetch sources is unneccessary for running the agent, but gives us more power to examine and evaluate the sources
        sources = []
        for source in agent_response.sources:
            object_id = source.object_id
            collection_name = source.collection
            collection = client.collections.get(collection_name)
            data_obj = collection.query.fetch_object_by_id(object_id)
            sources.append(data_obj)
        return sources
    
    @instrument
    def run_and_fetch_sources(self, query):
        agent_response = self.run(query)
        self.fetch_sources(agent_response)
        return agent_response
    
query_agent = Agent(client)

## 12. Validate performance

Last, we'll register the improved version of teh app and validate the performance improvement

In [None]:
from trulens.apps.app import TruApp

tru_agent = TruApp(
    query_agent,
    app_name="query agent",
    app_version="modified system prompt",
    feedbacks=[f_answer_relevance, f_filter_relevance, f_context_relevance],
)

In [None]:
with tru_agent as recording:
    response = query_agent.run_and_fetch_sources("I like vintage clothes, can you list me some options that are less than $200?")

In the dashboard, we can compare application versions and their evaluation results.

Comparing here, we see the context relevance improvement.

![trulens validate](images/trulens_weaviate_validate.gif)