# Transformation Agent Demo

## Connect to the Weaviate Cloud instance

> Reminder: Weaviate Agents are only available for Weaviate Cloud instances.

Connect to your Weaviate instance, using credentials from the Weaviate Cloud console. Here, they are loaded from the `.env` file.

In [1]:
from dotenv import load_dotenv
import weaviate
import os

load_dotenv()

weaviate_url = os.getenv("WEAVIATE_URL")
weaviate_api_key = os.getenv("WEAVIATE_API_KEY")

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=weaviate_api_key,
)

assert client.is_ready()

## Add data

In [2]:
import json

with open("data/simplified_posts.json", "r") as f:
    data = json.load(f)

In [3]:
client.collections.delete("ForumPost")

In [4]:
from weaviate.classes.config import Configure, DataType, Property

client.collections.create(
    "ForumPost",
    description="This collection contains conversations from the Weaviate Forum.",
    properties=[
        Property(
            name="user_id",
            description="Unique identifier for the user creating the thread.",
            data_type=DataType.INT,
        ),
        Property(
            name="conversation",
            description="Text of the entire forum conversation thread, truncated to 20,000 characters maximum for context limit.",
            data_type=DataType.TEXT,
        ),
        Property(
            name="conversation_full",
            description="Full text of the entire forum conversation thread.",
            data_type=DataType.TEXT,
        ),
        Property(
            name="date_created",
            description="Date and time when the thread was first created.",
            data_type=DataType.DATE,
        ),
        Property(
            name="has_accepted_answer",
            description="Whether the thread has an accepted answer.",
            data_type=DataType.BOOL,
        ),
        Property(
            name="title",
            description="Title text of the forum thread.",
            data_type=DataType.TEXT,
        ),
        Property(
            name="topic_id",
            description="Unique identifier for the topic of the thread.",
            data_type=DataType.INT,
        ),
    ],
    vectorizer_config=[
        Configure.NamedVectors.text2vec_weaviate(
            name="default", source_properties=["conversation_full", "title"]
        ),
    ],
    inverted_index_config=Configure.inverted_index(
        index_null_state=True,
        index_timestamps=True,
    ),
)


/Users/jphwang/code/weaviate-tutorials/weaviate-agents-workshop/.venv/lib/python3.12/site-packages/weaviate/collections/classes/config.py:1975: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
  for cls_field in self.model_fields:


<weaviate.collections.collection.sync.Collection at 0x120f6b8f0>

In [5]:
from tqdm import tqdm
from datetime import datetime, timezone
from weaviate.util import generate_uuid5

posts = client.collections.get("ForumPost")

with posts.batch.fixed_size(200) as batch:
    for i, row in tqdm(enumerate(data)):
        row["date_created"] = datetime.fromisoformat(row["date_created"]).replace(
            tzinfo=timezone.utc
        )
        if len(row["conversation"]) > 20000:
            row["conversation"] = (
                row["conversation"][:10000] + "..." + row["conversation"][-10000:]
            )
        row["conversation_full"] = row["conversation"]
        batch.add_object(properties=row, uuid=generate_uuid5(row["topic_id"]))

if posts.batch.failed_objects:
    for obj in posts.batch.failed_objects[:5]:
        print(f"Failed to add object {obj['row_id']}: {obj.message}")

596it [00:00, 26294.92it/s]


In [6]:
print(len(posts))

596


In [7]:
collection = client.collections.get("ForumPost")

print("Currently existing properties:\n")
for p in collection.config.get().properties:
    print(f"Property: {p.name}")

Currently existing properties:

Property: user_id
Property: conversation
Property: conversation_full
Property: date_created
Property: has_accepted_answer
Property: title
Property: topic_id


## How to use the TA

First, define the operation for the TA to perform:

In [8]:
from weaviate.classes.config import DataType
from weaviate.agents.classes import Operations

add_technical_complexity = Operations.append_property(
    property_name="technicalComplexity",
    data_type=DataType.INT,
    view_properties=["conversation"],
    instruction="""
    Rate the technical complexity of the user's forum post query
    on a scale from 1 to 5, where 1 is very simple and 5 is very complex.
    """,
)

### Run the TA

The TA is run in an asynchronous way, so you can run it in the background and check the status later.

In [9]:
from weaviate.agents.transformation import TransformationAgent

ta = TransformationAgent(
    client=client,
    collection="ForumPost",
    operations=[add_technical_complexity]
)

ta_response = ta.update_all()


### Check the status of the TA

In [10]:
ta.get_status(workflow_id=ta_response.workflow_id)

{'workflow_id': 'TransformationWorkflow-41f9bd0027ca972a56b19b7813c401cb',
 'status': {'batch_count': 3,
  'end_time': None,
  'start_time': '2025-06-09 18:45:10',
  'state': 'running',
  'total_duration': None,
  'total_items': 596}}

With a helper function (check status in a loop)

In [11]:
from helpers import get_ta_status

get_ta_status(agent_instance=ta, workflow_id=ta_response.workflow_id)

Waiting... Elapsed time: 149.18 seconds
Total time: 151.83 seconds
{'workflow_id': 'TransformationWorkflow-41f9bd0027ca972a56b19b7813c401cb', 'status': {'batch_count': 3, 'end_time': '2025-06-09 18:47:42', 'start_time': '2025-06-09 18:45:10', 'state': 'completed', 'total_duration': 151.831738, 'total_items': 596}}


In [None]:
print("Currently existing properties:\n")
for p in collection.config.get().properties:
    print(f"Property: {p.name}")

In fact - you can define as many operations as you would like.

In [12]:
from helpers import TECHNICAL_DOMAIN_CATEGORIES, ROOT_CAUSE_CATEGORIES, ACCESS_CONTEXT_CATEGORIES


add_technical_domain = Operations.append_property(
    property_name="technicalDomain",
    data_type=DataType.TEXT,
    view_properties=["conversation", "title"],
    instruction=f"""
    Identify the primary technical domain of the user's forum post query.
    The answer must be one of the following:
    {TECHNICAL_DOMAIN_CATEGORIES.keys()}

    The definitions of the categories are as follows:
    {TECHNICAL_DOMAIN_CATEGORIES}

    Remember that the answer must be one of these categories:
    {TECHNICAL_DOMAIN_CATEGORIES.keys()}
    """,
)

add_root_cause_category = Operations.append_property(
    property_name="rootCauseCategory",
    data_type=DataType.TEXT,
    view_properties=["conversation", "title"],
    instruction=f"""
    Based on the text, what was the fundamental issue behind the user's question? The answer must be one of the following categories:
    {ROOT_CAUSE_CATEGORIES.keys()}

    The definitions of the categories are as follows:
    {ROOT_CAUSE_CATEGORIES}
    For example, if the user was confused about how to use a specific feature of Weaviate, the answer should be "conceptual_misunderstanding".

    Remember that the answer must be one of these categories:
    {ROOT_CAUSE_CATEGORIES.keys()}
    """,
)

add_access_context = Operations.append_property(
    property_name="accessContext",
    data_type=DataType.TEXT,
    view_properties=["conversation", "title"],
    instruction=f"""
    Based on the text, how was the user trying to access Weaviate? The answer must be one of the following categories:

    {ACCESS_CONTEXT_CATEGORIES.keys()}

    The definitions of the categories are as follows:
    {ACCESS_CONTEXT_CATEGORIES}
    For example, if the user was using the Weaviate Python client library, the answer should be "python_client".

    Remember that the answer must be one of these categories:
    {ACCESS_CONTEXT_CATEGORIES.keys()}
    """,
)

was_it_caused_by_outdated_stack = Operations.append_property(
    property_name="causedByOutdatedStack",
    data_type=DataType.BOOL,
    view_properties=["conversation", "title"],
    instruction="""
    Based on the text, was the user's question caused by an outdated version of Weaviate or its components, such as the client library being used?
    """,
)

was_it_a_documentation_gap = Operations.append_property(
    property_name="isDocumentationGap",
    data_type=DataType.BOOL,
    view_properties=["conversation", "title"],
    instruction="""
    Based on the text, identify whether the user's question was caused by a lack of documentation or unclear instructions regarding Weaviate.

    This does not include cases where the documentation exists, and the user did not find it, or did not read it.
    This also does not include cases where the user was asking about a feature that is not supported by Weaviate,
    or the user was asking about a feature that is not part of a first-party Weaviate product, such as a third-party integration or a custom implementation.
    This also does not include cases where there was a bug in the code, or the user was using an outdated version of Weaviate or its components.

    Only mark this as true if the user was asking about a feature or an aspect
    that is not covered by the documentation, or the documentation was unclear or incorrect.
    """,
)

create_summary = Operations.append_property(
    property_name="summary",
    data_type=DataType.TEXT,
    view_properties=["conversation", "title"],
    instruction="""
    Briefly summarize the user's question and the resolution provided (if any) in a few sentences.
    """,
)

Actually - you can define multiple operations to perform, and ask the TA to perform them all at once.

In [13]:
ta = TransformationAgent(
    client=client,
    collection="ForumPost",
    operations=[
        add_technical_domain,
        add_root_cause_category,
        add_access_context,
        was_it_caused_by_outdated_stack,
        was_it_a_documentation_gap,
        create_summary
    ],
)

ta_response = ta.update_all()

In [14]:
get_ta_status(agent_instance=ta, workflow_id=ta_response.workflow_id)

Waiting... Elapsed time: 9.56 seconds
Waiting... Elapsed time: 20.01 seconds
Waiting... Elapsed time: 30.50 seconds
Waiting... Elapsed time: 40.97 seconds
Waiting... Elapsed time: 51.46 seconds
Waiting... Elapsed time: 62.13 seconds
Waiting... Elapsed time: 72.60 seconds
Waiting... Elapsed time: 83.06 seconds
Waiting... Elapsed time: 93.50 seconds
Waiting... Elapsed time: 104.25 seconds
Waiting... Elapsed time: 114.73 seconds
Waiting... Elapsed time: 125.35 seconds
Waiting... Elapsed time: 135.86 seconds
Waiting... Elapsed time: 146.35 seconds
Waiting... Elapsed time: 156.83 seconds
Waiting... Elapsed time: 167.47 seconds
Waiting... Elapsed time: 177.96 seconds
Waiting... Elapsed time: 188.44 seconds
Waiting... Elapsed time: 198.91 seconds
Total time: 206.99 seconds
{'workflow_id': 'TransformationWorkflow-2df14f478751f7a464e1061231241931', 'status': {'batch_count': 3, 'end_time': '2025-06-09 18:51:59', 'start_time': '2025-06-09 18:48:32', 'state': 'completed', 'total_duration': 206.986

In [15]:
print("Currently existing properties:\n")
for p in collection.config.get().properties:
    print(f"Property: {p.name}")

Currently existing properties:

Property: user_id
Property: conversation
Property: conversation_full
Property: date_created
Property: has_accepted_answer
Property: title
Property: topic_id
Property: technicalComplexity
Property: technicalDomain
Property: rootCauseCategory
Property: accessContext
Property: causedByOutdatedStack
Property: isDocumentationGap
Property: summary


## Queries enabled by the new properties

In [16]:
from weaviate.classes.aggregate import GroupByAggregate

analysis_props = [
    "technicalComplexity",
    "technicalDomain",
    "rootCauseCategory",
    "accessContext",
    "causedByOutdatedStack",
    "isDocumentationGap"
]

for prop in analysis_props:

    response = collection.aggregate.over_all(
        group_by=GroupByAggregate(prop=prop)
    )
    print(f"\nProperty: {prop}")
    for group in response.groups:
        print(f"Value: {group.grouped_by} Count: {group.total_count}")


Property: technicalComplexity
Value: GroupedBy(prop='technicalComplexity', value=4.0) Count: 294
Value: GroupedBy(prop='technicalComplexity', value=5.0) Count: 188
Value: GroupedBy(prop='technicalComplexity', value=3.0) Count: 71
Value: GroupedBy(prop='technicalComplexity', value=2.0) Count: 43

Property: technicalDomain
Value: GroupedBy(prop='technicalDomain', value='queries') Count: 173
Value: GroupedBy(prop='technicalDomain', value='deployment') Count: 144
Value: GroupedBy(prop='technicalDomain', value='integration') Count: 97
Value: GroupedBy(prop='technicalDomain', value='ingestion') Count: 83
Value: GroupedBy(prop='technicalDomain', value='server_setup') Count: 71
Value: GroupedBy(prop='technicalDomain', value='others') Count: 12
Value: GroupedBy(prop='technicalDomain', value='security') Count: 9
Value: GroupedBy(prop='technicalDomain', value='configuration') Count: 1
Value: GroupedBy(prop='technicalDomain', value='property_recommendation') Count: 1
Value: GroupedBy(prop='techni

In [17]:
from weaviate.classes.query import Filter

prop = "technicalDomain"
response = collection.aggregate.over_all(
    group_by=GroupByAggregate(prop=prop),
    filters=Filter.by_property(name="rootCauseCategory").equal("conceptual_misunderstanding")
)

print(f"\nProperty: {prop}")
for group in response.groups:
    print(f"Value: {group.grouped_by} Count: {group.total_count}")


Property: technicalDomain
Value: GroupedBy(prop='technicalDomain', value='queries') Count: 65
Value: GroupedBy(prop='technicalDomain', value='integration') Count: 30
Value: GroupedBy(prop='technicalDomain', value='ingestion') Count: 19
Value: GroupedBy(prop='technicalDomain', value='deployment') Count: 18
Value: GroupedBy(prop='technicalDomain', value='others') Count: 6
Value: GroupedBy(prop='technicalDomain', value='server_setup') Count: 5
Value: GroupedBy(prop='technicalDomain', value='security') Count: 1
Value: GroupedBy(prop='technicalDomain', value='multimodal search') Count: 1
Value: GroupedBy(prop='technicalDomain', value='property_recommendation') Count: 1


In [None]:
from weaviate.classes.generate import GenerativeConfig

anthropic_key = os.getenv("ANTHROPIC_API_KEY")

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=weaviate_api_key,
    headers={
        "X-Anthropic-Api-Key": anthropic_key,
    }
)

collection = client.collections.get("ForumPost")


response = collection.generate.fetch_objects(
    filters=(
        Filter.by_property(name="rootCauseCategory").equal("conceptual_misunderstanding") &
        Filter.by_property(name="technicalDomain").equal("queries")
    ),
    limit=100,
    generative_provider=GenerativeConfig.anthropic(model="claude-3-7-sonnet-latest"),
    grouped_task="""
    From these Weaviate Forum post conversations, identify 3-5 most common things
    that we can help users to understand better about Weaviate queries.
    If possible, also provide a count of each type in the sample.
    """,
    grouped_properties=["summary", "title"]
)

print(f"\n{response.generative.text}")


# Common Points of Confusion in Weaviate Queries

Based on an analysis of these 63 forum posts, here are the most common areas where users need better understanding about Weaviate queries:

## 1. Hybrid Search Configuration and Scoring (12 posts)
Users frequently struggle to understand how hybrid search combines vector and keyword searches, particularly:
- How scoring works in hybrid search and why scores might seem inconsistent
- How to adjust the alpha parameter to balance between vector and keyword components
- When to use hybrid search vs. pure vector or keyword search
- How to properly configure hybrid search with filters

Examples include posts like "Hybrid search in weaviate", "Hybrid similarity scoring is so weird", and "Hybrid search with embedding outside the database".

## 2. Property Configuration for Searching vs. Filtering (10 posts)
Many users confuse when to use indexSearchable vs. indexFilterable, or struggle with:
- How to properly configure text fields for search vs

In [19]:
client.close()