# Computer-Use Agents SOTA Challenge

This notebook demonstrates how to create a computer use agent with Cua and evaluate it using HUD.

## Step 1: Connect to cloud services

You will need a Cua account to run computer use agents in the cloud and a HUD account to evaluate them.

1. Create a Cua account at https://www.trycua.com/
2. Start a Cua container at https://www.trycua.com/dashboard/containers
3. Create a HUD account at https://www.hud.so/
4. Create a .env file like this:

```
# Required environment variables:
CUA_API_KEY=
CUA_CONTAINER_NAME=
HUD_API_KEY=

# Any LLM provider will work:
ANTHROPIC_API_KEY=
OPENAI_API_KEY=
```

In [None]:
# Read the .env file

from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')
load_dotenv(dotenv_path='.env')

## Step 2: Create a Computer Use Agent

Connect to your running Cua container using the Cua SDK and initialize an agent.

In [None]:
import logging
from pathlib import Path
import os

from agent import ComputerAgent
from computer import Computer, VMProviderType

# Connect to your existing cloud container
computer = Computer(
    os_type="linux",
    provider_type=VMProviderType.CLOUD,
    api_key=os.getenv("CUA_API_KEY"),
    name=os.getenv("CUA_CONTAINER_NAME"),
    verbosity=logging.INFO
)

# Create agent
agent = ComputerAgent(
    model="openai/computer-use-preview",
    tools=[computer],
    trajectory_dir=str(Path("trajectories")),
    only_n_most_recent_images=3,
    verbosity=logging.INFO
)

## Step 3: Run a Simple Task

Try running the computer use agent on a simple task.

Trajectories are saved in the format: `trajectories/YYYY-MM-DD_computer-use-pre_XXX`.

To view a replay of the agent's actions, upload the trajectory to the [trajectory viewer](https://www.trycua.com/trajectory-viewer).

You can also connect to an agent through VNC on the [Cua Dashboard](https://www.trycua.com/dashboard).

In [None]:
tasks = [
    "Look for a repository named trycua/cua on GitHub."
]

for i, task in enumerate(tasks):
    print(f"\nExecuting task {i}/{len(tasks)}: {task}")
    async for result in agent.run(task):
        print(result)
        pass

    print(f"\n✅ Task {i+1}/{len(tasks)} completed: {task}")

## Step 4: Evaluate the Agent with HUD

Test your agent's performance on a selection of tasks from the OSWorld benchmark:

In [None]:
import uuid
from pprint import pprint
from agent.integrations.hud import run_full_dataset

# Full dataset evaluation (runs via HUD's run_dataset under the hood)
job_name = f"osworld-test-{str(uuid.uuid4())[:4]}"

results = await run_full_dataset(
    dataset="ddupont/OSWorld-Tiny-Public",          # You can also pass a Dataset or a list[dict]
    job_name=job_name,                   # Optional; defaults to a timestamp for custom datasets
    model="openai/computer-use-preview", # Or any supported model string
    max_concurrent=20,                   # Tune to your infra
    max_steps=50,                        # Safety cap per task
    split="train[:3]"                    # Limit to just 3 tasks
    # instructions="..."        # Set a custom system prompt
    # callbacks=[],             # Set custom callbacks
    # tools=[your_python_func], # Add custom tools
)

# results is a list from hud.datasets.run_dataset; inspect/aggregate as needed
print(f"Job: {job_name}")
print(f"Total results: {len(results)}")
pprint(results[:3])

If HUD is not available, you can also test your agent offline on the OSWorld-G evaluation set

In [None]:
from agent.benchmark import run_offline_dataset
import json
import logging

# Choose a model and any ComputerAgent kwargs you want
# Examples:
#   model="openai/computer-use-preview"
#   model="anthropic/claude-3-5-sonnet-20241022"
# You can also pass other ComputerAgent kwargs like only_n_most_recent_images, callbacks, etc.
results = await run_offline_dataset(
    dataset="MMInstruction/OSWorld-G",
    split="test[:50]",
    model="openai/computer-use-preview",
    only_n_most_recent_images=3,
    verbosity=logging.INFO,     # Print agent outputs 
    # instructions="..."        # Set a custom system prompt
    # callbacks=[],             # Set custom callbacks
    # tools=[your_python_func], # Add custom tools
)

print("Summary:", json.dumps(results["summary"], indent=2))
print("First 3 results:", json.dumps(results["results"][:3], indent=2))

# Step 5: Improve your Agent

Improve your agent to get the highest score possible on OSWorld-Verified. Here are some ideas to get you started:

- Experiment with different models or combinations of models
- Try adding your custom tools to the agent
- Read the ComputerAgent source code, and come up with your own improved version/subclass