# ComputerAgent HUD Integration for OSWorld

This notebook demonstrates how to use the ComputerAgent with HUD for OSWorld benchmarking.
The ComputerAgent integration provides the same interface as OperatorAgent but works with both Claude and OpenAI models.

In [None]:
# # Install dependencies if needed
# !uv venv 
# !source .venv/bin/activate
# !uv sync

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Required environment variables:
# - HUD_API_KEY (for HUD access)
# - ANTHROPIC_API_KEY (for Claude models)
# - OPENAI_API_KEY (for OpenAI models)

from hud import gym, load_taskset
from pprint import pprint
import asyncio

In [5]:
# Import the HUD-integrated ComputerAgent
from agent.integrations.hud import ComputerAgent

In [5]:
# Load OSWorld taskset
taskset = await load_taskset("OSWorld-Verified")
print(f"Total tasks in OSWorld: {len(taskset)}")

# Select a test task
test = taskset[148]
print(f"Task prompt: {test.prompt}")

Total tasks in OSWorld: 367
Task prompt: Can you make my computer bring back the last tab I shut down?


In [6]:
# Load SheetBench taskset
taskset = await load_taskset("SheetBench-V2")
print(f"Total tasks in SheetBench: {len(taskset)}")

# Select a test task
test = taskset[0]
print(f"Task prompt: {test.prompt}")

Total tasks in SheetBench: 50
Task prompt: Given the Input data, determine the ticker with the greatest correlation between volume and next day price change.
- in ANSWER tab put the Ticker in A1 and the correlation in B1
  - use CORREL to determine correlation
- be sure to first sort the date by ticker z to a and then date ascending before calculating nextdaypricechange %
Correlation should be rounded to 2 decimal points


In [7]:
# Create environment (takes ~2.5 minutes to start)
env = await gym.make(test)
print("Environment ready!")

[INFO] 2025-08-08 19:08:17,078 | hud.environment | View the live trace at https://app.hud.so/trace/ca88c178-cf40-499b-8ad3-d5d60348d9fe


Environment ready!


In [8]:
await env.stream() # vnc

'\n    <div style="width: 960px; height: 540px; overflow: hidden;">\n        <div style="transform: scale(0.5); transform-origin: top left;">\n            <iframe src="https://live.anchorbrowser.io?sessionId=21376c89-e539-4f07-b23f-db4a3749d61a" width="1920" height="1080" style="border: 1px solid #ddd;">\n            </iframe>\n        </div>\n    </div>\n    '

## Test with any supported CUA model

The ComputerAgent integration can use Claude, OpenAI, UI-TARS, or composed models just like the original ComputerAgent:

In [13]:
import logging
# Create ComputerAgent with Claude
claude_agent = ComputerAgent(
    # model="anthropic/claude-3-5-sonnet-20241022",
    model="openai/computer-use-preview",
    # environment="linux",  # OSWorld typically uses Linux
    environment="browser", # SheetBench uses the browser
    trajectory_dir="trajectories",
    verbosity=logging.INFO,
)

print(f"Created agent: {claude_agent.name}")

Created agent: computeragent-computer-use-preview


In [14]:
# Initial observation
obs, _ = await env.reset()
print("Initial observation complete")

# Agent loop with Claude
for i in range(8):
    print(f"========= Step {i + 1} ==========")
    
    try:
        action, done = await claude_agent.predict(obs)
        print(f"Agent's action: {action}")

        obs, reward, terminated, info = await env.step(action)

        if done or terminated:
            print(f"Task completed after {i + 1} steps")
            break
            
    except Exception as e:
        print(f"Error in step {i + 1}: {e}")
        break

Initial observation complete


2025-08-08 19:14:10,479 - agent.ComputerAgent - INFO - LLM processing started with 1 messages
2025-08-08 19:14:18,867 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 55, 'y': 149})


Agent's action: [ClickAction(type='click', reasoning='Sorting dataset for analysis preparation', logs={'conversation_length': 3}, point=Point(x=77, y=174), button='left', pattern=None, hold_keys=None)]


2025-08-08 19:14:24,566 - agent.ComputerAgent - INFO - LLM processing started with 4 messages
2025-08-08 19:14:30,430 - agent.ComputerAgent - INFO - Computer: keypress({'keys': ['CTRL', 'A']})


Agent's action: [PressAction(type='press', reasoning='Sorting dataset for analysis preparation', logs={'conversation_length': 5}, keys=['ctrl', 'a'])]


2025-08-08 19:14:36,137 - agent.ComputerAgent - INFO - LLM processing started with 6 messages
2025-08-08 19:14:42,483 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 73, 'y': 151})


Agent's action: [ClickAction(type='click', reasoning='Sorting dataset for analysis preparation', logs={'conversation_length': 7}, point=Point(x=102, y=176), button='left', pattern=None, hold_keys=None)]


2025-08-08 19:14:48,687 - agent.ComputerAgent - INFO - LLM processing started with 8 messages
2025-08-08 19:14:59,516 - agent.ComputerAgent - INFO - Computer: keypress({'keys': ['CTRL', 'A']})


Agent's action: [PressAction(type='press', reasoning='Sorting dataset for analysis preparation', logs={'conversation_length': 9}, keys=['ctrl', 'a'])]


2025-08-08 19:15:05,229 - agent.ComputerAgent - INFO - LLM processing started with 10 messages
2025-08-08 19:15:15,153 - agent.ComputerAgent - INFO - Computer: drag({'path': [{'x': 55, 'y': 147}, {'x': 319, 'y': 713}]})


Agent's action: [DragAction(type='drag', reasoning='Highlighting data for sorting preparation', logs={'conversation_length': 12}, path=[Point(x=77, y=172), Point(x=448, y=835)], pattern=None, hold_keys=None)]


2025-08-08 19:15:21,362 - agent.ComputerAgent - INFO - LLM processing started with 13 messages
2025-08-08 19:15:33,774 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 229, 'y': 41})


Agent's action: [ClickAction(type='click', reasoning='Opening sort options for data', logs={'conversation_length': 15}, point=Point(x=322, y=48), button='left', pattern=None, hold_keys=None)]


2025-08-08 19:15:39,973 - agent.ComputerAgent - INFO - LLM processing started with 16 messages
2025-08-08 19:15:52,928 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 430, 'y': 96})


Agent's action: [ClickAction(type='click', reasoning='Choosing "Sort range" for sorting', logs={'conversation_length': 18}, point=Point(x=604, y=112), button='left', pattern=None, hold_keys=None)]


2025-08-08 19:15:59,611 - agent.ComputerAgent - INFO - LLM processing started with 19 messages
2025-08-08 19:16:17,003 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 530, 'y': 172})


Agent's action: [ClickAction(type='click', reasoning='Accessing advanced sorting options now', logs={'conversation_length': 21}, point=Point(x=745, y=201), button='left', pattern=None, hold_keys=None)]


## Evaluate Results

In [15]:
# Evaluate environment state
result = await env.evaluate()
print("=== Final Evaluation ===")
pprint(result)

=== Final Evaluation ===
{'error': None,
 'gold_file_url': 'https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/615426c8-9df7-4ffa-92e9-200134a84da9/gold_solution_2.xlsx?',
 'logs': 'INFO: Starting evaluation with evaluator: sheets_cell_values\n'
         "INFO: Evaluator args: [{'A1': 'ABC', 'B1': '-0.08'}]\n"
         'INFO: Partial rewarding: False\n'
         'INFO: Starting sheets_cell_values evaluation for environment: '
         'af7a34a0-43b0-44d2-82d0-2b66ed16f1ea\n'
         "INFO: Raw args received: [{'A1': 'ABC', 'B1': '-0.08'}] (type: "
         "<class 'list'>)\n"
         'INFO: Partial rewarding enabled: False\n'
         'INFO: === Google Sheets Cell Value Verification ===\n'
         'INFO: Current page URL: '
         'https://docs.google.com/spreadsheets/d/1h-Ec3rW9sAME2sTn8qxIvFxO6qXtdURPacEFL5DJnqw/edit?gid=700326861#gid=700326861\n'
         'INFO: ✅ Confirmed on Google Sheets page\n'
         'INFO: Processing args parameter...\n'
     

In [16]:
# Clean up
await env.close()
print("Environment closed")

Environment closed


## Run OSWorld-Verified in parallel

In [17]:
from agent.integrations.hud import run_job
from hud import load_taskset
from hud.taskset import TaskSet
import logging

# Load taskset
taskset = await load_taskset("OSWorld-Verified")
taskset = TaskSet(tasks=taskset[:10]) # limit to 10 tasks instead of all 370

# Run benchmark job
job = await run_job(
    model="openai/computer-use-preview",
    task_or_taskset=taskset,
    job_name="test-computeragent-job",
    max_concurrent_tasks=5,
    # add any extra ComputerAgent kwargs:
    verbosity=logging.INFO,  # Enable logging
    trajectory_dir="trajectories"       # Save trajectories locally
)

# Get results OR view them at app.hud.so
print(await job.get_analytics())
print(f"View results at: https://app.hud.so/jobs/{job.id}")

  0%|----------------------------------------| 0/200 [1:24<??:??, ?? steps/min]2025-08-08 19:24:29,970 - agent.ComputerAgent - INFO - LLM processing started with 1 messages
  0%|----------------------------------------| 0/200 [1:25<??:??, ?? steps/min]2025-08-08 19:24:30,647 - agent.ComputerAgent - INFO - LLM processing started with 1 messages
2025-08-08 19:24:31,329 - agent.ComputerAgent - INFO - LLM processing started with 1 messages
  0%|----------------------------------------| 0/200 [1:26<??:??, ?? steps/min]2025-08-08 19:24:31,958 - agent.ComputerAgent - INFO - LLM processing started with 1 messages
  0%|----------------------------------------| 0/200 [1:28<??:??, ?? steps/min]2025-08-08 19:24:35,310 - agent.ComputerAgent - INFO - Computer: wait({})
2025-08-08 19:24:36,641 - agent.ComputerAgent - INFO - Computer: wait({})
2025-08-08 19:24:37,969 - agent.ComputerAgent - INFO - Computer: wait({})
2025-08-08 19:24:39,338 - agent.ComputerAgent - INFO - Computer: wait({})
  2%|-------

No screenshot found, taking screenshot


2025-08-08 19:30:35,460 - agent.ComputerAgent - INFO - LLM processing started with 45 messages
 50%|████████████████████--------------------| 100/200 [7:32<7:32, 13.2 steps/min]2025-08-08 19:30:38,650 - agent.ComputerAgent - INFO - LLM processing started with 44 messages
 50%|████████████████████--------------------| 100/200 [7:35<7:35, 13.2 steps/min]2025-08-08 19:30:40,900 - agent.ComputerAgent - INFO - LLM processing started with 48 messages
 50%|████████████████████--------------------| 100/200 [7:37<7:37, 13.1 steps/min]2025-08-08 19:30:43,737 - agent.ComputerAgent - INFO - Computer: click({'button': 'left', 'x': 79, 'y': 637})
 50%|████████████████████--------------------| 101/200 [7:40<7:30, 13.2 steps/min]2025-08-08 19:30:46,567 - agent.ComputerAgent - INFO - Agent: I am unable to complete the task. However, I see that the outside task has been completed meaning the "Focus Active Editor Group" key binding has been set. Task completed
2025-08-08 19:30:47,191 - agent.ComputerAgen

{'task_count': 11, 'avg_reward': 0.2, 'success_rate': 18.181818181818183}
View results at: https://app.hud.so/jobs/d80a9a78-0e06-4b49-ba3e-cb5c8db4ba7c
