In [1]:
# uv pip install -e ".[dev]" 
from hud import gym, load_taskset
from pprint import pprint

In [2]:
taskset = await load_taskset("OSWorld-Ubuntu")
print(f"Total tasks in OSWorld: {len(taskset)}")

test = taskset[144]
print(f"Task prompt: {test.prompt}")

Total tasks in OSWorld: 369
Task prompt: Can you make my computer bring back the last tab I shut down?

Task 0:
Prompt: Please help me change the background of VS Code to the photo in Downloads.

Task 1:
Prompt: Could you help me remove the dock on the left side of the screen?

Task 2:
Prompt: Please help me create a new python file named "test.py" via VS Code and save it at "/home/user/Desktop".

Task 3:
Prompt: Find discussions of community and open one with most replies.

Task 4:
Prompt: So, I've been dabbling with coding a Snake game in Python, and I finally got it up and running. It's pretty cool, but it's not without its quirks. The biggest issue I'm facing right now is that the snake can't seem to eat the food, no matter what. Could you help me tweak the code so the snake can actually eat the food? Thanks a bunch!

Task 5:
Prompt: I am looking for an website address I accessed a month ago, but Youtube websites which take almost all of my browsing history are interrupting my sear

In [3]:
# The Ubuntu environment will take around 2.5 minutes to start, but can be parallelized
env = await gym.make(test)

2025-05-27 09:57:20,806 - hud.gym - INFO - Creating private environment


In [4]:
from hud.agent import ClaudeAgent

# Define a new agent each time to reset the message history
# Make sure to define the environment variable ANTHROPIC_API_KEY
agent = ClaudeAgent()

# Initial observation
obs, _ = await env.reset()
print(f"Initial observation complete")

# Agent loop
for i in range(8):
    print(f"========= Step {i + 1} =========")
    action, done = await agent.predict(obs)
    print(f"Agent's action: {action}")

    obs, reward, terminated, info = await env.step(action)

    if done or terminated:
        break

Initial observation complete
Agent's action: [PressAction(type='press', keys=['ctrl', 'shift', 't'])]
Agent's action: [ResponseAction(type='response', text="Great! I've successfully reopened your last closed tab. As you can see, the TripAdvisor tab has been restored and is now displayed alongside your other open tabs (Lonely Planet and Airbnb).\n\nThe keyboard shortcut I used was Ctrl+Shift+T, which is the standard shortcut in Chrome for reopening the most recently closed tab. You can use this shortcut multiple times in succession to reopen multiple closed tabs in the reverse order they were closed.")]


In [5]:
# Evaluate environment state
result = await env.evaluate()
pprint(result)

{'error': None,
 'logs': 'INFO: Starting evaluation...\n'
         'INFO: Evaluating task 08d9a8b1-7b7a-4ba7-a226-4e266e13f6df...\n'
         'INFO: Evaluator configuration:\n'
         'INFO:   Metric function(s): is_expected_tabs\n'
         'INFO:   Metric conjunction: and\n'
         'INFO:   Result getter: get_open_tabs_info\n'
         'INFO:   Expected getter: get_rule\n'
         'INFO:   Metric options: {}\n'
         'INFO: Setting up post-config for evaluation...\n'
         'INFO: Evaluating single metric: is_expected_tabs\n'
         "INFO: Getting result state using config: {'type': 'open_tabs_info'}\n"
         "INFO: Getting expected state using config: {'type': 'rule', 'rules': "
         "{'type': 'url', 'urls': ['https://www.lonelyplanet.com', "
         "'https://www.airbnb.com', 'https://www.tripadvisor.com']}}\n"
         'INFO: Comparing result state with expected state\n'
         'INFO: Final evaluation result: 1\n'
         'INFO: Completed evaluation.\n'
    

In [6]:
# Make sure to close environment to avoid being charged for idle time
await env.close()

Paralell runs for the whole dataset

In [26]:

from hud import run_job
taskset = await load_taskset("OSWorld-Ubuntu")
job = await run_job(ClaudeAgent, taskset, "osworld-test", max_steps_per_task=20, max_concurrent_tasks=20, auto_reply_question=True)

In [None]:
await job.get_analytics()