Skip to content

feat(cua-bench): RL integration of tinker into cua-bench#1200

Merged
ddupont808 merged 18 commits into
trycua:mainfrom
dongyuanjushi:cuabench-tinker
Apr 6, 2026
Merged

feat(cua-bench): RL integration of tinker into cua-bench#1200
ddupont808 merged 18 commits into
trycua:mainfrom
dongyuanjushi:cuabench-tinker

Conversation

@dongyuanjushi
Copy link
Copy Markdown
Contributor

@dongyuanjushi dongyuanjushi commented Mar 20, 2026

RL integration of tinker into cua-bench

  • Customize an agent using open-sourced model, e.g., Qwen3.5-4B, Qwen3.5-9B
image
  • Trajectory collection of Cb traces to push forward to Tinker client
  • RL workflow using Tinker client
  • Sync model updates from Tinker and interface for relaunching agent

Summary by CodeRabbit

New Features

  • Added support for three new agent models: Qwen3.5, Qwen3-VL, and OpenCUA, expanding the available agent options.
  • Introduced off-policy RL training loop with GRPO algorithm support, enabling reinforcement learning fine-tuning workflows.
  • Added checkpoint management and automated rollout orchestration for training pipelines.

Bug Fixes

  • Improved headless mode detection to automatically check for display availability.
  • Enhanced path resolution for task dependencies with validation.

Chores

  • Made Playwright a core dependency.
  • Added OpenTelemetry observability support.

@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Mar 20, 2026

@dongyuanjushi is attempting to deploy a commit to the Cua Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0ae94481-e130-4f1e-bf2f-91d5f7f05562

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a complete Tinker off-policy RL training infrastructure for OpenCUA, including three new agent implementations (Qwen35, Qwen3VL, OpenCUA), an RL loop with GRPO training, rollout orchestration, checkpoint management, and trace loading. Environment setup, dependencies, and agent loop configurations are updated to support the new training pipeline.

Changes

Cohort / File(s) Summary
New Agent Implementations
cua_bench/agents/qwen35_agent.py, cua_bench/agents/qwen3vl_agent.py, cua_bench/agents/opencua_agent.py
Added three new registered agent classes with perform_task implementations that initialize from kwargs/env variables, construct computer tools from a DesktopSession, execute ComputerAgent runs, track token usage and steps, and classify transient failures via _is_retryable_api_error helper.
Agent Registry Updates
cua_bench/agents/__init__.py
Updated __all__ and added import of Qwen35Agent to enable package-level export and self-registration with the agent registry.
Tinker RL Training Loop
cua_bench/trainer/off_policy/tinker/rl_loop.py
New module implementing the core off-policy training loop: manages epochs, checkpointing, rollout subprocess execution, episode loading/validation, reward metrics, and GRPO batch construction with training step invocation.
GRPO Trainer
cua_bench/trainer/off_policy/tinker/grpo.py
New module for group-relative policy optimization: defines GRPOConfig, groups episodes by task, computes trajectory advantages, renders trajectory prompts, tokenizes actions, samples reference logprobs, builds Tinker Datum objects, and performs training steps via AdamW optimization.
Checkpoint Management
cua_bench/trainer/off_policy/tinker/checkpoints.py
New module providing checkpoint save/load: CheckpointInfo dataclass, save_checkpoint() for persisting training state and manifest metadata, and get_last_checkpoint() for resuming from the latest saved state.
Rollout Orchestration
cua_bench/trainer/off_policy/tinker/rollout.py
New module for subprocess-based rollout execution: orchestrates cb run dataset CLI invocations, polls for session completion, validates output directories, and aggregates results across multiple rollout runs.
Trace Loading
cua_bench/trainer/off_policy/tinker/traces.py
New module for reading HuggingFace trace datasets: defines Step and Episode dataclasses, extracts actions and screenshots from trace events, and provides load_episode(), load_run(), and load_runs() functions for batch loading.
Tinker Package Init
cua_bench/trainer/off_policy/tinker/__init__.py
New package entry point re-exporting public symbols from RL loop, GRPO, and checkpoints modules via __all__.
Agent Loop: Qwen35
agent/agent/loops/qwen35.py
New module implementing Qwen3-5 agent loop: uses LiteLLM chat completions with custom <tool_call> parsing, Nous system prompt generation, image smart-resizing, and coordinate rescaling; includes predict_click helper.
Agent Loop: Qwen3VL
agent/agent/loops/qwen3vl.py
New module registering Qwen3VlConfig subclass of GenericVlmConfig to prioritize Qwen model patterns.
Agent Loop: OpenCUA Updates
agent/agent/loops/opencua.py
Refactored OpenCUAConfig.predict_step from delegation to bespoke LiteLLM flow with function schema building, image smart-resizing, <tool_call> parsing, coordinate rescaling, and updated capability reporting.
Agent Loop Registry
agent/agent/loops/__init__.py
Updated imports and __all__ to register qwen35 and qwen3vl loop modules.
Agent Module Scaffolding
agent/agent/agent.py
Added commented-out debug print blocks and conditional has_image derivation within the computer-action handling loop; minor whitespace change.
CLI & Container Infrastructure
cua_bench/cli/commands/interact.py, cua_bench/cli/commands/run.py, cua_bench/runner/task_runner.py
Interactive mode now detects X11/Wayland display availability before setting headless mode; detached run command and container runner now resolve dev_path to absolute paths with validation (existence and installability checks); added OPENAI_ENDPOINT to host-provided API key environment variables.
Docker & Dependencies
Dockerfile, pyproject.toml
Dockerfile now installs libsndfile1 system library and chains OpenTelemetry dependencies with cua-agent extras; added playwright>=1.55.0 as core pyproject.toml dependency.
Documentation
plan.md
Added planning document describing end-to-end Tinker RL integration architecture, existing trace schema, proposed modules, checkpoint/resume workflows, and open questions.

Sequence Diagram(s)

sequenceDiagram
    participant RLLoop as RL Loop
    participant Rollout as Rollout Orchestrator
    participant Subprocess as cb CLI
    participant Sessions as Session Manager
    participant TraceLoader as Trace Loader
    participant GRPO as GRPO Batch Builder
    participant TrainingClient as Tinker Training Client

    RLLoop->>RLLoop: Initialize ServiceClient & resume from checkpoint
    loop For each epoch
        RLLoop->>RLLoop: Optionally save state checkpoint
        RLLoop->>Rollout: run_rollouts(tasks, agent, model, steps, num_rollouts)
        Rollout->>Subprocess: Invoke cb run dataset N times
        Subprocess->>Sessions: Create training sessions
        Rollout->>Sessions: Poll until all sessions terminal
        Rollout-->>RLLoop: Return [(run_id, run_dir), ...]
        
        RLLoop->>TraceLoader: load_run(run_dir) for each output
        TraceLoader-->>RLLoop: [Episode, ...]
        
        RLLoop->>RLLoop: Validate episode counts per task
        RLLoop->>GRPO: build_batch(episodes, sampling_client, tokenizer, renderer)
        GRPO->>GRPO: Group episodes by task, compute advantages
        GRPO->>GRPO: Render prompts, tokenize actions, sample ref logprobs
        GRPO->>GRPO: Build Datum objects
        GRPO-->>RLLoop: [Datum, ...]
        
        RLLoop->>TrainingClient: forward_backward(datums, loss_fn="importance_sampling")
        RLLoop->>TrainingClient: optim_step(AdamW params)
        TrainingClient-->>RLLoop: metrics dict
        RLLoop->>RLLoop: Log training metrics
    end
    RLLoop->>TrainingClient: save_checkpoint(name="final")
    RLLoop->>RLLoop: Training complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested labels

release:pypi/agent, release:pypi/bench

Suggested reviewers

  • ddupont808
  • r33drichards

Poem

🐰 ✨ A rabbit hops through training loops so fine,
With Tinker steps and rollouts in a line,
New agents born—Qwen3 and OpenCUA's gleam,
Checkpoints saved, adventures in the stream,
Off-policy rewards paint the RL dream! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(cua-bench): RL integration of tinker into cua-bench' accurately reflects the main objective and scope of the changeset, which adds comprehensive RL training infrastructure via Tinker integration into the cua-bench system.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dongyuanjushi dongyuanjushi changed the title RL integration of tinker into cua-bench feat(cua-bench): RL integration of tinker into cua-bench Mar 20, 2026
@dongyuanjushi dongyuanjushi marked this pull request as ready for review April 2, 2026 03:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (13)
libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py (1)

146-148: Prefer explicit runtime handling over assert in trace ingestion.

Using assert here can cause avoidable episode drops; a defensive coercion/fallback keeps training robust to imperfect rows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py` around lines
146 - 148, The current use of assert on trajectory_id can drop episodes; replace
the assert with explicit runtime handling in the trace ingestion code where
trajectory_id is computed from ds (the assignment using ds[0]["trajectory_id"]).
If the value is None or not a str, coerce it to a safe default (e.g., empty
string or str(value)) and emit a warning/log (using the module logger) including
the offending value and row context, rather than asserting, so ingestion
continues robustly.
libs/python/agent/agent/agent.py (1)

1025-1041: Please remove leftover debug scaffolding before merge.

The commented debug blocks and no-op inspection branch add noise to a hot path and make the action loop harder to maintain.

Also applies to: 1053-1055, 1061-1069, 1074-1084

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/agent.py` around lines 1025 - 1041, Remove the
leftover commented debug scaffolding and any no-op inspection branches related
to LLM response printing: delete the commented print blocks that reference
result.get("usage"), the loop over result.get("output") (including checks for
"computer_call", "function_call", "message"), the print(f"\n[DEBUG][LLM
Response] output={result.get('output')}") lines, and any adjacent no-op/disabled
inspection branches in the action loop that only read the variable
result/output; ensure you remove all similar debug prints elsewhere in the same
method that reference result or output while leaving the actual action-loop
behavior and return values untouched.
libs/cua-bench/plan.md (1)

36-45: Add language specifiers to fenced code blocks for markdown lint compliance.

Several ASCII diagram and file structure code blocks lack language specifiers, triggering markdown lint warnings. Using text or plaintext would satisfy MD040.

📝 Example fix for the trace storage paths block
-```
+```text
 ~/.local/share/cua-bench/runs/<run_id>/
 ├── task_0_trace/          # HF Dataset (arrow files)

Apply similarly to blocks at lines 73, 112, 214, and 452.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/plan.md` around lines 36 - 45, Update the fenced ASCII/code
blocks in plan.md to include a language specifier (e.g., ```text or
```plaintext) so they satisfy MD040; specifically change the block that begins
with "~/.local/share/cua-bench/runs/<run_id>/" and the other similar
ASCII/file-structure blocks in the document to start with ```text (or
```plaintext) instead of bare ```, keeping the exact ASCII content intact.
libs/cua-bench/cua_bench/agents/opencua_agent.py (2)

12-33: Consider extracting shared retry error detection to a common module.

This _is_retryable_api_error function is duplicated in qwen3vl_agent.py. Both agents could import from a shared location.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/opencua_agent.py` around lines 12 - 33,
Extract the duplicate retry-detection logic into a shared helper and import it
from both agents: move the implementation of _is_retryable_api_error into a
common module (e.g., a new function named is_retryable_api_error in a shared
util module), remove the duplicate definitions from opencua_agent.py and
qwen3vl_agent.py, and update both files to import and call
is_retryable_api_error instead of their local _is_retryable_api_error; ensure
the shared function preserves the same checks
(asyncio.TimeoutError/TimeoutError, litellm exception types, and message
substring checks) and update any references to the original function name
accordingly.

200-200: Unused variable last_exc.

The variable last_exc is declared but never assigned or used. It appears to be leftover from a retry implementation that was removed.

🧹 Remove unused variable
-        last_exc: BaseException | None = None
-        
         try:
             step = 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/opencua_agent.py` at line 200, Remove the
unused variable declaration last_exc from opencua_agent.py: locate the
declaration "last_exc: BaseException | None = None" (within the surrounding
function or class where it appears) and delete it, ensuring no other code
references last_exc remain; run tests/lint to confirm no-unused-variable
warnings are resolved.
libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py (1)

77-82: Silent manifest load failure could mask underlying issues.

When the manifest exists but is corrupt, silently resetting to an empty list means previous checkpoint references are lost. Consider logging a warning to aid debugging.

📝 Add warning log
     if manifest_path.exists():
         try:
             manifest = json.loads(manifest_path.read_text())
         except Exception:
+            print(f"[checkpoint] Warning: could not parse {manifest_path}, starting fresh manifest")
             manifest = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py` around
lines 77 - 82, The manifest load currently swallows all exceptions and silently
replaces manifest with an empty list; update the block around manifest,
manifest_path.exists(), json.loads(manifest_path.read_text()) to catch
exceptions and emit a warning (e.g., logger.warning or module logger) that
includes the manifest_path and the caught exception so corrupt/invalid manifests
are visible in logs while still falling back to an empty list.
libs/cua-bench/cua_bench/agents/qwen3vl_agent.py (2)

12-33: Duplicate _is_retryable_api_error function.

This function is identical to the one in opencua_agent.py. Consider extracting it to a shared utility module (e.g., base.py or a new utils.py) to avoid duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 12 - 33, The
_is_retryable_api_error function is duplicated in qwen3vl_agent.py and
opencua_agent.py; extract this function into a shared utility module (e.g.,
create utils.py or base.py with _is_retryable_api_error), remove the duplicate
definitions from both agents, and update qwen3vl_agent.py and opencua_agent.py
to import _is_retryable_api_error from the new shared module; ensure the
imported name and behavior remain identical and run tests/import checks to
confirm no symbol changes.

56-135: Significant duplication of _create_custom_computer with opencua_agent.py.

This method is nearly identical to the one in opencua_agent.py, differing only in hardcoded dimensions. Consider extracting to a shared base class or utility function, parameterizing the dimensions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 56 - 135, The
_create_custom_computer method is duplicated from opencua_agent.py (only
dimensions differ); extract the shared logic into a single reusable function or
base-class method (e.g., build_custom_computer(session, width, height,
environment) or BaseAgent._create_custom_computer_impl) and have
qwen3vl_agent._create_custom_computer and opencua_agent._create_custom_computer
call that helper with the appropriate parameters (use the differing
get_dimensions value as the width/height argument or override get_environment if
needed); update references to the unique symbols in the diff (screenshot, click,
double_click, type_text, keypress, move, scroll, drag, wait, get_dimensions,
get_environment) so both agents delegate to the shared implementation and remove
the duplicated code.
libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py (1)

79-88: Blocking poll loop may hang indefinitely on stuck sessions.

The polling loop waits for all sessions to reach terminal state, but if a session gets stuck in a non-terminal state (e.g., due to a container issue), it will block until the full timeout expires. Consider adding a progress indicator or intermediate status logging.

📝 Suggested improvement: add progress logging
     deadline = time.monotonic() + timeout
+    last_log = 0.0
     while time.monotonic() < deadline:
         if _all_sessions_terminal(run_id):
             break
+        elapsed = time.monotonic() - (deadline - timeout)
+        if elapsed - last_log >= 60:  # Log every minute
+            print(f"[rollout] Still waiting for run {run_id} ({int(elapsed)}s elapsed)...")
+            last_log = elapsed
         time.sleep(poll_interval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py` around lines
79 - 88, The poll loop that waits using _all_sessions_terminal(run_id) can sit
idle until timeout if sessions are stuck; add periodic progress logging inside
the loop (every poll_interval) that logs run_id, elapsed time (timeout -
remaining), and a short status such as the count or IDs of non-terminal sessions
(implement a helper like _nonterminal_sessions(run_id) if missing) so operators
see progress and which sessions are stuck; keep the existing TimeoutError but
ensure each iteration emits a concise log entry with run_id, poll_interval, and
current non-terminal info to aid debugging.
libs/python/agent/agent/loops/opencua.py (1)

305-335: Inconsistent action format conversion in Priority 2 path.

In Priority 3 (lines 356-360), convert_qwen_tool_args_to_computer_action is called to convert Qwen-style arguments to the internal action schema. However, in Priority 2 (lines 305-335), this conversion is missing. If the XML-parsed tool call contains Qwen-format arguments (e.g., {"action": "left_click", "coordinate": [x, y]}), it won't be converted to the expected format.

♻️ Suggested fix to add conversion in Priority 2
             if coord and isinstance(coord, (list, tuple)) and len(coord) >= 2:
                 rx, ry = _rescale(int(round(float(coord[0]))), int(round(float(coord[1]))))
                 raw_args = {**raw_args, "coordinate": [rx, ry]}

+            # Convert Qwen format to Computer Calls format if this is a computer tool
+            if fn_name == "computer":
+                converted_action = convert_qwen_tool_args_to_computer_action(raw_args)
+                if converted_action:
+                    raw_args = converted_action
+
             fake_cm = {
                 "role": "assistant",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/opencua.py` around lines 305 - 335, Priority 2
path parses XML tool calls into tool_call/raw_args but never normalizes
Qwen-style action fields; update the block that builds raw_args after calling
_parse_tool_call_from_text to pass raw_args through
convert_qwen_tool_args_to_computer_action (e.g., raw_args =
convert_qwen_tool_args_to_computer_action(raw_args)) before rescaling
coordinates and JSON-encoding, so fake_cm created for
convert_completion_messages_to_responses_items uses the normalized internal
action schema.
libs/cua-bench/cua_bench/agents/qwen35_agent.py (1)

227-233: DONE detection only checks first content item.

The loop only checks content[0].get("text", "") for "DONE". If the completion indicator appears in subsequent content items, it won't be detected.

♻️ Check all content items for DONE
                 for item in result["output"]:
                     if item["type"] == "message":
                         content = item.get("content", [])
-                        if content and "DONE" in content[0].get("text", ""):
+                        for c in content:
+                            if isinstance(c, dict) and "DONE" in c.get("text", ""):
+                                print(f"\n[Task completed] Agent indicated completion at step {step}")
+                                task_completed = True
+                                break
+                        if task_completed:
-                            print(f"\n[Task completed] Agent indicated completion at step {step}")
-                            task_completed = True
                             break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py` around lines 227 - 233, The
DONE detection currently only inspects the first content item (content[0]) in
the loop over result["output"], so move to checking all entries in the content
list: for each message-type item (variable item) iterate or use a predicate
(e.g., any(...)) over content entries to see if any entry's .get("text","")
contains "DONE", and if so set task_completed = True and break; update the logic
around result["output"], item, content, step, and task_completed in
qwen35_agent.py accordingly.
libs/python/agent/agent/loops/qwen35.py (1)

502-510: Consider removing commented-out code blocks.

Multiple commented-out sections for thinking_text extraction exist (lines 502-510, 550-556). If these are intentionally disabled alternatives, consider removing them or adding a comment explaining why they're preserved.

Also applies to: 550-556

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/qwen35.py` around lines 502 - 510, Remove the
dead/commented-out thinking_text extraction block (the regex that strips
<tool_call> and the output_items.append that creates a "message" with
"output_text") or, if it's being intentionally kept as an alternative, leave a
short explanatory comment above it stating why it is preserved (e.g., "kept as
alternative output handling for tool-invoked thoughts; disabled because X").
Ensure references to the variables/functions are clear: the commented regex
using r"<tool_call>[\s\S]*?</tool_call>", the thinking_text variable, and the
output_items.append that emits a
{"type":"message","role":"assistant","content":[{"type":"output_text","text":
thinking_text}]} so future readers understand its purpose or that it was
intentionally removed.
libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py (1)

212-234: Dead code: skipped_zero_advantage counter is never incremented.

The counter is initialized at line 212 and printed at lines 233-234, but it's never incremented. The zero-advantage skip at lines 185-186 doesn't update this counter.

♻️ Either increment the counter or remove the dead code
+    skipped_zero_advantage = 0
+
     for task_desc, task_episodes in task_groups.items():
         for ep in task_episodes:
             advantage = _trajectory_advantage(ep, task_episodes)

             # Skip if advantage is exactly zero (all same reward in group)
             if advantage == 0.0:
+                skipped_zero_advantage += 1
                 continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py` around lines 212
- 234, The skipped_zero_advantage counter is never updated; either increment it
whenever a datum is intentionally dropped (e.g., when _build_datum returns None
or when advantage == 0 in the pending loop) or remove the counter and the
trailing print entirely. Locate the for-loop iterating over pending and update
the branch that currently ignores zero-advantage samples to increment
skipped_zero_advantage before continuing, or delete the skipped_zero_advantage
variable and the final print statement if you prefer not to log skipped datums.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py`:
- Around line 238-243: The current logic treats both successful completion and
hitting the step limit as FailureMode.NONE, making them indistinguishable;
update the control flow in the qwen35_agent logic that sets failure_mode (using
task_completed, step, and self.max_steps) so that: if task_completed ->
FailureMode.NONE, elif step >= self.max_steps -> FailureMode.MAX_STEPS_EXCEEDED,
else -> FailureMode.NONE (restore the commented MAX_STEPS_EXCEEDED branch using
FailureMode.MAX_STEPS_EXCEEDED to allow callers to differentiate max-step
terminations from success).

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py`:
- Around line 117-118: The get_dimensions function in qwen3vl_agent.py currently
returns a hardcoded (1280, 800) which may not match the real session
(opencua_agent.py uses 1024x768); replace the hardcoded tuple in async def
get_dimensions() with logic that queries the actual session/window size (e.g.,
read from the agent/session object or a session API like
get_window_size/get_viewport) and return that value, falling back to a
configurable default if the session query fails; update callers to accept the
dynamic result if necessary so coordinate calculations use the real session
dimensions.
- Around line 261-272: The return path in qwen3vl_agent.py always sets
failure_mode to FailureMode.NONE, ignoring when the agent hit the step limit;
restore the check used in opencua_agent.py by adding an elif branch that sets
failure_mode = FailureMode.MAX_STEPS_EXCEEDED when step >= self.max_steps (keep
the existing task_completed check first), then return the AgentResult as before
(references: FailureMode, step, self.max_steps, task_completed, AgentResult,
total_usage).

In `@libs/cua-bench/cua_bench/cli/commands/interact.py`:
- Around line 172-179: The headless detection incorrectly treats missing
DISPLAY/WAYLAND_DISPLAY as no-GUI on macOS; update the logic around env.headless
so macOS (platform.system() == "Darwin" or sys.platform.startswith("darwin")) is
assumed to have a display and only fall back to checking DISPLAY/WAYLAND_DISPLAY
on non-macOS Unix platforms. Import the appropriate platform/sys module, compute
has_display = True for macOS, otherwise use the existing environment variable
checks, and set env.headless = not has_display (leaving the existing user-facing
message behavior intact).

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py`:
- Around line 118-122: The code constructs a CheckpointInfo from the last
manifest entry but accesses last["state_path"] directly which can raise KeyError
if that entry is malformed; update the logic in the function that reads manifest
(use the manifest and last variables and the CheckpointInfo construction) to
safely fetch state_path via last.get("state_path") and validate it (e.g., if not
present, either scan backwards for the most recent entry with a state_path or
raise a clear ValueError/RuntimeError that includes the manifest entry
contents), then pass the validated path and epoch into CheckpointInfo; ensure
the error message names the manifest/last entry so debugging is straightforward.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py`:
- Line 174: Remove the stray debugger call `breakpoint()` left in the GRPO
training loop: either delete the `breakpoint()` invocation outright or guard it
behind a development-only flag (e.g., an environment/config check like DEBUG or
a `debug` argument) so production runs are not interrupted; look for the
`breakpoint()` occurrence in grpo.py (inside the GRPO training/step function)
and update/remove it accordingly.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rl_loop.py`:
- Around line 99-101: The call to
AutoTokenizer.from_pretrained(config.base_model, trust_remote_code=True) enables
remote code execution; make this behavior configurable and documented: add a
boolean config flag (e.g., config.allow_trust_remote_code or
tokenizer_trust_remote_code) and use it instead of hardcoding True when calling
AutoTokenizer.from_pretrained, document in README that certain models require
trust_remote_code and list recommended trusted sources, and ensure environments
with strict security can default the flag to False so the code path is safe
unless explicitly enabled for trusted models like Qwen.
- Around line 80-96: The code ignores config.resume_from; update the resume
logic in run() to honor config.resume_from by checking if config.resume_from is
set and, if so, load that checkpoint (e.g., call
checkpoints.get_checkpoint(config.resume_from) or pass config.resume_from
directly) and create the training client via
service_client.create_training_client_from_state_with_optimizer(resume_path) and
set start_epoch from the loaded resume info; otherwise fall back to
checkpoints.get_last_checkpoint(config.log_path) and to
create_lora_training_client as currently written. Ensure you reference
config.resume_from, checkpoints.get_checkpoint or
checkpoints.get_last_checkpoint,
service_client.create_training_client_from_state_with_optimizer, and start_epoch
when making the change.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py`:
- Around line 56-62: The hardcoded "--with", "libs/python/agent/agent" entry in
the cmd list inside rollout.py can be missing depending on CWD; make this path
configurable by adding a parameter or config lookup (e.g., rollout function
argument like agent_lib_path or reading an env var such as AGENT_LIB_PATH) and
use pathlib.Path.resolve() to produce an absolute path; update the code that
builds cmd (the cmd list that includes tasks_path, agent, model, max_steps) to
insert the resolved agent_lib_path (and keep a sensible default matching the
current value) so callers can override the path when launching rollouts.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py`:
- Around line 54-59: The reconstructed action text currently handles
"computer_call" and "tool_use" but omits "function_call", losing important
behavior signals; update the branch in traces.py to add an elif for kind ==
"function_call" that appends a JSON entry to parts (similar style to existing
branches) including the function name and its arguments (e.g., use
item.get("name") and item.get("arguments") or fallback to item.get("input")), so
function call details are preserved in the reconstructed trace.

In `@libs/python/agent/agent/agent.py`:
- Around line 642-643: The early-return in run() can leave loop_kwargs
unassigned and cause UnboundLocalError when _on_run_end(loop_kwargs, ...) is
later called; to fix, initialize loop_kwargs (e.g., loop_kwargs = {} or None)
before entering the loop in run(), update it inside the loop as currently done,
and if you choose None ensure the later call to _on_run_end checks for None (or
skip calling _on_run_end) so the unbound variable is never referenced; reference
run(), on_run_continue, loop_kwargs, and _on_run_end when applying the change.

In `@libs/python/agent/agent/loops/qwen35.py`:
- Line 701: predict_click is passing image dims in the wrong order to
_unnormalize_coordinate causing X/Y swap; after computing rh, rw =
smart_resize(h, w, ...) call _unnormalize_coordinate with (rw, rh) (width,
height) not (rh, rw), i.e. change the args = await _unnormalize_coordinate(args,
(rh, rw)) call inside predict_click to args = await
_unnormalize_coordinate(args, (rw, rh)) so the
_unnormalize_coordinate(width,height) expectation is satisfied.
- Around line 432-451: The debug print in the completion message loop is leaking
potentially sensitive data; replace or remove the stdout print in the loop that
iterates over completion_messages (the block that builds step_content from
content/item and prints f"Step {i}: Role: {role}, Content: {step_content}"). Use
the module logger (e.g., logger.debug) instead of print and ensure logs are
guarded by a debug-level check or redact sensitive parts (truncate or mask
content), keeping the same logic for building step_content so behavior is
preserved but no raw prints go to stdout.

---

Nitpick comments:
In `@libs/cua-bench/cua_bench/agents/opencua_agent.py`:
- Around line 12-33: Extract the duplicate retry-detection logic into a shared
helper and import it from both agents: move the implementation of
_is_retryable_api_error into a common module (e.g., a new function named
is_retryable_api_error in a shared util module), remove the duplicate
definitions from opencua_agent.py and qwen3vl_agent.py, and update both files to
import and call is_retryable_api_error instead of their local
_is_retryable_api_error; ensure the shared function preserves the same checks
(asyncio.TimeoutError/TimeoutError, litellm exception types, and message
substring checks) and update any references to the original function name
accordingly.
- Line 200: Remove the unused variable declaration last_exc from
opencua_agent.py: locate the declaration "last_exc: BaseException | None = None"
(within the surrounding function or class where it appears) and delete it,
ensuring no other code references last_exc remain; run tests/lint to confirm
no-unused-variable warnings are resolved.

In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py`:
- Around line 227-233: The DONE detection currently only inspects the first
content item (content[0]) in the loop over result["output"], so move to checking
all entries in the content list: for each message-type item (variable item)
iterate or use a predicate (e.g., any(...)) over content entries to see if any
entry's .get("text","") contains "DONE", and if so set task_completed = True and
break; update the logic around result["output"], item, content, step, and
task_completed in qwen35_agent.py accordingly.

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py`:
- Around line 12-33: The _is_retryable_api_error function is duplicated in
qwen3vl_agent.py and opencua_agent.py; extract this function into a shared
utility module (e.g., create utils.py or base.py with _is_retryable_api_error),
remove the duplicate definitions from both agents, and update qwen3vl_agent.py
and opencua_agent.py to import _is_retryable_api_error from the new shared
module; ensure the imported name and behavior remain identical and run
tests/import checks to confirm no symbol changes.
- Around line 56-135: The _create_custom_computer method is duplicated from
opencua_agent.py (only dimensions differ); extract the shared logic into a
single reusable function or base-class method (e.g.,
build_custom_computer(session, width, height, environment) or
BaseAgent._create_custom_computer_impl) and have
qwen3vl_agent._create_custom_computer and opencua_agent._create_custom_computer
call that helper with the appropriate parameters (use the differing
get_dimensions value as the width/height argument or override get_environment if
needed); update references to the unique symbols in the diff (screenshot, click,
double_click, type_text, keypress, move, scroll, drag, wait, get_dimensions,
get_environment) so both agents delegate to the shared implementation and remove
the duplicated code.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py`:
- Around line 77-82: The manifest load currently swallows all exceptions and
silently replaces manifest with an empty list; update the block around manifest,
manifest_path.exists(), json.loads(manifest_path.read_text()) to catch
exceptions and emit a warning (e.g., logger.warning or module logger) that
includes the manifest_path and the caught exception so corrupt/invalid manifests
are visible in logs while still falling back to an empty list.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py`:
- Around line 212-234: The skipped_zero_advantage counter is never updated;
either increment it whenever a datum is intentionally dropped (e.g., when
_build_datum returns None or when advantage == 0 in the pending loop) or remove
the counter and the trailing print entirely. Locate the for-loop iterating over
pending and update the branch that currently ignores zero-advantage samples to
increment skipped_zero_advantage before continuing, or delete the
skipped_zero_advantage variable and the final print statement if you prefer not
to log skipped datums.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py`:
- Around line 79-88: The poll loop that waits using
_all_sessions_terminal(run_id) can sit idle until timeout if sessions are stuck;
add periodic progress logging inside the loop (every poll_interval) that logs
run_id, elapsed time (timeout - remaining), and a short status such as the count
or IDs of non-terminal sessions (implement a helper like
_nonterminal_sessions(run_id) if missing) so operators see progress and which
sessions are stuck; keep the existing TimeoutError but ensure each iteration
emits a concise log entry with run_id, poll_interval, and current non-terminal
info to aid debugging.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py`:
- Around line 146-148: The current use of assert on trajectory_id can drop
episodes; replace the assert with explicit runtime handling in the trace
ingestion code where trajectory_id is computed from ds (the assignment using
ds[0]["trajectory_id"]). If the value is None or not a str, coerce it to a safe
default (e.g., empty string or str(value)) and emit a warning/log (using the
module logger) including the offending value and row context, rather than
asserting, so ingestion continues robustly.

In `@libs/cua-bench/plan.md`:
- Around line 36-45: Update the fenced ASCII/code blocks in plan.md to include a
language specifier (e.g., ```text or ```plaintext) so they satisfy MD040;
specifically change the block that begins with
"~/.local/share/cua-bench/runs/<run_id>/" and the other similar
ASCII/file-structure blocks in the document to start with ```text (or
```plaintext) instead of bare ```, keeping the exact ASCII content intact.

In `@libs/python/agent/agent/agent.py`:
- Around line 1025-1041: Remove the leftover commented debug scaffolding and any
no-op inspection branches related to LLM response printing: delete the commented
print blocks that reference result.get("usage"), the loop over
result.get("output") (including checks for "computer_call", "function_call",
"message"), the print(f"\n[DEBUG][LLM Response] output={result.get('output')}")
lines, and any adjacent no-op/disabled inspection branches in the action loop
that only read the variable result/output; ensure you remove all similar debug
prints elsewhere in the same method that reference result or output while
leaving the actual action-loop behavior and return values untouched.

In `@libs/python/agent/agent/loops/opencua.py`:
- Around line 305-335: Priority 2 path parses XML tool calls into
tool_call/raw_args but never normalizes Qwen-style action fields; update the
block that builds raw_args after calling _parse_tool_call_from_text to pass
raw_args through convert_qwen_tool_args_to_computer_action (e.g., raw_args =
convert_qwen_tool_args_to_computer_action(raw_args)) before rescaling
coordinates and JSON-encoding, so fake_cm created for
convert_completion_messages_to_responses_items uses the normalized internal
action schema.

In `@libs/python/agent/agent/loops/qwen35.py`:
- Around line 502-510: Remove the dead/commented-out thinking_text extraction
block (the regex that strips <tool_call> and the output_items.append that
creates a "message" with "output_text") or, if it's being intentionally kept as
an alternative, leave a short explanatory comment above it stating why it is
preserved (e.g., "kept as alternative output handling for tool-invoked thoughts;
disabled because X"). Ensure references to the variables/functions are clear:
the commented regex using r"<tool_call>[\s\S]*?</tool_call>", the thinking_text
variable, and the output_items.append that emits a
{"type":"message","role":"assistant","content":[{"type":"output_text","text":
thinking_text}]} so future readers understand its purpose or that it was
intentionally removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8bbb89bf-347d-4b4a-a373-e2dc3ae2fc49

📥 Commits

Reviewing files that changed from the base of the PR and between 131565b and 46930ad.

📒 Files selected for processing (23)
  • libs/cua-bench/Dockerfile
  • libs/cua-bench/cua_bench/agents/__init__.py
  • libs/cua-bench/cua_bench/agents/opencua_agent.py
  • libs/cua-bench/cua_bench/agents/qwen35_agent.py
  • libs/cua-bench/cua_bench/agents/qwen3vl_agent.py
  • libs/cua-bench/cua_bench/cli/commands/interact.py
  • libs/cua-bench/cua_bench/cli/commands/run.py
  • libs/cua-bench/cua_bench/runner/task_runner.py
  • libs/cua-bench/cua_bench/trainer/__init__.py
  • libs/cua-bench/cua_bench/trainer/off_policy/__init__.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/__init__.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/rl_loop.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py
  • libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py
  • libs/cua-bench/plan.md
  • libs/cua-bench/pyproject.toml
  • libs/python/agent/agent/agent.py
  • libs/python/agent/agent/loops/__init__.py
  • libs/python/agent/agent/loops/opencua.py
  • libs/python/agent/agent/loops/qwen35.py
  • libs/python/agent/agent/loops/qwen3vl.py

Comment thread libs/cua-bench/cua_bench/agents/qwen35_agent.py
Comment on lines +117 to +118
async def get_dimensions():
return (1280, 800)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded screen dimensions may not match actual session.

The dimensions (1280, 800) are hardcoded and differ from opencua_agent.py's (1024, 768). If the actual session has different dimensions, this could cause coordinate mismatches. Consider querying the session for actual dimensions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 117 - 118, The
get_dimensions function in qwen3vl_agent.py currently returns a hardcoded (1280,
800) which may not match the real session (opencua_agent.py uses 1024x768);
replace the hardcoded tuple in async def get_dimensions() with logic that
queries the actual session/window size (e.g., read from the agent/session object
or a session API like get_window_size/get_viewport) and return that value,
falling back to a configurable default if the session query fails; update
callers to accept the dynamic result if necessary so coordinate calculations use
the real session dimensions.

Comment on lines +261 to +272
if task_completed:
failure_mode = FailureMode.NONE
# elif step >= self.max_steps:
# failure_mode = FailureMode.MAX_STEPS_EXCEEDED
else:
failure_mode = FailureMode.NONE

return AgentResult(
total_input_tokens=total_usage.get("prompt_tokens", 0),
total_output_tokens=total_usage.get("completion_tokens", 0),
failure_mode=failure_mode,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent failure mode: always returns NONE even when max steps exceeded.

Unlike opencua_agent.py which returns FailureMode.MAX_STEPS_EXCEEDED when the step limit is hit (lines 252-253), this implementation always returns FailureMode.NONE. The commented-out code on lines 263-264 suggests this was intentional, but the inconsistency may cause confusion when analyzing benchmark results.

🔧 Suggested fix for consistency
             if task_completed:
                 failure_mode = FailureMode.NONE
-            # elif step >= self.max_steps:
-            #     failure_mode = FailureMode.MAX_STEPS_EXCEEDED
+            elif step >= self.max_steps:
+                failure_mode = FailureMode.MAX_STEPS_EXCEEDED
             else:
                 failure_mode = FailureMode.NONE
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if task_completed:
failure_mode = FailureMode.NONE
# elif step >= self.max_steps:
# failure_mode = FailureMode.MAX_STEPS_EXCEEDED
else:
failure_mode = FailureMode.NONE
return AgentResult(
total_input_tokens=total_usage.get("prompt_tokens", 0),
total_output_tokens=total_usage.get("completion_tokens", 0),
failure_mode=failure_mode,
)
if task_completed:
failure_mode = FailureMode.NONE
elif step >= self.max_steps:
failure_mode = FailureMode.MAX_STEPS_EXCEEDED
else:
failure_mode = FailureMode.NONE
return AgentResult(
total_input_tokens=total_usage.get("prompt_tokens", 0),
total_output_tokens=total_usage.get("completion_tokens", 0),
failure_mode=failure_mode,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 261 - 272, The
return path in qwen3vl_agent.py always sets failure_mode to FailureMode.NONE,
ignoring when the agent hit the step limit; restore the check used in
opencua_agent.py by adding an elif branch that sets failure_mode =
FailureMode.MAX_STEPS_EXCEEDED when step >= self.max_steps (keep the existing
task_completed check first), then return the AgentResult as before (references:
FailureMode, step, self.max_steps, task_completed, AgentResult, total_usage).

Comment on lines +172 to +179
# Set headless to False for interactive mode (fall back if no display available)
import os
has_display = bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY"))
if not has_display:
print(
f"{GREY}No X server detected ($DISPLAY not set). Running in headless mode.{RESET}"
)
env.headless = not has_display
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Interactive mode is misdetected on macOS and may always run headless.

The current check treats missing DISPLAY/WAYLAND_DISPLAY as “no GUI”, which is false on macOS and breaks expected interactive behavior.

💡 Proposed fix
-        import os
-        has_display = bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY"))
+        import os
+        import sys
+        has_display = (
+            sys.platform == "darwin"
+            or bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY"))
+        )
         if not has_display:
             print(
-                f"{GREY}No X server detected ($DISPLAY not set). Running in headless mode.{RESET}"
+                f"{GREY}No graphical display detected. Running in headless mode.{RESET}"
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/cli/commands/interact.py` around lines 172 - 179,
The headless detection incorrectly treats missing DISPLAY/WAYLAND_DISPLAY as
no-GUI on macOS; update the logic around env.headless so macOS
(platform.system() == "Darwin" or sys.platform.startswith("darwin")) is assumed
to have a display and only fall back to checking DISPLAY/WAYLAND_DISPLAY on
non-macOS Unix platforms. Import the appropriate platform/sys module, compute
has_display = True for macOS, otherwise use the existing environment variable
checks, and set env.headless = not has_display (leaving the existing user-facing
message behavior intact).

Comment on lines +118 to +122
last = manifest[-1]
return CheckpointInfo(
state_path=last["state_path"],
epoch=last.get("epoch", 0),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential KeyError if checkpoint entry is malformed.

If the last manifest entry is missing the state_path key (e.g., due to partial write or corruption), this will raise an unhandled KeyError, crashing the training loop during resume. Consider using .get() with validation.

🛡️ Defensive fix
     last = manifest[-1]
+    state_path = last.get("state_path")
+    if not state_path:
+        print(f"[checkpoint] Warning: last checkpoint entry missing state_path, skipping")
+        return None
     return CheckpointInfo(
-        state_path=last["state_path"],
+        state_path=state_path,
         epoch=last.get("epoch", 0),
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
last = manifest[-1]
return CheckpointInfo(
state_path=last["state_path"],
epoch=last.get("epoch", 0),
)
last = manifest[-1]
state_path = last.get("state_path")
if not state_path:
print(f"[checkpoint] Warning: last checkpoint entry missing state_path, skipping")
return None
return CheckpointInfo(
state_path=state_path,
epoch=last.get("epoch", 0),
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py` around
lines 118 - 122, The code constructs a CheckpointInfo from the last manifest
entry but accesses last["state_path"] directly which can raise KeyError if that
entry is malformed; update the logic in the function that reads manifest (use
the manifest and last variables and the CheckpointInfo construction) to safely
fetch state_path via last.get("state_path") and validate it (e.g., if not
present, either scan backwards for the most recent entry with a state_path or
raise a clear ValueError/RuntimeError that includes the manifest entry
contents), then pass the validated path and epoch into CheckpointInfo; ensure
the error message names the manifest/last entry so debugging is straightforward.

Comment on lines +56 to +62
cmd = [
"cb", "run", "dataset", str(tasks_path),
"--agent", agent,
"--model", model,
"--max-steps", str(max_steps),
"--with", "libs/python/agent/agent"
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Hardcoded --with path may not exist in all environments.

The command hardcodes --with libs/python/agent/agent assuming a specific repository structure. This path is relative to the working directory, which may vary depending on where the rollout is launched from.

🔧 Suggested fix: make the path configurable
 def _run_single_rollout(
     tasks_path: str | Path,
     agent: str,
     model: str,
     max_steps: int,
     *,
     timeout: int = 3600,
     poll_interval: int = 10,
     extra_env: dict[str, str] | None = None,
+    with_paths: list[str] | None = None,
 ) -> tuple[str, Path]:
     ...
     cmd = [
         "cb", "run", "dataset", str(tasks_path),
         "--agent", agent,
         "--model", model,
         "--max-steps", str(max_steps),
-        "--with", "libs/python/agent/agent"
     ]
+    for wp in (with_paths or []):
+        cmd.extend(["--with", wp])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py` around lines
56 - 62, The hardcoded "--with", "libs/python/agent/agent" entry in the cmd list
inside rollout.py can be missing depending on CWD; make this path configurable
by adding a parameter or config lookup (e.g., rollout function argument like
agent_lib_path or reading an env var such as AGENT_LIB_PATH) and use
pathlib.Path.resolve() to produce an absolute path; update the code that builds
cmd (the cmd list that includes tasks_path, agent, model, max_steps) to insert
the resolved agent_lib_path (and keep a sensible default matching the current
value) so callers can override the path when launching rollouts.

Comment on lines +54 to +59
elif kind == "computer_call":
parts.append(json.dumps({"action": item.get("action")}))
elif kind == "tool_use":
parts.append(
json.dumps({"tool": item.get("name"), "input": item.get("input")})
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

function_call actions are currently omitted from reconstructed action text.

This drops part of the model behavior signal for traces containing function/tool calls, which can skew GRPO targets.

💡 Proposed fix
         elif kind == "computer_call":
             parts.append(json.dumps({"action": item.get("action")}))
+        elif kind == "function_call":
+            parts.append(
+                json.dumps(
+                    {
+                        "function": item.get("name"),
+                        "arguments": item.get("arguments"),
+                    }
+                )
+            )
         elif kind == "tool_use":
             parts.append(
                 json.dumps({"tool": item.get("name"), "input": item.get("input")})
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif kind == "computer_call":
parts.append(json.dumps({"action": item.get("action")}))
elif kind == "tool_use":
parts.append(
json.dumps({"tool": item.get("name"), "input": item.get("input")})
)
elif kind == "computer_call":
parts.append(json.dumps({"action": item.get("action")}))
elif kind == "function_call":
parts.append(
json.dumps(
{
"function": item.get("name"),
"arguments": item.get("arguments"),
}
)
)
elif kind == "tool_use":
parts.append(
json.dumps({"tool": item.get("name"), "input": item.get("input")})
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py` around lines 54
- 59, The reconstructed action text currently handles "computer_call" and
"tool_use" but omits "function_call", losing important behavior signals; update
the branch in traces.py to add an elif for kind == "function_call" that appends
a JSON entry to parts (similar style to existing branches) including the
function name and its arguments (e.g., use item.get("name") and
item.get("arguments") or fallback to item.get("input")), so function call
details are preserved in the reconstructed trace.

Comment on lines +642 to 643
if not should_continue:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Early stop path can crash with UnboundLocalError in run().

If on_run_continue returns False on the first iteration, execution breaks before loop_kwargs is assigned, but _on_run_end(loop_kwargs, ...) is still called later.

💡 Proposed fix
         await self._on_run_start(run_kwargs, old_items)
+        loop_kwargs: Dict[str, Any] = dict(run_kwargs)
 
         while new_items[-1].get("role") != "assistant" if new_items else True:
             # Lifecycle hook: Check if we should continue based on callbacks (e.g., budget manager)
             should_continue = await self._on_run_continue(run_kwargs, old_items, new_items)
             if not should_continue:
                 break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/agent.py` around lines 642 - 643, The early-return in
run() can leave loop_kwargs unassigned and cause UnboundLocalError when
_on_run_end(loop_kwargs, ...) is later called; to fix, initialize loop_kwargs
(e.g., loop_kwargs = {} or None) before entering the loop in run(), update it
inside the loop as currently done, and if you choose None ensure the later call
to _on_run_end checks for None (or skip calling _on_run_end) so the unbound
variable is never referenced; reference run(), on_run_continue, loop_kwargs, and
_on_run_end when applying the change.

Comment on lines +432 to +451
for i, msg in enumerate(completion_messages):
role = msg.get("role")
content = msg.get("content")
if isinstance(content, list):
step_content = []
for item in content:
item_type = item.get("type")
if item_type == "text":
step_content.append(item.get('text'))
elif item_type == "image_url":
step_content.append("Image URL: " + item.get('image_url').get('url')[:100])
else:
item = content
step_content = ""
if isinstance(item, dict) and item.get("type") == "image_url":
step_content = "Image URL: " + item.get('image_url').get('url')[:100]
else:
step_content = content

print(f"Step {i}: Role: {role}, Content: {step_content}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Debug print statements should be removed or guarded.

Lines 432-451 contain verbose debug logging that prints every message step with its content. This will pollute stdout in production and may leak sensitive information.

🧹 Remove or use proper logging
-        for i, msg in enumerate(completion_messages):
-            role = msg.get("role")
-            content = msg.get("content")
-            if isinstance(content, list):
-                step_content = []
-                for item in content:
-                    item_type = item.get("type")
-                    if item_type == "text":
-                        step_content.append(item.get('text'))
-                    elif item_type == "image_url":
-                        step_content.append("Image URL: " + item.get('image_url').get('url')[:100])
-            else:
-                item = content
-                step_content = ""
-                if isinstance(item, dict) and item.get("type") == "image_url":
-                    step_content = "Image URL: " + item.get('image_url').get('url')[:100]
-                else:
-                    step_content = content
-                    
-            print(f"Step {i}: Role: {role}, Content: {step_content}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/qwen35.py` around lines 432 - 451, The debug
print in the completion message loop is leaking potentially sensitive data;
replace or remove the stdout print in the loop that iterates over
completion_messages (the block that builds step_content from content/item and
prints f"Step {i}: Role: {role}, Content: {step_content}"). Use the module
logger (e.g., logger.debug) instead of print and ensure logs are guarded by a
debug-level check or redact sensitive parts (truncate or mask content), keeping
the same logic for building step_content so behavior is preserved but no raw
prints go to stdout.

content_text = ((choice.get("message") or {}).get("content")) or ""
tool_call = _parse_tool_call_from_text(content_text) or {}
args = tool_call.get("arguments") or {}
args = await _unnormalize_coordinate(args, (rh, rw))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Incorrect dimension order in predict_click causes swapped coordinates.

Line 701 passes (rh, rw) to _unnormalize_coordinate, but rh is height and rw is width (from line 664: rh, rw = smart_resize(h, w, ...)). The function expects (width, height) at line 189: width, height = float(dims[0]), float(dims[1]).

This will cause X and Y coordinates to be scaled incorrectly.

-        args = await _unnormalize_coordinate(args, (rh, rw))
+        args = await _unnormalize_coordinate(args, (rw, rh))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/python/agent/agent/loops/qwen35.py` at line 701, predict_click is
passing image dims in the wrong order to _unnormalize_coordinate causing X/Y
swap; after computing rh, rw = smart_resize(h, w, ...) call
_unnormalize_coordinate with (rw, rh) (width, height) not (rh, rw), i.e. change
the args = await _unnormalize_coordinate(args, (rh, rw)) call inside
predict_click to args = await _unnormalize_coordinate(args, (rw, rh)) so the
_unnormalize_coordinate(width,height) expectation is satisfied.

@sentry
Copy link
Copy Markdown

sentry Bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 7.17391% with 427 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
libs/python/agent/agent/loops/qwen35.py 6.80% 274 Missing ⚠️
libs/python/agent/agent/loops/opencua.py 4.57% 146 Missing ⚠️
libs/python/agent/agent/agent.py 0.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ddupont808 ddupont808 merged commit 43734e7 into trycua:main Apr 6, 2026
11 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants