feat(cua-bench): RL integration of tinker into cua-bench#1200
Conversation
|
@dongyuanjushi is attempting to deploy a commit to the Cua Team on Vercel. A member of the Team first needs to authorize it. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds a complete Tinker off-policy RL training infrastructure for OpenCUA, including three new agent implementations (Qwen35, Qwen3VL, OpenCUA), an RL loop with GRPO training, rollout orchestration, checkpoint management, and trace loading. Environment setup, dependencies, and agent loop configurations are updated to support the new training pipeline. Changes
Sequence Diagram(s)sequenceDiagram
participant RLLoop as RL Loop
participant Rollout as Rollout Orchestrator
participant Subprocess as cb CLI
participant Sessions as Session Manager
participant TraceLoader as Trace Loader
participant GRPO as GRPO Batch Builder
participant TrainingClient as Tinker Training Client
RLLoop->>RLLoop: Initialize ServiceClient & resume from checkpoint
loop For each epoch
RLLoop->>RLLoop: Optionally save state checkpoint
RLLoop->>Rollout: run_rollouts(tasks, agent, model, steps, num_rollouts)
Rollout->>Subprocess: Invoke cb run dataset N times
Subprocess->>Sessions: Create training sessions
Rollout->>Sessions: Poll until all sessions terminal
Rollout-->>RLLoop: Return [(run_id, run_dir), ...]
RLLoop->>TraceLoader: load_run(run_dir) for each output
TraceLoader-->>RLLoop: [Episode, ...]
RLLoop->>RLLoop: Validate episode counts per task
RLLoop->>GRPO: build_batch(episodes, sampling_client, tokenizer, renderer)
GRPO->>GRPO: Group episodes by task, compute advantages
GRPO->>GRPO: Render prompts, tokenize actions, sample ref logprobs
GRPO->>GRPO: Build Datum objects
GRPO-->>RLLoop: [Datum, ...]
RLLoop->>TrainingClient: forward_backward(datums, loss_fn="importance_sampling")
RLLoop->>TrainingClient: optim_step(AdamW params)
TrainingClient-->>RLLoop: metrics dict
RLLoop->>RLLoop: Log training metrics
end
RLLoop->>TrainingClient: save_checkpoint(name="final")
RLLoop->>RLLoop: Training complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (13)
libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py (1)
146-148: Prefer explicit runtime handling overassertin trace ingestion.Using
asserthere can cause avoidable episode drops; a defensive coercion/fallback keeps training robust to imperfect rows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py` around lines 146 - 148, The current use of assert on trajectory_id can drop episodes; replace the assert with explicit runtime handling in the trace ingestion code where trajectory_id is computed from ds (the assignment using ds[0]["trajectory_id"]). If the value is None or not a str, coerce it to a safe default (e.g., empty string or str(value)) and emit a warning/log (using the module logger) including the offending value and row context, rather than asserting, so ingestion continues robustly.libs/python/agent/agent/agent.py (1)
1025-1041: Please remove leftover debug scaffolding before merge.The commented debug blocks and no-op inspection branch add noise to a hot path and make the action loop harder to maintain.
Also applies to: 1053-1055, 1061-1069, 1074-1084
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/python/agent/agent/agent.py` around lines 1025 - 1041, Remove the leftover commented debug scaffolding and any no-op inspection branches related to LLM response printing: delete the commented print blocks that reference result.get("usage"), the loop over result.get("output") (including checks for "computer_call", "function_call", "message"), the print(f"\n[DEBUG][LLM Response] output={result.get('output')}") lines, and any adjacent no-op/disabled inspection branches in the action loop that only read the variable result/output; ensure you remove all similar debug prints elsewhere in the same method that reference result or output while leaving the actual action-loop behavior and return values untouched.libs/cua-bench/plan.md (1)
36-45: Add language specifiers to fenced code blocks for markdown lint compliance.Several ASCII diagram and file structure code blocks lack language specifiers, triggering markdown lint warnings. Using
textorplaintextwould satisfy MD040.📝 Example fix for the trace storage paths block
-``` +```text ~/.local/share/cua-bench/runs/<run_id>/ ├── task_0_trace/ # HF Dataset (arrow files)Apply similarly to blocks at lines 73, 112, 214, and 452.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/plan.md` around lines 36 - 45, Update the fenced ASCII/code blocks in plan.md to include a language specifier (e.g., ```text or ```plaintext) so they satisfy MD040; specifically change the block that begins with "~/.local/share/cua-bench/runs/<run_id>/" and the other similar ASCII/file-structure blocks in the document to start with ```text (or ```plaintext) instead of bare ```, keeping the exact ASCII content intact.libs/cua-bench/cua_bench/agents/opencua_agent.py (2)
12-33: Consider extracting shared retry error detection to a common module.This
_is_retryable_api_errorfunction is duplicated inqwen3vl_agent.py. Both agents could import from a shared location.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/agents/opencua_agent.py` around lines 12 - 33, Extract the duplicate retry-detection logic into a shared helper and import it from both agents: move the implementation of _is_retryable_api_error into a common module (e.g., a new function named is_retryable_api_error in a shared util module), remove the duplicate definitions from opencua_agent.py and qwen3vl_agent.py, and update both files to import and call is_retryable_api_error instead of their local _is_retryable_api_error; ensure the shared function preserves the same checks (asyncio.TimeoutError/TimeoutError, litellm exception types, and message substring checks) and update any references to the original function name accordingly.
200-200: Unused variablelast_exc.The variable
last_excis declared but never assigned or used. It appears to be leftover from a retry implementation that was removed.🧹 Remove unused variable
- last_exc: BaseException | None = None - try: step = 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/agents/opencua_agent.py` at line 200, Remove the unused variable declaration last_exc from opencua_agent.py: locate the declaration "last_exc: BaseException | None = None" (within the surrounding function or class where it appears) and delete it, ensuring no other code references last_exc remain; run tests/lint to confirm no-unused-variable warnings are resolved.libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py (1)
77-82: Silent manifest load failure could mask underlying issues.When the manifest exists but is corrupt, silently resetting to an empty list means previous checkpoint references are lost. Consider logging a warning to aid debugging.
📝 Add warning log
if manifest_path.exists(): try: manifest = json.loads(manifest_path.read_text()) except Exception: + print(f"[checkpoint] Warning: could not parse {manifest_path}, starting fresh manifest") manifest = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py` around lines 77 - 82, The manifest load currently swallows all exceptions and silently replaces manifest with an empty list; update the block around manifest, manifest_path.exists(), json.loads(manifest_path.read_text()) to catch exceptions and emit a warning (e.g., logger.warning or module logger) that includes the manifest_path and the caught exception so corrupt/invalid manifests are visible in logs while still falling back to an empty list.libs/cua-bench/cua_bench/agents/qwen3vl_agent.py (2)
12-33: Duplicate_is_retryable_api_errorfunction.This function is identical to the one in
opencua_agent.py. Consider extracting it to a shared utility module (e.g.,base.pyor a newutils.py) to avoid duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 12 - 33, The _is_retryable_api_error function is duplicated in qwen3vl_agent.py and opencua_agent.py; extract this function into a shared utility module (e.g., create utils.py or base.py with _is_retryable_api_error), remove the duplicate definitions from both agents, and update qwen3vl_agent.py and opencua_agent.py to import _is_retryable_api_error from the new shared module; ensure the imported name and behavior remain identical and run tests/import checks to confirm no symbol changes.
56-135: Significant duplication of_create_custom_computerwithopencua_agent.py.This method is nearly identical to the one in
opencua_agent.py, differing only in hardcoded dimensions. Consider extracting to a shared base class or utility function, parameterizing the dimensions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 56 - 135, The _create_custom_computer method is duplicated from opencua_agent.py (only dimensions differ); extract the shared logic into a single reusable function or base-class method (e.g., build_custom_computer(session, width, height, environment) or BaseAgent._create_custom_computer_impl) and have qwen3vl_agent._create_custom_computer and opencua_agent._create_custom_computer call that helper with the appropriate parameters (use the differing get_dimensions value as the width/height argument or override get_environment if needed); update references to the unique symbols in the diff (screenshot, click, double_click, type_text, keypress, move, scroll, drag, wait, get_dimensions, get_environment) so both agents delegate to the shared implementation and remove the duplicated code.libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py (1)
79-88: Blocking poll loop may hang indefinitely on stuck sessions.The polling loop waits for all sessions to reach terminal state, but if a session gets stuck in a non-terminal state (e.g., due to a container issue), it will block until the full timeout expires. Consider adding a progress indicator or intermediate status logging.
📝 Suggested improvement: add progress logging
deadline = time.monotonic() + timeout + last_log = 0.0 while time.monotonic() < deadline: if _all_sessions_terminal(run_id): break + elapsed = time.monotonic() - (deadline - timeout) + if elapsed - last_log >= 60: # Log every minute + print(f"[rollout] Still waiting for run {run_id} ({int(elapsed)}s elapsed)...") + last_log = elapsed time.sleep(poll_interval)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py` around lines 79 - 88, The poll loop that waits using _all_sessions_terminal(run_id) can sit idle until timeout if sessions are stuck; add periodic progress logging inside the loop (every poll_interval) that logs run_id, elapsed time (timeout - remaining), and a short status such as the count or IDs of non-terminal sessions (implement a helper like _nonterminal_sessions(run_id) if missing) so operators see progress and which sessions are stuck; keep the existing TimeoutError but ensure each iteration emits a concise log entry with run_id, poll_interval, and current non-terminal info to aid debugging.libs/python/agent/agent/loops/opencua.py (1)
305-335: Inconsistent action format conversion in Priority 2 path.In Priority 3 (lines 356-360),
convert_qwen_tool_args_to_computer_actionis called to convert Qwen-style arguments to the internal action schema. However, in Priority 2 (lines 305-335), this conversion is missing. If the XML-parsed tool call contains Qwen-format arguments (e.g.,{"action": "left_click", "coordinate": [x, y]}), it won't be converted to the expected format.♻️ Suggested fix to add conversion in Priority 2
if coord and isinstance(coord, (list, tuple)) and len(coord) >= 2: rx, ry = _rescale(int(round(float(coord[0]))), int(round(float(coord[1])))) raw_args = {**raw_args, "coordinate": [rx, ry]} + # Convert Qwen format to Computer Calls format if this is a computer tool + if fn_name == "computer": + converted_action = convert_qwen_tool_args_to_computer_action(raw_args) + if converted_action: + raw_args = converted_action + fake_cm = { "role": "assistant",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/python/agent/agent/loops/opencua.py` around lines 305 - 335, Priority 2 path parses XML tool calls into tool_call/raw_args but never normalizes Qwen-style action fields; update the block that builds raw_args after calling _parse_tool_call_from_text to pass raw_args through convert_qwen_tool_args_to_computer_action (e.g., raw_args = convert_qwen_tool_args_to_computer_action(raw_args)) before rescaling coordinates and JSON-encoding, so fake_cm created for convert_completion_messages_to_responses_items uses the normalized internal action schema.libs/cua-bench/cua_bench/agents/qwen35_agent.py (1)
227-233: DONE detection only checks first content item.The loop only checks
content[0].get("text", "")for "DONE". If the completion indicator appears in subsequent content items, it won't be detected.♻️ Check all content items for DONE
for item in result["output"]: if item["type"] == "message": content = item.get("content", []) - if content and "DONE" in content[0].get("text", ""): + for c in content: + if isinstance(c, dict) and "DONE" in c.get("text", ""): + print(f"\n[Task completed] Agent indicated completion at step {step}") + task_completed = True + break + if task_completed: - print(f"\n[Task completed] Agent indicated completion at step {step}") - task_completed = True break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py` around lines 227 - 233, The DONE detection currently only inspects the first content item (content[0]) in the loop over result["output"], so move to checking all entries in the content list: for each message-type item (variable item) iterate or use a predicate (e.g., any(...)) over content entries to see if any entry's .get("text","") contains "DONE", and if so set task_completed = True and break; update the logic around result["output"], item, content, step, and task_completed in qwen35_agent.py accordingly.libs/python/agent/agent/loops/qwen35.py (1)
502-510: Consider removing commented-out code blocks.Multiple commented-out sections for
thinking_textextraction exist (lines 502-510, 550-556). If these are intentionally disabled alternatives, consider removing them or adding a comment explaining why they're preserved.Also applies to: 550-556
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/python/agent/agent/loops/qwen35.py` around lines 502 - 510, Remove the dead/commented-out thinking_text extraction block (the regex that strips <tool_call> and the output_items.append that creates a "message" with "output_text") or, if it's being intentionally kept as an alternative, leave a short explanatory comment above it stating why it is preserved (e.g., "kept as alternative output handling for tool-invoked thoughts; disabled because X"). Ensure references to the variables/functions are clear: the commented regex using r"<tool_call>[\s\S]*?</tool_call>", the thinking_text variable, and the output_items.append that emits a {"type":"message","role":"assistant","content":[{"type":"output_text","text": thinking_text}]} so future readers understand its purpose or that it was intentionally removed.libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py (1)
212-234: Dead code:skipped_zero_advantagecounter is never incremented.The counter is initialized at line 212 and printed at lines 233-234, but it's never incremented. The zero-advantage skip at lines 185-186 doesn't update this counter.
♻️ Either increment the counter or remove the dead code
+ skipped_zero_advantage = 0 + for task_desc, task_episodes in task_groups.items(): for ep in task_episodes: advantage = _trajectory_advantage(ep, task_episodes) # Skip if advantage is exactly zero (all same reward in group) if advantage == 0.0: + skipped_zero_advantage += 1 continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py` around lines 212 - 234, The skipped_zero_advantage counter is never updated; either increment it whenever a datum is intentionally dropped (e.g., when _build_datum returns None or when advantage == 0 in the pending loop) or remove the counter and the trailing print entirely. Locate the for-loop iterating over pending and update the branch that currently ignores zero-advantage samples to increment skipped_zero_advantage before continuing, or delete the skipped_zero_advantage variable and the final print statement if you prefer not to log skipped datums.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py`:
- Around line 238-243: The current logic treats both successful completion and
hitting the step limit as FailureMode.NONE, making them indistinguishable;
update the control flow in the qwen35_agent logic that sets failure_mode (using
task_completed, step, and self.max_steps) so that: if task_completed ->
FailureMode.NONE, elif step >= self.max_steps -> FailureMode.MAX_STEPS_EXCEEDED,
else -> FailureMode.NONE (restore the commented MAX_STEPS_EXCEEDED branch using
FailureMode.MAX_STEPS_EXCEEDED to allow callers to differentiate max-step
terminations from success).
In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py`:
- Around line 117-118: The get_dimensions function in qwen3vl_agent.py currently
returns a hardcoded (1280, 800) which may not match the real session
(opencua_agent.py uses 1024x768); replace the hardcoded tuple in async def
get_dimensions() with logic that queries the actual session/window size (e.g.,
read from the agent/session object or a session API like
get_window_size/get_viewport) and return that value, falling back to a
configurable default if the session query fails; update callers to accept the
dynamic result if necessary so coordinate calculations use the real session
dimensions.
- Around line 261-272: The return path in qwen3vl_agent.py always sets
failure_mode to FailureMode.NONE, ignoring when the agent hit the step limit;
restore the check used in opencua_agent.py by adding an elif branch that sets
failure_mode = FailureMode.MAX_STEPS_EXCEEDED when step >= self.max_steps (keep
the existing task_completed check first), then return the AgentResult as before
(references: FailureMode, step, self.max_steps, task_completed, AgentResult,
total_usage).
In `@libs/cua-bench/cua_bench/cli/commands/interact.py`:
- Around line 172-179: The headless detection incorrectly treats missing
DISPLAY/WAYLAND_DISPLAY as no-GUI on macOS; update the logic around env.headless
so macOS (platform.system() == "Darwin" or sys.platform.startswith("darwin")) is
assumed to have a display and only fall back to checking DISPLAY/WAYLAND_DISPLAY
on non-macOS Unix platforms. Import the appropriate platform/sys module, compute
has_display = True for macOS, otherwise use the existing environment variable
checks, and set env.headless = not has_display (leaving the existing user-facing
message behavior intact).
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py`:
- Around line 118-122: The code constructs a CheckpointInfo from the last
manifest entry but accesses last["state_path"] directly which can raise KeyError
if that entry is malformed; update the logic in the function that reads manifest
(use the manifest and last variables and the CheckpointInfo construction) to
safely fetch state_path via last.get("state_path") and validate it (e.g., if not
present, either scan backwards for the most recent entry with a state_path or
raise a clear ValueError/RuntimeError that includes the manifest entry
contents), then pass the validated path and epoch into CheckpointInfo; ensure
the error message names the manifest/last entry so debugging is straightforward.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py`:
- Line 174: Remove the stray debugger call `breakpoint()` left in the GRPO
training loop: either delete the `breakpoint()` invocation outright or guard it
behind a development-only flag (e.g., an environment/config check like DEBUG or
a `debug` argument) so production runs are not interrupted; look for the
`breakpoint()` occurrence in grpo.py (inside the GRPO training/step function)
and update/remove it accordingly.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rl_loop.py`:
- Around line 99-101: The call to
AutoTokenizer.from_pretrained(config.base_model, trust_remote_code=True) enables
remote code execution; make this behavior configurable and documented: add a
boolean config flag (e.g., config.allow_trust_remote_code or
tokenizer_trust_remote_code) and use it instead of hardcoding True when calling
AutoTokenizer.from_pretrained, document in README that certain models require
trust_remote_code and list recommended trusted sources, and ensure environments
with strict security can default the flag to False so the code path is safe
unless explicitly enabled for trusted models like Qwen.
- Around line 80-96: The code ignores config.resume_from; update the resume
logic in run() to honor config.resume_from by checking if config.resume_from is
set and, if so, load that checkpoint (e.g., call
checkpoints.get_checkpoint(config.resume_from) or pass config.resume_from
directly) and create the training client via
service_client.create_training_client_from_state_with_optimizer(resume_path) and
set start_epoch from the loaded resume info; otherwise fall back to
checkpoints.get_last_checkpoint(config.log_path) and to
create_lora_training_client as currently written. Ensure you reference
config.resume_from, checkpoints.get_checkpoint or
checkpoints.get_last_checkpoint,
service_client.create_training_client_from_state_with_optimizer, and start_epoch
when making the change.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py`:
- Around line 56-62: The hardcoded "--with", "libs/python/agent/agent" entry in
the cmd list inside rollout.py can be missing depending on CWD; make this path
configurable by adding a parameter or config lookup (e.g., rollout function
argument like agent_lib_path or reading an env var such as AGENT_LIB_PATH) and
use pathlib.Path.resolve() to produce an absolute path; update the code that
builds cmd (the cmd list that includes tasks_path, agent, model, max_steps) to
insert the resolved agent_lib_path (and keep a sensible default matching the
current value) so callers can override the path when launching rollouts.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py`:
- Around line 54-59: The reconstructed action text currently handles
"computer_call" and "tool_use" but omits "function_call", losing important
behavior signals; update the branch in traces.py to add an elif for kind ==
"function_call" that appends a JSON entry to parts (similar style to existing
branches) including the function name and its arguments (e.g., use
item.get("name") and item.get("arguments") or fallback to item.get("input")), so
function call details are preserved in the reconstructed trace.
In `@libs/python/agent/agent/agent.py`:
- Around line 642-643: The early-return in run() can leave loop_kwargs
unassigned and cause UnboundLocalError when _on_run_end(loop_kwargs, ...) is
later called; to fix, initialize loop_kwargs (e.g., loop_kwargs = {} or None)
before entering the loop in run(), update it inside the loop as currently done,
and if you choose None ensure the later call to _on_run_end checks for None (or
skip calling _on_run_end) so the unbound variable is never referenced; reference
run(), on_run_continue, loop_kwargs, and _on_run_end when applying the change.
In `@libs/python/agent/agent/loops/qwen35.py`:
- Line 701: predict_click is passing image dims in the wrong order to
_unnormalize_coordinate causing X/Y swap; after computing rh, rw =
smart_resize(h, w, ...) call _unnormalize_coordinate with (rw, rh) (width,
height) not (rh, rw), i.e. change the args = await _unnormalize_coordinate(args,
(rh, rw)) call inside predict_click to args = await
_unnormalize_coordinate(args, (rw, rh)) so the
_unnormalize_coordinate(width,height) expectation is satisfied.
- Around line 432-451: The debug print in the completion message loop is leaking
potentially sensitive data; replace or remove the stdout print in the loop that
iterates over completion_messages (the block that builds step_content from
content/item and prints f"Step {i}: Role: {role}, Content: {step_content}"). Use
the module logger (e.g., logger.debug) instead of print and ensure logs are
guarded by a debug-level check or redact sensitive parts (truncate or mask
content), keeping the same logic for building step_content so behavior is
preserved but no raw prints go to stdout.
---
Nitpick comments:
In `@libs/cua-bench/cua_bench/agents/opencua_agent.py`:
- Around line 12-33: Extract the duplicate retry-detection logic into a shared
helper and import it from both agents: move the implementation of
_is_retryable_api_error into a common module (e.g., a new function named
is_retryable_api_error in a shared util module), remove the duplicate
definitions from opencua_agent.py and qwen3vl_agent.py, and update both files to
import and call is_retryable_api_error instead of their local
_is_retryable_api_error; ensure the shared function preserves the same checks
(asyncio.TimeoutError/TimeoutError, litellm exception types, and message
substring checks) and update any references to the original function name
accordingly.
- Line 200: Remove the unused variable declaration last_exc from
opencua_agent.py: locate the declaration "last_exc: BaseException | None = None"
(within the surrounding function or class where it appears) and delete it,
ensuring no other code references last_exc remain; run tests/lint to confirm
no-unused-variable warnings are resolved.
In `@libs/cua-bench/cua_bench/agents/qwen35_agent.py`:
- Around line 227-233: The DONE detection currently only inspects the first
content item (content[0]) in the loop over result["output"], so move to checking
all entries in the content list: for each message-type item (variable item)
iterate or use a predicate (e.g., any(...)) over content entries to see if any
entry's .get("text","") contains "DONE", and if so set task_completed = True and
break; update the logic around result["output"], item, content, step, and
task_completed in qwen35_agent.py accordingly.
In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py`:
- Around line 12-33: The _is_retryable_api_error function is duplicated in
qwen3vl_agent.py and opencua_agent.py; extract this function into a shared
utility module (e.g., create utils.py or base.py with _is_retryable_api_error),
remove the duplicate definitions from both agents, and update qwen3vl_agent.py
and opencua_agent.py to import _is_retryable_api_error from the new shared
module; ensure the imported name and behavior remain identical and run
tests/import checks to confirm no symbol changes.
- Around line 56-135: The _create_custom_computer method is duplicated from
opencua_agent.py (only dimensions differ); extract the shared logic into a
single reusable function or base-class method (e.g.,
build_custom_computer(session, width, height, environment) or
BaseAgent._create_custom_computer_impl) and have
qwen3vl_agent._create_custom_computer and opencua_agent._create_custom_computer
call that helper with the appropriate parameters (use the differing
get_dimensions value as the width/height argument or override get_environment if
needed); update references to the unique symbols in the diff (screenshot, click,
double_click, type_text, keypress, move, scroll, drag, wait, get_dimensions,
get_environment) so both agents delegate to the shared implementation and remove
the duplicated code.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py`:
- Around line 77-82: The manifest load currently swallows all exceptions and
silently replaces manifest with an empty list; update the block around manifest,
manifest_path.exists(), json.loads(manifest_path.read_text()) to catch
exceptions and emit a warning (e.g., logger.warning or module logger) that
includes the manifest_path and the caught exception so corrupt/invalid manifests
are visible in logs while still falling back to an empty list.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.py`:
- Around line 212-234: The skipped_zero_advantage counter is never updated;
either increment it whenever a datum is intentionally dropped (e.g., when
_build_datum returns None or when advantage == 0 in the pending loop) or remove
the counter and the trailing print entirely. Locate the for-loop iterating over
pending and update the branch that currently ignores zero-advantage samples to
increment skipped_zero_advantage before continuing, or delete the
skipped_zero_advantage variable and the final print statement if you prefer not
to log skipped datums.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py`:
- Around line 79-88: The poll loop that waits using
_all_sessions_terminal(run_id) can sit idle until timeout if sessions are stuck;
add periodic progress logging inside the loop (every poll_interval) that logs
run_id, elapsed time (timeout - remaining), and a short status such as the count
or IDs of non-terminal sessions (implement a helper like
_nonterminal_sessions(run_id) if missing) so operators see progress and which
sessions are stuck; keep the existing TimeoutError but ensure each iteration
emits a concise log entry with run_id, poll_interval, and current non-terminal
info to aid debugging.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py`:
- Around line 146-148: The current use of assert on trajectory_id can drop
episodes; replace the assert with explicit runtime handling in the trace
ingestion code where trajectory_id is computed from ds (the assignment using
ds[0]["trajectory_id"]). If the value is None or not a str, coerce it to a safe
default (e.g., empty string or str(value)) and emit a warning/log (using the
module logger) including the offending value and row context, rather than
asserting, so ingestion continues robustly.
In `@libs/cua-bench/plan.md`:
- Around line 36-45: Update the fenced ASCII/code blocks in plan.md to include a
language specifier (e.g., ```text or ```plaintext) so they satisfy MD040;
specifically change the block that begins with
"~/.local/share/cua-bench/runs/<run_id>/" and the other similar
ASCII/file-structure blocks in the document to start with ```text (or
```plaintext) instead of bare ```, keeping the exact ASCII content intact.
In `@libs/python/agent/agent/agent.py`:
- Around line 1025-1041: Remove the leftover commented debug scaffolding and any
no-op inspection branches related to LLM response printing: delete the commented
print blocks that reference result.get("usage"), the loop over
result.get("output") (including checks for "computer_call", "function_call",
"message"), the print(f"\n[DEBUG][LLM Response] output={result.get('output')}")
lines, and any adjacent no-op/disabled inspection branches in the action loop
that only read the variable result/output; ensure you remove all similar debug
prints elsewhere in the same method that reference result or output while
leaving the actual action-loop behavior and return values untouched.
In `@libs/python/agent/agent/loops/opencua.py`:
- Around line 305-335: Priority 2 path parses XML tool calls into
tool_call/raw_args but never normalizes Qwen-style action fields; update the
block that builds raw_args after calling _parse_tool_call_from_text to pass
raw_args through convert_qwen_tool_args_to_computer_action (e.g., raw_args =
convert_qwen_tool_args_to_computer_action(raw_args)) before rescaling
coordinates and JSON-encoding, so fake_cm created for
convert_completion_messages_to_responses_items uses the normalized internal
action schema.
In `@libs/python/agent/agent/loops/qwen35.py`:
- Around line 502-510: Remove the dead/commented-out thinking_text extraction
block (the regex that strips <tool_call> and the output_items.append that
creates a "message" with "output_text") or, if it's being intentionally kept as
an alternative, leave a short explanatory comment above it stating why it is
preserved (e.g., "kept as alternative output handling for tool-invoked thoughts;
disabled because X"). Ensure references to the variables/functions are clear:
the commented regex using r"<tool_call>[\s\S]*?</tool_call>", the thinking_text
variable, and the output_items.append that emits a
{"type":"message","role":"assistant","content":[{"type":"output_text","text":
thinking_text}]} so future readers understand its purpose or that it was
intentionally removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8bbb89bf-347d-4b4a-a373-e2dc3ae2fc49
📒 Files selected for processing (23)
libs/cua-bench/Dockerfilelibs/cua-bench/cua_bench/agents/__init__.pylibs/cua-bench/cua_bench/agents/opencua_agent.pylibs/cua-bench/cua_bench/agents/qwen35_agent.pylibs/cua-bench/cua_bench/agents/qwen3vl_agent.pylibs/cua-bench/cua_bench/cli/commands/interact.pylibs/cua-bench/cua_bench/cli/commands/run.pylibs/cua-bench/cua_bench/runner/task_runner.pylibs/cua-bench/cua_bench/trainer/__init__.pylibs/cua-bench/cua_bench/trainer/off_policy/__init__.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/__init__.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/grpo.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/rl_loop.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.pylibs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.pylibs/cua-bench/plan.mdlibs/cua-bench/pyproject.tomllibs/python/agent/agent/agent.pylibs/python/agent/agent/loops/__init__.pylibs/python/agent/agent/loops/opencua.pylibs/python/agent/agent/loops/qwen35.pylibs/python/agent/agent/loops/qwen3vl.py
| async def get_dimensions(): | ||
| return (1280, 800) |
There was a problem hiding this comment.
Hardcoded screen dimensions may not match actual session.
The dimensions (1280, 800) are hardcoded and differ from opencua_agent.py's (1024, 768). If the actual session has different dimensions, this could cause coordinate mismatches. Consider querying the session for actual dimensions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 117 - 118, The
get_dimensions function in qwen3vl_agent.py currently returns a hardcoded (1280,
800) which may not match the real session (opencua_agent.py uses 1024x768);
replace the hardcoded tuple in async def get_dimensions() with logic that
queries the actual session/window size (e.g., read from the agent/session object
or a session API like get_window_size/get_viewport) and return that value,
falling back to a configurable default if the session query fails; update
callers to accept the dynamic result if necessary so coordinate calculations use
the real session dimensions.
| if task_completed: | ||
| failure_mode = FailureMode.NONE | ||
| # elif step >= self.max_steps: | ||
| # failure_mode = FailureMode.MAX_STEPS_EXCEEDED | ||
| else: | ||
| failure_mode = FailureMode.NONE | ||
|
|
||
| return AgentResult( | ||
| total_input_tokens=total_usage.get("prompt_tokens", 0), | ||
| total_output_tokens=total_usage.get("completion_tokens", 0), | ||
| failure_mode=failure_mode, | ||
| ) |
There was a problem hiding this comment.
Inconsistent failure mode: always returns NONE even when max steps exceeded.
Unlike opencua_agent.py which returns FailureMode.MAX_STEPS_EXCEEDED when the step limit is hit (lines 252-253), this implementation always returns FailureMode.NONE. The commented-out code on lines 263-264 suggests this was intentional, but the inconsistency may cause confusion when analyzing benchmark results.
🔧 Suggested fix for consistency
if task_completed:
failure_mode = FailureMode.NONE
- # elif step >= self.max_steps:
- # failure_mode = FailureMode.MAX_STEPS_EXCEEDED
+ elif step >= self.max_steps:
+ failure_mode = FailureMode.MAX_STEPS_EXCEEDED
else:
failure_mode = FailureMode.NONE📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if task_completed: | |
| failure_mode = FailureMode.NONE | |
| # elif step >= self.max_steps: | |
| # failure_mode = FailureMode.MAX_STEPS_EXCEEDED | |
| else: | |
| failure_mode = FailureMode.NONE | |
| return AgentResult( | |
| total_input_tokens=total_usage.get("prompt_tokens", 0), | |
| total_output_tokens=total_usage.get("completion_tokens", 0), | |
| failure_mode=failure_mode, | |
| ) | |
| if task_completed: | |
| failure_mode = FailureMode.NONE | |
| elif step >= self.max_steps: | |
| failure_mode = FailureMode.MAX_STEPS_EXCEEDED | |
| else: | |
| failure_mode = FailureMode.NONE | |
| return AgentResult( | |
| total_input_tokens=total_usage.get("prompt_tokens", 0), | |
| total_output_tokens=total_usage.get("completion_tokens", 0), | |
| failure_mode=failure_mode, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/agents/qwen3vl_agent.py` around lines 261 - 272, The
return path in qwen3vl_agent.py always sets failure_mode to FailureMode.NONE,
ignoring when the agent hit the step limit; restore the check used in
opencua_agent.py by adding an elif branch that sets failure_mode =
FailureMode.MAX_STEPS_EXCEEDED when step >= self.max_steps (keep the existing
task_completed check first), then return the AgentResult as before (references:
FailureMode, step, self.max_steps, task_completed, AgentResult, total_usage).
| # Set headless to False for interactive mode (fall back if no display available) | ||
| import os | ||
| has_display = bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY")) | ||
| if not has_display: | ||
| print( | ||
| f"{GREY}No X server detected ($DISPLAY not set). Running in headless mode.{RESET}" | ||
| ) | ||
| env.headless = not has_display |
There was a problem hiding this comment.
Interactive mode is misdetected on macOS and may always run headless.
The current check treats missing DISPLAY/WAYLAND_DISPLAY as “no GUI”, which is false on macOS and breaks expected interactive behavior.
💡 Proposed fix
- import os
- has_display = bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY"))
+ import os
+ import sys
+ has_display = (
+ sys.platform == "darwin"
+ or bool(os.environ.get("DISPLAY") or os.environ.get("WAYLAND_DISPLAY"))
+ )
if not has_display:
print(
- f"{GREY}No X server detected ($DISPLAY not set). Running in headless mode.{RESET}"
+ f"{GREY}No graphical display detected. Running in headless mode.{RESET}"
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/cli/commands/interact.py` around lines 172 - 179,
The headless detection incorrectly treats missing DISPLAY/WAYLAND_DISPLAY as
no-GUI on macOS; update the logic around env.headless so macOS
(platform.system() == "Darwin" or sys.platform.startswith("darwin")) is assumed
to have a display and only fall back to checking DISPLAY/WAYLAND_DISPLAY on
non-macOS Unix platforms. Import the appropriate platform/sys module, compute
has_display = True for macOS, otherwise use the existing environment variable
checks, and set env.headless = not has_display (leaving the existing user-facing
message behavior intact).
| last = manifest[-1] | ||
| return CheckpointInfo( | ||
| state_path=last["state_path"], | ||
| epoch=last.get("epoch", 0), | ||
| ) |
There was a problem hiding this comment.
Potential KeyError if checkpoint entry is malformed.
If the last manifest entry is missing the state_path key (e.g., due to partial write or corruption), this will raise an unhandled KeyError, crashing the training loop during resume. Consider using .get() with validation.
🛡️ Defensive fix
last = manifest[-1]
+ state_path = last.get("state_path")
+ if not state_path:
+ print(f"[checkpoint] Warning: last checkpoint entry missing state_path, skipping")
+ return None
return CheckpointInfo(
- state_path=last["state_path"],
+ state_path=state_path,
epoch=last.get("epoch", 0),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| last = manifest[-1] | |
| return CheckpointInfo( | |
| state_path=last["state_path"], | |
| epoch=last.get("epoch", 0), | |
| ) | |
| last = manifest[-1] | |
| state_path = last.get("state_path") | |
| if not state_path: | |
| print(f"[checkpoint] Warning: last checkpoint entry missing state_path, skipping") | |
| return None | |
| return CheckpointInfo( | |
| state_path=state_path, | |
| epoch=last.get("epoch", 0), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/checkpoints.py` around
lines 118 - 122, The code constructs a CheckpointInfo from the last manifest
entry but accesses last["state_path"] directly which can raise KeyError if that
entry is malformed; update the logic in the function that reads manifest (use
the manifest and last variables and the CheckpointInfo construction) to safely
fetch state_path via last.get("state_path") and validate it (e.g., if not
present, either scan backwards for the most recent entry with a state_path or
raise a clear ValueError/RuntimeError that includes the manifest entry
contents), then pass the validated path and epoch into CheckpointInfo; ensure
the error message names the manifest/last entry so debugging is straightforward.
| cmd = [ | ||
| "cb", "run", "dataset", str(tasks_path), | ||
| "--agent", agent, | ||
| "--model", model, | ||
| "--max-steps", str(max_steps), | ||
| "--with", "libs/python/agent/agent" | ||
| ] |
There was a problem hiding this comment.
Hardcoded --with path may not exist in all environments.
The command hardcodes --with libs/python/agent/agent assuming a specific repository structure. This path is relative to the working directory, which may vary depending on where the rollout is launched from.
🔧 Suggested fix: make the path configurable
def _run_single_rollout(
tasks_path: str | Path,
agent: str,
model: str,
max_steps: int,
*,
timeout: int = 3600,
poll_interval: int = 10,
extra_env: dict[str, str] | None = None,
+ with_paths: list[str] | None = None,
) -> tuple[str, Path]:
...
cmd = [
"cb", "run", "dataset", str(tasks_path),
"--agent", agent,
"--model", model,
"--max-steps", str(max_steps),
- "--with", "libs/python/agent/agent"
]
+ for wp in (with_paths or []):
+ cmd.extend(["--with", wp])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/rollout.py` around lines
56 - 62, The hardcoded "--with", "libs/python/agent/agent" entry in the cmd list
inside rollout.py can be missing depending on CWD; make this path configurable
by adding a parameter or config lookup (e.g., rollout function argument like
agent_lib_path or reading an env var such as AGENT_LIB_PATH) and use
pathlib.Path.resolve() to produce an absolute path; update the code that builds
cmd (the cmd list that includes tasks_path, agent, model, max_steps) to insert
the resolved agent_lib_path (and keep a sensible default matching the current
value) so callers can override the path when launching rollouts.
| elif kind == "computer_call": | ||
| parts.append(json.dumps({"action": item.get("action")})) | ||
| elif kind == "tool_use": | ||
| parts.append( | ||
| json.dumps({"tool": item.get("name"), "input": item.get("input")}) | ||
| ) |
There was a problem hiding this comment.
function_call actions are currently omitted from reconstructed action text.
This drops part of the model behavior signal for traces containing function/tool calls, which can skew GRPO targets.
💡 Proposed fix
elif kind == "computer_call":
parts.append(json.dumps({"action": item.get("action")}))
+ elif kind == "function_call":
+ parts.append(
+ json.dumps(
+ {
+ "function": item.get("name"),
+ "arguments": item.get("arguments"),
+ }
+ )
+ )
elif kind == "tool_use":
parts.append(
json.dumps({"tool": item.get("name"), "input": item.get("input")})
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| elif kind == "computer_call": | |
| parts.append(json.dumps({"action": item.get("action")})) | |
| elif kind == "tool_use": | |
| parts.append( | |
| json.dumps({"tool": item.get("name"), "input": item.get("input")}) | |
| ) | |
| elif kind == "computer_call": | |
| parts.append(json.dumps({"action": item.get("action")})) | |
| elif kind == "function_call": | |
| parts.append( | |
| json.dumps( | |
| { | |
| "function": item.get("name"), | |
| "arguments": item.get("arguments"), | |
| } | |
| ) | |
| ) | |
| elif kind == "tool_use": | |
| parts.append( | |
| json.dumps({"tool": item.get("name"), "input": item.get("input")}) | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/cua-bench/cua_bench/trainer/off_policy/tinker/traces.py` around lines 54
- 59, The reconstructed action text currently handles "computer_call" and
"tool_use" but omits "function_call", losing important behavior signals; update
the branch in traces.py to add an elif for kind == "function_call" that appends
a JSON entry to parts (similar style to existing branches) including the
function name and its arguments (e.g., use item.get("name") and
item.get("arguments") or fallback to item.get("input")), so function call
details are preserved in the reconstructed trace.
| if not should_continue: | ||
| return False |
There was a problem hiding this comment.
Early stop path can crash with UnboundLocalError in run().
If on_run_continue returns False on the first iteration, execution breaks before loop_kwargs is assigned, but _on_run_end(loop_kwargs, ...) is still called later.
💡 Proposed fix
await self._on_run_start(run_kwargs, old_items)
+ loop_kwargs: Dict[str, Any] = dict(run_kwargs)
while new_items[-1].get("role") != "assistant" if new_items else True:
# Lifecycle hook: Check if we should continue based on callbacks (e.g., budget manager)
should_continue = await self._on_run_continue(run_kwargs, old_items, new_items)
if not should_continue:
break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/agent.py` around lines 642 - 643, The early-return in
run() can leave loop_kwargs unassigned and cause UnboundLocalError when
_on_run_end(loop_kwargs, ...) is later called; to fix, initialize loop_kwargs
(e.g., loop_kwargs = {} or None) before entering the loop in run(), update it
inside the loop as currently done, and if you choose None ensure the later call
to _on_run_end checks for None (or skip calling _on_run_end) so the unbound
variable is never referenced; reference run(), on_run_continue, loop_kwargs, and
_on_run_end when applying the change.
| for i, msg in enumerate(completion_messages): | ||
| role = msg.get("role") | ||
| content = msg.get("content") | ||
| if isinstance(content, list): | ||
| step_content = [] | ||
| for item in content: | ||
| item_type = item.get("type") | ||
| if item_type == "text": | ||
| step_content.append(item.get('text')) | ||
| elif item_type == "image_url": | ||
| step_content.append("Image URL: " + item.get('image_url').get('url')[:100]) | ||
| else: | ||
| item = content | ||
| step_content = "" | ||
| if isinstance(item, dict) and item.get("type") == "image_url": | ||
| step_content = "Image URL: " + item.get('image_url').get('url')[:100] | ||
| else: | ||
| step_content = content | ||
|
|
||
| print(f"Step {i}: Role: {role}, Content: {step_content}") |
There was a problem hiding this comment.
Debug print statements should be removed or guarded.
Lines 432-451 contain verbose debug logging that prints every message step with its content. This will pollute stdout in production and may leak sensitive information.
🧹 Remove or use proper logging
- for i, msg in enumerate(completion_messages):
- role = msg.get("role")
- content = msg.get("content")
- if isinstance(content, list):
- step_content = []
- for item in content:
- item_type = item.get("type")
- if item_type == "text":
- step_content.append(item.get('text'))
- elif item_type == "image_url":
- step_content.append("Image URL: " + item.get('image_url').get('url')[:100])
- else:
- item = content
- step_content = ""
- if isinstance(item, dict) and item.get("type") == "image_url":
- step_content = "Image URL: " + item.get('image_url').get('url')[:100]
- else:
- step_content = content
-
- print(f"Step {i}: Role: {role}, Content: {step_content}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/qwen35.py` around lines 432 - 451, The debug
print in the completion message loop is leaking potentially sensitive data;
replace or remove the stdout print in the loop that iterates over
completion_messages (the block that builds step_content from content/item and
prints f"Step {i}: Role: {role}, Content: {step_content}"). Use the module
logger (e.g., logger.debug) instead of print and ensure logs are guarded by a
debug-level check or redact sensitive parts (truncate or mask content), keeping
the same logic for building step_content so behavior is preserved but no raw
prints go to stdout.
| content_text = ((choice.get("message") or {}).get("content")) or "" | ||
| tool_call = _parse_tool_call_from_text(content_text) or {} | ||
| args = tool_call.get("arguments") or {} | ||
| args = await _unnormalize_coordinate(args, (rh, rw)) |
There was a problem hiding this comment.
Bug: Incorrect dimension order in predict_click causes swapped coordinates.
Line 701 passes (rh, rw) to _unnormalize_coordinate, but rh is height and rw is width (from line 664: rh, rw = smart_resize(h, w, ...)). The function expects (width, height) at line 189: width, height = float(dims[0]), float(dims[1]).
This will cause X and Y coordinates to be scaled incorrectly.
- args = await _unnormalize_coordinate(args, (rh, rw))
+ args = await _unnormalize_coordinate(args, (rw, rh))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@libs/python/agent/agent/loops/qwen35.py` at line 701, predict_click is
passing image dims in the wrong order to _unnormalize_coordinate causing X/Y
swap; after computing rh, rw = smart_resize(h, w, ...) call
_unnormalize_coordinate with (rw, rh) (width, height) not (rh, rw), i.e. change
the args = await _unnormalize_coordinate(args, (rh, rw)) call inside
predict_click to args = await _unnormalize_coordinate(args, (rw, rh)) so the
_unnormalize_coordinate(width,height) expectation is satisfied.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
RL integration of tinker into cua-bench
Summary by CodeRabbit
New Features
Bug Fixes
Chores