Skip to content

feat(cua-driver): desktop-scope Phase 1 — capture_scope config, get_desktop_state, Windows screen-absolute actions (#1968)#2019

Merged
f-trycua merged 6 commits into
mainfrom
feat/1968-phase1-desktop-scope
Jun 26, 2026
Merged

feat(cua-driver): desktop-scope Phase 1 — capture_scope config, get_desktop_state, Windows screen-absolute actions (#1968)#2019
f-trycua merged 6 commits into
mainfrom
feat/1968-phase1-desktop-scope

Conversation

@f-trycua

@f-trycua f-trycua commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Implements Phase 1 of #1968 — an opt-in desktop-scope modality for cua-driver (the "Computer-Use 1.0" vision-only, foreground, window-less, screen-absolute loop), complementing the default per-window background model.

What ships in Phase 1

  1. capture_scope: "window" | "desktop" config (default "window"), on all three platforms, wired through set_config / get_config like the existing capture_mode. macOS persists it (+ session overrides via a new effective_scope()); Windows/Linux keep it in-memory, matching their existing config behaviour.
  2. get_desktop_state tool on all three platforms — vision-only full-display screenshot via the existing capture::screenshot_display_bytes(), no downscale (true screen pixels so screen-absolute picks land), reporting both image dims and true screen size. structuredContent: platform, screenshot_width/height, screen_width/height, screen(scale)_factor, screenshot_mime_type (+ screenshot_file_path when screenshot_out_file is set). Over the CLI the PNG surfaces as screenshot_png_b64, same as get_window_state.
  3. Screen-absolute, window-less foreground click + scroll — Windows only (Phase 1 ships the Windows actuator; macOS/Linux global-pointer primitives are Phase 2). Reuses send_click_synthesized (already ABSOLUTE | VIRTUALDESK); adds send_wheel_synthesized for scroll. The window-less branch activates only when capture_scope == "desktop" AND no pid/window_id AND x,y present; otherwise returns a structured desktop_scope_disabled error pointing the caller at set_config capture_scope=desktop. Existing pid-based paths are unchanged.
  4. Capability: get_desktop_state claims screen.capture + screen.dimensions (existing tokens — additive, no CAPABILITY_VERSION bump).

Decisions baked in (were the open questions in planning)

  • No downscale for get_desktop_state — exact screen pixels for the GUI loop.
  • Explicit gate: window-less actions require capture_scope=desktop, not inferred from missing pid/window_id (avoids reinterpreting legacy x,y+pid calls).
  • macOS sibling effective_scope() rather than widening the existing effective() -> (String,u32) tuple (zero churn to existing call sites).
  • Windows click target via WindowFromPoint keeping the foreground-swap/UIPI path (what makes Chromium-content clicks land).
  • macOS/Linux desktop-scope actions: get_desktop_state works everywhere; the screen-absolute actions are Windows-only in Phase 1 (macOS CGEvent::post and Linux XTest pointer primitives are Phase 2, explicitly out of scope here).

Explicitly out of scope (later phases / the issue)

Desktop-wide SoM / cross-window element indices; a single computer-tool wrapper; computer-server wire-compat; multi-display (main display only); macOS/Linux foreground actuators (Phase 2).

Testing — Windows Phase 1 validated live ✅

Windows — live, in an interactive desktop session on the test VM (driven through a running cua-driver serve):

  • get_desktop_state → full-display capture returns a PNG + true screen size (1512×949). Confirms the earlier Session-0 The handle is invalid (0x80070006) was only the service-session wall — capture works in a real desktop.
  • Screen-absolute click (no pid/window_id, desktop scope) → ✅ Sent click via SendInput at screen (756,474) on HWND 0x103c8 (desktop scope) — lands on a real window via WindowFromPoint.
  • Screen-absolute scroll✅ Scrolled down via SendInput wheel (3 ticks) at screen (756,474) (desktop scope).
  • Negative gate: capture_scope=window + window-less click → structured desktop_scope_disabled rejection. ✓
  • All 72 platform-windows unit tests pass natively (incl. the 9 new desktop-scope tests: is_windowless_desktop_action, wheel_mouse_data sign/magnitude, get_desktop_state schema, config default).

Other platforms:

  • core: cargo test -p cua-driver-core capability → 11 passed (capability registration).
  • macOS: cargo build -p platform-macos clean; 4 new capture_scope / get_desktop_state / effective_scope tests pass. (Screen-absolute actuator is Phase 2.)
  • Linux: compiles in CI (the nix build + gnome/kde/xfce integration jobs are green); no live desktop-scope runtime check yet. (Actuator is Phase 2.)

Config persistence — capture_scope survives across processes ✅

Rebased on #2034 (config disk-persistence parity). capture_scope is now persisted to ~/.cua-driver/config.json and reloaded at startup on all three platforms, so it survives across stateless cua-driver call invocations: set_config capture_scope=desktop in one process is honored by the get_desktop_state / click / scroll of the next. Verified live on Windows + Linux — a separate get_config reads back desktop, and a fresh stateless click passes the desktop-scope gate straight into actuation (no held serve/MCP session required).

Downstream

Unblocks a --scope desktop switch in the Gemini computer-use-preview DesktopComputer backend, doubling as a drop-in full-screen computer-use shim. Phase 1 (Windows) is enough for a first end-to-end Gemini demo on a Windows host.
With #2034 merged + this branch rebased on it, the persisted capture_scope means the backend can drive desktop-scope over the per-call CLI as-is (no persistent connection required) — set_config capture_scope=desktop sticks across calls.

🤖 Generated with Claude Code

https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ

Summary by CodeRabbit

  • New Features

    • Added a new desktop capture mode that can grab the full screen at native resolution.
    • Added support for a new capture_scope setting, letting users switch between window-based and desktop-based behavior.
    • Extended scrolling and clicking behavior to work with screen coordinates in desktop mode.
    • Added mouse-wheel input support on Windows.
  • Bug Fixes

    • Configuration responses now consistently show the active capture scope.
    • Improved validation and error messages when unsupported capture settings are provided.

@vercel

vercel Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Ignored Ignored Preview Jun 26, 2026 4:01am

Request Review

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ff717769-07f9-486d-b9f4-5d1ae193c2ce

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds capture_scope to driver config, config tools, and session overrides on macOS/Linux/Windows. Introduces get_desktop_state on all three platforms, with Windows also adding desktop-scope click/scroll handling and wheel injection support.

Changes

Desktop scope capture and input

Layer / File(s) Summary
get_desktop_state capability mapping
libs/cua-driver/rust/crates/cua-driver-core/src/tool.rs
get_desktop_state now maps to screen.capture and screen.dimensions, and the capability-requirement test list includes it.
macOS capture_scope config propagation
libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs, libs/cua-driver/rust/crates/platform-macos/src/tools/get_config.rs, libs/cua-driver/rust/crates/platform-macos/src/tools/set_config.rs
capture_scope is added to DriverConfig, loaded from config JSON, stored in session overrides, resolved through effective_scope, and returned by the macOS config tools.
macOS desktop-state tool
libs/cua-driver/rust/crates/platform-macos/src/tools/get_screen_size.rs, libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs, libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs
get_desktop_state is registered, main_screen_size() is exposed to it, and the tool captures the main display at native resolution with image and geometry output.
Linux desktop-state and config
libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs
capture_scope, get_config, set_config, X11 screen-size factoring, and GetDesktopStateTool are added together, with registry wiring and tests.
Windows desktop-scope input
libs/cua-driver/rust/crates/platform-windows/src/input/mod.rs, libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs, libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
click and scroll accept window-less screen coordinates when capture_scope is desktop, and wheel synthesis is added for synthesized scrolling.
Windows desktop capture and config
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
Windows adds physical_screen_size(), get_desktop_state, capture_scope config get/set support, response updates, and desktop-scope tests.

Sequence Diagram(s)

sequenceDiagram
  participant set_config
  participant SessionConfigRegistry
  participant write_driver_config_key
  participant get_config
  set_config->>SessionConfigRegistry: store capture_scope in ConfigOverrides
  set_config->>write_driver_config_key: persist capture_scope for global config
  get_config->>SessionConfigRegistry: effective_scope(session_id, global)
  SessionConfigRegistry-->>get_config: resolved capture_scope
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • trycua/cua#1692 — Shares the capture_scope-driven get_desktop_state desktop-capture flow on Windows.
  • trycua/cua#1776 — Introduces the session-scoped config override plumbing that this PR extends with capture_scope on macOS.
  • trycua/cua#1966 — Changes the Linux screenshot capture path used by the new desktop-state tool.

Suggested reviewers

  • r33drichards

Poem

🐇 I hopped through pixels, wide and bright,
with capture_scope set just right.
A desktop map, both near and far,
now glows in bytes like a tiny star.
Hooray! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: desktop-scope Phase 1 with capture_scope, get_desktop_state, and Windows screen-absolute actions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/1968-phase1-desktop-scope

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs (2)

5119-5134: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

List capture_scope in the set_config description’s known keys.

The schema accepts capture_scope on Line 5134, but the user-facing “Known keys” text above it omits the new key, which makes the opt-in path harder for tool callers to discover.

Proposed wording update
                 Known keys:\n\
                 - `capture_mode` (string: `vision` | `ax` | `som`)\n\
+                - `capture_scope` (string: `window` | `desktop`)\n\
                 - `max_image_dimension` (integer)\n\
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines
5119 - 5134, The set_config documentation is missing the new capture_scope
option from the “Known keys” list even though the input_schema already accepts
it. Update the set_config description text near the schema in impl_.rs to add
capture_scope alongside capture_mode, max_image_dimension, experimental_pip, and
experimental_pip_geometry, keeping the wording consistent with the existing
style so callers can discover the desktop/window scope option.

1940-1946: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Update the x/y schema text for desktop-scope coordinates.

Line 1941 makes pid optional for desktop-scope clicks, but Lines 1945-1946 still describe x/y only as window-local get_window_state pixels. Agents reading the schema can still choose the wrong coordinate space for get_desktop_state.

Proposed wording update
-                    "x":{"type":"number","description":"X in window-local screenshot pixels — same space as the PNG get_window_state returns. Must be provided together with y."},
-                    "y":{"type":"number","description":"Y in window-local screenshot pixels. Must be provided together with x."},
+                    "x":{"type":"number","description":"X coordinate. With pid/window_id, this is window-local screenshot pixels from get_window_state. With capture_scope=\"desktop\" and no pid/window_id, this is a true screen pixel from get_desktop_state. Must be provided together with y."},
+                    "y":{"type":"number","description":"Y coordinate. With pid/window_id, this is window-local screenshot pixels from get_window_state. With capture_scope=\"desktop\" and no pid/window_id, this is a true screen pixel from get_desktop_state. Must be provided together with x."},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines
1940 - 1946, The schema text for x and y still only describes window-local
get_window_state pixels, which conflicts with the desktop-scope behavior allowed
by pid being optional. Update the x/y descriptions in the same schema block to
explicitly mention desktop-scope coordinates from get_desktop_state when
capture_scope is desktop and no pid/window_id is provided, while keeping the
existing window-local wording for window-based clicks. Use the existing schema
definitions in the object properties section to keep the coordinate-space
guidance consistent with pid, window_id, element_index, and element_token.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs`:
- Around line 3298-3302: The new capture_scope handling in the Linux tool parser
is mutating the shared DriverConfig, which can leak a session’s “desktop” choice
into other sessions; update the capture_scope branches in impl_ to treat it as a
session-only override whenever _session_id is present, matching the macOS peer
behavior. Use the existing capture_scope match logic and the session-related
state in the same tool parsing flow to apply the override locally rather than
writing it back to the shared config, and make the same change in the other
capture_scope branch around the later matching block.
- Around line 2779-2786: The Linux desktop structured payload in
get_desktop_state is missing the scale_factor field that peer platforms include.
Update the structured JSON built in the get_desktop_state path to add
scale_factor, using the same value already reported by get_screen_size on Linux
(1.0), so clients can rely on a consistent geometry contract across platforms.
- Around line 2755-2767: The screenshot output path handling in the Linux tool
does not expand a leading home prefix, so `screenshot_out_file` can be written
literally as `~/...` instead of the user’s home directory. Update the `out_file`
handling inside the `spawn_blocking` block to normalize the provided path before
`std::fs::write`, expanding `~/` the same way the macOS desktop-state tool does.
Use the existing `screenshot_out_file` argument flow and the
`out_file.as_deref()` branch to apply the expansion before writing or returning
the path.

In `@libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs`:
- Around line 27-31: The `get_desktop_state` description is advertising a macOS
desktop-only loop that immediately performs `click(x,y)` and `scroll(x,y)`
without `pid` or `window_id`, which is not supported yet. Update the tool
description in `get_desktop_state` to only describe the screenshot/state capture
behavior and native-resolution metadata, and remove any wording that implies
window-less desktop interaction on macOS until Phase 2.

In `@libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs`:
- Around line 161-163: The startup config load in the macOS tools module is
accepting any persisted capture_scope string, which can bypass the validation
enforced by set_config. Update the config restore logic around the capture_scope
handling in the startup path so it only applies values that match the allowed
enum (window or desktop) and otherwise leaves cfg.capture_scope at its default.
Use the existing set_config validation and the capture_scope field in the config
struct as the reference points when aligning the startup behavior with the
setter.

In `@libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs`:
- Around line 619-623: The wheel injection path in mouse input currently ignores
the result of SetCursorPos in the wheel-routing logic, which can cause SendInput
to run at the wrong cursor location when positioning fails. Update the wheel
handling block in the mouse input flow to propagate the SetCursorPos(sx, sy)
result with the ? operator so the operation aborts on failure before calling
SendInput, keeping the target window consistent.

In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs`:
- Around line 2011-2019: `WindowFromPoint` in the click-sending path is
returning a child window, but `send_click_synthesized` expects the top-level
HWND when it validates foreground ownership. Update the logic in
`spawn_blocking` to promote the hit-tested window to its root ancestor (using
the existing Windows ancestor API) before converting it to `hwnd_u` and passing
it into `crate::input::send_click_synthesized`, while keeping the original
screen coordinates unchanged.

---

Outside diff comments:
In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs`:
- Around line 5119-5134: The set_config documentation is missing the new
capture_scope option from the “Known keys” list even though the input_schema
already accepts it. Update the set_config description text near the schema in
impl_.rs to add capture_scope alongside capture_mode, max_image_dimension,
experimental_pip, and experimental_pip_geometry, keeping the wording consistent
with the existing style so callers can discover the desktop/window scope option.
- Around line 1940-1946: The schema text for x and y still only describes
window-local get_window_state pixels, which conflicts with the desktop-scope
behavior allowed by pid being optional. Update the x/y descriptions in the same
schema block to explicitly mention desktop-scope coordinates from
get_desktop_state when capture_scope is desktop and no pid/window_id is
provided, while keeping the existing window-local wording for window-based
clicks. Use the existing schema definitions in the object properties section to
keep the coordinate-space guidance consistent with pid, window_id,
element_index, and element_token.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c53e2e6b-0f8d-4043-8dea-dcdc7773dc62

📥 Commits

Reviewing files that changed from the base of the PR and between 9e84c1d and 6a7ae3f.

📒 Files selected for processing (10)
  • libs/cua-driver/rust/crates/cua-driver-core/src/tool.rs
  • libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs
  • libs/cua-driver/rust/crates/platform-macos/src/tools/get_config.rs
  • libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs
  • libs/cua-driver/rust/crates/platform-macos/src/tools/get_screen_size.rs
  • libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs
  • libs/cua-driver/rust/crates/platform-macos/src/tools/set_config.rs
  • libs/cua-driver/rust/crates/platform-windows/src/input/mod.rs
  • libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs
  • libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs

Comment on lines +2755 to +2767
let out_file = args.opt_str("screenshot_out_file");

let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> {
// Vision-only: capture the FULL DISPLAY at native size. No downscale
// so screen-absolute pixels land exactly.
let png = crate::capture::screenshot_display_bytes()?;
let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?;
// True screen size from the X11 root window.
let (screen_w, screen_h) = x11_screen_size()?;
// Optional: write PNG to disk instead of returning base64.
let written = if let Some(path) = out_file.as_deref() {
std::fs::write(path, &png)?;
Some(path.to_string())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Expand ~/ before writing screenshot_out_file.

Linux currently writes "~/foo.png" literally, while the peer macOS desktop-state tool expands the home prefix. This makes cross-platform callers fail on Linux for the same output path.

Proposed fix
-        let out_file = args.opt_str("screenshot_out_file");
+        let out_file = args.opt_str("screenshot_out_file").map(|s| {
+            if let Some(rest) = s.strip_prefix("~/") {
+                if let Ok(home) = std::env::var("HOME") {
+                    return format!("{home}/{rest}");
+                }
+            }
+            s
+        });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let out_file = args.opt_str("screenshot_out_file");
let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> {
// Vision-only: capture the FULL DISPLAY at native size. No downscale
// so screen-absolute pixels land exactly.
let png = crate::capture::screenshot_display_bytes()?;
let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?;
// True screen size from the X11 root window.
let (screen_w, screen_h) = x11_screen_size()?;
// Optional: write PNG to disk instead of returning base64.
let written = if let Some(path) = out_file.as_deref() {
std::fs::write(path, &png)?;
Some(path.to_string())
let out_file = args.opt_str("screenshot_out_file").map(|s| {
if let Some(rest) = s.strip_prefix("~/") {
if let Ok(home) = std::env::var("HOME") {
return format!("{home}/{rest}");
}
}
s
});
let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> {
// Vision-only: capture the FULL DISPLAY at native size. No downscale
// so screen-absolute pixels land exactly.
let png = crate::capture::screenshot_display_bytes()?;
let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?;
// True screen size from the X11 root window.
let (screen_w, screen_h) = x11_screen_size()?;
// Optional: write PNG to disk instead of returning base64.
let written = if let Some(path) = out_file.as_deref() {
std::fs::write(path, &png)?;
Some(path.to_string())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
2755 - 2767, The screenshot output path handling in the Linux tool does not
expand a leading home prefix, so `screenshot_out_file` can be written literally
as `~/...` instead of the user’s home directory. Update the `out_file` handling
inside the `spawn_blocking` block to normalize the provided path before
`std::fs::write`, expanding `~/` the same way the macOS desktop-state tool does.
Use the existing `screenshot_out_file` argument flow and the
`out_file.as_deref()` branch to apply the expansion before writing or returning
the path.

Comment on lines +2779 to +2786
let mut structured = json!({
"platform": "linux",
"screenshot_width": shot_w,
"screenshot_height": shot_h,
"screen_width": screen_w,
"screen_height": screen_h,
"screenshot_mime_type": "image/png",
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Include scale_factor in desktop structured content.

get_desktop_state should expose the same geometry contract as the peer platforms; Linux already reports 1.0 from get_screen_size, but omits it here. Clients normalizing desktop screenshots across platforms can’t rely on the field.

Proposed fix
                 let mut structured = json!({
                     "platform": "linux",
                     "screenshot_width": shot_w,
                     "screenshot_height": shot_h,
                     "screen_width": screen_w,
                     "screen_height": screen_h,
+                    "scale_factor": 1.0,
                     "screenshot_mime_type": "image/png",
                 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut structured = json!({
"platform": "linux",
"screenshot_width": shot_w,
"screenshot_height": shot_h,
"screen_width": screen_w,
"screen_height": screen_h,
"screenshot_mime_type": "image/png",
});
let mut structured = json!({
"platform": "linux",
"screenshot_width": shot_w,
"screenshot_height": shot_h,
"screen_width": screen_w,
"screen_height": screen_h,
"scale_factor": 1.0,
"screenshot_mime_type": "image/png",
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
2779 - 2786, The Linux desktop structured payload in get_desktop_state is
missing the scale_factor field that peer platforms include. Update the
structured JSON built in the get_desktop_state path to add scale_factor, using
the same value already reported by get_screen_size on Linux (1.0), so clients
can rely on a consistent geometry contract across platforms.

Comment on lines +3298 to +3302
"capture_scope" => match val.as_str() {
Some(s @ ("window" | "desktop")) => { cfg.capture_scope = s.to_owned(); parts.push(format!("capture_scope={s}")); }
Some(other) => return ToolResult::error(format!("`capture_scope` must be \"window\" or \"desktop\", got \"{other}\".")),
None => return ToolResult::error(format!("`capture_scope` must be a string, got {val}.")),
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Keep capture_scope session-scoped when a session is present.

Both new capture_scope branches mutate the shared DriverConfig, so a session enabling "desktop" can leak into other Linux sessions. The PR contract and macOS peer implementation treat this as a session override when _session_id is present.

Also applies to: 3340-3346

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
3298 - 3302, The new capture_scope handling in the Linux tool parser is mutating
the shared DriverConfig, which can leak a session’s “desktop” choice into other
sessions; update the capture_scope branches in impl_ to treat it as a
session-only override whenever _session_id is present, matching the macOS peer
behavior. Use the existing capture_scope match logic and the session-related
state in the same tool parsing flow to apply the override locally rather than
writing it back to the shared config, and make the same change in the other
capture_scope branch around the later matching block.

Comment on lines +27 to +31
description: "Capture a full-display vision screenshot in true screen pixels \
(no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \
drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \
display resolution plus the true screen size and backing scale factor so \
screen-absolute pixel picks land exactly. Vision-only: no AX tree walk."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't advertise window-less click/scroll on macOS yet.

This description says get_desktop_state is for macOS loops that then call click(x,y) / scroll(x,y) without pid or window_id, but this PR explicitly defers those desktop-scope actions on macOS to Phase 2. As written, the tool contract points agents at a flow that will fail immediately after capture.

Suggested fix
-        description: "Capture a full-display vision screenshot in true screen pixels \
-            (no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \
-            drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \
-            display resolution plus the true screen size and backing scale factor so \
-            screen-absolute pixel picks land exactly. Vision-only: no AX tree walk."
+        description: "Capture a full-display vision screenshot in true screen pixels \
+            (no downscale). Returns the PNG at native display resolution plus the true \
+            screen size and backing scale factor so screen-absolute pixel picks land \
+            exactly. Vision-only: no AX tree walk. macOS desktop-scope click/scroll \
+            actions are deferred to a later phase."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: "Capture a full-display vision screenshot in true screen pixels \
(no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \
drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \
display resolution plus the true screen size and backing scale factor so \
screen-absolute pixel picks land exactly. Vision-only: no AX tree walk."
description: "Capture a full-display vision screenshot in true screen pixels \
(no downscale). Returns the PNG at native display resolution plus the true \
screen size and backing scale factor so screen-absolute pixel picks land \
exactly. Vision-only: no AX tree walk. macOS desktop-scope click/scroll \
actions are deferred to a later phase."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs`
around lines 27 - 31, The `get_desktop_state` description is advertising a macOS
desktop-only loop that immediately performs `click(x,y)` and `scroll(x,y)`
without `pid` or `window_id`, which is not supported yet. Update the tool
description in `get_desktop_state` to only describe the screenshot/state capture
behavior and native-resolution metadata, and remove any wording that implies
window-less desktop interaction on macOS until Phase 2.

Comment on lines +161 to +163
if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
cfg.capture_scope = v.to_owned();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

Reject invalid persisted capture_scope values.

set_config only allows window/desktop, but startup currently accepts any string from config.json. A stale or hand-edited file can therefore boot the daemon into a state that get_config will echo even though set_config would reject it. Keep the default when the stored value is outside the enum.

Suggested fix
-    if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
-        cfg.capture_scope = v.to_owned();
-    }
+    if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
+        if matches!(v, "window" | "desktop") {
+            cfg.capture_scope = v.to_owned();
+        }
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
cfg.capture_scope = v.to_owned();
}
if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
if matches!(v, "window" | "desktop") {
cfg.capture_scope = v.to_owned();
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs` around lines 161
- 163, The startup config load in the macOS tools module is accepting any
persisted capture_scope string, which can bypass the validation enforced by
set_config. Update the config restore logic around the capture_scope handling in
the startup path so it only applies values that match the allowed enum (window
or desktop) and otherwise leaves cfg.capture_scope at its default. Use the
existing set_config validation and the capture_scope field in the config struct
as the reference points when aligning the startup behavior with the setter.

Comment on lines +619 to +623
// Wheel routes to the window under the cursor — place it on the target.
let _ = SetCursorPos(sx, sy);

let events = [wheel_input];
let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

Handle the result of SetCursorPos to prevent injecting at the wrong location.

If SetCursorPos(sx, sy) fails (e.g., due to invalid coordinates), silently ignoring the error causes SendInput to execute at the current cursor location instead of the intended target, scrolling the wrong window. Propagate the result using the ? operator to abort if the cursor cannot be placed.

Example correction
-        // Wheel routes to the window under the cursor — place it on the target.
-        let _ = SetCursorPos(sx, sy);
+        // Wheel routes to the window under the cursor — place it on the target.
+        SetCursorPos(sx, sy)?;

         let events = [wheel_input];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Wheel routes to the window under the cursor — place it on the target.
let _ = SetCursorPos(sx, sy);
let events = [wheel_input];
let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32);
// Wheel routes to the window under the cursor — place it on the target.
SetCursorPos(sx, sy)?;
let events = [wheel_input];
let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs` around lines
619 - 623, The wheel injection path in mouse input currently ignores the result
of SetCursorPos in the wheel-routing logic, which can cause SendInput to run at
the wrong cursor location when positioning fails. Update the wheel handling
block in the mouse input flow to propagate the SetCursorPos(sx, sy) result with
the ? operator so the operation aborts on failure before calling SendInput,
keeping the target window consistent.

Comment on lines +2011 to +2019
let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> {
use windows::Win32::Foundation::POINT;
use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint;
let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
if target.0.is_null() {
anyhow::bail!("No window under screen point ({sx},{sy}).");
}
let hwnd_u = target.0 as u64;
crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs | sed -n '2011,2019p'

Repository: trycua/cua

Length of output: 787


🏁 Script executed:

rg -t rs "fn send_click_synthesized" -A 20 -B 5 --type rust

Repository: trycua/cua

Length of output: 175


🏁 Script executed:

rg "fn send_click_synthesized" -B 3 -A 30

Repository: trycua/cua

Length of output: 148


🏁 Script executed:

rg "send_click_synthesized" -B 3 -A 20

Repository: trycua/cua

Length of output: 148


🏁 Script executed:

find libs/cua-driver -type f -name "*.rs" | xargs grep -n "send_click_synthesized"

Repository: trycua/cua

Length of output: 3293


🏁 Script executed:

cat -n libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs

Repository: trycua/cua

Length of output: 32066


Promote WindowFromPoint results to a top-level HWND before send_click_synthesized

WindowFromPoint returns the deepest visible child. However, the caller send_click_synthesized verifies foreground success by checking GetForegroundWindow() == target. Windows SetForegroundWindow behavior promotes the target to its ancestor, causing this check to fail (Root != Child), and the function aborts with an error.

Use GetAncestor(leaf, GA_ROOT) to target the top-level frame. The screen coordinates for the SendInput event will remain correct.

Proposed fix
-                use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint;
-                let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
-                if target.0.is_null() {
+                use windows::Win32::UI::WindowsAndMessaging::{GetAncestor, WindowFromPoint, GA_ROOT};
+                let leaf = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
+                if leaf.0.is_null() {
                     anyhow::bail!("No window under screen point ({sx},{sy}).");
                 }
+                let root = unsafe { GetAncestor(leaf, GA_ROOT) };
+                let target = if root.0.is_null() { leaf } else { root };
                 let hwnd_u = target.0 as u64;
                 crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> {
use windows::Win32::Foundation::POINT;
use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint;
let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
if target.0.is_null() {
anyhow::bail!("No window under screen point ({sx},{sy}).");
}
let hwnd_u = target.0 as u64;
crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?;
let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> {
use windows::Win32::Foundation::POINT;
use windows::Win32::UI::WindowsAndMessaging::{GetAncestor, WindowFromPoint, GA_ROOT};
let leaf = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
if leaf.0.is_null() {
anyhow::bail!("No window under screen point ({sx},{sy}).");
}
let root = unsafe { GetAncestor(leaf, GA_ROOT) };
let target = if root.0.is_null() { leaf } else { root };
let hwnd_u = target.0 as u64;
crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines
2011 - 2019, `WindowFromPoint` in the click-sending path is returning a child
window, but `send_click_synthesized` expects the top-level HWND when it
validates foreground ownership. Update the logic in `spawn_blocking` to promote
the hit-tested window to its root ancestor (using the existing Windows ancestor
API) before converting it to `hwnd_u` and passing it into
`crate::input::send_click_synthesized`, while keeping the original screen
coordinates unchanged.

f-trycua and others added 6 commits June 25, 2026 20:48
…re + screen.dimensions) (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…ault window) (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…ault window) (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…on capture) (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…on capture) (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…+ window-less screen-absolute click/scroll (#1968)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
@f-trycua f-trycua force-pushed the feat/1968-phase1-desktop-scope branch from 6a7ae3f to b6d7451 Compare June 26, 2026 04:00
@github-actions

Copy link
Copy Markdown
Contributor

Linux visual regression artifacts

Matrix jobs now run independently. Download visual artifacts from this workflow run.
Each background-GUI job uploads a .gif of the interaction plus two annotated PNGs (<app>.png raw, <app>-atspi.png with AT-SPI element boxes); the cua-driver-linux-som-overlays artifact adds <app>-som.png cua Set-of-Marks overlays:

  • cua-driver-linux-cursor-click-gif
  • cua-driver-linux-background-terminal-gif
  • cua-driver-linux-parallel-drag-xserver
  • cua-driver-linux-background-gui-chromium
  • cua-driver-linux-background-gui-tk
  • cua-driver-linux-background-gui-gtk3-gedit
  • cua-driver-linux-background-gui-gtk3-mousepad
  • cua-driver-linux-background-gui-gtk3-scite
  • cua-driver-linux-background-gui-gtk4-characters
  • cua-driver-linux-background-gui-qt5-manuskript
  • cua-driver-linux-background-gui-qt5-klog
  • cua-driver-linux-background-gui-qt5-openambit
  • cua-driver-linux-background-gui-qt6-kate
  • cua-driver-linux-background-gui-qt6-kcalc
  • cua-driver-linux-background-gui-qt6-okular
  • cua-driver-linux-background-gui-qt6-qownnotes
  • cua-driver-linux-background-gui-electron-zettlr
  • cua-driver-linux-background-gui-electron-joplin
  • cua-driver-linux-background-gui-electron-logseq
  • cua-driver-linux-som-overlays

Open workflow run and download artifacts

@f-trycua f-trycua marked this pull request as ready for review June 26, 2026 05:10
@f-trycua f-trycua merged commit fc27185 into main Jun 26, 2026
71 checks passed
@f-trycua f-trycua deleted the feat/1968-phase1-desktop-scope branch June 26, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant