feat(cua-driver): desktop-scope Phase 1 — capture_scope config, get_desktop_state, Windows screen-absolute actions (#1968)#2019
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds ChangesDesktop scope capture and input
Sequence Diagram(s)sequenceDiagram
participant set_config
participant SessionConfigRegistry
participant write_driver_config_key
participant get_config
set_config->>SessionConfigRegistry: store capture_scope in ConfigOverrides
set_config->>write_driver_config_key: persist capture_scope for global config
get_config->>SessionConfigRegistry: effective_scope(session_id, global)
SessionConfigRegistry-->>get_config: resolved capture_scope
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs (2)
5119-5134: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick winList
capture_scopein theset_configdescription’s known keys.The schema accepts
capture_scopeon Line 5134, but the user-facing “Known keys” text above it omits the new key, which makes the opt-in path harder for tool callers to discover.Proposed wording update
Known keys:\n\ - `capture_mode` (string: `vision` | `ax` | `som`)\n\ + - `capture_scope` (string: `window` | `desktop`)\n\ - `max_image_dimension` (integer)\n\🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines 5119 - 5134, The set_config documentation is missing the new capture_scope option from the “Known keys” list even though the input_schema already accepts it. Update the set_config description text near the schema in impl_.rs to add capture_scope alongside capture_mode, max_image_dimension, experimental_pip, and experimental_pip_geometry, keeping the wording consistent with the existing style so callers can discover the desktop/window scope option.
1940-1946: 🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick winUpdate the
x/yschema text for desktop-scope coordinates.Line 1941 makes
pidoptional for desktop-scope clicks, but Lines 1945-1946 still describex/yonly as window-localget_window_statepixels. Agents reading the schema can still choose the wrong coordinate space forget_desktop_state.Proposed wording update
- "x":{"type":"number","description":"X in window-local screenshot pixels — same space as the PNG get_window_state returns. Must be provided together with y."}, - "y":{"type":"number","description":"Y in window-local screenshot pixels. Must be provided together with x."}, + "x":{"type":"number","description":"X coordinate. With pid/window_id, this is window-local screenshot pixels from get_window_state. With capture_scope=\"desktop\" and no pid/window_id, this is a true screen pixel from get_desktop_state. Must be provided together with y."}, + "y":{"type":"number","description":"Y coordinate. With pid/window_id, this is window-local screenshot pixels from get_window_state. With capture_scope=\"desktop\" and no pid/window_id, this is a true screen pixel from get_desktop_state. Must be provided together with x."},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines 1940 - 1946, The schema text for x and y still only describes window-local get_window_state pixels, which conflicts with the desktop-scope behavior allowed by pid being optional. Update the x/y descriptions in the same schema block to explicitly mention desktop-scope coordinates from get_desktop_state when capture_scope is desktop and no pid/window_id is provided, while keeping the existing window-local wording for window-based clicks. Use the existing schema definitions in the object properties section to keep the coordinate-space guidance consistent with pid, window_id, element_index, and element_token.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs`:
- Around line 3298-3302: The new capture_scope handling in the Linux tool parser
is mutating the shared DriverConfig, which can leak a session’s “desktop” choice
into other sessions; update the capture_scope branches in impl_ to treat it as a
session-only override whenever _session_id is present, matching the macOS peer
behavior. Use the existing capture_scope match logic and the session-related
state in the same tool parsing flow to apply the override locally rather than
writing it back to the shared config, and make the same change in the other
capture_scope branch around the later matching block.
- Around line 2779-2786: The Linux desktop structured payload in
get_desktop_state is missing the scale_factor field that peer platforms include.
Update the structured JSON built in the get_desktop_state path to add
scale_factor, using the same value already reported by get_screen_size on Linux
(1.0), so clients can rely on a consistent geometry contract across platforms.
- Around line 2755-2767: The screenshot output path handling in the Linux tool
does not expand a leading home prefix, so `screenshot_out_file` can be written
literally as `~/...` instead of the user’s home directory. Update the `out_file`
handling inside the `spawn_blocking` block to normalize the provided path before
`std::fs::write`, expanding `~/` the same way the macOS desktop-state tool does.
Use the existing `screenshot_out_file` argument flow and the
`out_file.as_deref()` branch to apply the expansion before writing or returning
the path.
In `@libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs`:
- Around line 27-31: The `get_desktop_state` description is advertising a macOS
desktop-only loop that immediately performs `click(x,y)` and `scroll(x,y)`
without `pid` or `window_id`, which is not supported yet. Update the tool
description in `get_desktop_state` to only describe the screenshot/state capture
behavior and native-resolution metadata, and remove any wording that implies
window-less desktop interaction on macOS until Phase 2.
In `@libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs`:
- Around line 161-163: The startup config load in the macOS tools module is
accepting any persisted capture_scope string, which can bypass the validation
enforced by set_config. Update the config restore logic around the capture_scope
handling in the startup path so it only applies values that match the allowed
enum (window or desktop) and otherwise leaves cfg.capture_scope at its default.
Use the existing set_config validation and the capture_scope field in the config
struct as the reference points when aligning the startup behavior with the
setter.
In `@libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs`:
- Around line 619-623: The wheel injection path in mouse input currently ignores
the result of SetCursorPos in the wheel-routing logic, which can cause SendInput
to run at the wrong cursor location when positioning fails. Update the wheel
handling block in the mouse input flow to propagate the SetCursorPos(sx, sy)
result with the ? operator so the operation aborts on failure before calling
SendInput, keeping the target window consistent.
In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs`:
- Around line 2011-2019: `WindowFromPoint` in the click-sending path is
returning a child window, but `send_click_synthesized` expects the top-level
HWND when it validates foreground ownership. Update the logic in
`spawn_blocking` to promote the hit-tested window to its root ancestor (using
the existing Windows ancestor API) before converting it to `hwnd_u` and passing
it into `crate::input::send_click_synthesized`, while keeping the original
screen coordinates unchanged.
---
Outside diff comments:
In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs`:
- Around line 5119-5134: The set_config documentation is missing the new
capture_scope option from the “Known keys” list even though the input_schema
already accepts it. Update the set_config description text near the schema in
impl_.rs to add capture_scope alongside capture_mode, max_image_dimension,
experimental_pip, and experimental_pip_geometry, keeping the wording consistent
with the existing style so callers can discover the desktop/window scope option.
- Around line 1940-1946: The schema text for x and y still only describes
window-local get_window_state pixels, which conflicts with the desktop-scope
behavior allowed by pid being optional. Update the x/y descriptions in the same
schema block to explicitly mention desktop-scope coordinates from
get_desktop_state when capture_scope is desktop and no pid/window_id is
provided, while keeping the existing window-local wording for window-based
clicks. Use the existing schema definitions in the object properties section to
keep the coordinate-space guidance consistent with pid, window_id,
element_index, and element_token.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c53e2e6b-0f8d-4043-8dea-dcdc7773dc62
📒 Files selected for processing (10)
libs/cua-driver/rust/crates/cua-driver-core/src/tool.rslibs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rslibs/cua-driver/rust/crates/platform-macos/src/tools/get_config.rslibs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rslibs/cua-driver/rust/crates/platform-macos/src/tools/get_screen_size.rslibs/cua-driver/rust/crates/platform-macos/src/tools/mod.rslibs/cua-driver/rust/crates/platform-macos/src/tools/set_config.rslibs/cua-driver/rust/crates/platform-windows/src/input/mod.rslibs/cua-driver/rust/crates/platform-windows/src/input/mouse.rslibs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
| let out_file = args.opt_str("screenshot_out_file"); | ||
|
|
||
| let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> { | ||
| // Vision-only: capture the FULL DISPLAY at native size. No downscale | ||
| // so screen-absolute pixels land exactly. | ||
| let png = crate::capture::screenshot_display_bytes()?; | ||
| let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?; | ||
| // True screen size from the X11 root window. | ||
| let (screen_w, screen_h) = x11_screen_size()?; | ||
| // Optional: write PNG to disk instead of returning base64. | ||
| let written = if let Some(path) = out_file.as_deref() { | ||
| std::fs::write(path, &png)?; | ||
| Some(path.to_string()) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
Expand ~/ before writing screenshot_out_file.
Linux currently writes "~/foo.png" literally, while the peer macOS desktop-state tool expands the home prefix. This makes cross-platform callers fail on Linux for the same output path.
Proposed fix
- let out_file = args.opt_str("screenshot_out_file");
+ let out_file = args.opt_str("screenshot_out_file").map(|s| {
+ if let Some(rest) = s.strip_prefix("~/") {
+ if let Ok(home) = std::env::var("HOME") {
+ return format!("{home}/{rest}");
+ }
+ }
+ s
+ });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let out_file = args.opt_str("screenshot_out_file"); | |
| let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> { | |
| // Vision-only: capture the FULL DISPLAY at native size. No downscale | |
| // so screen-absolute pixels land exactly. | |
| let png = crate::capture::screenshot_display_bytes()?; | |
| let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?; | |
| // True screen size from the X11 root window. | |
| let (screen_w, screen_h) = x11_screen_size()?; | |
| // Optional: write PNG to disk instead of returning base64. | |
| let written = if let Some(path) = out_file.as_deref() { | |
| std::fs::write(path, &png)?; | |
| Some(path.to_string()) | |
| let out_file = args.opt_str("screenshot_out_file").map(|s| { | |
| if let Some(rest) = s.strip_prefix("~/") { | |
| if let Ok(home) = std::env::var("HOME") { | |
| return format!("{home}/{rest}"); | |
| } | |
| } | |
| s | |
| }); | |
| let result = tokio::task::spawn_blocking(move || -> anyhow::Result<_> { | |
| // Vision-only: capture the FULL DISPLAY at native size. No downscale | |
| // so screen-absolute pixels land exactly. | |
| let png = crate::capture::screenshot_display_bytes()?; | |
| let (shot_w, shot_h) = crate::capture::png_dimensions_pub(&png)?; | |
| // True screen size from the X11 root window. | |
| let (screen_w, screen_h) = x11_screen_size()?; | |
| // Optional: write PNG to disk instead of returning base64. | |
| let written = if let Some(path) = out_file.as_deref() { | |
| std::fs::write(path, &png)?; | |
| Some(path.to_string()) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
2755 - 2767, The screenshot output path handling in the Linux tool does not
expand a leading home prefix, so `screenshot_out_file` can be written literally
as `~/...` instead of the user’s home directory. Update the `out_file` handling
inside the `spawn_blocking` block to normalize the provided path before
`std::fs::write`, expanding `~/` the same way the macOS desktop-state tool does.
Use the existing `screenshot_out_file` argument flow and the
`out_file.as_deref()` branch to apply the expansion before writing or returning
the path.
| let mut structured = json!({ | ||
| "platform": "linux", | ||
| "screenshot_width": shot_w, | ||
| "screenshot_height": shot_h, | ||
| "screen_width": screen_w, | ||
| "screen_height": screen_h, | ||
| "screenshot_mime_type": "image/png", | ||
| }); |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Include scale_factor in desktop structured content.
get_desktop_state should expose the same geometry contract as the peer platforms; Linux already reports 1.0 from get_screen_size, but omits it here. Clients normalizing desktop screenshots across platforms can’t rely on the field.
Proposed fix
let mut structured = json!({
"platform": "linux",
"screenshot_width": shot_w,
"screenshot_height": shot_h,
"screen_width": screen_w,
"screen_height": screen_h,
+ "scale_factor": 1.0,
"screenshot_mime_type": "image/png",
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let mut structured = json!({ | |
| "platform": "linux", | |
| "screenshot_width": shot_w, | |
| "screenshot_height": shot_h, | |
| "screen_width": screen_w, | |
| "screen_height": screen_h, | |
| "screenshot_mime_type": "image/png", | |
| }); | |
| let mut structured = json!({ | |
| "platform": "linux", | |
| "screenshot_width": shot_w, | |
| "screenshot_height": shot_h, | |
| "screen_width": screen_w, | |
| "screen_height": screen_h, | |
| "scale_factor": 1.0, | |
| "screenshot_mime_type": "image/png", | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
2779 - 2786, The Linux desktop structured payload in get_desktop_state is
missing the scale_factor field that peer platforms include. Update the
structured JSON built in the get_desktop_state path to add scale_factor, using
the same value already reported by get_screen_size on Linux (1.0), so clients
can rely on a consistent geometry contract across platforms.
| "capture_scope" => match val.as_str() { | ||
| Some(s @ ("window" | "desktop")) => { cfg.capture_scope = s.to_owned(); parts.push(format!("capture_scope={s}")); } | ||
| Some(other) => return ToolResult::error(format!("`capture_scope` must be \"window\" or \"desktop\", got \"{other}\".")), | ||
| None => return ToolResult::error(format!("`capture_scope` must be a string, got {val}.")), | ||
| }, |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift
Keep capture_scope session-scoped when a session is present.
Both new capture_scope branches mutate the shared DriverConfig, so a session enabling "desktop" can leak into other Linux sessions. The PR contract and macOS peer implementation treat this as a session override when _session_id is present.
Also applies to: 3340-3346
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs` around lines
3298 - 3302, The new capture_scope handling in the Linux tool parser is mutating
the shared DriverConfig, which can leak a session’s “desktop” choice into other
sessions; update the capture_scope branches in impl_ to treat it as a
session-only override whenever _session_id is present, matching the macOS peer
behavior. Use the existing capture_scope match logic and the session-related
state in the same tool parsing flow to apply the override locally rather than
writing it back to the shared config, and make the same change in the other
capture_scope branch around the later matching block.
| description: "Capture a full-display vision screenshot in true screen pixels \ | ||
| (no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \ | ||
| drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \ | ||
| display resolution plus the true screen size and backing scale factor so \ | ||
| screen-absolute pixel picks land exactly. Vision-only: no AX tree walk." |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Don't advertise window-less click/scroll on macOS yet.
This description says get_desktop_state is for macOS loops that then call click(x,y) / scroll(x,y) without pid or window_id, but this PR explicitly defers those desktop-scope actions on macOS to Phase 2. As written, the tool contract points agents at a flow that will fail immediately after capture.
Suggested fix
- description: "Capture a full-display vision screenshot in true screen pixels \
- (no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \
- drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \
- display resolution plus the true screen size and backing scale factor so \
- screen-absolute pixel picks land exactly. Vision-only: no AX tree walk."
+ description: "Capture a full-display vision screenshot in true screen pixels \
+ (no downscale). Returns the PNG at native display resolution plus the true \
+ screen size and backing scale factor so screen-absolute pixel picks land \
+ exactly. Vision-only: no AX tree walk. macOS desktop-scope click/scroll \
+ actions are deferred to a later phase."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| description: "Capture a full-display vision screenshot in true screen pixels \ | |
| (no downscale), for capture_scope=\"desktop\" GUI loops where the agent then \ | |
| drives click(x,y)/scroll(x,y) with no pid/window_id. Returns the PNG at native \ | |
| display resolution plus the true screen size and backing scale factor so \ | |
| screen-absolute pixel picks land exactly. Vision-only: no AX tree walk." | |
| description: "Capture a full-display vision screenshot in true screen pixels \ | |
| (no downscale). Returns the PNG at native display resolution plus the true \ | |
| screen size and backing scale factor so screen-absolute pixel picks land \ | |
| exactly. Vision-only: no AX tree walk. macOS desktop-scope click/scroll \ | |
| actions are deferred to a later phase." |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-macos/src/tools/get_desktop_state.rs`
around lines 27 - 31, The `get_desktop_state` description is advertising a macOS
desktop-only loop that immediately performs `click(x,y)` and `scroll(x,y)`
without `pid` or `window_id`, which is not supported yet. Update the tool
description in `get_desktop_state` to only describe the screenshot/state capture
behavior and native-resolution metadata, and remove any wording that implies
window-less desktop interaction on macOS until Phase 2.
| if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) { | ||
| cfg.capture_scope = v.to_owned(); | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win
Reject invalid persisted capture_scope values.
set_config only allows window/desktop, but startup currently accepts any string from config.json. A stale or hand-edited file can therefore boot the daemon into a state that get_config will echo even though set_config would reject it. Keep the default when the stored value is outside the enum.
Suggested fix
- if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
- cfg.capture_scope = v.to_owned();
- }
+ if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) {
+ if matches!(v, "window" | "desktop") {
+ cfg.capture_scope = v.to_owned();
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) { | |
| cfg.capture_scope = v.to_owned(); | |
| } | |
| if let Some(v) = json.get("capture_scope").and_then(|v| v.as_str()) { | |
| if matches!(v, "window" | "desktop") { | |
| cfg.capture_scope = v.to_owned(); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-macos/src/tools/mod.rs` around lines 161
- 163, The startup config load in the macOS tools module is accepting any
persisted capture_scope string, which can bypass the validation enforced by
set_config. Update the config restore logic around the capture_scope handling in
the startup path so it only applies values that match the allowed enum (window
or desktop) and otherwise leaves cfg.capture_scope at its default. Use the
existing set_config validation and the capture_scope field in the config struct
as the reference points when aligning the startup behavior with the setter.
| // Wheel routes to the window under the cursor — place it on the target. | ||
| let _ = SetCursorPos(sx, sy); | ||
|
|
||
| let events = [wheel_input]; | ||
| let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32); |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
Handle the result of SetCursorPos to prevent injecting at the wrong location.
If SetCursorPos(sx, sy) fails (e.g., due to invalid coordinates), silently ignoring the error causes SendInput to execute at the current cursor location instead of the intended target, scrolling the wrong window. Propagate the result using the ? operator to abort if the cursor cannot be placed.
Example correction
- // Wheel routes to the window under the cursor — place it on the target.
- let _ = SetCursorPos(sx, sy);
+ // Wheel routes to the window under the cursor — place it on the target.
+ SetCursorPos(sx, sy)?;
let events = [wheel_input];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Wheel routes to the window under the cursor — place it on the target. | |
| let _ = SetCursorPos(sx, sy); | |
| let events = [wheel_input]; | |
| let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32); | |
| // Wheel routes to the window under the cursor — place it on the target. | |
| SetCursorPos(sx, sy)?; | |
| let events = [wheel_input]; | |
| let sent = SendInput(&events, std::mem::size_of::<INPUT>() as i32); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rs` around lines
619 - 623, The wheel injection path in mouse input currently ignores the result
of SetCursorPos in the wheel-routing logic, which can cause SendInput to run at
the wrong cursor location when positioning fails. Update the wheel handling
block in the mouse input flow to propagate the SetCursorPos(sx, sy) result with
the ? operator so the operation aborts on failure before calling SendInput,
keeping the target window consistent.
| let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> { | ||
| use windows::Win32::Foundation::POINT; | ||
| use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint; | ||
| let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) }; | ||
| if target.0.is_null() { | ||
| anyhow::bail!("No window under screen point ({sx},{sy})."); | ||
| } | ||
| let hwnd_u = target.0 as u64; | ||
| crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?; |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
cat -n libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs | sed -n '2011,2019p'Repository: trycua/cua
Length of output: 787
🏁 Script executed:
rg -t rs "fn send_click_synthesized" -A 20 -B 5 --type rustRepository: trycua/cua
Length of output: 175
🏁 Script executed:
rg "fn send_click_synthesized" -B 3 -A 30Repository: trycua/cua
Length of output: 148
🏁 Script executed:
rg "send_click_synthesized" -B 3 -A 20Repository: trycua/cua
Length of output: 148
🏁 Script executed:
find libs/cua-driver -type f -name "*.rs" | xargs grep -n "send_click_synthesized"Repository: trycua/cua
Length of output: 3293
🏁 Script executed:
cat -n libs/cua-driver/rust/crates/platform-windows/src/input/mouse.rsRepository: trycua/cua
Length of output: 32066
Promote WindowFromPoint results to a top-level HWND before send_click_synthesized
WindowFromPoint returns the deepest visible child. However, the caller send_click_synthesized verifies foreground success by checking GetForegroundWindow() == target. Windows SetForegroundWindow behavior promotes the target to its ancestor, causing this check to fail (Root != Child), and the function aborts with an error.
Use GetAncestor(leaf, GA_ROOT) to target the top-level frame. The screen coordinates for the SendInput event will remain correct.
Proposed fix
- use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint;
- let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
- if target.0.is_null() {
+ use windows::Win32::UI::WindowsAndMessaging::{GetAncestor, WindowFromPoint, GA_ROOT};
+ let leaf = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) };
+ if leaf.0.is_null() {
anyhow::bail!("No window under screen point ({sx},{sy}).");
}
+ let root = unsafe { GetAncestor(leaf, GA_ROOT) };
+ let target = if root.0.is_null() { leaf } else { root };
let hwnd_u = target.0 as u64;
crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> { | |
| use windows::Win32::Foundation::POINT; | |
| use windows::Win32::UI::WindowsAndMessaging::WindowFromPoint; | |
| let target = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) }; | |
| if target.0.is_null() { | |
| anyhow::bail!("No window under screen point ({sx},{sy})."); | |
| } | |
| let hwnd_u = target.0 as u64; | |
| crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?; | |
| let send_result = tokio::task::spawn_blocking(move || -> anyhow::Result<u64> { | |
| use windows::Win32::Foundation::POINT; | |
| use windows::Win32::UI::WindowsAndMessaging::{GetAncestor, WindowFromPoint, GA_ROOT}; | |
| let leaf = unsafe { WindowFromPoint(POINT { x: sx, y: sy }) }; | |
| if leaf.0.is_null() { | |
| anyhow::bail!("No window under screen point ({sx},{sy})."); | |
| } | |
| let root = unsafe { GetAncestor(leaf, GA_ROOT) }; | |
| let target = if root.0.is_null() { leaf } else { root }; | |
| let hwnd_u = target.0 as u64; | |
| crate::input::send_click_synthesized(hwnd_u, sx, sy, count, &button)?; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs` around lines
2011 - 2019, `WindowFromPoint` in the click-sending path is returning a child
window, but `send_click_synthesized` expects the top-level HWND when it
validates foreground ownership. Update the logic in `spawn_blocking` to promote
the hit-tested window to its root ancestor (using the existing Windows ancestor
API) before converting it to `hwnd_u` and passing it into
`crate::input::send_click_synthesized`, while keeping the original screen
coordinates unchanged.
…re + screen.dimensions) (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…ault window) (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…ault window) (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…on capture) (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…on capture) (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
…+ window-less screen-absolute click/scroll (#1968) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
6a7ae3f to
b6d7451
Compare
Linux visual regression artifactsMatrix jobs now run independently. Download visual artifacts from this workflow run.
|
Implements Phase 1 of #1968 — an opt-in desktop-scope modality for cua-driver (the "Computer-Use 1.0" vision-only, foreground, window-less, screen-absolute loop), complementing the default per-window background model.
What ships in Phase 1
capture_scope: "window" | "desktop"config (default"window"), on all three platforms, wired throughset_config/get_configlike the existingcapture_mode. macOS persists it (+ session overrides via a neweffective_scope()); Windows/Linux keep it in-memory, matching their existing config behaviour.get_desktop_statetool on all three platforms — vision-only full-display screenshot via the existingcapture::screenshot_display_bytes(), no downscale (true screen pixels so screen-absolute picks land), reporting both image dims and true screen size.structuredContent:platform, screenshot_width/height, screen_width/height, screen(scale)_factor, screenshot_mime_type(+screenshot_file_pathwhenscreenshot_out_fileis set). Over the CLI the PNG surfaces asscreenshot_png_b64, same asget_window_state.send_click_synthesized(alreadyABSOLUTE | VIRTUALDESK); addssend_wheel_synthesizedfor scroll. The window-less branch activates only whencapture_scope == "desktop"AND nopid/window_idANDx,ypresent; otherwise returns a structureddesktop_scope_disablederror pointing the caller atset_config capture_scope=desktop. Existing pid-based paths are unchanged.get_desktop_stateclaimsscreen.capture+screen.dimensions(existing tokens — additive, noCAPABILITY_VERSIONbump).Decisions baked in (were the open questions in planning)
get_desktop_state— exact screen pixels for the GUI loop.capture_scope=desktop, not inferred from missing pid/window_id (avoids reinterpreting legacyx,y+pid calls).effective_scope()rather than widening the existingeffective() -> (String,u32)tuple (zero churn to existing call sites).WindowFromPointkeeping the foreground-swap/UIPI path (what makes Chromium-content clicks land).get_desktop_stateworks everywhere; the screen-absolute actions are Windows-only in Phase 1 (macOSCGEvent::postand Linux XTest pointer primitives are Phase 2, explicitly out of scope here).Explicitly out of scope (later phases / the issue)
Desktop-wide SoM / cross-window element indices; a single
computer-tool wrapper;computer-serverwire-compat; multi-display (main display only); macOS/Linux foreground actuators (Phase 2).Testing — Windows Phase 1 validated live ✅
Windows — live, in an interactive desktop session on the test VM (driven through a running
cua-driver serve):get_desktop_state→ full-display capture returns a PNG + true screen size (1512×949). Confirms the earlier Session-0The handle is invalid (0x80070006)was only the service-session wall — capture works in a real desktop.click(no pid/window_id, desktop scope) →✅ Sent click via SendInput at screen (756,474) on HWND 0x103c8 (desktop scope)— lands on a real window viaWindowFromPoint.scroll→✅ Scrolled down via SendInput wheel (3 ticks) at screen (756,474) (desktop scope).capture_scope=window+ window-less click → structureddesktop_scope_disabledrejection. ✓platform-windowsunit tests pass natively (incl. the 9 new desktop-scope tests:is_windowless_desktop_action,wheel_mouse_datasign/magnitude,get_desktop_stateschema, config default).Other platforms:
cargo test -p cua-driver-core capability→ 11 passed (capability registration).cargo build -p platform-macosclean; 4 newcapture_scope/get_desktop_state/effective_scopetests pass. (Screen-absolute actuator is Phase 2.)Config persistence —
capture_scopesurvives across processes ✅Rebased on #2034 (config disk-persistence parity).
capture_scopeis now persisted to~/.cua-driver/config.jsonand reloaded at startup on all three platforms, so it survives across statelesscua-driver callinvocations:set_config capture_scope=desktopin one process is honored by theget_desktop_state/click/scrollof the next. Verified live on Windows + Linux — a separateget_configreads backdesktop, and a fresh statelessclickpasses the desktop-scope gate straight into actuation (no heldserve/MCP session required).Downstream
Unblocks a
--scope desktopswitch in the Geminicomputer-use-previewDesktopComputerbackend, doubling as a drop-in full-screen computer-use shim. Phase 1 (Windows) is enough for a first end-to-end Gemini demo on a Windows host.With #2034 merged + this branch rebased on it, the persisted
capture_scopemeans the backend can drive desktop-scope over the per-call CLI as-is (no persistent connection required) —set_config capture_scope=desktopsticks across calls.🤖 Generated with Claude Code
https://claude.ai/code/session_01KMXCW4M5uK1HRGjjH4wueZ
Summary by CodeRabbit
New Features
capture_scopesetting, letting users switch between window-based and desktop-based behavior.Bug Fixes