feat(computer): main-thread synthetic-input executor + CEF crash fix (1/8 of #3307)#3340
Conversation
Run enigo keyboard/mouse on the app main thread via a native-registry executor; enigo's macOS TSMGetInputSourceProperty traps off-thread and crashes the CEF host. Adds mouse/keyboard tools, the main_thread bridge, and downscaled screenshots so the model can see them. Slice 1/7 of tinyhumansai#3307 (was the 'computer control' area).
📝 WalkthroughWalkthroughAdds a main-thread executor for synthetic input (registered at Tauri startup) and routes keyboard/mouse operations through it; replaces large-screenshot base64 truncation with blocking downscaling to inline JPEG data-URLs and updates tests. ChangesMain Thread Input Execution
Screenshot Image Optimization
Sequence Diagram(s)sequenceDiagram
participant Core
participant TauriShell
participant AppMainThread
participant Enigo_macOS
Core->>TauriShell: request `computer.input_on_main_thread`(MainThreadInputOp)
TauriShell->>AppMainThread: run_on_main_thread(op.run)
AppMainThread->>Enigo_macOS: perform synthetic input operations
Enigo_macOS-->>AppMainThread: Result<String, String>
AppMainThread-->>TauriShell: Result<String, String>
TauriShell-->>Core: Result<String, String>
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/openhuman/tools/impl/browser/screenshot.rs (1)
170-183: 💤 Low valueConsider adding debug logging for the downscale path.
The new downscale logic handles multiple fallback dimensions and has several decision points, but no diagnostic logging. Adding
debug!ortrace!statements would help troubleshoot issues in production (e.g., why a particular dimension was selected, original vs final sizes).+ log::debug!( + "screenshot: oversized capture ({} bytes), downscaling...", + bytes.len() + ); match tokio::task::spawn_blocking(move || downscale_to_jpeg(&bytes, MAX_RAW_BYTES)).await { Ok(Ok((jpeg, w, h))) => { + log::debug!("screenshot: downscaled to {}x{} ({} bytes)", w, h, jpeg.len()); Ok(Self::data_url_result(As per coding guidelines: "Add substantial debug-level logs while implementing features or fixes in Rust using
log/tracingcrate with stable prefixes and correlation fields."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/tools/impl/browser/screenshot.rs` around lines 170 - 183, The downscale path invoked via tokio::task::spawn_blocking calling downscale_to_jpeg(&bytes, MAX_RAW_BYTES) lacks diagnostic logs; add debug-level tracing (e.g., tracing::debug! or debug!) around the downscale flow in screenshot.rs: log the original bytes length, MAX_RAW_BYTES, chosen output dimension(s) (w,h) returned by downscale_to_jpeg, which fallback branch was taken, and any error messages (include the error variable e) with a stable prefix and correlation field (e.g., "screenshot::downscale" and a request_id or path) so that the Ok(Ok((jpeg,w,h))) branch logs final dimensions and Ok(Err(e)) / Err(e) branches log the failure details before returning ToolResult via data_url_result or ToolResult::success/error.src/openhuman/tools/impl/computer/main_thread.rs (1)
33-48: ⚡ Quick winAdd debug logs around main-thread dispatch and failure branches.
This bridge is a critical cross-module dispatch path, but it currently has no trace/debug visibility for dispatch attempts or failures.
Proposed patch
pub async fn run_input_on_main<F>(op: F) -> Result<String, String> where F: FnOnce() -> Result<String, String> + Send + 'static, { + log::debug!("[computer-main-thread] dispatching synthetic input op"); let req = MainThreadInputOp { run: Box::new(op) }; match request_native_global::<MainThreadInputOp, Result<String, String>>( INPUT_ON_MAIN_THREAD_METHOD, req, @@ .await { - Ok(inner) => inner, - Err(e) => Err(format!( - "synthetic input requires the desktop app's main-thread executor (unavailable: {e})" - )), + Ok(inner) => { + if let Err(err) = &inner { + log::debug!("[computer-main-thread] main-thread input op failed: {err}"); + } + inner + } + Err(e) => { + log::debug!("[computer-main-thread] dispatch unavailable: {e}"); + Err(format!( + "synthetic input requires the desktop app's main-thread executor (unavailable: {e})" + )) + } } }As per coding guidelines: "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with stable grep-friendly prefixes".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/tools/impl/computer/main_thread.rs` around lines 33 - 48, The run_input_on_main bridge currently has no logging; add trace/debug logs around the dispatch and failure branches: log before calling request_native_global (include INPUT_ON_MAIN_THREAD_METHOD and that a MainThreadInputOp is being dispatched), log on successful Ok(inner) return (including a stable prefix and the returned value or summary), and log on Err(e) with the full error message (use same grep-friendly prefix). Update the function run_input_on_main to emit these logs using the existing logging facility, referencing MainThreadInputOp, request_native_global, and INPUT_ON_MAIN_THREAD_METHOD so dispatch attempts, successes, and failures are visible.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/lib.rs`:
- Around line 2816-2818: Document the newly registered native IPC method
register_native_global::<MainThreadInputOp, Result<String, String>, _,
_>(INPUT_ON_MAIN_THREAD_METHOD) in the Tauri shell architecture doc
(tauri-shell.md): add a short purpose/summary for INPUT_ON_MAIN_THREAD_METHOD,
specify the exact request shape (fields/types) and the response shape
(Result<String, String> success/error format), mark that this is a
desktop-only/native IPC surface, and include a minimal example request and
expected response to illustrate usage and error semantics.
In `@src/openhuman/tools/impl/computer/keyboard.rs`:
- Around line 190-201: The current closure passed to run_input_on_main (invoked
in the type path calling Enigo::new and enigo.text within into_result) holds the
Tauri main thread for the entire text/Hotkey sequence; refactor so only the
minimal Enigo call runs on the main thread: split long text or hotkey combos
into small chunks (e.g., per-character or small-buffer), perform all sequencing,
delays (HOTKEY_INTER_KEY_DELAY) and async sleeps off the main thread, and invoke
run_input_on_main/run_input_on_main-like helper repeatedly for each chunk to
call Enigo::new() once per dispatch and execute only the short
enigo.text(&chunk) action on the main thread. Apply the same chunked dispatch
pattern to the similar block referenced around lines 286-321.
In `@src/openhuman/tools/impl/computer/mouse.rs`:
- Around line 230-239: The closure passed to run_input_on_main currently
constructs Enigo and calls humanized_move (and similar humanized_* functions) on
the app main thread, which blocks UI; instead, compute the waypoint sequence and
pacing off-thread and only perform the minimal Enigo calls on the main thread.
Concretely: move creation of the waypoint list and sleep/timing logic out of the
run_input_on_main closure (call humanized_move or a new helper that returns
Vec<(x,y,delay)> from the worker thread), then inside the run_input_on_main
closure create Enigo (Enigo::new) and iterate the precomputed waypoints applying
enigo.mouse_move_to / button events without sleeping; for pacing use a
background timer that schedules main-thread enigo calls or marshal each enigo
call back to the main thread with short, non-blocking hops. Apply the same
approach to the other usages referenced (humanized_click,
humanized_double_click, humanized_drag) so only the actual Enigo operations run
on the main thread.
---
Nitpick comments:
In `@src/openhuman/tools/impl/browser/screenshot.rs`:
- Around line 170-183: The downscale path invoked via
tokio::task::spawn_blocking calling downscale_to_jpeg(&bytes, MAX_RAW_BYTES)
lacks diagnostic logs; add debug-level tracing (e.g., tracing::debug! or debug!)
around the downscale flow in screenshot.rs: log the original bytes length,
MAX_RAW_BYTES, chosen output dimension(s) (w,h) returned by downscale_to_jpeg,
which fallback branch was taken, and any error messages (include the error
variable e) with a stable prefix and correlation field (e.g.,
"screenshot::downscale" and a request_id or path) so that the Ok(Ok((jpeg,w,h)))
branch logs final dimensions and Ok(Err(e)) / Err(e) branches log the failure
details before returning ToolResult via data_url_result or
ToolResult::success/error.
In `@src/openhuman/tools/impl/computer/main_thread.rs`:
- Around line 33-48: The run_input_on_main bridge currently has no logging; add
trace/debug logs around the dispatch and failure branches: log before calling
request_native_global (include INPUT_ON_MAIN_THREAD_METHOD and that a
MainThreadInputOp is being dispatched), log on successful Ok(inner) return
(including a stable prefix and the returned value or summary), and log on Err(e)
with the full error message (use same grep-friendly prefix). Update the function
run_input_on_main to emit these logs using the existing logging facility,
referencing MainThreadInputOp, request_native_global, and
INPUT_ON_MAIN_THREAD_METHOD so dispatch attempts, successes, and failures are
visible.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3849692f-1e2e-448e-ba07-c42b9f1b4443
📒 Files selected for processing (6)
app/src-tauri/src/lib.rssrc/openhuman/tools/impl/browser/screenshot.rssrc/openhuman/tools/impl/computer/keyboard.rssrc/openhuman/tools/impl/computer/main_thread.rssrc/openhuman/tools/impl/computer/mod.rssrc/openhuman/tools/impl/computer/mouse.rs
- Document the computer.input_on_main_thread native-registry method in gitbooks/developing/architecture/tauri-shell.md (CodeRabbit). - Unit-test run_input_on_main Ok/Err passthrough via a registered executor.
JPEG has no alpha channel, so an RGBA capture (PNG screenshots often carry one) would fail to encode and return no inline image, leaving vision-driven control blind. Convert the thumbnail to RGB first; add an RGBA regression test.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/tools/impl/browser/screenshot.rs (1)
209-229: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winAdd debug-level logging for downscale attempts and outcome.
The
downscale_to_jpegfunction implements iterative downscaling but lacks instrumentation. As per coding guidelines, new Rust features should include debug/trace logs for entry/exit, branch decisions, and outcomes with stable grep-friendly prefixes.Consider logging:
- Entry: input byte size
- Each iteration: attempted
max_dim, resulting JPEG byte count- Outcome: which dimension succeeded or fallback return
This will make it easy to trace why a particular downscale dimension was chosen and diagnose inline-budget issues.
📊 Example logging additions
fn downscale_to_jpeg(bytes: &[u8], max_bytes: usize) -> Result<(Vec<u8>, u32, u32), String> { + log::debug!("downscale_to_jpeg: input {} bytes, budget {} bytes", bytes.len(), max_bytes); let img = image::load_from_memory(bytes).map_err(|e| format!("decode: {e}"))?; let mut last: Option<(Vec<u8>, u32, u32)> = None; for max_dim in [1568u32, 1280, 1024, 768, 600] { let thumb = img.thumbnail(max_dim, max_dim).to_rgb8(); let (w, h) = (thumb.width(), thumb.height()); let mut buf = std::io::Cursor::new(Vec::new()); image::codecs::jpeg::JpegEncoder::new_with_quality(&mut buf, 72) .encode_image(&thumb) .map_err(|e| format!("jpeg encode: {e}"))?; let out = buf.into_inner(); + log::debug!("downscale_to_jpeg: tried {}×{} → {} bytes", w, h, out.len()); if out.len() <= max_bytes { + log::debug!("downscale_to_jpeg: success at {}×{}, {} bytes", w, h, out.len()); return Ok((out, w, h)); } last = Some((out, w, h)); } + if let Some((_, w, h)) = &last { + log::debug!("downscale_to_jpeg: fallback to smallest attempt {}×{}", w, h); + } last.ok_or_else(|| "could not produce a fitting JPEG".to_string()) }As per coding guidelines: "Add substantial debug-level logs while implementing features or fixes in Rust using
log/tracingcrate with stable prefixes and correlation fields."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/tools/impl/browser/screenshot.rs` around lines 209 - 229, Add debug-level instrumentation to downscale_to_jpeg: at function entry log the input bytes length (e.g., "screenshot:downscale entry bytes=%d"); inside the for loop log the attempted max_dim before thumbnailing and log the resulting JPEG size and dimensions after encoding (e.g., "screenshot:downscale attempt max_dim=%d out_bytes=%d w=%d h=%d"); when returning early on success log the chosen max_dim/dimensions with a stable prefix (e.g., "screenshot:downscale success ..."); and when falling back to last or error log the final decision ("screenshot:downscale fallback out_bytes=%d w=%d h=%d" or "screenshot:downscale error ..."). Use the project's tracing/log crate, debug level, and keep the prefix "screenshot:downscale" so grep can find all entries; add logs at the start, after each image::codecs::jpeg::JpegEncoder encode_image call, on the Ok return, and on the err/fallback path (referencing function name downscale_to_jpeg, variable max_dim, out, w, h, and last).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/openhuman/tools/impl/browser/screenshot.rs`:
- Around line 209-229: Add debug-level instrumentation to downscale_to_jpeg: at
function entry log the input bytes length (e.g., "screenshot:downscale entry
bytes=%d"); inside the for loop log the attempted max_dim before thumbnailing
and log the resulting JPEG size and dimensions after encoding (e.g.,
"screenshot:downscale attempt max_dim=%d out_bytes=%d w=%d h=%d"); when
returning early on success log the chosen max_dim/dimensions with a stable
prefix (e.g., "screenshot:downscale success ..."); and when falling back to last
or error log the final decision ("screenshot:downscale fallback out_bytes=%d
w=%d h=%d" or "screenshot:downscale error ..."). Use the project's tracing/log
crate, debug level, and keep the prefix "screenshot:downscale" so grep can find
all entries; add logs at the start, after each image::codecs::jpeg::JpegEncoder
encode_image call, on the Ok return, and on the err/fallback path (referencing
function name downscale_to_jpeg, variable max_dim, out, w, h, and last).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 83d09e2c-27e7-4d10-afcf-834bd3c76ee0
📒 Files selected for processing (1)
src/openhuman/tools/impl/browser/screenshot.rs
oxoxDev
left a comment
There was a problem hiding this comment.
Foundation slice of the computer-control EPIC — the main-thread SIGTRAP fix and screenshot downscale are solid, but the synthetic-input tools aren't approval-gated, which blocks this.
Blocker — mouse/keyboard execute without the ApprovalGate
They're default-off (registered only under computer_control.enabled, good), but neither tool overrides external_effect_with_args, so it inherits the trait default false. The ApprovalGate is triggered exclusively by external_effect_with_args (agent/harness/engine/tools.rs:162) — PermissionLevel::Dangerous does not trigger it (it's only a static channel-capability filter). So once computer_control.enabled=true, raw coordinate clicks and arbitrary keystrokes run with no approval prompt, including on background/auto-approved turns, with no sensitive-app denylist (strictly more dangerous than ax_interact, which at least has SENSITIVE_APPS). Keystrokes can land in a focused sudo/password field or Terminal unattended. This misses the bar #3168 set (opt-in and approval-gated).
Fix: add fn external_effect(&self) -> bool { true } to both MouseTool and KeyboardTool (every action mutates the host — no per-arg discrimination needed). Coordinate input can't denylist by app, so the gate is the only real boundary here.
Note: the user_filter.rs comment "PermissionLevel::Dangerous so the approval gate fires per-action" is factually wrong (Dangerous doesn't fire the gate), and ax_interact.rs's comment claims it "mirrors the opt-in posture of the mouse/keyboard tools" — i.e. ax_interact's author believed these were already gated. Worth correcting both while here.
Major — main-thread executor blocks the UI thread
The enigo-on-main-thread dispatch correctly fixes the TSMGetInputSourceProperty SIGTRAP, but run_on_main_thread runs the entire op on-main, including its std::thread::sleeps (hotkey 20ms x N keys; humanized-move per-waypoint dwells — hundreds of ms on a full-screen drag). That blocks the CEF/UI main thread for the full duration of every synthetic-input op → UI jank on an always-on path. Keep only the macOS layout lookup on-main and run the movement/sleeps off-thread; and wrap the closure in catch_unwind so an enigo FFI panic can't unwind on the main thread.
Clean
Screenshot downscale is correct (aspect-ratio preserved, RGBA handled, no panic on tiny/huge, no new info leak). Headless path fails closed. No-receiver case returns a clean cancel error.
Nit: mouse coordinates are bounded to 0..=32768, not actual screen/monitor bounds — fine as a sanity guard, but underscores that the ApprovalGate is the real safety boundary.
This is the foundation slice the rest of the stack builds on (slice 3/#3342 wires these to the agent), so the gating should be fixed here before the train rolls.
| .await? | ||
| into_result( | ||
| "type", | ||
| run_input_on_main(move || { |
There was a problem hiding this comment.
Blocker (gating): this dispatches arbitrary keystrokes, but KeyboardTool never overrides external_effect_with_args, so it inherits false and never routes through the ApprovalGate (the gate fires only on external_effect_with_args at engine/tools.rs:162 — PermissionLevel::Dangerous does NOT trigger it). With computer_control.enabled=true on a background/auto-approved turn, type/press/hotkey can type into a focused sudo/password field or Terminal unattended, and there's no sensitive-app denylist here (unlike ax_interact's SENSITIVE_APPS). Add fn external_effect(&self) -> bool { true } to KeyboardTool.
| .await? | ||
| into_result( | ||
| "move", | ||
| run_input_on_main(move || { |
There was a problem hiding this comment.
Blocker (gating): same issue as keyboard — MouseTool doesn't override external_effect_with_args, so blind coordinate clicks execute without the ApprovalGate. Raw mouse input has no app/element scoping (can't denylist), so the gate is the only real boundary. Add fn external_effect(&self) -> bool { true } to MouseTool. (Also: the user_filter.rs comment claiming Dangerous fires the gate is incorrect — worth fixing.)
| let (tx, rx) = tokio::sync::oneshot::channel(); | ||
| let run = req.run; | ||
| input_app | ||
| .run_on_main_thread(move || { |
There was a problem hiding this comment.
Major: this runs the whole enigo op on the main thread, including its std::thread::sleeps (hotkey inter-key delays; humanized-move per-waypoint dwells — hundreds of ms on a big drag), blocking the CEF/UI main thread for the op duration → input jank on an always-on path. Consider keeping only the macOS layout lookup on-main and running movement/sleeps off-thread; and wrap (run)() in std::panic::catch_unwind so an enigo FFI panic can't unwind on the main thread (currently it'd drop tx → caller sees a clean cancel, but the main-thread panic itself is the risk).
…ansai#3340 review) Per @oxoxDev: MouseTool/KeyboardTool inherited external_effect=false, so neither hit the ApprovalGate — PermissionLevel::Dangerous alone does NOT trigger it (the gate keys off external_effect_with_args). With computer_control.enabled, blind clicks / arbitrary keystrokes could run unattended on an auto-approved turn, with no sensitive-app denylist. - Override external_effect → true on both tools (gate every action). - Wrap the main-thread input executor in catch_unwind so an enigo FFI panic can't unwind across the app main thread. - Correct the user_filter.rs / ax_interact.rs comments that wrongly claimed Dangerous fires the gate. - Tests: assert both tools route through the gate.
…ps (Phase 1.5)
Adds a model-chosen `vision_click { description }` action to the `automate`
loop for apps that expose no usable accessibility tree (Slack, Discord,
VS Code). Flow: screenshot the app window -> ask the main vision model for the
target's pixel coordinates (via the existing `[IMAGE:]` marker path) -> map
image pixels to absolute screen points -> guarded left-click.
- New `accessibility/vision_click.rs`: pure `image_to_screen` coordinate
transform (folds in the deferred F2 mapping -- the px->pt ratio absorbs the
capture downscale + Retina backing scale), tolerant locate-response parser,
capture geometry, and the main-thread guarded click (`run_input_on_main`,
Change 1.15).
- Section 1.8 safety guard: only clicks when the target app is frontmost;
refuses on positive evidence another app is focused, so synthetic input never
lands on OpenHuman's own CEF window.
- Reuses the main `chat` vision provider -- no new inference API, no new tool,
no new approval surface (inherits `automate`'s Dangerous + mutations gate).
- 19 new unit tests (pure transform/parse + scripted-backend loop integration,
incl. the frontmost-refusal guard). All 25 automate + vision_click tests green.
Closes the last open Phase 1.5 item (tinyhumansai#3148). Stacks on tinyhumansai#3340-tinyhumansai#3346.
Summary
Slice 1/8 of #3307 — computer-control input primitives + the CEF crash fix.
TSMGetInputSourcePropertytraps off-thread (SIGTRAP) and crashes the CEF host; dispatching throughrun_on_main_threadfixes it.mouse/keyboardtools and themain_threadbridge (INPUT_ON_MAIN_THREAD_METHOD).screenshotoutput so the model can actually see it.Files (6)
tools/impl/computer/{mouse,keyboard,main_thread,mod}.rs,tools/impl/browser/screenshot.rs,app/src-tauri/src/lib.rs(main-thread executor registration only).Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Documentation