fix(core): prevent SIGBUS stack overflow in composio tool path#2069
Conversation
Production crash (`SIGBUS / KERN_PROTECTION_FAILURE` at a `tokio-rt-worker` stack guard page) when the in-process core ran a chat-driven `delegate_to_integrations_agent → integrations_agent → composio_list_tools → load_config_with_timeout` chain. The toml parser's serde Visitor frames piled on top of ~50 frames of agent tower and breached the 2 MB worker stack. Fix: * Move `toml::from_str::<Config>` onto the blocking pool via `spawn_blocking` (`parse_toml_off_worker`) so the parser runs on a fresh thread stack with no async tower above it. * Install a custom `tauri::async_runtime` with `thread_stack_size(8 MiB)` at the top of `pub fn run()` so the in-process core inherits 4× the prior headroom for everything else in the tower. * Add `tests/composio_list_tools_stack_overflow_regression.rs` — a structural guard that drives `run_subagent(integrations_agent) → composio_list_tools → load_config_with_timeout` on a production- shaped 2 MB worker. Module docs explain what it does and does not catch (we can't easily mock the upper chat-channel layers in cargo-test, so the bare path fits in 2 MB even without the parser- move; the test serves as a structural regression).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughInstalls a custom Tokio runtime with 8 MiB thread stacks in Tauri, offloads TOML deserialization to Tokio's blocking pool, updates related docs/comments, and adds a regression test that reproduces the low-worker-stack SIGBUS scenario. ChangesStack Overflow Prevention
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/composio_list_tools_stack_overflow_regression.rs (1)
323-327: ⚡ Quick winAssert that the composio path was actually exercised, not just “no crash.”
Right now the test passes on mere task completion. Add a small execution assertion (e.g., provider turn count) so early-return regressions don’t produce false positives.
Suggested strengthening
- rt.block_on(async { - tokio::spawn(drive_subagent()) - .await - .expect("subagent task must complete without SIGBUS / panic"); - }); + rt.block_on(async { + let turns = tokio::spawn(drive_subagent()) + .await + .expect("subagent task must complete without SIGBUS / panic"); + assert!( + turns >= 2, + "expected at least one tool-call roundtrip through composio_list_tools" + ); + }); } -async fn drive_subagent() { +async fn drive_subagent() -> usize { @@ - let _ = with_parent_context(parent, async move { + let _ = with_parent_context(parent, async move { run_subagent( &def, "list available gmail actions", SubagentRunOptions::default(), ) .await }) .await; + *provider.iter.lock() }Also applies to: 330-382
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/composio_list_tools_stack_overflow_regression.rs` around lines 323 - 327, The test currently only awaits tokio::spawn(drive_subagent()) for completion; change it to also assert that the composio path was exercised by reading and asserting a runtime-visible counter/metric after the task completes (e.g., check a provider turn counter or a probes struct exposed by drive_subagent or the provider), such as reading provider_turn_count (or an equivalent atomic/metric returned from or shared with drive_subagent) and assert it is > 0; apply the same addition to the other block covering lines 330-382 so both assertions validate that the provider/Composio path ran rather than merely completing without a crash.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 473-474: The doc comment mentioning that
config::ops::load_config_with_timeout "is an additional optimization that avoids
paying the parse on repeat calls" is stale; update the rationale in load.rs to
remove or reword that claim so it no longer asserts a cache optimization for
load_config_with_timeout (reference the symbol load_config_with_timeout in the
comment block) and instead document the current behavior accurately (e.g., that
the caching optimization was reverted or that no additional caching is
performed).
---
Nitpick comments:
In `@tests/composio_list_tools_stack_overflow_regression.rs`:
- Around line 323-327: The test currently only awaits
tokio::spawn(drive_subagent()) for completion; change it to also assert that the
composio path was exercised by reading and asserting a runtime-visible
counter/metric after the task completes (e.g., check a provider turn counter or
a probes struct exposed by drive_subagent or the provider), such as reading
provider_turn_count (or an equivalent atomic/metric returned from or shared with
drive_subagent) and assert it is > 0; apply the same addition to the other block
covering lines 330-382 so both assertions validate that the provider/Composio
path ran rather than merely completing without a crash.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 02917c27-90bb-41fe-8b88-0b4c6cdf0ab3
📒 Files selected for processing (4)
app/src-tauri/src/lib.rssrc/openhuman/config/ops.rssrc/openhuman/config/schema/load.rstests/composio_list_tools_stack_overflow_regression.rs
graycyrus
left a comment
There was a problem hiding this comment.
Walkthrough
Solid fix for the SIGBUS stack overflow in the composio tool path. The two-pronged approach — moving toml::from_str::<Config> onto the blocking pool via spawn_blocking and bumping the Tauri tokio worker stack to 8 MiB — addresses both the immediate trigger and gives future headroom. The regression test is well-structured and honestly documents its own limitations. Only one minor stale-docs nit.
Change Summary
| File | Change type | Description |
|---|---|---|
app/src-tauri/src/lib.rs |
Modified | Custom tokio runtime with 8 MiB worker stacks, set before any Tauri async call |
src/openhuman/config/ops.rs |
Modified | Doc comments explaining why spawn_blocking instead of caching |
src/openhuman/config/schema/load.rs |
Modified | New parse_toml_off_worker helper; parse_config_with_recovery uses it for both primary and backup parse paths |
tests/composio_list_tools_stack_overflow_regression.rs |
Added | Regression test driving run_subagent → composio_list_tools → load_config_with_timeout on a 2 MiB worker stack |
Per-file notes
lib.rs — The std::mem::forget + tauri::async_runtime::set pattern is correct per Tauri's docs. Block comment thoroughly explains the rationale and the "must call before any other tauri async call" ordering constraint.
load.rs — parse_toml_off_worker is clean: takes ownership of the string (required for 'static in spawn_blocking), flattens the join error into the same String error path, and callers don't need to change their error handling. Both the primary parse and the backup-file recovery path go through it.
ops.rs — Doc-only change, accurately describes why caching was abandoned.
Test file — Excellent module-level documentation explaining the crash, why existing tests missed it, and the honest caveat about what the test does/doesn't catch. Setup mirrors production constraints (2 MiB worker stack, representative config TOML, stub provider/memory).
Findings
[minor] tests/composio_list_tools_stack_overflow_regression.rs:387-390 — Stale doc comment says load_config_with_timeout "is fronted by a process-global cache keyed on OPENHUMAN_WORKSPACE, invalidated by Config::save()" and that hot-path consumers "get a clone, never re-entering the parser." But the PR description and the ops.rs doc update both confirm the cache was tried and reverted because it caused 6 flakes in json_rpc_e2e.rs. This block should be removed or rewritten to match reality (spawn_blocking only, no cache). Same class of stale-docs issue that CodeRabbit caught in load.rs:474 (fixed in cc141de), but this instance in the test file was missed.
Summary
EXC_BAD_ACCESS (SIGBUS)in the in-process core when a chat-driven Gmail (or other composio) tool call runs throughdelegate_to_integrations_agent → integrations_agent → composio_list_tools → load_config_with_timeout. The TOML parse forConfigblew the 2 MB tokio worker stack guard page on top of the ~50-frame agent tower.toml::from_str::<Config>onto the tokio blocking pool viaspawn_blockingso the parser runs on a fresh thread stack.tauri::async_runtimewiththread_stack_size(8 MiB)at the top ofpub fn run()so the in-process core'stokio::spawn(run_server_embedded(..))inherits 4× the prior headroom for everything else in the tower.composio_list_tools→load_config_with_timeoutchain on a production-shaped 2 MB worker.Problem
User report: the Tauri window quits without warning when sending a Gmail-related agent request. macOS crash report (
crahs.log):The address falls inside the stack guard page of a 2 MB tokio worker. Stack frames at the top of the crashed thread were
toml_parser::on_array_open → value → … → toml::from_str → Config::load_or_init → load_config_with_timeout → ComposioListToolsTool::execute → subagent_runner::run_inner_loop → SkillDelegationTool::execute → Agent::execute_tools → web channel run_chat_task. About 100 frames total, with the toml parser's serde-monomorphisedVisitorframes for the deeply-nestedConfig(KB-sized per frame) tipping the worker over.composio_list_toolsreloadsConfigfrom disk on every invocation (per #1710 Wave 4, so a mid-sessioncomposio.modetoggle is observed). That's the trigger; the underlying issue is that the toml parser + agent tower together no longer fit in a 2 MB worker stack.Solution
Two complementary changes:
src/openhuman/config/schema/load.rs— wraptoml::from_str::<Config>intokio::task::spawn_blockingvia new helperparse_toml_off_worker. The blocking pool thread starts fresh (no async tower above the parse), so the serdeVisitorframes no longer compound with the caller's frames.parse_config_with_recoverynow routes both the primary parse and the backup-recovery parse through the helper.app/src-tauri/src/lib.rs— at the very top ofpub fn run(), build a customtokio::runtime::Builder::new_multi_thread().thread_stack_size(8 * 1024 * 1024)runtime andtauri::async_runtime::set(handle)it before any other tauri call. The in-process core runs viatokio::spawn(run_server_embedded(..))on this runtime, so every JSON-RPC handler gets 4× the prior worker stack headroom. The runtime is leaked (per Tauri's contract: "you cannot drop the underlying TokioRuntime").Tried but reverted: an
Arc<Config>cache frontingload_config_with_timeoutthat would skip the per-call parse entirely. Worked correctness-wise in lib unit tests but caused 6 flakes intests/json_rpc_e2e.rs(in-process JSON-RPC servers loading config mid-mutation, race-prone even with mtime checks). The two changes above carry the production stack-overflow fix without it.Regression test (
tests/composio_list_tools_stack_overflow_regression.rs) is a structural guard: it drivesrun_subagent(integrations_agent) → composio_list_tools → load_config_with_timeouton a production-shaped 2 MB worker with a stubbedProvider. Module docs explain what it does and does not catch (we can't easily mock the upper chat-channel layers in cargo-test, so the bare path fits in 2 MB even without the parser-move; the test catches future structural regressions in the path).Submission Checklist
tests/composio_list_tools_stack_overflow_regression.rsexercises the changed path;pnpm test:rustis clean (45 integration + 7602 lib tests + new regression).crahs.log).Impact
spawn_blockinground-trip). Imperceptible — the parse is microseconds and already ran on tokio time.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
Validation Run
runtime_dispatch::message_dispatch_processes_messages_in_parallelcleared on re-run, also flaked on baseline without changes — unrelated), `cargo test --test json_rpc_e2e` (45 passed)Validation Blocked
Behavior Changes
composio_list_tools/ per-action composio tools. No surface or API change.Parity Contract
load_config_with_timeoutsemantics unchanged; the per-call reload behavior added in Prioritize fully local speech and Composer operation #1710 Wave 4 is retained.parse_config_with_recoverystill falls back to.bakrecovery; newparse_toml_off_workerreturns errors via the same(Config, was_corrupted)channel as the original inline parse.Duplicate / Superseded PR Handling
Summary by CodeRabbit
Bug Fixes
Documentation
Tests