Skip to content

fix(core): prevent SIGBUS stack overflow in composio tool path#2069

Merged
senamakel merged 3 commits into
tinyhumansai:mainfrom
senamakel:fix/composio-stack-overflow
May 18, 2026
Merged

fix(core): prevent SIGBUS stack overflow in composio tool path#2069
senamakel merged 3 commits into
tinyhumansai:mainfrom
senamakel:fix/composio-stack-overflow

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 18, 2026

Summary

  • Fix EXC_BAD_ACCESS (SIGBUS) in the in-process core when a chat-driven Gmail (or other composio) tool call runs through delegate_to_integrations_agent → integrations_agent → composio_list_tools → load_config_with_timeout. The TOML parse for Config blew the 2 MB tokio worker stack guard page on top of the ~50-frame agent tower.
  • Move toml::from_str::<Config> onto the tokio blocking pool via spawn_blocking so the parser runs on a fresh thread stack.
  • Install a custom tauri::async_runtime with thread_stack_size(8 MiB) at the top of pub fn run() so the in-process core's tokio::spawn(run_server_embedded(..)) inherits 4× the prior headroom for everything else in the tower.
  • Add a structural regression test that drives the full sub-agent → composio_list_toolsload_config_with_timeout chain on a production-shaped 2 MB worker.

Problem

User report: the Tauri window quits without warning when sending a Gmail-related agent request. macOS crash report (crahs.log):

Triggered by Thread: 46  tokio-rt-worker
Exception Type:   EXC_BAD_ACCESS (SIGBUS)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x3028534f0
Exception Message: Could not determine thread index for stack guard region

The address falls inside the stack guard page of a 2 MB tokio worker. Stack frames at the top of the crashed thread were toml_parser::on_array_open → value → … → toml::from_str → Config::load_or_init → load_config_with_timeout → ComposioListToolsTool::execute → subagent_runner::run_inner_loop → SkillDelegationTool::execute → Agent::execute_tools → web channel run_chat_task. About 100 frames total, with the toml parser's serde-monomorphised Visitor frames for the deeply-nested Config (KB-sized per frame) tipping the worker over.

composio_list_tools reloads Config from disk on every invocation (per #1710 Wave 4, so a mid-session composio.mode toggle is observed). That's the trigger; the underlying issue is that the toml parser + agent tower together no longer fit in a 2 MB worker stack.

Solution

Two complementary changes:

  1. src/openhuman/config/schema/load.rs — wrap toml::from_str::<Config> in tokio::task::spawn_blocking via new helper parse_toml_off_worker. The blocking pool thread starts fresh (no async tower above the parse), so the serde Visitor frames no longer compound with the caller's frames. parse_config_with_recovery now routes both the primary parse and the backup-recovery parse through the helper.

  2. app/src-tauri/src/lib.rs — at the very top of pub fn run(), build a custom tokio::runtime::Builder::new_multi_thread().thread_stack_size(8 * 1024 * 1024) runtime and tauri::async_runtime::set(handle) it before any other tauri call. The in-process core runs via tokio::spawn(run_server_embedded(..)) on this runtime, so every JSON-RPC handler gets 4× the prior worker stack headroom. The runtime is leaked (per Tauri's contract: "you cannot drop the underlying TokioRuntime").

Tried but reverted: an Arc<Config> cache fronting load_config_with_timeout that would skip the per-call parse entirely. Worked correctness-wise in lib unit tests but caused 6 flakes in tests/json_rpc_e2e.rs (in-process JSON-RPC servers loading config mid-mutation, race-prone even with mtime checks). The two changes above carry the production stack-overflow fix without it.

Regression test (tests/composio_list_tools_stack_overflow_regression.rs) is a structural guard: it drives run_subagent(integrations_agent) → composio_list_tools → load_config_with_timeout on a production-shaped 2 MB worker with a stubbed Provider. Module docs explain what it does and does not catch (we can't easily mock the upper chat-channel layers in cargo-test, so the bare path fits in 2 MB even without the parser-move; the test catches future structural regressions in the path).

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • N/A: behaviour-only fix; no new product surface to measure diff coverage on. tests/composio_list_tools_stack_overflow_regression.rs exercises the changed path; pnpm test:rust is clean (45 integration + 7602 lib tests + new regression).
  • N/A: behaviour-only change — no feature rows added/removed/renamed.
  • N/A: no feature IDs touched.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: change does not touch release-cut surfaces (in-process core stack budget + a regression test).
  • N/A: no linked issue — surfaced from a user-supplied macOS crash dump (crahs.log).

Impact

  • Runtime/platform: desktop only. Affects the in-process Rust core inside the Tauri shell. 8 MiB worker stacks × ~CPU-count workers raises virtual-address reservation modestly (real memory only fault-paged in when touched). Matches the macOS pthread main-thread default and is below Linux/Windows defaults for non-tokio threads, so no platform reaches a hard limit.
  • Performance: per-call config reload now hops to the blocking pool (one spawn_blocking round-trip). Imperceptible — the parse is microseconds and already ran on tokio time.
  • Security: none.
  • Compatibility: no API changes. Internal-only.

Related

  • Closes:
  • Follow-up PR(s)/TODOs: a per-call config cache would eliminate the per-tool parse entirely (the most efficient fix) but needs care around in-process integration tests that swap workspace paths mid-flight; tracked as a follow-up.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: `fix/composio-stack-overflow`
  • Commit SHA: `9a311b608`

Validation Run

  • `pnpm --filter openhuman-app format:check` (ran via pre-push hook; auto-fixed and re-committed)
  • N/A: no TS changes.
  • Focused tests: `cargo test --test composio_list_tools_stack_overflow_regression` (1 passed), `cargo test --lib` (7602 passed; transient timing flake on runtime_dispatch::message_dispatch_processes_messages_in_parallel cleared on re-run, also flaked on baseline without changes — unrelated), `cargo test --test json_rpc_e2e` (45 passed)
  • Rust fmt/check (if changed): `cargo check --bin openhuman-core` clean
  • Tauri fmt/check (if changed): `cargo check --manifest-path app/src-tauri/Cargo.toml` clean

Validation Blocked

  • `command:` N/A
  • `error:` N/A
  • `impact:` N/A

Behavior Changes

  • Intended behavior change: composio tool calls no longer crash the in-process core under deep async towers.
  • User-visible effect: Tauri window stays alive when an agent sub-call uses composio_list_tools / per-action composio tools. No surface or API change.

Parity Contract

  • Legacy behavior preserved: load_config_with_timeout semantics unchanged; the per-call reload behavior added in Prioritize fully local speech and Composer operation #1710 Wave 4 is retained.
  • Guard/fallback/dispatch parity checks: parse_config_with_recovery still falls back to .bak recovery; new parse_toml_off_worker returns errors via the same (Config, was_corrupted) channel as the original inline parse.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: N/A
  • Resolution: N/A

Summary by CodeRabbit

  • Bug Fixes

    • Resolved stack-overflow crashes during configuration loading and improved parsing stability under heavy workloads.
    • Made async runtime startup more robust to prevent runtime aborts.
  • Documentation

    • Clarified config-loading timeout and recovery behavior in docs.
  • Tests

    • Added a regression test to ensure stack-overflow issues do not recur.

Review Change Stack

senamakel added 2 commits May 18, 2026 01:53
Production crash (`SIGBUS / KERN_PROTECTION_FAILURE` at a
`tokio-rt-worker` stack guard page) when the in-process core ran a
chat-driven `delegate_to_integrations_agent → integrations_agent →
composio_list_tools → load_config_with_timeout` chain. The toml
parser's serde Visitor frames piled on top of ~50 frames of agent
tower and breached the 2 MB worker stack.

Fix:

* Move `toml::from_str::<Config>` onto the blocking pool via
  `spawn_blocking` (`parse_toml_off_worker`) so the parser runs on a
  fresh thread stack with no async tower above it.
* Install a custom `tauri::async_runtime` with `thread_stack_size(8
  MiB)` at the top of `pub fn run()` so the in-process core inherits
  4× the prior headroom for everything else in the tower.
* Add `tests/composio_list_tools_stack_overflow_regression.rs` — a
  structural guard that drives `run_subagent(integrations_agent) →
  composio_list_tools → load_config_with_timeout` on a production-
  shaped 2 MB worker. Module docs explain what it does and does not
  catch (we can't easily mock the upper chat-channel layers in
  cargo-test, so the bare path fits in 2 MB even without the parser-
  move; the test serves as a structural regression).
@senamakel senamakel requested a review from a team May 18, 2026 08:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c48a9433-cf2b-4f5d-b713-0ba1df0ed46d

📥 Commits

Reviewing files that changed from the base of the PR and between 9a311b6 and cc141de.

📒 Files selected for processing (1)
  • src/openhuman/config/schema/load.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/config/schema/load.rs

📝 Walkthrough

Walkthrough

Installs a custom Tokio runtime with 8 MiB thread stacks in Tauri, offloads TOML deserialization to Tokio's blocking pool, updates related docs/comments, and adds a regression test that reproduces the low-worker-stack SIGBUS scenario.

Changes

Stack Overflow Prevention

Layer / File(s) Summary
Tauri async runtime with larger stack
app/src-tauri/src/lib.rs
Custom Tokio multi-thread runtime with 8 MiB per-thread stack is created, leaked for process lifetime, and registered with tauri::async_runtime::set before other startup work.
Config parsing moved to blocking pool
src/openhuman/config/ops.rs, src/openhuman/config/schema/load.rs
parse_config_with_recovery and backup parsing now call parse_toml_off_worker(), which performs toml::from_str on Tokio's blocking pool; documentation/comments explain stack-growth mitigation and blocking pool behavior.
Stack overflow regression test with minimal stack
tests/composio_list_tools_stack_overflow_regression.rs
Adds a regression test reproducing SIGBUS with ~2 MiB worker stack, including env synchronization, EnvGuard, a representative config.toml fixture, stub Provider/Memory, and helpers that run the subagent path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

working

Poem

🐰 I hopped through stacks both thin and deep,
Gave threads eight megs so they could sleep,
Put TOML parsing on a safer track,
A gentle leak keeps runtime back—
Now tests run through without a peep.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(core): prevent SIGBUS stack overflow in composio tool path' directly describes the main change—fixing a SIGBUS crash by preventing stack overflow in the composio tool execution path. It clearly summarizes the primary objective and is specific enough to convey the key issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/composio_list_tools_stack_overflow_regression.rs (1)

323-327: ⚡ Quick win

Assert that the composio path was actually exercised, not just “no crash.”

Right now the test passes on mere task completion. Add a small execution assertion (e.g., provider turn count) so early-return regressions don’t produce false positives.

Suggested strengthening
-    rt.block_on(async {
-        tokio::spawn(drive_subagent())
-            .await
-            .expect("subagent task must complete without SIGBUS / panic");
-    });
+    rt.block_on(async {
+        let turns = tokio::spawn(drive_subagent())
+            .await
+            .expect("subagent task must complete without SIGBUS / panic");
+        assert!(
+            turns >= 2,
+            "expected at least one tool-call roundtrip through composio_list_tools"
+        );
+    });
 }
 
-async fn drive_subagent() {
+async fn drive_subagent() -> usize {
@@
-    let _ = with_parent_context(parent, async move {
+    let _ = with_parent_context(parent, async move {
         run_subagent(
             &def,
             "list available gmail actions",
             SubagentRunOptions::default(),
         )
         .await
     })
     .await;
+    *provider.iter.lock()
 }

Also applies to: 330-382

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/composio_list_tools_stack_overflow_regression.rs` around lines 323 -
327, The test currently only awaits tokio::spawn(drive_subagent()) for
completion; change it to also assert that the composio path was exercised by
reading and asserting a runtime-visible counter/metric after the task completes
(e.g., check a provider turn counter or a probes struct exposed by
drive_subagent or the provider), such as reading provider_turn_count (or an
equivalent atomic/metric returned from or shared with drive_subagent) and assert
it is > 0; apply the same addition to the other block covering lines 330-382 so
both assertions validate that the provider/Composio path ran rather than merely
completing without a crash.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 473-474: The doc comment mentioning that
config::ops::load_config_with_timeout "is an additional optimization that avoids
paying the parse on repeat calls" is stale; update the rationale in load.rs to
remove or reword that claim so it no longer asserts a cache optimization for
load_config_with_timeout (reference the symbol load_config_with_timeout in the
comment block) and instead document the current behavior accurately (e.g., that
the caching optimization was reverted or that no additional caching is
performed).

---

Nitpick comments:
In `@tests/composio_list_tools_stack_overflow_regression.rs`:
- Around line 323-327: The test currently only awaits
tokio::spawn(drive_subagent()) for completion; change it to also assert that the
composio path was exercised by reading and asserting a runtime-visible
counter/metric after the task completes (e.g., check a provider turn counter or
a probes struct exposed by drive_subagent or the provider), such as reading
provider_turn_count (or an equivalent atomic/metric returned from or shared with
drive_subagent) and assert it is > 0; apply the same addition to the other block
covering lines 330-382 so both assertions validate that the provider/Composio
path ran rather than merely completing without a crash.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 02917c27-90bb-41fe-8b88-0b4c6cdf0ab3

📥 Commits

Reviewing files that changed from the base of the PR and between 0f616e4 and 9a311b6.

📒 Files selected for processing (4)
  • app/src-tauri/src/lib.rs
  • src/openhuman/config/ops.rs
  • src/openhuman/config/schema/load.rs
  • tests/composio_list_tools_stack_overflow_regression.rs

Comment thread src/openhuman/config/schema/load.rs Outdated
@senamakel senamakel merged commit 579addf into tinyhumansai:main May 18, 2026
25 checks passed
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walkthrough

Solid fix for the SIGBUS stack overflow in the composio tool path. The two-pronged approach — moving toml::from_str::<Config> onto the blocking pool via spawn_blocking and bumping the Tauri tokio worker stack to 8 MiB — addresses both the immediate trigger and gives future headroom. The regression test is well-structured and honestly documents its own limitations. Only one minor stale-docs nit.

Change Summary

File Change type Description
app/src-tauri/src/lib.rs Modified Custom tokio runtime with 8 MiB worker stacks, set before any Tauri async call
src/openhuman/config/ops.rs Modified Doc comments explaining why spawn_blocking instead of caching
src/openhuman/config/schema/load.rs Modified New parse_toml_off_worker helper; parse_config_with_recovery uses it for both primary and backup parse paths
tests/composio_list_tools_stack_overflow_regression.rs Added Regression test driving run_subagent → composio_list_tools → load_config_with_timeout on a 2 MiB worker stack

Per-file notes

lib.rs — The std::mem::forget + tauri::async_runtime::set pattern is correct per Tauri's docs. Block comment thoroughly explains the rationale and the "must call before any other tauri async call" ordering constraint.

load.rsparse_toml_off_worker is clean: takes ownership of the string (required for 'static in spawn_blocking), flattens the join error into the same String error path, and callers don't need to change their error handling. Both the primary parse and the backup-file recovery path go through it.

ops.rs — Doc-only change, accurately describes why caching was abandoned.

Test file — Excellent module-level documentation explaining the crash, why existing tests missed it, and the honest caveat about what the test does/doesn't catch. Setup mirrors production constraints (2 MiB worker stack, representative config TOML, stub provider/memory).

Findings

[minor] tests/composio_list_tools_stack_overflow_regression.rs:387-390 — Stale doc comment says load_config_with_timeout "is fronted by a process-global cache keyed on OPENHUMAN_WORKSPACE, invalidated by Config::save()" and that hot-path consumers "get a clone, never re-entering the parser." But the PR description and the ops.rs doc update both confirm the cache was tried and reverted because it caused 6 flakes in json_rpc_e2e.rs. This block should be removed or rewritten to match reality (spawn_blocking only, no cache). Same class of stale-docs issue that CodeRabbit caught in load.rs:474 (fixed in cc141de), but this instance in the test file was missed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants