fix(core): prevent SIGBUS stack overflow in composio tool path by senamakel · Pull Request #2069 · tinyhumansai/openhuman

senamakel · 2026-05-18T08:56:12Z

Summary

Fix EXC_BAD_ACCESS (SIGBUS) in the in-process core when a chat-driven Gmail (or other composio) tool call runs through delegate_to_integrations_agent → integrations_agent → composio_list_tools → load_config_with_timeout. The TOML parse for Config blew the 2 MB tokio worker stack guard page on top of the ~50-frame agent tower.
Move toml::from_str::<Config> onto the tokio blocking pool via spawn_blocking so the parser runs on a fresh thread stack.
Install a custom tauri::async_runtime with thread_stack_size(8 MiB) at the top of pub fn run() so the in-process core's tokio::spawn(run_server_embedded(..)) inherits 4× the prior headroom for everything else in the tower.
Add a structural regression test that drives the full sub-agent → composio_list_tools → load_config_with_timeout chain on a production-shaped 2 MB worker.

Problem

User report: the Tauri window quits without warning when sending a Gmail-related agent request. macOS crash report (crahs.log):

Triggered by Thread: 46  tokio-rt-worker
Exception Type:   EXC_BAD_ACCESS (SIGBUS)
Exception Subtype: KERN_PROTECTION_FAILURE at 0x3028534f0
Exception Message: Could not determine thread index for stack guard region

The address falls inside the stack guard page of a 2 MB tokio worker. Stack frames at the top of the crashed thread were toml_parser::on_array_open → value → … → toml::from_str → Config::load_or_init → load_config_with_timeout → ComposioListToolsTool::execute → subagent_runner::run_inner_loop → SkillDelegationTool::execute → Agent::execute_tools → web channel run_chat_task. About 100 frames total, with the toml parser's serde-monomorphised Visitor frames for the deeply-nested Config (KB-sized per frame) tipping the worker over.

composio_list_tools reloads Config from disk on every invocation (per #1710 Wave 4, so a mid-session composio.mode toggle is observed). That's the trigger; the underlying issue is that the toml parser + agent tower together no longer fit in a 2 MB worker stack.

Solution

Two complementary changes:

src/openhuman/config/schema/load.rs — wrap toml::from_str::<Config> in tokio::task::spawn_blocking via new helper parse_toml_off_worker. The blocking pool thread starts fresh (no async tower above the parse), so the serde Visitor frames no longer compound with the caller's frames. parse_config_with_recovery now routes both the primary parse and the backup-recovery parse through the helper.
app/src-tauri/src/lib.rs — at the very top of pub fn run(), build a custom tokio::runtime::Builder::new_multi_thread().thread_stack_size(8 * 1024 * 1024) runtime and tauri::async_runtime::set(handle) it before any other tauri call. The in-process core runs via tokio::spawn(run_server_embedded(..)) on this runtime, so every JSON-RPC handler gets 4× the prior worker stack headroom. The runtime is leaked (per Tauri's contract: "you cannot drop the underlying TokioRuntime").

Tried but reverted: an Arc<Config> cache fronting load_config_with_timeout that would skip the per-call parse entirely. Worked correctness-wise in lib unit tests but caused 6 flakes in tests/json_rpc_e2e.rs (in-process JSON-RPC servers loading config mid-mutation, race-prone even with mtime checks). The two changes above carry the production stack-overflow fix without it.

Regression test (tests/composio_list_tools_stack_overflow_regression.rs) is a structural guard: it drives run_subagent(integrations_agent) → composio_list_tools → load_config_with_timeout on a production-shaped 2 MB worker with a stubbed Provider. Module docs explain what it does and does not catch (we can't easily mock the upper chat-channel layers in cargo-test, so the bare path fits in 2 MB even without the parser-move; the test catches future structural regressions in the path).

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
N/A: behaviour-only fix; no new product surface to measure diff coverage on. tests/composio_list_tools_stack_overflow_regression.rs exercises the changed path; pnpm test:rust is clean (45 integration + 7602 lib tests + new regression).
N/A: behaviour-only change — no feature rows added/removed/renamed.
N/A: no feature IDs touched.
No new external network dependencies introduced (mock backend used per Testing Strategy)
N/A: change does not touch release-cut surfaces (in-process core stack budget + a regression test).
N/A: no linked issue — surfaced from a user-supplied macOS crash dump (crahs.log).

Impact

Runtime/platform: desktop only. Affects the in-process Rust core inside the Tauri shell. 8 MiB worker stacks × ~CPU-count workers raises virtual-address reservation modestly (real memory only fault-paged in when touched). Matches the macOS pthread main-thread default and is below Linux/Windows defaults for non-tokio threads, so no platform reaches a hard limit.
Performance: per-call config reload now hops to the blocking pool (one spawn_blocking round-trip). Imperceptible — the parse is microseconds and already ran on tokio time.
Security: none.
Compatibility: no API changes. Internal-only.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: `fix/composio-stack-overflow`
Commit SHA: `9a311b608`

Validation Run

`pnpm --filter openhuman-app format:check` (ran via pre-push hook; auto-fixed and re-committed)
N/A: no TS changes.
Focused tests: `cargo test --test composio_list_tools_stack_overflow_regression` (1 passed), `cargo test --lib` (7602 passed; transient timing flake on runtime_dispatch::message_dispatch_processes_messages_in_parallel cleared on re-run, also flaked on baseline without changes — unrelated), `cargo test --test json_rpc_e2e` (45 passed)
Rust fmt/check (if changed): `cargo check --bin openhuman-core` clean
Tauri fmt/check (if changed): `cargo check --manifest-path app/src-tauri/Cargo.toml` clean

Validation Blocked

`command:` N/A
`error:` N/A
`impact:` N/A

Behavior Changes

Intended behavior change: composio tool calls no longer crash the in-process core under deep async towers.
User-visible effect: Tauri window stays alive when an agent sub-call uses composio_list_tools / per-action composio tools. No surface or API change.

Parity Contract

Legacy behavior preserved: load_config_with_timeout semantics unchanged; the per-call reload behavior added in Prioritize fully local speech and Composer operation #1710 Wave 4 is retained.
Guard/fallback/dispatch parity checks: parse_config_with_recovery still falls back to .bak recovery; new parse_toml_off_worker returns errors via the same (Config, was_corrupted) channel as the original inline parse.

Duplicate / Superseded PR Handling

Duplicate PR(s): N/A
Canonical PR: N/A
Resolution: N/A

Summary by CodeRabbit

Bug Fixes
- Resolved stack-overflow crashes during configuration loading and improved parsing stability under heavy workloads.
- Made async runtime startup more robust to prevent runtime aborts.
Documentation
- Clarified config-loading timeout and recovery behavior in docs.
Tests
- Added a regression test to ensure stack-overflow issues do not recur.

Production crash (`SIGBUS / KERN_PROTECTION_FAILURE` at a `tokio-rt-worker` stack guard page) when the in-process core ran a chat-driven `delegate_to_integrations_agent → integrations_agent → composio_list_tools → load_config_with_timeout` chain. The toml parser's serde Visitor frames piled on top of ~50 frames of agent tower and breached the 2 MB worker stack. Fix: * Move `toml::from_str::<Config>` onto the blocking pool via `spawn_blocking` (`parse_toml_off_worker`) so the parser runs on a fresh thread stack with no async tower above it. * Install a custom `tauri::async_runtime` with `thread_stack_size(8 MiB)` at the top of `pub fn run()` so the in-process core inherits 4× the prior headroom for everything else in the tower. * Add `tests/composio_list_tools_stack_overflow_regression.rs` — a structural guard that drives `run_subagent(integrations_agent) → composio_list_tools → load_config_with_timeout` on a production- shaped 2 MB worker. Module docs explain what it does and does not catch (we can't easily mock the upper chat-channel layers in cargo-test, so the bare path fits in 2 MB even without the parser- move; the test serves as a structural regression).

coderabbitai · 2026-05-18T08:56:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c48a9433-cf2b-4f5d-b713-0ba1df0ed46d

📥 Commits

Reviewing files that changed from the base of the PR and between 9a311b6 and cc141de.

📒 Files selected for processing (1)

src/openhuman/config/schema/load.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/config/schema/load.rs

📝 Walkthrough

Walkthrough

Installs a custom Tokio runtime with 8 MiB thread stacks in Tauri, offloads TOML deserialization to Tokio's blocking pool, updates related docs/comments, and adds a regression test that reproduces the low-worker-stack SIGBUS scenario.

Changes

Stack Overflow Prevention

Layer / File(s)	Summary
Tauri async runtime with larger stack `app/src-tauri/src/lib.rs`	Custom Tokio multi-thread runtime with 8 MiB per-thread stack is created, leaked for process lifetime, and registered with `tauri::async_runtime::set` before other startup work.
Config parsing moved to blocking pool `src/openhuman/config/ops.rs`, `src/openhuman/config/schema/load.rs`	`parse_config_with_recovery` and backup parsing now call `parse_toml_off_worker()`, which performs `toml::from_str` on Tokio's blocking pool; documentation/comments explain stack-growth mitigation and blocking pool behavior.
Stack overflow regression test with minimal stack `tests/composio_list_tools_stack_overflow_regression.rs`	Adds a regression test reproducing SIGBUS with ~2 MiB worker stack, including env synchronization, `EnvGuard`, a representative `config.toml` fixture, stub `Provider`/`Memory`, and helpers that run the subagent path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

working

Poem

🐰 I hopped through stacks both thin and deep,
Gave threads eight megs so they could sleep,
Put TOML parsing on a safer track,
A gentle leak keeps runtime back—
Now tests run through without a peep.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(core): prevent SIGBUS stack overflow in composio tool path' directly describes the main change—fixing a SIGBUS crash by preventing stack overflow in the composio tool execution path. It clearly summarizes the primary objective and is specific enough to convey the key issue.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/composio_list_tools_stack_overflow_regression.rs (1)

323-327: ⚡ Quick win

Assert that the composio path was actually exercised, not just “no crash.”

Right now the test passes on mere task completion. Add a small execution assertion (e.g., provider turn count) so early-return regressions don’t produce false positives.

Suggested strengthening

-    rt.block_on(async {
-        tokio::spawn(drive_subagent())
-            .await
-            .expect("subagent task must complete without SIGBUS / panic");
-    });
+    rt.block_on(async {
+        let turns = tokio::spawn(drive_subagent())
+            .await
+            .expect("subagent task must complete without SIGBUS / panic");
+        assert!(
+            turns >= 2,
+            "expected at least one tool-call roundtrip through composio_list_tools"
+        );
+    });
 }
 
-async fn drive_subagent() {
+async fn drive_subagent() -> usize {
@@
-    let _ = with_parent_context(parent, async move {
+    let _ = with_parent_context(parent, async move {
         run_subagent(
             &def,
             "list available gmail actions",
             SubagentRunOptions::default(),
         )
         .await
     })
     .await;
+    *provider.iter.lock()
 }

Also applies to: 330-382

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/composio_list_tools_stack_overflow_regression.rs` around lines 323 -
327, The test currently only awaits tokio::spawn(drive_subagent()) for
completion; change it to also assert that the composio path was exercised by
reading and asserting a runtime-visible counter/metric after the task completes
(e.g., check a provider turn counter or a probes struct exposed by
drive_subagent or the provider), such as reading provider_turn_count (or an
equivalent atomic/metric returned from or shared with drive_subagent) and assert
it is > 0; apply the same addition to the other block covering lines 330-382 so
both assertions validate that the provider/Composio path ran rather than merely
completing without a crash.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 473-474: The doc comment mentioning that
config::ops::load_config_with_timeout "is an additional optimization that avoids
paying the parse on repeat calls" is stale; update the rationale in load.rs to
remove or reword that claim so it no longer asserts a cache optimization for
load_config_with_timeout (reference the symbol load_config_with_timeout in the
comment block) and instead document the current behavior accurately (e.g., that
the caching optimization was reverted or that no additional caching is
performed).

---

Nitpick comments:
In `@tests/composio_list_tools_stack_overflow_regression.rs`:
- Around line 323-327: The test currently only awaits
tokio::spawn(drive_subagent()) for completion; change it to also assert that the
composio path was exercised by reading and asserting a runtime-visible
counter/metric after the task completes (e.g., check a provider turn counter or
a probes struct exposed by drive_subagent or the provider), such as reading
provider_turn_count (or an equivalent atomic/metric returned from or shared with
drive_subagent) and assert it is > 0; apply the same addition to the other block
covering lines 330-382 so both assertions validate that the provider/Composio
path ran rather than merely completing without a crash.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 02917c27-90bb-41fe-8b88-0b4c6cdf0ab3

📥 Commits

Reviewing files that changed from the base of the PR and between 0f616e4 and 9a311b6.

📒 Files selected for processing (4)

app/src-tauri/src/lib.rs
src/openhuman/config/ops.rs
src/openhuman/config/schema/load.rs
tests/composio_list_tools_stack_overflow_regression.rs

graycyrus

Walkthrough

Solid fix for the SIGBUS stack overflow in the composio tool path. The two-pronged approach — moving toml::from_str::<Config> onto the blocking pool via spawn_blocking and bumping the Tauri tokio worker stack to 8 MiB — addresses both the immediate trigger and gives future headroom. The regression test is well-structured and honestly documents its own limitations. Only one minor stale-docs nit.

Change Summary

File	Change type	Description
`app/src-tauri/src/lib.rs`	Modified	Custom tokio runtime with 8 MiB worker stacks, set before any Tauri async call
`src/openhuman/config/ops.rs`	Modified	Doc comments explaining why `spawn_blocking` instead of caching
`src/openhuman/config/schema/load.rs`	Modified	New `parse_toml_off_worker` helper; `parse_config_with_recovery` uses it for both primary and backup parse paths
`tests/composio_list_tools_stack_overflow_regression.rs`	Added	Regression test driving `run_subagent → composio_list_tools → load_config_with_timeout` on a 2 MiB worker stack

Per-file notes

lib.rs — The std::mem::forget + tauri::async_runtime::set pattern is correct per Tauri's docs. Block comment thoroughly explains the rationale and the "must call before any other tauri async call" ordering constraint.

load.rs — parse_toml_off_worker is clean: takes ownership of the string (required for 'static in spawn_blocking), flattens the join error into the same String error path, and callers don't need to change their error handling. Both the primary parse and the backup-file recovery path go through it.

ops.rs — Doc-only change, accurately describes why caching was abandoned.

Test file — Excellent module-level documentation explaining the crash, why existing tests missed it, and the honest caveat about what the test does/doesn't catch. Setup mirrors production constraints (2 MiB worker stack, representative config TOML, stub provider/memory).

Findings

[minor] tests/composio_list_tools_stack_overflow_regression.rs:387-390 — Stale doc comment says load_config_with_timeout "is fronted by a process-global cache keyed on OPENHUMAN_WORKSPACE, invalidated by Config::save()" and that hot-path consumers "get a clone, never re-entering the parser." But the PR description and the ops.rs doc update both confirm the cache was tried and reverted because it caused 6 flakes in json_rpc_e2e.rs. This block should be removed or rewritten to match reality (spawn_blocking only, no cache). Same class of stale-docs issue that CodeRabbit caught in load.rs:474 (fixed in cc141de), but this instance in the test file was missed.

senamakel added 2 commits May 18, 2026 01:53

chore: apply pre-push rustfmt auto-fixes

9a311b6

senamakel requested a review from a team May 18, 2026 08:56

coderabbitai Bot added the working A PR that is being worked on by the team. label May 18, 2026

coderabbitai Bot requested changes May 18, 2026

View reviewed changes

Comment thread src/openhuman/config/schema/load.rs Outdated

docs: remove stale config-cache claim from parse rationale

cc141de

coderabbitai Bot approved these changes May 18, 2026

View reviewed changes

senamakel merged commit 579addf into tinyhumansai:main May 18, 2026
25 checks passed

graycyrus reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): prevent SIGBUS stack overflow in composio tool path#2069

fix(core): prevent SIGBUS stack overflow in composio tool path#2069
senamakel merged 3 commits into
tinyhumansai:mainfrom
senamakel:fix/composio-stack-overflow

senamakel commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

graycyrus left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

senamakel commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Walkthrough

Change Summary

Per-file notes

Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

senamakel commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading