fix(memory/ingestion): bound the job channel + reject submits at cap (#2442) by obchain · Pull Request #2444 · tinyhumansai/openhuman

obchain · 2026-05-21T12:16:18Z

Summary

Replace mpsc::unbounded_channel in the memory ingestion queue with mpsc::channel(DEFAULT_QUEUE_CAPACITY) (512).
Switch IngestionQueue::submit from tx.send() to tx.try_send(); distinguish Full from Closed in the warn-level log so observability can tell over-pressure apart from worker shutdown.
Roll IngestionState::dequeue() back on both drop paths so the memory_ingestion_status queue-depth gauge stays accurate under sustained overflow.
Add start_worker_with_capacity so unit tests can drive the at-capacity branch deterministically without faking a slow worker.
New unit tests cover: submit-at-capacity drops, recovery after drain, channel-closed drop accounting, and a guardrail on the DEFAULT_QUEUE_CAPACITY ceiling.

Problem

src/openhuman/memory/ingestion/queue.rs:106 on main (6137b67) built the job channel with mpsc::unbounded_channel. The worker (ingestion_worker in the same file) drains one job at a time under the IngestionState::acquire() singleton lock because the local extraction LLM cannot run concurrently — per-job work is on the order of seconds-to-minutes depending on doc size + model.

Two producer sites push without backpressure:

src/openhuman/memory/store/client.rs:152 — put_doc
src/openhuman/memory/store/client.rs:266 — store_skill_sync

Both increment IngestionState::enqueue() and call IngestionQueue::submit(job). submit already handled the "worker gone" path (SendError) but the channel itself had no capacity bound, so a buggy / misconfigured / hostile producer that submits faster than the worker can drain grows the in-flight buffer indefinitely (each IngestionJob owns a full NamespaceDocumentInput — title, body, metadata) until the process OOMs.

Not exploitable across a trust boundary, but it is a robustness gap: a runaway skill, a misconfigured Composio sync, or an agent re-ingesting the same source on every tick can DoS the local core with no user-visible warning.

Solution

Three-step fix, all in src/openhuman/memory/ingestion/queue.rs:

mpsc::unbounded_channel → mpsc::channel(DEFAULT_QUEUE_CAPACITY). DEFAULT_QUEUE_CAPACITY = 512 keeps the in-flight buffer comfortably under ~50 MB at typical doc sizes (1–100 KB), while still absorbing reasonable bulk-import bursts (Notion workspace backfill, large Slack history).
tx.send(job) → tx.try_send(job) (non-blocking). The Full and Closed variants are logged distinctly so observability can tell over-pressure apart from worker shutdown. Both call state.dequeue() so the queue-depth gauge does not drift upward.
start_worker_with_capacity(memory, state, capacity) is exposed for tests; start_worker_with_state delegates with the default cap so callers see no signature change.

No producer signature changes — put_doc and store_skill_sync continue to call submit() exactly as before.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) — 4 new cases covering capacity-bound submit, drain recovery, worker-gone, and a DEFAULT_QUEUE_CAPACITY guardrail.
Diff coverage ≥ 80% — every new line in submit, start_worker_with_capacity, and the constant guardrail is exercised by a dedicated test.
Coverage matrix updated — N/A: robustness fix, no new feature row needed.
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: no matrix row affected.
No new external network dependencies introduced — Rust-only change in the ingestion module.
Manual smoke checklist updated if this touches release-cut surfaces — N/A: not on the release-cut surface list.
Linked issue closed via Closes #NNN in the ## Related section.

Impact

Runtime/platform: desktop core (mac/win/linux) + standalone CLI. No frontend changes.
Performance: zero in the common case — try_send is cheaper than send (no async wait) and bounded channels have the same fast path as unbounded for non-full sends.
Security: tightens local-DoS surface; a runaway producer can no longer balloon the core's RSS via the ingestion queue.
Migration / compatibility: behaviour change — under sustained overload submit returns false for the overflow jobs instead of silently buffering them. Callers (put_doc, store_skill_sync) already ignore the return value, so the observable change is "dropped + logged" instead of "buffered until OOM". The underlying document upsert that ran before submit is unaffected — only the graph-extraction follow-up is skipped.

Closes: Memory ingestion queue is unbounded — buggy producer can OOM the core #2442
Surface introduced in feat: background ingestion queue for memory graph extraction #325 (feat: background ingestion queue for memory graph extraction).
Related path-side hardening pass: fix(security): always canonicalize paths before policy check #2111.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/2442-bounded-ingestion-queue
Commit SHA: b5af973

Validation Run

pnpm --filter openhuman-app format:check — N/A: no frontend changes.
pnpm typecheck — N/A: no TypeScript changes.
Focused tests: cargo test --lib -- ingestion::queue (4/4 pass).
Rust fmt/check: cargo fmt --all --check clean; cargo check --manifest-path Cargo.toml clean (warnings unrelated to this change).
Tauri fmt/check: N/A — no Tauri changes.

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: the ingestion queue is no longer unbounded; sustained overload causes follow-up extraction jobs to be dropped + logged instead of accumulating until OOM.
User-visible effect: under realistic single-user load, none — capacity 512 covers reasonable bulk imports. Under pathological producer overload, "memory ingestion queue at capacity" warnings appear in the log and memory_ingestion_status reflects the bounded depth instead of growing without limit.

Parity Contract

Legacy behavior preserved: submit still returns true on accepted enqueue, false on any drop; existing put_doc / store_skill_sync call sites continue to ignore the return value exactly as before; worker loop semantics are unchanged.
Guard/fallback/dispatch parity checks: IngestionState::enqueue/dequeue accounting still matches the worker's acquire/dequeue pairing — the drop paths add a matching dequeue() so the gauge is correct.

Duplicate / Superseded PR Handling

Duplicate PR(s): None.
Canonical PR: This PR.
Resolution (closed/superseded/updated): N/A.

Summary by CodeRabbit

Improvements
- Ingestion queue now uses a bounded capacity (default 512) to avoid uncontrolled growth.
- Queue overflow is handled by dropping excess jobs and recording the drop, preventing system stalls.
- Queue capacity is configurable for advanced tuning.
- Ingestion status tracking tightened to remain accurate when jobs are dropped.
Bug Fixes
- Startup now guards against invalid (zero) queue capacity.

Producers can DoS the core today by calling `put_doc` or `store_skill_sync` faster than the worker drains; the channel was `mpsc::unbounded_channel`, so a runaway loop would grow the in-flight buffer (each `IngestionJob` owns a full document body) until the process OOMs. Switch the channel to `mpsc::channel(DEFAULT_QUEUE_CAPACITY)` (512) and change `submit` from `tx.send()` to `tx.try_send()`. The two drop reasons (`Full`, `Closed`) are logged with distinct messages so observability can tell over-pressure apart from worker shutdown. Both paths roll `IngestionState::dequeue()` back so the `memory_ingestion_status` queue-depth gauge stays accurate under sustained overflow. `start_worker_with_capacity` is exposed (in addition to `start_worker_with_state`, which now delegates with the default cap) so unit tests can drive the at-capacity branch deterministically without faking a slow worker. Tests added in the same file: capacity enforcement, recovery after drain, channel-closed drop accounting, and a guardrail on `DEFAULT_QUEUE_CAPACITY` so future bumps don't regress the memory-ceiling intent. Closes tinyhumansai#2442

coderabbitai · 2026-05-21T12:18:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f790f526-7a6f-4095-8c67-fc60c7cae216

📥 Commits

Reviewing files that changed from the base of the PR and between b5af973 and ea3022b.

📒 Files selected for processing (1)

src/openhuman/memory/ingestion/queue.rs

📝 Walkthrough

Walkthrough

This PR converts the unbounded ingestion queue to a bounded channel with DEFAULT_QUEUE_CAPACITY = 512, preventing memory exhaustion from misbehaving producers. Submission becomes non-blocking via try_send, distinguishing full-buffer drops from worker-shutdown cases while maintaining queue-depth accounting. Worker startup is expanded with a configurable capacity variant and comprehensive tests.

Changes

Bounded Ingestion Queue

Layer / File(s)	Summary
Bounded queue contract and default capacity `src/openhuman/memory/ingestion/queue.rs`	Module documentation, `DEFAULT_QUEUE_CAPACITY` constant (512 jobs), and `IngestionQueue` sender type changed from `UnboundedSender` to bounded `mpsc::Sender` with documented drop-on-full behavior.
Submission with overflow and shutdown distinction `src/openhuman/memory/ingestion/queue.rs`	`IngestionQueue::submit` uses non-blocking `try_send`, decrementing queue-depth counter on both `Full` (overflow) and `Closed` (worker gone) errors; returns `false` to signal drop without blocking.
Worker startup with configurable capacity `src/openhuman/memory/ingestion/queue.rs`	New exported `start_worker_with_capacity(capacity)` function creates bounded channel, spawns worker, and logs capacity. `start_worker_with_state` delegates to it using `DEFAULT_QUEUE_CAPACITY`. Worker receiver parameter updated to bounded `mpsc::Receiver`.
Bounded queue tests `src/openhuman/memory/ingestion/queue.rs`	Test module verifies capacity-based dropping, recovery after receiver drain, closed-receiver behavior, and asserts `DEFAULT_QUEUE_CAPACITY` is nonzero and reasonably bounded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

rust-core

Poem

🐰 A queue once wild and without end,
Now bounded tight, with caps to spend.
Five-twelve jobs max before we drop,
No more the runaway climb non-stop!
The worker drains with steady care,
While tests ensure the burden's fair.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title clearly summarizes the main changes: bounding the job channel and rejecting submissions at capacity, directly matching the PR's core objectives.
Linked Issues check	✅ Passed	All coding requirements from `#2442` are met: bounded channel with DEFAULT_QUEUE_CAPACITY=512, try_send with Full/Closed distinction, queue_depth decrement on drop, start_worker_with_capacity exposed, and comprehensive unit test coverage.
Out of Scope Changes check	✅ Passed	All changes are within scope of `#2442`: queue bounding logic, bounded channel migration, error handling, worker capacity configuration, and related unit tests directly address the unbounded queue issue.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/openhuman/memory/ingestion/queue.rs (1)

96-113: ⚡ Quick win

Include document_id in the drop-path warnings.

These warnings are the only breadcrumb when extraction is skipped after the document was already upserted, so they should carry the stable ID too.

♻️ Suggested log enrichment

                 log::warn!(
-                    "[memory:ingestion_queue] dropping job: queue at capacity ({} pending) namespace={} title={}",
+                    "[memory:ingestion_queue] dropping job: queue at capacity ({} pending) doc_id={} namespace={} title={}",
                     self.tx.max_capacity(),
+                    dropped.document_id,
                     dropped.document.namespace,
                     dropped.document.title,
                 );
@@
                 log::warn!(
-                    "[memory:ingestion_queue] dropping job: worker channel closed (shutdown?) namespace={} title={}",
+                    "[memory:ingestion_queue] dropping job: worker channel closed (shutdown?) doc_id={} namespace={} title={}",
+                    dropped.document_id,
                     dropped.document.namespace,
                     dropped.document.title,
                 );

As per coding guidelines, "Use structured, grep-friendly context with stable prefixes ... and include correlation fields such as request IDs, method names, and entity IDs when available."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/ingestion/queue.rs` around lines 96 - 113, Update the
two warning logs in the ingestion queue drop paths to include the stable
document ID: add dropped.document.document_id to the formatted message and
arguments for both the capacity-full branch and the Closed branch (the code
around self.tx.max_capacity() and the self.state.dequeue() branch). Ensure the
same field name (dropped.document.document_id) is used in both log lines so the
warnings include namespace, title, and the stable document_id for tracing.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/ingestion/queue.rs`:
- Around line 153-161: The function start_worker_with_capacity currently calls
tokio::sync::mpsc::channel(capacity) which will panic if capacity is 0; add an
explicit guard at the start of start_worker_with_capacity to check for capacity
== 0 and return/fail fast with a clear error message (e.g.,
panic!("start_worker_with_capacity: capacity must be >= 1") or convert to a
Result and return an Err) before calling mpsc::channel, so callers of
UnifiedMemory/IngestionQueue receive a clear diagnostic instead of an internal
Tokio panic.

---

Nitpick comments:
In `@src/openhuman/memory/ingestion/queue.rs`:
- Around line 96-113: Update the two warning logs in the ingestion queue drop
paths to include the stable document ID: add dropped.document.document_id to the
formatted message and arguments for both the capacity-full branch and the Closed
branch (the code around self.tx.max_capacity() and the self.state.dequeue()
branch). Ensure the same field name (dropped.document.document_id) is used in
both log lines so the warnings include namespace, title, and the stable
document_id for tracing.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb6f2cdc-71aa-4d4f-8bf9-74674130f331

📥 Commits

Reviewing files that changed from the base of the PR and between ec9708a and b5af973.

📒 Files selected for processing (1)

src/openhuman/memory/ingestion/queue.rs

graycyrus

Clean fix — well scoped, well tested, addresses a real robustness gap.

What this does

Bounds the ingestion job channel from mpsc::unbounded_channel → mpsc::channel(512) and switches submit to non-blocking try_send with distinct Full vs Closed logging. The worker-side drain and the IngestionState accounting are both handled correctly on both drop paths. Four new tests exercise the capacity-bound, drain-recovery, worker-gone, and constant-guardrail branches.

Area	Files	Verdict
Rust core (memory/ingestion)	`queue.rs`	✅

Notes

CodeRabbit already flagged the zero-capacity guard on start_worker_with_capacity — agree that's worth adding, tokio::sync::mpsc::channel(0) panics.
DEFAULT_QUEUE_CAPACITY = 512 is a reasonable middle ground; the doc comment justifying the number is appreciated.
No producer signature changes, no cross-cutting breakage — put_doc and store_skill_sync continue to call submit() unchanged.
The enqueue()/dequeue() accounting is correctly balanced on all three paths (success, full, closed).

Nothing else to flag beyond what CodeRabbit caught. Nice work.

… doc_id on drop CodeRabbit on tinyhumansai#2444 flagged two follow-ups: 1. `tokio::sync::mpsc::channel(0)` panics with a cryptic Tokio-internal message ("mpsc bounded channel requires buffer > 0"). Add an explicit `assert!(capacity > 0, …)` in `start_worker_with_capacity` so misuse surfaces a clear, grep-friendly message at the call site instead of looking like a Tokio bug. New `#[should_panic]` test `start_worker_rejects_zero_capacity` pins the contract. 2. The drop-path warn logs now include `doc_id` alongside `namespace` and `title` so each warn line is a stable breadcrumb back to the upserted document whose graph-extraction follow-up was skipped. 5/5 tests pass; cargo fmt + check clean.

obchain · 2026-05-21T18:38:08Z

Pushed ea3022b7 addressing both CodeRabbit items:

Added assert!(capacity > 0, …) at the head of start_worker_with_capacity so misuse surfaces a clear message instead of Tokio's cryptic internal panic. New #[should_panic] test pins the assertion text.
Drop-path warn logs now include doc_id alongside namespace + title for grep-friendly trace back to the upserted document.

cargo test --lib -- ingestion::queue — 5/5 pass.

obchain requested a review from a team May 21, 2026 12:16

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread src/openhuman/memory/ingestion/queue.rs

graycyrus reviewed May 21, 2026

View reviewed changes

coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 21, 2026

coderabbitai Bot approved these changes May 21, 2026

View reviewed changes

obchain requested a review from graycyrus May 21, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory/ingestion): bound the job channel + reject submits at cap (#2442)#2444

fix(memory/ingestion): bound the job channel + reject submits at cap (#2442)#2444
obchain wants to merge 2 commits into
tinyhumansai:mainfrom
obchain:fix/2442-bounded-ingestion-queue

obchain commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

graycyrus left a comment

Uh oh!

obchain commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

obchain commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

What this does

Notes

Uh oh!

obchain commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

obchain commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading