Skip to content

fix(memory/ingestion): bound the job channel + reject submits at cap (#2442)#2444

Open
obchain wants to merge 2 commits into
tinyhumansai:mainfrom
obchain:fix/2442-bounded-ingestion-queue
Open

fix(memory/ingestion): bound the job channel + reject submits at cap (#2442)#2444
obchain wants to merge 2 commits into
tinyhumansai:mainfrom
obchain:fix/2442-bounded-ingestion-queue

Conversation

@obchain
Copy link
Copy Markdown
Contributor

@obchain obchain commented May 21, 2026

Summary

  • Replace mpsc::unbounded_channel in the memory ingestion queue with mpsc::channel(DEFAULT_QUEUE_CAPACITY) (512).
  • Switch IngestionQueue::submit from tx.send() to tx.try_send(); distinguish Full from Closed in the warn-level log so observability can tell over-pressure apart from worker shutdown.
  • Roll IngestionState::dequeue() back on both drop paths so the memory_ingestion_status queue-depth gauge stays accurate under sustained overflow.
  • Add start_worker_with_capacity so unit tests can drive the at-capacity branch deterministically without faking a slow worker.
  • New unit tests cover: submit-at-capacity drops, recovery after drain, channel-closed drop accounting, and a guardrail on the DEFAULT_QUEUE_CAPACITY ceiling.

Problem

src/openhuman/memory/ingestion/queue.rs:106 on main (6137b67) built the job channel with mpsc::unbounded_channel. The worker (ingestion_worker in the same file) drains one job at a time under the IngestionState::acquire() singleton lock because the local extraction LLM cannot run concurrently — per-job work is on the order of seconds-to-minutes depending on doc size + model.

Two producer sites push without backpressure:

  • src/openhuman/memory/store/client.rs:152put_doc
  • src/openhuman/memory/store/client.rs:266store_skill_sync

Both increment IngestionState::enqueue() and call IngestionQueue::submit(job). submit already handled the "worker gone" path (SendError) but the channel itself had no capacity bound, so a buggy / misconfigured / hostile producer that submits faster than the worker can drain grows the in-flight buffer indefinitely (each IngestionJob owns a full NamespaceDocumentInput — title, body, metadata) until the process OOMs.

Not exploitable across a trust boundary, but it is a robustness gap: a runaway skill, a misconfigured Composio sync, or an agent re-ingesting the same source on every tick can DoS the local core with no user-visible warning.

Solution

Three-step fix, all in src/openhuman/memory/ingestion/queue.rs:

  1. mpsc::unbounded_channelmpsc::channel(DEFAULT_QUEUE_CAPACITY). DEFAULT_QUEUE_CAPACITY = 512 keeps the in-flight buffer comfortably under ~50 MB at typical doc sizes (1–100 KB), while still absorbing reasonable bulk-import bursts (Notion workspace backfill, large Slack history).
  2. tx.send(job)tx.try_send(job) (non-blocking). The Full and Closed variants are logged distinctly so observability can tell over-pressure apart from worker shutdown. Both call state.dequeue() so the queue-depth gauge does not drift upward.
  3. start_worker_with_capacity(memory, state, capacity) is exposed for tests; start_worker_with_state delegates with the default cap so callers see no signature change.

No producer signature changes — put_doc and store_skill_sync continue to call submit() exactly as before.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) — 4 new cases covering capacity-bound submit, drain recovery, worker-gone, and a DEFAULT_QUEUE_CAPACITY guardrail.
  • Diff coverage ≥ 80% — every new line in submit, start_worker_with_capacity, and the constant guardrail is exercised by a dedicated test.
  • Coverage matrix updated — N/A: robustness fix, no new feature row needed.
  • All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: no matrix row affected.
  • No new external network dependencies introduced — Rust-only change in the ingestion module.
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: not on the release-cut surface list.
  • Linked issue closed via Closes #NNN in the ## Related section.

Impact

  • Runtime/platform: desktop core (mac/win/linux) + standalone CLI. No frontend changes.
  • Performance: zero in the common case — try_send is cheaper than send (no async wait) and bounded channels have the same fast path as unbounded for non-full sends.
  • Security: tightens local-DoS surface; a runaway producer can no longer balloon the core's RSS via the ingestion queue.
  • Migration / compatibility: behaviour change — under sustained overload submit returns false for the overflow jobs instead of silently buffering them. Callers (put_doc, store_skill_sync) already ignore the return value, so the observable change is "dropped + logged" instead of "buffered until OOM". The underlying document upsert that ran before submit is unaffected — only the graph-extraction follow-up is skipped.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/2442-bounded-ingestion-queue
  • Commit SHA: b5af973

Validation Run

  • pnpm --filter openhuman-app format:check — N/A: no frontend changes.
  • pnpm typecheck — N/A: no TypeScript changes.
  • Focused tests: cargo test --lib -- ingestion::queue (4/4 pass).
  • Rust fmt/check: cargo fmt --all --check clean; cargo check --manifest-path Cargo.toml clean (warnings unrelated to this change).
  • Tauri fmt/check: N/A — no Tauri changes.

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: the ingestion queue is no longer unbounded; sustained overload causes follow-up extraction jobs to be dropped + logged instead of accumulating until OOM.
  • User-visible effect: under realistic single-user load, none — capacity 512 covers reasonable bulk imports. Under pathological producer overload, "memory ingestion queue at capacity" warnings appear in the log and memory_ingestion_status reflects the bounded depth instead of growing without limit.

Parity Contract

  • Legacy behavior preserved: submit still returns true on accepted enqueue, false on any drop; existing put_doc / store_skill_sync call sites continue to ignore the return value exactly as before; worker loop semantics are unchanged.
  • Guard/fallback/dispatch parity checks: IngestionState::enqueue/dequeue accounting still matches the worker's acquire/dequeue pairing — the drop paths add a matching dequeue() so the gauge is correct.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): None.
  • Canonical PR: This PR.
  • Resolution (closed/superseded/updated): N/A.

Summary by CodeRabbit

  • Improvements
    • Ingestion queue now uses a bounded capacity (default 512) to avoid uncontrolled growth.
    • Queue overflow is handled by dropping excess jobs and recording the drop, preventing system stalls.
    • Queue capacity is configurable for advanced tuning.
    • Ingestion status tracking tightened to remain accurate when jobs are dropped.
  • Bug Fixes
    • Startup now guards against invalid (zero) queue capacity.

Review Change Stack

Producers can DoS the core today by calling `put_doc` or
`store_skill_sync` faster than the worker drains; the channel was
`mpsc::unbounded_channel`, so a runaway loop would grow the in-flight
buffer (each `IngestionJob` owns a full document body) until the
process OOMs.

Switch the channel to `mpsc::channel(DEFAULT_QUEUE_CAPACITY)` (512)
and change `submit` from `tx.send()` to `tx.try_send()`. The two drop
reasons (`Full`, `Closed`) are logged with distinct messages so
observability can tell over-pressure apart from worker shutdown. Both
paths roll `IngestionState::dequeue()` back so the
`memory_ingestion_status` queue-depth gauge stays accurate under
sustained overflow.

`start_worker_with_capacity` is exposed (in addition to
`start_worker_with_state`, which now delegates with the default cap)
so unit tests can drive the at-capacity branch deterministically
without faking a slow worker.

Tests added in the same file: capacity enforcement, recovery after
drain, channel-closed drop accounting, and a guardrail on
`DEFAULT_QUEUE_CAPACITY` so future bumps don't regress the
memory-ceiling intent.

Closes tinyhumansai#2442
@obchain obchain requested a review from a team May 21, 2026 12:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f790f526-7a6f-4095-8c67-fc60c7cae216

📥 Commits

Reviewing files that changed from the base of the PR and between b5af973 and ea3022b.

📒 Files selected for processing (1)
  • src/openhuman/memory/ingestion/queue.rs

📝 Walkthrough

Walkthrough

This PR converts the unbounded ingestion queue to a bounded channel with DEFAULT_QUEUE_CAPACITY = 512, preventing memory exhaustion from misbehaving producers. Submission becomes non-blocking via try_send, distinguishing full-buffer drops from worker-shutdown cases while maintaining queue-depth accounting. Worker startup is expanded with a configurable capacity variant and comprehensive tests.

Changes

Bounded Ingestion Queue

Layer / File(s) Summary
Bounded queue contract and default capacity
src/openhuman/memory/ingestion/queue.rs
Module documentation, DEFAULT_QUEUE_CAPACITY constant (512 jobs), and IngestionQueue sender type changed from UnboundedSender to bounded mpsc::Sender with documented drop-on-full behavior.
Submission with overflow and shutdown distinction
src/openhuman/memory/ingestion/queue.rs
IngestionQueue::submit uses non-blocking try_send, decrementing queue-depth counter on both Full (overflow) and Closed (worker gone) errors; returns false to signal drop without blocking.
Worker startup with configurable capacity
src/openhuman/memory/ingestion/queue.rs
New exported start_worker_with_capacity(capacity) function creates bounded channel, spawns worker, and logs capacity. start_worker_with_state delegates to it using DEFAULT_QUEUE_CAPACITY. Worker receiver parameter updated to bounded mpsc::Receiver.
Bounded queue tests
src/openhuman/memory/ingestion/queue.rs
Test module verifies capacity-based dropping, recovery after receiver drain, closed-receiver behavior, and asserts DEFAULT_QUEUE_CAPACITY is nonzero and reasonably bounded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

rust-core

Poem

🐰 A queue once wild and without end,
Now bounded tight, with caps to spend.
Five-twelve jobs max before we drop,
No more the runaway climb non-stop!
The worker drains with steady care,
While tests ensure the burden's fair.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title clearly summarizes the main changes: bounding the job channel and rejecting submissions at capacity, directly matching the PR's core objectives.
Linked Issues check ✅ Passed All coding requirements from #2442 are met: bounded channel with DEFAULT_QUEUE_CAPACITY=512, try_send with Full/Closed distinction, queue_depth decrement on drop, start_worker_with_capacity exposed, and comprehensive unit test coverage.
Out of Scope Changes check ✅ Passed All changes are within scope of #2442: queue bounding logic, bounded channel migration, error handling, worker capacity configuration, and related unit tests directly address the unbounded queue issue.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/openhuman/memory/ingestion/queue.rs (1)

96-113: ⚡ Quick win

Include document_id in the drop-path warnings.

These warnings are the only breadcrumb when extraction is skipped after the document was already upserted, so they should carry the stable ID too.

♻️ Suggested log enrichment
                 log::warn!(
-                    "[memory:ingestion_queue] dropping job: queue at capacity ({} pending) namespace={} title={}",
+                    "[memory:ingestion_queue] dropping job: queue at capacity ({} pending) doc_id={} namespace={} title={}",
                     self.tx.max_capacity(),
+                    dropped.document_id,
                     dropped.document.namespace,
                     dropped.document.title,
                 );
@@
                 log::warn!(
-                    "[memory:ingestion_queue] dropping job: worker channel closed (shutdown?) namespace={} title={}",
+                    "[memory:ingestion_queue] dropping job: worker channel closed (shutdown?) doc_id={} namespace={} title={}",
+                    dropped.document_id,
                     dropped.document.namespace,
                     dropped.document.title,
                 );

As per coding guidelines, "Use structured, grep-friendly context with stable prefixes ... and include correlation fields such as request IDs, method names, and entity IDs when available."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/ingestion/queue.rs` around lines 96 - 113, Update the
two warning logs in the ingestion queue drop paths to include the stable
document ID: add dropped.document.document_id to the formatted message and
arguments for both the capacity-full branch and the Closed branch (the code
around self.tx.max_capacity() and the self.state.dequeue() branch). Ensure the
same field name (dropped.document.document_id) is used in both log lines so the
warnings include namespace, title, and the stable document_id for tracing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/ingestion/queue.rs`:
- Around line 153-161: The function start_worker_with_capacity currently calls
tokio::sync::mpsc::channel(capacity) which will panic if capacity is 0; add an
explicit guard at the start of start_worker_with_capacity to check for capacity
== 0 and return/fail fast with a clear error message (e.g.,
panic!("start_worker_with_capacity: capacity must be >= 1") or convert to a
Result and return an Err) before calling mpsc::channel, so callers of
UnifiedMemory/IngestionQueue receive a clear diagnostic instead of an internal
Tokio panic.

---

Nitpick comments:
In `@src/openhuman/memory/ingestion/queue.rs`:
- Around line 96-113: Update the two warning logs in the ingestion queue drop
paths to include the stable document ID: add dropped.document.document_id to the
formatted message and arguments for both the capacity-full branch and the Closed
branch (the code around self.tx.max_capacity() and the self.state.dequeue()
branch). Ensure the same field name (dropped.document.document_id) is used in
both log lines so the warnings include namespace, title, and the stable
document_id for tracing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cb6f2cdc-71aa-4d4f-8bf9-74674130f331

📥 Commits

Reviewing files that changed from the base of the PR and between ec9708a and b5af973.

📒 Files selected for processing (1)
  • src/openhuman/memory/ingestion/queue.rs

Comment thread src/openhuman/memory/ingestion/queue.rs
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix — well scoped, well tested, addresses a real robustness gap.

What this does

Bounds the ingestion job channel from mpsc::unbounded_channelmpsc::channel(512) and switches submit to non-blocking try_send with distinct Full vs Closed logging. The worker-side drain and the IngestionState accounting are both handled correctly on both drop paths. Four new tests exercise the capacity-bound, drain-recovery, worker-gone, and constant-guardrail branches.

Area Files Verdict
Rust core (memory/ingestion) queue.rs

Notes

  • CodeRabbit already flagged the zero-capacity guard on start_worker_with_capacity — agree that's worth adding, tokio::sync::mpsc::channel(0) panics.
  • DEFAULT_QUEUE_CAPACITY = 512 is a reasonable middle ground; the doc comment justifying the number is appreciated.
  • No producer signature changes, no cross-cutting breakage — put_doc and store_skill_sync continue to call submit() unchanged.
  • The enqueue()/dequeue() accounting is correctly balanced on all three paths (success, full, closed).

Nothing else to flag beyond what CodeRabbit caught. Nice work.

… doc_id on drop

CodeRabbit on tinyhumansai#2444 flagged two follow-ups:

1. `tokio::sync::mpsc::channel(0)` panics with a cryptic Tokio-internal
   message ("mpsc bounded channel requires buffer > 0"). Add an explicit
   `assert!(capacity > 0, …)` in `start_worker_with_capacity` so misuse
   surfaces a clear, grep-friendly message at the call site instead of
   looking like a Tokio bug. New `#[should_panic]` test
   `start_worker_rejects_zero_capacity` pins the contract.

2. The drop-path warn logs now include `doc_id` alongside `namespace` and
   `title` so each warn line is a stable breadcrumb back to the upserted
   document whose graph-extraction follow-up was skipped.

5/5 tests pass; cargo fmt + check clean.
@obchain
Copy link
Copy Markdown
Contributor Author

obchain commented May 21, 2026

Pushed ea3022b7 addressing both CodeRabbit items:

  • Added assert!(capacity > 0, …) at the head of start_worker_with_capacity so misuse surfaces a clear message instead of Tokio's cryptic internal panic. New #[should_panic] test pins the assertion text.
  • Drop-path warn logs now include doc_id alongside namespace + title for grep-friendly trace back to the upserted document.

cargo test --lib -- ingestion::queue — 5/5 pass.

@coderabbitai coderabbitai Bot added the rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. label May 21, 2026
@obchain obchain requested a review from graycyrus May 21, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory ingestion queue is unbounded — buggy producer can OOM the core

2 participants