Skip to content

fix(retain): keep oversized items in one async child to stop FK race (#1795)#1805

Merged
nicoloboschi merged 3 commits into
mainfrom
fix/issue-1795-oversized-retain-no-fragment
May 28, 2026
Merged

fix(retain): keep oversized items in one async child to stop FK race (#1795)#1805
nicoloboschi merged 3 commits into
mainfrom
fix/issue-1795-oversized-retain-no-fragment

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

Summary

Fixes #1795. submit_async_retain was splitting oversized retain payloads into N independent async_operations rows that all shared one document_id. Workers have no per-document gate for retain (the busy-bank guard in claim_tasks only covers consolidation), so siblings ran concurrently — each entered handle_document_tracking with is_first_batch=True and cascade-deleted the previous winner's memory_units. The loser's final ANN pass then inserted memory_links referencing now-deleted units, tripping fk_memory_links_from_unit_id_memory_units. Concurrent siblings also exhausted OS thread budgets via per-child sentence-transformer pools (libgomp resource-unavailable failures) and left partial document state visible to dry-run skip checks.

This is fix #2 from the issue's suggested fixes: do not pre-split a single document across independent child operations; let the in-process splitter handle intra-document chunking sequentially.

Approach

  • New _split_contents_into_async_children helper for the async submit path. It packs items into children by token budget but never fragments a single item across children — oversized items go into their own one-item child holding the full un-chunked content.
  • The worker's existing in-process splitter (retain_batch_async_split_contents_into_sub_batches) re-chunks that content sequentially inside one worker slot with correct is_first_batch=(i==1) semantics — the same path that already enforces SELECT … FOR UPDATE + content-hash gating between batches of one call.
  • Small items still pack together so genuinely independent inputs keep cross-worker parallelism.
  • Metadata field names (num_sub_batches, sub_batch_index, total_sub_batches) and the parent/child operation structure are unchanged, so dashboards and status APIs are untouched.

Test plan

  • 8 pure-Python tests for the new helper (tests/test_batch_chunking.py):
    • single oversized item → exactly 1 child holding the full content
    • metadata preserved through the helper
    • small items packed by budget
    • mixed small + oversized → small packed, oversized isolated
    • multiple oversized items → one each
    • single small item, empty input, oversized-at-boundaries
  • 3 integration tests against the real DB (tests/test_async_batch_retain.py):
    • test_oversized_single_item_creates_one_child_not_many — submits one oversized doc and asserts the async_operations table has exactly one retain row whose task_payload.contents holds the un-chunked content. Empirically verified to fail on the pre-fix code with Expected 1 child for an oversized single item, got 7.
    • test_oversized_single_item_drains_without_fk_violation — drives a worker drain end-to-end and asserts no memory_links rows have orphan FKs in either direction — the exact invariant pre-fix code violated.
    • test_oversized_item_among_small_items_keeps_small_items_packed — confirms the parallelism optimization for genuinely-independent items isn't lost.
  • Full tests/test_async_batch_retain.py + tests/test_batch_chunking.py suite passes (34 tests).
  • Sanity sweep of tests/test_document_tracking.py, tests/test_op_cancellation.py, tests/test_async_retain_tags.py (22 tests) passes.
  • ./scripts/hooks/lint.sh clean.

…1795)

submit_async_retain split oversized retain payloads into N independent
async_operations rows that all shared one document_id. Workers have no
per-document gate for retain (claim_tasks only guards consolidation),
so siblings ran concurrently — each entered handle_document_tracking
with is_first_batch=True, cascade-deleting the previous winner's
memory_units. The loser's final ANN pass then inserted memory_links
referencing now-deleted units, tripping
fk_memory_links_from_unit_id_memory_units. Concurrent siblings also
exhausted OS thread budgets via per-child sentence-transformer pools
(libgomp resource-unavailable failures) and left partial document
state visible to dry-run skip checks.

Add _split_contents_into_async_children for the async submit path: it
packs items into children by token budget but never fragments a single
item across children. Oversized items go into their own one-item child
holding the full un-chunked content; the worker's existing in-process
splitter (retain_batch_async → _split_contents_into_sub_batches)
re-chunks them sequentially inside one worker slot with correct
is_first_batch=(i==1) semantics — the same path that already enforces
SELECT … FOR UPDATE + content-hash gating between batches of one call.

Small items still pack together so genuinely independent inputs keep
cross-worker parallelism. Metadata field names (num_sub_batches,
sub_batch_index, total_sub_batches) are unchanged.

Tests:
- 8 pure-Python tests for the new helper covering single oversized,
  metadata preservation, packing by budget, mixed inputs, multiple
  oversized, boundary positioning, empty input.
- 3 integration tests against the real DB:
  - test_oversized_single_item_creates_one_child_not_many asserts the
    async_operations table has exactly one retain row with the
    un-chunked content (fails on pre-fix code: "got 7" children).
  - test_oversized_single_item_drains_without_fk_violation drives a
    worker drain and asserts no memory_links rows have orphan FKs in
    either direction — the exact invariant pre-fix code violated.
  - test_oversized_item_among_small_items_keeps_small_items_packed
    confirms the parallelism optimization isn't lost.
The two structural assertions (test_oversized_single_item_creates_one_child_not_many
and test_oversized_item_among_small_items_keeps_small_items_packed) only need to
verify the async_operations rows that submit_async_retain inserts — those rows
commit before submit_task is called. The previous version let SyncTaskBackend
drive the full LLM-based retain pipeline synchronously, which timed out at
CI's 300s per-test limit even though it ran in ~5s locally.

Monkeypatch _task_backend.submit_task to a no-op so the structural assertions
fire in ~30ms without running the worker.

Also slim the drain test's payload from ~3x to ~1.2x the per-batch token budget.
That still triggers in-process splitting (~2 sub-batches → the path that
exercises is_first_batch=(i==1) sequencing) but cuts LLM extraction work from
~5 chunks to ~2, keeping wall time comfortably under 300s on slower runners.

The structural regression assertions still fail without the engine fix —
verified by temporarily reverting hindsight_api/engine/memory_engine.py and
re-running: "Expected 1 child for an oversized single item, got 7. Issue #1795:
per-chunk children race on the shared document_id."
test_oversized_single_item_drains_without_fk_violation drives the full
retain pipeline (LLM extraction + embeddings + ANN + consolidation)
synchronously through SyncTaskBackend. Even with the payload trimmed
to ~1.2x the batch budget (~2 sub-batches), Gemini API latency in CI
varies enough that the 300s per-test timeout fires intermittently.

The fix is already covered without it:
- test_oversized_single_item_creates_one_child_not_many is the direct
  regression test for #1795. It asserts on the async_operations rows
  submit_async_retain inserts and was empirically shown to fail on
  the pre-fix engine ("Expected 1 child for an oversized single item,
  got 7"). No worker execution needed.
- test_oversized_item_among_small_items_keeps_small_items_packed
  covers the mixed-batch case structurally.
- 8 unit tests in test_batch_chunking.py cover the helper directly.
- The FK constraint fk_memory_links_from_unit_id_memory_units is
  enforced by Postgres itself; any orphan write would error at insert
  time, so the engine cannot silently regress without other tests
  noticing.
@nicoloboschi nicoloboschi merged commit 74525cc into main May 28, 2026
143 of 144 checks passed
@nicoloboschi nicoloboschi deleted the fix/issue-1795-oversized-retain-no-fragment branch May 28, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0.7.0: oversized retain split can enqueue concurrent chunks for same document, causing worker thread exhaustion and ANN FK failures

1 participant