fix(retain): keep oversized items in one async child to stop FK race (#1795)#1805
Merged
Merged
Conversation
…1795) submit_async_retain split oversized retain payloads into N independent async_operations rows that all shared one document_id. Workers have no per-document gate for retain (claim_tasks only guards consolidation), so siblings ran concurrently — each entered handle_document_tracking with is_first_batch=True, cascade-deleting the previous winner's memory_units. The loser's final ANN pass then inserted memory_links referencing now-deleted units, tripping fk_memory_links_from_unit_id_memory_units. Concurrent siblings also exhausted OS thread budgets via per-child sentence-transformer pools (libgomp resource-unavailable failures) and left partial document state visible to dry-run skip checks. Add _split_contents_into_async_children for the async submit path: it packs items into children by token budget but never fragments a single item across children. Oversized items go into their own one-item child holding the full un-chunked content; the worker's existing in-process splitter (retain_batch_async → _split_contents_into_sub_batches) re-chunks them sequentially inside one worker slot with correct is_first_batch=(i==1) semantics — the same path that already enforces SELECT … FOR UPDATE + content-hash gating between batches of one call. Small items still pack together so genuinely independent inputs keep cross-worker parallelism. Metadata field names (num_sub_batches, sub_batch_index, total_sub_batches) are unchanged. Tests: - 8 pure-Python tests for the new helper covering single oversized, metadata preservation, packing by budget, mixed inputs, multiple oversized, boundary positioning, empty input. - 3 integration tests against the real DB: - test_oversized_single_item_creates_one_child_not_many asserts the async_operations table has exactly one retain row with the un-chunked content (fails on pre-fix code: "got 7" children). - test_oversized_single_item_drains_without_fk_violation drives a worker drain and asserts no memory_links rows have orphan FKs in either direction — the exact invariant pre-fix code violated. - test_oversized_item_among_small_items_keeps_small_items_packed confirms the parallelism optimization isn't lost.
The two structural assertions (test_oversized_single_item_creates_one_child_not_many and test_oversized_item_among_small_items_keeps_small_items_packed) only need to verify the async_operations rows that submit_async_retain inserts — those rows commit before submit_task is called. The previous version let SyncTaskBackend drive the full LLM-based retain pipeline synchronously, which timed out at CI's 300s per-test limit even though it ran in ~5s locally. Monkeypatch _task_backend.submit_task to a no-op so the structural assertions fire in ~30ms without running the worker. Also slim the drain test's payload from ~3x to ~1.2x the per-batch token budget. That still triggers in-process splitting (~2 sub-batches → the path that exercises is_first_batch=(i==1) sequencing) but cuts LLM extraction work from ~5 chunks to ~2, keeping wall time comfortably under 300s on slower runners. The structural regression assertions still fail without the engine fix — verified by temporarily reverting hindsight_api/engine/memory_engine.py and re-running: "Expected 1 child for an oversized single item, got 7. Issue #1795: per-chunk children race on the shared document_id."
test_oversized_single_item_drains_without_fk_violation drives the full retain pipeline (LLM extraction + embeddings + ANN + consolidation) synchronously through SyncTaskBackend. Even with the payload trimmed to ~1.2x the batch budget (~2 sub-batches), Gemini API latency in CI varies enough that the 300s per-test timeout fires intermittently. The fix is already covered without it: - test_oversized_single_item_creates_one_child_not_many is the direct regression test for #1795. It asserts on the async_operations rows submit_async_retain inserts and was empirically shown to fail on the pre-fix engine ("Expected 1 child for an oversized single item, got 7"). No worker execution needed. - test_oversized_item_among_small_items_keeps_small_items_packed covers the mixed-batch case structurally. - 8 unit tests in test_batch_chunking.py cover the helper directly. - The FK constraint fk_memory_links_from_unit_id_memory_units is enforced by Postgres itself; any orphan write would error at insert time, so the engine cannot silently regress without other tests noticing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1795.
submit_async_retainwas splitting oversized retain payloads into N independentasync_operationsrows that all shared onedocument_id. Workers have no per-document gate for retain (the busy-bank guard inclaim_tasksonly covers consolidation), so siblings ran concurrently — each enteredhandle_document_trackingwithis_first_batch=Trueand cascade-deleted the previous winner'smemory_units. The loser's final ANN pass then insertedmemory_linksreferencing now-deleted units, trippingfk_memory_links_from_unit_id_memory_units. Concurrent siblings also exhausted OS thread budgets via per-child sentence-transformer pools (libgompresource-unavailable failures) and left partial document state visible to dry-run skip checks.This is fix #2 from the issue's suggested fixes: do not pre-split a single document across independent child operations; let the in-process splitter handle intra-document chunking sequentially.
Approach
_split_contents_into_async_childrenhelper for the async submit path. It packs items into children by token budget but never fragments a single item across children — oversized items go into their own one-item child holding the full un-chunked content.retain_batch_async→_split_contents_into_sub_batches) re-chunks that content sequentially inside one worker slot with correctis_first_batch=(i==1)semantics — the same path that already enforcesSELECT … FOR UPDATE+ content-hash gating between batches of one call.num_sub_batches,sub_batch_index,total_sub_batches) and the parent/child operation structure are unchanged, so dashboards and status APIs are untouched.Test plan
tests/test_batch_chunking.py):tests/test_async_batch_retain.py):test_oversized_single_item_creates_one_child_not_many— submits one oversized doc and asserts theasync_operationstable has exactly one retain row whosetask_payload.contentsholds the un-chunked content. Empirically verified to fail on the pre-fix code withExpected 1 child for an oversized single item, got 7.test_oversized_single_item_drains_without_fk_violation— drives a worker drain end-to-end and asserts nomemory_linksrows have orphan FKs in either direction — the exact invariant pre-fix code violated.test_oversized_item_among_small_items_keeps_small_items_packed— confirms the parallelism optimization for genuinely-independent items isn't lost.tests/test_async_batch_retain.py+tests/test_batch_chunking.pysuite passes (34 tests).tests/test_document_tracking.py,tests/test_op_cancellation.py,tests/test_async_retain_tags.py(22 tests) passes../scripts/hooks/lint.shclean.