Skip to content

fix(retain): prevent duplicate memory units from delta upsert chunk issues#1178

Merged
nicoloboschi merged 1 commit intomainfrom
fix/delta-upsert-chunk-cleanup
Apr 21, 2026
Merged

fix(retain): prevent duplicate memory units from delta upsert chunk issues#1178
nicoloboschi merged 1 commit intomainfrom
fix/delta-upsert-chunk-cleanup

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

Summary

Fixes two bugs in the streaming retain pipeline that caused duplicate and stale memory units when documents were upserted multiple times with the same or different content.

Bug 1: Out-of-order chunk index assignment (root cause of delta always falling back)

The streaming retain pipeline extracts facts from chunks concurrently via asyncio.create_task. Chunks arrive at the database consumer in task-completion order (non-deterministic), but chunk_index was assigned based on arrival order:

# BUG: chunk_index based on arrival order, not original position
cm.chunk_index = global_chunk_offset + content_idx_in_batch

This caused chunk_id = {bank}_{doc}_{chunk_index} to be non-deterministic. On subsequent upserts, delta retain compared new content (deterministically chunked) against DB chunks stored at scrambled positions — every hash mismatched — so delta always fell back to full streaming re-processing, wasting LLM calls and risking duplicates through the recovery path.

Fix: Use the original global_idx (position in the pre-chunked content array) which is passed through the queue from the producer:

# FIX: deterministic chunk_id regardless of task completion order
cm.chunk_index = global_idx

Verified with test: test_repeated_upsert_chunks_not_scrambled — confirms chunks are stored at correct indices. test_delta_detects_unchanged_after_first_retain — confirms delta now correctly identifies all chunks as unchanged on second upsert (previously always fell back).

Bug 2: Concurrent upsert race condition

The streaming path splits into separate transactions:

  1. TXN1: handle_document_tracking (cascade-delete old doc + insert new doc row)
  2. LLM extraction (seconds of latency)
  3. TXN2: store_chunks_batch + insert_facts_batch

Two concurrent retains for the same document could interleave:

Request A:  TXN1 (delete→nothing, insert doc) → LLM extraction → TXN2 (store chunks + units)
Request B:       TXN1 (delete A's doc row, but NO chunks to cascade yet!) → LLM → TXN2 (ON CONFLICT overwrites A's chunks + inserts own units)

Result: A's memory units (stale) + B's memory units both linked to same chunks = duplicates + data corruption.

Fix: PostgreSQL advisory lock per (bank_id, document_id) serializes concurrent retain operations:

  • Keyed on hash(bank_id:document_id) — doesn't block unrelated documents
  • Session-level lock held across the multi-transaction pipeline
  • Second request waits, then sees correct state

Bug 3: Content hash mismatch in recovery detection (minor)

The recovery path computed new_content_hash from raw content, but the stored hash was computed after _sanitize_text(). If content contained control characters, recovery wouldn't trigger, causing unnecessary re-processing.

Fix: Apply same sanitization before hashing in recovery check.

Test plan

  • test_repeated_upsert_chunks_not_scrambled — chunks stored at deterministic indices
  • test_delta_detects_unchanged_after_first_retain — delta correctly detects unchanged content on 2nd/3rd upsert
  • All 16 existing test_delta_retain.py tests pass
  • All 50 test_retain.py tests pass
  • All 8 HTTP API retain integration tests pass
  • test_chunk_storage_upsert.py idempotency tests pass
  • Lint passes

@nicoloboschi nicoloboschi force-pushed the fix/delta-upsert-chunk-cleanup branch 5 times, most recently from 18c991d to 566d2f5 Compare April 21, 2026 10:41
…ng and concurrent upserts

Two bugs in the streaming retain pipeline caused duplicate/stale memory units
when documents were upserted multiple times:

1. **Out-of-order chunk index assignment**: The producer-consumer pipeline
   extracted facts from chunks concurrently, but assigned chunk_index based on
   task completion order rather than the original document position. This caused
   chunks to be stored at scrambled indices, making delta retain unable to
   detect unchanged chunks on subsequent upserts (always falling back to
   expensive full re-processing).

2. **Concurrent upsert race condition**: The streaming path splits document
   tracking (cascade-delete) and chunk/unit creation into separate transactions
   with LLM extraction in between. Two concurrent retains for the same document
   could interleave, producing duplicates or stale data.

Fixes:
- Use the original `global_idx` (position in pre-chunked content) for
  chunk_index instead of arrival-order-based offset
- Add a PostgreSQL advisory lock per (bank_id, document_id) to serialize
  concurrent retain operations on the same document
- Add stale-request detection: after acquiring the lock, skip if the document
  was already updated by a more recent retain (prevents older content from
  overwriting newer conversation state)
- Use pg_try_advisory_lock with pool.acquire timeout to avoid deadlocks
  when pool is near capacity (graceful degradation)
- Fix content hash mismatch in recovery detection (sanitize before hashing
  to match what handle_document_tracking stores)
@nicoloboschi nicoloboschi force-pushed the fix/delta-upsert-chunk-cleanup branch from 566d2f5 to 57e7d3c Compare April 21, 2026 10:56
@nicoloboschi nicoloboschi merged commit 511ca72 into main Apr 21, 2026
53 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant