fix(retain): prevent duplicate memory units from delta upsert chunk issues by nicoloboschi · Pull Request #1178 · vectorize-io/hindsight

nicoloboschi · 2026-04-20T14:45:55Z

Summary

Fixes two bugs in the streaming retain pipeline that caused duplicate and stale memory units when documents were upserted multiple times with the same or different content.

Bug 1: Out-of-order chunk index assignment (root cause of delta always falling back)

The streaming retain pipeline extracts facts from chunks concurrently via asyncio.create_task. Chunks arrive at the database consumer in task-completion order (non-deterministic), but chunk_index was assigned based on arrival order:

# BUG: chunk_index based on arrival order, not original position
cm.chunk_index = global_chunk_offset + content_idx_in_batch

This caused chunk_id = {bank}_{doc}_{chunk_index} to be non-deterministic. On subsequent upserts, delta retain compared new content (deterministically chunked) against DB chunks stored at scrambled positions — every hash mismatched — so delta always fell back to full streaming re-processing, wasting LLM calls and risking duplicates through the recovery path.

Fix: Use the original global_idx (position in the pre-chunked content array) which is passed through the queue from the producer:

# FIX: deterministic chunk_id regardless of task completion order
cm.chunk_index = global_idx

Verified with test: test_repeated_upsert_chunks_not_scrambled — confirms chunks are stored at correct indices. test_delta_detects_unchanged_after_first_retain — confirms delta now correctly identifies all chunks as unchanged on second upsert (previously always fell back).

Bug 2: Concurrent upsert race condition

The streaming path splits into separate transactions:

TXN1: handle_document_tracking (cascade-delete old doc + insert new doc row)
LLM extraction (seconds of latency)
TXN2: store_chunks_batch + insert_facts_batch

Two concurrent retains for the same document could interleave:

Request A:  TXN1 (delete→nothing, insert doc) → LLM extraction → TXN2 (store chunks + units)
Request B:       TXN1 (delete A's doc row, but NO chunks to cascade yet!) → LLM → TXN2 (ON CONFLICT overwrites A's chunks + inserts own units)

Result: A's memory units (stale) + B's memory units both linked to same chunks = duplicates + data corruption.

Fix: PostgreSQL advisory lock per (bank_id, document_id) serializes concurrent retain operations:

Keyed on hash(bank_id:document_id) — doesn't block unrelated documents
Session-level lock held across the multi-transaction pipeline
Second request waits, then sees correct state

Bug 3: Content hash mismatch in recovery detection (minor)

The recovery path computed new_content_hash from raw content, but the stored hash was computed after _sanitize_text(). If content contained control characters, recovery wouldn't trigger, causing unnecessary re-processing.

Fix: Apply same sanitization before hashing in recovery check.

Test plan

test_repeated_upsert_chunks_not_scrambled — chunks stored at deterministic indices
test_delta_detects_unchanged_after_first_retain — delta correctly detects unchanged content on 2nd/3rd upsert
All 16 existing test_delta_retain.py tests pass
All 50 test_retain.py tests pass
All 8 HTTP API retain integration tests pass
test_chunk_storage_upsert.py idempotency tests pass
Lint passes

…ng and concurrent upserts Two bugs in the streaming retain pipeline caused duplicate/stale memory units when documents were upserted multiple times: 1. **Out-of-order chunk index assignment**: The producer-consumer pipeline extracted facts from chunks concurrently, but assigned chunk_index based on task completion order rather than the original document position. This caused chunks to be stored at scrambled indices, making delta retain unable to detect unchanged chunks on subsequent upserts (always falling back to expensive full re-processing). 2. **Concurrent upsert race condition**: The streaming path splits document tracking (cascade-delete) and chunk/unit creation into separate transactions with LLM extraction in between. Two concurrent retains for the same document could interleave, producing duplicates or stale data. Fixes: - Use the original `global_idx` (position in pre-chunked content) for chunk_index instead of arrival-order-based offset - Add a PostgreSQL advisory lock per (bank_id, document_id) to serialize concurrent retain operations on the same document - Add stale-request detection: after acquiring the lock, skip if the document was already updated by a more recent retain (prevents older content from overwriting newer conversation state) - Use pg_try_advisory_lock with pool.acquire timeout to avoid deadlocks when pool is near capacity (graceful degradation) - Fix content hash mismatch in recovery detection (sanitize before hashing to match what handle_document_tracking stores)

nicoloboschi force-pushed the fix/delta-upsert-chunk-cleanup branch 5 times, most recently from 18c991d to 566d2f5 Compare April 21, 2026 10:41

nicoloboschi force-pushed the fix/delta-upsert-chunk-cleanup branch from 566d2f5 to 57e7d3c Compare April 21, 2026 10:56

nicoloboschi merged commit 511ca72 into main Apr 21, 2026
53 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retain): prevent duplicate memory units from delta upsert chunk issues#1178

fix(retain): prevent duplicate memory units from delta upsert chunk issues#1178
nicoloboschi merged 1 commit intomainfrom
fix/delta-upsert-chunk-cleanup

nicoloboschi commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicoloboschi commented Apr 20, 2026

Summary

Bug 1: Out-of-order chunk index assignment (root cause of delta always falling back)

Bug 2: Concurrent upsert race condition

Bug 3: Content hash mismatch in recovery detection (minor)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant