Skip to content

fix(consolidator): dedupe + ON CONFLICT for observation_sources INSERT#1340

Merged
nicoloboschi merged 1 commit intovectorize-io:mainfrom
youchi1:fix/consolidator-source-dedup
Apr 30, 2026
Merged

fix(consolidator): dedupe + ON CONFLICT for observation_sources INSERT#1340
nicoloboschi merged 1 commit intovectorize-io:mainfrom
youchi1:fix/consolidator-source-dedup

Conversation

@youchi1
Copy link
Copy Markdown
Contributor

@youchi1 youchi1 commented Apr 29, 2026

Summary

Both _execute_update_action and _execute_create_action in the consolidator insert rows into the observation_sources junction table. Both INSERT sites were missing two safeguards, causing UniqueViolationError on observation_sources_pkey under realistic load:

  1. Intra-batch duplicatessource_ids (or source_memory_ids) can contain the same id more than once when several memories collapse to the same effective source. The unfiltered list was passed to executemany, which then hit the unique constraint within the same batch.
  2. Concurrent consolidation — two consolidator workers racing on the same observation hit the DELETE-then-INSERT window in _execute_update_action.
  3. Residual rows — rare but possible at transaction boundaries when the DELETE doesn't observe a row that the INSERT then collides with.

Fix

  • dict.fromkeys(source_ids) (and dict.fromkeys(source_memory_ids)) — preserves insertion order while deduping in-batch.
  • ON CONFLICT (observation_id, source_id) DO NOTHING on both INSERT sites — absorbs any surviving duplicate without aborting the batch.

Both layers are needed: dedupe avoids the round-trip on common in-batch duplicates; ON CONFLICT handles cross-batch / concurrent races.

Reproduced in production on a workload that retained sessions with overlapping source memories. After this change, consolidation runs that were previously rolling back per-batch now complete cleanly.

Test plan

  • Reproduced UniqueViolationError on observation_sources_pkey on the original code with overlapping source_ids
  • Confirmed both INSERT sites are exercised (line ~1029 update path, line ~1416 create path)
  • Verified deployed fix on a live system; consolidation no longer raises

Both _execute_update_action and _execute_create_action insert into the
observation_sources junction table. Previously, both:
  - Built INSERT batches without deduping the source_ids list
  - Lacked ON CONFLICT handling

This caused UniqueViolationError on (observation_id, source_id) under
several scenarios:
  1. Same source_id repeated within source_ids (a single batch can have
     duplicates when several memories collapse to the same effective
     source).
  2. Concurrent consolidation of the same observation racing on the
     DELETE-then-INSERT pattern in _execute_update_action.
  3. Residual rows surviving the DELETE (rare but possible at transaction
     boundaries).

Fix:
  - dict.fromkeys() preserves insertion order while deduping the list.
  - ON CONFLICT (observation_id, source_id) DO NOTHING absorbs any
    surviving duplicates without aborting the entire batch.

Both layers are needed: dedupe avoids the round-trip on intra-batch
duplicates, ON CONFLICT handles cross-batch / concurrent races.
@nicoloboschi nicoloboschi merged commit c4dc8c3 into vectorize-io:main Apr 30, 2026
@youchi1 youchi1 deleted the fix/consolidator-source-dedup branch May 4, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants