Skip to content

fix(consolidation): indefinite retry with backoff + dedup-by-bank guard#1811

Merged
nicoloboschi merged 2 commits into
mainfrom
fix/consolidation-retry-dedup-by-bank
May 28, 2026
Merged

fix(consolidation): indefinite retry with backoff + dedup-by-bank guard#1811
nicoloboschi merged 2 commits into
mainfrom
fix/consolidation-retry-dedup-by-bank

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi commented May 28, 2026

Summary

Two coupled changes to the consolidation task-retry path in execute_task:

  1. Indefinite retry with capped exponential backoff (60, 120, 240, 480, 960, then pinned at 1800s) — replaces the generic 60s × 3 cap that other task types still use. Transient outages must eventually recover; capping silently dead-letters a bank's unconsolidated rows.
  2. Per-bank dedup guard — if another consolidation op is already pending for the same bank, skip the retry and let the peer cover the work. Without this, indefinite retry would let multiple ops for the same bank retry forever in lockstep during an outage.

Surfaces from the discussion on #1799: the symptom there (event-chain breaks leave memory_units unconsolidated) is largely a retry-budget exhaustion problem during long outages, not a missing-trigger problem. Fixing the retry layer addresses it without polling memory_units.

Why

Before this PR, execute_task retried transient consolidation failures with the generic schedule (60s × 3, then _mark_failed). During a sustained upstream outage:

  1. Op A runs → LLM 503 → re-pended via RetryTaskAt.
  2. Retain succeeds and enqueues op B for the same bank (submit_async_consolidation dedup only checks pending, not processing).
  3. Both A and B burn 3 retry slots each; every additional retain spawns C, D, … each burning three more.
  4. After the budget is exhausted, the ops are marked failed and the bank's unconsolidated rows sit untouched until the next retain triggers a fresh op — which then also burns its budget.

This PR breaks the cycle:

  • The dedup guard collapses step 3 to a single retrying op per bank.
  • The indefinite-with-cap schedule on that single op rides out outages of any length, so step 4 doesn't happen: the next attempt eventually succeeds when the dependency comes back.

Deterministic failures (integrity violations, embedding dimension errors) are still filtered upstream by _is_non_retryable_task_error and marked failed immediately. Other task types keep their existing 60s × 3 generic schedule.

What changed

hindsight_api/engine/memory_engine.py

  • _consolidation_retry_backoff_seconds(retry_count) — capped exponential backoff helper, no attempt cap.
  • _has_other_pending_consolidation(bank_id, operation_id) — single-row check on async_operations, fails open on DB errors.
  • Generic exception branch of execute_task for consolidation tasks: fire failure webhook, run dedup check, then either bare-raise (poller marks failed) or RetryTaskAt with the new backoff.

tests/test_consolidation_retry_dedup_by_bank.py (new) — 7 regression tests:

  • Dedup guard: peer pending → no RetryTaskAt; no peer → RetryTaskAt fires; peer in different bank → no suppression.
  • Backoff schedule: unit test of the helper (60/120/240/480/960/cap), parameterised execute_task test verifying retry_at matches each step, indefinite-retry test at retry_count=100 confirming the cap holds and we never give up.

Test plan

  • pytest tests/test_consolidation_retry_dedup_by_bank.py — 9 passed
  • pytest tests/test_consolidation_retry_budget.py tests/test_consolidation_failure_recovery.py tests/test_integrity_violation_not_retried.py — no regressions
  • ./scripts/hooks/lint.sh — clean
  • CI green

What this PR does not change

  • No new env vars, no opt-in flag — these are correctness fixes for the current retry semantics.
  • No changes to submit_async_consolidation dedup semantics — that path still only dedups pending.
  • No memory_units polling — the discussion on feat(worker): add opt-in periodic scanner for pending consolidations #1799 ruled out the scanner approach.
  • Other task types (batch_retain, refresh_mental_model, webhook_delivery) keep their existing 60s × 3 retry schedule.

…ending

When a consolidation task hits a transient error, execute_task raises
RetryTaskAt to re-queue the same operation. During a long upstream outage
(LLM provider down, DB flapping), every successful retain on the same bank
also enqueues a fresh consolidation op via submit_async_consolidation, so
each op independently consumes its own 3-retry budget — a retry storm
against the same broken dependency.

Add a per-bank dedup check before raising RetryTaskAt: if another
consolidation op is already in 'pending' for the same bank, the current op
is failed instead of retried. The pending peer will process the same
unconsolidated rows when the worker picks it up.

The check fails open: a DB hiccup during the dedup lookup returns False so
the normal retry path runs rather than swallowing a real failure.
… backoff

Replace the inherited 60s × 3 generic retry for consolidation tasks with a
consolidation-specific schedule: exponential backoff (60, 120, 240, 480,
960, then pinned at 1800s cap) with no attempt cap.

Capping retries silently dead-letters a bank's unconsolidated rows whenever
an upstream outage (LLM provider down, DB flapping) lasts longer than the
budget — exactly the failure mode the dedup-by-bank guard was meant to
contain. The guard already prevents retry storms by collapsing duplicate
ops to a single retrying op per bank, so indefinite retry on that single op
is safe: the dependency comes back, the next scheduled attempt succeeds.

Deterministic failures (integrity violations, embedding dimension errors)
are still filtered upstream by `_is_non_retryable_task_error` and marked
failed immediately. Only generic transient errors reach the indefinite
retry path. Other task types (batch_retain, refresh_mental_model,
webhook_delivery) keep their existing 60s × 3 generic schedule.
@nicoloboschi nicoloboschi changed the title fix(consolidation): skip task retry when peer consolidation already pending fix(consolidation): indefinite retry with backoff + dedup-by-bank guard May 28, 2026
@nicoloboschi nicoloboschi merged commit 0d2ba56 into main May 28, 2026
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant