fix(consolidation): indefinite retry with backoff + dedup-by-bank guard#1811
Merged
Conversation
…ending When a consolidation task hits a transient error, execute_task raises RetryTaskAt to re-queue the same operation. During a long upstream outage (LLM provider down, DB flapping), every successful retain on the same bank also enqueues a fresh consolidation op via submit_async_consolidation, so each op independently consumes its own 3-retry budget — a retry storm against the same broken dependency. Add a per-bank dedup check before raising RetryTaskAt: if another consolidation op is already in 'pending' for the same bank, the current op is failed instead of retried. The pending peer will process the same unconsolidated rows when the worker picks it up. The check fails open: a DB hiccup during the dedup lookup returns False so the normal retry path runs rather than swallowing a real failure.
… backoff Replace the inherited 60s × 3 generic retry for consolidation tasks with a consolidation-specific schedule: exponential backoff (60, 120, 240, 480, 960, then pinned at 1800s cap) with no attempt cap. Capping retries silently dead-letters a bank's unconsolidated rows whenever an upstream outage (LLM provider down, DB flapping) lasts longer than the budget — exactly the failure mode the dedup-by-bank guard was meant to contain. The guard already prevents retry storms by collapsing duplicate ops to a single retrying op per bank, so indefinite retry on that single op is safe: the dependency comes back, the next scheduled attempt succeeds. Deterministic failures (integrity violations, embedding dimension errors) are still filtered upstream by `_is_non_retryable_task_error` and marked failed immediately. Only generic transient errors reach the indefinite retry path. Other task types (batch_retain, refresh_mental_model, webhook_delivery) keep their existing 60s × 3 generic schedule.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two coupled changes to the consolidation task-retry path in
execute_task:pendingfor the same bank, skip the retry and let the peer cover the work. Without this, indefinite retry would let multiple ops for the same bank retry forever in lockstep during an outage.Surfaces from the discussion on #1799: the symptom there (event-chain breaks leave
memory_unitsunconsolidated) is largely a retry-budget exhaustion problem during long outages, not a missing-trigger problem. Fixing the retry layer addresses it without pollingmemory_units.Why
Before this PR,
execute_taskretried transient consolidation failures with the generic schedule (60s × 3, then_mark_failed). During a sustained upstream outage:RetryTaskAt.submit_async_consolidationdedup only checkspending, notprocessing).failedand the bank's unconsolidated rows sit untouched until the next retain triggers a fresh op — which then also burns its budget.This PR breaks the cycle:
Deterministic failures (integrity violations, embedding dimension errors) are still filtered upstream by
_is_non_retryable_task_errorand marked failed immediately. Other task types keep their existing 60s × 3 generic schedule.What changed
hindsight_api/engine/memory_engine.py_consolidation_retry_backoff_seconds(retry_count)— capped exponential backoff helper, no attempt cap._has_other_pending_consolidation(bank_id, operation_id)— single-row check onasync_operations, fails open on DB errors.execute_taskforconsolidationtasks: fire failure webhook, run dedup check, then either bare-raise (poller marks failed) orRetryTaskAtwith the new backoff.tests/test_consolidation_retry_dedup_by_bank.py(new) — 7 regression tests:execute_tasktest verifyingretry_atmatches each step, indefinite-retry test atretry_count=100confirming the cap holds and we never give up.Test plan
pytest tests/test_consolidation_retry_dedup_by_bank.py— 9 passedpytest tests/test_consolidation_retry_budget.py tests/test_consolidation_failure_recovery.py tests/test_integrity_violation_not_retried.py— no regressions./scripts/hooks/lint.sh— cleanWhat this PR does not change
submit_async_consolidationdedup semantics — that path still only dedupspending.memory_unitspolling — the discussion on feat(worker): add opt-in periodic scanner for pending consolidations #1799 ruled out the scanner approach.