fix(db_utils): make acquire_with_retry yield exactly once by slayoffer · Pull Request #1880 · vectorize-io/hindsight

slayoffer · 2026-05-31T08:44:20Z

Summary

acquire_with_retry is an @asynccontextmanager whose retry loop wraps yield conn. That violates the single-yield contract: when the caller's body raises a retryable-looking exception (e.g. ConnectionError, asyncio.TimeoutError), the loop iterates and tries to yield a second time, and Python surfaces the resulting generator-protocol error as:

RuntimeError("generator didn't stop after athrow()")

— masking the real cause of the failure on every retryable inner error.

Fix: wrap only the acquire in the retry loop via contextlib.AsyncExitStack. The yield runs exactly once, after a connection is successfully acquired. User-code exceptions propagate unchanged.

Why this matters

_is_retryable matches a fairly broad set of error classes by name (InterfaceError, ConnectionDoesNotExistError, TooManyConnectionsError, DeadlockDetectedError) plus OSError / ConnectionError / asyncio.TimeoutError / Oracle ORA-00060. Anything inside an async with acquire_with_retry(backend) as conn: block that raised one of those would surface as RuntimeError("generator didn't stop after athrow()"). In our deployment that single masked bug accounted for ~1.9k identical failed consolidation operations on one large bank over ~57 days before we caught it — diagnosis was hard because the visible error pointed at the generator machinery, not the underlying transient.

The prior retry-of-user-code branch was also already non-functional (always crashed with the RuntimeError above before any retry could happen), so removing it is strictly an observability and correctness improvement, not a behaviour change for the happy path.

What changes

hindsight-api-slim/hindsight_api/engine/db_utils.py (backend path only — the legacy asyncpg.Pool path is untouched)

Before:

for attempt in range(max_retries + 1):
    try:
        async with backend_or_pool.acquire() as conn:
            ...
            yield conn          # ← inside the retry loop
            return
    except Exception as e:
        if not _is_retryable(e): raise
        ...

After:

async with AsyncExitStack() as stack:
    conn: Any = None
    for attempt in range(max_retries + 1):
        try:
            conn = await stack.enter_async_context(backend_or_pool.acquire())
            break
        except Exception as e:
            if not _is_retryable(e): raise
            ...
    ...
    yield conn              # ← exactly once, outside the loop

Docstring extended to spell out the contract: the retry covers the acquire step; user-code exceptions are not retried.

Test plan

New unit test tests/test_db_utils.py::test_retryable_user_code_exception_propagates_unchanged asserts:
- a ConnectionError raised inside the async with body propagates as ConnectionError (not RuntimeError)
- the connection's __aexit__ is called exactly once (no double-release on AsyncExitStack cleanup)
uv run pytest tests/test_db_utils.py -v → 1/1 passing
uv run ruff check hindsight_api/engine/db_utils.py tests/test_db_utils.py → clean
Maintainer-side CI

Refs

Original fix landed in our fork at xsolla/hindsight-agentic-memory#29; this PR cherry-picks the same commit onto vectorize-io/hindsight:main.

🤖 Generated with Claude Code

acquire_with_retry's retry loop wrapped the yield, violating @asynccontextmanager's single-yield contract. When user code inside the async with block raised a retryable exception, the loop iterated and tried to yield again, producing RuntimeError("generator didn't stop after athrow()") on every retryable inner error. This masked the real cause and was the root of 1,934 identical failed consolidation ops on shurick-memory in production since 2026-03-30. Retry now wraps only the acquire (via AsyncExitStack). User-code exceptions inside the block propagate as their real types — strictly better for observability, since the prior retry-of-user-code branch was already non-functional (always crashed with the RuntimeError above). Includes a regression unit test asserting (a) the original retryable exception propagates unchanged and (b) the connection is released exactly once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nicoloboschi

Nice catch!
LGTM

nicoloboschi approved these changes Jun 1, 2026

View reviewed changes

nicoloboschi merged commit 2a9589f into vectorize-io:main Jun 1, 2026
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db_utils): make acquire_with_retry yield exactly once#1880

fix(db_utils): make acquire_with_retry yield exactly once#1880
nicoloboschi merged 1 commit into
vectorize-io:mainfrom
slayoffer:fix/upstream-acquire-with-retry-single-yield

slayoffer commented May 31, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slayoffer commented May 31, 2026

Summary

Why this matters

What changes

Test plan

Refs

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants