Skip to content

fix(db_utils): make acquire_with_retry yield exactly once#1880

Merged
nicoloboschi merged 1 commit into
vectorize-io:mainfrom
slayoffer:fix/upstream-acquire-with-retry-single-yield
Jun 1, 2026
Merged

fix(db_utils): make acquire_with_retry yield exactly once#1880
nicoloboschi merged 1 commit into
vectorize-io:mainfrom
slayoffer:fix/upstream-acquire-with-retry-single-yield

Conversation

@slayoffer
Copy link
Copy Markdown
Contributor

Summary

acquire_with_retry is an @asynccontextmanager whose retry loop wraps yield conn. That violates the single-yield contract: when the caller's body raises a retryable-looking exception (e.g. ConnectionError, asyncio.TimeoutError), the loop iterates and tries to yield a second time, and Python surfaces the resulting generator-protocol error as:

RuntimeError("generator didn't stop after athrow()")

— masking the real cause of the failure on every retryable inner error.

Fix: wrap only the acquire in the retry loop via contextlib.AsyncExitStack. The yield runs exactly once, after a connection is successfully acquired. User-code exceptions propagate unchanged.

Why this matters

_is_retryable matches a fairly broad set of error classes by name (InterfaceError, ConnectionDoesNotExistError, TooManyConnectionsError, DeadlockDetectedError) plus OSError / ConnectionError / asyncio.TimeoutError / Oracle ORA-00060. Anything inside an async with acquire_with_retry(backend) as conn: block that raised one of those would surface as RuntimeError("generator didn't stop after athrow()"). In our deployment that single masked bug accounted for ~1.9k identical failed consolidation operations on one large bank over ~57 days before we caught it — diagnosis was hard because the visible error pointed at the generator machinery, not the underlying transient.

The prior retry-of-user-code branch was also already non-functional (always crashed with the RuntimeError above before any retry could happen), so removing it is strictly an observability and correctness improvement, not a behaviour change for the happy path.

What changes

hindsight-api-slim/hindsight_api/engine/db_utils.py (backend path only — the legacy asyncpg.Pool path is untouched)

Before:

for attempt in range(max_retries + 1):
    try:
        async with backend_or_pool.acquire() as conn:
            ...
            yield conn          # ← inside the retry loop
            return
    except Exception as e:
        if not _is_retryable(e): raise
        ...

After:

async with AsyncExitStack() as stack:
    conn: Any = None
    for attempt in range(max_retries + 1):
        try:
            conn = await stack.enter_async_context(backend_or_pool.acquire())
            break
        except Exception as e:
            if not _is_retryable(e): raise
            ...
    ...
    yield conn              # ← exactly once, outside the loop

Docstring extended to spell out the contract: the retry covers the acquire step; user-code exceptions are not retried.

Test plan

  • New unit test tests/test_db_utils.py::test_retryable_user_code_exception_propagates_unchanged asserts:
    • a ConnectionError raised inside the async with body propagates as ConnectionError (not RuntimeError)
    • the connection's __aexit__ is called exactly once (no double-release on AsyncExitStack cleanup)
  • uv run pytest tests/test_db_utils.py -v → 1/1 passing
  • uv run ruff check hindsight_api/engine/db_utils.py tests/test_db_utils.py → clean
  • Maintainer-side CI

Refs

Original fix landed in our fork at xsolla/hindsight-agentic-memory#29; this PR cherry-picks the same commit onto vectorize-io/hindsight:main.

🤖 Generated with Claude Code

acquire_with_retry's retry loop wrapped the yield, violating
@asynccontextmanager's single-yield contract. When user code inside
the async with block raised a retryable exception, the loop iterated
and tried to yield again, producing RuntimeError("generator didn't
stop after athrow()") on every retryable inner error. This masked
the real cause and was the root of 1,934 identical failed
consolidation ops on shurick-memory in production since 2026-03-30.

Retry now wraps only the acquire (via AsyncExitStack). User-code
exceptions inside the block propagate as their real types — strictly
better for observability, since the prior retry-of-user-code branch
was already non-functional (always crashed with the RuntimeError above).

Includes a regression unit test asserting (a) the original retryable
exception propagates unchanged and (b) the connection is released
exactly once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!
LGTM

@nicoloboschi nicoloboschi merged commit 2a9589f into vectorize-io:main Jun 1, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants