Skip to content

feat(redis): survive transient Redis outages with bounded reconnects#76

Merged
ChiragAgg5k merged 9 commits intomainfrom
feat/redis-resilience-retries
Apr 23, 2026
Merged

feat(redis): survive transient Redis outages with bounded reconnects#76
ChiragAgg5k merged 9 commits intomainfrom
feat/redis-resilience-retries

Conversation

@ChiragAgg5k
Copy link
Copy Markdown
Member

@ChiragAgg5k ChiragAgg5k commented Apr 22, 2026

Summary

Harden the Redis broker and connection adapters so workers survive transient Redis outages (DNS flaps, failover, restarts, brief network partitions) instead of crash-looping under a supervisor.

  • Connection layer (Connection/Redis.php, Connection/RedisCluster.php): lazy getRedis() now retries up to 5 attempts with exponential backoff + full jitter (100 ms base, 3 s cap) before throwing. close() is best-effort and always clears the cached handle.
  • Broker (Broker/Redis.php): consume() catches RedisException and RedisClusterException raised by the blocking pop, drops the stale connection without letting close failures mask the original reconnect path, applies capped backoff with full jitter (100 ms base, 5 s cap), and continues. Backoff resets on the first successful pop.

Motivation

Before this change, a single Redis exception during brPop would bubble out of consume() and kill the worker process. Any transient Redis issue — failover, restart, brief network partition — caused the worker fleet to rely on the process supervisor for recovery, which can reopen many connections at the same instant and create a thundering herd on the recovering Redis.

Similarly, getRedis() opened a single socket with no retry, so a one-off DNS or TCP hiccup during boot surfaced as an unrecoverable failure to the caller.

What changed

src/Queue/Connection/Redis.php

  • Added CONNECT_MAX_ATTEMPTS (5), CONNECT_BACKOFF_MS (100), CONNECT_MAX_BACKOFF_MS (3 000) constants.
  • getRedis() wraps new \\Redis() + connect() + setOption() in a retry loop. On failure it throws a \\RedisException with host, port, attempt count, and the original exception as previous. close() wraps phpredis close in try/finally so stale handles are always cleared.
  • If setup fails after a socket was opened, the temporary Redis instance is closed before retrying so failed setOption() attempts do not leak sockets.

src/Queue/Connection/RedisCluster.php

  • Added the same retry constants and a retry loop around new \\RedisCluster(...), catching \\RedisClusterException. On failure it throws a \\RedisClusterException with cluster node list, attempt count, and the original exception as previous. close() wraps phpredis close in try/finally so stale handles are always cleared.

src/Queue/Broker/Redis.php

  • Added RECONNECT_BACKOFF_MS (100) and RECONNECT_MAX_BACKOFF_MS (5 000) constants.
  • consume() catches \\RedisException|\\RedisClusterException from the blocking pop. If the broker was closed, it exits cleanly. Otherwise it drops the stale connection, sleeps for mt_rand(0, backoffMs) (full jitter), and continues the loop.
  • The broker backoff is maintained as a capped delay value instead of an ever-growing exponent.
  • Backoff resets to the base delay after any successful pop.

Swoole considerations

The usleep() calls cooperate with the Swoole reactor because src/Queue/Adapter/Swoole.php:37 sets SWOOLE_HOOK_ALL, which hooks usleep to Coroutine::sleep. If that flag is ever narrowed, these sleeps will block the reactor.

Design notes

  • Full jitter is used because the realistic failure mode is all workers losing the connection simultaneously; full jitter spreads reconnect attempts during recovery.
  • Broker retries unbounded. There is no max-attempt ceiling in consume() — a worker should stay alive across arbitrarily long outages. Operators rely on closed=true from close() to end the loop.
  • Caught types are phpredis exceptions in the broker, which is an abstraction leak but consistent with the existing Redis-specific broker and adapters. Translating to a neutral ConnectionException is out of scope here.

Test plan

  • composer lint
  • vendor/bin/phpstan analyse --memory-limit=1G
  • GitHub Actions adapter tests
  • Greptile review

Out of scope / follow-ups

  • Connection\\Redis constructor accepts $user/$password but getRedis() never calls auth(). Pre-existing bug; worth a separate PR.
  • Telemetry hook or log on reconnect — operators currently get no signal when a worker enters the retry loop. Candidate for a follow-up using utopia-php/telemetry.
  • Translate phpredis exceptions into a driver-neutral ConnectionException so the broker stops depending on phpredis exception types directly.
  • Promote CONNECT_MAX_ATTEMPTS etc. to constructor parameters if operators want to tune them per deployment.

The broker's consume() loop previously rethrew any RedisException raised
during the blocking pop, crashing the worker on every transient network
blip. The connection layer also opened a brand-new socket on the first
call with no retry, so a single DNS or TCP hiccup during boot would take
the process down.

Connection layer (Redis, RedisCluster):
  - getRedis() now retries up to 5 attempts with exponential backoff
    (100ms base, 3s cap) and full jitter to avoid thundering herd on
    recovery.
  - close() is best-effort and swallows Throwable so a dead socket
    doesn't mask the original error.

Broker (Redis):
  - On RedisException during pop, drop the stale connection and retry
    with exponential backoff (100ms base, 5s cap, full jitter). Worker
    stays alive across outages instead of crash-looping under a
    supervisor.
  - Attempt counter resets on the first successful pop so each outage
    starts from a fresh backoff.

Relies on the SWOOLE_HOOK_ALL hook flags set in Adapter/Swoole.php so
usleep yields cooperatively inside coroutines rather than blocking the
reactor.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR hardens the Redis broker and both connection adapters to survive transient outages. The consume() loop now catches both \RedisException and \RedisClusterException, closes the stale handle, sleeps with capped exponential backoff and full jitter, and continues rather than propagating; getRedis() in both connection classes retries up to five times with the same jitter strategy. Previously flagged issues (stale socket on close() throw, socket leak on setOption() failure, missing \RedisClusterException catch in the broker) are all resolved in this revision.

Confidence Score: 5/5

Safe to merge — all previously raised P1 issues are resolved and no new defects were found.

All three P1 findings from prior rounds (stale handle not nulled on close() throw, missing \RedisClusterException catch in broker, socket leak on setOption() failure) are correctly addressed. The backoff math, jitter implementation, and test assertions are accurate. No new correctness or security concerns were identified.

No files require special attention.

Important Files Changed

Filename Overview
src/Queue/Broker/Redis.php Reconnect loop with exponential backoff and full jitter added to consume(); both \RedisException and \RedisClusterException are caught; optional reconnect/success callbacks added; connection close is properly guarded with try/catch.
src/Queue/Connection/Redis.php getRedis() now retries up to 5 times with exponential backoff; socket leak on setOption() failure is correctly handled via $connected flag; close() uses try/finally to always null the handle.
src/Queue/Connection/RedisCluster.php Mirrors Redis.php changes — retry loop with exponential backoff around new \RedisCluster(); close() uses try/finally; no socket-leak concern since \RedisCluster constructor manages its own cleanup on failure.
tests/Queue/E2E/Adapter/RedisReconnectCallbackTest.php New unit tests for reconnect and reconnect-success callbacks; namespace matches existing test files; fakes correctly simulate one-shot failure and recovery; assertions are accurate.

Reviews (8): Last reviewed commit: "feat(redis): expose reconnect success ca..." | Re-trigger Greptile

Comment thread src/Queue/Broker/Redis.php Outdated
Comment thread src/Queue/Connection/Redis.php
Comment thread src/Queue/Broker/Redis.php Outdated
@ChiragAgg5k ChiragAgg5k merged commit 91de91b into main Apr 23, 2026
8 checks passed
@ChiragAgg5k ChiragAgg5k deleted the feat/redis-resilience-retries branch April 23, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants