feat(redis): survive transient Redis outages with bounded reconnects#76
feat(redis): survive transient Redis outages with bounded reconnects#76ChiragAgg5k merged 9 commits intomainfrom
Conversation
The broker's consume() loop previously rethrew any RedisException raised
during the blocking pop, crashing the worker on every transient network
blip. The connection layer also opened a brand-new socket on the first
call with no retry, so a single DNS or TCP hiccup during boot would take
the process down.
Connection layer (Redis, RedisCluster):
- getRedis() now retries up to 5 attempts with exponential backoff
(100ms base, 3s cap) and full jitter to avoid thundering herd on
recovery.
- close() is best-effort and swallows Throwable so a dead socket
doesn't mask the original error.
Broker (Redis):
- On RedisException during pop, drop the stale connection and retry
with exponential backoff (100ms base, 5s cap, full jitter). Worker
stays alive across outages instead of crash-looping under a
supervisor.
- Attempt counter resets on the first successful pop so each outage
starts from a fresh backoff.
Relies on the SWOOLE_HOOK_ALL hook flags set in Adapter/Swoole.php so
usleep yields cooperatively inside coroutines rather than blocking the
reactor.
Greptile SummaryThis PR hardens the Redis broker and both connection adapters to survive transient outages. The Confidence Score: 5/5Safe to merge — all previously raised P1 issues are resolved and no new defects were found. All three P1 findings from prior rounds (stale handle not nulled on close() throw, missing \RedisClusterException catch in broker, socket leak on setOption() failure) are correctly addressed. The backoff math, jitter implementation, and test assertions are accurate. No new correctness or security concerns were identified. No files require special attention. Important Files Changed
Reviews (8): Last reviewed commit: "feat(redis): expose reconnect success ca..." | Re-trigger Greptile |
Summary
Harden the Redis broker and connection adapters so workers survive transient Redis outages (DNS flaps, failover, restarts, brief network partitions) instead of crash-looping under a supervisor.
Connection/Redis.php,Connection/RedisCluster.php): lazygetRedis()now retries up to 5 attempts with exponential backoff + full jitter (100 ms base, 3 s cap) before throwing.close()is best-effort and always clears the cached handle.Broker/Redis.php):consume()catchesRedisExceptionandRedisClusterExceptionraised by the blocking pop, drops the stale connection without letting close failures mask the original reconnect path, applies capped backoff with full jitter (100 ms base, 5 s cap), and continues. Backoff resets on the first successful pop.Motivation
Before this change, a single Redis exception during
brPopwould bubble out ofconsume()and kill the worker process. Any transient Redis issue — failover, restart, brief network partition — caused the worker fleet to rely on the process supervisor for recovery, which can reopen many connections at the same instant and create a thundering herd on the recovering Redis.Similarly,
getRedis()opened a single socket with no retry, so a one-off DNS or TCP hiccup during boot surfaced as an unrecoverable failure to the caller.What changed
src/Queue/Connection/Redis.phpCONNECT_MAX_ATTEMPTS(5),CONNECT_BACKOFF_MS(100),CONNECT_MAX_BACKOFF_MS(3 000) constants.getRedis()wrapsnew \\Redis()+connect()+setOption()in a retry loop. On failure it throws a\\RedisExceptionwith host, port, attempt count, and the original exception asprevious.close()wraps phpredis close intry/finallyso stale handles are always cleared.setOption()attempts do not leak sockets.src/Queue/Connection/RedisCluster.phpnew \\RedisCluster(...), catching\\RedisClusterException. On failure it throws a\\RedisClusterExceptionwith cluster node list, attempt count, and the original exception asprevious.close()wraps phpredis close intry/finallyso stale handles are always cleared.src/Queue/Broker/Redis.phpRECONNECT_BACKOFF_MS(100) andRECONNECT_MAX_BACKOFF_MS(5 000) constants.consume()catches\\RedisException|\\RedisClusterExceptionfrom the blocking pop. If the broker was closed, it exits cleanly. Otherwise it drops the stale connection, sleeps formt_rand(0, backoffMs)(full jitter), and continues the loop.Swoole considerations
The
usleep()calls cooperate with the Swoole reactor becausesrc/Queue/Adapter/Swoole.php:37setsSWOOLE_HOOK_ALL, which hooksusleeptoCoroutine::sleep. If that flag is ever narrowed, these sleeps will block the reactor.Design notes
consume()— a worker should stay alive across arbitrarily long outages. Operators rely onclosed=truefromclose()to end the loop.ConnectionExceptionis out of scope here.Test plan
composer lintvendor/bin/phpstan analyse --memory-limit=1GOut of scope / follow-ups
Connection\\Redisconstructor accepts$user/$passwordbutgetRedis()never callsauth(). Pre-existing bug; worth a separate PR.utopia-php/telemetry.ConnectionExceptionso the broker stops depending on phpredis exception types directly.CONNECT_MAX_ATTEMPTSetc. to constructor parameters if operators want to tune them per deployment.