Description
Expected Behavior
A flurry of transient connection errors should not result in a client getting into a permanently broken state. At some point, it should try a new connection.
Current Behavior
If you have a redis.Client with a PoolSize of, say, 5, and you get 5 failed connections in a row without an intervening successful connection, every subsequent call to that redis client will fail. It will return a stored error without trying to make a new connection.
Possible Solution
See "Possible Implementation" below.
Steps to Reproduce
Reproduction courtesy of my colleague @nickstenning: https://gist.github.com/nickstenning/e9479b62b3609927e27dc95e05385250
In this repro, we create a client with a PoolSize of 2. It then experiences two failed connections in a row, and from that point forward never attempts to make a new connection again, only returns the last stored connection error.
Context (Environment)
This affected us because we had PoolSize=2 (which makes sense for our use case where we have mostly single-threaded access to the redis.Client), we saw 2 connection errors in a row, and then our app got stuck in a tight spinloop where it was repeatedly calling redis and getting an error back. Our expectation was that calling the redis client would perform some sort of network operation, so it was surprising that it did not.
We were running go-redis v9.5.3 (which has disappeared, like v9.5.4 - see #3057).
Detailed Description
The relevant code is here: https://github.com/redis/go-redis/blob/v9.5.2/internal/pool/pool.go#L209-L220
In particular, once p.dialErrorsNum reaches p.cfg.PoolSize, dialConn() will always return an error without attempting to create a connection.
Possible Implementation
Remove the code that's counting and storing errors. It's not clear to me what the intention of this code is, but I don't think getting the client into a permanently broken state like this is a good idea.