-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[automatic failover] Automatic failover client improvements (part 2) #4297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…aiters()' in 'TrackingConnectionPool' (#4270) - remove the check for number of waitiers in TrackingConnectionPool
…#4268) - set maxtotal connections for echoStrategy
…cuitBreakerFailoverBase.clusterFailover' (#4275) * - replace CircuitBreaker with Cluster for CircuitBreakerFailoverBase.clusterFailover - improve thread safety with provider initialization * - formatting
* - minor optimizations on fail fast * - volatile failfast
* - replace minConsecutiveSuccessCount with numberOfRetries - add retries into healtCheckImpl - apply changes to strategy implementations config classes - fix unit tests * - fix typo * - fix failing tests * - add tests for retry logic * - formatting * - format * - revisit numRetries for healthCheck ,replace with numProbes and implement built in policies - new types probecontext, ProbePolicy, HealthProbeContext - add delayer executor pool to healthcheckımpl - adjustments on worker pool of healthCheckImpl for shared use of workers * - format * - expand comment with example case * - drop pooled executor for delays * - polish * - fix tests * - formatting * - checking failing tests * - fix test * - fix flaky tests * - fix flaky test * - add tests for builtin probing policies * - fix flaky test
* - move failover provider to mcf * - make iterateActiveCluster package private
#4291) * User-provided ssl config for lag-aware health check * ssl scenario test for lag-aware healthcheck * format * format * address review comments - use getters instead of fields
* - implement max failover attempt - add tests * - fix user receive the intended exception * -clean+format * - java doc for exceptions * format * - more tests on excaption types in max failover attempts mechanism * format * fix failing timing in test * disable health checks * rename to switchToHealthyCluster * format
src/test/java/redis/clients/jedis/scenario/LagAwareStrategySslIT.java
Dismissed
Show dismissed
Hide dismissed
Test Results 273 files + 1 273 suites +1 11m 4s ⏱️ +9s Results for commit 1ce1746. ± Comparison against base commit e838e48. This pull request removes 13 and adds 30 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
ggivo
approved these changes
Sep 30, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
breakingchange
Pull request that has breaking changes. Must include the breaking behavior in release notes.
feature
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Improved failover resilience and health check enhancements for Active-Active deployments
This PR enhances the automatic failover and failback system with improved resilience, better health check probing strategies, and refined exception handling. These changes make the failover system more robust for production Active-Active Redis deployments.
🚀 New Features Added
1. Configurable Failover Retry Mechanism
maxNumFailoverAttempts
configuration - Controls maximum failover attempts before giving up (default: 10)delayInBetweenFailoverAttempts
configuration - Delay between failover attempts in milliseconds (default: 12000ms)2. Enhanced Exception Handling
JedisFailoverException
- Base exception for failover-related errorsJedisPermanentlyNotAvailableException
- Thrown when max failover attempts exceededJedisTemporarilyNotAvailableException
- Thrown when clusters temporarily unavailable but retries remainassertOperability()
method to validate cluster health before command execution3. Improved Health Check Probing System
ProbingPolicy
interface for flexible health check evaluation strategiesProbingPolicy.BuiltIn.ALL_SUCCESS
- Requires all probes to succeedProbingPolicy.BuiltIn.MAJORITY
- Requires majority of probes to succeedProbingPolicy.BuiltIn.AT_LEAST_ONE
- Requires at least one probe to succeednumProbes
configuration - Number of health check probes to execute (replacesminConsecutiveSuccessCount
)delayInBetweenProbes
configuration - Delay between individual probes in milliseconds (default: 100ms)HealthProbeContext
- Tracks probe execution state and results4. SSL Support for Redis Enterprise REST API
SslOptions
support inLagAwareStrategy
for HTTPS connections to Redis EnterprisesslOptions()
builder method inLagAwareStrategy.ConfigBuilder
getSslVerifyMode()
method inSslOptions
class🔧 Core Improvements
1. Failover Logic Enhancements
iterateActiveCluster()
renamed toswitchToHealthyCluster()
for clarityfindWeightedHealthyClusterToIterate()
now excludes the source cluster from candidatescanIterateOnceMore()
renamed tocanIterateFrom()
with cluster parameterAtomicLong
2. Connection Management
CircuitBreakerCommandExecutor
- Now properly closes connections in finally blockforceDisconnect()
now callssetBroken()
first to prevent race conditions with connection pool3. Health Check System Refactoring
minConsecutiveSuccessCount()
replaced withgetNumProbes()
,getPolicy()
, andgetDelayInBetweenProbes()
EchoStrategy
now usesJedisPooled
with connection pool (max 2 connections)4. LagAwareStrategy Enhancements
JedisException
when BDB not found instead of returning UNHEALTHYJedisException
on availability check errors for better error propagation📦 Package Reorganization
1. Multi-Cluster Provider Moved
MultiClusterPooledConnectionProvider
fromredis.clients.jedis.providers
toredis.clients.jedis.mcf
mcf
package🔄 API Changes
1. MultiClusterClientConfig Builder
2. HealthCheckStrategy Interface
3. HealthCheckStrategy.Config Builder
4. LagAwareStrategy.ConfigBuilder
🐛 Bug Fixes
1. Connection Leak Fix
CircuitBreakerCommandExecutor
was not closing connections on exceptions2. Race Condition in forceDisconnect()
returnResource
instead ofreturnBrokenResource
setBroken()
before closing socket3. Initialization Race Condition
activeCluster
during initialization could race with health check callbacksswitchToHealthyCluster()
within lock after initialization complete4. KeyStore Type Handling
SslOptions
could fail when keystore/truststore type is nullKeyStore.getDefaultType()
when type is null🎯 Behavioral Changes
1. Failover Retry Behavior
Before:
After:
maxNumFailoverAttempts
timesdelayInBetweenFailoverAttempts
between attemptsJedisTemporarilyNotAvailableException
while retries remainJedisPermanentlyNotAvailableException
when max attempts exceeded2. Health Check Probing
Before:
After:
3. Cluster Removal
Before:
After:
📊 Configuration Examples
Basic Failover with Retry
Health Check with Probing Policy
LagAwareStrategy with SSL
🔍 Technical Details
Failover Freeze Mechanism
The failover freeze prevents rapid repeated failover attempts:
Health Check Probing
The new probing system provides more flexibility:
Thread Safety Improvements
activeClusterChangeLock
renamed fromactiveClusterIndexLock
for clarityactiveCluster
assignments now within lock (except initialization)API Changes
HealthCheckStrategy interface:
int minConsecutiveSuccessCount()
int getNumProbes()
,ProbingPolicy getPolicy()
,int getDelayInBetweenProbes()
HealthCheckStrategy.Config.Builder:
minConsecutiveSuccessCount(int)
numProbes(int)
,policy(ProbingPolicy)
,delayInBetweenProbes(int)
Package relocation:
MultiClusterPooledConnectionProvider
moved fromredis.clients.jedis.providers
toredis.clients.jedis.mcf
Migration Guide
🧪 Testing
All changes validated with:
📝 Additional Notes
ALL_SUCCESS
withnumProbes=3
equivalent to oldminConsecutiveSuccessCount=3