[automatic failover] Automatic failover client improvements (part 2) #4297

atakavci · 2025-09-29T21:56:24Z

Improved failover resilience and health check enhancements for Active-Active deployments

This PR enhances the automatic failover and failback system with improved resilience, better health check probing strategies, and refined exception handling. These changes make the failover system more robust for production Active-Active Redis deployments.

🚀 New Features Added

1. Configurable Failover Retry Mechanism

NEW: maxNumFailoverAttempts configuration - Controls maximum failover attempts before giving up (default: 10)
NEW: delayInBetweenFailoverAttempts configuration - Delay between failover attempts in milliseconds (default: 12000ms)
NEW: Failover freeze mechanism to prevent rapid repeated failover attempts
NEW: Automatic failover attempt counter with reset on successful failover

2. Enhanced Exception Handling

NEW: JedisFailoverException - Base exception for failover-related errors
NEW: JedisPermanentlyNotAvailableException - Thrown when max failover attempts exceeded
NEW: JedisTemporarilyNotAvailableException - Thrown when clusters temporarily unavailable but retries remain
NEW: assertOperability() method to validate cluster health before command execution

3. Improved Health Check Probing System

NEW: ProbingPolicy interface for flexible health check evaluation strategies
NEW: ProbingPolicy.BuiltIn.ALL_SUCCESS - Requires all probes to succeed
NEW: ProbingPolicy.BuiltIn.MAJORITY - Requires majority of probes to succeed
NEW: ProbingPolicy.BuiltIn.AT_LEAST_ONE - Requires at least one probe to succeed
NEW: numProbes configuration - Number of health check probes to execute (replaces minConsecutiveSuccessCount)
NEW: delayInBetweenProbes configuration - Delay between individual probes in milliseconds (default: 100ms)
NEW: HealthProbeContext - Tracks probe execution state and results

4. SSL Support for Redis Enterprise REST API

NEW: SslOptions support in LagAwareStrategy for HTTPS connections to Redis Enterprise
NEW: sslOptions() builder method in LagAwareStrategy.ConfigBuilder
NEW: getSslVerifyMode() method in SslOptions class

🔧 Core Improvements

1. Failover Logic Enhancements

CHANGED: iterateActiveCluster() renamed to switchToHealthyCluster() for clarity
CHANGED: Failover now accepts source cluster parameter to avoid switching to same cluster
CHANGED: findWeightedHealthyClusterToIterate() now excludes the source cluster from candidates
CHANGED: canIterateOnceMore() renamed to canIterateFrom() with cluster parameter
IMPROVED: Thread-safe failover freeze tracking using AtomicLong
IMPROVED: Failover attempt counting with automatic reset on success

2. Connection Management

FIXED: Connection leak in CircuitBreakerCommandExecutor - Now properly closes connections in finally block
IMPROVED: Connection acquisition error handling - Validates operability before throwing exception
IMPROVED: forceDisconnect() now calls setBroken() first to prevent race conditions with connection pool

3. Health Check System Refactoring

CHANGED: Health check now uses probe-based evaluation instead of consecutive success counting
CHANGED: minConsecutiveSuccessCount() replaced with getNumProbes(), getPolicy(), and getDelayInBetweenProbes()
IMPROVED: Health check timeout handling with better error logging
IMPROVED: Probe execution with configurable delays between attempts
IMPROVED: Race condition prevention in health status updates with detailed documentation
CHANGED: EchoStrategy now uses JedisPooled with connection pool (max 2 connections)
CHANGED: Shared worker thread pool for all health checks instead of per-instance executors

4. LagAwareStrategy Enhancements

CHANGED: Now throws JedisException when BDB not found instead of returning UNHEALTHY
CHANGED: Throws JedisException on availability check errors for better error propagation
IMPROVED: Better error messages and logging
NEW: SSL support for secure REST API connections

📦 Package Reorganization

1. Multi-Cluster Provider Moved

MOVED: MultiClusterPooledConnectionProvider from redis.clients.jedis.providers to redis.clients.jedis.mcf
REASON: Better package organization - all multi-cluster failover components now in mcf package
UPDATED: All import statements across codebase to reflect new package location

🔄 API Changes

1. MultiClusterClientConfig Builder

// NEW configuration methods
builder.maxNumFailoverAttempts(10)              // Max failover attempts
builder.delayInBetweenFailoverAttempts(12000)   // Delay between attempts (ms)

2. HealthCheckStrategy Interface

// CHANGED: Old methods removed
- int minConsecutiveSuccessCount()

// NEW: Replaced with probe-based methods
+ int getNumProbes()
+ ProbingPolicy getPolicy()
+ int getDelayInBetweenProbes()

3. HealthCheckStrategy.Config Builder

// CHANGED: Old builder methods
- minConsecutiveSuccessCount(int)

// NEW: Probe-based configuration
+ numProbes(int)                    // Number of probes (default: 3)
+ policy(ProbingPolicy)             // Probing policy (default: ALL_SUCCESS)
+ delayInBetweenProbes(int)         // Delay between probes (default: 100ms)

4. LagAwareStrategy.ConfigBuilder

// NEW: SSL configuration for REST API
builder.sslOptions(SslOptions)      // Configure SSL for HTTPS connections

🐛 Bug Fixes

1. Connection Leak Fix

FIXED: CircuitBreakerCommandExecutor was not closing connections on exceptions
SOLUTION: Added try-finally block to ensure connection closure
IMPACT: Prevents connection pool exhaustion during failures

2. Race Condition in forceDisconnect()

FIXED: Concurrent close attempts could call returnResource instead of returnBrokenResource
SOLUTION: Call setBroken() before closing socket
IMPACT: Proper connection pool state management during fast failover

3. Initialization Race Condition

FIXED: Direct assignment to activeCluster during initialization could race with health check callbacks
SOLUTION: Use switchToHealthyCluster() within lock after initialization complete
IMPACT: Thread-safe cluster switching during startup

4. KeyStore Type Handling

FIXED: SslOptions could fail when keystore/truststore type is null
SOLUTION: Use KeyStore.getDefaultType() when type is null
IMPACT: Better SSL configuration compatibility

🎯 Behavioral Changes

1. Failover Retry Behavior

Before:

Failover would throw exception immediately if no healthy cluster available
No retry mechanism for transient failures

After:

Failover attempts up to maxNumFailoverAttempts times
Waits delayInBetweenFailoverAttempts between attempts
Throws JedisTemporarilyNotAvailableException while retries remain
Throws JedisPermanentlyNotAvailableException when max attempts exceeded

2. Health Check Probing

Before:

Required N consecutive successful health checks
Single failure reset the counter
No configurable probing strategy

After:

Executes N probes with configurable delay between them
Evaluates results based on policy (ALL_SUCCESS, MAJORITY, AT_LEAST_ONE)
More flexible and faster health determination

3. Cluster Removal

Before:

Cluster switch notification sent within lock
Potential deadlock risk

After:

Notification data collected within lock
Notification sent after lock released
Better concurrency and reduced lock contention

📊 Configuration Examples

Basic Failover with Retry

MultiClusterClientConfig config = MultiClusterClientConfig.builder(clusterConfigs)
    .circuitBreakerFailureRateThreshold(50.0f)
    .maxNumFailoverAttempts(10)                    // NEW: Max 10 failover attempts
    .delayInBetweenFailoverAttempts(12000)         // NEW: 12 second delay between attempts
    .build();

Health Check with Probing Policy

HealthCheckStrategy.Config healthConfig = HealthCheckStrategy.Config.builder()
    .interval(5000)                                 // Check every 5 seconds
    .timeout(2000)                                  // 2 second timeout per probe
    .numProbes(5)                                   // NEW: Execute 5 probes
    .policy(ProbingPolicy.BuiltIn.MAJORITY)        // NEW: Require majority success
    .delayInBetweenProbes(200)                     // NEW: 200ms between probes
    .build();

LagAwareStrategy with SSL

SslOptions sslOptions = SslOptions.builder()
    .truststore(truststorePath)
    .truststorePassword(password)
    .build();

LagAwareStrategy.Config lagConfig = LagAwareStrategy.Config.builder(restEndpoint, credentials)
    .sslOptions(sslOptions)                        // NEW: SSL for REST API
    .interval(5000)
    .timeout(3000)
    .numProbes(3)                                  // NEW: Probe-based health checks
    .policy(ProbingPolicy.BuiltIn.ALL_SUCCESS)    // NEW: All probes must succeed
    .build();

🔍 Technical Details

Failover Freeze Mechanism

The failover freeze prevents rapid repeated failover attempts:

First failover attempt sets freeze until timestamp
Subsequent attempts within freeze period reuse same attempt count
After freeze period expires, new attempt increments counter
Counter resets to 0 on successful failover

Health Check Probing

The new probing system provides more flexibility:

Execute N probes with configurable delays
Each probe result (success/failure) recorded in context
Policy evaluates context after each probe
Early termination when policy determines outcome
Final result based on policy decision

Thread Safety Improvements

activeClusterChangeLock renamed from activeClusterIndexLock for clarity
All activeCluster assignments now within lock (except initialization)
Notification callbacks moved outside locks to prevent deadlocks
Atomic operations for failover freeze and attempt counting

⚠️ Breaking Changes

API Changes

HealthCheckStrategy interface:
- Removed: int minConsecutiveSuccessCount()
- Added: int getNumProbes(), ProbingPolicy getPolicy(), int getDelayInBetweenProbes()
HealthCheckStrategy.Config.Builder:
- Removed: minConsecutiveSuccessCount(int)
- Added: numProbes(int), policy(ProbingPolicy), delayInBetweenProbes(int)
Package relocation:
- MultiClusterPooledConnectionProvider moved from redis.clients.jedis.providers to redis.clients.jedis.mcf

Migration Guide

// OLD: Consecutive success count
HealthCheckStrategy.Config.builder()
    .minConsecutiveSuccessCount(3)
    .build();

// NEW: Probe-based with policy
HealthCheckStrategy.Config.builder()
    .numProbes(3)
    .policy(ProbingPolicy.BuiltIn.ALL_SUCCESS)
    .delayInBetweenProbes(100)
    .build();

🧪 Testing

All changes validated with:

✅ Unit tests for new probing policies
✅ Integration tests for failover retry mechanism
✅ Connection leak tests
✅ Race condition tests for initialization
✅ SSL configuration tests for LagAwareStrategy
✅ Multi-threaded failover scenario tests

📝 Additional Notes

Default values maintain backward-compatible behavior for most use cases
Probing policy ALL_SUCCESS with numProbes=3 equivalent to old minConsecutiveSuccessCount=3
New exception types provide better error handling and retry logic
SSL support enables secure REST API communication with Redis Enterprise
Shared worker thread pool reduces resource consumption for health checks

…aiters()' in 'TrackingConnectionPool' (#4270) - remove the check for number of waitiers in TrackingConnectionPool

…#4268) - set maxtotal connections for echoStrategy

…cuitBreakerFailoverBase.clusterFailover' (#4275) * - replace CircuitBreaker with Cluster for CircuitBreakerFailoverBase.clusterFailover - improve thread safety with provider initialization * - formatting

* - minor optimizations on fail fast * - volatile failfast

* - replace minConsecutiveSuccessCount with numberOfRetries - add retries into healtCheckImpl - apply changes to strategy implementations config classes - fix unit tests * - fix typo * - fix failing tests * - add tests for retry logic * - formatting * - format * - revisit numRetries for healthCheck ,replace with numProbes and implement built in policies - new types probecontext, ProbePolicy, HealthProbeContext - add delayer executor pool to healthcheckımpl - adjustments on worker pool of healthCheckImpl for shared use of workers * - format * - expand comment with example case * - drop pooled executor for delays * - polish * - fix tests * - formatting * - checking failing tests * - fix test * - fix flaky tests * - fix flaky test * - add tests for builtin probing policies * - fix flaky test

* - move failover provider to mcf * - make iterateActiveCluster package private

#4291) * User-provided ssl config for lag-aware health check * ssl scenario test for lag-aware healthcheck * format * format * address review comments - use getters instead of fields

* - implement max failover attempt - add tests * - fix user receive the intended exception * -clean+format * - java doc for exceptions * format * - more tests on excaption types in max failover attempts mechanism * format * fix failing timing in test * disable health checks * rename to switchToHealthyCluster * format

…ilover-2

src/test/java/redis/clients/jedis/scenario/LagAwareStrategySslIT.java

github-actions · 2025-09-29T22:08:27Z

Test Results

273 files + 1 273 suites +1 11m 4s ⏱️ +9s
10 001 tests +17 9 956 ✅ +428 45 💤 - 411 0 ❌ ±0
2 639 runs +17 2 639 ✅ + 17 0 💤 ± 0 0 ❌ ±0

Results for commit 1ce1746. ± Comparison against base commit e838e48.

This pull request removes 13 and adds 30 tests. Note that renamed tests count towards both.

redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_and_cache_reset_on_exception_then_recovers_next_time
redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_when_no_bdb_returned
redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_when_no_matching_host_found
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testCanIterateOnceMore
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testCircuitBreakerForcedTransitions
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testConnectionPoolConfigApplied
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testHealthChecksStopAfterProviderClose
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testIterateActiveCluster
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testIterateActiveClusterOutOfRange
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testRunClusterFailoverPostProcessor
…

redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_AllSuccess_EarlyFail_Integration
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_ExhaustProbesAndStayUnhealthy
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_Majority_EarlySuccess_Integration
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_RealHealthCheckWithProbes
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_AllSuccess_StopsOnFirstFailure
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_Majority_EarlyFailStopsAtTwo
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_Majority_EarlySuccessStopsAtThree
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_ExhaustAllProbesAndFail
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_FailThenSucceedOnRetry
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_InterruptionStopsProbes
…

♻️ This comment has been updated with latest results.

atakavci and others added 9 commits September 3, 2025 14:03

[automatic failover] Remove the check for 'GenericObjectPool.getNumW…

aaed216

…aiters()' in 'TrackingConnectionPool' (#4270) - remove the check for number of waitiers in TrackingConnectionPool

[automatic failover] Configure max total connections for EchoStrategy (…

beb5e14

…#4268) - set maxtotal connections for echoStrategy

[automatic failover] Replace 'CircuitBreaker' with 'Cluster' for 'Cir…

e972c21

…cuitBreakerFailoverBase.clusterFailover' (#4275) * - replace CircuitBreaker with Cluster for CircuitBreakerFailoverBase.clusterFailover - improve thread safety with provider initialization * - formatting

[automatic failover] Minor optimizations on fast failover (#4277)

55a96a9

* - minor optimizations on fail fast * - volatile failfast

[automatic failover] Move failover provider to mcf (#4294)

213e749

* - move failover provider to mcf * - make iterateActiveCluster package private

[automatic failover] Add SSL configuration support to LagAwareStrategy (

0d8e184

#4291) * User-provided ssl config for lag-aware health check * ssl scenario test for lag-aware healthcheck * format * format * address review comments - use getters instead of fields

Merge remote-tracking branch 'redis/master' into feature/automatic-fa…

1ce1746

…ilover-2

atakavci requested review from uglide and ggivo September 29, 2025 21:56

atakavci self-assigned this Sep 29, 2025

atakavci added breakingchange Pull request that has breaking changes. Must include the breaking behavior in release notes. feature labels Sep 29, 2025

github-advanced-security bot found potential problems Sep 29, 2025

View reviewed changes

src/test/java/redis/clients/jedis/scenario/LagAwareStrategySslIT.java Dismissed Show dismissed Hide dismissed

ggivo approved these changes Sep 30, 2025

View reviewed changes

ggivo merged commit b899f3f into master Sep 30, 2025
19 checks passed

atakavci added this to the 7.0.0 milestone Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[automatic failover] Automatic failover client improvements (part 2) #4297

[automatic failover] Automatic failover client improvements (part 2) #4297

Uh oh!

atakavci commented Sep 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[automatic failover] Automatic failover client improvements (part 2) #4297

[automatic failover] Automatic failover client improvements (part 2) #4297

Uh oh!

Conversation

atakavci commented Sep 29, 2025

Improved failover resilience and health check enhancements for Active-Active deployments

🚀 New Features Added

1. Configurable Failover Retry Mechanism

2. Enhanced Exception Handling

3. Improved Health Check Probing System

4. SSL Support for Redis Enterprise REST API

🔧 Core Improvements

1. Failover Logic Enhancements

2. Connection Management

3. Health Check System Refactoring

4. LagAwareStrategy Enhancements

📦 Package Reorganization

1. Multi-Cluster Provider Moved

🔄 API Changes

1. MultiClusterClientConfig Builder

2. HealthCheckStrategy Interface

3. HealthCheckStrategy.Config Builder

4. LagAwareStrategy.ConfigBuilder

🐛 Bug Fixes

1. Connection Leak Fix

2. Race Condition in forceDisconnect()

3. Initialization Race Condition

4. KeyStore Type Handling

🎯 Behavioral Changes

1. Failover Retry Behavior

2. Health Check Probing

3. Cluster Removal

📊 Configuration Examples

Basic Failover with Retry

Health Check with Probing Policy

LagAwareStrategy with SSL

🔍 Technical Details

Failover Freeze Mechanism

Health Check Probing

Thread Safety Improvements

⚠️ Breaking Changes

API Changes

Migration Guide

🧪 Testing

📝 Additional Notes

Uh oh!

Uh oh!

github-actions bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 29, 2025 •

edited

Loading