Skip to content

Conversation

atakavci
Copy link
Contributor

Improved failover resilience and health check enhancements for Active-Active deployments

This PR enhances the automatic failover and failback system with improved resilience, better health check probing strategies, and refined exception handling. These changes make the failover system more robust for production Active-Active Redis deployments.

🚀 New Features Added

1. Configurable Failover Retry Mechanism

  • NEW: maxNumFailoverAttempts configuration - Controls maximum failover attempts before giving up (default: 10)
  • NEW: delayInBetweenFailoverAttempts configuration - Delay between failover attempts in milliseconds (default: 12000ms)
  • NEW: Failover freeze mechanism to prevent rapid repeated failover attempts
  • NEW: Automatic failover attempt counter with reset on successful failover

2. Enhanced Exception Handling

  • NEW: JedisFailoverException - Base exception for failover-related errors
  • NEW: JedisPermanentlyNotAvailableException - Thrown when max failover attempts exceeded
  • NEW: JedisTemporarilyNotAvailableException - Thrown when clusters temporarily unavailable but retries remain
  • NEW: assertOperability() method to validate cluster health before command execution

3. Improved Health Check Probing System

  • NEW: ProbingPolicy interface for flexible health check evaluation strategies
  • NEW: ProbingPolicy.BuiltIn.ALL_SUCCESS - Requires all probes to succeed
  • NEW: ProbingPolicy.BuiltIn.MAJORITY - Requires majority of probes to succeed
  • NEW: ProbingPolicy.BuiltIn.AT_LEAST_ONE - Requires at least one probe to succeed
  • NEW: numProbes configuration - Number of health check probes to execute (replaces minConsecutiveSuccessCount)
  • NEW: delayInBetweenProbes configuration - Delay between individual probes in milliseconds (default: 100ms)
  • NEW: HealthProbeContext - Tracks probe execution state and results

4. SSL Support for Redis Enterprise REST API

  • NEW: SslOptions support in LagAwareStrategy for HTTPS connections to Redis Enterprise
  • NEW: sslOptions() builder method in LagAwareStrategy.ConfigBuilder
  • NEW: getSslVerifyMode() method in SslOptions class

🔧 Core Improvements

1. Failover Logic Enhancements

  • CHANGED: iterateActiveCluster() renamed to switchToHealthyCluster() for clarity
  • CHANGED: Failover now accepts source cluster parameter to avoid switching to same cluster
  • CHANGED: findWeightedHealthyClusterToIterate() now excludes the source cluster from candidates
  • CHANGED: canIterateOnceMore() renamed to canIterateFrom() with cluster parameter
  • IMPROVED: Thread-safe failover freeze tracking using AtomicLong
  • IMPROVED: Failover attempt counting with automatic reset on success

2. Connection Management

  • FIXED: Connection leak in CircuitBreakerCommandExecutor - Now properly closes connections in finally block
  • IMPROVED: Connection acquisition error handling - Validates operability before throwing exception
  • IMPROVED: forceDisconnect() now calls setBroken() first to prevent race conditions with connection pool

3. Health Check System Refactoring

  • CHANGED: Health check now uses probe-based evaluation instead of consecutive success counting
  • CHANGED: minConsecutiveSuccessCount() replaced with getNumProbes(), getPolicy(), and getDelayInBetweenProbes()
  • IMPROVED: Health check timeout handling with better error logging
  • IMPROVED: Probe execution with configurable delays between attempts
  • IMPROVED: Race condition prevention in health status updates with detailed documentation
  • CHANGED: EchoStrategy now uses JedisPooled with connection pool (max 2 connections)
  • CHANGED: Shared worker thread pool for all health checks instead of per-instance executors

4. LagAwareStrategy Enhancements

  • CHANGED: Now throws JedisException when BDB not found instead of returning UNHEALTHY
  • CHANGED: Throws JedisException on availability check errors for better error propagation
  • IMPROVED: Better error messages and logging
  • NEW: SSL support for secure REST API connections

📦 Package Reorganization

1. Multi-Cluster Provider Moved

  • MOVED: MultiClusterPooledConnectionProvider from redis.clients.jedis.providers to redis.clients.jedis.mcf
  • REASON: Better package organization - all multi-cluster failover components now in mcf package
  • UPDATED: All import statements across codebase to reflect new package location

🔄 API Changes

1. MultiClusterClientConfig Builder

// NEW configuration methods
builder.maxNumFailoverAttempts(10)              // Max failover attempts
builder.delayInBetweenFailoverAttempts(12000)   // Delay between attempts (ms)

2. HealthCheckStrategy Interface

// CHANGED: Old methods removed
- int minConsecutiveSuccessCount()

// NEW: Replaced with probe-based methods
+ int getNumProbes()
+ ProbingPolicy getPolicy()
+ int getDelayInBetweenProbes()

3. HealthCheckStrategy.Config Builder

// CHANGED: Old builder methods
- minConsecutiveSuccessCount(int)

// NEW: Probe-based configuration
+ numProbes(int)                    // Number of probes (default: 3)
+ policy(ProbingPolicy)             // Probing policy (default: ALL_SUCCESS)
+ delayInBetweenProbes(int)         // Delay between probes (default: 100ms)

4. LagAwareStrategy.ConfigBuilder

// NEW: SSL configuration for REST API
builder.sslOptions(SslOptions)      // Configure SSL for HTTPS connections

🐛 Bug Fixes

1. Connection Leak Fix

  • FIXED: CircuitBreakerCommandExecutor was not closing connections on exceptions
  • SOLUTION: Added try-finally block to ensure connection closure
  • IMPACT: Prevents connection pool exhaustion during failures

2. Race Condition in forceDisconnect()

  • FIXED: Concurrent close attempts could call returnResource instead of returnBrokenResource
  • SOLUTION: Call setBroken() before closing socket
  • IMPACT: Proper connection pool state management during fast failover

3. Initialization Race Condition

  • FIXED: Direct assignment to activeCluster during initialization could race with health check callbacks
  • SOLUTION: Use switchToHealthyCluster() within lock after initialization complete
  • IMPACT: Thread-safe cluster switching during startup

4. KeyStore Type Handling

  • FIXED: SslOptions could fail when keystore/truststore type is null
  • SOLUTION: Use KeyStore.getDefaultType() when type is null
  • IMPACT: Better SSL configuration compatibility

🎯 Behavioral Changes

1. Failover Retry Behavior

Before:

  • Failover would throw exception immediately if no healthy cluster available
  • No retry mechanism for transient failures

After:

  • Failover attempts up to maxNumFailoverAttempts times
  • Waits delayInBetweenFailoverAttempts between attempts
  • Throws JedisTemporarilyNotAvailableException while retries remain
  • Throws JedisPermanentlyNotAvailableException when max attempts exceeded

2. Health Check Probing

Before:

  • Required N consecutive successful health checks
  • Single failure reset the counter
  • No configurable probing strategy

After:

  • Executes N probes with configurable delay between them
  • Evaluates results based on policy (ALL_SUCCESS, MAJORITY, AT_LEAST_ONE)
  • More flexible and faster health determination

3. Cluster Removal

Before:

  • Cluster switch notification sent within lock
  • Potential deadlock risk

After:

  • Notification data collected within lock
  • Notification sent after lock released
  • Better concurrency and reduced lock contention

📊 Configuration Examples

Basic Failover with Retry

MultiClusterClientConfig config = MultiClusterClientConfig.builder(clusterConfigs)
    .circuitBreakerFailureRateThreshold(50.0f)
    .maxNumFailoverAttempts(10)                    // NEW: Max 10 failover attempts
    .delayInBetweenFailoverAttempts(12000)         // NEW: 12 second delay between attempts
    .build();

Health Check with Probing Policy

HealthCheckStrategy.Config healthConfig = HealthCheckStrategy.Config.builder()
    .interval(5000)                                 // Check every 5 seconds
    .timeout(2000)                                  // 2 second timeout per probe
    .numProbes(5)                                   // NEW: Execute 5 probes
    .policy(ProbingPolicy.BuiltIn.MAJORITY)        // NEW: Require majority success
    .delayInBetweenProbes(200)                     // NEW: 200ms between probes
    .build();

LagAwareStrategy with SSL

SslOptions sslOptions = SslOptions.builder()
    .truststore(truststorePath)
    .truststorePassword(password)
    .build();

LagAwareStrategy.Config lagConfig = LagAwareStrategy.Config.builder(restEndpoint, credentials)
    .sslOptions(sslOptions)                        // NEW: SSL for REST API
    .interval(5000)
    .timeout(3000)
    .numProbes(3)                                  // NEW: Probe-based health checks
    .policy(ProbingPolicy.BuiltIn.ALL_SUCCESS)    // NEW: All probes must succeed
    .build();

🔍 Technical Details

Failover Freeze Mechanism

The failover freeze prevents rapid repeated failover attempts:

  1. First failover attempt sets freeze until timestamp
  2. Subsequent attempts within freeze period reuse same attempt count
  3. After freeze period expires, new attempt increments counter
  4. Counter resets to 0 on successful failover

Health Check Probing

The new probing system provides more flexibility:

  1. Execute N probes with configurable delays
  2. Each probe result (success/failure) recorded in context
  3. Policy evaluates context after each probe
  4. Early termination when policy determines outcome
  5. Final result based on policy decision

Thread Safety Improvements

  • activeClusterChangeLock renamed from activeClusterIndexLock for clarity
  • All activeCluster assignments now within lock (except initialization)
  • Notification callbacks moved outside locks to prevent deadlocks
  • Atomic operations for failover freeze and attempt counting

⚠️ Breaking Changes

API Changes

  1. HealthCheckStrategy interface:

    • Removed: int minConsecutiveSuccessCount()
    • Added: int getNumProbes(), ProbingPolicy getPolicy(), int getDelayInBetweenProbes()
  2. HealthCheckStrategy.Config.Builder:

    • Removed: minConsecutiveSuccessCount(int)
    • Added: numProbes(int), policy(ProbingPolicy), delayInBetweenProbes(int)
  3. Package relocation:

    • MultiClusterPooledConnectionProvider moved from redis.clients.jedis.providers to redis.clients.jedis.mcf

Migration Guide

// OLD: Consecutive success count
HealthCheckStrategy.Config.builder()
    .minConsecutiveSuccessCount(3)
    .build();

// NEW: Probe-based with policy
HealthCheckStrategy.Config.builder()
    .numProbes(3)
    .policy(ProbingPolicy.BuiltIn.ALL_SUCCESS)
    .delayInBetweenProbes(100)
    .build();

🧪 Testing

All changes validated with:

  • ✅ Unit tests for new probing policies
  • ✅ Integration tests for failover retry mechanism
  • ✅ Connection leak tests
  • ✅ Race condition tests for initialization
  • ✅ SSL configuration tests for LagAwareStrategy
  • ✅ Multi-threaded failover scenario tests

📝 Additional Notes

  • Default values maintain backward-compatible behavior for most use cases
  • Probing policy ALL_SUCCESS with numProbes=3 equivalent to old minConsecutiveSuccessCount=3
  • New exception types provide better error handling and retry logic
  • SSL support enables secure REST API communication with Redis Enterprise
  • Shared worker thread pool reduces resource consumption for health checks

atakavci and others added 9 commits September 3, 2025 14:03
…aiters()' in 'TrackingConnectionPool' (#4270)

- remove the check for number of waitiers in TrackingConnectionPool
…cuitBreakerFailoverBase.clusterFailover' (#4275)

* - replace CircuitBreaker with Cluster for CircuitBreakerFailoverBase.clusterFailover
- improve thread safety with provider initialization

* - formatting
* - minor optimizations on fail fast

* -  volatile failfast
* - replace minConsecutiveSuccessCount with numberOfRetries
- add retries into healtCheckImpl
- apply changes to strategy implementations config classes
- fix unit tests

* - fix typo

* - fix failing tests

* - add tests for retry logic

* - formatting

* - format

* - revisit numRetries for healthCheck ,replace with numProbes and implement built in policies
- new types probecontext, ProbePolicy, HealthProbeContext
- add delayer executor pool to healthcheckımpl
-  adjustments on  worker pool of healthCheckImpl for shared use of workers

* - format

* - expand comment with example case

* - drop pooled executor for delays

* - polish

* - fix tests

* - formatting

* - checking failing tests

* - fix test

* - fix flaky tests

* - fix flaky test

* - add tests for builtin probing policies

* - fix flaky test
* - move failover provider to mcf

* - make iterateActiveCluster package private
#4291)

* User-provided ssl config for lag-aware health check

* ssl scenario test for lag-aware healthcheck

* format

* format

* address review comments

  - use getters instead of fields
* - implement max failover attempt
- add tests

* - fix user receive the intended exception

* -clean+format

* - java doc for exceptions

* format

* - more tests on excaption types in max failover attempts mechanism

* format

* fix failing timing in test

* disable health checks

* rename to switchToHealthyCluster

* format
@atakavci atakavci requested review from uglide and ggivo September 29, 2025 21:56
@atakavci atakavci self-assigned this Sep 29, 2025
@atakavci atakavci added breakingchange Pull request that has breaking changes. Must include the breaking behavior in release notes. feature labels Sep 29, 2025
Copy link

github-actions bot commented Sep 29, 2025

Test Results

   273 files  + 1    273 suites  +1   11m 4s ⏱️ +9s
10 001 tests +17  9 956 ✅ +428  45 💤  - 411  0 ❌ ±0 
 2 639 runs  +17  2 639 ✅ + 17   0 💤 ±  0  0 ❌ ±0 

Results for commit 1ce1746. ± Comparison against base commit e838e48.

This pull request removes 13 and adds 30 tests. Note that renamed tests count towards both.
redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_and_cache_reset_on_exception_then_recovers_next_time
redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_when_no_bdb_returned
redis.clients.jedis.mcf.LagAwareStrategyUnitTest ‑ unhealthy_when_no_matching_host_found
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testCanIterateOnceMore
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testCircuitBreakerForcedTransitions
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testConnectionPoolConfigApplied
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testHealthChecksStopAfterProviderClose
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testIterateActiveCluster
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testIterateActiveClusterOutOfRange
redis.clients.jedis.providers.MultiClusterPooledConnectionProviderTest ‑ testRunClusterFailoverPostProcessor
…
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_AllSuccess_EarlyFail_Integration
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_ExhaustProbesAndStayUnhealthy
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_Majority_EarlySuccess_Integration
redis.clients.jedis.mcf.HealthCheckIntegrationTest ‑ testProbingLogic_RealHealthCheckWithProbes
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_AllSuccess_StopsOnFirstFailure
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_Majority_EarlyFailStopsAtTwo
redis.clients.jedis.mcf.HealthCheckTest ‑ testPolicy_Majority_EarlySuccessStopsAtThree
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_ExhaustAllProbesAndFail
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_FailThenSucceedOnRetry
redis.clients.jedis.mcf.HealthCheckTest ‑ testRetryLogic_InterruptionStopsProbes
…

♻️ This comment has been updated with latest results.

@ggivo ggivo merged commit b899f3f into master Sep 30, 2025
19 checks passed
@atakavci atakavci added this to the 7.0.0 milestone Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakingchange Pull request that has breaking changes. Must include the breaking behavior in release notes. feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants