Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions content/develop/clients/failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
categories:
- docs
- develop
- stack
- oss
- rs
- rc
- oss
- kubernetes
- clients
description: Improve reliability using the failover/failback features of client libraries.
linkTitle: Failover/failback
title: Failover and failback
topics:
- failover
- failback
- resilience
- health checks
relatedPages:
- /develop/clients/jedis/failover
- /develop/clients/redis-py/failover
scope: overview
weight: 50
---

Some Redis client libraries support
[failover and failback](https://en.wikipedia.org/wiki/Failover)
to improve the availability of connections to Redis databases. Use this page
to get a general overview of the concepts and then see the documentation for
your client library to learn how to configure it for failover and failback:

- [Jedis]({{< relref "/develop/clients/jedis/failover" >}})
- [redis-py]({{< relref "/develop/clients/redis-py/failover" >}}) (preview)

## Concepts

You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
or independent Redis servers that are all suitable to serve your app.
Typically, you would prefer to use some database endpoints over others for a particular
instance of your app (perhaps the ones that are closest geographically to the app server
to reduce network latency). However, if the best endpoint is not available due
to a failure, it is generally better to switch to another, suboptimal endpoint
than to let the app fail completely.

*Failover* is the technique of actively checking for connection failures or
unacceptably slow connections and automatically switching to the best available endpoint
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:

{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}

The complementary technique of *failback* then involves periodically checking the health
of all endpoints that have failed. If any endpoints recover, the failback mechanism
automatically switches the connection to the one with the highest priority.
This could potentially be repeated until the optimal endpoint is available again.

{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}

### Detecting connection problems

Redis clients use a [circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern) to detect connection problems.

The circuit breaker is a software component that tracks the sequence of recent
Redis connection attempts and commands, recording which ones have succeeded and
which have failed.
(Note that many command failures are caused by transient errors such as timeouts,
so before recording a failure, the first response should usually be just to retry
the command a few times.)

The status of the attempted command calls is kept in a "sliding window", which
is simply a buffer where the least recent item is dropped as each new
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.

{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}

When the number of failures in the window exceeds a configured
threshold, the circuit breaker declares the server to be unhealthy and triggers
a failover.

### Selecting a failover target

Since you may have multiple Redis servers available to fail over to, the client
lets you configure a list of endpoints to try, ordered by priority or
"weight". When a failover is triggered, the client selects the highest-weighted
endpoint that is still healthy and uses it for the temporary connection.

### Health checks

Given that the original endpoint had some geographical or other advantage
over the failover target, you will generally want to fail back to it as soon
as it recovers. In the meantime, another server might recover that is
still better than the current failover target, so it might be worth
failing back to that server even if it is not optimal.

Clients periodically run a "health check" on each server to see if it has recovered.
The health check can be as simple as sending a Redis
[`PING`]({{< relref "/commands/ping" >}}) or
[ECHO]({{< relref "/commands/echo" >}}) command and ensuring that it gives the
expected response.

You can also configure the client to run health checks on the current target
server during periods of inactivity, even if no failover has occurred. This can
help to detect problems even if your app is not actively using the server.
157 changes: 78 additions & 79 deletions content/develop/clients/jedis/failover.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,85 +12,29 @@ categories:
description: Improve reliability using the failover/failback features of Jedis.
linkTitle: Failover/failback
title: Failover and failback
topics:
- failover
- failback
- resilience
- health checks
- retries
relatedPages:
- /develop/clients/failover
scope: [client-specific, implementation]
weight: 50
---

Jedis supports [failover and failback](https://en.wikipedia.org/wiki/Failover)
to improve the availability of connections to Redis databases. This page explains
the concepts and describes how to configure Jedis for failover and failback.
how to configure Jedis for failover and failback. For an overview of the concepts,
see the main [Failover/failback]({{< relref "/develop/clients/failover" >}}) page.

## Concepts

You may have several [Active-Active databases]({{< relref "/operate/rs/databases/active-active" >}})
or independent Redis servers that are all suitable to serve your app.
Typically, you would prefer to use some database endpoints over others for a particular
instance of your app (perhaps the ones that are closest geographically to the app server
to reduce network latency). However, if the best endpoint is not available due
to a failure, it is generally better to switch to another, suboptimal endpoint
than to let the app fail completely.

*Failover* is the technique of actively checking for connection failures or
unacceptably slow connections and automatically switching to the best available endpoint
when they occur. This requires you to specify a list of endpoints to try, ordered by priority. The diagram below shows this process:

{{< image filename="images/failover/failover-client-reconnect.svg" alt="Failover and client reconnection" >}}

The complementary technique of *failback* then involves periodically checking the health
of all endpoints that have failed. If any endpoints recover, the failback mechanism
automatically switches the connection to the one with the highest priority.
This could potentially be repeated until the optimal endpoint is available again.

{{< image filename="images/failover/failover-client-failback.svg" alt="Failback: client switches back to original server" width="75%" >}}

### Detecting connection problems
## Failover configuration

Jedis uses the [resilience4j](https://resilience4j.readme.io/docs/getting-started)
library to detect connection problems using a
[circuit breaker design pattern](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern).

The circuit breaker is a software component that tracks the sequence of recent
Redis connection attempts and commands, recording which ones have succeeded and
which have failed.
(Note that many command failures are caused by transient errors such as timeouts,
so before recording a failure, the first response should usually be just to retry
the command a few times.)

The status of the attempted command calls is kept in a "sliding window", which
is simply a buffer where the least recent item is dropped as each new
one is added. The buffer can be configured to have a fixed number of failures and/or a failure ratio (specified as a percentage), both based on a time window.

{{< image filename="images/failover/failover-sliding-window.svg" alt="Sliding window of recent connection attempts" >}}

When the number of failures in the window exceeds a configured
threshold, the circuit breaker declares the server to be unhealthy and triggers
a failover.

### Selecting a failover target

Since you may have multiple Redis servers available to fail over to, Jedis
lets you configure a list of endpoints to try, ordered by priority or
"weight". When a failover is triggered, Jedis selects the highest-weighted
endpoint that is still healthy and uses it for the temporary connection.

### Health checks

Given that the original endpoint had some geographical or other advantage
over the failover target, you will generally want to fail back to it as soon
as it recovers. In the meantime, another server might recover that is
still better than the current failover target, so it might be worth
failing back to that server even if it is not optimal.

Jedis periodically runs a "health check" on each server to see if it has recovered.
The health check can be as simple as
sending a Redis [`PING`]({{< relref "/commands/ping" >}}) command and ensuring
that it gives the expected response.

You can also configure Jedis to run health checks on the current target
server during periods of inactivity, even if no failover has occurred. This can
help to detect problems even if your app is not actively using the server.

## Failover configuration

The example below shows a simple case with a list of two servers,
`redis-east` and `redis-west`, where `redis-east` is the preferred
target. If `redis-east` fails, Jedis should fail over to
Expand Down Expand Up @@ -150,7 +94,9 @@ poolConfig.setTestWhileIdle(true);
poolConfig.setTimeBetweenEvictionRuns(Duration.ofSeconds(1));
```

Supply the weighted list of endpoints using the `MultiDbConfig` builder.
Supply the weighted list of endpoints using the `MultiDbConfig` builder
(see [Selecting a failover target]({{< relref "/develop/clients/failover#selecting-a-failover-target" >}}) for a full description of how
the weighted list is used).
Use the `weight` option to order the endpoints, with the highest
weight being tried first.

Expand Down Expand Up @@ -203,7 +149,8 @@ but will also handle the connection management and failover transparently.
### Circuit breaker configuration

The `MultiDbConfig.CircuitBreakerConfig` builder lets you pass several options to configure
the circuit breaker:
the circuit breaker (see [Detecting connection problems]({{< relref "/develop/clients/failover#detecting-connection-problems" >}}) for more information on how the
circuit breaker works):

| Builder method | Default value | Description|
| --- | --- | --- |
Expand Down Expand Up @@ -275,9 +222,27 @@ MultiDbClient client = MultiDbClient.builder()

## Health check configuration

There are several strategies available for health checks that you can configure using the
`MultiDbConfig` builder. The sections below explain these strategies
in more detail.
Each health check consists of one or more separate "probes", each of which is a simple
test (such as a [`PING`]({{< relref "/commands/ping" >}}) command) to determine if the database is available. The results of the separate probes are combined
using a configurable policy to determine if the database is healthy.

There are several strategies available for health checks that you can deploy using the
`MultiDbConfig` builder. Each strategy is a class that implements the `HealthCheckStrategy`
interface. Use the constructor of a `HealthCheckStrategy` implementation to pass
a `HealthCheckStrategy.Config` object to configure the health check behavior.
The methods of the base `HealthCheckStrategy.Config` builder are shown below.
Note that some strategies (including your own custom strategies) may use a
subclass of `HealthCheckStrategy.Config` to provide extra options.

| Builder method | Default value | Description|
| --- | --- | --- |
| `interval()` | `1000` | Interval in milliseconds between health checks. |
| `timeout()` | `1000` | Timeout in milliseconds for health check requests. |
| `numProbes()` | `3` | Number of probes to perform during each health check. |
| `delayInBetweenProbes()` | `100` | Delay in milliseconds between probes during a health check. |
| `policy()` | `ProbingPolicy.BuiltIn.ALL_SUCCESS` | Policy to determine if the database is healthy based on the probe results. The options are `ALL_SUCCESS` (all probes must succeed), `ANY_SUCCESS` (at least one probe must succeed), and `MAJORITY_SUCCESS` (majority of probes must succeed). |

The sections below explain the available strategies in more detail.

### `PingStrategy` (default)

Expand All @@ -287,6 +252,23 @@ and checks that it gives the expected response. Any unexpected response
or exception indicates an unhealthy server. Although `PingStrategy` is
very simple, it is a good basic approach for most Redis deployments.

Although `PingStrategy` is the default, you can still activate it
explicitly using the `healthCheckStrategy()` method of the `MultiDbConfig.DatabaseConfig`
builder. Use this approach if you want to configure the default
`PingStrategy` with custom options, as shown in the example below.

```java
MultiDbConfig.DatabaseConfig dbConfig =
MultiDbConfig.DatabaseConfig.builder(hostAndPort, clientConfig)
.healthCheckStrategy(new PingStrategy(PingStrategy.Config.builder()
.interval(5000) // Check every 5 seconds
.timeout(3000) // 3 second timeout
.numProbes(5) // 5 probes per check
.delayInBetweenProbes(100) // 100ms delay between probes
.build()))
.build();
```

### `LagAwareStrategy` (preview)

`LagAwareStrategy` (currently in preview) is designed specifically for
Expand Down Expand Up @@ -320,13 +302,12 @@ MultiDbConfig.DatabaseConfig dbConfig =
.build();
```

The `LagAwareStrategy.Config` builder has the following options:
The `LagAwareStrategy.Config` builder has the following options in
addition to the standard options provided by `HealthCheckStrategy.Config`:

| Builder method | Default value | Description|
| --- | --- | --- |
| `sslOptions()` | `null` | Standard SSL options for connecting to the REST API. |
| `interval()` | `5000` | Interval in milliseconds between health checks. |
| `timeout()` | `3000` | Timeout in milliseconds for health check requests. |
| `extendedCheckEnabled()` | `false` | Enable extended lag checking (this includes lag validation in addition to the standard datapath validation). |
| `availabilityLagTolerance()` | `100` | Maximum lag tolerance in milliseconds for extended lag checking. |

Expand Down Expand Up @@ -366,16 +347,34 @@ MultiDbConfig.DatabaseConfig dbConfig = MultiDbConfig.DatabaseConfig.builder(eas
.build();
```

## Manual failback
## Managing databases at runtime

Although you will typically configure all databases during the initial connection, you can also modify the configuration at runtime. The example below shows how to add and remove database endpoints.

```java
HostAndPort other = new HostAndPort("redis-south.example.com", 14000);

// Create the database config as you would for the initial connection.
client.addDatabase(DatabaseConfig.builder(other, config)
// ...
.weight(0.5f)
.build()
);

// Remove the database from the failover set.
client.removeDatabase(other);
```

### Manual failback

By default, the failback mechanism runs health checks on all servers in the
weighted list and selects the highest-weighted server that is
healthy. However, you can also use the `setActiveDatabase()` method of
`MultiDbClient` to select which database to use manually:

```java
// The `setActiveDatabase()` method receives the `HostAndPort` of the
// cluster to switch to.
// The `setActiveDatabase()` method receives the `Endpoint` (eg,`HostAndPort`)
// of the cluster to switch to.
client.setActiveDatabase(west);
```

Expand Down
Loading