From 51905bc58a869a961595b7acfe1a71d7ff7ea835 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 14 Nov 2025 17:23:13 -0600 Subject: [PATCH 1/9] DOC-5860 RS: Disaster recovery strategies for Active-Active databases --- .../rs/databases/active-active/_index.md | 4 +- .../develop/app-failover-active-active.md | 6 + .../active-active/disaster-recovery.md | 226 ++++++++++++++++++ 3 files changed, 235 insertions(+), 1 deletion(-) create mode 100644 content/operate/rs/databases/active-active/disaster-recovery.md diff --git a/content/operate/rs/databases/active-active/_index.md b/content/operate/rs/databases/active-active/_index.md index d046f160df..0618dfbc92 100644 --- a/content/operate/rs/databases/active-active/_index.md +++ b/content/operate/rs/databases/active-active/_index.md @@ -59,4 +59,6 @@ Other Redis Enterprise Software features can also be used to enhance the perform - [Plan your Active-Active deployment]({{< relref "/operate/rs/databases/active-active/planning.md" >}}) - [Get started with Active-Active]({{< relref "/operate/rs/databases/active-active/get-started.md" >}}) -- [Create an Active-Active database]({{< relref "/operate/rs/databases/active-active/create.md" >}}) \ No newline at end of file +- [Create an Active-Active database]({{< relref "/operate/rs/databases/active-active/create.md" >}}) +- [Develop applications with Active-Active databases]({{}}) +- Review [disaster recovery strategies for Active-Active databases]({{< relref "/operate/rs/databases/active-active/disaster-recovery" >}}) \ No newline at end of file diff --git a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md index 17fddbf6c2..8f80287a87 100644 --- a/content/operate/rs/databases/active-active/develop/app-failover-active-active.md +++ b/content/operate/rs/databases/active-active/develop/app-failover-active-active.md @@ -15,6 +15,10 @@ An application deployed with an Active-Active database connects to a replica of If that replica is not available, the application can failover to a remote replica, and failback again if necessary. In this article we explain how this process works. +{{}} +For other disaster recovery strategies including network-based, proxy-based, and client library approaches, see [Active-Active disaster recovery strategies]({{}}). +{{}} + Active-Active connection failover can improve data availability, but can negatively impact data consistency. Active-Active replication, like Redis replication, is asynchronous. An application that fails over to another replica can miss write operations. @@ -28,6 +32,8 @@ Your application can detect two types of failure: 1. **Local failures** - The local replica is down or otherwise unavailable 1. **Replication failures** - The local replica is available but fails to replicate to or from remote replicas +You can also use [database availability API requests]({{}}) to determine if a database replica is available to handle read and write operations. The lag-aware database availability requests considers CRDT replication lag as a health check criterion to prevent reading stale data during failback scenarios. + ### Local Failures Local failure is detected when the application is unable to connect to the database endpoint for any reason. Reasons for a local failure can include: multiple node failures, configuration errors, connection refused, connection timed out, unexpected protocol level errors. diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md new file mode 100644 index 0000000000..d71e51de56 --- /dev/null +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -0,0 +1,226 @@ +--- +Title: Disaster recovery strategies for Active-Active databases +alwaysopen: false +categories: +- docs +- operate +- rs +- rc +description: Disaster recovery strategies for Active-Active databases using network, proxy, client library, and application-based approaches. +linkTitle: Disaster recovery +weight: 50 +--- + +An application deployed with an Active-Active database connects to a database member that is geographically nearby. If that database member becomes unavailable, the application can fail over to a secondary Active-Active database member, and fail back to the original database member again if it recovers. + +However, Active-Active Redis databases do not have a built-in failover or failback mechanism for application connections. To implement failover and failback, you can use one of the following disaster recovery strategies: + +- [Network-based](#network-based-disaster-recovery): Global traffic managers and load balancers for routing + +- [Proxy-based](#proxy-based-disaster-recovery): Software proxies handle detection and routing logic + +- [Client library-based](#client-library-based-disaster-recovery): Database client libraries with built-in failover logic + +- [Application-based](#application-based-disaster-recovery): Custom application-level monitoring and connectivity management + +## Detect failures with health checks + +You can use the following health checks to help detect Active-Active database failures and determine when to failover to a secondary Active-Active member or failback to the primary member: + +- [`PING`]({{}}) or [`ECHO`]({{}}) + +- Connection timeouts or Redis errors + +- [Lag-aware database availability requests]({{}}) + +- Probing the keyspace with [`SET`]({{}}) or [`GET`]({{}}) commands to cover all available shards + +- A custom health check + +## Considerations for disaster recovery + +When implementing a disaster recovery strategy for an Active-Active database, consider the following: + +- Is the Active-Active database an on-premise, cloud, multi-cloud, or hybrid-cloud deployment? + +- Number of regions and availability zones + +- Application server redundancy and deployment locations + +- Acceptable values for Recovery Point Objective (RPO) and Recovery Time Objective (RTO) + +- Latency and throughput requirements + +- Number of application errors that can be tolerated during a failure + +- Tolerance for reading stale but eventually consistent data during a failover scenario + +- Is concurrent access, in which different application servers can read from or write to different Active-Active database members, acceptable? + +- Are there any regulatory or policy requirements for disaster recovery? + +- Does the application connect to the Active-Active database using a Redis client library or through a development framework or ecosystem? + +- Does the Active-Active database use DNS, the [OSS Cluster API]({{}}), or the [discovery service]({{}})? + +- Is rate-limiting control needed? + +- Can you modify the existing codebase or introduce new components such as load balancers or proxies? + +## Network-based disaster recovery + +Network-based solutions use DNS or load balancing to route traffic across regions without application changes. + +Advantages: + +- Because routing happens at the network level: + + - No application code changes are needed. + + - Development frameworks are agnostic and can connect to a single Active-Active database member's endpoint. + +### Cross-region availability + +For cross-region availability, you can use a global traffic manager or a global load balancer. + +Advantages: + +- If DNS routing is available at the application level, no additional load balancer is required between the application and the data tier to resolve the Active-Active database member’s FQDN, reducing the latency. + +- Protects against data center failure since failure in one region should not affect services running in another region. + +#### Global traffic manager + +A global traffic manager acts as an intelligent DNS server that directs clients to healthy endpoints based on distance, latency, or availability. You should configure the traffic manager to route to the local region first and fail over to other regions if an issue occurs. + +Advantages: + +- High availability + +- Latency optimization + +- Seamless disaster recovery + +Considerations: + +- DNS propagation delays affect failover time + +- DNS caches can impact proper functioning + +- Limited custom health check support + +- May route traffic during CRDT synchronization, causing stale data reads + +#### Global load balancer + +For real-time traffic control and more advanced routing logic for cross-region failover and failback, you can use a global load balancer. However, this solution can have higher latency than a global traffic manager. + +### Cross-zone availability + +If your deployment does not require cross-region availability, you can use a regional load balancer to route requests to a healthy Active-Active database member in a different availability zone within the same region. + +## Proxy-based disaster recovery + +If you add a lightweight proxy software component between the clients and the Active-Active database, applications can dynamically route requests to the optimal endpoint. + +Advantages: + +- Proxies provide out-of-the-box proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors + +- Proxies can be configured to easily run the desired A-A health check policy, such as the lag-aware database availability. + +- If an Active-Active database member fails, a proxy can automatically detect the issue and redirect traffic to a healthy Active-Active database member without requiring DNS propagation delays or client disconnections. This enables fast, controlled failover and minimizes downtime. + +Considerations: + +- If you do not use DNS to resolve the Active-Active database members' FQDNs: + + - The proxies must have static IPs. + + - Adding a new node to the cluster requires that the proxy be configured with the new endpoint. + + - A config syncer component is required to discover topology changes and reconfigure the proxy. + +- Proxies introduce latency. + +- Proxy failures can disconnect clients and cause disruption. + +### Avoid concurrent access across replicas + +If concurrent access across replicas must be avoided in every scenario, you can use a centralized proxy with a standby proxy instance for high availability. + +Advantages: + +- Concurrent access across replicas is not possible + +- Failover and failback are simultaneous regardless of the Active-Active health check policy + +Considerations: + +- Although the proxy can be monitored with a watchdog and restarted in case of failure, this setup does not grant high availability for the proxy. + +- Limited scalability + +### Co-locate to reduce latency and improve scalability + +To reduce latency and improve scalability, you can use a proxy co-located in the application server. + +Advantages: + +- Reduced latency + +- Better scalability + +Considerations: + +- Failover and failback might not be simultaneous depending on the Active-Active health check policy. + +### Pool proxies for scalability + +You can use a pool of active proxies to scale the routing layer. Application servers can balance new connections to the pool of proxies using a round-robin distribution algorithm, such as DNS-based round robin. + +Advantages: + +- High availability without complex monitoring and failover solutions + +- Flexible scalability of the routing layer + +Considerations: + +- Concurrent access across replicas is possible, but can be mitigated using database availability API requests. + +## Client library-based disaster recovery + +Some Redis client libraries support geographic failover and failback. These client libraries monitor all Active-Active database members and instantiate connections for all endpoints in advance to allow faster failover and failback. + +Advantages: + +- No additional hardware or software components required + +- No high availability considerations + +- No scalability concerns + +- Tighter control over connectivity such as timeouts, connection retries, and dynamic reconfiguration + +- OSS Cluster API support + +- Low latency + +Considerations: + +- Requires code changes for failover and failback logic + +- Concurrent access across replicas is possible, but can be mitigated using the distributed health status provided by the database availability API requests + +- When a development framework uses Redis transparently, failover and failback might not be easy to configure + +For additional information, see the following client library guides for failover and failback: + +- [Jedis (Java)]({{}}) + +## Application-based disaster recovery + +For complete control over failover and failback, you can implement disaster recovery mechanisms directly in the application server. + +For more information, see [Application failover with Active-Active databases]({{}}). From ad03b8d343990541b3c25e430b4bf9bcc8fb3786 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Tue, 18 Nov 2025 16:04:26 -0600 Subject: [PATCH 2/9] DOC-5860 Style fixes, added definitions and links --- .../active-active/disaster-recovery.md | 86 +++++++++---------- 1 file changed, 43 insertions(+), 43 deletions(-) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index d71e51de56..ebbcb3b91f 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -13,29 +13,29 @@ weight: 50 An application deployed with an Active-Active database connects to a database member that is geographically nearby. If that database member becomes unavailable, the application can fail over to a secondary Active-Active database member, and fail back to the original database member again if it recovers. -However, Active-Active Redis databases do not have a built-in failover or failback mechanism for application connections. To implement failover and failback, you can use one of the following disaster recovery strategies: +However, Active-Active Redis databases do not have a built-in [failover](https://en.wikipedia.org/wiki/Failover) or failback mechanism for application connections. To implement failover and failback, you can use one of the following disaster recovery strategies: -- [Network-based](#network-based-disaster-recovery): Global traffic managers and load balancers for routing +- [Network-based](#network-based-disaster-recovery): Global traffic managers and load balancers for routing. -- [Proxy-based](#proxy-based-disaster-recovery): Software proxies handle detection and routing logic +- [Proxy-based](#proxy-based-disaster-recovery): Software proxies handle detection and routing logic. -- [Client library-based](#client-library-based-disaster-recovery): Database client libraries with built-in failover logic +- [Client library-based](#client-library-based-disaster-recovery): Database client libraries with built-in failover logic. -- [Application-based](#application-based-disaster-recovery): Custom application-level monitoring and connectivity management +- [Application-based](#application-based-disaster-recovery): Custom application-level monitoring and connectivity management. ## Detect failures with health checks You can use the following health checks to help detect Active-Active database failures and determine when to failover to a secondary Active-Active member or failback to the primary member: -- [`PING`]({{}}) or [`ECHO`]({{}}) +- [`PING`]({{}}) or [`ECHO`]({{}}). -- Connection timeouts or Redis errors +- Connection timeouts or Redis errors. -- [Lag-aware database availability requests]({{}}) +- [Lag-aware database availability requests]({{}}). -- Probing the keyspace with [`SET`]({{}}) or [`GET`]({{}}) commands to cover all available shards +- Probing the keyspace with [`SET`]({{}}) or [`GET`]({{}}) commands to cover all available shards. -- A custom health check +- A custom health check. ## Considerations for disaster recovery @@ -43,17 +43,17 @@ When implementing a disaster recovery strategy for an Active-Active database, co - Is the Active-Active database an on-premise, cloud, multi-cloud, or hybrid-cloud deployment? -- Number of regions and availability zones +- Number of regions and availability zones. -- Application server redundancy and deployment locations +- Application server redundancy and deployment locations. -- Acceptable values for Recovery Point Objective (RPO) and Recovery Time Objective (RTO) +- Acceptable values for the maximum amount of data that can be lost during a failure (Recovery Point Objective) and the maximum acceptable time to restore service after a failure (Recovery Time Objective). -- Latency and throughput requirements +- Latency and throughput requirements. -- Number of application errors that can be tolerated during a failure +- Number of application errors that can be tolerated during a failure. -- Tolerance for reading stale but eventually consistent data during a failover scenario +- Tolerance for reading stale but eventually consistent data during a failover scenario. - Is concurrent access, in which different application servers can read from or write to different Active-Active database members, acceptable? @@ -85,7 +85,7 @@ For cross-region availability, you can use a global traffic manager or a global Advantages: -- If DNS routing is available at the application level, no additional load balancer is required between the application and the data tier to resolve the Active-Active database member’s FQDN, reducing the latency. +- If DNS routing is available at the application level, no additional load balancer is required between the application and the data tier to resolve the Active-Active database member’s FQDN, reducing latency. - Protects against data center failure since failure in one region should not affect services running in another region. @@ -95,21 +95,21 @@ A global traffic manager acts as an intelligent DNS server that directs clients Advantages: -- High availability +- High availability. -- Latency optimization +- Latency optimization. -- Seamless disaster recovery +- Seamless disaster recovery. Considerations: -- DNS propagation delays affect failover time +- DNS propagation delays affect failover time. -- DNS caches can impact proper functioning +- DNS caches can impact proper functioning. -- Limited custom health check support +- Limited custom health check support. -- May route traffic during CRDT synchronization, causing stale data reads +- May route traffic during CRDT synchronization, causing stale data reads. #### Global load balancer @@ -125,7 +125,7 @@ If you add a lightweight proxy software component between the clients and the Ac Advantages: -- Proxies provide out-of-the-box proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors +- Proxies provide out-of-the-box proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors. - Proxies can be configured to easily run the desired A-A health check policy, such as the lag-aware database availability. @@ -141,9 +141,9 @@ Considerations: - A config syncer component is required to discover topology changes and reconfigure the proxy. -- Proxies introduce latency. +- Proxies introduce latency. -- Proxy failures can disconnect clients and cause disruption. +- Proxy failures can disconnect clients and cause disruptions. ### Avoid concurrent access across replicas @@ -151,15 +151,15 @@ If concurrent access across replicas must be avoided in every scenario, you can Advantages: -- Concurrent access across replicas is not possible +- Concurrent access across replicas is not possible. -- Failover and failback are simultaneous regardless of the Active-Active health check policy +- Failover and failback are simultaneous regardless of the Active-Active health check policy. Considerations: - Although the proxy can be monitored with a watchdog and restarted in case of failure, this setup does not grant high availability for the proxy. -- Limited scalability +- Limited scalability. ### Co-locate to reduce latency and improve scalability @@ -167,9 +167,9 @@ To reduce latency and improve scalability, you can use a proxy co-located in the Advantages: -- Reduced latency +- Reduced latency. -- Better scalability +- Better scalability. Considerations: @@ -181,9 +181,9 @@ You can use a pool of active proxies to scale the routing layer. Application ser Advantages: -- High availability without complex monitoring and failover solutions +- High availability without complex monitoring and failover solutions. -- Flexible scalability of the routing layer +- Flexible scalability of the routing layer. Considerations: @@ -195,25 +195,25 @@ Some Redis client libraries support geographic failover and failback. These clie Advantages: -- No additional hardware or software components required +- No additional hardware or software components required. -- No high availability considerations +- No high availability considerations. -- No scalability concerns +- No scalability concerns. -- Tighter control over connectivity such as timeouts, connection retries, and dynamic reconfiguration +- Tighter control over connectivity such as timeouts, connection retries, and dynamic reconfiguration. -- OSS Cluster API support +- OSS Cluster API support. -- Low latency +- Low latency. Considerations: -- Requires code changes for failover and failback logic +- Requires code changes for failover and failback logic. -- Concurrent access across replicas is possible, but can be mitigated using the distributed health status provided by the database availability API requests +- Concurrent access across replicas is possible, but can be mitigated using the distributed health status provided by the database availability API requests. -- When a development framework uses Redis transparently, failover and failback might not be easy to configure +- When a development framework uses Redis transparently, failover and failback might not be easy to configure. For additional information, see the following client library guides for failover and failback: From 678dcc1090dc8e2ced774cd62d8f3ddbd773e34c Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Tue, 18 Nov 2025 18:05:01 -0600 Subject: [PATCH 3/9] DOC-5860 Added A-A disaster recovery diagrams --- .../active-active/disaster-recovery.md | 20 +++++++++++++++++++ .../centralized-proxy.svg | 1 + .../client-library-connection-pool.svg | 1 + .../client-library.svg | 1 + .../global-load-balancer.svg | 1 + .../gtm-with-DNS.svg | 1 + .../gtm-with-load-balancer.svg | 1 + .../regional-load-balancer.svg | 1 + 8 files changed, 27 insertions(+) create mode 100644 static/images/active-active-disaster-recovery/centralized-proxy.svg create mode 100644 static/images/active-active-disaster-recovery/client-library-connection-pool.svg create mode 100644 static/images/active-active-disaster-recovery/client-library.svg create mode 100644 static/images/active-active-disaster-recovery/global-load-balancer.svg create mode 100644 static/images/active-active-disaster-recovery/gtm-with-DNS.svg create mode 100644 static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg create mode 100644 static/images/active-active-disaster-recovery/regional-load-balancer.svg diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index ebbcb3b91f..2ce8aedf2d 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -111,14 +111,26 @@ Considerations: - May route traffic during CRDT synchronization, causing stale data reads. +The following diagram shows how a global traffic manager with DNS resolution routes traffic and allows applications to connect directly to the nearest Active-Active database member: + +{{}} + +If the environment does not allow DNS resolution, you can use a load balancer to direct traffic to the cluster nodes: + +{{}} + #### Global load balancer For real-time traffic control and more advanced routing logic for cross-region failover and failback, you can use a global load balancer. However, this solution can have higher latency than a global traffic manager. +{{}} + ### Cross-zone availability If your deployment does not require cross-region availability, you can use a regional load balancer to route requests to a healthy Active-Active database member in a different availability zone within the same region. +{{}} + ## Proxy-based disaster recovery If you add a lightweight proxy software component between the clients and the Active-Active database, applications can dynamically route requests to the optimal endpoint. @@ -161,6 +173,8 @@ Considerations: - Limited scalability. +{{}} + ### Co-locate to reduce latency and improve scalability To reduce latency and improve scalability, you can use a proxy co-located in the application server. @@ -175,6 +189,8 @@ Considerations: - Failover and failback might not be simultaneous depending on the Active-Active health check policy. +{{}} + ### Pool proxies for scalability You can use a pool of active proxies to scale the routing layer. Application servers can balance new connections to the pool of proxies using a round-robin distribution algorithm, such as DNS-based round robin. @@ -189,6 +205,8 @@ Considerations: - Concurrent access across replicas is possible, but can be mitigated using database availability API requests. +{{}} + ## Client library-based disaster recovery Some Redis client libraries support geographic failover and failback. These client libraries monitor all Active-Active database members and instantiate connections for all endpoints in advance to allow faster failover and failback. @@ -215,6 +233,8 @@ Considerations: - When a development framework uses Redis transparently, failover and failback might not be easy to configure. +{{}} + For additional information, see the following client library guides for failover and failback: - [Jedis (Java)]({{}}) diff --git a/static/images/active-active-disaster-recovery/centralized-proxy.svg b/static/images/active-active-disaster-recovery/centralized-proxy.svg new file mode 100644 index 0000000000..899d4f1cff --- /dev/null +++ b/static/images/active-active-disaster-recovery/centralized-proxy.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/client-library-connection-pool.svg b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg new file mode 100644 index 0000000000..b36728b754 --- /dev/null +++ b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/client-library.svg b/static/images/active-active-disaster-recovery/client-library.svg new file mode 100644 index 0000000000..588ee7d8bd --- /dev/null +++ b/static/images/active-active-disaster-recovery/client-library.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/global-load-balancer.svg b/static/images/active-active-disaster-recovery/global-load-balancer.svg new file mode 100644 index 0000000000..2b931c85d2 --- /dev/null +++ b/static/images/active-active-disaster-recovery/global-load-balancer.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/gtm-with-DNS.svg b/static/images/active-active-disaster-recovery/gtm-with-DNS.svg new file mode 100644 index 0000000000..32f60181d3 --- /dev/null +++ b/static/images/active-active-disaster-recovery/gtm-with-DNS.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg b/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg new file mode 100644 index 0000000000..143b2223a7 --- /dev/null +++ b/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/regional-load-balancer.svg b/static/images/active-active-disaster-recovery/regional-load-balancer.svg new file mode 100644 index 0000000000..44bd34995f --- /dev/null +++ b/static/images/active-active-disaster-recovery/regional-load-balancer.svg @@ -0,0 +1 @@ + \ No newline at end of file From 14593f8c5f59681605f3a24864585798771e3d46 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Wed, 19 Nov 2025 16:39:59 -0600 Subject: [PATCH 4/9] DOC-5860 Trimmed svg view boxes --- .../active-active-disaster-recovery/centralized-proxy.svg | 2 +- .../client-library-connection-pool.svg | 2 +- .../images/active-active-disaster-recovery/client-library.svg | 2 +- .../co-located-proxy-and-app.svg | 1 + .../active-active-disaster-recovery/global-load-balancer.svg | 2 +- static/images/active-active-disaster-recovery/gtm-with-DNS.svg | 2 +- .../active-active-disaster-recovery/gtm-with-load-balancer.svg | 2 +- static/images/active-active-disaster-recovery/proxy-pool.svg | 1 + .../active-active-disaster-recovery/regional-load-balancer.svg | 2 +- 9 files changed, 9 insertions(+), 7 deletions(-) create mode 100644 static/images/active-active-disaster-recovery/co-located-proxy-and-app.svg create mode 100644 static/images/active-active-disaster-recovery/proxy-pool.svg diff --git a/static/images/active-active-disaster-recovery/centralized-proxy.svg b/static/images/active-active-disaster-recovery/centralized-proxy.svg index 899d4f1cff..c9fef9c3bc 100644 --- a/static/images/active-active-disaster-recovery/centralized-proxy.svg +++ b/static/images/active-active-disaster-recovery/centralized-proxy.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/client-library-connection-pool.svg b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg index b36728b754..3fe09acf42 100644 --- a/static/images/active-active-disaster-recovery/client-library-connection-pool.svg +++ b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/client-library.svg b/static/images/active-active-disaster-recovery/client-library.svg index 588ee7d8bd..b7371a6bcd 100644 --- a/static/images/active-active-disaster-recovery/client-library.svg +++ b/static/images/active-active-disaster-recovery/client-library.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/co-located-proxy-and-app.svg b/static/images/active-active-disaster-recovery/co-located-proxy-and-app.svg new file mode 100644 index 0000000000..1c19b74419 --- /dev/null +++ b/static/images/active-active-disaster-recovery/co-located-proxy-and-app.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/global-load-balancer.svg b/static/images/active-active-disaster-recovery/global-load-balancer.svg index 2b931c85d2..e2f3f27fb3 100644 --- a/static/images/active-active-disaster-recovery/global-load-balancer.svg +++ b/static/images/active-active-disaster-recovery/global-load-balancer.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/gtm-with-DNS.svg b/static/images/active-active-disaster-recovery/gtm-with-DNS.svg index 32f60181d3..53779f01ed 100644 --- a/static/images/active-active-disaster-recovery/gtm-with-DNS.svg +++ b/static/images/active-active-disaster-recovery/gtm-with-DNS.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg b/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg index 143b2223a7..4181558b2e 100644 --- a/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg +++ b/static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/proxy-pool.svg b/static/images/active-active-disaster-recovery/proxy-pool.svg new file mode 100644 index 0000000000..edb1f6e18c --- /dev/null +++ b/static/images/active-active-disaster-recovery/proxy-pool.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/static/images/active-active-disaster-recovery/regional-load-balancer.svg b/static/images/active-active-disaster-recovery/regional-load-balancer.svg index 44bd34995f..3a641e9d7f 100644 --- a/static/images/active-active-disaster-recovery/regional-load-balancer.svg +++ b/static/images/active-active-disaster-recovery/regional-load-balancer.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file From eafb01c3554d8e0d536e4696f97dddffd4b6f8f0 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Wed, 19 Nov 2025 17:24:25 -0600 Subject: [PATCH 5/9] DOC-5860 Added diagram intros and alt text --- .../active-active/disaster-recovery.md | 34 ++++++++++++++----- .../client-library-connection-pool.svg | 2 +- 2 files changed, 26 insertions(+), 10 deletions(-) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index 2ce8aedf2d..58c78c7d96 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -111,25 +111,29 @@ Considerations: - May route traffic during CRDT synchronization, causing stale data reads. -The following diagram shows how a global traffic manager with DNS resolution routes traffic and allows applications to connect directly to the nearest Active-Active database member: +The following diagram shows how a global traffic manager with DNS resolution routes traffic: -{{}} +{{Diagram of a global traffic manager routing applications to Active-Active database members across regions}} If the environment does not allow DNS resolution, you can use a load balancer to direct traffic to the cluster nodes: -{{}} +{{Diagram of a global traffic manager with a load balancer directing traffic to Active-Active database members across regions}} #### Global load balancer For real-time traffic control and more advanced routing logic for cross-region failover and failback, you can use a global load balancer. However, this solution can have higher latency than a global traffic manager. -{{}} +The following diagram shows how a global load balancer routes traffic between regions: + +{{Diagram of a global load balancer routing traffic between Active-Active database members in different regions}} ### Cross-zone availability If your deployment does not require cross-region availability, you can use a regional load balancer to route requests to a healthy Active-Active database member in a different availability zone within the same region. -{{}} +The following diagram shows how a regional load balancer routes traffic across availability zones: + +{{Diagram of a regional load balancer routing traffic across availability zones within a single region}} ## Proxy-based disaster recovery @@ -173,7 +177,9 @@ Considerations: - Limited scalability. -{{}} +The following diagram shows a centralized proxy architecture with a standby proxy instance: + +{{Diagram of a centralized proxy architecture with active and standby proxy instances routing to Active-Active database members}} ### Co-locate to reduce latency and improve scalability @@ -189,7 +195,9 @@ Considerations: - Failover and failback might not be simultaneous depending on the Active-Active health check policy. -{{}} +The following diagram shows a co-located proxy architecture where each application server has its own proxy: + +{{Diagram of co-located proxy architecture where each application server has its own proxy instance}} ### Pool proxies for scalability @@ -205,7 +213,9 @@ Considerations: - Concurrent access across replicas is possible, but can be mitigated using database availability API requests. -{{}} +The following diagram shows a pool of proxies: + +{{Diagram of a pool of active proxy instances}} ## Client library-based disaster recovery @@ -233,7 +243,13 @@ Considerations: - When a development framework uses Redis transparently, failover and failback might not be easy to configure. -{{}} +The following diagram shows a client library-based disaster recovery approach: + +{{Diagram of client libraries routing traffic to Active-Active database members}} + +The following diagram shows a client-based disaster recovery approach that also uses [connection pooling]({{}}): + +{{Diagram of client libraries with connection pooling routing traffic to Active-Active database members}} For additional information, see the following client library guides for failover and failback: diff --git a/static/images/active-active-disaster-recovery/client-library-connection-pool.svg b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg index 3fe09acf42..b7a9ff71ea 100644 --- a/static/images/active-active-disaster-recovery/client-library-connection-pool.svg +++ b/static/images/active-active-disaster-recovery/client-library-connection-pool.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file From c4593c461f72c6a8ba506656bf0a2ab8958e8893 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 21 Nov 2025 10:06:01 -0600 Subject: [PATCH 6/9] DOC-5860 Added redis-py failover link to A-A disaster recovery doc --- content/operate/rs/databases/active-active/disaster-recovery.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index 58c78c7d96..2532eafde3 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -255,6 +255,8 @@ For additional information, see the following client library guides for failover - [Jedis (Java)]({{}}) +- [redis-py (Python)]({{}}) + ## Application-based disaster recovery For complete control over failover and failback, you can implement disaster recovery mechanisms directly in the application server. From e019f4ae43a9147bc384822bb610f7937a77c5e2 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 21 Nov 2025 10:58:29 -0600 Subject: [PATCH 7/9] DOC-5860 Added some missing commas --- .../operate/rs/databases/active-active/disaster-recovery.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index 2532eafde3..dbb89f0112 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -65,7 +65,7 @@ When implementing a disaster recovery strategy for an Active-Active database, co - Is rate-limiting control needed? -- Can you modify the existing codebase or introduce new components such as load balancers or proxies? +- Can you modify the existing codebase or introduce new components, such as load balancers or proxies? ## Network-based disaster recovery @@ -229,7 +229,7 @@ Advantages: - No scalability concerns. -- Tighter control over connectivity such as timeouts, connection retries, and dynamic reconfiguration. +- Tighter control over connectivity, such as timeouts, connection retries, and dynamic reconfiguration. - OSS Cluster API support. From e2a1c728bd481dbf5efe680a8cd329754a148bf9 Mon Sep 17 00:00:00 2001 From: Rachel Elledge Date: Fri, 21 Nov 2025 15:53:38 -0600 Subject: [PATCH 8/9] DOC-5860 Copy edits and fixed diagram --- .../rs/databases/active-active/disaster-recovery.md | 10 +++++----- .../active-active-disaster-recovery/proxy-pool.svg | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index dbb89f0112..eee6a7e0bd 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -141,9 +141,9 @@ If you add a lightweight proxy software component between the clients and the Ac Advantages: -- Proxies provide out-of-the-box proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors. +- Proxies provide proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors. -- Proxies can be configured to easily run the desired A-A health check policy, such as the lag-aware database availability. +- Proxies can be configured to run Active-Active health checks, such as the lag-aware database availability requests. - If an Active-Active database member fails, a proxy can automatically detect the issue and redirect traffic to a healthy Active-Active database member without requiring DNS propagation delays or client disconnections. This enables fast, controlled failover and minimizes downtime. @@ -153,9 +153,9 @@ Considerations: - The proxies must have static IPs. - - Adding a new node to the cluster requires that the proxy be configured with the new endpoint. + - If you add a new node to the cluster, you must configure the proxy with the new endpoint. - - A config syncer component is required to discover topology changes and reconfigure the proxy. + - A configuration syncer component is required to discover topology changes and reconfigure the proxy. - Proxies introduce latency. @@ -167,7 +167,7 @@ If concurrent access across replicas must be avoided in every scenario, you can Advantages: -- Concurrent access across replicas is not possible. +- Prevents concurrent access across replicas. - Failover and failback are simultaneous regardless of the Active-Active health check policy. diff --git a/static/images/active-active-disaster-recovery/proxy-pool.svg b/static/images/active-active-disaster-recovery/proxy-pool.svg index edb1f6e18c..1d80cc458f 100644 --- a/static/images/active-active-disaster-recovery/proxy-pool.svg +++ b/static/images/active-active-disaster-recovery/proxy-pool.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file From e0f0d1ca5e2869cead2cc52c70270db223e6f6ad Mon Sep 17 00:00:00 2001 From: Rachel Elledge <86307637+rrelledge@users.noreply.github.com> Date: Fri, 21 Nov 2025 16:29:30 -0600 Subject: [PATCH 9/9] Update content/operate/rs/databases/active-active/disaster-recovery.md Co-authored-by: David Dougherty --- content/operate/rs/databases/active-active/disaster-recovery.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/operate/rs/databases/active-active/disaster-recovery.md b/content/operate/rs/databases/active-active/disaster-recovery.md index eee6a7e0bd..3c373e067e 100644 --- a/content/operate/rs/databases/active-active/disaster-recovery.md +++ b/content/operate/rs/databases/active-active/disaster-recovery.md @@ -31,7 +31,7 @@ You can use the following health checks to help detect Active-Active database fa - Connection timeouts or Redis errors. -- [Lag-aware database availability requests]({{}}). +- [Lag-aware database availability requests]({{}}). - Probing the keyspace with [`SET`]({{}}) or [`GET`]({{}}) commands to cover all available shards.