replica & async rest API health check enhancement #1599

ksarabu1 · 2020-06-23T16:58:52Z

Description: rest API health check enhancement on endpoints /replica & /async or /asynchronous to support additional query parameter lag.

GET /replica?lag=<max-lag>: replica check endpoint. In addition to checks from replica, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and to compute latency on replica for performance reasons. Please note that the value in leader_optime might be couple of seconds old (based on loop_wait).
GET /asynchronous?lag=<max-lag> or GET /async&lag=<max-lag>: asynchronous standby check endpoint. In addition to checks from asynchronous or async, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and compute latency on replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).

Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. o far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date: * Etcd: since the `/optime/leader` key is updated right after the '/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old. * ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old. * Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg. * Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, what might have negative side effects. Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds. Will help to implement #1599

patroni/api.py

docs/rest_api.rst

patroni/api.py

docs/rest_api.rst

patroni/api.py

tests/test_api.py

patroni/api.py

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

FxKu · 2020-07-15T08:27:43Z

👍

Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date: * Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old. * ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old. * Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg. * Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects. Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds. Will help to implement #1599

CyberDem0n · 2020-07-15T08:32:55Z

👍

CyberDem0n · 2020-07-15T08:37:12Z

Merged, thank you @ksarabu1!

Commit 04b9fb9 introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the #1599 What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequent than it is required. In addition to that sligtly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster. Close #1599

1. Commit 04b9fb9 introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the #1599. What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequently than is required. 2. Don't watch for sync members. The watch for sync member(s) was introduced in order to give a signal to the leader that one of the members set the `nosync` tag to true. Since that time we have got a few more conditions that should be notified about, therefore instead of watching for all members of the cluster every cluster member checks whether the condition is met, and instead of updating ZNode performs delete+create. Since every member is already watching for new ZNodes to be created inside the $scope/members/, they automatically get notified about important changes, and therefore watching for sync members is redundant. 3. In addition to that, slightly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster. Close #1873

replica & async health check enhancement

ab8fe9e

CyberDem0n mentioned this pull request Jun 24, 2020

Make sure cached last_leader_operation is up-to-date on replicas #1600

Merged

CyberDem0n reviewed Jun 24, 2020

View reviewed changes

patroni/api.py Outdated Show resolved Hide resolved

ksarabu1 added 3 commits July 2, 2020 02:23

cluster.last_leader_operation

058957e

cluster.last_leader_operation

4ac193a

cluster.last_leader_operation

1f53934

CyberDem0n reviewed Jul 6, 2020

View reviewed changes

docs/rest_api.rst Outdated Show resolved Hide resolved

patroni/api.py Outdated Show resolved Hide resolved

patroni/api.py Outdated Show resolved Hide resolved

patroni/api.py Outdated Show resolved Hide resolved

ksarabu1 added 3 commits July 6, 2020 16:56

suggested changes

79fb344

suggested changes

ed1abe8

suggested changes

86b76df

ksarabu1 requested a review from CyberDem0n July 7, 2020 19:11

replayed_location should be integer

a9d9bc7

CyberDem0n requested changes Jul 10, 2020

View reviewed changes

ksarabu1 and others added 3 commits July 10, 2020 19:24

Update patroni/api.py

6a34fcc

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

Update tests/test_api.py

c782aa5

Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>

suggested changes

32fb1e1

ksarabu1 requested a review from CyberDem0n July 10, 2020 23:49

ksarabu1 and others added 2 commits July 10, 2020 20:31

suggested changes

2d3093b

Update api.py

b1e1fa3

CyberDem0n approved these changes Jul 13, 2020

View reviewed changes

CyberDem0n merged commit 8a62999 into zalando:master Jul 15, 2020

vitabaks mentioned this pull request Sep 3, 2020

Maximum lag for replica access (REST API enhancement) #1249

Open

CyberDem0n mentioned this pull request Mar 10, 2021

Fix excessive HA loop runs with Zookeeper #1875

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replica & async rest API health check enhancement #1599

replica & async rest API health check enhancement #1599

ksarabu1 commented Jun 23, 2020 •

edited by CyberDem0n

FxKu commented Jul 15, 2020

CyberDem0n commented Jul 15, 2020

CyberDem0n commented Jul 15, 2020

replica & async rest API health check enhancement #1599

replica & async rest API health check enhancement #1599

Conversation

ksarabu1 commented Jun 23, 2020 • edited by CyberDem0n

FxKu commented Jul 15, 2020

CyberDem0n commented Jul 15, 2020

CyberDem0n commented Jul 15, 2020

ksarabu1 commented Jun 23, 2020 •

edited by CyberDem0n