Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replica & async rest API health check enhancement #1599

Merged
merged 13 commits into from
Jul 15, 2020

Conversation

ksarabu1
Copy link
Contributor

@ksarabu1 ksarabu1 commented Jun 23, 2020

Description: rest API health check enhancement on endpoints /replica & /async or /asynchronous to support additional query parameter lag.

  • GET /replica?lag=<max-lag>: replica check endpoint. In addition to checks from replica, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and to compute latency on replica for performance reasons. Please note that the value in leader_optime might be couple of seconds old (based on loop_wait).

  • GET /asynchronous?lag=<max-lag> or GET /async&lag=<max-lag>: asynchronous standby check endpoint. In addition to checks from asynchronous or async, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and compute latency on replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).

CyberDem0n pushed a commit that referenced this pull request Jun 24, 2020
Patroni is caching the cluster view in the DCS object because not all
operations require the most up-to-date values. The cached version is
valid for TTL seconds. o far it worked quite well, the only known
problem was that the `last_leader_operation` for some DCS
implementations was not very up-to-date:

* Etcd: since the `/optime/leader` key is updated right after the
'/leader` key, usually all replicas get the value from the previous HA
loop. Therefore the value is somewhere between `loop_wait` and
`loop_wait*2` old. We improve it by using the 10ms artificial sleep after
receiving watch notification from `compareAndSwap` operation on the
leader key. It usually gives enough time for the primary to update
the `/optime/leader`. On average that makes the cached version
`loop_wait/2` old.

* ZooKeeper: Patroni itself is not so much interested in most up-to-date
values of member and leader/optime ZNodes. In case of the leader race it
just reads everything from ZooKeeper, but during normal operation it is
relying on cache. In order to see the recent value on replicas they are
doing watch on the `leader/optime` Znode and will re-read it after it
was updated by the primary. On average that makes the cached version
`loop_wait/2` old.

* Kubernetes: last_leader_operation is stored in the same object as the
leader key itself and therefore update is atomic and we always see the
latest version. That makes the cached version `loop_wait/2` old on avg.

* Consul: HA loops on the primary and replicas are not synchronized,
therefore at the moment when we read the cluster state from the Consul
KV we see the last_leader_operation value that is between 0 and
loop_wait old. On average that makes the cached version `loop_wait` old.
Unfortunately we can't make it much better without performing periodic
updates from Consul, what might have negative side effects.

Since the `optime/leader` is only updated at most once per HA loop
cycle, the value stored in the DCS is usually `loop_wait/2` old on avg.
For majority of DCS implementations we could promise that the cached
version in Patroni will match the value in DCS most of the time,
therefore there is no need to make additional requests. The only
exception is Consul, but probably we could just document it, so when
someone relying on last_leader_operation value to check the replication
lag can correspondingly adjust thresholds.

Will help to implement #1599
patroni/api.py Outdated Show resolved Hide resolved
docs/rest_api.rst Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
@ksarabu1 ksarabu1 requested a review from CyberDem0n July 7, 2020 19:11
docs/rest_api.rst Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
tests/test_api.py Show resolved Hide resolved
patroni/api.py Outdated Show resolved Hide resolved
ksarabu1 and others added 3 commits July 10, 2020 19:24
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
@ksarabu1 ksarabu1 requested a review from CyberDem0n July 10, 2020 23:49
@FxKu
Copy link
Member

FxKu commented Jul 15, 2020

👍

CyberDem0n added a commit that referenced this pull request Jul 15, 2020
Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date:

* Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old.

* ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old.

* Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg.

* Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects.

Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds.

Will help to implement #1599
@CyberDem0n
Copy link
Collaborator

👍

@CyberDem0n CyberDem0n merged commit 8a62999 into zalando:master Jul 15, 2020
@CyberDem0n
Copy link
Collaborator

Merged, thank you @ksarabu1!

CyberDem0n pushed a commit that referenced this pull request Mar 10, 2021
Commit 04b9fb9 introduced additional
conditions for updating cached version of the leader optime. It was
required for implementing health-checks based on replication lag in the
#1599

What in fact was forgotten, the event should be cleared after the new
value of the optime was fetched. Not doing so results in running the HA
loop more frequent than it is required.

In addition to that sligtly increase watch timeout, it will keep HA
loops in sync across all nodes in the cluster.

Close #1599
CyberDem0n added a commit that referenced this pull request Mar 29, 2021
1. Commit 04b9fb9 introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the #1599.
  What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequently than is required.
  
2. Don't watch for sync members.
  The watch for sync member(s) was introduced in order to give a signal to the leader that one of the members set the `nosync` tag to true.
  Since that time we have got a few more conditions that should be notified about, therefore instead of watching for all members of the cluster every cluster member checks whether the condition is met, and instead of updating ZNode performs delete+create.
  Since every member is already watching for new ZNodes to be created inside the $scope/members/, they automatically get notified about important changes, and therefore watching for sync members is redundant.

3. In addition to that, slightly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster.
Close #1873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants