-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replica & async rest API health check enhancement #1599
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CyberDem0n
pushed a commit
that referenced
this pull request
Jun 24, 2020
Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. o far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date: * Etcd: since the `/optime/leader` key is updated right after the '/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old. * ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old. * Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg. * Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, what might have negative side effects. Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds. Will help to implement #1599
CyberDem0n
reviewed
Jun 24, 2020
CyberDem0n
reviewed
Jul 6, 2020
CyberDem0n
requested changes
Jul 10, 2020
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
Co-authored-by: Alexander Kukushkin <cyberdemn@gmail.com>
CyberDem0n
approved these changes
Jul 13, 2020
👍 |
CyberDem0n
added a commit
that referenced
this pull request
Jul 15, 2020
Patroni is caching the cluster view in the DCS object because not all operations require the most up-to-date values. The cached version is valid for TTL seconds. So far it worked quite well, the only known problem was that the `last_leader_operation` for some DCS implementations was not very up-to-date: * Etcd: since the `/optime/leader` key is updated right after the `/leader` key, usually all replicas get the value from the previous HA loop. Therefore the value is somewhere between `loop_wait` and `loop_wait*2` old. We improve it by using the 10ms artificial sleep after receiving watch notification from `compareAndSwap` operation on the leader key. It usually gives enough time for the primary to update the `/optime/leader`. On average that makes the cached version `loop_wait/2` old. * ZooKeeper: Patroni itself is not so much interested in most up-to-date values of member and leader/optime ZNodes. In case of the leader race it just reads everything from ZooKeeper, but during normal operation it is relying on cache. In order to see the recent value on replicas they are doing watch on the `leader/optime` Znode and will re-read it after it was updated by the primary. On average that makes the cached version `loop_wait/2` old. * Kubernetes: last_leader_operation is stored in the same object as the leader key itself and therefore update is atomic and we always see the latest version. That makes the cached version `loop_wait/2` old on avg. * Consul: HA loops on the primary and replicas are not synchronized, therefore at the moment when we read the cluster state from the Consul KV we see the last_leader_operation value that is between 0 and loop_wait old. On average that makes the cached version `loop_wait` old. Unfortunately we can't make it much better without performing periodic updates from Consul, which might have negative side effects. Since the `optime/leader` is only updated at most once per HA loop cycle, the value stored in the DCS is usually `loop_wait/2` old on avg. For majority of DCS implementations we could promise that the cached version in Patroni will match the value in DCS most of the time, therefore there is no need to make additional requests. The only exception is Consul, but probably we could just document it, so when someone relying on last_leader_operation value to check the replication lag can correspondingly adjust thresholds. Will help to implement #1599
👍 |
Merged, thank you @ksarabu1! |
CyberDem0n
pushed a commit
that referenced
this pull request
Mar 10, 2021
Commit 04b9fb9 introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the #1599 What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequent than it is required. In addition to that sligtly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster. Close #1599
CyberDem0n
added a commit
that referenced
this pull request
Mar 29, 2021
1. Commit 04b9fb9 introduced additional conditions for updating cached version of the leader optime. It was required for implementing health-checks based on replication lag in the #1599. What in fact was forgotten, the event should be cleared after the new value of the optime was fetched. Not doing so results in running the HA loop more frequently than is required. 2. Don't watch for sync members. The watch for sync member(s) was introduced in order to give a signal to the leader that one of the members set the `nosync` tag to true. Since that time we have got a few more conditions that should be notified about, therefore instead of watching for all members of the cluster every cluster member checks whether the condition is met, and instead of updating ZNode performs delete+create. Since every member is already watching for new ZNodes to be created inside the $scope/members/, they automatically get notified about important changes, and therefore watching for sync members is redundant. 3. In addition to that, slightly increase watch timeout, it will keep HA loops in sync across all nodes in the cluster. Close #1873
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description: rest API health check enhancement on endpoints /replica & /async or /asynchronous to support additional query parameter
lag
.GET /replica?lag=<max-lag>
: replica check endpoint. In addition to checks fromreplica
, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and to compute latency on replica for performance reasons. Please note that the value in leader_optime might be couple of seconds old (based on loop_wait).GET /asynchronous?lag=<max-lag>
orGET /async&lag=<max-lag>
: asynchronous standby check endpoint. In addition to checks fromasynchronous
orasync
, it also checks replication latency and returns status code 200 only when the latency is below specified value (in bytes). The key leader_optime from DCS is used for Leader wal position and compute latency on replica for performance reasons. Please note that the value in leader_optime might be a couple of seconds old (based on loop_wait).