-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use node_status_table
as a single source of truth about node liveness
#17625
Use node_status_table
as a single source of truth about node liveness
#17625
Conversation
Added mapping of service unavailable status to appropriate error code. This way the error is not confusing. Signed-off-by: Michał Maślanka <michal@redpanda.com>
Introduced `alive_timeout_ms` property that is configuring the time that must elapse from the last status update before the node is marked as offline. Signed-off-by: Michał Maślanka <michal@redpanda.com>
4b9976d
to
5275d35
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47385#018ea9a4-13a0-4439-9102-6d0f1bd07220 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47403#018eaa73-b479-4d7d-9ed8-791fa917f053 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018ead56-dfe8-4a02-bfc3-aacc2808fc78 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018ead56-dfed-41e6-b125-8a5f2092ef18 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018eadf6-6351-4eb6-be40-e44950ccdc48 |
5275d35
to
4a26a66
Compare
@@ -84,6 +89,11 @@ class health_monitor_frontend | |||
get_cluster_health_overview(model::timeout_clock::time_point); | |||
|
|||
ss::future<bool> does_raft0_have_leader(); | |||
/** | |||
* Method validating if a node is known and alive. It will return an empty | |||
* optional if the node is not present in node status table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node is not present in node status table
What is the semantics of this? Is it "node is unknown" or "there hasn't been a successful ping yet"? I remember we were struggling with this in partition balancer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is always possibility that node will be in members table already but wasn't yet pinged.
Replaced all references to `health_monitor` node status with querying `health_monitor_frontend::is_alive` this way caller will obtain more up to date information about the cluster member liveness. Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
In order to provide a guidance on what interface to use to query for node status added a deprecate annotation to `is_alive()` method of node status. This will indicate all health infrastructure users to use `health_monitor::is_alive` method. Signed-off-by: Michał Maślanka <michal@redpanda.com>
When a node receives a hello request from its peer it indicates it is alive. We may use the hello request to reset `node_status_backend` re-connection backoff timeout to decrease the latency of discovering that the peer is alive. Signed-off-by: Michał Maślanka <michal@redpanda.com>
4a26a66
to
c0b0704
Compare
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
Use information coming from frequently updated
node_status_table
as a source for information about node liveness. Thenode_status_table
is updated frequently as its updates are generated via Raft independent heartbeat based failure detector. This information is far more up to date then the one based on health reports as the health reports are collected infrequently.Fixes: #17197, #17198
Backports Required
Release Notes
Improvements