Use `node_status_table` as a single source of truth about node liveness #17625

mmaslankaprv · 2024-04-04T10:13:09Z

Use information coming from frequently updated node_status_table as a source for information about node liveness. The node_status_table is updated frequently as its updates are generated via Raft independent heartbeat based failure detector. This information is far more up to date then the one based on health reports as the health reports are collected infrequently.

Fixes: #17197, #17198

Backports Required

Release Notes

Improvements

more accurate node status reporting

Added mapping of service unavailable status to appropriate error code. This way the error is not confusing. Signed-off-by: Michał Maślanka <michal@redpanda.com>

Introduced `alive_timeout_ms` property that is configuring the time that must elapse from the last status update before the node is marked as offline. Signed-off-by: Michał Maślanka <michal@redpanda.com>

vbotbuildovich · 2024-04-04T16:25:14Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47385#018ea9a4-13a0-4439-9102-6d0f1bd07220

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47403#018eaa73-b479-4d7d-9ed8-791fa917f053

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018ead56-dfe8-4a02-bfc3-aacc2808fc78

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018ead56-dfed-41e6-b125-8a5f2092ef18

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47436#018eadf6-6351-4eb6-be40-e44950ccdc48

src/v/cluster/health_monitor_types.h

src/v/cluster/health_monitor_frontend.cc

ztlpn · 2024-04-04T17:38:52Z

src/v/cluster/health_monitor_frontend.h

@@ -84,6 +89,11 @@ class health_monitor_frontend
      get_cluster_health_overview(model::timeout_clock::time_point);

    ss::future<bool> does_raft0_have_leader();
+    /**
+     * Method validating if a node is known and alive. It will return an empty
+     * optional if the node is not present in node status table.


node is not present in node status table

What is the semantics of this? Is it "node is unknown" or "there hasn't been a successful ping yet"? I remember we were struggling with this in partition balancer...

there is always possibility that node will be in members table already but wasn't yet pinged.

Replaced all references to `health_monitor` node status with querying `health_monitor_frontend::is_alive` this way caller will obtain more up to date information about the cluster member liveness. Signed-off-by: Michał Maślanka <michal@redpanda.com>

Signed-off-by: Michał Maślanka <michal@redpanda.com>

In order to provide a guidance on what interface to use to query for node status added a deprecate annotation to `is_alive()` method of node status. This will indicate all health infrastructure users to use `health_monitor::is_alive` method. Signed-off-by: Michał Maślanka <michal@redpanda.com>

When a node receives a hello request from its peer it indicates it is alive. We may use the hello request to reset `node_status_backend` re-connection backoff timeout to decrease the latency of discovering that the peer is alive. Signed-off-by: Michał Maślanka <michal@redpanda.com>

vbotbuildovich · 2024-04-05T19:24:49Z

/backport v23.3.x

vbotbuildovich · 2024-04-05T19:25:45Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17625-v23.3.x-554 remotes/upstream/v23.3.x
git cherry-pick -x f16cdf5a42f286fa838bb81fc9d0976d5eaf693e e557e26076142af407bc4ae0a147e945d0b51d11 ed10f1e4e3a7f73f1336da2a5e32cb4334a8fb09 bff2992033cb28840afa20f761376a9cd612f9a4 aa36d7915dea88da331fc863d64148ac46331cfe e7212dacf2b295588239f958380a0a04fcb26b1d c0b07045917d7374e35af0cdbdce6af5822dfecc

Workflow run logs.

github-actions bot added the area/redpanda label Apr 4, 2024

mmaslankaprv requested review from nvartolomei, bharathv and ztlpn April 4, 2024 10:20

mmaslankaprv added 2 commits April 4, 2024 15:26

rpc: map service unavailable error code

f16cdf5

Added mapping of service unavailable status to appropriate error code. This way the error is not confusing. Signed-off-by: Michał Maślanka <michal@redpanda.com>

config: introduced a timeout deciding if node is alive

e557e26

Introduced `alive_timeout_ms` property that is configuring the time that must elapse from the last status update before the node is marked as offline. Signed-off-by: Michał Maślanka <michal@redpanda.com>

mmaslankaprv force-pushed the metadata-node-status branch from 4b9976d to 5275d35 Compare April 4, 2024 13:26

nvartolomei mentioned this pull request Apr 4, 2024

cluster: node status table as source of truth for alive #17582

Closed

6 tasks

mmaslankaprv force-pushed the metadata-node-status branch from 5275d35 to 4a26a66 Compare April 4, 2024 17:19

ztlpn reviewed Apr 4, 2024

View reviewed changes

mmaslankaprv added 5 commits April 5, 2024 08:38

c/admin: use is_alive api when querying for brokers

bff2992

Signed-off-by: Michał Maślanka <michal@redpanda.com>

c/metrics_reporter: use new is_alive api in metrics reporter

aa36d79

Signed-off-by: Michał Maślanka <michal@redpanda.com>

mmaslankaprv force-pushed the metadata-node-status branch from 4a26a66 to c0b0704 Compare April 5, 2024 06:38

mmaslankaprv requested a review from ztlpn April 5, 2024 08:03

redpanda-data deleted a comment from vbotbuildovich Apr 5, 2024

ztlpn approved these changes Apr 5, 2024

View reviewed changes

piyushredpanda merged commit 84d54b4 into redpanda-data:dev Apr 5, 2024
19 checks passed

vbotbuildovich mentioned this pull request Apr 5, 2024

[v23.3.x] Use node_status_table as a single source of truth about node liveness #17683

Closed

mmaslankaprv mentioned this pull request Apr 8, 2024

[v23.3.x] Use node_status_table as a single source of truth about node liveness #17698

Merged

ztlpn mentioned this pull request Apr 8, 2024

CI Failure (assert topic_info.start_offset == new_lwm, topic_info) in TestReadReplicaService.test_identical_lwms_after_delete_records #17247

Closed

mmaslankaprv mentioned this pull request Apr 9, 2024

CI Failure (requests.exceptions.TooManyRedirects: Exceeded 30 redirects.) in MultiRestartTest.test_recovery_after_multiple_restarts #17569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `node_status_table` as a single source of truth about node liveness #17625

Use `node_status_table` as a single source of truth about node liveness #17625

mmaslankaprv commented Apr 4, 2024 •

edited by nvartolomei

vbotbuildovich commented Apr 4, 2024 •

edited

ztlpn Apr 4, 2024

mmaslankaprv Apr 5, 2024

vbotbuildovich commented Apr 5, 2024

vbotbuildovich commented Apr 5, 2024

Use node_status_table as a single source of truth about node liveness #17625

Use node_status_table as a single source of truth about node liveness #17625

Conversation

mmaslankaprv commented Apr 4, 2024 • edited by nvartolomei

Backports Required

Release Notes

Improvements

vbotbuildovich commented Apr 4, 2024 • edited

ztlpn Apr 4, 2024

Choose a reason for hiding this comment

mmaslankaprv Apr 5, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Apr 5, 2024

vbotbuildovich commented Apr 5, 2024

Use `node_status_table` as a single source of truth about node liveness #17625

Use `node_status_table` as a single source of truth about node liveness #17625

mmaslankaprv commented Apr 4, 2024 •

edited by nvartolomei

vbotbuildovich commented Apr 4, 2024 •

edited