Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use node_status_table as a single source of truth about node liveness #17625

Merged
merged 7 commits into from
Apr 5, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Apr 4, 2024

Use information coming from frequently updated node_status_table as a source for information about node liveness. The node_status_table is updated frequently as its updates are generated via Raft independent heartbeat based failure detector. This information is far more up to date then the one based on health reports as the health reports are collected infrequently.

Fixes: #17197, #17198

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • more accurate node status reporting

Added mapping of service unavailable status to appropriate error code.
This way the error is not confusing.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Introduced `alive_timeout_ms` property that is configuring the time that
must elapse from the last status update before the node is marked as offline.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
src/v/cluster/health_monitor_types.h Outdated Show resolved Hide resolved
src/v/cluster/health_monitor_frontend.cc Outdated Show resolved Hide resolved
@@ -84,6 +89,11 @@ class health_monitor_frontend
get_cluster_health_overview(model::timeout_clock::time_point);

ss::future<bool> does_raft0_have_leader();
/**
* Method validating if a node is known and alive. It will return an empty
* optional if the node is not present in node status table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node is not present in node status table

What is the semantics of this? Is it "node is unknown" or "there hasn't been a successful ping yet"? I remember we were struggling with this in partition balancer...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is always possibility that node will be in members table already but wasn't yet pinged.

Replaced all references to `health_monitor` node status with querying
`health_monitor_frontend::is_alive` this way caller will obtain more up
to date information about the cluster member liveness.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
In order to provide a guidance on what interface to use to query for
node status added a deprecate annotation to `is_alive()` method of node
status. This will indicate all health infrastructure users to use
`health_monitor::is_alive` method.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
When a node receives a hello request from its peer it indicates it is
alive. We may use the hello request to reset `node_status_backend`
re-connection backoff timeout to decrease the latency of discovering
that the peer is alive.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@piyushredpanda piyushredpanda merged commit 84d54b4 into redpanda-data:dev Apr 5, 2024
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17625-v23.3.x-554 remotes/upstream/v23.3.x
git cherry-pick -x f16cdf5a42f286fa838bb81fc9d0976d5eaf693e e557e26076142af407bc4ae0a147e945d0b51d11 ed10f1e4e3a7f73f1336da2a5e32cb4334a8fb09 bff2992033cb28840afa20f761376a9cd612f9a4 aa36d7915dea88da331fc863d64148ac46331cfe e7212dacf2b295588239f958380a0a04fcb26b1d c0b07045917d7374e35af0cdbdce6af5822dfecc

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants