Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using node_status_table to determine node unavailability in partition balancer #11146

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jun 1, 2023

Using node_status_table last heartbeat timestamps to determine if a
node is unavailable. Previously partition_balancer_planner was using
follower state coming from the controller raft group.

Fixes: #7218

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

bharathv
bharathv previously approved these changes Jun 1, 2023
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.. probably good to have Alexey do another pass.

src/v/cluster/partition_balancer_backend.h Outdated Show resolved Hide resolved
src/v/cluster/partition_balancer_state.h Outdated Show resolved Hide resolved
for (const auto& follower : follower_metrics) {
auto unavailable_dur = now - follower.last_heartbeat;
auto node_status = _state.node_status().get_node_status(id);
// node status is not yet available, wait for it to be updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed on slack, we should cover the case when a remote node didn't respond to pings at all since the current node startup. I guess we should use time when we discovered the node through the members table as last_seen in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will open a follow up pr with the changes in node_status subsystem it seems that we also do not clean the status table when node is removed from the cluster

Since `partition_balancer_backend` is going to use the
`node_status_table` to recognize unavailable node it has be has an
access to it.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the partition-balancer-node-status branch 2 times, most recently from a4bed64 to b1edff0 Compare June 2, 2023 12:48
Using `node_status_table` last heartbeat timestamps to determine if a
node is unavailable. Previously `partition_balancer_planner` was using
follower state coming from the controller raft group.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the partition-balancer-node-status branch from b1edff0 to a66fccb Compare June 2, 2023 12:57
@mmaslankaprv mmaslankaprv merged commit a776aaa into redpanda-data:dev Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants