Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/health: reduced number of health_node_report copies #17863

Merged
merged 1 commit into from
Apr 17, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Apr 15, 2024

Every time the health_monitor_backend::get_cluster_health was called we returned a copy of each node health report. The health report data structure maybe large as it contains all partition replica information.

To reduce the number of copies while still not interrupting asynchronous iteration over the health report changed the API to return a list of shared pointers.

Node health reports aren't modified in place by health monitor backend rather than that they are completely replaced. This makes it easy to share the underlying data without worrying about concurrent access.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • largely reduced number of health report copies

auto serde_fields() {
return std::tie(
raft0_leader, node_states, node_reports, bytes_in_cloud_storage);
cluster_health_report copy() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these copy() functions actually used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests only

@@ -114,8 +114,8 @@ class health_monitor_backend {
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So most of the methods like get_cluster_health, collect_current_node_health etc still return a copy. Is that expected? Aren't those the most used methods to access the health report?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on slack:

get_cluster_health returns the report struct which internall has the ptrs to the invidivual node reports.

get_current_node_health still returns by value but those are not used on in the metadata update path.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 15, 2024

write(out, bytes_in_cloud_storage);
}

ss::future<> serde_async_read(iobuf_parser& in, const serde::header& h) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a lot going on in this one commit. is it all related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i know it is a lot of changes for one commit, it is indeed related as this one change (a type that we use to propagate cluster_health_reports) is influencing a lot of other places like:

  • serialization
  • callers of health_monitor_frontend apis
  • tests

Every time the `health_monitor_backend::get_cluster_health` was called
we returned a copy of each node health report. The health report data
structure maybe large as it contains all partition replica information.

To reduce the number of copies while still not interrupting asynchronous
iteration over the health report changed the API to return a list of
shared pointers.

Node health reports aren't modified in place by health monitor backend
rather than that they are completely replaced. This makes it easy to
share the underlying data without worrying about concurrent access.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv merged commit 040d13f into redpanda-data:dev Apr 17, 2024
17 checks passed
@mmaslankaprv mmaslankaprv deleted the shared-health branch April 17, 2024 06:25
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17863-v23.3.x-530 remotes/upstream/v23.3.x
git cherry-pick -x 226b9bff0971bd07156f75171d528f327c9f2096

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants