c/health: reduced number of health_node_report copies #17863

mmaslankaprv · 2024-04-15T10:13:29Z

Every time the health_monitor_backend::get_cluster_health was called we returned a copy of each node health report. The health report data structure maybe large as it contains all partition replica information.

To reduce the number of copies while still not interrupting asynchronous iteration over the health report changed the API to return a list of shared pointers.

Node health reports aren't modified in place by health monitor backend rather than that they are completely replaced. This makes it easy to share the underlying data without worrying about concurrent access.

Backports Required

Release Notes

Improvements

largely reduced number of health report copies

StephanDollberg · 2024-04-15T10:43:27Z

src/v/cluster/health_monitor_types.h

-    auto serde_fields() {
-        return std::tie(
-          raft0_leader, node_states, node_reports, bytes_in_cloud_storage);
+    cluster_health_report copy() const;


Are these copy() functions actually used anywhere?

StephanDollberg · 2024-04-15T10:55:54Z

src/v/cluster/health_monitor_backend.h

@@ -114,8 +114,8 @@ class health_monitor_backend {
    };


So most of the methods like get_cluster_health, collect_current_node_health etc still return a copy. Is that expected? Aren't those the most used methods to access the health report?

Discussed on slack:

get_cluster_health returns the report struct which internall has the ptrs to the invidivual node reports.

get_current_node_health still returns by value but those are not used on in the metadata update path.

vbotbuildovich · 2024-04-15T14:13:54Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47792#018ee1df-5556-459f-a026-38304debb5e9

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47792#018ee1e7-0381-4848-8c36-cde1b1a5fdc9

dotnwat · 2024-04-15T21:57:28Z

src/v/cluster/health_monitor_types.h

+        write(out, bytes_in_cloud_storage);
+    }
+
+    ss::future<> serde_async_read(iobuf_parser& in, const serde::header& h) {


there is a lot going on in this one commit. is it all related?

yes, i know it is a lot of changes for one commit, it is indeed related as this one change (a type that we use to propagate cluster_health_reports) is influencing a lot of other places like:

serialization

callers of health_monitor_frontend apis

tests

src/v/cluster/health_monitor_types.h

src/v/cluster/tests/partition_balancer_planner_test.cc

Every time the `health_monitor_backend::get_cluster_health` was called we returned a copy of each node health report. The health report data structure maybe large as it contains all partition replica information. To reduce the number of copies while still not interrupting asynchronous iteration over the health report changed the API to return a list of shared pointers. Node health reports aren't modified in place by health monitor backend rather than that they are completely replaced. This makes it easy to share the underlying data without worrying about concurrent access. Signed-off-by: Michał Maślanka <michal@redpanda.com>

vbotbuildovich · 2024-04-17T06:25:24Z

/backport v23.3.x

vbotbuildovich · 2024-04-17T06:26:25Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17863-v23.3.x-530 remotes/upstream/v23.3.x
git cherry-pick -x 226b9bff0971bd07156f75171d528f327c9f2096

Workflow run logs.

github-actions bot added the area/redpanda label Apr 15, 2024

mmaslankaprv force-pushed the shared-health branch from a762ea0 to 46e3b81 Compare April 15, 2024 10:25

mmaslankaprv marked this pull request as ready for review April 15, 2024 10:25

mmaslankaprv requested review from bharathv, ztlpn and StephanDollberg April 15, 2024 10:25

StephanDollberg previously approved these changes Apr 15, 2024

View reviewed changes

mmaslankaprv dismissed StephanDollberg’s stale review via 062c807 April 15, 2024 11:38

mmaslankaprv force-pushed the shared-health branch from 46e3b81 to 062c807 Compare April 15, 2024 11:38

mmaslankaprv requested a review from StephanDollberg April 15, 2024 13:02

StephanDollberg previously approved these changes Apr 15, 2024

View reviewed changes

mmaslankaprv dismissed StephanDollberg’s stale review via cfae3ff April 15, 2024 15:49

mmaslankaprv force-pushed the shared-health branch from 062c807 to cfae3ff Compare April 15, 2024 15:49

StephanDollberg previously approved these changes Apr 15, 2024

View reviewed changes

dotnwat reviewed Apr 15, 2024

View reviewed changes

ztlpn reviewed Apr 16, 2024

View reviewed changes

src/v/cluster/health_monitor_types.h Show resolved Hide resolved

src/v/cluster/tests/partition_balancer_planner_test.cc Outdated Show resolved Hide resolved

mmaslankaprv dismissed StephanDollberg’s stale review via 226b9bf April 16, 2024 13:40

mmaslankaprv force-pushed the shared-health branch from cfae3ff to 226b9bf Compare April 16, 2024 13:40

mmaslankaprv requested review from dotnwat, StephanDollberg and ztlpn April 16, 2024 14:28

ztlpn approved these changes Apr 16, 2024

View reviewed changes

mmaslankaprv merged commit 040d13f into redpanda-data:dev Apr 17, 2024
17 checks passed

mmaslankaprv deleted the shared-health branch April 17, 2024 06:25

vbotbuildovich mentioned this pull request Apr 17, 2024

[v23.3.x] c/health: reduced number of health_node_report copies #17911

Open

mmaslankaprv mentioned this pull request Apr 23, 2024

[v23.3.x] c/health: reduced number of health_node_report copies #18017

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c/health: reduced number of health_node_report copies #17863

c/health: reduced number of health_node_report copies #17863

mmaslankaprv commented Apr 15, 2024 •

edited

StephanDollberg Apr 15, 2024

mmaslankaprv Apr 15, 2024

StephanDollberg Apr 15, 2024

StephanDollberg Apr 15, 2024

vbotbuildovich commented Apr 15, 2024 •

edited

dotnwat Apr 15, 2024

mmaslankaprv Apr 16, 2024

vbotbuildovich commented Apr 17, 2024

vbotbuildovich commented Apr 17, 2024

c/health: reduced number of health_node_report copies #17863

c/health: reduced number of health_node_report copies #17863

Conversation

mmaslankaprv commented Apr 15, 2024 • edited

Backports Required

Release Notes

Improvements

StephanDollberg Apr 15, 2024

Choose a reason for hiding this comment

mmaslankaprv Apr 15, 2024

Choose a reason for hiding this comment

StephanDollberg Apr 15, 2024

Choose a reason for hiding this comment

StephanDollberg Apr 15, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Apr 15, 2024 • edited

dotnwat Apr 15, 2024

Choose a reason for hiding this comment

mmaslankaprv Apr 16, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Apr 17, 2024

vbotbuildovich commented Apr 17, 2024

mmaslankaprv commented Apr 15, 2024 •

edited

vbotbuildovich commented Apr 15, 2024 •

edited