Distribute health report collection #17158

mmaslankaprv · 2024-03-18T09:26:18Z

Change the health monitor backend logic to distribute the health report collection logic. Previously all the nodes queried the cluster health from the redpanda/controller/0 partition leader. This put additional pressure on that node as it had to deal with serialization of node reports.

Changed health report collection logic so that every node queries each other to collect its health report statistics. This way the overhead related with serialization and handling health report request is evenly distributed among all the nodes in the cluster.

Backports Required

Release Notes

Improvements

less overhead of health report collection

Size of partition contained in the node health report doesn't have to be equal on all of the nodes. Change the health monitor test to account for that fact. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Change the health monitor logic to distribute the health report collection logic. Previously all the nodes queried the cluster health from the `redpanda/controller/0` partition leader. This put additional pressure on that node as it had to deal with serialization of node reports. Changed health report collection logic so that every node queries each other to collect its health report statistics. This way the overhead related with serialization and handling health report request is evenly distributed among all the nodes in the cluster. Signed-off-by: Michal Maslanka <michal@redpanda.com>

Signed-off-by: Michal Maslanka <michal@redpanda.com>

vbotbuildovich · 2024-03-18T11:44:00Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46352#018e5112-4b55-4a31-8cac-a6589b54a3dc

vbotbuildovich · 2024-03-18T11:54:48Z

new failures in https://buildkite.com/redpanda/redpanda/builds/46352#018e5123-8908-4be6-a8d9-65917ff8736a:

"rptest.tests.offset_for_leader_epoch_archival_test.OffsetForLeaderEpochArchivalTest.test_querying_remote_partitions.remote_reads=.False.True"

new failures in https://buildkite.com/redpanda/redpanda/builds/46397#018e5380-ef83-48ca-9be7-49e0e580eae6:

"rptest.tests.offset_for_leader_epoch_archival_test.OffsetForLeaderEpochArchivalTest.test_querying_archive"

Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv · 2024-03-18T19:44:42Z

/ci-repeat 1

bharathv

controller leader likes this change (rest of the nodes don't :P)

StephanDollberg

Nice, so this is even fully backwards compatible as we just reuse the APIs that the controller was already using.

vbotbuildovich · 2024-03-19T10:22:58Z

/backport v23.3.x

vbotbuildovich · 2024-03-19T10:23:55Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17158-v23.3.x-694 remotes/upstream/v23.3.x
git cherry-pick -x 4e64c0c773bf05b63b0966de47d51892ab455629 7517e9c124243178b55f53b5e6c538d3777b1141 df0c94b47fbe9ef490577cb4447ccac249dc70b5 086c0328c622eb93b13cb85cab00955dd90dc19e 977b21b75ce4d71137c69836dfc7b5c6b10a4df2 37b7f24a2d430654b6d02cad979c50e8ffc61a7d

Workflow run logs.

mmaslankaprv added 3 commits March 18, 2024 10:12

c/tests: refactored health monitor backend unit test

4e64c0c

Size of partition contained in the node health report doesn't have to be equal on all of the nodes. Change the health monitor test to account for that fact. Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/health_monitor_backend: cache self id for ergonomy of use

df0c94b

Signed-off-by: Michal Maslanka <michal@redpanda.com>

github-actions bot added the area/redpanda label Mar 18, 2024

mmaslankaprv added 3 commits March 18, 2024 13:05

c/health_monitor_backend: skip intermediate type while collecting status

086c032

Signed-off-by: Michal Maslanka <michal@redpanda.com>

c/health_monitor_backend: remove one redundant copy of health report

977b21b

Signed-off-by: Michal Maslanka <michal@redpanda.com>

tests: moved health report related benchmark to cluster module

37b7f24

Signed-off-by: Michal Maslanka <michal@redpanda.com>

mmaslankaprv requested review from dotnwat, bharathv, ztlpn and StephanDollberg March 18, 2024 19:44

bharathv approved these changes Mar 18, 2024

View reviewed changes

StephanDollberg approved these changes Mar 19, 2024

View reviewed changes

mmaslankaprv merged commit 4ad77c3 into redpanda-data:dev Mar 19, 2024
11 checks passed

mmaslankaprv deleted the health-update branch March 19, 2024 10:22

vbotbuildovich mentioned this pull request Mar 19, 2024

[v23.3.x] Distribute health report collection #17180

Closed

mmaslankaprv mentioned this pull request Mar 25, 2024

[v23.3.x] Distribute health report collection #17360

Merged

WillemKauf mentioned this pull request Apr 11, 2024

CI Failure (TooManyRedirects: Exceeded 30 redirects) in CloudStorageScrubberTest.test_scrubber #17149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute health report collection #17158

Distribute health report collection #17158

mmaslankaprv commented Mar 18, 2024 •

edited

vbotbuildovich commented Mar 18, 2024

vbotbuildovich commented Mar 18, 2024 •

edited

mmaslankaprv commented Mar 18, 2024

bharathv left a comment

StephanDollberg left a comment

vbotbuildovich commented Mar 19, 2024

vbotbuildovich commented Mar 19, 2024

Distribute health report collection #17158

Distribute health report collection #17158

Conversation

mmaslankaprv commented Mar 18, 2024 • edited

Backports Required

Release Notes

Improvements

vbotbuildovich commented Mar 18, 2024

vbotbuildovich commented Mar 18, 2024 • edited

mmaslankaprv commented Mar 18, 2024

bharathv left a comment

Choose a reason for hiding this comment

StephanDollberg left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Mar 19, 2024

vbotbuildovich commented Mar 19, 2024

mmaslankaprv commented Mar 18, 2024 •

edited

vbotbuildovich commented Mar 18, 2024 •

edited