Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute health report collection #17158

Merged
merged 6 commits into from
Mar 19, 2024

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Mar 18, 2024

Change the health monitor backend logic to distribute the health report collection logic. Previously all the nodes queried the cluster health from the redpanda/controller/0 partition leader. This put additional pressure on that node as it had to deal with serialization of node reports.

Changed health report collection logic so that every node queries each other to collect its health report statistics. This way the overhead related with serialization and handling health report request is evenly distributed among all the nodes in the cluster.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • less overhead of health report collection

Size of partition contained in the node health report doesn't have to be
equal on all of the nodes. Change the health monitor test to account for
that fact.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Change the health monitor logic to distribute the health report
collection logic. Previously all the nodes queried the cluster health
from the `redpanda/controller/0` partition leader. This put additional
pressure on that node as it had to deal with serialization of node
reports.

Changed health report collection logic so that every node queries each
other to collect its health report statistics. This way the overhead
related with serialization and handling health report request is evenly
distributed among all the nodes in the cluster.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 18, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/46352#018e5123-8908-4be6-a8d9-65917ff8736a:

"rptest.tests.offset_for_leader_epoch_archival_test.OffsetForLeaderEpochArchivalTest.test_querying_remote_partitions.remote_reads=.False.True"

new failures in https://buildkite.com/redpanda/redpanda/builds/46397#018e5380-ef83-48ca-9be7-49e0e580eae6:

"rptest.tests.offset_for_leader_epoch_archival_test.OffsetForLeaderEpochArchivalTest.test_querying_archive"

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv
Copy link
Member Author

/ci-repeat 1

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller leader likes this change (rest of the nodes don't :P)

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so this is even fully backwards compatible as we just reuse the APIs that the controller was already using.

@mmaslankaprv mmaslankaprv merged commit 4ad77c3 into redpanda-data:dev Mar 19, 2024
11 checks passed
@mmaslankaprv mmaslankaprv deleted the health-update branch March 19, 2024 10:22
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17158-v23.3.x-694 remotes/upstream/v23.3.x
git cherry-pick -x 4e64c0c773bf05b63b0966de47d51892ab455629 7517e9c124243178b55f53b5e6c538d3777b1141 df0c94b47fbe9ef490577cb4447ccac249dc70b5 086c0328c622eb93b13cb85cab00955dd90dc19e 977b21b75ce4d71137c69836dfc7b5c6b10a4df2 37b7f24a2d430654b6d02cad979c50e8ffc61a7d

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants