Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: publish total available reclaim size to balancer #16354

Merged
merged 2 commits into from
Feb 1, 2024

Conversation

dotnwat
Copy link
Member

@dotnwat dotnwat commented Jan 29, 2024

SM publishes reclaimable space up to the local retention level to the cluster balancer. However, it may be the case that a disk is nearly full even at the local retention target, causing the balancer to believe that any space is not available for reclaim. This is problematic for decommission which needs to be able to find some place to create new replicas.

This change swaps out the reclaimable space at local retention level for total reclaimable space. The balancer policy will be expanded further to also take into account the local retention targets and optimize decisions. Changes for that are here #16372.

Related https://github.com/redpanda-data/core-internal/issues/719

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Improvements

  • Publish total reclaimable space to avoid stuck decommission scenario.

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/44451#018d57bd-426f-4946-94f7-cdf8271819d1:

"rptest.tests.audit_log_test.AuditLogTestOauth.test_admin_oauth"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jan 30, 2024

@dotnwat dotnwat changed the title cluster: expose local retention and total available reclaim sizes cluster: total available reclaim sizes Jan 30, 2024
@dotnwat dotnwat self-assigned this Jan 30, 2024
@dotnwat dotnwat changed the title cluster: total available reclaim sizes cluster: publish total available reclaim size to balancer Jan 30, 2024
Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change alone, is the balancer actually balancing any differently? Or does this just change the conditions under which the balancer will balance? I think the latter, but just want to make sure I understand

@ztlpn
Copy link
Contributor

ztlpn commented Jan 31, 2024

@andrwng the answer is both!

  • changes to total size report will affect the triggering conditions ("is this node under disk pressure so that some replicas have to be moved off of it?") as well as individual replica placement decisions ("can I place a replica on this node without causing disk pressure?")
  • changes to per-partition size report will affect individual placement decisions too ("if I move this partition replica, what is the amount of disk space it will occupy on the destination node if it is under disk pressure?")

@andrwng
Copy link
Contributor

andrwng commented Jan 31, 2024

@andrwng the answer is both!

  • changes to total size report will affect the triggering conditions ("is this node under disk pressure so that some replicas have to be moved off of it?") as well as individual replica placement decisions ("can I place a replica on this node without causing disk pressure?")
  • changes to per-partition size report will affect individual placement decisions too ("if I move this partition replica, what is the amount of disk space it will occupy on the destination node if it is under disk pressure?")

Got it, thanks for the explanation! That makes sense; I recalled that we don't try to balance space across nodes, but it makes sense that we gate individual moves based on whether there is available/reclaimable space.

@dotnwat dotnwat merged commit 236b5ea into redpanda-data:dev Feb 1, 2024
18 of 20 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16354-v23.2.x-948 remotes/upstream/v23.2.x
git cherry-pick -x 930db9f0c9e3828656895a80c4d8c0fecaeed029 20e10636780195e0ca1dc8b6d79a3ad998466c05

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants