Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements & cleanup in node operations tests, fix for finishing node operations #7862

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Dec 20, 2022

Members backend reconciliation loop is processing single node update at
a time. This limitation introduce a dependency between subsequent
partition rebalance phases. Since after node addition some of the
partitions may be moved to the node that is requested to be
decommissioned and shut down before the previous rebalancing phase
finished it is required to prioritize decommissioning over rebalancing.

Introduced a change that will always execute node decommission operation
first before waiting for the rebalancing to finish. As a part of
node decommissioning process all required reallocation (the one that
targets the decommissioned node) will be canceled. The addition
rebalance operation is going to be scheduled again after decommissioning
finishes.

Test improvements

Added failures injection to random node operations test and refactored nodes operation fuzzy test to share the same code.

Fixes: #7874

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Improvements

  • no need to wait for all reallocations to finish before decommissioning a node

@mmaslankaprv mmaslankaprv changed the title Improvements & cleanup in node operations tests Improvements & cleanup in node operations tests, fix for finishing node operations Dec 20, 2022
@mmaslankaprv mmaslankaprv added this to the v22.3.x-next milestone Dec 20, 2022
@piyushredpanda piyushredpanda modified the milestones: v22.3.x-next, v22.3.10 Dec 20, 2022
@mmaslankaprv
Copy link
Member Author

ci failure: k8s

src/v/kafka/server/handlers/metadata.cc Outdated Show resolved Hide resolved
src/v/cluster/members_backend.cc Outdated Show resolved Hide resolved
src/v/cluster/members_backend.cc Outdated Show resolved Hide resolved
tests/rptest/utils/node_operations.py Outdated Show resolved Hide resolved
tests/rptest/utils/node_operations.py Show resolved Hide resolved
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Refactored the nodes operations fuzzy test to share logic with its
smaller version - random node operations test.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Members backend reconciliation loop is processing single node update at
a time. This limitation introduce a dependency between subsequent
partition rebalance phases. Since after node addition some of the
partitions may be moved to the node that is requested to be
decommissioned and shut down before the previous rebalancing phase
finished it is required to prioritize decommissioning over rebalancing.

Introduced a change that will always execute node decommission operation
first before waiting for the rebalancing to finish. As a part of
node decommissioning process all required reallocation (the one that
targets the decommissioned node) will be canceled. The addition
rebalance operation is going to be scheduled again after decommissioning
finishes.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added learner recovery throttling to prevent node from finishing
decommission before it is recommissioned.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv merged commit 2310a4b into redpanda-data:dev Dec 21, 2022
@mmaslankaprv
Copy link
Member Author

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x 7825d52b2a736a8de8609e12d062d96ffda0e3a2 18fee690153a0cf2e7bea9a29885e5c619f4097e cd561d154edd46f9dcbaa222ed41113cf7c3e241 a32a029f01863b31948330641363e623e3ccf71e 713940a13335e2c03036cdf44187010914096f74 d970a5e481651ad23761203c8699fc4c2b89e54f ff7f2a305c1c28d71a571b66c71692fa16b933d8

Workflow run logs.

@mmaslankaprv mmaslankaprv deleted the rebalancing-tests-follow-up branch December 21, 2022 16:06
}
// sort updates to prioritize decommissions/recommissions over node
// additions, use stable sort to keep de/recommissions order
static auto is_de_or_recommission = [](const update_meta& meta) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there is any point in making this static (and might even introduce thread safety overheads) since this captures no state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test: test_flipping_decommission_recommission timing out
5 participants