-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made partition move cancellations more robust #8322
Conversation
c83e298
to
e7222ec
Compare
e7222ec
to
1d44749
Compare
* │ │ │ | ||
* │ ▼ │ | ||
* │ ┌──────────┐ │ | ||
* │ finish │ moving │ revert_cancel │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can repurpose the finish command for revert_cancel? I.e. if we are in the moving(B -> A)
state, revert_cancel becomes finish(B)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering this, but when it comes to writing the code it would be much more complicated. Handling of finish command already is pretty complex and i didn't want to add to it. Given we introduce the feature in the feature table adding a new command is not really an effort and i think it makes handling a little bit easier.
Introduced a feature enabling partition move reverts. The feature is needed as the revert is going to be based on a new controller command. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
When partition is being moved new replicas are allocated on the nodes were the partition replicas are moved to but until the move finishes the old replicas are still present on the nodes removed from the replica set. Updated topic dispatcher logic to reflect that. Now we remove allocation from partition allocator only in finsh moving command handler. Signed-off-by: Michal Maslanka <michal@redpanda.com>
1d44749
to
c39b2fc
Compare
Added handling for `revert_cancel_partition_move` command to the topic table. The command is going to be emitted by controller_backend when partition underlying raft group reconfiguration that is requested to be canceled is already finished. The command finishes move and updates the topic partition with replica set aligned with the target of original move command. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Controller backend is now dispatching replication of revert cancel partition move command when partition reconfiguration finished before backend had a chance to cancel it. When supported the revert command is used by the backend instead of reconfiguring the partition raft group to the replica set from before the requested move. Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
c39b2fc
to
10ec8c8
Compare
// move hasn't yet requested | ||
if (not_yet_moved) { | ||
// just update configuration revision | ||
auto it = _topics.local().all_topics_metadata().find( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the difference between the code in this branch and update_partition_replica_set
is that the latter is wrapped in apply_configuration_change_on_leader
, but why do we do it in the force-abort case and not in the soft cancel case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole cancel_replica_set_update
is executed on the leader. Force cancellation should work even when there is no leader. But when configuration change already finished there must be a leader as it was successful, hence we dispatch this last update piece on the leader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just catching up on reviews.
Problem statement
Since reverting partition configuration involves a genuine Raft group reconfiguration it is vulnerable to all the same issues that may prevent raft reconfiguration from finishing.
Since the replica set of partition was already updated in the topic table the controller backend reconciliation loop would wait for the reconfiguration to finish and as the movement was already cancelled it can not be cancelled again.
Example
Consider a situation where in a 4-nodes cluster containing nodes
(0,1,2,3)
a partition is being moved from replica set(0,1,2)
to(1,2,3)
. In this case a node 3 is added to the replica set while node 0 is removed. When configuration change is requested Raft configuration is going through the set of changes as described in [1] if cancel is requested after raft reconfiguration to(1,2,3)
finished but before the finish_moving_replicas command was replicated controller backend will instruct raft to revert configuration back to(0,1,2)
. The first step involving change from(1,2,3)
to(0,1,2)
is adding node 0 as a learner to the current configuration i.e. there will be one learner - 0. If node 0 is unavailable the raft reconfiguration will never finish as node 0 will never be promoted to the voter.Solution
Added handling for
revert_cancel_partition_move
command to the topic table. The command is going to be emitted by controller_backend when partition underlying raft group reconfiguration that is requested to be canceled is already finished. The command finishes move and updates the topic partition with replica set aligned with the target of originalmove command.
Partition movement state transitions:
When partition is being moved from replica set A to replica set B it may
either be finished or cancelled. The cancellation on the other hand may be
reverted. The partition movement state transitions are explained in the
following diagram:
Backports Required
UX Changes
Release Notes
Improvements