Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made partition move cancellations more robust #8322

Merged
merged 9 commits into from
Jan 23, 2023

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jan 19, 2023

Problem statement

Since reverting partition configuration involves a genuine Raft group reconfiguration it is vulnerable to all the same issues that may prevent raft reconfiguration from finishing.

Since the replica set of partition was already updated in the topic table the controller backend reconciliation loop would wait for the reconfiguration to finish and as the movement was already cancelled it can not be cancelled again.

Example

Consider a situation where in a 4-nodes cluster containing nodes (0,1,2,3) a partition is being moved from replica set (0,1,2) to (1,2,3). In this case a node 3 is added to the replica set while node 0 is removed. When configuration change is requested Raft configuration is going through the set of changes as described in [1] if cancel is requested after raft reconfiguration to (1,2,3) finished but before the finish_moving_replicas command was replicated controller backend will instruct raft to revert configuration back to (0,1,2). The first step involving change from (1,2,3) to (0,1,2) is adding node 0 as a learner to the current configuration i.e. there will be one learner - 0. If node 0 is unavailable the raft reconfiguration will never finish as node 0 will never be promoted to the voter.

Solution

Added handling for revert_cancel_partition_move command to the topic table. The command is going to be emitted by controller_backend when partition underlying raft group reconfiguration that is requested to be canceled is already finished. The command finishes move and updates the topic partition with replica set aligned with the target of original
move command.

Partition movement state transitions:

When partition is being moved from replica set A to replica set B it may
either be finished or cancelled. The cancellation on the other hand may be
reverted. The partition movement state transitions are explained in the
following diagram:


 A - replica set A
 B - replica set B

 ┌──────────┐   move           ┌──────────┐       finish         ┌──────────┐
 │  static  │                  │  moving  │                      │  static  │
 │          ├─────────────────►│          ├─────────────────────►│          │
 │    A     │                  │  A -> B  │                      │    B     │
 └──────────┘                  └────┬─────┘                      └──────────┘
      ▲                             │                                  ▲
      │                             │                                  │
      │                             │ cancel/force_cancel              │
      │                             │                                  │
      │                             │                                  │
      │                             ▼                                  │
      │                        ┌──────────┐                            │
      │        finish          │  moving  │     revert_cancel          │
      └────────────────────────┤          ├────────────────────────────┘
                               │  B -> A  │
                               └──────────┴─────┐
                                    ▲           │
                                    │           │
                                    │           │
                                    │           │
                                    └───────────┘
                                        force_cancel

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Improvements

  • more robust partition movements related with decommissioning/recommissioning nodes that are unavailable

@mmaslankaprv mmaslankaprv force-pushed the revert-cancel branch 4 times, most recently from c83e298 to e7222ec Compare January 20, 2023 14:39
@mmaslankaprv mmaslankaprv changed the title Revert cancel Made partition move cancellations more robust Jan 20, 2023
src/v/cluster/topic_updates_dispatcher.cc Show resolved Hide resolved
src/v/features/feature_table.h Outdated Show resolved Hide resolved
src/v/cluster/types.h Outdated Show resolved Hide resolved
src/v/cluster/types.h Outdated Show resolved Hide resolved
src/v/cluster/topic_table.h Outdated Show resolved Hide resolved
src/v/cluster/topic_table.h Outdated Show resolved Hide resolved
* │ │ │
* │ ▼ │
* │ ┌──────────┐ │
* │ finish │ moving │ revert_cancel │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can repurpose the finish command for revert_cancel? I.e. if we are in the moving(B -> A) state, revert_cancel becomes finish(B).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering this, but when it comes to writing the code it would be much more complicated. Handling of finish command already is pretty complex and i didn't want to add to it. Given we introduce the feature in the feature table adding a new command is not really an effort and i think it makes handling a little bit easier.

src/v/cluster/controller_backend.cc Show resolved Hide resolved
Introduced a feature enabling partition move reverts. The feature is
needed as the revert is going to be based on a new controller command.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv marked this pull request as ready for review January 23, 2023 07:47
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
When partition is being moved new replicas are allocated on the nodes
were the partition replicas are moved to but until the move finishes the
old replicas are still present on the nodes removed from the
replica set. Updated topic dispatcher logic to reflect that. Now we
remove allocation from partition allocator only in finsh moving command
handler.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Added handling for `revert_cancel_partition_move` command to the topic
table. The command is going to be emitted by controller_backend when
partition underlying raft group reconfiguration that is requested to be
canceled is already finished. The command finishes move and updates the
topic partition with replica set aligned with the target of original
move command.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Controller backend is now dispatching replication of revert cancel
partition move command when partition reconfiguration finished before
backend had a chance to cancel it. When supported the revert command is
used by the backend instead of reconfiguring the partition raft group to
the replica set from before the requested move.

Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
Signed-off-by: Michal Maslanka <michal@redpanda.com>
// move hasn't yet requested
if (not_yet_moved) {
// just update configuration revision
auto it = _topics.local().all_topics_metadata().find(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the difference between the code in this branch and update_partition_replica_set is that the latter is wrapped in apply_configuration_change_on_leader, but why do we do it in the force-abort case and not in the soft cancel case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whole cancel_replica_set_update is executed on the leader. Force cancellation should work even when there is no leader. But when configuration change already finished there must be a leader as it was successful, hence we dispatch this last update piece on the leader

@mmaslankaprv mmaslankaprv merged commit 760b8a8 into redpanda-data:dev Jan 23, 2023
Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just catching up on reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants