Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly integrate replica set state switching in resharding transfers #4480

Merged
merged 26 commits into from
Jun 18, 2024

Conversation

timvisee
Copy link
Member

Tracked in: #4213

In #4373 I implemented a first iteration of point migration. In practice, transferring still appeared to be broken because a lot more had to be done for replica set state switching and shard routing.

This PR does all that needs to be done in terms of shard transfers to make resharding work. Specifically, to actually migrate points from all shards to the new target shard, and to replicate the new shard matching the configured replication factor.

Most importantly, this does:

  • support transfers with a different source/target shard (for resharding)
  • extend the unique shard key with a target shard field (for resharding)
  • resharding transfer must merge points in target shard, it should not replace them
  • the resharding driver state must be aware of the peers we know about
  • use correct replica set state switching for staring and finishing resharding transfers

What doesn't work yet and will be resolved in a separate PR:

  • migrate points within the same node

I've extensively tested this locally with various different scenarios, such as different shard numbers, replication factors and target nodes. All works well.

We cannot (automatically) test this yet, because the consensus operation for finalizing resharding is missing. Once that's there, I'll add an integration test to assert correct behavior.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee requested a review from ffuugoo June 17, 2024 14:19
@timvisee timvisee mentioned this pull request Jun 17, 2024
76 tasks
Comment on lines +61 to +77
let this_peer_id = consensus.this_peer_id();
let is_receiver = this_peer_id == shard_transfer.to;
let is_sender = this_peer_id == shard_transfer.from;

// Get the source and target shards, in case of resharding the target shard is different
let from_shard_id = shard_transfer.shard_id;
let to_shard_id = shard_transfer
.to_shard_id
.unwrap_or(shard_transfer.shard_id);

let shards_holder = self.shards_holder.read().await;
let from_replica_set = shards_holder.get_shard(&from_shard_id).ok_or_else(|| {
CollectionError::service_error(format!("Shard {from_shard_id} doesn't exist"))
})?;
let to_replica_set = shards_holder.get_shard(&to_shard_id).ok_or_else(|| {
CollectionError::service_error(format!("Shard {to_shard_id} doesn't exist"))
})?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now have transfers with a different source and target shard ID (in case resharding), we need a separate source and target replica set in our shard transfer logic. This theme will repeat throughout this PR in various places.

Comment on lines +812 to +817
// List the peers in active or resharding state
// If resharding, there will be only one shard in resharding state and we should not
// consider all to be dead
// TODO(resharding): accept resharding state as active like below?
let active_or_resharding_peers = peers.iter().filter_map(|(&peer_id, &state)| {
matches!(state, ReplicaState::Active | ReplicaState::Resharding).then_some(peer_id)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When resharding, we only have one shard in the Resharding state, and none in Active state.

This prevents messing with our locally disabled peers list. I'm not too happy with it though, maybe there's a better place where we can prevent this behavior.

@timvisee timvisee merged commit 3255719 into dev Jun 18, 2024
17 checks passed
@timvisee timvisee deleted the resharding-shard-state-routing branch June 18, 2024 15:51
generall pushed a commit that referenced this pull request Jun 21, 2024
#4480)

* Add serde/validate attributes to resharding operations, matching others

* Fix broken comment

* Add debug message for resharding driver entering stages

* Fix shard transfer start setting state of wrong replica for resharding

* Remove obsolete clones

* Add target shard ID to shard key, add relevant gRPC types

* Move target shard ID below source shard ID field

* Rename collection_name to collection_id

* Reformat

* Transferring point batches must merge points in case of resharding

* In resharding state, sync list of peers on start

* Add logic for setting replica set state through consensus dispatcher

* Properly start resharding transfer

* Properly finish resharding transfers, set shard state correctly

* Fix shard transfer initialisation with different target shard

* Fix shard state handling with resharding on all nodes on transfer start

* Don't reset locally disabled state if only existing shard is resharding

* Add important TODOs

* Update OpenAPI and gRPC specification

* Elaborate on some logic in code with comments

* Use user configured shard transfer method for replication

* Add debug assert, on transfer start we should not replace existing shard

* On transfer start, be aware of different sender/receiver local states

This fixes transfers where we might not have a replica on all nodes

* Fix shard transfer not setting cutoff point on target shard

* While resharding, migrate shards in numerical order

* On shard transfer initialisation, ensure transfer targets given shard
@timvisee timvisee added this to the v1.11: resharding milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants