-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly integrate replica set state switching in resharding transfers #4480
Conversation
let this_peer_id = consensus.this_peer_id(); | ||
let is_receiver = this_peer_id == shard_transfer.to; | ||
let is_sender = this_peer_id == shard_transfer.from; | ||
|
||
// Get the source and target shards, in case of resharding the target shard is different | ||
let from_shard_id = shard_transfer.shard_id; | ||
let to_shard_id = shard_transfer | ||
.to_shard_id | ||
.unwrap_or(shard_transfer.shard_id); | ||
|
||
let shards_holder = self.shards_holder.read().await; | ||
let from_replica_set = shards_holder.get_shard(&from_shard_id).ok_or_else(|| { | ||
CollectionError::service_error(format!("Shard {from_shard_id} doesn't exist")) | ||
})?; | ||
let to_replica_set = shards_holder.get_shard(&to_shard_id).ok_or_else(|| { | ||
CollectionError::service_error(format!("Shard {to_shard_id} doesn't exist")) | ||
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now have transfers with a different source and target shard ID (in case resharding), we need a separate source and target replica set in our shard transfer logic. This theme will repeat throughout this PR in various places.
// List the peers in active or resharding state | ||
// If resharding, there will be only one shard in resharding state and we should not | ||
// consider all to be dead | ||
// TODO(resharding): accept resharding state as active like below? | ||
let active_or_resharding_peers = peers.iter().filter_map(|(&peer_id, &state)| { | ||
matches!(state, ReplicaState::Active | ReplicaState::Resharding).then_some(peer_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When resharding, we only have one shard in the Resharding
state, and none in Active
state.
This prevents messing with our locally disabled peers list. I'm not too happy with it though, maybe there's a better place where we can prevent this behavior.
This fixes transfers where we might not have a replica on all nodes
#4480) * Add serde/validate attributes to resharding operations, matching others * Fix broken comment * Add debug message for resharding driver entering stages * Fix shard transfer start setting state of wrong replica for resharding * Remove obsolete clones * Add target shard ID to shard key, add relevant gRPC types * Move target shard ID below source shard ID field * Rename collection_name to collection_id * Reformat * Transferring point batches must merge points in case of resharding * In resharding state, sync list of peers on start * Add logic for setting replica set state through consensus dispatcher * Properly start resharding transfer * Properly finish resharding transfers, set shard state correctly * Fix shard transfer initialisation with different target shard * Fix shard state handling with resharding on all nodes on transfer start * Don't reset locally disabled state if only existing shard is resharding * Add important TODOs * Update OpenAPI and gRPC specification * Elaborate on some logic in code with comments * Use user configured shard transfer method for replication * Add debug assert, on transfer start we should not replace existing shard * On transfer start, be aware of different sender/receiver local states This fixes transfers where we might not have a replica on all nodes * Fix shard transfer not setting cutoff point on target shard * While resharding, migrate shards in numerical order * On shard transfer initialisation, ensure transfer targets given shard
Tracked in: #4213
In #4373 I implemented a first iteration of point migration. In practice, transferring still appeared to be broken because a lot more had to be done for replica set state switching and shard routing.
This PR does all that needs to be done in terms of shard transfers to make resharding work. Specifically, to actually migrate points from all shards to the new target shard, and to replicate the new shard matching the configured replication factor.
Most importantly, this does:
What doesn't work yet and will be resolved in a separate PR:
I've extensively tested this locally with various different scenarios, such as different shard numbers, replication factors and target nodes. All works well.
We cannot (automatically) test this yet, because the consensus operation for finalizing resharding is missing. Once that's there, I'll add an integration test to assert correct behavior.
All Submissions:
dev
branch. Did you create your branch fromdev
?New Feature Submissions:
cargo +nightly fmt --all
command prior to submission?cargo clippy --all --all-features
command?