Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate new shard to match replication factor in resharding driver #4381

Merged
merged 4 commits into from
Jun 7, 2024

Conversation

timvisee
Copy link
Member

@timvisee timvisee commented Jun 3, 2024

Tracked in: #4213
Depends on: #4379

Implements the third stage in the resharding driver. It keeps replicating our new shard until we match the desired replication factor. This is dynamic and will follow the latest configured replication factor, even while it changes during this process. If we don't have enough peers to allow that number of replicas it continues early.

This does not implement the following yet, marked as TODOs:

  • using the preferred shard transfer method, as configured by the user
  • balance the shard distribution across the peers

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee requested a review from ffuugoo June 3, 2024 15:39
@timvisee timvisee mentioned this pull request Jun 3, 2024
88 tasks
Comment on lines 491 to 509
/// Await for a resharding shard trnasfer to succeed.
///
/// Yields on a successfull transfer.
///
/// Returns an error if:
/// - the transfer failed or got aborted
/// - the transfer timed out
/// - no matching transfer is ongoing; it never started or went missing without a notification
///
/// Yields on a succesful transfer. Returns an error if an error occurred or if the global timeout
/// is reached.
async fn await_transfer_success(
reshard_key: &ReshardKey,
transfer: &ShardTransfer,
shard_holder: &Arc<LockedShardHolder>,
collection_id: &CollectionId,
consensus: &dyn ShardTransferConsensus,
await_transfer_end: impl Future<Output = CollectionResult<Result<(), ()>>>,
) -> CollectionResult<()> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this and it's usage may seem a bit complex, there's three important things it does:

  1. we subscribe to shard transfer changes before we actually start transfers, to ensure we don't miss notifications
  2. it enforces a timeout, we don't wait longer than 24 hours for a transfer to complete
  3. it does a periodic sanity check to ensure the transfer is still active in consensus, if it isn't we assume it has failed (and panic in debug mode because this cannot happen)

@timvisee timvisee marked this pull request as ready for review June 4, 2024 09:36
Copy link
Contributor

@ffuugoo ffuugoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Sorry for taking so long to review this. 🙈

@timvisee timvisee merged commit b5a691e into dev Jun 7, 2024
17 checks passed
@timvisee timvisee deleted the reshard-driver-replicate branch June 7, 2024 10:16
timvisee added a commit that referenced this pull request Jun 11, 2024
…4381)

* Add replication stage for resharding

* Use consistent naming

* Simplify awaiting shard transfer success, add periodic sanity check

* Fix typos
@timvisee timvisee added this to the v1.11: resharding milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants