Skip to content

Fix resharding cleanup datarace with update queue#9014

Merged
timvisee merged 4 commits into
devfrom
fix-resharding-clean-update-queue
May 21, 2026
Merged

Fix resharding cleanup datarace with update queue#9014
timvisee merged 4 commits into
devfrom
fix-resharding-clean-update-queue

Conversation

@timvisee
Copy link
Copy Markdown
Member

@timvisee timvisee commented May 12, 2026

Two systems conflict:

  1. resharding propagate deletes stage - deletes points that don't belong in (old) shard anymore
  2. update queue - operations pending in update queue that did not land in segments yet

During resharding up, we copy a portion of points from all existing shards into a new shard. We leave them in place for some time to prevent read consistency problems. At the end of resharding we delete these moved points as they don't belong in the old shard anymore.

Together with the update queue system this allows a data race. Queued operations are not cleared. But, these queued operations may still create/update points that don't belong in the shard anymore. So, after clearing old points is finished the same old points may land in the shard again from the update queue.

This PR adds a test to assert this behavior. It fixes the problem by making every clear operation wait for the current update queue to be applied with sort of a plunger operation. The test fails before the fix, and succeeds with the fix.

All Submissions:

  • My PR targets the dev branch (not master) and my branch was created from dev.
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@timvisee timvisee added the bug Something isn't working label May 12, 2026
@timvisee timvisee requested a review from ffuugoo May 12, 2026 13:30
@timvisee timvisee marked this pull request as ready for review May 12, 2026 13:30
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai Bot May 12, 2026
@qdrant qdrant deleted a comment from coderabbitai Bot May 12, 2026
@timvisee timvisee merged commit 97cb524 into dev May 21, 2026
15 checks passed
@timvisee timvisee deleted the fix-resharding-clean-update-queue branch May 21, 2026 10:43
generall pushed a commit that referenced this pull request May 22, 2026
* Add test to show cleanup may conflict with update queue

* When invoking clean task, first wait for current update queue

* Don't hold shard holder lock for a long time

* Also assert the clean task finished completely
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release:1.18.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants