Fix resharding cleanup datarace with update queue by timvisee · Pull Request #9014 · qdrant/qdrant

timvisee · 2026-05-12T10:01:09Z

Two systems conflict:

resharding propagate deletes stage - deletes points that don't belong in (old) shard anymore
update queue - operations pending in update queue that did not land in segments yet

During resharding up, we copy a portion of points from all existing shards into a new shard. We leave them in place for some time to prevent read consistency problems. At the end of resharding we delete these moved points as they don't belong in the old shard anymore.

Together with the update queue system this allows a data race. Queued operations are not cleared. But, these queued operations may still create/update points that don't belong in the shard anymore. So, after clearing old points is finished the same old points may land in the shard again from the update queue.

This PR adds a test to assert this behavior. It fixes the problem by making every clear operation wait for the current update queue to be applied with sort of a plunger operation. The test fails before the fix, and succeeds with the fix.

All Submissions:

My PR targets the dev branch (not master) and my branch was created from dev.
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

* Add test to show cleanup may conflict with update queue * When invoking clean task, first wait for current update queue * Don't hold shard holder lock for a long time * Also assert the clean task finished completely

timvisee added 3 commits May 12, 2026 11:40

Add test to show cleanup may conflict with update queue

deb8d10

When invoking clean task, first wait for current update queue

07be145

Don't hold shard holder lock for a long time

048f9c7

timvisee added the bug Something isn't working label May 12, 2026

timvisee requested a review from ffuugoo May 12, 2026 13:30

timvisee marked this pull request as ready for review May 12, 2026 13:30

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai Bot May 12, 2026

Also assert the clean task finished completely

d6be5c8

qdrant deleted a comment from coderabbitai Bot May 12, 2026

timvisee added the release:1.18.1 label May 13, 2026

ffuugoo approved these changes May 21, 2026

View reviewed changes

timvisee merged commit 97cb524 into dev May 21, 2026
15 checks passed

timvisee deleted the fix-resharding-clean-update-queue branch May 21, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resharding cleanup datarace with update queue#9014

Fix resharding cleanup datarace with update queue#9014
timvisee merged 4 commits into
devfrom
fix-resharding-clean-update-queue

timvisee commented May 12, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timvisee commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timvisee commented May 12, 2026 •

edited

Loading