Skip to content

Recreate workers/optimizers async to not block consensus#9121

Merged
generall merged 8 commits into
devfrom
recreate-optimizers-async
May 21, 2026
Merged

Recreate workers/optimizers async to not block consensus#9121
generall merged 8 commits into
devfrom
recreate-optimizers-async

Conversation

@timvisee
Copy link
Copy Markdown
Member

@timvisee timvisee commented May 21, 2026

When the optimizer configuration of a collection changes, we recreate workers (such as the update worker and optimizers).

To achieve this, we first stop, finish and destruct running workers. We wait for them to be complete. Then we recreate workers and optimizers with the updated configuration.

If the update worker is currently processing an expensive operation. This process if taking down the update worker can take a very long time. It means this process is blocking, affected by what workers are currently doing.

Some consensus operations depend on this. An obvious example is the update collection configuration API. This is a huge problem, because it may block consensus for a long time. That cascades into unstable consensus and failing nodes.

This PR resolves the problem by moving this expensive process into the background. The consensus operation immediately goes through, and workers/optimizers are recreated in the background.

All Submissions:

  • My PR targets the dev branch (not master) and my branch was created from dev.
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --workspace --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

.write()
.report_optimizer_error(format!("Failed to recreate optimizers: {err}"));
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noteworthy change.

We now tend to recreate workers and optimizers in the background. This is fallible. On error, it now propagates the issue as optimizer error so it is exposed in collection info. Users may trigger the operation again to give it another shot, after which the error is also cleared.

Comment on lines +103 to +121
# Send a collection config update through consensus. Any optimizers_config change
# triggers optimizer recreation, which must not block the consensus apply thread on
# the busy worker.
start = time.time()
try:
r = requests.patch(
f"{peer_uri}/collections/{COLLECTION}",
json={"optimizers_config": {"default_segment_number": 2}},
timeout=PATCH_CLIENT_TIMEOUT_SEC,
)
except requests.exceptions.Timeout:
elapsed = time.time() - start
raise AssertionError(
f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
)
elapsed = time.time() - start

assert_http_ok(r)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would previously fail.

We submit a 20-second-wait operation above. This update collection call would have been blocked by it.

@timvisee timvisee marked this pull request as ready for review May 21, 2026 14:00
@timvisee timvisee changed the title Recreate optimizers asynchronously to not block consensus Recreate workers/optimizers async to not block consensus May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR converts optimizer recreation from a blocking awaited operation to a non-blocking background single-flight mechanism. It introduces a RecreateOptimizersState atomic state coordinator that ensures only one recreation task runs at a time while coalescing concurrent requests into a follow-up run. The background method spawns a detached task that repeatedly applies optimizer config updates across all shards. Four call sites (Raft snapshot apply, vector schema updates, collection metadata ops) now invoke the background path instead of awaiting, and shard-level update logic is refactored to properly sequence worker stopping, config reading, and worker restart. An integration test validates that consensus operations complete promptly when optimizer recreation is triggered.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • qdrant/qdrant#8767: Refactors LocalShard::on_optimizer_config_update error handling and optimizer-error clearing, overlapping with the shard optimizer-update logic changes in this PR.

Suggested reviewers

  • agourlay
  • generall
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main objective: making optimizer/worker recreation asynchronous to prevent blocking consensus operations.
Description check ✅ Passed The description comprehensively explains the problem (blocking consensus during expensive worker shutdown), the solution (moving recreation to background), and provides context for why this matters.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch recreate-optimizers-async

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py`:
- Around line 113-118: The except block catching requests.exceptions.Timeout
should preserve the original exception context: change the handler to capture
the caught Timeout (e.g., "except requests.exceptions.Timeout as e:") and
re-raise the AssertionError with "from e" so the original traceback is kept;
update the block around the failing message that references
PATCH_CLIENT_TIMEOUT_SEC and elapsed to use "raise AssertionError(... ) from e"
in the test_collection_update_not_blocked_by_busy_update_worker.py test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dea0a05f-d17f-4630-8409-9fd68f7e7523

📥 Commits

Reviewing files that changed from the base of the PR and between 6764b67 and 96e96d0.

📒 Files selected for processing (7)
  • lib/collection/src/collection/collection_ops.rs
  • lib/collection/src/collection/mod.rs
  • lib/collection/src/collection/state_management.rs
  • lib/collection/src/collection/vector_name_schema.rs
  • lib/collection/src/shards/local_shard/updaters.rs
  • lib/storage/src/content_manager/toc/collection_meta_ops.rs
  • tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py

Comment on lines +113 to +118
except requests.exceptions.Timeout:
elapsed = time.time() - start
raise AssertionError(
f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py"

# Show surrounding context for the reported lines
nl -ba "$FILE" | sed -n '80,160p'

Repository: qdrant/qdrant

Length of output: 100


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py"
awk 'NR>=90 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE"

Repository: qdrant/qdrant

Length of output: 1886


Preserve the original Timeout exception context when failing the assertion (tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py, lines 113-118).

The current except requests.exceptions.Timeout: re-raises AssertionError without from, which discards the original traceback.

Proposed fix
-    except requests.exceptions.Timeout:
+    except requests.exceptions.Timeout as err:
         elapsed = time.time() - start
         raise AssertionError(
             f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
             f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
-        )
+        ) from err
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except requests.exceptions.Timeout:
elapsed = time.time() - start
raise AssertionError(
f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
)
except requests.exceptions.Timeout as err:
elapsed = time.time() - start
raise AssertionError(
f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
) from err
🧰 Tools
🪛 Ruff (0.15.13)

[warning] 115-118: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py`
around lines 113 - 118, The except block catching requests.exceptions.Timeout
should preserve the original exception context: change the handler to capture
the caught Timeout (e.g., "except requests.exceptions.Timeout as e:") and
re-raise the AssertionError with "from e" so the original traceback is kept;
update the block around the failing message that references
PATCH_CLIENT_TIMEOUT_SEC and elapsed to use "raise AssertionError(... ) from e"
in the test_collection_update_not_blocked_by_busy_update_worker.py test.

@generall generall self-requested a review May 21, 2026 14:25
@generall generall merged commit b7ae3e8 into dev May 21, 2026
20 of 21 checks passed
@generall generall deleted the recreate-optimizers-async branch May 21, 2026 14:56
generall pushed a commit that referenced this pull request May 22, 2026
* Recreate optimizers in non-blocking fashion from consensus calls

* Update comments

* On optimizer config update failure, report error status to local shard

* Add a test to confirm we don't block consensus

* Rerun recreation if called multiple times

* Use atomics instead

* Move to the bottom

* Reformat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants