Recreate workers/optimizers async to not block consensus#9121
Conversation
| .write() | ||
| .report_optimizer_error(format!("Failed to recreate optimizers: {err}")); | ||
| } | ||
|
|
There was a problem hiding this comment.
Noteworthy change.
We now tend to recreate workers and optimizers in the background. This is fallible. On error, it now propagates the issue as optimizer error so it is exposed in collection info. Users may trigger the operation again to give it another shot, after which the error is also cleared.
| # Send a collection config update through consensus. Any optimizers_config change | ||
| # triggers optimizer recreation, which must not block the consensus apply thread on | ||
| # the busy worker. | ||
| start = time.time() | ||
| try: | ||
| r = requests.patch( | ||
| f"{peer_uri}/collections/{COLLECTION}", | ||
| json={"optimizers_config": {"default_segment_number": 2}}, | ||
| timeout=PATCH_CLIENT_TIMEOUT_SEC, | ||
| ) | ||
| except requests.exceptions.Timeout: | ||
| elapsed = time.time() - start | ||
| raise AssertionError( | ||
| f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s " | ||
| f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker" | ||
| ) | ||
| elapsed = time.time() - start | ||
|
|
||
| assert_http_ok(r) |
There was a problem hiding this comment.
This would previously fail.
We submit a 20-second-wait operation above. This update collection call would have been blocked by it.
📝 WalkthroughWalkthroughThis PR converts optimizer recreation from a blocking awaited operation to a non-blocking background single-flight mechanism. It introduces a Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py`:
- Around line 113-118: The except block catching requests.exceptions.Timeout
should preserve the original exception context: change the handler to capture
the caught Timeout (e.g., "except requests.exceptions.Timeout as e:") and
re-raise the AssertionError with "from e" so the original traceback is kept;
update the block around the failing message that references
PATCH_CLIENT_TIMEOUT_SEC and elapsed to use "raise AssertionError(... ) from e"
in the test_collection_update_not_blocked_by_busy_update_worker.py test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dea0a05f-d17f-4630-8409-9fd68f7e7523
📒 Files selected for processing (7)
lib/collection/src/collection/collection_ops.rslib/collection/src/collection/mod.rslib/collection/src/collection/state_management.rslib/collection/src/collection/vector_name_schema.rslib/collection/src/shards/local_shard/updaters.rslib/storage/src/content_manager/toc/collection_meta_ops.rstests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py
| except requests.exceptions.Timeout: | ||
| elapsed = time.time() - start | ||
| raise AssertionError( | ||
| f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s " | ||
| f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py"
# Show surrounding context for the reported lines
nl -ba "$FILE" | sed -n '80,160p'Repository: qdrant/qdrant
Length of output: 100
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py"
awk 'NR>=90 && NR<=140 {printf "%d:%s\n", NR, $0}' "$FILE"Repository: qdrant/qdrant
Length of output: 1886
Preserve the original Timeout exception context when failing the assertion (tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py, lines 113-118).
The current except requests.exceptions.Timeout: re-raises AssertionError without from, which discards the original traceback.
Proposed fix
- except requests.exceptions.Timeout:
+ except requests.exceptions.Timeout as err:
elapsed = time.time() - start
raise AssertionError(
f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s "
f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker"
- )
+ ) from err📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except requests.exceptions.Timeout: | |
| elapsed = time.time() - start | |
| raise AssertionError( | |
| f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s " | |
| f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker" | |
| ) | |
| except requests.exceptions.Timeout as err: | |
| elapsed = time.time() - start | |
| raise AssertionError( | |
| f"Collection update did not return within {PATCH_CLIENT_TIMEOUT_SEC}s " | |
| f"(waited {elapsed:.1f}s) - consensus apply was blocked by the busy update worker" | |
| ) from err |
🧰 Tools
🪛 Ruff (0.15.13)
[warning] 115-118: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/consensus_tests/test_collection_update_not_blocked_by_busy_worker.py`
around lines 113 - 118, The except block catching requests.exceptions.Timeout
should preserve the original exception context: change the handler to capture
the caught Timeout (e.g., "except requests.exceptions.Timeout as e:") and
re-raise the AssertionError with "from e" so the original traceback is kept;
update the block around the failing message that references
PATCH_CLIENT_TIMEOUT_SEC and elapsed to use "raise AssertionError(... ) from e"
in the test_collection_update_not_blocked_by_busy_update_worker.py test.
* Recreate optimizers in non-blocking fashion from consensus calls * Update comments * On optimizer config update failure, report error status to local shard * Add a test to confirm we don't block consensus * Rerun recreation if called multiple times * Use atomics instead * Move to the bottom * Reformat
When the optimizer configuration of a collection changes, we recreate workers (such as the update worker and optimizers).
To achieve this, we first stop, finish and destruct running workers. We wait for them to be complete. Then we recreate workers and optimizers with the updated configuration.
If the update worker is currently processing an expensive operation. This process if taking down the update worker can take a very long time. It means this process is blocking, affected by what workers are currently doing.
Some consensus operations depend on this. An obvious example is the update collection configuration API. This is a huge problem, because it may block consensus for a long time. That cascades into unstable consensus and failing nodes.
This PR resolves the problem by moving this expensive process into the background. The consensus operation immediately goes through, and workers/optimizers are recreated in the background.
All Submissions:
devbranch (notmaster) and my branch was created fromdev.New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --workspace --all-featurescommand?Changes to Core Features: