fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler#4839
Merged
qinxuye merged 8 commits intoxorbitsai:mainfrom Apr 24, 2026
Conversation
…ock, and add supervisor RPC timeouts - Remove bare `torchvision` from ENGINE_VIRTUALENV_PACKAGES["sentence_transformers"] to avoid mixed-source torch/torchvision conflicts (P0) - Add _exclusive_venv_path_lock around install_packages to prevent concurrent package installation race conditions (P0) - Add per-worker list_models timeout with asyncio.gather for concurrent aggregation, preventing single-worker hang from blocking all queries (P1) - Add get_model RPC timeout via xo.wait_for to bound supervisor->worker call duration (P1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…del_spec Add "#system_torch# ; #engine# == \"sentence_transformers\"" to virtualenv.packages for jina-reranker-v3 (rerank) and jina-embeddings-v3 (embedding) to ensure torch comes from the parent environment, preventing torch(child venv)/torchvision(parent) version mismatch that causes "operator torchvision::nms does not exist". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eck, and SafeRotatingFileHandler - Add REST-layer negative cache (TTL 10s) for get_model to prevent retry floods from blocking Supervisor Actor message queue (2026041601) - Change require_model HTTP status from 400 to 404 for Model not found - Enhance get_model error message with available model uids list - Invalidate negative cache on successful model launch - Add replica uid pre-check before asyncio.gather in launch_builtin_model and _launch_builtin_sharded_model to prevent partial-deploy-then-rollback (2026041602) - Add SafeRotatingFileHandler that auto-creates parent directories, preventing sub pool startup failure when log dir is missing (2026041603) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Update model-not-found tests to expect HTTP 404 instead of 400 - Remove torchvision assertion from virtualenv packages test since torchvision is now supplied per-model via #system_torchvision# Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terminate_model does not use require_model, so it still returns 400 for ValueError on non-existent model deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The HTTPException(404) raised by the negative cache fast path was being caught by the broad `except Exception` handler and re-wrapped as 500. Add an `except HTTPException: raise` clause to let it pass through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n tests - Revert "delete again" assertion back to 400 (terminate_model path) - Add autouse fixture to clear _MODEL_NOT_FOUND_CACHE between unit tests to prevent cross-test pollution from the negative cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qinxuye
reviewed
Apr 24, 2026
Remove local document references (2026041601.md, 2026041602.md, 2026041603.md) that are not part of the upstream repository. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torchvisionfromENGINE_VIRTUALENV_PACKAGESto avoid mixed-source torch/torchvision conflicts; torchvision is now supplied per-model via#system_torchvision#in model_specinstall_packageslock to prevent concurrent package installationsget_modelfloods with configurable TTLHTTPException(404)being swallowed by broadexcept Exceptionand re-wrapped as 500SafeRotatingFileHandlerfor concurrent log rotationrequire_modelpath (correct REST semantics)#system_torch#to jina-reranker-v3 and jina-embeddings-v3 model_specTest plan
require_modelterminate_modelpath keeps HTTP 400 (does not userequire_model)#system_torchvision#approachtest_embedding_model_with_flagfailure is a pre-existing transformers version issue, unrelated to this PR)🤖 Generated with Claude Code