fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler by m199369309 · Pull Request #4839 · xorbitsai/inference

m199369309 · 2026-04-23T01:29:09Z

Summary

Remove bare torchvision from ENGINE_VIRTUALENV_PACKAGES to avoid mixed-source torch/torchvision conflicts; torchvision is now supplied per-model via #system_torchvision# in model_spec
Add install_packages lock to prevent concurrent package installations
Add supervisor RPC timeouts to avoid hanging calls
Add REST negative cache for get_model floods with configurable TTL
Fix negative cache HTTPException(404) being swallowed by broad except Exception and re-wrapped as 500
Add autouse fixture to clear negative cache between unit tests (prevent cross-test pollution)
Add replica UID pre-check before launching
Add SafeRotatingFileHandler for concurrent log rotation
Update model-not-found HTTP status from 400 to 404 for require_model path (correct REST semantics)
Add #system_torch# to jina-reranker-v3 and jina-embeddings-v3 model_spec
Fix tests to match updated status codes and torchvision removal

Test plan

Unit tests updated for HTTP 404 on model-not-found via require_model
terminate_model path keeps HTTP 400 (does not use require_model)
Torchvision virtualenv test updated to reflect #system_torchvision# approach
Negative cache HTTPException passthrough fixed
Negative cache cleared between unit tests via autouse fixture
CI pipeline passes (test_embedding_model_with_flag failure is a pre-existing transformers version issue, unrelated to this PR)

🤖 Generated with Claude Code

…ock, and add supervisor RPC timeouts - Remove bare `torchvision` from ENGINE_VIRTUALENV_PACKAGES["sentence_transformers"] to avoid mixed-source torch/torchvision conflicts (P0) - Add _exclusive_venv_path_lock around install_packages to prevent concurrent package installation race conditions (P0) - Add per-worker list_models timeout with asyncio.gather for concurrent aggregation, preventing single-worker hang from blocking all queries (P1) - Add get_model RPC timeout via xo.wait_for to bound supervisor->worker call duration (P1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…del_spec Add "#system_torch# ; #engine# == \"sentence_transformers\"" to virtualenv.packages for jina-reranker-v3 (rerank) and jina-embeddings-v3 (embedding) to ensure torch comes from the parent environment, preventing torch(child venv)/torchvision(parent) version mismatch that causes "operator torchvision::nms does not exist". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qinxuye

LGTM

…eck, and SafeRotatingFileHandler - Add REST-layer negative cache (TTL 10s) for get_model to prevent retry floods from blocking Supervisor Actor message queue (2026041601) - Change require_model HTTP status from 400 to 404 for Model not found - Enhance get_model error message with available model uids list - Invalidate negative cache on successful model launch - Add replica uid pre-check before asyncio.gather in launch_builtin_model and _launch_builtin_sharded_model to prevent partial-deploy-then-rollback (2026041602) - Add SafeRotatingFileHandler that auto-creates parent directories, preventing sub pool startup failure when log dir is missing (2026041603) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Update model-not-found tests to expect HTTP 404 instead of 400 - Remove torchvision assertion from virtualenv packages test since torchvision is now supplied per-model via #system_torchvision# Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

terminate_model does not use require_model, so it still returns 400 for ValueError on non-existent model deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The HTTPException(404) raised by the negative cache fast path was being caught by the broad `except Exception` handler and re-wrapped as 500. Add an `except HTTPException: raise` clause to let it pass through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…n tests - Revert "delete again" assertion back to 400 (terminate_model path) - Add autouse fixture to clear _MODEL_NOT_FOUND_CACHE between unit tests to prevent cross-test pollution from the negative cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove local document references (2026041601.md, 2026041602.md, 2026041603.md) that are not part of the upstream repository. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qinxuye

LGTM

XprobeBot added the bug Something isn't working label Apr 23, 2026

XprobeBot added this to the v2.x milestone Apr 23, 2026

m199369309 changed the title ~~fix: remove bare torchvision from engine deps, add install_packages lock, and supervisor RPC timeouts~~ fix: remove bare torchvision from engine deps, add install_packages lock, supervisor RPC timeouts, and model_spec torch alignment Apr 23, 2026

qinxuye approved these changes Apr 23, 2026

View reviewed changes

m199369309 requested a review from qinxuye April 23, 2026 14:59

m199369309 and others added 4 commits April 23, 2026 23:28

fix: revert delete bogus model test to expect 400

e3910d2

terminate_model does not use require_model, so it still returns 400 for ValueError on non-existent model deletion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qinxuye reviewed Apr 24, 2026

View reviewed changes

Comment thread xinference/api/utils.py Outdated

chore: remove internal doc references from comments

afaafa6

Remove local document references (2026041601.md, 2026041602.md, 2026041603.md) that are not part of the upstream repository. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

qinxuye approved these changes Apr 24, 2026

View reviewed changes

qinxuye merged commit 256603b into xorbitsai:main Apr 24, 2026
4 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler#4839

fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler#4839
qinxuye merged 8 commits intoxorbitsai:mainfrom
m199369309:fix/venv-torchvision-and-supervisor-rpc-timeout

m199369309 commented Apr 23, 2026 •

edited

Loading

Uh oh!

qinxuye left a comment

Uh oh!

Uh oh!

qinxuye left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

m199369309 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

qinxuye left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qinxuye left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m199369309 commented Apr 23, 2026 •

edited

Loading