Skip to content

fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler#4839

Merged
qinxuye merged 8 commits intoxorbitsai:mainfrom
m199369309:fix/venv-torchvision-and-supervisor-rpc-timeout
Apr 24, 2026
Merged

fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler#4839
qinxuye merged 8 commits intoxorbitsai:mainfrom
m199369309:fix/venv-torchvision-and-supervisor-rpc-timeout

Conversation

@m199369309
Copy link
Copy Markdown
Contributor

@m199369309 m199369309 commented Apr 23, 2026

Summary

  • Remove bare torchvision from ENGINE_VIRTUALENV_PACKAGES to avoid mixed-source torch/torchvision conflicts; torchvision is now supplied per-model via #system_torchvision# in model_spec
  • Add install_packages lock to prevent concurrent package installations
  • Add supervisor RPC timeouts to avoid hanging calls
  • Add REST negative cache for get_model floods with configurable TTL
  • Fix negative cache HTTPException(404) being swallowed by broad except Exception and re-wrapped as 500
  • Add autouse fixture to clear negative cache between unit tests (prevent cross-test pollution)
  • Add replica UID pre-check before launching
  • Add SafeRotatingFileHandler for concurrent log rotation
  • Update model-not-found HTTP status from 400 to 404 for require_model path (correct REST semantics)
  • Add #system_torch# to jina-reranker-v3 and jina-embeddings-v3 model_spec
  • Fix tests to match updated status codes and torchvision removal

Test plan

  • Unit tests updated for HTTP 404 on model-not-found via require_model
  • terminate_model path keeps HTTP 400 (does not use require_model)
  • Torchvision virtualenv test updated to reflect #system_torchvision# approach
  • Negative cache HTTPException passthrough fixed
  • Negative cache cleared between unit tests via autouse fixture
  • CI pipeline passes (test_embedding_model_with_flag failure is a pre-existing transformers version issue, unrelated to this PR)

🤖 Generated with Claude Code

…ock, and add supervisor RPC timeouts

- Remove bare `torchvision` from ENGINE_VIRTUALENV_PACKAGES["sentence_transformers"] to avoid mixed-source torch/torchvision conflicts (P0)
- Add _exclusive_venv_path_lock around install_packages to prevent concurrent package installation race conditions (P0)
- Add per-worker list_models timeout with asyncio.gather for concurrent aggregation, preventing single-worker hang from blocking all queries (P1)
- Add get_model RPC timeout via xo.wait_for to bound supervisor->worker call duration (P1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@XprobeBot XprobeBot added the bug Something isn't working label Apr 23, 2026
@XprobeBot XprobeBot added this to the v2.x milestone Apr 23, 2026
…del_spec

Add "#system_torch# ; #engine# == \"sentence_transformers\"" to virtualenv.packages
for jina-reranker-v3 (rerank) and jina-embeddings-v3 (embedding) to ensure torch
comes from the parent environment, preventing torch(child venv)/torchvision(parent)
version mismatch that causes "operator torchvision::nms does not exist".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@m199369309 m199369309 changed the title fix: remove bare torchvision from engine deps, add install_packages lock, and supervisor RPC timeouts fix: remove bare torchvision from engine deps, add install_packages lock, supervisor RPC timeouts, and model_spec torch alignment Apr 23, 2026
Copy link
Copy Markdown
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…eck, and SafeRotatingFileHandler

- Add REST-layer negative cache (TTL 10s) for get_model to prevent retry
  floods from blocking Supervisor Actor message queue (2026041601)
- Change require_model HTTP status from 400 to 404 for Model not found
- Enhance get_model error message with available model uids list
- Invalidate negative cache on successful model launch
- Add replica uid pre-check before asyncio.gather in launch_builtin_model
  and _launch_builtin_sharded_model to prevent partial-deploy-then-rollback (2026041602)
- Add SafeRotatingFileHandler that auto-creates parent directories,
  preventing sub pool startup failure when log dir is missing (2026041603)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@m199369309 m199369309 changed the title fix: remove bare torchvision from engine deps, add install_packages lock, supervisor RPC timeouts, and model_spec torch alignment fix: venv torchvision alignment, supervisor RPC timeouts, get_model flood protection, replica pre-check, and safe log handler Apr 23, 2026
@m199369309 m199369309 requested a review from qinxuye April 23, 2026 14:59
m199369309 and others added 4 commits April 23, 2026 23:28
- Update model-not-found tests to expect HTTP 404 instead of 400
- Remove torchvision assertion from virtualenv packages test since
  torchvision is now supplied per-model via #system_torchvision#

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terminate_model does not use require_model, so it still returns 400
for ValueError on non-existent model deletion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The HTTPException(404) raised by the negative cache fast path was being
caught by the broad `except Exception` handler and re-wrapped as 500.
Add an `except HTTPException: raise` clause to let it pass through.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n tests

- Revert "delete again" assertion back to 400 (terminate_model path)
- Add autouse fixture to clear _MODEL_NOT_FOUND_CACHE between unit tests
  to prevent cross-test pollution from the negative cache

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread xinference/api/utils.py Outdated
Remove local document references (2026041601.md, 2026041602.md,
2026041603.md) that are not part of the upstream repository.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit 256603b into xorbitsai:main Apr 24, 2026
4 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants