[ci][llm] Deflake `llm_batch_vllm:test_vllm_llama_lora` model downloading behavior by nrghosh · Pull Request #60199 · ray-project/ray

nrghosh · 2026-01-16T01:49:03Z

Description

Fix llm_batch_vllm:test_vllm_llama_lora model downloading
Improve confusing log message re: model downloading behavior

Problem Model downloading locking issue was causing vLLM / engine init to fail with "Cannot find any model weights".

Cause: all download types (tokenizer, full model, etc) were sharing the
same file lock. When a tokenizer-only download acq the lock first and
ocmpleted, subsequent full model downloads were getting TimeoutError and
would wait for lock release -> then return without downloading (assuming
that the weights were cached).

This was observed inconsistently due to the model files being cached
(and test passing) in some cases.

Reproduction:

Clear cache with rm -rf ~/.cache/huggingface/hub/models--llama-3.2-216M-lora-dummy/
Run test with pytest -vs release/llm_tests/batch/test_batch_vllm.py::test_vllm_llama_lora
Observe failure to get waits + engine init failure: RuntimeError: Cannot find any model weights

Fix: use separate lock paths for different download types:

This way, tokenizer-only downloads completing can't take full model
downloads to skip directly fetching the model weights.

Testing (on Anyscale and Ray OSS)

Clear cache
Add fix
Run test - success

Also ran all llm_batch_vllm release tests on Anyscale and Ray OSS to confirm no regressions in model downloading behavior

Related issues

n/a - internal Ray CI

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Model downloading locking issue was causing vLLM / engine init to fail with "Cannot find any model weights". Cause: all download types (tokenizer, full model, etc) were sharing the same file lock. When a tokenizer-only download acq the lock first and ocmpleted, subsequent full model downloads were getting TimeoutError and would wait for lock release -> then return without downloading (assuming that the weights were cached). This was observed inconsistently due to the model files being cached (and test passing) in some cases. Reproduction: 1. Clear cache with `rm -rf ~/.cache/huggingface/hub/models--s3:----air-example-data--llama-3.2-216M-dummy` 2. Run test with `pytest -vs release/llm_tests/batch/test_batch_vllm.py::test_vllm_llama_lora` 3. Observe failure to get waits + engine init failure: `RuntimeError: Cannot find any model weights` Fix: use separate lock paths for different download types: This way, tokenizer-only downloads completing can't take full model downloads to skip directly fetching the model weights. Testing (on Anyscale and Ray OSS) 1. Clear cache 2. Add fix 3. Run test - success Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

gemini-code-assist

Code Review

This pull request effectively resolves a race condition in model downloading by introducing separate file locks for different download types. The use of a lock_suffix based on the download type is a clean and direct solution to the problem described. Additionally, the clarification in the log message for missing hash files is a welcome improvement, reducing potential confusion for users. The changes are well-implemented and address the issue correctly. I have one minor suggestion to improve maintainability.

gemini-code-assist · 2026-01-16T01:50:28Z

python/ray/llm/_internal/common/utils/download_utils.py

+        if tokenizer_only:
+            lock_suffix = "-tokenizer"
+        elif exclude_safetensors:
+            lock_suffix = "-exclude-safetensors"
+        else:
+            lock_suffix = "-full"


The logic to use different lock suffixes based on the download type is a great fix for the race condition. To improve maintainability and avoid magic strings, you could consider defining "-tokenizer", "-exclude-safetensors", and "-full" as module-level or class-level constants. This would make the code more robust and easier to read, especially if these suffixes are ever used elsewhere.

…ding behavior (ray-project#60199) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

…ding behavior (ray-project#60199) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

…ding behavior (ray-project#60199) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh added 2 commits January 15, 2026 17:26

improve confusing error message

d51e902

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh requested a review from a team as a code owner January 16, 2026 01:49

nrghosh changed the title ~~[ci][llm] lm ci model downloading~~ [ci][llm] Fix llm_batch_vllm:test_vllm_llama_lora model downloading behavior Jan 16, 2026

nrghosh changed the title ~~[ci][llm] Fix llm_batch_vllm:test_vllm_llama_lora model downloading behavior~~ [ci][llm] Deflake llm_batch_vllm:test_vllm_llama_lora model downloading behavior Jan 16, 2026

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

Merge branch 'master' into nrghosh/fix-llm-ci-model-downloading

5d70fd4

nrghosh requested a review from aslonnie January 16, 2026 02:14

nrghosh added llm ci go add ONLY when ready to merge, run all tests labels Jan 16, 2026

Merge branch 'master' into nrghosh/fix-llm-ci-model-downloading

a9f7868

aslonnie approved these changes Jan 16, 2026

View reviewed changes

kouroshHakha approved these changes Jan 17, 2026

View reviewed changes

kouroshHakha merged commit 4c64f43 into ray-project:master Jan 17, 2026
6 checks passed

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026

[ci][llm] Deflake llm_batch_vllm:test_vllm_llama_lora model downloa…

315326f

…ding behavior (ray-project#60199) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci][llm] Deflake `llm_batch_vllm:test_vllm_llama_lora` model downloading behavior#60199

[ci][llm] Deflake `llm_batch_vllm:test_vllm_llama_lora` model downloading behavior#60199
kouroshHakha merged 4 commits intoray-project:masterfrom
nrghosh:nrghosh/fix-llm-ci-model-downloading

nrghosh commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nrghosh commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nrghosh commented Jan 16, 2026 •

edited

Loading