Skip to content

[ci][llm] Deflake llm_batch_vllm:test_vllm_llama_lora model downloading behavior#60199

Merged
kouroshHakha merged 4 commits intoray-project:masterfrom
nrghosh:nrghosh/fix-llm-ci-model-downloading
Jan 17, 2026
Merged

[ci][llm] Deflake llm_batch_vllm:test_vllm_llama_lora model downloading behavior#60199
kouroshHakha merged 4 commits intoray-project:masterfrom
nrghosh:nrghosh/fix-llm-ci-model-downloading

Conversation

@nrghosh
Copy link
Contributor

@nrghosh nrghosh commented Jan 16, 2026

Description

  • Fix llm_batch_vllm:test_vllm_llama_lora model downloading
  • Improve confusing log message re: model downloading behavior

Problem Model downloading locking issue was causing vLLM / engine init to fail with "Cannot find any model weights".

Cause: all download types (tokenizer, full model, etc) were sharing the
same file lock. When a tokenizer-only download acq the lock first and
ocmpleted, subsequent full model downloads were getting TimeoutError and
would wait for lock release -> then return without downloading (assuming
that the weights were cached).

This was observed inconsistently due to the model files being cached
(and test passing) in some cases.

Reproduction:

  1. Clear cache with rm -rf ~/.cache/huggingface/hub/models--llama-3.2-216M-lora-dummy/
  2. Run test with pytest -vs release/llm_tests/batch/test_batch_vllm.py::test_vllm_llama_lora
  3. Observe failure to get waits + engine init failure: RuntimeError: Cannot find any model weights

Fix: use separate lock paths for different download types:

This way, tokenizer-only downloads completing can't take full model
downloads to skip directly fetching the model weights.

Testing (on Anyscale and Ray OSS)

  1. Clear cache
  2. Add fix
  3. Run test - success

Also ran all llm_batch_vllm release tests on Anyscale and Ray OSS to confirm no regressions in model downloading behavior

image

Related issues

n/a - internal Ray CI

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Model downloading locking issue was causing vLLM / engine init to fail
with "Cannot find any model weights".

Cause: all download types (tokenizer, full model, etc) were sharing the
same file lock. When a tokenizer-only download acq the lock first and
ocmpleted, subsequent full model downloads were getting TimeoutError and
would wait for lock release -> then return without downloading (assuming
that the weights were cached).

This was observed inconsistently due to the model files being cached
(and test passing) in some cases.

Reproduction:
1. Clear cache with `rm -rf ~/.cache/huggingface/hub/models--s3:----air-example-data--llama-3.2-216M-dummy`
2. Run test with `pytest -vs release/llm_tests/batch/test_batch_vllm.py::test_vllm_llama_lora`
3. Observe failure to get waits + engine init failure: `RuntimeError:
   Cannot find any model weights`

Fix: use separate lock paths for different download types:

This way, tokenizer-only downloads completing can't take full model
downloads to skip directly fetching the model weights.

Testing (on Anyscale and Ray OSS)
1. Clear cache
2. Add fix
3. Run test - success

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh requested a review from a team as a code owner January 16, 2026 01:49
@nrghosh nrghosh changed the title [ci][llm] lm ci model downloading [ci][llm] Fix llm_batch_vllm:test_vllm_llama_lora model downloading behavior Jan 16, 2026
@nrghosh nrghosh changed the title [ci][llm] Fix llm_batch_vllm:test_vllm_llama_lora model downloading behavior [ci][llm] Deflake llm_batch_vllm:test_vllm_llama_lora model downloading behavior Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a race condition in model downloading by introducing separate file locks for different download types. The use of a lock_suffix based on the download type is a clean and direct solution to the problem described. Additionally, the clarification in the log message for missing hash files is a welcome improvement, reducing potential confusion for users. The changes are well-implemented and address the issue correctly. I have one minor suggestion to improve maintainability.

Comment on lines +140 to +145
if tokenizer_only:
lock_suffix = "-tokenizer"
elif exclude_safetensors:
lock_suffix = "-exclude-safetensors"
else:
lock_suffix = "-full"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to use different lock suffixes based on the download type is a great fix for the race condition. To improve maintainability and avoid magic strings, you could consider defining "-tokenizer", "-exclude-safetensors", and "-full" as module-level or class-level constants. This would make the code more robust and easier to read, especially if these suffixes are ever used elsewhere.

@nrghosh nrghosh requested a review from aslonnie January 16, 2026 02:14
@nrghosh nrghosh added llm ci go add ONLY when ready to merge, run all tests labels Jan 16, 2026
@kouroshHakha kouroshHakha merged commit 4c64f43 into ray-project:master Jan 17, 2026
6 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
…ding behavior (ray-project#60199)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…ding behavior (ray-project#60199)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…ding behavior (ray-project#60199)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants