-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Core] Modify the initialization parameters of the lora manager #25249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the initialization of the LoRA manager to use a single vllm_config
object, which simplifies the API and improves code clarity. The changes are well-implemented across the codebase. However, I've found a critical issue in one of the updated tests where the configuration for the test is not correctly set up, which will lead to test failures. I've provided a suggestion to fix it.
model_config = ModelConfig(max_model_len=16) | ||
vllm_config = VllmConfig(model_config=model_config, | ||
lora_config=lora_config) | ||
|
||
vllm_config.scheduler_config.max_num_seqs = 4 | ||
vllm_config.scheduler_config.max_num_batched_tokens = 2 | ||
worker_adapter_manager = LRUCacheWorkerLoRAManager( | ||
4, 2, | ||
dummy_model.unpadded_vocab_size - lora_config.lora_extra_vocab_size, | ||
lora_config, device, EMBEDDING_MODULES, EMBEDDING_PADDING_MODULES) | ||
vllm_config, device, EMBEDDING_MODULES, EMBEDDING_PADDING_MODULES) | ||
|
||
worker_adapter_manager.max_num_seqs = 4 | ||
worker_adapter_manager.max_num_batched_tokens = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vllm_config
is not correctly initialized for the test. The ModelConfig
within it doesn't have the hf_config
from the dummy_model
, which will cause vllm_config.model_config.get_vocab_size()
to return 0 inside LRUCacheWorkerLoRAManager
. This leads to incorrect behavior, especially when calculating target_embedding_padding
.
To fix this, you should associate the dummy_model.config
with the vllm_config
's model_config
. Also, the manual setting of max_num_seqs
and max_num_batched_tokens
on worker_adapter_manager
is redundant as these are already set during initialization from the vllm_config
.
model_config = ModelConfig(max_model_len=16) | |
vllm_config = VllmConfig(model_config=model_config, | |
lora_config=lora_config) | |
vllm_config.scheduler_config.max_num_seqs = 4 | |
vllm_config.scheduler_config.max_num_batched_tokens = 2 | |
worker_adapter_manager = LRUCacheWorkerLoRAManager( | |
4, 2, | |
dummy_model.unpadded_vocab_size - lora_config.lora_extra_vocab_size, | |
lora_config, device, EMBEDDING_MODULES, EMBEDDING_PADDING_MODULES) | |
vllm_config, device, EMBEDDING_MODULES, EMBEDDING_PADDING_MODULES) | |
worker_adapter_manager.max_num_seqs = 4 | |
worker_adapter_manager.max_num_batched_tokens = 2 | |
model_config = ModelConfig(max_model_len=16) | |
vllm_config = VllmConfig(model_config=model_config, | |
lora_config=lora_config) | |
# Manually set hf_config for the test since ModelConfig doesn't take it | |
# in __init__ and we are not loading from a real model path. | |
vllm_config.model_config.hf_config = dummy_model.config | |
vllm_config.scheduler_config.max_num_seqs = 4 | |
vllm_config.scheduler_config.max_num_batched_tokens = 2 | |
worker_adapter_manager = LRUCacheWorkerLoRAManager( | |
vllm_config, device, EMBEDDING_MODULES, EMBEDDING_PADDING_MODULES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits. Otherwise LGTM
Thanks for reminder! @jeejeelee |
@jeejeelee , thanks for the CC, will update in our plugin accordingly |
…-project#25249) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@Yikun @wangxiyuan Updated our plugin: vllm-project/vllm-ascend#3095 |
### What this PR does / why we need it? Fix the impact to LoRA that vllm-project/vllm#25249 brought. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@9607d5e --------- Signed-off-by: paulyu12 <507435917@qq.com>
This PR broke the lora in TPU plugin. But thanks @gpolovets1 for the fix https://github.com/vllm-project/tpu_commons/pull/720. cc: @jeejeelee @Isotr0py |
…-project#25249) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
…-project#25249) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Purpose
vllm_config
to the internal lora manager to facilitate the advancement of #[Core] Enable LoRA support for classification model #24596 , as [Core] Enable LoRA support for classification model #24596 needs to obtain detail information such as task type.lora.py->lora_weight.py
)This PR will slight affect the loRA implementation of hardware plugins. cc @xuechendi @Yikun
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.