Skip to content

Conversation

@jeejeelee
Copy link
Collaborator

@jeejeelee jeejeelee commented Nov 27, 2025

Purpose

We have removed the LoRA extra-vocab code, embedding_padding_modules is unused, this PR clean up related code. What's more, using a smaller base model and corresponding LoRA to reduce CI pressure."

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as draft November 27, 2025 14:07
@mergify
Copy link

mergify bot commented Nov 27, 2025

Documentation preview: https://vllm--29611.org.readthedocs.build/en/29611/

@mergify mergify bot added documentation Improvements or additions to documentation llama Related to Llama models speculative-decoding v1 labels Nov 27, 2025
@mergify
Copy link

mergify bot commented Nov 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jeejeelee.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a good cleanup by removing the unused embedding_padding_modules and related logic for padding LoRA embedding weights. The changes are consistent across the codebase. I have one suggestion to improve robustness by adding an assertion to ensure vocabulary size consistency for lm_head LoRA adapters, which will provide clearer error messages for users.

if pin_memory:
loras[module_name].lora_a = loras[module_name].lora_a.pin_memory()
else:
loras[module_name].lora_b = tensor.to(device=device, dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While the padding logic for lm_head LoRA adapters is being removed, it's good practice to add an explicit check for vocabulary size mismatch. This will prevent potential runtime errors with cryptic messages if a user tries to load a LoRA with a different vocabulary size for the lm_head, providing a much clearer error message instead.

Suggested change
loras[module_name].lora_b = tensor.to(device=device, dtype=dtype)
if "lm_head" in module_name and model_vocab_size is not None:
assert model_vocab_size == tensor.shape[0], (
f"The lm_head LoRA vocab size ({tensor.shape[0]}) must be consistent"
f" with the base model's vocabulary size({model_vocab_size})."
)
loras[module_name].lora_b = tensor.to(device=device, dtype=dtype)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@mergify mergify bot removed the needs-rebase label Nov 27, 2025
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@jeejeelee jeejeelee marked this pull request as ready for review November 27, 2025 16:34
@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 27, 2025
@DarkLight1337
Copy link
Member

Please update the description explaining which code is unused

"""Main function that sets up and runs the prompt processing."""
engine = initialize_engine()
lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have removed the LoRA extra-vocab code, so we no longer support this type of LoRA weight. Accordingly, I change the base model and LoRA model in this script


@pytest.fixture(scope="session")
def zephyr_lora_files():
def qwen3_lora_files():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a smaller LoRA model can reduce CI pressure.

@jeejeelee
Copy link
Collaborator Author

@DarkLight1337 I have updated the description

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but maybe we should get LoRA TP to pass on main first

@jeejeelee
Copy link
Collaborator Author

@DarkLight1337 I will monitor LoRA TP test more closely, these tests are quite flaky

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
generate_and_test(
llm, olmoe_lora_files, lora_id=1, compare_lower=fully_sharded_loras
)
generate_and_test(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relax TP4+fully_sharded_loras test

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some nits

@pytest.fixture(scope="session")
def zephyr_lora_files():
def qwen3_lora_files():
"""Download zephyr LoRA files once per test session."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring should be updated too

Copy link
Collaborator Author

@jeejeelee jeejeelee Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


# Model name constants used across tests
MODEL_NAME_ZEPHYR = "HuggingFaceH4/zephyr-7b-beta"
MODEL_NAME_ZEPHYR = "Qwen/Qwen3-0.6B"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MODEL_NAME_ZEPHYR = "Qwen/Qwen3-0.6B"
MODEL_NAME_QWEN= "Qwen/Qwen3-0.6B"

Comment on lines 132 to 136
if "lora_embedding_A" in tensor_name and model_vocab_size is not None:
assert model_vocab_size == tensor.shape[1], (
f"The embedding LoRA size({tensor.shape[1]}) must be consistent"
f" with the base model's vocabulary size({model_vocab_size})."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we raise a runtime error instead of asserting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
@DarkLight1337
Copy link
Member

Flaky test

@vllm-bot vllm-bot merged commit 39e63de into vllm-project:main Nov 29, 2025
60 of 62 checks passed
@jeejeelee jeejeelee deleted the cleanup-lora branch November 29, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants