[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding #21068

xiaohongchen1991 · 2025-07-16T17:44:36Z

Purpose

This PR is for supporting LoRA with Speculative Decoding for V1 engine on gpu (tpu is not covered by this PR).

In the current V1 implementation for LoRA, it assumes 1 sampled token for each prompt in each forward pass step. See the logits computing logics,

logits[:,self.base_layer.org_vocab_size:self.base_layer.org_vocab_size + lora_logits.shape[1]] = lora_logits

where the logits.size(0) is sum(num of sampled tokens for each prompts) while lora_logits.size(0) is num of prompts.

This implementation works if the num_sample_tokens is 1 when running the server without speculative decoding. If speculative decoding is enabled, num_sample_tokens will become num_speculative_decoding_tokens + 1. The code fail with error messages like

(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1292, in execute_model
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     logits = self.model.compute_logits(sample_hidden_states, None)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/llama.py", line 590, in compute_logits
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     logits = self.logits_processor(self.lm_head, hidden_states,
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/lora/layers.py", line 1186, in forward
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     return type(self.base_layer).forward(self, *args, **kwargs)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/layers/logits_processor.py", line 71, in forward
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.11/dist-packages/vllm/lora/layers.py", line 1169, in _get_logits
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527]     logits[:,
(VllmWorker rank=1 pid=19338) ERROR 07-08 19:45:36 [multiproc_executor.py:527] RuntimeError: The expanded size of the tensor (7) must match the existing size (2) at non-singleton dimension 0.  Target sizes: [7, 256].  Tensor sizes: [2, 256]

This commit adjusts the dimension of AdapterMapping.prompt_mapping so that it stores LoRA id for each sampled token. If speculative decoding is disabled, its size will become the number of prompts and will store LoRA id for each prompt, which is the same as the original implementation.

Test Plan

Running the V1 server with both LoRA and Speculative Decoding enabled. Test scenarios:

Run single base model request
Run single adapter model request
Run multiple base model requests in parallel
Run multiple adapter model requests in parallel

Test Result

Run server

export VLLM_USE_V1=1
export VLLM_ENABLE_V1_MULTIPROCESSING=1

python3 -m vllm.entrypoints.openai.api_server \
     --model /tmp/model/llama-3-1-8b/ \
     --port 8080 \
     --max-num-seqs 4 \
     --tensor-parallel-size 1 \
     --gpu-memory-utilization 0.9 \
     --enable-lora \
     --lora-modules adapter=/tmp/model/llama-3-1-8b/adapter/ \
     --max-lora-rank 32 \
     --speculative_config '{"model": "/tmp/model/llama-3-1-8b/eagle_head/", "num_speculative_tokens": 5, "method": "eagle3"}'

Run single base model request

curl http://localhost:8080/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "/tmp/model/llama-3-1-8b/",
        "prompt": "San Francisco is a",
        "max_tokens": 32,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq

"text": " top tourist destination, and for good reason. The city is known for its iconic Golden Gate Bridge, steep hills, colorful Victorian homes, and vibrant cultural scene."

Run single adapter request

curl http://localhost:8080/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "/tmp/model/llama-3-1-8b/",
        "prompt": "San Francisco is a",
        "max_tokens": 32,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq

"text": " top tourist destination, and for good reason. The city is known for its iconic Golden Gate Bridge, steep hills, colorful Victorian homes, and vibrant cultural scene."

Run multiple requests in parallel

curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "adapter",
        "prompt": "San Francisco is a",
        "max_tokens": 600,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq & curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "adapter",
        "prompt": "San Francisco is a",
        "max_tokens": 600,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq & curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "adapter",
        "prompt": "San Francisco is a",
        "max_tokens": 600,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq & curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "adapter",
        "prompt": "San Francisco is a",
        "max_tokens": 600,
                "top_k": 1,
                "top_p": 1.0,
        "temperature": 1.0
    }' | jq &

See the dimension of logits and lora_logits match in logits computing.

logits shape: torch.Size([24, 128512])
lora_logits shape: torch.Size([24, 256])

(Optional) Documentation Update

github-actions · 2025-07-16T17:44:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for LoRA with speculative decoding in the V1 engine, addressing a dimension mismatch issue that arises when speculative decoding is enabled. The changes involve adjusting the dimension of AdapterMapping.prompt_mapping to store LoRA IDs for each sampled token, ensuring compatibility with both enabled and disabled speculative decoding scenarios. The test plan includes running the V1 server with LoRA and speculative decoding enabled, covering single and multiple requests in parallel.

NickLucche

Would you mind turning your scripts into tests, if not already present?

Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

… v1 on gpu Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

mergify · 2025-09-25T19:15:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xiaohongchen1991.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

xiaohongchen1991 · 2025-09-25T19:17:53Z

Would you mind turning your scripts into tests, if not already present?

Have turned the script into a unit test.

pytest -s -v tests/v1/e2e/test_lora_with_spec_decode.py

succeeded

Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

xiaohongchen1991 · 2025-10-07T18:14:53Z

@NickLucche Can you help take a look

mergify · 2025-10-13T01:26:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xiaohongchen1991.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

varun-sundar-rabindranath · 2025-11-06T23:49:11Z

vllm/lora/punica_wrapper/punica_gpu.py

+        # to  max_batches * (num_speculative_decoding_tokens + 1).
        self.prompt_mapping_meta = LoRAKernelMeta.make(
-            self.max_loras, max_batches, device=device
+            self.max_loras, max_num_batched_tokens, device=device


Hi @xiaohongchen1991 . I have seen configurations with max_batches, max_num_batched_tokens set as 1024, 8192. In such cases, it looks like there is a constraint on how big num_speculative_decoding_tokens can be. I think we should add an assert like assert(max_num_batched_tokens >= max_batches * (num_speculative_decoding_tokens + 1)) so we catch out-of-bounds errors.
What do you think ? Thanks.

also, should we just use max_batches if spec_decode is disabled ? It might be useful when debugging issues.

Hi @varun-sundar-rabindranath . Assert added.

thanks @li2haipeng !

varun-sundar-rabindranath · 2025-11-07T01:14:21Z

Thanks @xiaohongchen1991 for the great work - The changes generally look good to me. The preparation of prompt_lora_mapping with sample_num_tokens to account for the draft tokens in LoRA sampler_ids and LoRA shrink and expand kernels look good.

I am a little out of sync with Spec. Decode - Particularly I am a bit confused as to what happens when sum(sample_num_tokens) exceeds max_num_batched_tokens - Most of the data structures in PunicaWrapperBase is max out at max_num_batched_tokens. Is this already handled elsewhere and not a possibility ?
cc @jeejeelee

li2haipeng · 2025-11-07T19:49:40Z

Thanks @xiaohongchen1991 for the great work - The changes generally look good to me. The preparation of prompt_lora_mapping with sample_num_tokens to account for the draft tokens in LoRA sampler_ids and LoRA shrink and expand kernels look good.

I am a little out of sync with Spec. Decode - Particularly I am a bit confused as to what happens when sum(sample_num_tokens) exceeds max_num_batched_tokens - Most of the data structures in PunicaWrapperBase is max out at max_num_batched_tokens. Is this already handled elsewhere and not a possibility ? cc @jeejeelee

Not sure If I understand it correctly but this situation would be guarded after adding the assert, right?

varun-sundar-rabindranath · 2025-11-07T20:50:14Z

vllm/v1/worker/lora_model_runner_mixin.py

        lora_requests: set[LoRARequest]
        prompt_lora_mapping, token_lora_mapping, lora_requests = (
-            input_batch.make_lora_inputs(num_scheduled_tokens)
+            input_batch.make_lora_inputs(num_scheduled_tokens, num_sampled_tokens)


@li2haipeng can you also add an assert after this line like,
assert(len(prompt_lora_mapping) <= self.max_num_batched_tokens)

My main concern is that, given we are doing

prompt_lora_mapping = tuple(req_lora_mapping.repeat(num_sampled_tokens))

in gpu_input_batch.py :: make_lora_inputs()
I wonder if len(prompt_lora_mapping) would exceed max_num_batched_tokens . If this happens, I think we will catch it here

vllm/vllm/lora/punica_wrapper/punica_base.py

Line 192 in da786e3

self._sampler_indices[: sampler_indices.shape[0]].copy_(sampler_indices)

, but I am not fully sure. Eitherway, an assert here would be very useful and would point to the direct cause.

Also, I think the interaction between max_batches * (num_speculative_decoding_tokens + 1) and max_num_batched_tokens should be captured and we should raise an error during engine startup when they are incompatible. For example, if the user creates an engine with,
LoRA + Spec Decode + max_num_seqs=512 + max_num_batched_tokens=1024 + num_speculative_decoding_tokens = 5
This will assert deep in the code - but it'll be much better to assert during startup (somewhere here

vllm/vllm/engine/arg_utils.py

Line 1285 in da786e3

def create_engine_config(

) and have a suggestion for the users to increase the max_num_batched_tokens.

What do you think ?
cc @robertgshaw2-redhat

I agree with Varun

@varun-sundar-rabindranath Fixed. Thanks for your suggestion.

varun-sundar-rabindranath · 2025-11-07T22:16:55Z

Fixed pre-commit and missing import issues from the previous rebase.

Synced up offline with some related folks. Here is the current status. This PR has conflicts with this recent merged commit from @andylolu2 and @jeejeelee. This caused the issue with TP>1 as mentioned by @li2haipeng .

This issue can be reproduced by running the added unit test
pytest -s -v tests/v1/e2e/test_lora_with_spec_decode.py
with TP=2. i.e., update to the following.
@pytest.mark.parametrize(
    "model_setup",
    [
        (
            "eagle3",
            "Qwen/Qwen3-1.7B",
            "AngelSlim/Qwen3-1.7B_eagle3",
            "premjatin/qwen-linear-algebra-coder",
            2,
        )
    ],
)
A temporary workaround is to set flag cudagraph_specialize_lora to False to disable this new feature introduced in the conflicting commit. Re-ran the unit test with TP>2 and it passed, meaning the output for config with/without speculative decoding is the same when running inference on lora adapter. So, there is no accuracy issue with this workaround.

@li2haipeng can you record this restriction too please. i.e. set cudagraph_specialize_lora to false explicitly with a logger message conveying lora + spec_decode + cudagraph_specialize_lora + tp is not supported at the moment.

dcmaddix · 2025-11-07T23:00:07Z

Fixed pre-commit and missing import issues from the previous rebase.
Synced up offline with some related folks. Here is the current status. This PR has conflicts with this recent merged commit from @andylolu2 and @jeejeelee. This caused the issue with TP>1 as mentioned by @li2haipeng .
This issue can be reproduced by running the added unit test
pytest -s -v tests/v1/e2e/test_lora_with_spec_decode.py
with TP=2. i.e., update to the following.
@pytest.mark.parametrize(
    "model_setup",
    [
        (
            "eagle3",
            "Qwen/Qwen3-1.7B",
            "AngelSlim/Qwen3-1.7B_eagle3",
            "premjatin/qwen-linear-algebra-coder",
            2,
        )
    ],
)
A temporary workaround is to set flag cudagraph_specialize_lora to False to disable this new feature introduced in the conflicting commit. Re-ran the unit test with TP>2 and it passed, meaning the output for config with/without speculative decoding is the same when running inference on lora adapter. So, there is no accuracy issue with this workaround.
@li2haipeng can you record this restriction too please. i.e. set cudagraph_specialize_lora to false explicitly with a logger message conveying lora + spec_decode + cudagraph_specialize_lora + tp is not supported at the moment.

@varun-sundar-rabindranath @li2haipeng we don't need to set it to False because this PR fixed the issue: #28318.

varun-sundar-rabindranath · 2025-11-07T23:19:09Z

tests/v1/e2e/test_lora_with_spec_decode.py

+            "Qwen/Qwen3-1.7B",
+            "AngelSlim/Qwen3-1.7B_eagle3",
+            "premjatin/qwen-linear-algebra-coder",
+            1,


given the issues around TP, it'd be good to add a TP = 2 test as well. But it can be a fast-follow after @28318 lands. Thanks.

script -> test conversion done

varun-sundar-rabindranath

LGTM! Thanks @xiaohongchen1991 @li2haipeng @dcmaddix for working on this ❤️

varun-sundar-rabindranath · 2025-11-07T23:22:10Z

vllm/engine/arg_utils.py

+            ), (
+                "Consider increasing max_num_batched_tokens or "
+                "decreasing num_speculative_tokens"
+            )


nit : Can you turn it into a ValueError to stay consistent with the error-raising mechanism in this file. Thanks.

robertgshaw2-redhat

stamp post varun's review

…llm-project#21068) Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Danielle Robinson <dcmaddix@gmail.com> Co-authored-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com>

Bump vLLM version to v0.11.2 What's broken and changed by vLLM: 1. structured_output is broken by vllm-project/vllm#26866 2. get_mrope_input_positions is broken by vllm-project/vllm#28399 3. graph mode is broken by vllm-project/vllm#25110 we'll upgrade torch to 2.8 to fix the problem later 4. embedding is broken by vllm-project/vllm#27583 5. `get_attn_backend_cls` and attention backend is broken are broken by vllm-project/vllm#28534 6. spec decode is broken by vllm-project/vllm#28771 7. sp feature is broken by vllm-project/vllm#27126 8. mtp is broken by vllm-project/vllm#27922 9. lora is broken by vllm-project/vllm#21068 10. execute_model is broken by vllm-project/vllm#26866 11. `VLLM_DISABLE_SHARED_EXPERTS_STREAM` env is broken by vllm-project/vllm#28159 12. kv cahe is broken by vllm-project/vllm#27753 13. dp is broken by vllm-project/vllm#25110 What's broken and changed by ourself: 1. qwen vl is broken by vllm-project/vllm#28455 We'll remove model files in the future to avoid this kind of error 2. Engine core is broken by vllm-project/vllm#23691 We'll remove the patch file in the future. 3. Ascend scheduler is broken by vllm-project/vllm#28733 We'll remove ascend scheudler later. 4. qwen3-next is broken by vllm-project/vllm#28083 We'll remove model files in the future to avoid this kind of error 5. qwen vl is broken by vllm-project/vllm#27764. We'll remove model files in the future Known issue: 1. ray doesn't work 2. the accuracy of qwen3-next is not correct 3. qwen3-vl is broken 4. prefix cache+ ascend scheduler + deepseek v2 lite is broken. Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: shen-shanshan <467638484@qq.com> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: leo-pony <nengjunma@outlook.com> Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com>

…llm-project#21068) Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Danielle Robinson <dcmaddix@gmail.com> Co-authored-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>

xiaohongchen1991 requested review from WoosukKwon, alexm-redhat, comaniac, jeejeelee, njhill, robertgshaw2-redhat and ywang96 as code owners July 16, 2025 17:44

mergify bot added the v1 label Jul 16, 2025

gemini-code-assist bot reviewed Jul 16, 2025

View reviewed changes

NickLucche previously requested changes Jul 17, 2025

View reviewed changes

xiaohongchen1991 added 2 commits September 25, 2025 14:54

[Lora][Spec Decode] support LoRA with speculative decoding for v1 on gpu

202c2cc

Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

fixup! [Lora][Spec Decode] support LoRA with speculative decoding for…

e90736f

… v1 on gpu Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

xiaohongchen1991 force-pushed the speculative-decoding-with-lora branch from 0fe83ca to e90736f Compare September 25, 2025 19:15

mergify bot added the tpu Related to Google TPUs label Sep 25, 2025

mergify bot added the needs-rebase label Sep 25, 2025

xiaohongchen1991 requested a review from NickLucche September 25, 2025 19:22

mergify bot removed the needs-rebase label Oct 7, 2025

xiaohongchen1991 added 2 commits October 7, 2025 13:46

Merge branch 'main' into speculative-decoding-with-lora

68673aa

Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

fixup! Merge branch 'main' into speculative-decoding-with-lora

6d96013

Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>

xiaohongchen1991 force-pushed the speculative-decoding-with-lora branch from 38138b0 to 6d96013 Compare October 7, 2025 17:49

Merge branch 'main' into speculative-decoding-with-lora

1f7fc85

xiaohongchen1991 closed this Oct 7, 2025

xiaohongchen1991 reopened this Oct 7, 2025

mergify bot added the needs-rebase label Oct 13, 2025

hukongyi mentioned this pull request Oct 13, 2025

[Feature]: Ensure output consistency when using LoRA with Eagle3 Speculative Decoding #26679

Open

1 task

varun-sundar-rabindranath reviewed Nov 6, 2025

View reviewed changes

dcmaddix and others added 3 commits November 7, 2025 10:10

Merge branch 'main' into speculative-decoding-with-lora

735d7eb

add assert

f927b9f

Merge branch 'main' into speculative-decoding-with-lora

643bfd7

li2haipeng and others added 2 commits November 7, 2025 20:12

pre-commit fix

605de8a

Merge branch 'main' into speculative-decoding-with-lora

8e38ddb

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

varun-sundar-rabindranath reviewed Nov 7, 2025

View reviewed changes

li2haipeng and others added 2 commits November 7, 2025 22:05

fix comment

9fe779c

Merge branch 'main' into speculative-decoding-with-lora

7744204

set cudagraph_specialize_lora to false when using lora+eagle

a0609d9

remove warning

5e23647

varun-sundar-rabindranath reviewed Nov 7, 2025

View reviewed changes

varun-sundar-rabindranath approved these changes Nov 7, 2025

View reviewed changes

robertgshaw2-redhat approved these changes Nov 7, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) November 7, 2025 23:33

change to ValueError

0d78c25

auto-merge was automatically disabled November 7, 2025 23:42
Head branch was pushed to by a user without write access

jeejeelee enabled auto-merge (squash) November 8, 2025 00:12

jeejeelee merged commit d0c7792 into vllm-project:main Nov 8, 2025
55 checks passed

wangxiyuan mentioned this pull request Nov 25, 2025

upgrade to vllm 0.11.2 vllm-project/vllm-ascend#4400

Merged

Uh oh!

[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding #21068

[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding #21068

Uh oh!

Conversation

xiaohongchen1991 commented Jul 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 25, 2025

Uh oh!

xiaohongchen1991 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaohongchen1991 commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 13, 2025

Uh oh!

varun-sundar-rabindranath Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

li2haipeng commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcmaddix commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

xiaohongchen1991 commented Jul 16, 2025 •

edited by github-actions bot

Loading

xiaohongchen1991 commented Sep 25, 2025 •

edited

Loading

varun-sundar-rabindranath Nov 6, 2025 •

edited

Loading

varun-sundar-rabindranath commented Nov 7, 2025 •

edited

Loading

li2haipeng commented Nov 7, 2025 •

edited

Loading

varun-sundar-rabindranath commented Nov 7, 2025 •

edited

Loading