Skip to content

Conversation

eldarkurtic
Copy link
Contributor

This PR adds support for EAGLE3 speculative decoding for GPT-OSS model. Changes tested with a locally trained speculator model, and observed reasonable acceptance rates.

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables Eagle3 speculative decoding for the GPT-OSS model. The changes are generally well-implemented, including adding the model to the supported list, implementing the SupportsEagle3 interface, and generalizing the embedding layer sharing logic. However, I've identified a critical issue in GptOssModel.forward where a TypeError can occur due to improper handling of a None residual on the first layer of a pipeline stage. I have provided a code suggestion to fix this bug.

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Copy link
Contributor

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but let's wait for someone more familiar with eagle perhaps.

Would be great to have smoke tests for model init at least with layer-capped versions..

f"{self.disable_by_batch_size=}")

eagle3_target_supported = ["llama", "qwen"]
eagle3_target_supported = ["llama", "qwen", "gpt_oss"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR, but I wonder why do we have to list models here instead of relying on SupportsEagle3 dispatching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest not sure, I just followed llama.py / qwen.py to see how eagle is enabled there

logger.info(
"Assuming the EAGLE head shares the same vocab embedding"
" with the target model.")
del self.model.model.embed_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be picked up by gc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from llama.py and qwen.py. Should I remove it here or leave it for consistency?

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Sep 19, 2025
@benchislett
Copy link
Collaborator

Noting #23596, which implements similarly but includes blackwell support via FlashInfer. Should be fine to merge this one first, but we should make sure they're consistent as we will need blackwell support as well

@jiahanc
Copy link
Contributor

jiahanc commented Sep 19, 2025

I think for models with alternative attentions like gpt-oss, you need to find correct attention builders for draft model (draft model is full attention) like https://github.com/vllm-project/vllm/pull/23596/files#diff-a4809a837fbf535a8f0999b11087a53ec1c53948b50c0a1fe64396bc86de9461R883-R906 . Without this, it will use the sliding window attention, which will have accuracy issues (target model KV will be overwritten by draft model)
Also this part is required https://github.com/vllm-project/vllm/pull/23596/files#diff-80ee7e2a62f9dcfbb8a312dc4e3948557e97ef187290daebbcae1e28596bda29R1107-R1111.

@eldarkurtic
Copy link
Contributor Author

The goal of this one was to first add support for the simplest Llama-like-speculator from Eagle3. And then we can build on top of it for more complex archs. This is unblock some preliminary experimentation with speculative decoding.

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 22, 2025 07:04
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025
@DarkLight1337 DarkLight1337 merged commit 21467f9 into vllm-project:main Sep 22, 2025
58 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
)

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
)

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants