[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements

Below are some of the items that help improve the usability and performance of speculative decoding in vLLM. Please feel free to review, suggest and collaborate if you are interested! 

Point of Contact: @benchislett , drafted by @benchislett 

## Asynchronous Scheduling Support
See main Async Scheduling tracking issue for background: https://github.com/vllm-project/vllm/issues/27679

- [x] Basic support (Merged https://github.com/vllm-project/vllm/pull/24799)
- [ ] Improved overlapping beyond drafting time (todo, discussion https://github.com/vllm-project/vllm/pull/24799#issuecomment-3528663588)
- [ ] Support for penalties and bad_words (todo, discussion https://github.com/vllm-project/vllm/pull/24799#issuecomment-3442944707)
- [ ] Full support for runtime sampling parameters [logprobs, maybe more] (todo)
   - logprobs: https://github.com/vllm-project/vllm/pull/29223
- [ ] Compatibility with structured outputs (todo, discussion https://github.com/vllm-project/vllm/pull/24799#issuecomment-3438133490)

## New Drafting Styles

- [ ] Draft-model support (In Review https://github.com/vllm-project/vllm/pull/24322)
- [ ] PARD support (planned extension of #24322)
- [ ] Hybrid ngram-eagle drafting (WIP https://github.com/vllm-project/vllm/pull/24344)

## Performance Improvements
- [ ] Full CudaGraph for the drafter
- [ ] Reduce number of kernels issued in EAGLE drafting
  - [ ] Task 1: optimize prepare_inputs_padded (In Review https://github.com/vllm-project/vllm/pull/28597)
  - [ ] Task 2: clean up and optimize work done between iterations of the drafter (todo: needs issue)
  - [ ] Task 3: make eagle’s prepare_inputs methods stateful [cleaner implementation, which also will save a few redundant copies] (todo: needs issue)
- [ ] Fix acceptance rate inconsistency when using padded drafter batch (WIP https://github.com/vllm-project/vllm/pull/26498)
  - [ ] First-class support for multimodal EAGLE
- [ ] Re-enable torch-compile & CUDA graphs for EAGLE with multimodal models (https://github.com/vllm-project/vllm/pull/28896)
- [ ] Support more architecture configurations
  - [ ] Support EAGLE3 without custom LM Head (In Review https://github.com/vllm-project/vllm/pull/27688)

## Sampling Improvements
- [ ] Support for probability-aware sampling (WIP https://github.com/vllm-project/vllm/pull/20459)
- [ ] General performance investigation for speculative sampling

# CC List
@pavanimajety @vadiklyutiy @hjjq @nvpohanh 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements #28947

Asynchronous Scheduling Support

New Drafting Styles

Performance Improvements

Sampling Improvements

CC List

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Tracking Issue][Performance]: Speculative decoding performance/QoL improvements #28947

Description

Asynchronous Scheduling Support

New Drafting Styles

Performance Improvements

Sampling Improvements

CC List

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions