-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Open
Labels
Description
Below are some of the items that help improve the usability and performance of speculative decoding in vLLM. Please feel free to review, suggest and collaborate if you are interested!
Point of Contact: @benchislett , drafted by @benchislett
Asynchronous Scheduling Support
See main Async Scheduling tracking issue for background: #27679
- Basic support (Merged [Core] Async Scheduling X Spec Decoding Compatibility #24799)
- Improved overlapping beyond drafting time (todo, discussion [Core] Async Scheduling X Spec Decoding Compatibility #24799 (comment))
- Support for penalties and bad_words (todo, discussion [Core] Async Scheduling X Spec Decoding Compatibility #24799 (comment))
- Full support for runtime sampling parameters [logprobs, maybe more] (todo)
- Compatibility with structured outputs (todo, discussion [Core] Async Scheduling X Spec Decoding Compatibility #24799 (comment))
New Drafting Styles
- Draft-model support (In Review feat: spec decode with draft models #24322)
- PARD support (planned extension of feat: spec decode with draft models #24322)
- Hybrid ngram-eagle drafting (WIP [Spec Decode][Hybrid] Add ngram-eagle SD method #24344)
Performance Improvements
- Full CudaGraph for the drafter
- Reduce number of kernels issued in EAGLE drafting
- Task 1: optimize prepare_inputs_padded (In Review [Perf] Optimize EAGLE prepare_inputs_padded with triton kernels #28597)
- Task 2: clean up and optimize work done between iterations of the drafter (todo: needs issue)
- Task 3: make eagle’s prepare_inputs methods stateful [cleaner implementation, which also will save a few redundant copies] (todo: needs issue)
- Fix acceptance rate inconsistency when using padded drafter batch (WIP [Bugfix] Invalidate positions when using padded speculative decoding #26498)
- First-class support for multimodal EAGLE
- Re-enable torch-compile & CUDA graphs for EAGLE with multimodal models (Eagle: MM Cuda Graphs with MRope #28896)
- Support more architecture configurations
- Support EAGLE3 without custom LM Head (In Review [Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states #27688)
Sampling Improvements
- Support for probability-aware sampling (WIP [V1][Spec Decode][Feature] Spec decode with probs #20459)
- General performance investigation for speculative sampling
CC List
chaunceyjiang, nvpohanh and zihaoanllm
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status