[WIP] Speculative decoding using a draft model #2188

cadedaniel · 2023-12-19T00:34:17Z

Speculative decoding

This PR adds speculative decoding to vLLM using the draft model approach. This was first explored in these papers:

As a simplified overview, this type of speculative decoding runs a smaller draft model to guess what the larger target model will emit. The target model then verifies the guesses, and emits them if they pass verification. This yields a latency reduction as the verification of many tokens can happen in a single forward pass of the target model.

Running this at Anyscale, I see a 30-50% latency reduction depending on the draft model, target model and dataset.

Usage

The usage looks like this:

llm = LLM(
    model="meta-llama/Llama-2-13b-chat-hf",
    speculative_model="JackFram/llama-68m", # The draft model. Must have same vocabulary as target model.
    tensor_parallel_size=4,
    speculative_model_uses_tp_1=True, # Whether the draft model should use TP=1 or same TP as target model.
    num_speculative_tokens=3, # The number of speculative tokens to score.
)

Feature list

Any vLLM model can be used as the draft model as long as it has the same vocabulary as the target model.
Rejection sampling to guarantee target model distribution.
The draft model can use tensor-parallel degree of 1, or the same tensor-parallel degree as the target model. This reduces draft model latency and contributes to the overall speedup.
The vLLM scheduler has been improved to support >1 token per step.
Tests for rejection sampling, various workers, scheduler, and e2e speculative decoding.

Future work (not implemented)

Top-k speculative decoding ("Tree attention") to increase draft hitrates
Optimized PagedAttention (multi-query attention) for scoring
Beam-sampling + speculative decoding

This PR is marked as draft as there is nontrivial work required to get this into a mergable state.

Guide for reviewers

The following are key files for understanding this speculative decoding implementation:

Rejection sampler. Use rejection sampling to approximate the target distribution using samples from the draft distribution.
Draft target worker. A worker which contains both the draft and target models. This orchestrates the drafting of speculative continuations, scoring, and accepting speculative tokens based on rejection sampling.
Multi-step worker. This runs the draft model several times, without invoking the vLLM scheduler each time.
Single-TP worker. This allows the DraftTargetWorker to run the draft model with TP=1, while the target model is TP=2, 4 or 8. This reduces latency of the draft model.
Key scheduler change. The scheduler and block manager now have the notion of "preallocated slots". This allows them to schedule KV block space sufficient for the worker to run several steps before the next scheduler invocation.
Key sequence changes. Sequences now track "processed tokens". A processed token is one that has been processed by the model such that the KV activations have been saved to cache.
Performance optimizations. Significant modifications have been made to the sampler so that the draft model can run several steps without CPU synchronization. This plus the SingleTPWorker reduced draft model latency from 5ms to 1.5ms.

Lvjinhong · 2023-12-19T05:09:07Z

I'm very fortunate to witness such great work from you. How is the current progress? May I use your method to accelerate the llama2 70b on vLLM for now?

cadedaniel · 2023-12-20T02:06:25Z

I'm very fortunate to witness such great work from you. How is the current progress? May I use your method to accelerate the llama2 70b on vLLM for now?

hi @Lvjinhong. The current PR requires some work to get into a working state on the public vllm repo. I will start on this later this week but given US holidays I expect to finish early january.

zhaoyang-star · 2024-01-02T02:10:29Z

Very glad to see this work. I also have a wip version of speculative decoding and looking forward to using this feature.

cadedaniel · 2024-01-03T02:15:10Z

Created a plan to break this PR into separate pieces. Pending review from vLLM original authors, I will start on it this week. https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit

wasertech · 2024-01-17T22:05:48Z

So from the little I understood I think there is a way to avoid having a draft model for your vocabulary by using directly the model's ngrams? It should probably be the subject of a separate PR but I think this is the way forward for everyone to easily enjoy the benefits of speculative decoding.

What do you think @cadedaniel, do you think its as straigh forward as @simon-mo puts it?

I would imagine after this PR, ngram support should be very straightforward.

cadedaniel · 2024-01-17T22:07:45Z

Yep. This PR builds the framework for scoring and verifying draft tokens, independent of whether they come from a draft model or something like Medusa, or lookup ngrams like prompt lookup or RAG lookup.

wasertech · 2024-01-17T22:13:43Z

I think I understand a bit better now thank you, @cadedaniel , again for this incredible work!

UranusSeven · 2024-01-19T03:26:39Z

@cadedaniel Hi! Thanks for your great work!

As I understand it, during the prefill stage, we are going to run the draft model once and the target model once. Will this increase the first token latency?

cadedaniel · 2024-01-19T04:26:58Z

Yes, the time to first token is a few milliseconds higher for a draft model of size 68m. It can be optimized in future versions, e.g. with Medusa/EAGLE where draft tokens are generated without independent kv cache.

UranusSeven · 2024-01-19T06:32:57Z

@cadedaniel Thanks for your reply! I also want to confirm my understanding regarding the decoding step's impact on first token latency:

Generating k draft tokens per iteration introduces a delay of k * d milliseconds, where d is the time to run the draft model once.
Additionally, the target model's execution time of t milliseconds further contributes to the overall latency.
This means the total decoding time is approximately k * d + t milliseconds. The cost of sampling is ignored.
Consequently, newly arriving prefill requests might experience a wait of up to k * d + t milliseconds before execution.

I'm wondering if it helps to break the decoding step into 2 sub-steps, drafting and verifying. In this way, newly arriving prefill requests wait up to k * d milliseconds. And, if the newly arriving prompts are short, they can be batched with the verifying step.

What do you think?

cadedaniel · 2024-01-22T18:12:15Z

@UranusSeven could you ask your question in a discussion post? happy to answer there

SinanAkkoyun · 2024-01-30T13:09:33Z

Hi, thanks for the great work! Do you have some speed tps benchmarks? I'd like to use SD with deepseeks 33B and 1B models

Also, does this PR support an SD model with a different tokenizer than the main model? (for example llama with deepseek SD model)

qizzzh · 2024-01-31T04:50:54Z

Out of curiosity, does the proposal support separate deployment for the draft and target model? Asking because in production these two likely have different QPS and computing resource requirements.

xunfeng1980

vllm-public/vllm/config.py", line 62
    def __init__(self,
                ^
SyntaxError: '(' was never closed

xunfeng1980 · 2024-02-05T03:31:09Z

vllm/config.py

+                 cuda_graph_max_context_len: int = 5000,
+                 cuda_graph_cache_size: int = 10,
+                 flash_style: bool = False,
+                 max_chunked_prefill_len: int = -1,


vllm-public/vllm/engine/async_llm_engine.py", line 7, in <module> from vllm.anyscale.lora.utils import LoRARequest ModuleNotFoundError: No module named 'vllm.anyscale'

Having the same issue. Any update on this?

cadedaniel · 2024-02-20T18:49:44Z

Hi everyone, want to provide a quick update here:

Last few weeks I've prioritized optimizing Mixtral latency Optimized fused MoE Kernel #2913
Now I will focus on getting this merged full-time. I am aiming to finish merges by February with @LiuXiaoxuanPKU's help reviewing.
After the correctness tests are merged, I will accept any optimizations that pass correctness tests. I will list out some major optimizations that people can take on (and already some tech discussions happening on MQA cc @ymwangg @robertgshaw2-neuralmagic ).

binarycrayon · 2024-02-20T18:51:50Z

Thank you for your update and your efforts!

…

On Tue, Feb 20, 2024 at 10:49 AM Cade Daniel ***@***.***> wrote: Hi everyone, want to provide a quick update here: - Last few weeks I've prioritized optimizing Mixtral latency #2913 <#2913> - Now I will focus on getting this merged full-time. I am aiming to finish merges by February with @LiuXiaoxuanPKU <https://github.com/LiuXiaoxuanPKU>'s help reviewing. - After the correctness tests are merged, I will accept any optimizations that pass correctness tests. I will list out some major optimizations that people can take on (and already some tech discussions happening on MQA cc @ymwangg <https://github.com/ymwangg> @robertgshaw2-neuralmagic <https://github.com/robertgshaw2-neuralmagic> ). — Reply to this email directly, view it on GitHub <#2188 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAACPY5KTS4UPZU6NU3Y53DYUTV5LAVCNFSM6AAAAABA2LBWYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUHA2TQOJUHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ymwangg · 2024-02-21T00:15:11Z

Thanks for the update! Looking forward to seeing how to incorporate our work once your PRs are out.

hchoi-moveworks · 2024-03-12T19:25:01Z

Thanks @cadedaniel for truly inspiring work! 🙏

Would Speculative decoding work with vLLM's continuous batching as well? Would that be the step Integrate speculative decoding with LLMEngine in our proposed design doc?
https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit

cadedaniel · 2024-03-12T19:39:34Z

Thanks @cadedaniel for truly inspiring work! 🙏

Would Speculative decoding work with vLLM's continuous batching as well? Would that be the step Integrate speculative decoding with LLMEngine in our proposed design doc?

https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit

Yep! Once the e2e correctness tests pass you can use it with continuous batching.

paolovic · 2024-04-10T12:47:01Z

Hi @cadedaniel ,
thanks for your great contribution!
Since I'm also very keen on this feature, can I support you somehow?

cadedaniel · 2024-04-10T19:10:39Z

Thanks! It would be helpful if you could test it out and add features or optimizations. Once #3951 is merged it will be correct but not yet fast; I'll post more about optimizing it this week.

paolovic · 2024-04-11T12:21:48Z

Thanks! It would be helpful if you could test it out and add features or optimizations. Once #3951 is merged it will be correct but not yet fast; I'll post more about optimizing it this week.

Great, I'll start tomorrow (CET)

cocoza4 · 2024-04-25T10:34:59Z

will AssertionError: Speculative decoding not yet supported for RayGPU backend. error go away if this PR is merged? Ref: #4358

HimanshuJanbandhu · 2024-05-27T12:46:49Z

Hi guys,
First of all great to see your work on speculative decoding.
However, there is a new paper, Draft & Verify: Lossless Large Language Model Acceleration via
Self-Speculative Decoding (https://arxiv.org/abs/2309.08168), which talks about how we can create a draft model, by skipping layers from the original LLM.
There's another paper recently by Meta, that talks about it, LayerSkip: Enabling Early Exit Inference and
Self-Speculative Decoding - https://arxiv.org/pdf/2404.16710.
This method leads further optimization of memory as it is using the same model for draft and verify, creating draft model just by skipping some layers from the original model.
Would love to see this implemented in vLLM, as I am keen on this feature, can I support you somehow?
Here is the official implementation of ""Draft & Verify: Lossless Large Language Model Acceleration via
Self-Speculative Decoding"" provided https://github.com/dilab-zju/self-speculative-decoding/tree/main

w32zhong · 2024-05-31T17:11:17Z

Hi @HimanshuJanbandhu, I would like also mention my recent work S3D (https://arxiv.org/abs/2405.20314). It is very similar to your mentioned Self-Speculative Decoding work which is simple to implement and easy to be integrated to existing stacks. But we have achieved better efficiency in general (compared to Self-Spec), our method combines layer-skipping with multiple next-token generation/unmasking. Although ours requires a bit training, but it should be straightforward just like training a Transformer encoder like BERT.

cadedaniel · 2024-06-03T20:10:56Z

We welcome a self-speculative implementation!

paolovic · 2024-06-04T06:50:59Z

Thanks! It would be helpful if you could test it out and add features or optimizations. Once #3951 is merged it will be correct but not yet fast; I'll post more about optimizing it this week.

Great, I'll start tomorrow (CET)

by the way: sorry for volunteering and never coming back to you. Unfortunately, I am and will be reaaally busy until beginning of August ✌🏻

But thank you very much for your efforts!!!

Draft-model speculative decoding

6ae7c1b

cadedaniel mentioned this pull request Jan 4, 2024

[Speculative decoding 1/9] Optimized rejection sampler #2336

Merged

cadedaniel mentioned this pull request Jan 12, 2024

[Speculative decoding 2/9] Multi-step worker for draft model #2424

Merged

simon-mo mentioned this pull request Jan 17, 2024

Add support for prompt-lookup speculative decoding #2469

Closed

wasertech mentioned this pull request Jan 18, 2024

[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding? #1171

Closed

cadedaniel mentioned this pull request Jan 18, 2024

Feature request: prompt lookup decoding #1802

Open

This was referenced Jan 24, 2024

Will vLLM supports Self-Speculative Decoding ？ #2559

Closed

How to do speculative sampling with vllm? #1042

Closed

Integrate Speculative decoding to speed up inferences #942

Closed

Lookahead decoding | forking + "appending" to child sequences. #1970

Open

ymwangg mentioned this pull request Jan 26, 2024

Speculative Decoding #2607

Open

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

xunfeng1980 reviewed Feb 5, 2024

View reviewed changes

simon-mo mentioned this pull request Feb 6, 2024

[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) #2791

Open

cadedaniel mentioned this pull request Feb 29, 2024

[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling #3103

Merged

4 tasks

furlat mentioned this pull request Mar 1, 2024

make some tests and choose an openai api compatible local llm server Neural-Dragon-AI/Cynde#7

Open

cadedaniel mentioned this pull request Mar 4, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

simon-mo mentioned this pull request Apr 4, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

leiwen83 mentioned this pull request Apr 21, 2024

[Speculative decoding] Add ngram prompt lookup decoding #4237

Merged

kir-gadjello mentioned this pull request Apr 29, 2024

Implement Speculative Decoding EricLBuehler/mistral.rs#242

Merged

wooyeonlee0 mentioned this pull request Jun 5, 2024

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632

Closed

josephrocca mentioned this pull request Jun 7, 2024

[Feature] Speculative Decoding InternLM/lmdeploy#1738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Speculative decoding using a draft model #2188

[WIP] Speculative decoding using a draft model #2188

cadedaniel commented Dec 19, 2023

Lvjinhong commented Dec 19, 2023 •

edited

Loading

cadedaniel commented Dec 20, 2023

zhaoyang-star commented Jan 2, 2024

cadedaniel commented Jan 3, 2024

wasertech commented Jan 17, 2024 •

edited

Loading

cadedaniel commented Jan 17, 2024

wasertech commented Jan 17, 2024 •

edited

Loading

UranusSeven commented Jan 19, 2024 •

edited

Loading

cadedaniel commented Jan 19, 2024

UranusSeven commented Jan 19, 2024

cadedaniel commented Jan 22, 2024

SinanAkkoyun commented Jan 30, 2024 •

edited

Loading

qizzzh commented Jan 31, 2024

xunfeng1980 left a comment

xunfeng1980 Feb 5, 2024

xunfeng1980 Feb 5, 2024

vitali87 Mar 19, 2024

cadedaniel commented Feb 20, 2024

binarycrayon commented Feb 20, 2024 via email

ymwangg commented Feb 21, 2024

hchoi-moveworks commented Mar 12, 2024

cadedaniel commented Mar 12, 2024

paolovic commented Apr 10, 2024

cadedaniel commented Apr 10, 2024

paolovic commented Apr 11, 2024

cocoza4 commented Apr 25, 2024

HimanshuJanbandhu commented May 27, 2024 •

edited

Loading

w32zhong commented May 31, 2024 •

edited

Loading

cadedaniel commented Jun 3, 2024

paolovic commented Jun 4, 2024 •

edited

Loading

[WIP] Speculative decoding using a draft model #2188

Are you sure you want to change the base?

[WIP] Speculative decoding using a draft model #2188

Conversation

cadedaniel commented Dec 19, 2023

Speculative decoding

Usage

Feature list

Future work (not implemented)

Guide for reviewers

Lvjinhong commented Dec 19, 2023 • edited Loading

cadedaniel commented Dec 20, 2023

zhaoyang-star commented Jan 2, 2024

cadedaniel commented Jan 3, 2024

wasertech commented Jan 17, 2024 • edited Loading

cadedaniel commented Jan 17, 2024

wasertech commented Jan 17, 2024 • edited Loading

UranusSeven commented Jan 19, 2024 • edited Loading

cadedaniel commented Jan 19, 2024

UranusSeven commented Jan 19, 2024

cadedaniel commented Jan 22, 2024

SinanAkkoyun commented Jan 30, 2024 • edited Loading

qizzzh commented Jan 31, 2024

xunfeng1980 left a comment

Choose a reason for hiding this comment

xunfeng1980 Feb 5, 2024

Choose a reason for hiding this comment

xunfeng1980 Feb 5, 2024

Choose a reason for hiding this comment

vitali87 Mar 19, 2024

Choose a reason for hiding this comment

cadedaniel commented Feb 20, 2024

binarycrayon commented Feb 20, 2024 via email

ymwangg commented Feb 21, 2024

hchoi-moveworks commented Mar 12, 2024

cadedaniel commented Mar 12, 2024

paolovic commented Apr 10, 2024

cadedaniel commented Apr 10, 2024

paolovic commented Apr 11, 2024

cocoza4 commented Apr 25, 2024

HimanshuJanbandhu commented May 27, 2024 • edited Loading

w32zhong commented May 31, 2024 • edited Loading

cadedaniel commented Jun 3, 2024

paolovic commented Jun 4, 2024 • edited Loading

Lvjinhong commented Dec 19, 2023 •

edited

Loading

wasertech commented Jan 17, 2024 •

edited

Loading

wasertech commented Jan 17, 2024 •

edited

Loading

UranusSeven commented Jan 19, 2024 •

edited

Loading

SinanAkkoyun commented Jan 30, 2024 •

edited

Loading

HimanshuJanbandhu commented May 27, 2024 •

edited

Loading

w32zhong commented May 31, 2024 •

edited

Loading

paolovic commented Jun 4, 2024 •

edited

Loading