[WIP] Speculative Decoding #1797

LiuXiaoxuanPKU · 2023-11-27T06:38:34Z

This is an attempt to implement speculative decoding (paper) in vllm. It is not optimized, not tested (please avoid using it for now). The current design:

Use huggingface instead of paged attention for the draft model.
Does not keep kv cache for the draft model.
Does not support tensor parallelism.
Use kernel from prefix cache for token verification.

pian13131

Just some tiny comments.

pian13131 · 2023-12-08T00:08:14Z

vllm/engine/llm_engine.py

@@ -408,11 +416,24 @@ def _process_sequence_group_outputs(self, seq_group: SequenceGroup,
            # We reuse the parent sequence here to reduce redundant memory
            # copies, especially when using non-beam search sampling methods.
            last_child_sample = child_samples[-1]
-            parent.append_token_id(last_child_sample.output_token,
-                                   last_child_sample.logprobs)
+            if last_child_sample.accepted_tokens:


nit: if FLAGS.ENABLE_SD:?

vllm/engine/llm_engine.py

pian13131 · 2023-12-08T00:10:57Z

vllm/engine/spec_dec.py

+        self.propose_cnt = config.propose_cnt
+        self.draft_model_config = config.draft_model_config


nit: self.config = config?

pian13131 · 2023-12-08T00:13:26Z

vllm/engine/spec_dec.py

+
+    # propose draft tokens
+    # the function will run the draft model and set draft_tokens and draft_token_probs of each seq
+    def set_draft_tokens(self, seq_group_list: List[SequenceGroupMetadata],


propose() might be a better name

zhaoyang-star · 2023-12-08T02:32:40Z

vllm/model_executor/layers/attention.py

-            )
+            if FLAGS.ENABLE_SD:
+                output = _multi_query_cached_kv_attention(
+                    query, key, value, key_cache, value_cache, input_metadata)


Why need to pass key and value? I think the two vars already have copied to key_cache and value_cache by cache_ops.reshape_and_cache(). Maybe I am missing something?

void-main

Nice work, congrats!

void-main · 2023-12-08T02:04:41Z

vllm/engine/spec_dec.py

+        for seq_group_metadata in seq_group_metadata_list:
+            assert len(
+                seq_group_metadata.seq_data
+            ) == 1, f"Speculative Decoding does nor beam search for now: {len(seq_group_metadata.seq_data)}"


a little typo in the assert message

void-main · 2023-12-08T02:48:37Z

vllm/engine/llm_engine.py

@@ -573,6 +594,11 @@ def step(self) -> List[RequestOutput]:
        if scheduler_outputs.is_empty():
            return ignored

+        # only enable speculative decoding for generation run
+        if self.spec_dec_worker and (not scheduler_outputs.prompt_run):
+            self.spec_dec_worker.set_draft_tokens(seq_group_metadata_list,


in multi GPU inference scenario, will this method be called by all the workers?

do you think it's a better idea to only run on rank 0, and broadcast the tokens to other ranks?

void-main · 2023-12-08T07:16:08Z

vllm/engine/spec_dec.py

+logger.setLevel("WARNING")
+
+
+class SpecDecWorker(Worker):


This worker is too tightly coupled with assisted decoding.

Do you think it's a good idea if we abstract an base class for SpD, and move these specific implementations to a concrete class like AssistedSpcDecWorker?

But I believe we could refactor this later.

void-main · 2023-12-08T07:25:03Z

vllm/entrypoints/llm.py

@@ -69,7 +69,7 @@ def __init__(
        revision: Optional[str] = None,
        tokenizer_revision: Optional[str] = None,
        seed: int = 0,
-        gpu_memory_utilization: float = 0.9,
+        gpu_memory_utilization: float = 0.8,


this is a little hacky to me. what if the sequence is long and could take more than 0.2 gpu memory?

do you think it's a better idea if we actual run the assisted model in profile_num_available_blocks?

void-main · 2023-12-08T07:53:46Z

vllm/model_executor/layers/kv_mqa.py

+    pass
+
+
+if triton.__version__ >= "2.1.0":


maybe we assert the version should be greater or equal to 2.1.0?

void-main · 2023-12-08T08:21:26Z

vllm/model_executor/layers/kv_mqa.py

+                    offs_d[:, None] // x) * stride_k_cache_d + (
+                        (start_n + offs_n[None, :]) %
+                        block_size) * stride_k_cache_bl + (
+                            offs_d[:, None] % x) * stride_k_cache_x


good job! this would be faster than my version! 👍

void-main · 2023-12-08T08:33:25Z

vllm/model_executor/layers/kv_mqa.py

+        block_mask = tl.where(
+            block_start_loc < cur_batch_seq_len - cur_batch_ctx_len, 1, 0)
+
+        for start_n in range(0, block_mask * (start_m + 1) * BLOCK_M, BLOCK_N):


I wonder what's special for K and V of the draft tokens, why we need process these tokens separately?

void-main · 2023-12-08T08:55:49Z

vllm/model_executor/layers/attention.py

-                self.scale,
-                self.alibi_slopes,
-            )
+            if FLAGS.ENABLE_SD:


correct me if I'm wrong, but for assisted decoding, usually the propse_cnt is small (maybe around 4?), which would cause first dimension of q to be small, thus the q@k gemm and qk@v gemm are small. for such cases, does it really worth using Tensor Core for GEMM?

void-main · 2023-12-12T09:11:09Z

vllm/engine/llm_engine.py

@@ -573,6 +594,11 @@ def step(self) -> List[RequestOutput]:
        if scheduler_outputs.is_empty():
            return ignored

+        # only enable speculative decoding for generation run
+        if self.spec_dec_worker and (not scheduler_outputs.prompt_run):


report a bug here, when we start vllm api server with python3 -m vllm.entrypoints.api_server --model=/path/to/tgt_model/ --draft-model=/path/to/draft/model/ --propose-cnt=5, the server errors out. looks like you forgot set_draft_tokens and accept_tokens in AsyncLLMEngine

LiuXiaoxuanPKU added 30 commits November 6, 2023 09:58

spec draft

75ae5fd

Merge branch 'vllm-project:main' into spec

46cd4c3

minor

edeaec0

minor

95a7e13

draft tokens

366fbb9

minor

3c7397e

merge

9f35009

Merge branch 'main' of github.com:LiuXiaoxuanPKU/vllm

9b64276

Merge branch 'main' into spec

1525262

minor

7e6224a

Merge branch 'spec' of github.com:LiuXiaoxuanPKU/vllm into spec

93901c8

draft logits

692328a

need to change draft token probs data structure

8b6d647

rejection sampling

675e1ae

rejection sampling

32267f6

format

1aab040

get draft probs

826b54a

style

b2ec9aa

combine draft_token_ids and output_token_ids in SequenceData

6382396

invalidate kv draft

89d8ba2

fix

9594d08

pass in multiple tokens for generation phase, kv_mqa

6b1e94c

pass scheduler to spec worker

2d5c379

mqa

025bb89

separate sampler

dd23ff7

lots of fix, multi_qa_kv runnable

f1b3987

nan in hidden states

9a85990

lots of style fix, early break accepting tokens

54bfebd

fix free bug

a904ac9

bug fix

0cb9326

LiuXiaoxuanPKU and others added 17 commits November 18, 2023 21:55

minor fix get target probs in prefill phase

4e9ae6c

fix mismatch between logical and physical blocks!!

0ff36e7

add alphas

d2d67f9

tokenizer & bug fix

7d94cb2

pass tests

b1a5a88

add flag

93c7956

remove speculative decoding for prompt run

141da66

remove temperature, only support all greedy for now

439c88b

clean

40ab8d4

minor

bf2ebe9

merge

179e968

fix & pass tests

664a256

format

7f9a373

remove old files

0540142

remove untouched file

993f2d4

format

c410cbe

format

9f2d98b

pian13131 approved these changes Dec 8, 2023

View reviewed changes

zhaoyang-star reviewed Dec 8, 2023

View reviewed changes

void-main reviewed Dec 8, 2023

View reviewed changes

void-main reviewed Dec 12, 2023

View reviewed changes

LiuXiaoxuanPKU mentioned this pull request Dec 15, 2023

Allow single LLM step to generate multiple tokens #2120

Open

godsakurapeng mentioned this pull request Jan 3, 2024

Can vllm become faster? #2327

Closed

ymwangg mentioned this pull request Jan 26, 2024

Speculative Decoding #2607

Open

simon-mo mentioned this pull request Feb 6, 2024

[Feature Request] Adding Eagle, Medusa, Look Ahead decoding ( improvements of Speculative decoding) #2791

Open

sighingnow mentioned this pull request Feb 25, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Speculative Decoding #1797

[WIP] Speculative Decoding #1797

LiuXiaoxuanPKU commented Nov 27, 2023 •

edited

pian13131 left a comment

pian13131 Dec 8, 2023

pian13131 Dec 8, 2023

pian13131 Dec 8, 2023

zhaoyang-star Dec 8, 2023

void-main left a comment

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 8, 2023

void-main Dec 12, 2023 •

edited

		self.propose_cnt = config.propose_cnt
		self.draft_model_config = config.draft_model_config

[WIP] Speculative Decoding #1797

Are you sure you want to change the base?

[WIP] Speculative Decoding #1797

Conversation

LiuXiaoxuanPKU commented Nov 27, 2023 • edited

pian13131 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

void-main left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

void-main Dec 12, 2023 • edited

Choose a reason for hiding this comment

LiuXiaoxuanPKU commented Nov 27, 2023 •

edited

void-main Dec 12, 2023 •

edited