[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. #3951

cadedaniel · 2024-04-09T23:14:56Z

This PR adds e2e correctness tests for speculative decoding. It is PR 7/9 in the speculative decoding open sourcing plan.

The E2E correctness tests verify that the generated output of a sequence with speculative decoding is equal to the generated output without speculative decoding when temperature is 0. We test various batch sizes, models, speculative lens, block sizes, num_gpu_blocks (& preemption), and max_model_lens (& skipping speculation for some/all sequences) and verify that this core greedy equality property holds.

See test_correctness.py for more details on test methodology.

Bugfixes

To make the tests pass, this PR introduces several fixes that are listed in order of notoriety:

The vLLM sampler now modifies the probability distributions so that the sampling method is encoded within the distribution. This is gated by a flag for speculative decoding. What this means is that the token sampled via greedy sampling has its probability set to 1.0. This allows speculative decoding's rejection sampler to guarantee output equality.
Batch expansion was incorrectly scoring tokens when some sequences were skipped. This was fixed.
When a "bonus token" is emitted, its KV is never generated by the draft worker. This causes accuracy reduction in proposers which use KV, e.g. draft models. This PR disables bonus tokens, with a followup issue to re-enable them. --> [Speculative decoding] [Performance]: Re-enable bonus tokens #4212
Incorrect system efficiency calculation fixed.

Minor feature additions

The following features were added:

The vLLM sampler now can return sampling results as on-GPU tensors, instead of only Python datastructures. This allows the rejection sampler to consume GPU tensors instead of serializing to CPU/serializing back to GPU.
Spec decode metrics are emitted by the vLLM engine using the stats object. This was required for correctness testing that spec decode with the same model for draft/target has 100% acceptance rate.
The draft model now has a configurable max_model_len for use in testing. This was required to test preemption.

LiuXiaoxuanPKU

Thoughts before discussing this PR. Skip sampler & tests.

LiuXiaoxuanPKU · 2024-04-22T17:14:36Z

vllm/config.py

+        speculative_max_model_len is mainly used for testing that sequences can
+        skip speculation.
+        """
+


Do we want to add a check to make sure speculative_max_model_len < min( draft_max_model_len, target_max_model_len, ) in case user sets speculative_max_model_len inappropriately?

Cade to fix

LiuXiaoxuanPKU · 2024-04-22T17:23:38Z

vllm/engine/llm_engine.py

+            # process the output tokens. Otherwise, they are (chunked) prefill
+            # samples and should not be processed.
+            stages = [seq.data._stage for seq in seq_group.seqs_dict.values()]
+            if all(stage == SequenceStage.DECODE for stage in stages):


A bit concern here:
From this, it seems that we assume DECODE stage only has 1 new token?

I assume we cannot have chunked prefill and speculative decoding cannot be turned on the same time? Did we explicitly check or document that somewhere?

(Cade fill out answer)

Cade to verify args and raise if chunked prefill enabled while spec decode enabled

Answer for future readers:

Chunked prefill and speculative decoding from a systems perspective are compatible, however the current vLLM implementations need work to be enabled together. I'll add a validation check which raises if both are enabled.

The DECODE stage currently only reports 1 new token. This is used by the scheduler to prevent a batch from becoming compute-bound using the token budget. When chunked prefill is enabled, we will need to adjust this to take into account the "new tokens" computed during speculative verification and modify this value. When chunked prefill is disabled, the new token budget is max_num_batched_tokens, and we are OK with the fact that the budget system doesn't take speculative decoding into account.

I'll make an issue for integrating chunked prefill with spec decode soon!

cadedaniel · 2024-04-22T19:40:29Z

vllm/engine/llm_engine.py

+            # process the output tokens. Otherwise, they are (chunked) prefill
+            # samples and should not be processed.
+            stages = [seq.data._stage for seq in seq_group.seqs_dict.values()]
+            if all(stage == SequenceStage.DECODE for stage in stages):


(Cade fill out answer)

Cade to verify args and raise if chunked prefill enabled while spec decode enabled

cadedaniel · 2024-04-22T19:42:24Z

vllm/config.py

+        speculative_max_model_len is mainly used for testing that sequences can
+        skip speculation.
+        """
+


Cade to fix

cadedaniel · 2024-04-22T20:02:18Z

vllm/model_executor/layers/sampler.py

@@ -680,12 +760,36 @@ def _get_logprobs(
    return result_prompt_logprobs, result_sample_logprobs


+def _modify_greedy_probs_inplace(logprobs: torch.Tensor, probs: torch.Tensor,


Cade to list how this fits into sampler overall

Why not use very small temperature instead?

LiuXiaoxuanPKU

Addressed me concerns after discussion, please add some doc to clarify, thanks!

cadedaniel · 2024-04-22T21:25:37Z

Applied feedback @LiuXiaoxuanPKU . I will enable auto-merge with your approval; if you have any more comments happy to take them in a future PR.

cadedaniel · 2024-04-23T07:16:46Z

main branch was broken, merging again to get #4271

…s tests. (vllm-project#3951)

cadedaniel added 30 commits April 3, 2024 14:17

wip

252a0c7

Merge remote-tracking branch 'upstream/main' into executor_base

dd629d4

wip

a34800f

wip

09f30bd

clean

8b5bb8b

wip

6fd424f

wip

2a347bb

wip

658ff9b

wip

acee7be

wip

85760d6

wip

408b29d

Merge remote-tracking branch 'upstream/main' into executor_base

9d8fd69

wip

3149a03

wip

0c32e0a

wip

f64d5b1

wip

7207f0c

wip

0c4df0b

wip

2e355e7

wip

edb7f62

wip

48bb3e9

fix test

7b39044

fix test

9e5f2fb

fix test

1a3e26e

fix test

cd2015c

fix

d926034

fix

607f7e2

fix

e127bb7

fix

deaa8b0

clean

7817d61

clean

99823a3

cadedaniel added 3 commits April 21, 2024 23:47

fix

1676607

Merge remote-tracking branch 'upstream/main' into spec-decode-sampler

63059fe

mypy fix

5a51b82

cadedaniel changed the title ~~[Draft] [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests.~~ [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. Apr 22, 2024

cadedaniel marked this pull request as ready for review April 22, 2024 07:15

cadedaniel assigned LiuXiaoxuanPKU Apr 22, 2024

rejection sampler test fix

c3d619e

cadedaniel enabled auto-merge (squash) April 22, 2024 17:37

LiuXiaoxuanPKU reviewed Apr 22, 2024

View reviewed changes

cadedaniel disabled auto-merge April 22, 2024 20:03

cadedaniel commented Apr 22, 2024

View reviewed changes

LiuXiaoxuanPKU approved these changes Apr 22, 2024

View reviewed changes

cadedaniel added 4 commits April 22, 2024 14:01

pr feedback

7bfe6dd

break compatibility tests into own file

c38aa97

remove unnecessary flags

f300f08

lint

5434d90

cadedaniel enabled auto-merge (squash) April 22, 2024 21:25

LiuXiaoxuanPKU mentioned this pull request Apr 22, 2024

[Usage]: speculative model #4266

Closed

Merge remote-tracking branch 'upstream/main' into spec-decode-sampler

824d44c

cadedaniel merged commit 62b8aeb into vllm-project:main Apr 23, 2024
47 checks passed

leiwen83 mentioned this pull request Apr 23, 2024

[Speculative decoding] Add ngram prompt lookup decoding #4237

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Apr 25, 2024

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

91a1ac7

…s tests. (vllm-project#3951)

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 26, 2024

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

dd092dd

…s tests. (vllm-project#3951)

alexeykondrat pushed a commit to alexeykondrat/ci-vllm that referenced this pull request May 1, 2024

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

c36d4d2

…s tests. (vllm-project#3951)

cadedaniel mentioned this pull request May 6, 2024

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

Open

8 tasks

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

899ccaa

…s tests. (vllm-project#3951)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Speculative decoding 7/9] Speculative decoding end-to-end correctnes…

94791ca

…s tests. (vllm-project#3951)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. #3951

[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. #3951

cadedaniel commented Apr 9, 2024 •

edited

LiuXiaoxuanPKU left a comment

LiuXiaoxuanPKU Apr 22, 2024

cadedaniel Apr 22, 2024

LiuXiaoxuanPKU Apr 22, 2024

cadedaniel Apr 22, 2024

cadedaniel Apr 22, 2024

cadedaniel Apr 22, 2024

cadedaniel Apr 22, 2024

cadedaniel Apr 22, 2024

LiuXiaoxuanPKU left a comment

cadedaniel commented Apr 22, 2024

cadedaniel commented Apr 23, 2024

		@@ -680,12 +760,36 @@ def _get_logprobs(
		return result_prompt_logprobs, result_sample_logprobs


		def _modify_greedy_probs_inplace(logprobs: torch.Tensor, probs: torch.Tensor,

[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. #3951

[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. #3951

Conversation

cadedaniel commented Apr 9, 2024 • edited

Bugfixes

Minor feature additions

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

cadedaniel commented Apr 22, 2024

cadedaniel commented Apr 23, 2024

cadedaniel commented Apr 9, 2024 •

edited