[Bug fix][Core] fixup ngram not setup correctly #4551

leiwen83 · 2024-05-02T09:46:48Z

ngram_prompt_lookup_max/ngram_prompt_lookup_min need to be past through SpecDecodeWorker.create_worker's draft_worker_kwargs.

If those two doesn't get past, now there will be exception as dict cannot pop those two keys.

ngram_prompt_lookup_max/ngram_prompt_lookup_min need to be past through SpecDecodeWorker.create_worker's draft_worker_kwargs. If those two doesn't get past, now there will be exception as dict cannot pop those two keys.

rkooo567 · 2024-05-02T14:13:02Z

cc @comaniac

comaniac

Oops. We can merge first but it should be better to add a unit test to cover this case.

cadedaniel · 2024-05-02T15:04:57Z

+1. Let's get a test covering this path.

cadedaniel · 2024-05-02T15:05:58Z

Why was it not covered by existing tests?

comaniac · 2024-05-02T15:09:02Z

Why was it not covered by existing tests?

I guess existing tests directly initiated the worker, but this is more like an end-to-end path starting from a higher level?

…lookup_max > 0

leiwen83 · 2024-05-03T03:36:21Z

Why was it not covered by existing tests?

It is for current ngram still use draft model set as target model to get some info like vocab size. In this failure, ngram testcase is actually turned into multistep case with draft model same as target model...

I add a check assert in conftest to ensure we current in ngram running path, when corresponding param is set.

tests/spec_decode/e2e/conftest.py

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

cadedaniel · 2024-05-06T23:42:24Z

Retrying test infra failure

comaniac · 2024-05-07T17:31:53Z

@cadedaniel this should be able to merge.

Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

simon-mo · 2024-05-08T16:19:13Z

Spec decode tests start failing in main branch after this PR https://buildkite.com/vllm/ci/builds/6784#018f551e-d727-491c-be34-9d9fa29f4ea4

comaniac · 2024-05-08T16:32:09Z

The fix PR is here: #4672
Meanwhile, @cadedaniel adjusted the test config to workaround this issue in #4592, so we should be good after merging this one.

Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

formatting ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

wenlei03 added 2 commits May 2, 2024 17:43

[Bug fix][Core] fixup ngram not setup correctly

3c109a6

ngram_prompt_lookup_max/ngram_prompt_lookup_min need to be past through SpecDecodeWorker.create_worker's draft_worker_kwargs. If those two doesn't get past, now there will be exception as dict cannot pop those two keys.

fix yapf

7426588

rkooo567 assigned cadedaniel May 2, 2024

leiwen83 marked this pull request as ready for review May 2, 2024 14:14

comaniac approved these changes May 2, 2024

View reviewed changes

add testcase check for validate ngram running path when ngram_prompt_…

58c8d01

…lookup_max > 0

fix async mode

098b4a9

leiwen83 force-pushed the ngram_fix branch from bfb8d7e to 098b4a9 Compare May 3, 2024 11:13

leiwen83 requested a review from comaniac May 3, 2024 15:27

comaniac approved these changes May 3, 2024

View reviewed changes

tests/spec_decode/e2e/conftest.py Outdated Show resolved Hide resolved

cadedaniel and others added 3 commits May 3, 2024 13:54

Update tests/spec_decode/e2e/conftest.py

4a6dbb5

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

fix

dba3b0d

logging proposer type

dc2c645

cadedaniel enabled auto-merge (squash) May 7, 2024 18:40

cadedaniel approved these changes May 7, 2024

View reviewed changes

cadedaniel merged commit 8344f77 into vllm-project:main May 7, 2024
49 checks passed

cadedaniel mentioned this pull request May 8, 2024

[Dynamic Spec Decoding] Auto-disable by the running queue size #4592

Merged

cadedaniel mentioned this pull request May 8, 2024

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs #4672

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug fix][Core] fixup ngram not setup correctly #4551

[Bug fix][Core] fixup ngram not setup correctly #4551

leiwen83 commented May 2, 2024

rkooo567 commented May 2, 2024

comaniac left a comment

cadedaniel commented May 2, 2024

cadedaniel commented May 2, 2024

comaniac commented May 2, 2024

leiwen83 commented May 3, 2024

cadedaniel commented May 6, 2024

comaniac commented May 7, 2024

simon-mo commented May 8, 2024

comaniac commented May 8, 2024

[Bug fix][Core] fixup ngram not setup correctly #4551

[Bug fix][Core] fixup ngram not setup correctly #4551

Conversation

leiwen83 commented May 2, 2024

rkooo567 commented May 2, 2024

comaniac left a comment

Choose a reason for hiding this comment

cadedaniel commented May 2, 2024

cadedaniel commented May 2, 2024

comaniac commented May 2, 2024

leiwen83 commented May 3, 2024

cadedaniel commented May 6, 2024

comaniac commented May 7, 2024

simon-mo commented May 8, 2024

comaniac commented May 8, 2024