Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug fix][Core] fixup ngram not setup correctly #4551

Merged
merged 7 commits into from
May 7, 2024

Conversation

leiwen83
Copy link
Contributor

@leiwen83 leiwen83 commented May 2, 2024

ngram_prompt_lookup_max/ngram_prompt_lookup_min need to be past through SpecDecodeWorker.create_worker's draft_worker_kwargs.

If those two doesn't get past, now there will be exception as dict cannot pop those two keys.

ngram_prompt_lookup_max/ngram_prompt_lookup_min need to be past
through SpecDecodeWorker.create_worker's draft_worker_kwargs.

If those two doesn't get past, now there will be exception as
dict cannot pop those two keys.
@rkooo567
Copy link
Collaborator

rkooo567 commented May 2, 2024

cc @comaniac

@leiwen83 leiwen83 marked this pull request as ready for review May 2, 2024 14:14
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. We can merge first but it should be better to add a unit test to cover this case.

@cadedaniel
Copy link
Collaborator

+1. Let's get a test covering this path.

@cadedaniel
Copy link
Collaborator

Why was it not covered by existing tests?

@comaniac
Copy link
Collaborator

comaniac commented May 2, 2024

Why was it not covered by existing tests?

I guess existing tests directly initiated the worker, but this is more like an end-to-end path starting from a higher level?

@leiwen83
Copy link
Contributor Author

leiwen83 commented May 3, 2024

Why was it not covered by existing tests?

It is for current ngram still use draft model set as target model to get some info like vocab size. In this failure, ngram testcase is actually turned into multistep case with draft model same as target model...

I add a check assert in conftest to ensure we current in ngram running path, when corresponding param is set.

tests/spec_decode/e2e/conftest.py Outdated Show resolved Hide resolved
cadedaniel and others added 3 commits May 3, 2024 13:54
@cadedaniel
Copy link
Collaborator

Retrying test infra failure

@comaniac
Copy link
Collaborator

comaniac commented May 7, 2024

@cadedaniel this should be able to merge.

@cadedaniel cadedaniel enabled auto-merge (squash) May 7, 2024 18:40
@cadedaniel cadedaniel merged commit 8344f77 into vllm-project:main May 7, 2024
49 checks passed
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 8, 2024
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
@simon-mo
Copy link
Collaborator

simon-mo commented May 8, 2024

Spec decode tests start failing in main branch after this PR https://buildkite.com/vllm/ci/builds/6784#018f551e-d727-491c-be34-9d9fa29f4ea4

@comaniac
Copy link
Collaborator

comaniac commented May 8, 2024

The fix PR is here: #4672
Meanwhile, @cadedaniel adjusted the test config to workaround this issue in #4592, so we should be good after merging this one.

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
tybalex pushed a commit to rubra-ai/vllm that referenced this pull request May 25, 2024
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
SwapnilDreams100 pushed a commit to SwapnilDreams100/vllm that referenced this pull request May 29, 2024
ruff formatting

formatting -isort

formatting yapf

add request class

init file added

adding CPU_executor change

adding support for  cpu engine

formatting

backslash error fix

formatting

tests update

update worker test

update worker test formatting

Disable cuda version check in vllm-openai image (vllm-project#4530)

[Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623)

[CI] use ccache actions properly in release workflow (vllm-project#4629)

[CI] Add retry for agent lost (vllm-project#4633)

Update lm-format-enforcer to 0.10.1 (vllm-project#4631)

[Kernel] Make static FP8 scaling more robust (vllm-project#4570)

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

[Core][Optimization] change python dict to pytorch tensor (vllm-project#4607)

[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642)

[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609)

[Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648)

[Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551)

Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

[Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660)

[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660)

[Core] Optimize sampler get_logprobs (vllm-project#4594)

[CI] Make mistral tests pass (vllm-project#4596)

[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi  (vllm-project#4573)

[Misc] Add `get_name` method to attention backends (vllm-project#4685)

[Core] Faster startup for LoRA enabled models (vllm-project#4634)

[Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659)

[CI/Test] fix swap test for multi gpu (vllm-project#4689)

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

[Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592)

Co-authored-by: Cade Daniel <edacih@gmail.com>

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672)

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626)

consolidation
SwapnilDreams100 pushed a commit to SwapnilDreams100/vllm that referenced this pull request May 29, 2024
formatting

ruff formatting

formatting -isort

formatting yapf

add request class

init file added

adding CPU_executor change

adding support for  cpu engine

formatting

backslash error fix

formatting

tests update

update worker test

update worker test formatting

Disable cuda version check in vllm-openai image (vllm-project#4530)

[Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623)

[CI] use ccache actions properly in release workflow (vllm-project#4629)

[CI] Add retry for agent lost (vllm-project#4633)

Update lm-format-enforcer to 0.10.1 (vllm-project#4631)

[Kernel] Make static FP8 scaling more robust (vllm-project#4570)

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

[Core][Optimization] change python dict to pytorch tensor (vllm-project#4607)

[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642)

[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609)

[Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648)

[Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551)

Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

[Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660)

[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660)

[Core] Optimize sampler get_logprobs (vllm-project#4594)

[CI] Make mistral tests pass (vllm-project#4596)

[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi  (vllm-project#4573)

[Misc] Add `get_name` method to attention backends (vllm-project#4685)

[Core] Faster startup for LoRA enabled models (vllm-project#4634)

[Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659)

[CI/Test] fix swap test for multi gpu (vllm-project#4689)

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

[Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592)

Co-authored-by: Cade Daniel <edacih@gmail.com>

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672)

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626)

consolidation
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants