Disable cuda version check in vllm-openai image #4530

zhaoyang-star · 2024-05-01T15:48:23Z

Currently we no need to check cuda version when using fp8 kv cache. As of now, vLLM's binaries are compiled with CUDA 12.1 and public PyTorch release versions by default. The vllm-openai image has also CUDA 12.1.

simon-mo · 2024-05-01T16:13:13Z

sorry i just merged the other PR, can you resolve the conflict?

simon-mo · 2024-05-02T18:24:09Z

🤦‍♂️ sorry another conflict

zhaoyang-star · 2024-05-05T08:39:44Z

@simon-mo The conflict is solved. Please take a review.

ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

formatting ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

Disable cuda version check in vllm-openai image

3d6bf1e

simon-mo approved these changes May 1, 2024

View reviewed changes

zhaoyang-star added 2 commits May 2, 2024 14:53

Merge branch 'main' into disable_check_nvcc

65ce86e

fix isort error

2ac0ec7

Merge branch 'main' into disable_check_nvcc

d8493d1

simon-mo merged commit 0650e59 into vllm-project:main May 5, 2024
57 of 59 checks passed

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024

Disable cuda version check in vllm-openai image (vllm-project#4530)

352ef7c

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024

Disable cuda version check in vllm-openai image (vllm-project#4530)

472e608

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

Disable cuda version check in vllm-openai image (vllm-project#4530)

1337ced

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

Disable cuda version check in vllm-openai image (vllm-project#4530)

4ef72fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable cuda version check in vllm-openai image #4530

Disable cuda version check in vllm-openai image #4530

zhaoyang-star commented May 1, 2024 •

edited

Loading

simon-mo commented May 1, 2024

simon-mo commented May 2, 2024

zhaoyang-star commented May 5, 2024

Disable cuda version check in vllm-openai image #4530

Disable cuda version check in vllm-openai image #4530

Conversation

zhaoyang-star commented May 1, 2024 • edited Loading

simon-mo commented May 1, 2024

simon-mo commented May 2, 2024

zhaoyang-star commented May 5, 2024

zhaoyang-star commented May 1, 2024 •

edited

Loading