[Misc] Use vllm-flash-attn instead of flash-attn #4686

WoosukKwon · 2024-05-08T16:01:08Z

This PR is to use the pre-built vllm-flash-attn wheel instead of the original flash-attn.

LiuXiaoxuanPKU

LGTM!

maxin9966 · 2024-05-22T04:40:12Z

Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? I recall that the Turing GPU supports flash-attn1.

ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

formatting ruff formatting formatting -isort formatting yapf add request class init file added adding CPU_executor change adding support for cpu engine formatting backslash error fix formatting tests update update worker test update worker test formatting Disable cuda version check in vllm-openai image (vllm-project#4530) [Bugfix] Fix `asyncio.Task` not being subscriptable (vllm-project#4623) [CI] use ccache actions properly in release workflow (vllm-project#4629) [CI] Add retry for agent lost (vllm-project#4633) Update lm-format-enforcer to 0.10.1 (vllm-project#4631) [Kernel] Make static FP8 scaling more robust (vllm-project#4570) Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance. [Core][Optimization] change python dict to pytorch tensor (vllm-project#4607) [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (vllm-project#4642) [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (vllm-project#4609) [Core][Optimization] change copy-on-write from dict[int, list] to list (vllm-project#4648) [Bug fix][Core] fixup ngram not setup correctly (vllm-project#4551) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> [Core][Distributed] support cpu&device in broadcast tensor dict (vllm-project#4660) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (vllm-project#4660) [Core] Optimize sampler get_logprobs (vllm-project#4594) [CI] Make mistral tests pass (vllm-project#4596) [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (vllm-project#4573) [Misc] Add `get_name` method to attention backends (vllm-project#4685) [Core] Faster startup for LoRA enabled models (vllm-project#4634) [Core][Optimization] change python dict to pytorch tensor for blocks to swap (vllm-project#4659) [CI/Test] fix swap test for multi gpu (vllm-project#4689) [Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686) [Dynamic Spec Decoding] Auto-disable by the running queue size (vllm-project#4592) Co-authored-by: Cade Daniel <edacih@gmail.com> [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (vllm-project#4672) [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (vllm-project#4626) consolidation

[Misc] Use vllm-flash-attn instead of flash-attn

de121f5

WoosukKwon requested a review from LiuXiaoxuanPKU May 8, 2024 16:01

LiuXiaoxuanPKU approved these changes May 8, 2024

View reviewed changes

Merge branch 'main' into vllm-flash-attn

25a9e6c

WoosukKwon merged commit 89579a2 into main May 8, 2024
24 of 25 checks passed

WoosukKwon deleted the vllm-flash-attn branch May 8, 2024 20:15

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 9, 2024

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

3ab3a02

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

b5967c4

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

fe03b5c

atineoSE mentioned this pull request May 21, 2024

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

Open

tybalex pushed a commit to rubra-ai/vllm that referenced this pull request May 25, 2024

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

3b946dd

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Misc] Use vllm-flash-attn instead of flash-attn (vllm-project#4686)

57fc253

sivanantha321 mentioned this pull request Jun 4, 2024

Add nccl package and Bump vLLM to 0.4.3 for huggingface runtime kserve/kserve#3723

Merged

8 tasks

Eta0 mentioned this pull request Jun 15, 2024

build(vllm-tensorizer): Compile vllm-flash-attn from source coreweave/ml-containers#70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] Use vllm-flash-attn instead of flash-attn #4686

[Misc] Use vllm-flash-attn instead of flash-attn #4686

WoosukKwon commented May 8, 2024

LiuXiaoxuanPKU left a comment

maxin9966 commented May 22, 2024

[Misc] Use vllm-flash-attn instead of flash-attn #4686

[Misc] Use vllm-flash-attn instead of flash-attn #4686

Conversation

WoosukKwon commented May 8, 2024

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

maxin9966 commented May 22, 2024