-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906
Comments
4070 ti super |
I also met the problem. |
Since #4686, we can use This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.
This builds the container up to the build stage, which will contain the wheel for vllm in the Then install with:
Now you can run vllm and get:
|
@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti? |
I have same problelm on Linux ( My Envtorch 2.3.0
xformers 0.0.26.post1
vllm 0.4.2
vllm-flash-attn 2.5.8.post2
vllm_nccl_cu12 2.18.1.0.4.0 CUDA +-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:26:00.0 Off | 0 |
| N/A 25C P0 56W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled | Probelm:
|
@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks. |
@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above. Alternatively, you can: |
There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran |
Hello @atineoSE , I installed vllm 0.5.0.post1 via pip: It also installs
Should I do something different in my code to use FlashAttention-2? what does this message mean? With vllm 0.4.2 FlashAttention-2 was working. |
@ameza13 this is a new issue and not what the OP mentioned. I have encountered this when running vLLM with microsoft/Phi-3-medium-4k-instruct Indeed, it looks like the FlashAttention-2 backend does not support the sliding window, so such a model needs to fall back to some other backend (XFormers in this case). The model works just fine, though I'm not sure if this implies some performance penalty. |
Same for me. My env:
|
Your current environment
Driver Version: 545.23.08
CUDA Version: 12.3
python3.9
vllm 0.4.2
flash_attn 2.4.2~2.5.8 (I have tried various versions of flash_attn)
torch 2.3
🐛 Describe the bug
Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
Using XFormers backend.
The text was updated successfully, but these errors were encountered: