[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

maxin9966 · 2024-05-19T09:27:08Z

Your current environment

Driver Version: 545.23.08
CUDA Version: 12.3
python3.9
vllm 0.4.2
flash_attn 2.4.2~2.5.8 (I have tried various versions of flash_attn)
torch 2.3

🐛 Describe the bug

Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
Using XFormers backend.

maxin9966 · 2024-05-19T09:35:32Z

4070 ti super
ubuntu 22

bbeijy · 2024-05-20T15:30:44Z

I also met the problem.

atineoSE · 2024-05-21T18:38:39Z

Since #4686, we can use vllm-flash-attn instead of flash-attn.

This is not yet available in the latest release, v0.4.2 but you can build a new vLLM wheel from source, here is how I did it.

git clone git@github.com:vllm-project/vllm.git
cd vllm
sudo docker build --target build -t vllm_build .
container_id=$(sudo docker create --name vllm_temp vllm_build:latest)
sudo docker cp ${container_id}:/workspace/dist .

This builds the container up to the build stage, which will contain the wheel for vllm in the /workspace/dist directory. We can then extract it with docker cp.

Then install with:

pip install vllm-flash-attn
pip install dist/vllm-0.4.2+cu124-cp310-cp310-linux_x86_64.whl

Now you can run vllm and get:

Using FlashAttention-2 backend.

maxin9966 · 2024-05-22T04:36:05Z

@atineoSE Thank you very much. By the way, does vllm-flash-attn support Turing architecture GPUs like the 2080ti?

simonwei97 · 2024-05-24T08:10:22Z

I have same problelm on Linux (CentOS 7).

My Env

torch                             2.3.0
xformers                          0.0.26.post1
vllm                              0.4.2
vllm-flash-attn                   2.5.8.post2
vllm_nccl_cu12                    2.18.1.0.4.0

CUDA

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:26:00.0 Off |                    0 |
| N/A   25C    P0    56W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |

Probelm:

INFO 05-24 16:04:56 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 05-24 16:04:56 selector.py:32] Using XFormers backend.

mces89 · 2024-05-25T06:50:33Z

@atineoSE can you share the wheel somewhere, i cannot compile the wheel using this docker. Thanks.

atineoSE · 2024-05-27T13:44:08Z

@mces89 you have to compile for your architecture, so it's not universal. You can use the steps above.

Alternatively, you can:

use the Docker version of the current release v0.4.2, as explained here (support for flash-attn-2 built in)
wait until the next version is released for the pip version, as explained here (support for vllm-flash-attn will be available)

dymil · 2024-05-28T16:24:59Z

There's not an an absolute need to go through Docker – I just looked at the instructions in the README to build from source and ran
pip install vllm@git+https://github.com/vllm-project/vllm
That seemed to get me further (now I'm dealing with an unrelated error, so I can't confirm everything's peachy)

ameza13 · 2024-06-21T01:43:06Z

Hello @atineoSE , I installed vllm 0.5.0.post1 via pip: pip install vllm

It also installs vllm-flash-attn package. However, when I run my script, I still get this message:

INFO 06-21 01:37:43 selector.py:150] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-21 01:37:43 selector.py:51] Using XFormers backend.

Should I do something different in my code to use FlashAttention-2? what does this message mean?

With vllm 0.4.2 FlashAttention-2 was working.

atineoSE · 2024-07-12T11:09:25Z

@ameza13 this is a new issue and not what the OP mentioned. I have encountered this when running vLLM with microsoft/Phi-3-medium-4k-instruct

Indeed, it looks like the FlashAttention-2 backend does not support the sliding window, so such a model needs to fall back to some other backend (XFormers in this case). The model works just fine, though I'm not sure if this implies some performance penalty.

plp38 · 2024-07-15T13:46:33Z

Same for me.

My env:

Driver Version: 555.42.06 
CUDA Version: 12.1
python3.12.4
vllm 0.5.1.post1
flash_attn 2.5.9.post1
torch 2.3.1

maxin9966 added the bug Something isn't working label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

maxin9966 commented May 19, 2024

maxin9966 commented May 19, 2024

bbeijy commented May 20, 2024

atineoSE commented May 21, 2024

maxin9966 commented May 22, 2024

simonwei97 commented May 24, 2024 •

edited

Loading

mces89 commented May 25, 2024

atineoSE commented May 27, 2024

dymil commented May 28, 2024

ameza13 commented Jun 21, 2024 •

edited

Loading

atineoSE commented Jul 12, 2024

plp38 commented Jul 15, 2024

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found #4906

Comments

maxin9966 commented May 19, 2024

Your current environment

🐛 Describe the bug

maxin9966 commented May 19, 2024

bbeijy commented May 20, 2024

atineoSE commented May 21, 2024

maxin9966 commented May 22, 2024

simonwei97 commented May 24, 2024 • edited Loading

My Env

mces89 commented May 25, 2024

atineoSE commented May 27, 2024

dymil commented May 28, 2024

ameza13 commented Jun 21, 2024 • edited Loading

atineoSE commented Jul 12, 2024

plp38 commented Jul 15, 2024

simonwei97 commented May 24, 2024 •

edited

Loading

ameza13 commented Jun 21, 2024 •

edited

Loading