-
-
Notifications
You must be signed in to change notification settings - Fork 9.9k
Description
Your current environment
I am using VLLM 0.9.1
, and the flash_attention installed by
MAX_JOBS=4 pip install flash-attn --no-build-isolation
🐛 Describe the bug
When I try to run Phi-3 Small models (both 8k and 128k), it would throw an error of
AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
specifically in vllm/attention/backends/blocksparse_attn.py
line 416, where the code tries to access prefill_metadata
by calling if prefill_meta := attn_metadata.prefill_metadata:
.
I tried to trace the error myself, it went all the way in the phi3_small.py
where the model calls the self.attn
in the pass, but I cannot find where the model fills the attn_metadata
, I saw from other posts that the model is using the context manager to handle this, which I am not quite familiar with.
Also, other models such as Llama would not have such an issue, I suspect this to be related to the attention type, as for phi 3 small, it would run the torch.ops.vllm.unified_attention
in vllm/attention/layer.py
, but Llama model would not touch such a line.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.