integrate xformers #452

laoda513 · 2024-05-14T04:25:51Z

integrate xformers memory_efficient_attention, could be beneficial if your device's architecture is less than <sm_80
The efficiency between xformers.memory_efficient_attention and flash_attn in >sm_80 are almost the same.
But xformer does not expand the kv automatically, we need to do it manually, and the martix operation make this implemention much slower.

So the open logic is, try using flash_attn first and then try using xformer and then using Torch matmul attention.

integrate xformer integrate xformer

laoda513 · 2024-05-14T04:41:29Z

For SM >= 80, we continue using flash_attn. If flash_attn is not available, the xformer implementation is 30-50% slower than flash_attn but 30-50% faster than Torch's matmul attention with very long context(~100K).
For SM < 80, the xformer implementation reduces memory cost similarly to SM >= 80 (it actually uses even less memory than SM >= 80, but I'm not sure by how much, so I didn't modify the code related to model loading) and is 30-50% faster than Torch's matmul attention.

Ph0rk0z · 2024-05-18T11:57:16Z

I had a slight clash with the latest dev that I fixed, but seems to be working on P100 and 2080s. It is a good fallback when you inference on mixed cards that don't support FA. Can cram bigger wizard now. Yes processing longer contexts takes more time but that's better than being unable to load/use the model.

laoda513 added 2 commits May 13, 2024 22:47

integrate xformer

e6e4d87

integrate xformer integrate xformer

if no flash_attn using xformer when available

309d0e9

laoda513 mentioned this pull request May 14, 2024

Triton based flash attention 2 that supports volta and up. #411

Closed

turboderp changed the base branch from master to dev May 26, 2024 21:27

turboderp self-requested a review May 26, 2024 21:34

turboderp approved these changes May 26, 2024

View reviewed changes

turboderp merged commit fceb4fd into turboderp:dev May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

integrate xformers #452

integrate xformers #452

laoda513 commented May 14, 2024

laoda513 commented May 14, 2024

Ph0rk0z commented May 18, 2024

integrate xformers #452

integrate xformers #452

Conversation

laoda513 commented May 14, 2024

laoda513 commented May 14, 2024

Ph0rk0z commented May 18, 2024