Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate xformers #452

Merged
merged 2 commits into from
May 26, 2024
Merged

integrate xformers #452

merged 2 commits into from
May 26, 2024

Conversation

laoda513
Copy link
Contributor

integrate xformers memory_efficient_attention, could be beneficial if your device's architecture is less than <sm_80
The efficiency between xformers.memory_efficient_attention and flash_attn in >sm_80 are almost the same.
But xformer does not expand the kv automatically, we need to do it manually, and the martix operation make this implemention much slower.

So the open logic is, try using flash_attn first and then try using xformer and then using Torch matmul attention.

@laoda513
Copy link
Contributor Author

For SM >= 80, we continue using flash_attn. If flash_attn is not available, the xformer implementation is 30-50% slower than flash_attn but 30-50% faster than Torch's matmul attention with very long context(~100K).
For SM < 80, the xformer implementation reduces memory cost similarly to SM >= 80 (it actually uses even less memory than SM >= 80, but I'm not sure by how much, so I didn't modify the code related to model loading) and is 30-50% faster than Torch's matmul attention.

@Ph0rk0z
Copy link

Ph0rk0z commented May 18, 2024

I had a slight clash with the latest dev that I fixed, but seems to be working on P100 and 2080s. It is a good fallback when you inference on mixed cards that don't support FA. Can cram bigger wizard now. Yes processing longer contexts takes more time but that's better than being unable to load/use the model.

@turboderp turboderp changed the base branch from master to dev May 26, 2024 21:27
@turboderp turboderp self-requested a review May 26, 2024 21:34
@turboderp turboderp merged commit fceb4fd into turboderp:dev May 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants