Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add MQA attention in experimental/gen_ai (pytorch#2504)
Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657
- Loading branch information