Add GQA attention in experimental/gen_ai #2504

sryap · 2024-04-16T16:47:15Z

Summary:
This diff open sources gqa_attn_splitk, which is a grouped-query
attention operator for LLM inference for the decoding phase. The
operator supports the following:

BF16 input query and output types
BF16/INT4 KV cache types
16,384 max context length
Fixed head dimension of 128
Arbitrary query head size
Fixed number of KV head of 1

The INT4 KV gqa_attn_splitk is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657

facebook-github-bot · 2024-04-16T16:47:22Z

This pull request was exported from Phabricator. Differential Revision: D56110657

netlify · 2024-04-16T16:47:30Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`f48e2fd`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66204f7e88c762000851359d
😎 Deploy Preview	https://deploy-preview-2504--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657

facebook-github-bot · 2024-04-16T16:50:31Z

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657

facebook-github-bot · 2024-04-16T16:53:15Z

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657

facebook-github-bot · 2024-04-16T16:53:50Z

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657

facebook-github-bot · 2024-04-16T17:30:32Z

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary: This diff open sources `mqa_attn_splitk`, which is a multi-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary query head size The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Differential Revision: D56110657

facebook-github-bot · 2024-04-16T17:31:22Z

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary: This diff open sources `gqa_attn_splitk`, which is a grouped-query attention operator for LLM inference for the decoding phase. The operator supports the following: - BF16 input query and output types - BF16/INT4 KV cache types - 16,384 max context length - Fixed head dimension of 128 - Arbitrary number of query heads The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV counterpart. Reviewed By: jianyuh, amylittleyang Differential Revision: D56110657

facebook-github-bot · 2024-04-17T22:38:42Z

This pull request was exported from Phabricator. Differential Revision: D56110657

facebook-github-bot · 2024-04-17T22:39:02Z

This pull request was exported from Phabricator. Differential Revision: D56110657

facebook-github-bot · 2024-04-18T00:30:54Z

This pull request has been merged in e839246.

facebook-github-bot added the cla signed label Apr 16, 2024

facebook-github-bot added the fb-exported label Apr 16, 2024

sryap force-pushed the export-D56110657 branch from 4a28483 to 4e93bf9 Compare April 16, 2024 16:50

sryap force-pushed the export-D56110657 branch from 4e93bf9 to 1ae18ab Compare April 16, 2024 16:53

sryap force-pushed the export-D56110657 branch from 1ae18ab to 44f390e Compare April 16, 2024 16:53

sryap force-pushed the export-D56110657 branch from 44f390e to 773c7e6 Compare April 16, 2024 17:30

sryap force-pushed the export-D56110657 branch from 773c7e6 to a81e373 Compare April 16, 2024 17:31

sryap force-pushed the export-D56110657 branch from a81e373 to 7eed85b Compare April 17, 2024 22:38

sryap force-pushed the export-D56110657 branch from 7eed85b to f48e2fd Compare April 17, 2024 22:38

sryap changed the title ~~Add MQA attention in experimental/gen_ai~~ Add GQA attention in experimental/gen_ai Apr 17, 2024

facebook-github-bot closed this in e839246 Apr 18, 2024

facebook-github-bot added the Merged label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GQA attention in experimental/gen_ai #2504

Add GQA attention in experimental/gen_ai #2504

sryap commented Apr 16, 2024 •

edited

Loading

facebook-github-bot commented Apr 16, 2024

netlify bot commented Apr 16, 2024 •

edited

Loading

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 17, 2024

facebook-github-bot commented Apr 17, 2024

facebook-github-bot commented Apr 18, 2024

Add GQA attention in experimental/gen_ai #2504

Add GQA attention in experimental/gen_ai #2504

Conversation

sryap commented Apr 16, 2024 • edited Loading

facebook-github-bot commented Apr 16, 2024

netlify bot commented Apr 16, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 16, 2024

facebook-github-bot commented Apr 17, 2024

facebook-github-bot commented Apr 17, 2024

facebook-github-bot commented Apr 18, 2024

sryap commented Apr 16, 2024 •

edited

Loading

netlify bot commented Apr 16, 2024 •

edited

Loading