Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GQA attention in experimental/gen_ai #2504

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Apr 16, 2024

Summary:
This diff open sources gqa_attn_splitk, which is a grouped-query
attention operator for LLM inference for the decoding phase. The
operator supports the following:

  • BF16 input query and output types
  • BF16/INT4 KV cache types
  • 16,384 max context length
  • Fixed head dimension of 128
  • Arbitrary query head size
  • Fixed number of KV head of 1

The INT4 KV gqa_attn_splitk is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

Copy link

netlify bot commented Apr 16, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit f48e2fd
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66204f7e88c762000851359d
😎 Deploy Preview https://deploy-preview-2504--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 16, 2024
Summary:

This diff open sources `mqa_attn_splitk`, which is a multi-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary query head size

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 16, 2024
Summary:

This diff open sources `mqa_attn_splitk`, which is a multi-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary query head size

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 16, 2024
Summary:

This diff open sources `mqa_attn_splitk`, which is a multi-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary query head size

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 16, 2024
Summary:

This diff open sources `mqa_attn_splitk`, which is a multi-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary query head size

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 16, 2024
Summary:

This diff open sources `mqa_attn_splitk`, which is a multi-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary query head size

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

Summary:

This diff open sources `gqa_attn_splitk`, which is a grouped-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary number of query heads

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Reviewed By: jianyuh, amylittleyang

Differential Revision: D56110657
sryap added a commit to sryap/FBGEMM that referenced this pull request Apr 17, 2024
Summary:

This diff open sources `gqa_attn_splitk`, which is a grouped-query
attention operator for LLM inference for the decoding phase.  The
operator supports the following:

- BF16 input query and output types
- BF16/INT4 KV cache types
- 16,384 max context length
- Fixed head dimension of 128
- Arbitrary number of query heads

The INT4 KV `mqa_attn_splitk` is ~1.7x faster than the BF16 KV
counterpart.

Reviewed By: jianyuh, amylittleyang

Differential Revision: D56110657
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56110657

@sryap sryap changed the title Add MQA attention in experimental/gen_ai Add GQA attention in experimental/gen_ai Apr 17, 2024
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e839246.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants