[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

kimishpatel wants to merge 7 commits into gh/kimishpatel/112/base from gh/kimishpatel/112/head

Contributor

kimishpatel commented Sep 25, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: D62623241


          [Executorch][llama] Add custom_sdpa and use that instead of sdpa_with…

5d9d688

…_kv_cache

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

pytorch-bot bot commented Sep 25, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5669

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f43dc24 with merge base ee32848 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Sep 25, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241

This was referenced Sep 25, 2024

[ExecuTorch] Some updated to kv cache #5663

Closed

Fix dequantize per channel to handle double scale type #5524

Closed

[ExecuTorch] Add quantized kv cache to llama #5664

Closed

Refactor custom SDPA op to separate kv cache update from the custom sdpa op #5665

Closed

Add update_quantized_cache op #5527

Closed

[Executorch][llama] Update SDPA op to use quantized kv cache #5666

Closed

[Executorch][llama] Refactoring sdpa #5667

Closed

[Executorch] Update EXECUTORCH_LIBRARY macro #5668

Closed

[Executorch][quant] Optimize per channel dequantize #5670

Merged

swolchok approved these changes

View reviewed changes


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

232b644

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Sep 26, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241

facebook-github-bot added the fb-exported label

kimishpatel mentioned this pull request

Dont quantize the current token for attention #5715

Merged


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

4eadd61

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

kimishpatel added a commit that referenced this pull request


          [Executorch][llama] Add custom_sdpa and use that instead of sdpa_with…

d3b0ccb

…_kv_cache

Pull Request resolved: #5669

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.
ghstack-source-id: 245150344
@exported-using-ghexport

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

Contributor

facebook-github-bot commented Sep 27, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

b625b57

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Sep 30, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

c352976

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241


          Update on "[Executorch][llama] Add custom_sdpa and use that instead o…

f43dc24

…f sdpa_with_kv_cache"

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

[ghstack-poisoned]

Contributor

facebook-github-bot commented Oct 1, 2024

This pull request was exported from Phabricator. Differential Revision: D62623241

facebook-github-bot closed this in

43d7662

Contributor

facebook-github-bot commented Oct 2, 2024

This pull request has been merged in 43d7662.

facebook-github-bot added the Merged label

kedarnath03 pushed a commit to kedarnath03/executorch that referenced this pull request


          [Executorch][llama] Add custom_sdpa and use that instead of sdpa_with…

7677f0b

…_kv_cache

Pull Request resolved: pytorch/executorch#5669

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.
ghstack-source-id: 245751544
@exported-using-ghexport

Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged