-
Notifications
You must be signed in to change notification settings - Fork 683
[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…_kv_cache sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5669
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f43dc24 with merge base ee32848 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
…_kv_cache Pull Request resolved: #5669 sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. ghstack-source-id: 245150344 @exported-using-ghexport Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D62623241 |
…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]
This pull request was exported from Phabricator. Differential Revision: D62623241 |
This pull request has been merged in 43d7662. |
…_kv_cache Pull Request resolved: pytorch/executorch#5669 sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. ghstack-source-id: 245751544 @exported-using-ghexport Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)
Stack from ghstack (oldest at bottom):
sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.
Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.
Differential Revision: D62623241