[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5531

kimishpatel · 2024-09-20T22:50:58Z

Stack from ghstack (oldest at bottom):

sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates
happens separately. Then the quantized cache is dequantized. After that
we call sdpa_with_kv_cache which copies k and v data into dequantized cache.
Although this is not needed because the actual cache is the one that is
quantized.
For very large context length this will add significant amount data copy.

Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that
using a) update_cache op and b) custom_sdpa op.

Differential Revision: D62623241

…_kv_cache sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]

pytorch-bot · 2024-09-20T22:51:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5531

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 27 New Failures

As of commit 6e8af9b with merge base b2517d6 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
Lint / lintrunner / linux-job (gh)
>>> Lint for kernels/quantized/test/test_quant_dequant_per_token.py:
pull / test-custom-ops-linux (buck2) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-custom-ops-linux (cmake) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (bf16, buck2, portable) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (bf16, cmake, portable) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, buck2, portable) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom+qe) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, cmake, portable) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom+qe) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llama-runner-qnn-linux (fp32, cmake, qnn) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-llava-runner-linux / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (buck2, mv3, portable, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (buck2, mv3, xnnpack-quantization-delegation, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (cmake, mv3, portable, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (cmake, mv3, xnnpack-quantization-delegation, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (cmake, vit, portable, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-models-linux (cmake, vit, xnnpack-delegation, linux.2xlarge, 90) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-pybind-build-linux (cmake) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-quantized-aot-lib-linux (cmake) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-selective-build-linux (buck2) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / test-selective-build-linux (cmake) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / unittest / linux / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]
pull / unittest / macos / macos-job (gh)
/Users/ec2-user/runner/_work/executorch/executorch/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long long' in initializer list [-Wc++11-narrowing]
pull / unittest-arm (buck2) / linux-job (gh)
/pytorch/executorch/kernels/quantized/cpu/op_dequantize.cpp:400:42: error: non-constant-expression cannot be narrowed from type 'size_t' (aka 'unsigned long') to 'long' in initializer list [-Wc++11-narrowing]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-09-20T22:51:31Z

This pull request was exported from Phabricator. Differential Revision: D62623241

…_kv_cache sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) ghstack-source-id: 243859229 Pull Request resolved: #5531

…f sdpa_with_kv_cache" sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/) [ghstack-poisoned]

facebook-github-bot · 2024-09-24T19:21:36Z

This pull request was exported from Phabricator. Differential Revision: D62623241

…_kv_cache Pull Request resolved: pytorch/executorch#5531 sdpa_with_kv_cache updates kv cache. In quantized kv cache, cache updates happens separately. Then the quantized cache is dequantized. After that we call sdpa_with_kv_cache which copies k and v data into dequantized cache. Although this is not needed because the actual cache is the one that is quantized. For very large context length this will add significant amount data copy. Subsequent diffs will deprecate sdpa_with_kv_cache op and deconstruct that using a) update_cache op and b) custom_sdpa op. ghstack-source-id: 244449207 @exported-using-ghexport Differential Revision: [D62623241](https://our.internmc.facebook.com/intern/diff/D62623241/)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 20, 2024

This was referenced Sep 20, 2024

[ExecuTorch] Some updated to kv cache #5523

Closed

Fix dequantize per channel to handle double scale type #5524

Closed

[ExecuTorch] Add quantized kv cache to llama #5525

Closed

facebook-github-bot added the fb-exported label Sep 20, 2024

kimishpatel mentioned this pull request Sep 24, 2024

[Executorch][quant] Optimize per channel dequantize #5596

Closed

kimishpatel closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5531

[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5531

Uh oh!

kimishpatel commented Sep 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 20, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 20, 2024

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

Uh oh!

[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5531

[Executorch][llama] Add custom_sdpa and use that instead of sdpa_with_kv_cache #5531

Uh oh!

Conversation

kimishpatel commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5531

❌ 27 New Failures

Uh oh!

facebook-github-bot commented Sep 20, 2024

Uh oh!

facebook-github-bot commented Sep 24, 2024

Uh oh!

Uh oh!

kimishpatel commented Sep 20, 2024 •

edited

Loading

pytorch-bot bot commented Sep 20, 2024 •

edited

Loading