[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes#18558
[ET-VK] Fix embedding_q4gsw out-of-bounds access with dynamic shapes#18558meta-codesync[bot] merged 1 commit intogh/SS-JIA/513/basefrom
Conversation
The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) [ghstack-poisoned]
The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: #18558
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18558
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Cancelled Job, 2 Unrelated FailuresAs of commit 33dfb14 with merge base 24751f1 ( NEW FAILURE - The following job has failed:
CANCELLED JOB - The following job was cancelled. Please retry:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
2ad79a8
into
gh/SS-JIA/513/base
The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: #18558
The embedding_q4gsw shader used push constants for num_indices, out_height, and embed_dim that were captured at graph build time and never updated when input tensors were dynamically resized. This caused out-of-bounds GPU memory reads when the actual input was smaller than the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs. The fix derives all shape-dependent values (embed_dim, out_height, num_indices) from the output tensor's sizes UBO, which is automatically updated on resize. Only truly constant values (group_size, is_linear_weight) remain as push constants. Root cause: With a 7-token input on a graph built for 256 tokens, the local workgroup rounding created an extra thread (y=7) that passed the stale bounds check (7 >= 256 == false) and read past the 7-element indices buffer. Differential Revision: [D98642319](https://our.internmc.facebook.com/intern/diff/D98642319/) ghstack-source-id: 359350851 Pull Request resolved: pytorch#18558
Stack from ghstack (oldest at bottom):
The embedding_q4gsw shader used push constants for num_indices,
out_height, and embed_dim that were captured at graph build time and
never updated when input tensors were dynamically resized. This caused
out-of-bounds GPU memory reads when the actual input was smaller than
the initial allocation, resulting in VK_ERROR_DEVICE_LOST on Mali GPUs.
The fix derives all shape-dependent values (embed_dim, out_height,
num_indices) from the output tensor's sizes UBO, which is automatically
updated on resize. Only truly constant values (group_size,
is_linear_weight) remain as push constants.
Root cause: With a 7-token input on a graph built for 256 tokens, the
local workgroup rounding created an extra thread (y=7) that passed the
stale bounds check (7 >= 256 == false) and read past the 7-element
indices buffer.
Differential Revision: D98642319