[ExecuTorch][WebGPU] SymInt arithmetic ops (add/sub/mul/floordiv) for dynamic shapes by pytorchbot · Pull Request #20712 · pytorch/executorch

pytorchbot · 2026-07-04T17:05:55Z

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #20573 by @JulianCloudNTH
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/65/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/65/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/JulianCloudNTH/65/orig

@diff-train-skip-merge

… dynamic shapes Pull Request resolved: #20573 **Register scalar SymInt arithmetic so dynamic-shape graphs lower without an "unsupported op" failure.** **Problem:** A dynamic-shape exported program emits scalar SymInt arithmetic nodes (`add`/`sub`/`mul`/`floordiv`) to compute live sizes and positions (e.g. `input_pos + S`, `seq_len // n`). The WebGPU backend registered only `et_vk.select_as_symint.default`, so `WebGPUGraph::build()` threw `unsupported op: add` when loading any dynamic `.pte`. **Solution:** - Before: only `select_as_symint` produced a live SymInt; any arithmetic on it was unsupported. - After: `add`/`sub`/`mul`/`floordiv` each recompute their output SymInt from the operands via a resize hook whenever a live operand changes. **Implementation:** - `register_sym_binary` reads each operand (live SymInt via `read_symint`, else a static `Int`), seeds the build-time value, and registers a resize hook on any live operand (`set_symint` on recompute). - `floordiv` rounds toward negative infinity (Python semantics). - Mirrors Vulkan `backends/vulkan/runtime/graph/ops/impl/SymIntOps.cpp` (add/sub/floordiv/mul). - Registered under the bare targets `add`/`sub`/`mul`/`floordiv` — distinct registry keys from the tensor `aten.add.Tensor`/`aten.mul.Tensor` ops. **Constraints:** An output that folded to a static `Int` is a no-op. No GPU kernel, no dispatch, no change to the static-shape path. `sym_size.int` is intentionally not added here (depends on the tensor-shape dim-source API in the following diff). Co-authored-with: Claude Code. ghstack-source-id: 399812821 @exported-using-ghexport Differential Revision: [D109906102](https://our.internmc.facebook.com/intern/diff/D109906102/)

pytorch-bot · 2026-07-04T17:05:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20712

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Cancelled Jobs

As of commit cd1f7ec with merge base 9b7fd14 ():

NEW FAILURES - The following jobs have failed:

Build Presets / windows (pybind) / build (gh)
pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t 89eafcc9eaf182a95ca410406a40e23269259fccde5ba4bede0cd017843faa1b /exec failed with exit code 1

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Build Presets / windows (windows) / build (gh)
pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-07-04T17:06:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #20574 The WebGPU backend baked static tensor shapes at build time, so a dynamic `.pte` needed a separate graph for each shape (prefill vs. decode). This adds a tensor-shape resize engine mirroring Vulkan: tensors carry live `cur_dims` ≤ max, inputs resize per call, and a bounded-fixpoint propagates tensor-level resize hooks. **Key changes:** - `WebGPUTensor`: add `cur_dims`/`cur_nbytes` (live sizes ≤ max allocation), initialized to max at build - `WebGPUGraph`: `resize_input`/`set_cur_dims` validate live dims fit max, `propagate_resize` runs tensor hooks for dirty shapes - `update_symints_from_inputs` reads live `cur_dims`; adds `sym_size.int` dim source path - `copy_inputs` uploads only live bytes; `WebGPUBackend::execute` shrinks inputs and resizes outputs to live shapes Static graphs stay byte-identical: `cur == max` forever, no hooks fire, no reallocations. ghstack-source-id: 399812823 @exported-using-ghexport Differential Revision: [D109906091](https://our.internmc.facebook.com/intern/diff/D109906091/)

Pull Request resolved: #20575 These ops baked their dispatch count, param UBO, and output dims at `build()` for the max seq-len. On a dynamic-shape graph at a smaller live S they would over-dispatch and leave the output sized at the max, so the resize engine could not actually shrink them. This adds tensor resize hooks to rms_norm, embedding_q4gsw, and apply_rotary_emb. When an input is resized, each hook recomputes the live row/token count, rewrites the param UBO, updates the dispatch `workgroup_count_x`, and sets the output's `cur_dims`. The hook is inert until a resize happens, so static graphs are byte-identical. Implementation: - `rms_norm`: recompute `num_rows` from live `cur_dims`; out dims follow the input. - `embedding_q4gsw`: recompute `num_indices`/`total_blocks`; out dims = indices dims + `[embed_dim]`. - `apply_rotary_emb`: `add_rope_dispatch` now returns its uniform handle; one hook rewrites both the xq and xk dispatches/UBOs for the live S and sets both outputs. - Each keeps its uniform buffer alive via `own_uniform_buffer` (the hook rewrites it) instead of releasing it at build. Mirrors Vulkan per-op `resize_*_node` (recompute sizes + dispatch each execute). No kernel/WGSL/numerics change. Behavior-neutral on static graphs (hook only fires when live dims differ from max). `quantized_linear` and SDPA resize hooks land in following diffs; `prepack` needs none (constants are fixed-size). ghstack-source-id: 399812824 @exported-using-ghexport Differential Revision: [D109906096](https://our.internmc.facebook.com/intern/diff/D109906096/)

Pull Request resolved: #20576 **Make the 4-bit quantized linear serve any live M (rows) from one graph, so a dynamic prefill+decode graph computes correct-size outputs.** **Problem:** `linear_q4gsw` baked its dispatch count, `params.M`, and output shape at `build()` for the max M. On a dynamic-shape graph at a smaller live M (e.g. decode M=1 vs prefill M=S) it would over-dispatch and leave the output sized at the max. **Solution:** - Before: one fixed dispatch sized for the build-time M. - After: a tensor resize hook on the input recomputes the live M from `cur_dims`, rewrites `params.M`, updates the dispatch `workgroup_count_x` for the SAME kernel chosen at build (bicol GEMV / shmem GEMM / register-tiled), and sets the output `cur_dims` (= input dims with the last dim replaced by N). Inert until the input is resized. **Implementation:** - The build-time kernel select (bicol GEMV for M==1, else shmem GEMM for large K/N, else register-tiled) is fixed at build; the hook re-runs `compute_q4gsw_workgroup_count` for whichever of the three the build chose and rewrites the param UBO + output dims for the live M — it does not switch kernels (runtime M-switching is a separate optimization). - `own_uniform_buffer` keeps the param UBO alive so the hook can rewrite it. - Mirrors Vulkan `resize_q4gsw_linear_node` (recompute M-derived dispatch each execute). **Constraints:** Behavior-neutral on static graphs (hook fires only when the input's live M differs from the max). No kernel/WGSL/numerics change. Runtime M-based kernel switching is deliberately out of scope (a later opt diff). Co-authored-with: Claude Code. ghstack-source-id: 399812825 @exported-using-ghexport Differential Revision: [D109906094](https://our.internmc.facebook.com/intern/diff/D109906094/)

Pull Request resolved: #20577 **Make the elementwise add and mul ops serve any live shape from one graph.** **Problem:** `aten.add.Tensor` and `aten.mul.Tensor` baked their element count + param UBO(s) + output shape at `build()` for the max shape. On a dynamic-shape graph at a smaller live shape they would over-dispatch and leave the output sized at the max. **Solution:** - Before: one fixed dispatch sized for the build-time shape. - After: each registers a resize hook on BOTH operands (the dynamic one may be either operand by arg order). The hook recomputes the live element count, rewrites the param UBO(s), updates the dispatch `workgroup_count_x`, and sets the output `cur_dims`. Inert until an operand is resized. **Implementation:** - `add`: out follows the larger operand (robust when one input is a static residual and the other is the dynamic-S tensor); rewrites `AddParams`. - `mul`: recomputes the broadcast output shape and rebuilds all three `TensorMeta` UBOs via `fill_tensor_meta_broadcast`. - Each keeps its uniform buffer(s) alive via `own_uniform_buffer` instead of releasing at build. - Mirrors Vulkan per-op `resize_*_node` (recompute sizes + dispatch each execute). **Constraints:** Behavior-neutral on static graphs (the hook fires only when an operand's live shape differs from the max). No kernel/WGSL/numerics change. Co-authored-with: Claude Code. ghstack-source-id: 399812828 @exported-using-ghexport Differential Revision: [D109906093](https://our.internmc.facebook.com/intern/diff/D109906093/)

Pull Request resolved: #20578 **Make sigmoid and select_copy serve any live shape from one graph; fix select's last-token index under dynamic shapes.** **Problem:** Both ops baked their dispatch/params/output shape at `build()` for the max shape. `select_copy` was worse: a negative index (e.g. `-1` for the last token) was normalized against the build-time MAX dim, so at a smaller live S it selected a stale/zero position past the live data — producing wrong (often zero) output. **Solution:** - `sigmoid` (generic `add_unary_op`): a resize hook recomputes `num_elements`/dispatch and sets the output `cur_dims` (shape-preserving). - `select_copy`: KEEP the raw (possibly negative) index at build; a resize hook re-resolves it against the LIVE dim, recomputes the output dims (= input minus `dim`), rebuilds the out/in `TensorMeta` UBOs and the dispatch. - Both keep their uniform buffer(s) alive via `own_uniform_buffer`. **Implementation:** - The select out/in meta is rebuilt from synthetic `WebGPUTensor{dims}` via `fill_tensor_meta` (reads only `.dims`). - Mirrors Vulkan per-op `resize_*_node`. **Constraints:** Behavior-neutral on static graphs (hooks fire only when an input's live shape differs from the max). No kernel/WGSL/numerics change. Co-authored-with: Claude Code. ghstack-source-id: 399812832 @exported-using-ghexport Differential Revision: [D109906095](https://our.internmc.facebook.com/intern/diff/D109906095/)

Pull Request resolved: #20579 **Make `view_copy` track the live sequence length under dynamic shapes.** **Problem:** `view_copy` lowers to a flat DMA buffer copy (`add_buffer_copy`) sized at the build-time max shape. With one dynamic graph serving any seq-len S (prefill S=K, decode S=1), the copy moved the full max-S byte count and the output kept its max dims, so a downstream consumer read a live shape that was too large. **Solution:** register a tensor resize hook on the input so the copy follows the live input numel (a view preserves numel). - Before: `copy_nbytes` and the output dims are fixed at the serialized max. - After: the hook recomputes the live numel from `cur_dims(in)`, scales the single dynamic output dim to preserve numel, sets the output `cur_dims`, and rewrites the Copy dispatch's `copy_nbytes`. **Implementation:** - Keep the existing DMA path (`Kind::Copy`); the hook only rewrites `copy_nbytes` via `dispatch_at`, no new kernel. - Handle the aliased in/out fast path (no copy emitted) by still setting the output `cur_dims` so the resize cascade reaches consumers. - Mirrors Vulkan's `view_buffer` contiguous fast path; numel-preserving like the other dynamic-shape op hooks. **Constraints:** inert on a static graph (`cur_dims == dims`), so byte-identical to the prior behavior; fp32-only and numel-preserving invariants unchanged. Co-authored-with: Claude Code. ghstack-source-id: 399812833 @exported-using-ghexport Differential Revision: [D109906098](https://our.internmc.facebook.com/intern/diff/D109906098/)

Pull Request resolved: #20580 **Make `sdpa_with_kv_cache` serve any live seq-len S from one graph (batched prefill S=K and decode S=1).** **Problem:** the existing dynamic path only reacted to a live `input_pos` (decode), with S captured at build time. It rewrote the QK dispatch (which depends on `context_len`) but left `update_cache`, softmax, and AV sized for the build-time S. Under a dynamic seq-len S (one graph serving prefill and decode), `kv_numel`, the QK/AV tile grids, and the softmax row count all depend on S and were stale. **Solution:** a single recompute hook driven by either a live S (q tensor resize) or a live `input_pos` (SymInt), recomputing every per-step quantity from the live shape. - Before: hook keyed only on `input_pos`; recomputes ctx + QK count; S fixed. - After: hook keyed on q (always) and `input_pos` (when SymInt); reads live S from `cur_dims(q)` and live pos, recomputes all five dispatches' counts + UBOs (`update_cache` K/V, QK, softmax, AV), and sets the output `cur_dims` to q's. **Implementation:** - Capture the `update_cache`/softmax/AV dispatch indices (previously only QK) so their workgroup counts can be rewritten per step. - QK/AV workgroup counts use the landed register-tiled grids (`Hq*ceil(S/TM)*ceil(ctx-or-D/TN)`); softmax is one workgroup per `Hq*S` row. - Register the hook on q unconditionally — inert until q is resized, so a static graph is byte-identical. - Mirrors Vulkan `DynamicDispatchNode` (recompute workgroups per execute); scratch is sized at build (S=max, ctx=Cmax) so buffers never move and bind groups stay valid. **Constraints:** fp32-only, batch=1, GQA, `is_causal=true`, `D%4==0` invariants unchanged; the static / decode-only paths are unaffected (the q hook never fires without a resize). Co-authored-with: Claude Code. ghstack-source-id: 399812834 @exported-using-ghexport Differential Revision: [D109906097](https://our.internmc.facebook.com/intern/diff/D109906097/)

…t/end) Pull Request resolved: #20581 **Make `slice_copy` support a dynamic gather range so the RoPE-freqs slice `[input_pos : input_pos + S]` works under one dynamic graph.** **Problem:** the static slice handler read `start` via a scalar reader that throws on a SymInt and ignored `end` (output length baked AOT). The RoPE-freqs slice uses a SymInt `input_pos` for start and a live S for the range, so the static op could neither build nor resize for it. **Solution:** read start/end as possibly-dynamic SymInts and add a resize hook that recomputes the gather offset and live output length each step. - Before: `start` is a static scalar (SymInt throws); `end` ignored; output length fixed at the serialized max. - After: `start`/`end` read via a SymInt-aware reader; a hook recomputes `out[dim] = (end - start + step - 1) / step`, rewrites `out_meta`/`in_meta`/`params` UBOs + the dispatch count, and sets the output `cur_dims`. **Implementation:** - Hook registered on the `start`/`end` value-ids when they are SymInts and on the input tensor always (inert until resized, so a static slice is byte-identical). - Output/input `TensorMeta` rebuilt from live dims; `dim`/`step` stay static. - Keep the uniforms alive via `own_uniform_buffer` so the hook can rewrite them. - Mirrors Vulkan `resize_slice_copy_node`. **Constraints:** fp32-only; `dim`/`step` static; numerics + layout unchanged; inert on a static graph. NOTE (stacking): this diff sits on top of the in-review `slice_copy` op (D108793168); rebase onto it once that op lands on master. Co-authored-with: Claude Code. ghstack-source-id: 399812835 @exported-using-ghexport Differential Revision: [D109906092](https://our.internmc.facebook.com/intern/diff/D109906092/)

…+ per-op resize) Pull Request resolved: #20582 **End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.** **Problem:** the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups). **Solution:** a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape. - Cases A-D: dynamic + static `rms_norm` (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged). - Cases F-H: `rms(rms(x))` cascade, `rms(x)+x` (rms->add cascade), `rms(x)*x` (mul). - Cases I-L: dynamic `linear_q4gsw` (GEMM at several M), `sdpa_with_kv_cache` (GQA prefill at several S), `embedding_q4gsw` (int64 ids), `apply_rotary_emb` (two outputs). - Cases M-N: dynamic `sigmoid` (elementwise) and `select_copy(0, -1)` (negative index resolved against the live leading dim each call). - Graph-reuse variants: every dynamic op above (`rms_norm` incl. a grow-first smallest→largest order, the `rms(rms(x))` cascade, `linear_q4gsw`, `embedding_q4gsw`, `apply_rotary_emb`, `sigmoid`, `select_copy`) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize. **Implementation:** - `test/ops/dynamic_shape/test_dynamic_shape_export.py` exports each toy model through `VulkanPartitioner` with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope. - `test/native/test_dynamic_shape.cpp` loads each `.pte`, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single `Module` serves the whole shape sweep. - Multi-output ops select their output by full shape, never numel. **Constraints:** numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if `sym_size.int`/`copy_` op coverage is incomplete (does not fail the suite). Co-authored-with: Claude Code. ghstack-source-id: 399812841 @exported-using-ghexport Differential Revision: [D109906090](https://our.internmc.facebook.com/intern/diff/D109906090/)

… (prefill path) Pull Request resolved: #20583 **Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.** **Problem**: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (`maxComputeWorkgroupsPerDimension`, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill. **Solution**: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from `@builtin(num_workgroups)`. - **Before**: `compute_1d_workgroup_count` throws if `count > limit`; dispatch `(count, 1, 1)`. - **After**: `compute_2d_workgroup_count` returns `{count, 1}` (fast path) or a near-square `{x, y}` (`x = ceil(sqrt(count))` clamped to `limit`, `y = div_up(count, x)`); dispatch `(x, y, 1)`. A flat `{limit, div_up(count, limit)}` split would idle up to ~half the launched workgroups when `count` just exceeds `limit`; the near-square split holds the waste to `O(sqrt(count))` (e.g. 65536 -> `{256, 256}`, 0 inactive). **Implementation**: - `WgCount` + pure `fold_workgroup_count_2d` + `compute_2d_workgroup_count` in `WebGPUUtils.h` (device-free, unit-testable; `queried_max_workgroups` factored out of the 1D path) - `WebGPUDispatch.workgroup_count_y` (default 1, declared last so existing aggregate inits are unchanged); both `dispatchWorkgroups` calls + the profiling record pass `(x, y, 1)` - Per-kernel in-shader reconstruction: thread-form `idx = gid.x + gid.y*(num_workgroups.x*wg_size)` (QK/AV/add); row-form `row_idx = wid.x + wid.y*num_workgroups.x` (softmax — keeps a `valid` predicate, not an early return, so `workgroupBarrier()`s stay uniform) - `Sdpa.cpp`: QK/softmax/AV counts via the 2D helper; the dynamic-`input_pos` resize hook recomputes both x and y for QK - Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX `get_2d_grid_dims` packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste) **Constraints**: - `y=1` fast path keeps every non-folded dispatch byte-identical to the prior 1D path - Scope = prefill path only; `rms_norm`/`embedding`/`lm_head`/`update_cache` are row/token-indexed and never hit the cap, so they keep the 1D path - Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the `uint32` element guard fires first at S~11585) Co-authored-with: Claude Code. ghstack-source-id: 399812920 @exported-using-ghexport Differential Revision: [D109517684](https://our.internmc.facebook.com/intern/diff/D109517684/)

…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 399812923 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)

Pull Request resolved: #20651 **Lift the 65535 workgroup-per-dim cap for `mul` and `permute` so they run at any numel.** `mul.Tensor` and `permute` still used `compute_1d_workgroup_count`, which throws once `numel / wg_size > 65535` — hit by a realistic Llama-3.2-1B LoRA layer (`mul` over `[2048, 8192]` = 262k workgroups; `permute` of `[2048, 2048]` = 65536). `add`/`sub`/`div`/`fill`/`sdpa` already use the 2D fold; this brings `mul` + `permute` in line. Key changes: - `mul/BinaryOp.cpp`, `permute/Permute.cpp` — `compute_1d_workgroup_count` → `compute_2d_workgroup_count` (returns `utils::WgCount`); dispatch + resize hook now set both `workgroup_count_x` and `workgroup_count_y`. - `binary_mul.wgsl`, `permute.wgsl` — `main` takes `@builtin(num_workgroups)`; flat index `gid.x + gid.y * (num_workgroups.x * wg_size)` (regenerated `*_wgsl.h`). Mirrors the landed `add` op fold (`runtime/ops/add/{BinaryOp.cpp,binary_add.wgsl}`). Co-authored-with: Claude Code. ghstack-source-id: 399812930 @exported-using-ghexport Differential Revision: [D110149677](https://our.internmc.facebook.com/intern/diff/D110149677/)

…scripten Dawn Pull Request resolved: #20652 **Key the `timedWaitAny` instance setup to the actual Dawn API instead of `__EMSCRIPTEN__`, so native-rig Dawn and emscripten/emdawnwebgpu use the modern `requiredFeatures` path and only the vendored Dawn uses the legacy `capabilities.*` path.** The instance-descriptor setup was guarded by `#if defined(__EMSCRIPTEN__)`, which routed emscripten (emdawnwebgpu, emcc 4.0.19+) through the legacy `capabilities.*` API that no longer exists there. The guard now keys off the API actually present. Key changes: - `WebGPUDevice.cpp` — `#if defined(__EMSCRIPTEN__)` → `#if defined(WEBGPU_DAWN_INSTANCE_CAPABILITIES)`. The legacy `instance_desc.capabilities.*` path is taken only by the buck-vendored Dawn (which defines the macro); native cmake Dawn and emscripten leave it undefined and take the `requiredFeatures` / `WGPUInstanceFeatureName_TimedWaitAny` path. Co-authored-with: Claude Code. ghstack-source-id: 399812934 @exported-using-ghexport Differential Revision: [D110149678](https://our.internmc.facebook.com/intern/diff/D110149678/)

Pull Request resolved: #20706 Convert the remaining hand-rolled `int main()` + printf/`bool ok` native tests to GTest so the whole `backends/webgpu/test/` suite is uniform, filterable via `--gtest_filter`, and self-reporting (extends the GTest conversion already applied to `test_dynamic_shape`). The five converted files are a harness-only change — every test case, tensor shape, tolerance, artifact filename, and skip condition is preserved 1:1, only the pass/fail reporting mechanism changes — and this diff additionally wires the already-GTest `webgpu_dynamic_shape_test` into the CI runner so the dynamic-shape suite actually executes. Key changes: - `test/test_webgpu_native.cpp`, `test/native/test_dispatch_order.cpp`, `test/native/test_index.cpp`, `test/native/test_scratch_buffer.cpp`, `test/native/test_update_cache.cpp` — `main`+`printf`/`bool ok` accumulator → `TEST()` cases using `EXPECT_*`/`ASSERT_*`; each keeps a custom `main()` that brings up the WebGPU device once then `RUN_ALL_TESTS()` (device-absent still SKIPs by returning 0). `test_index`/`test_webgpu_native` use inclusive `EXPECT_LE(err, tol)` to match the original `err > tol` fail gate exactly. - `CMakeLists.txt` — move every native-test target into the `if(TARGET GTest::gtest)` block, linking `GTest::gtest`. - `scripts/test_webgpu_native_ci.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` to the native-test configure so the now-gtest-gated targets are defined, and wire `webgpu_dynamic_shape_test` into the runner: export its `.pte`s + goldens via `export_dynamic_shape_cases`, add it to the built/run target list behind the same `--target help` probe, and run it guarded (mirroring the `index` test). - `test/test_build_webgpu.sh` — add `-DEXECUTORCH_BUILD_TESTS=ON` so the local build script (which builds the now-gtest-gated targets unconditionally) still finds them. ghstack-source-id: 399812941 @exported-using-ghexport Differential Revision: [D110536636](https://our.internmc.facebook.com/intern/diff/D110536636/)

pytorchbot temporarily deployed to cadence July 4, 2026 17:06 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 4, 2026

pytorchbot had a problem deploying to cadence July 4, 2026 17:33 — with GitHub Actions Error

JCNTH added 14 commits July 4, 2026 10:35

ghost requested review from kirklandsign and larryliu0820 as code owners July 4, 2026 17:35

ghost had a problem deploying to cadence July 4, 2026 17:36 — with GitHub Actions Error

ghost self-requested a review July 4, 2026 17:36

ghost approved these changes Jul 4, 2026

View reviewed changes

Merge branch 'main' into gh/JulianCloudNTH/65/orig

cd1f7ec

ghost temporarily deployed to cadence July 4, 2026 17:36 — with GitHub Actions Inactive

ghost temporarily deployed to cadence July 4, 2026 18:07 — with GitHub Actions Inactive

ghost merged commit 005ae03 into main Jul 4, 2026
183 of 187 checks passed

ghost deleted the gh/JulianCloudNTH/65/orig branch July 4, 2026 22:06

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] SymInt arithmetic ops (add/sub/mul/floordiv) for dynamic shapes#20712

[ExecuTorch][WebGPU] SymInt arithmetic ops (add/sub/mul/floordiv) for dynamic shapes#20712
16 commits merged into
mainfrom
gh/JulianCloudNTH/65/orig

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

pytorchbot commented Jul 4, 2026

Uh oh!

pytorch-bot Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20712

❌ 2 New Failures, 2 Cancelled Jobs

Uh oh!

github-actions Bot commented Jul 4, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jul 4, 2026 •

edited

Loading

This PR needs a `release notes:` label