Skip to content

[ET-VK][conv2d] Auto-route SlidingWindow conv2d to im2col/GEMM via device-aware heuristic#20059

Open
SS-JIA wants to merge 3 commits into
gh/SS-JIA/557/basefrom
gh/SS-JIA/557/head
Open

[ET-VK][conv2d] Auto-route SlidingWindow conv2d to im2col/GEMM via device-aware heuristic#20059
SS-JIA wants to merge 3 commits into
gh/SS-JIA/557/basefrom
gh/SS-JIA/557/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Jun 5, 2026

Stack from ghstack (oldest at bottom):

Wire a routing heuristic into the production conv2d dispatch (add_conv2d_node in Convolution.cpp) so that aten.convolution.default automatically selects the im2col + GEMM path (conv2d_gemm_impl) instead of the direct sliding-window shader when it is beneficial. The new should_use_conv2d_im2col() helper applies preconditions (groups == 1, dilation == 1, Kh*Kw > 1 — 1x1 is already handled by the optimized conv2d_pw path) and then selects im2col when device_is_mali() || c_out >= 128. The rationale is that Mali (buffer path) benefits from im2col universally, while on Adreno (texture path) im2col only amortizes the fixed im2col-gather cost once the output channel count is large enough; smaller c_out shapes stay on the direct shader where they are faster, which also avoids the costly texture3d gather that occurs for very wide im2col matrices.

Differential Revision: D107595816

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Jun 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20059

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit b361111 with merge base 502fdbe (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Jun 5, 2026

CLA Missing ID

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@SS-JIA
Copy link
Copy Markdown
Contributor Author

SS-JIA commented Jun 5, 2026

@claude Review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Claude finished @SS-JIA's task in 2m 3s —— View job


PR Review: [ET-VK][conv2d] Auto-route SlidingWindow conv2d to im2col/GEMM via device-aware heuristic

  • Gather context and read changed files
  • Analyze routing heuristic logic
  • Review test coverage
  • Post review feedback

Summary

Clean PR that wires a device-aware routing heuristic into the production conv2d dispatch so that SlidingWindow convolutions can automatically select the im2col+GEMM path when beneficial. The heuristic is simple and well-motivated, the test coverage is thorough (boundary configs, forced-path selectors, cross-device verification strategy), and the force_direct escape hatch for tests is well-designed.


Bug: Transposed convolutions may be incorrectly routed to im2col+GEMM

Convolution.cpp:527-530 — After the Pointwise and Depthwise early returns, the code unconditionally evaluates should_use_conv2d_im2col() for all remaining methods, which includes Conv2dMethod::Transposed:

const bool use_im2col = force_direct
    ? false
    : should_use_conv2d_im2col(graph, weight_data, groups_val, kernel_params);
if (use_im2col) {
    return conv2d_gemm_impl(...);  // Wrong for transposed!
}

A transposed conv2d with groups==1, dilation==1, Kh*Kw > 1, and c_out >= 128 (or on Mali) would pass all preconditions in should_use_conv2d_im2col() and incorrectly dispatch to conv2d_gemm_impl, which implements standard (non-transposed) forward convolution.

Suggested fix — guard the heuristic so it only fires for SlidingWindow:

const bool use_im2col = (!force_direct && method == Conv2dMethod::SlidingWindow)
    ? should_use_conv2d_im2col(graph, weight_data, groups_val, kernel_params)
    : false;

Fix this →


Minor observations

  1. Redundant kernel-size check (Convolution.cpp:329-331) — The Kh*Kw <= 1 check is defensive but can never fire for SlidingWindow (since get_conv2d_method routes 1×1 to Pointwise). If you add the method == SlidingWindow guard above, this becomes even more clearly dead code. Fine to keep as a safety net, just noting it.

  2. Magic constant — The c_out >= 128 threshold is well-justified in comments. Consider extracting it to a named constant (e.g., kIm2colCoutThreshold) for grep-ability if this threshold needs tuning in the future, but this is entirely optional.

  3. Comment verbosity (Convolution.cpp:300-314) — The block comment on should_use_conv2d_im2col is detailed, which is appropriate for a heuristic where the "why" matters. Reads well.

  4. Test design — The boundary test configs at c_out=64/128 combined with all three selectors ("", "direct", "im2col") is a solid strategy for catching routing regressions. The "direct" selector addition also nicely ensures Mali CI doesn't lose coverage of the direct shader path.


Verdict

The PR is well-structured and the heuristic is sound. The transposed-conv routing issue above is the only correctness concern — it's likely not triggered by current model graphs (transposed convs with c_out≥128 and large kernels are uncommon in practice), but it's a latent bug worth fixing before landing.


[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Jun 5, 2026
…vice-aware heuristic

Pull Request resolved: #20059

Wire a routing heuristic into the production conv2d dispatch (`add_conv2d_node` in `Convolution.cpp`) so that `aten.convolution.default` automatically selects the im2col + GEMM path (`conv2d_gemm_impl`) instead of the direct sliding-window shader when it is beneficial. The new `should_use_conv2d_im2col()` helper applies preconditions (`groups == 1`, `dilation == 1`, `Kh*Kw > 1` — 1x1 is already handled by the optimized `conv2d_pw` path) and then selects im2col when `device_is_mali() || c_out >= 128`. The rationale is that Mali (buffer path) benefits from im2col universally, while on Adreno (texture path) im2col only amortizes the fixed im2col-gather cost once the output channel count is large enough; smaller `c_out` shapes stay on the direct shader where they are faster, which also avoids the costly texture3d gather that occurs for very wide im2col matrices.
ghstack-source-id: 390484641
@exported-using-ghexport

Differential Revision: [D107595816](https://our.internmc.facebook.com/intern/diff/D107595816/)
[ghstack-poisoned]
SS-JIA pushed a commit that referenced this pull request Jun 5, 2026
…vice-aware heuristic

Pull Request resolved: #20059

Wire a routing heuristic into the production conv2d dispatch (`add_conv2d_node` in `Convolution.cpp`) so that `aten.convolution.default` automatically selects the im2col + GEMM path (`conv2d_gemm_impl`) instead of the direct sliding-window shader when it is beneficial. The new `should_use_conv2d_im2col()` helper applies preconditions (`groups == 1`, `dilation == 1`, `Kh*Kw > 1` — 1x1 is already handled by the optimized `conv2d_pw` path) and then selects im2col when `device_is_mali() || c_out >= 128`. The rationale is that Mali (buffer path) benefits from im2col universally, while on Adreno (texture path) im2col only amortizes the fixed im2col-gather cost once the output channel count is large enough; smaller `c_out` shapes stay on the direct shader where they are faster, which also avoids the costly texture3d gather that occurs for very wide im2col matrices.
ghstack-source-id: 390528476
@exported-using-ghexport

Differential Revision: [D107595816](https://our.internmc.facebook.com/intern/diff/D107595816/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants