[MPSInductor] Fix nested loop var elimination #156566

malfet · 2025-06-22T21:13:29Z

Stack from ghstack (oldest at bottom):

As reduction resuts must be kept around
Add regression test that is specific for this issue

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-06-22T21:13:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156566

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 7e04fb8 with merge base 1d993fa ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (trunk failure)
Process completed with exit code 1.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

malfet · 2025-06-23T04:32:35Z

@pytorchbot merge -f "Lint + MPS are green"

pytorchmergebot · 2025-06-23T04:35:03Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops. I.e. it changes shader code generated for ```python import torch def f(x): var, mean = torch.var_mean(x, dim=2, keepdim = True) return x / var, var torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps')) ``` from ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int x0 = xindex; threadgroup float3 tmp_acc_0[1024]; tmp_acc_0[r0_index * 1] = 0.0; for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp0 = in_ptr0[r0_1 + 1024*x0]; tmp_acc_0[r0_index * 1] = ::c10::metal::welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0)); } auto tmp1 = c10::metal::threadgroup_welford_combine(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) { int r0_1 = 1 * r0_index + r0_1_cnt; auto tmp4 = in_ptr0[r0_1 + 1024*x0]; auto tmp5 = tmp4 / tmp3; out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp5); } } ``` to ```metal [[max_total_threads_per_threadgroup(1024)]] kernel void generated_kernel( device float* out_ptr1, device float* out_ptr2, constant float* in_ptr0, uint2 thread_pos [[thread_position_in_grid]], uint2 group_pos [[thread_position_in_threadgroup]] ) { auto xindex = thread_pos.x; auto r0_index = thread_pos.y; int r0_1 = r0_index; int x0 = xindex; threadgroup float tmp_acc_0[1024]; auto tmp0 = in_ptr0[r0_1 + 1024*x0]; tmp_acc_0[r0_index * 1] = tmp0; auto tmp1 = c10::metal::threadgroup_welford_reduce(tmp_acc_0, 1024); auto tmp2 = 1023.0; auto tmp3 = tmp1.y / tmp2; out_ptr1[x0] = static_cast<float>(tmp3); auto tmp4 = tmp0 / tmp3; out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4); } `` Pull Request resolved: #156567 Approved by: https://github.com/dcci ghstack dependencies: #156566

Update

7e04fb8

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) module: inductor labels Jun 22, 2025

malfet mentioned this pull request Jun 22, 2025

[MPSInductor][BE] Fix multistage reduction check #156567

Closed

malfet added topic: bug fixes topic category release notes: mps Release notes category labels Jun 22, 2025

malfet requested review from jansel and dcci June 22, 2025 21:14

dcci approved these changes Jun 22, 2025

View reviewed changes

pytorchmergebot added the merging label Jun 23, 2025

pytorchmergebot added the Merged label Jun 23, 2025

pytorchmergebot closed this in 4cd6e96 Jun 23, 2025

pytorchmergebot removed the merging label Jun 23, 2025

github-actions bot deleted the gh/malfet/415/head branch July 24, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPSInductor] Fix nested loop var elimination #156566

[MPSInductor] Fix nested loop var elimination #156566

Uh oh!

malfet commented Jun 22, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Jun 22, 2025 •

edited

Loading

Uh oh!

malfet commented Jun 23, 2025

Uh oh!

pytorchmergebot commented Jun 23, 2025

Uh oh!

Uh oh!

[MPSInductor] Fix nested loop var elimination #156566

[MPSInductor] Fix nested loop var elimination #156566

Uh oh!

Conversation

malfet commented Jun 22, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156566

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

malfet commented Jun 23, 2025

Uh oh!

pytorchmergebot commented Jun 23, 2025

Merge started

Uh oh!

Uh oh!

malfet commented Jun 22, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jun 22, 2025 •

edited

Loading