[Passes] Move LoopInterchange into optimization pipeline #145503

kasuga-fj · 2025-06-24T12:05:28Z

As mentioned in #145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline.

More contexts:

By default, LoopInterchange attempts to improve data locality, however, it also takes increasing vectorization opportunities into account. Given that, it is reasonable to run it as close to vectorization as possible.
I looked into previous changes related to the placement of LoopInterchange, but couldn’t find any strong motivation suggesting that it benefits other simplifications.
As far as I tried some tests (including llvm-test-suite), removing LoopInterchange from the simplification pipeline does not affect other simplifications. Therefore, there doesn't seem to be much value in keeping it there.
The new position reduces compile-time for ThinLTO, probably because it only runs once per function in post-link optimization, rather than both in pre-link and post-link optimization.

As mentioned in llvm#145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline. More contexts: - By default, LoopInterchange attempts to improve data locality, however, it also takes vectorization opportunities into account. Given that, it is reasonable to run it as close to vectorization as possible. - I looked into previous changes related to the placement of LoopInterchange, but couldn’t find any strong motivation suggesting that it benefits other simplifications. - As far as I tried some tests (including llvm-test-suite), removing LoopInterchange from the simplification pipeline does not affect other simplifications. Therefore, there doesn't seem to be much value in keeping it there.

kasuga-fj · 2025-06-24T12:11:45Z

llvm/lib/Passes/PassBuilderPipelines.cpp

+  if (PTO.LoopInterchange)
+    LPM.addPass(LoopInterchangePass());


I'm not sure where LoopInterchange should be placed within the optimization pipeline...

Hmm, would be interesting to get some more data about the impact on different positions, but current palcement seems reasonable to me at least

Yes, collecting diverse data would be ideal, but since the current LoopInterchange is rarely triggered in practice, I haven’t been able to gather enough data to have a meaningful discussion about the optimal placement...

TSVC s235 might be an interesting example. If LoopDistribute were extended to support non-innermost loops and it ran before LoopInterchange, the following could become possible:

Original:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++) for (int i = 0; i < LEN2; i++) { a[i] += b[i] * c[i]; for (int j = 1; j < LEN2; j++) { aa[j][i] = aa[j-1][i] + bb[j][i] * a[i]; } }

Distributed:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++) for (int i = 0; i < LEN2; i++) a[i] += b[i] * c[i]; for (int nl = 0; nl < 200*(ntimes/LEN2); nl++) for (int i = 0; i < LEN2; i++) for (int j = 1; j < LEN2; j++) aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];

Interchanged:

for (int nl = 0; nl < 200*(ntimes/LEN2); nl++) for (int i = 0; i < LEN2; i++) a[i] += b[i] * c[i]; // The i-loop and j-loop are interchanged. for (int nl = 0; nl < 200*(ntimes/LEN2); nl++) for (int j = 1; j < LEN2; j++) for (int i = 0; i < LEN2; i++) aa[j][i] = aa[j-1][i] + bb[j][i] * a[i];

IIRC, when I ran the last code in the past, it was about 8x faster than the original. But achieving this would require a lot of work on LoopDistribute...

kasuga-fj · 2025-06-24T12:13:32Z

llvm/lib/Passes/PassBuilderPipelines.cpp

-  if (PTO.LoopInterchange)
-    LPM2.addPass(LoopInterchangePass());


It might be reasonable to keep this, at least for now?

fhahn

Makes sense in general but do you have any data on how this change impacts the # of loops interchanged, and the impact on perf for the cases where it helps?

This should also at least have a test checking the position of LoopInterchange in the pipline and ideally a phase-ordering tests that shows an improvement

kasuga-fj · 2025-06-24T15:59:31Z

As tested with the llvm-test-suite, this patch reduces the number of interchanged loops by one. The omitted transformation corresponds to the following loop in llvm-test-suite/MultiSource/Applications/obsequi/tables.c:

for (i = 0; i < 32; i++)
  for (j = 0; j < 32; j++)
    for (k = 0; k < 2; k++)
      fill_in_key_entry(&g_keyinfo[k][i][j], n_rows, n_cols);

This change is due to the ordering between full unrolling and loop interchange. In the original pipeline, the k-loop is lifted to the outermost position because doing so improves data locality. On the other hand, with this patch, the k-loop is fully unrolled first, preventing the interchange from being applied. While it's unclear to me which behavior is preferable in general, in this case, performance remained largely unaffected.

I think LoopInterchange should not be in the simplification pipeline because loop versioning may be introduced inside the LoopInterchange in the future (ref: #123436), which is better suited to the optimization pipeline. Additionally, if we're going to change its position, I believe we should do it before #124911, which is the primary motivation behind submitting this patch now. Or should we wait until it actually becomes a problem?

This should also at least have a test checking the position of LoopInterchange in the pipeline

Exactly, I'll add it. Just to confirm, are you referring to a test like https://github.com/llvm/llvm-project/blob/559218f7ee78f77d32c14cd14a56b799167596f5/llvm/test/Other/new-pass-manager.ll ?

ideally a phase-ordering tests that shows an improvement

I agree with it, but performance improvement is not a purpose of this patch, and I don't have a good example for it...

llvmbot · 2025-06-26T12:57:43Z

@llvm/pr-subscribers-llvm-transforms

Author: Ryotaro Kasuga (kasuga-fj)

Changes

As mentioned in #145071, LoopInterchange should be part of the optimization pipeline rather than the simplification pipeline. This patch moves LoopInterchange into the optimization pipeline.

More contexts:

By default, LoopInterchange attempts to improve data locality, however, it also takes increasing vectorization opportunities into account. Given that, it is reasonable to run it as close to vectorization as possible.
I looked into previous changes related to the placement of LoopInterchange, but couldn’t find any strong motivation suggesting that it benefits other simplifications.
As far as I tried some tests (including llvm-test-suite), removing LoopInterchange from the simplification pipeline does not affect other simplifications. Therefore, there doesn't seem to be much value in keeping it there.

Full diff: https://github.com/llvm/llvm-project/pull/145503.diff

2 Files Affected:

(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+4-3)
(added) llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll (+48)

diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index c83d2dc1f1514..98821bb1408a7 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -690,9 +690,6 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
 
   LPM2.addPass(LoopDeletionPass());
 
-  if (PTO.LoopInterchange)
-    LPM2.addPass(LoopInterchangePass());
-
   // Do not enable unrolling in PreLinkThinLTO phase during sample PGO
   // because it changes IR to makes profile annotation in back compile
   // inaccurate. The normal unroller doesn't pay attention to forced full unroll
@@ -1547,6 +1544,10 @@ PassBuilder::buildModuleOptimizationPipeline(OptimizationLevel Level,
   //        this may need to be revisited once we run GVN before loop deletion
   //        in the simplification pipeline.
   LPM.addPass(LoopDeletionPass());
+
+  if (PTO.LoopInterchange)
+    LPM.addPass(LoopInterchangePass());
+
   OptimizePM.addPass(createFunctionToLoopPassAdaptor(
       std::move(LPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/false));
 
diff --git a/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll b/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll
new file mode 100644
index 0000000000000..d0b6447d92fe7
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll
@@ -0,0 +1,48 @@
+; RUN: opt -passes='default<O3>' -enable-loopinterchange -disable-output \
+; RUN:     -disable-verify -verify-analysis-invalidation=0 \
+; RUN:     -debug-pass-manager=quiet %s 2>&1 | FileCheck %s
+
+; Test the position of LoopInterchange in the pass pipeline.
+
+; CHECK-NOT:  Running pass: LoopInterchangePass
+; CHECK:      Running pass: ControlHeightReductionPass
+; CHECK-NEXT: Running pass: LoopSimplifyPass
+; CHECK-NEXT: Running pass: LCSSAPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopInterchangePass
+; CHECK-NEXT: Running pass: LoopDistributePass
+; CHECK-NEXT: Running pass: InjectTLIMappings
+; CHECK-NEXT: Running pass: LoopVectorizePass
+
+
+define void @foo(ptr %a, i32 %n) {
+entry:
+  br label %for.i.header
+
+for.i.header:
+  %i = phi i32 [ 0, %entry ], [ %i.next, %for.i.latch ]
+  br label %for.j
+
+for.j:
+  %j = phi i32 [ 0, %for.i.header ], [ %j.next, %for.j ]
+  %tmp = mul i32 %i, %n
+  %offset = add i32 %tmp, %j
+  %idx = getelementptr inbounds i32, ptr %a, i32 %offset
+  %load = load i32, ptr %idx, align 4
+  %inc = add i32 %load, 1
+  store i32 %inc, ptr %idx, align 4
+  %j.next = add i32 %j, 1
+  %j.exit = icmp eq i32 %j.next, %n
+  br i1 %j.exit, label %for.i.latch, label %for.j
+
+for.i.latch:
+  %i.next = add i32 %i, 1
+  %i.exit = icmp eq i32 %i.next, %n
+  br i1 %i.exit, label %for.i.header, label %exit
+
+exit:
+  ret void
+}

kasuga-fj · 2025-06-26T12:59:25Z

llvm/test/Transforms/LoopInterchange/position-in-pipeline.ll

+; RUN: opt -passes='default<O3>' -enable-loopinterchange -disable-output \
+; RUN:     -disable-verify -verify-analysis-invalidation=0 \
+; RUN:     -debug-pass-manager=quiet %s 2>&1 | FileCheck %s
+
+; Test the position of LoopInterchange in the pass pipeline.
+
+; CHECK-NOT:  Running pass: LoopInterchangePass
+; CHECK:      Running pass: ControlHeightReductionPass
+; CHECK-NEXT: Running pass: LoopSimplifyPass
+; CHECK-NEXT: Running pass: LCSSAPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopRotatePass
+; CHECK-NEXT: Running pass: LoopDeletionPass
+; CHECK-NEXT: Running pass: LoopInterchangePass
+; CHECK-NEXT: Running pass: LoopDistributePass
+; CHECK-NEXT: Running pass: InjectTLIMappings
+; CHECK-NEXT: Running pass: LoopVectorizePass


Please let me know if there is a better way...

fhahn

In light of enabling by default, would be good to check the compile time impact for the new position, which may have less overhead?

fhahn · 2025-06-27T13:44:04Z

llvm/lib/Passes/PassBuilderPipelines.cpp

+  if (PTO.LoopInterchange)
+    LPM.addPass(LoopInterchangePass());


Hmm, would be interesting to get some more data about the impact on different positions, but current palcement seems reasonable to me at least

kasuga-fj · 2025-06-27T14:20:07Z

In light of enabling by default, would be good to check the compile time impact for the new position, which may have less overhead?

I believe it reduces overhead, but I didn't actually measure it. Anyway, I’m planning to work on reducing the compile time of LoopInterchange, so I will take the impact then as well.

sjoerdmeijer · 2025-07-01T14:18:31Z

I think moving this to the optimisation pipeline makes perfect sense.

In terms of performance improvements or not, I am not expecting it to make a big difference, and I am not concerned about it. For two reasons: we know interchange is not yet triggering for our motivating examples and from our point of view this is about enablement, and later we will become more interested in performance and might need other optimisers too like the distribute example you gave. Therefore, I expect some more optimisation pipeline tuning later. This won't be a massive churn, because it's just moving the pass and that's why I don't mind its current position in the pipeline, or this new one.

My only curiousity is if this changes compile-times. That should be a quick check. If you want, I can quickly do this if that helps.

kasuga-fj · 2025-07-02T12:28:01Z

Here are the compile-time impact by changing the position.

https://llvm-compile-time-tracker.com/compare.php?from=f02992c483d40325ceac5c3520b0b0fca42f8928&to=21856d54baa0b81035fb16c736cbc71cbb4df87a&stat=instructions%3Au

As mentioned in #124911 (comment), it reduces the compile-time, particularly for ThinLTO. However, since I'm planning to submit some PRs that modify LoopInterchange to further reduce compile time, it might be better to reevaluate later.

My only curiousity is if this changes compile-times.

My concern is that, once it is enabled by default, repositioning it may require additional effort. That said, since it's rarely triggered in practice at this moment, compile time appears to be the only aspect with any measurable data.

sjoerdmeijer

Okay, thanks, LGTM.

Like I said, the move makes sense, and it improves compile-times, so there is a lot to like here.

I will leave it up to you if you want to merge this. I think you can, and even if we want to change things later, that won't be a lot of churn and I am expecting some anyway. But again, up to you.

kasuga-fj · 2025-07-02T17:01:53Z

I also don’t have any specific cases where this change would cause a regression. So, I will wait for a few days just in case, and if there are no objections, I will go ahead and merge.

kasuga-fj · 2025-07-04T10:57:45Z

Since there appear to be no strong objections, I will go ahead and merge this for now.

For anyone landing on this PR: Please feel free to revert if you run into any issue.

kasuga-fj commented Jun 24, 2025

View reviewed changes

kasuga-fj requested review from nikic, fhahn and sjoerdmeijer June 24, 2025 12:14

fhahn reviewed Jun 24, 2025

View reviewed changes

Add test to check the position of LoopInterchange in the pipeline

11e2519

llvmbot added the llvm:transforms label Jun 26, 2025

kasuga-fj commented Jun 26, 2025

View reviewed changes

fhahn reviewed Jun 27, 2025

View reviewed changes

kasuga-fj mentioned this pull request Jul 1, 2025

[LoopInterchange] Enable it by default #124911

Open

sjoerdmeijer approved these changes Jul 2, 2025

View reviewed changes

kasuga-fj merged commit 3099b7e into llvm:main Jul 4, 2025
8 checks passed

		if (PTO.LoopInterchange)
		LPM2.addPass(LoopInterchangePass());

[Passes] Move LoopInterchange into optimization pipeline #145503

[Passes] Move LoopInterchange into optimization pipeline #145503

Uh oh!

Conversation

kasuga-fj commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasuga-fj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

fhahn Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

kasuga-fj Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

kasuga-fj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

kasuga-fj commented Jun 24, 2025

Uh oh!

llvmbot commented Jun 26, 2025

Uh oh!

kasuga-fj Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

kasuga-fj commented Jun 27, 2025

Uh oh!

sjoerdmeijer commented Jul 1, 2025

Uh oh!

kasuga-fj commented Jul 2, 2025

Uh oh!

sjoerdmeijer left a comment

Choose a reason for hiding this comment

Uh oh!

kasuga-fj commented Jul 2, 2025

Uh oh!

kasuga-fj commented Jul 4, 2025

Uh oh!

Uh oh!

Uh oh!

kasuga-fj commented Jun 24, 2025 •

edited

Loading

kasuga-fj Jun 26, 2025 •

edited

Loading