Skip to content

JIT: Move loop inversion to after loop recognition #115850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 14, 2025

Conversation

amanasifkhalid
Copy link
Member

Prerequisite to #113709. I expect diffs to go both ways: In some cases, loop canonicalization unlocks pattern-based loop inversion, whereas in other cases, we now recognize fewer loops due to loop inversion no longer introducing new cycles pre-canonicalization.

@Copilot Copilot AI review requested due to automatic review settings May 21, 2025 21:11
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 21, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR moves the loop inversion phase to after loop recognition, adds immediate block compaction/removal for newly altered test blocks, and triggers a DFS rebuild with fresh loop analysis when any loops were inverted.

  • Add single-predecessor block compaction/removal in optInvertWhileLoop
  • Recompute the DFS tree and re-run loop finding after any loop inversions
  • Relocate the PHASE_INVERT_LOOPS call in the compilation pipeline

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
optimizer.cpp Inserted block compaction/removal and DFS invalidation
compiler.cpp Moved the loop inversion phase to a later point in compCompile
Comments suppressed due to low confidence (1)

src/coreclr/jit/compiler.cpp:4668

  • Add targeted tests that verify the new phase ordering and ensure that both block compaction and removal occur as expected after loop inversion.
DoPhase(this, PHASE_INVERT_LOOPS, &Compiler::optInvertLoops);

@amanasifkhalid
Copy link
Member Author

The diffs will be hard to parse for this, so I'm looking more at metrics. Here are some metric diffs for aspnet on win-x64:

Base:

  • Loops found: 29252
  • Loops inverted: 10694
  • Loops cloned: 1885
  • Loops unrolled: 12
  • Loops IV widened: 3047
  • Widened IVs: 3047
  • Unused IVs removed: 4539
  • Loops downward counted: 1873
  • Loops strength reduced: 1702
  • RBO: 30708
  • Jump threadings: 9735

Diff:

  • Loops found: 29085 (-167)
  • Loops inverted: 9074 (-1620)
  • Loops cloned: 3596 (+1711)
  • Loops unrolled: 12
  • Loops IV widened: 2999 (-48)
  • Widened IVs: 2999 (-48)
  • Unused IVs removed: 4498 (-41)
  • Loops downward counted: 1863 (-10)
  • Loops strength reduced: 1693 (-9)
  • RBO: 33859 (+3151)
  • Jump threadings: 9774 (+39)

We can see from the metrics that we're inverting fewer loops overall, but there are plenty of cases where we invert new loops, thus unblocking other loop opts -- in particular, we're doing a lot more cloning. Fewer loops found overall is due to loop inversion no longer introducing new cycles before loop recognition runs.

PerfScore diffs are overwhelmingly negative in non-PGO collections. This might be heuristic-derived profile weights for cloned loops inflating PerfScores, and/or something else...

@AndyAyersMS
Copy link
Member

Diffs

Assuming the diffs are largely cloning related, it appears that extra cloning is pretty costly. It is hard to know how much of it is really beneficial. I wish we had better heuristics.

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 27, 2025

It is hard to know how much of it is really beneficial.

Right, because of this, I've decided to flip my ordering and enable graph-based loop inversion with the existing phase ordering. Locally, the diffs are slightly easier to triage. Once that's in, hopefully it'll be easier to triage the diffs on this PR and see if there's anything actionable.

@amanasifkhalid
Copy link
Member Author

CI failures indicate we will need the fix in #113935 to proceed. @EgorBo are you able to revive that work?

@EgorBo
Copy link
Member

EgorBo commented Jun 6, 2025

CI failures indicate we will need the fix in #113935 to proceed. @EgorBo are you able to revive that work?

Ah, sure, let me revive it

@amanasifkhalid
Copy link
Member Author

I'm removing fgRenumberBlocks while I'm here to avoid opening another PR, FYI.

@amanasifkhalid
Copy link
Member Author

Diffs show yet another round of large size increases, though most of this seems to be driven by coreclr_tests. In particular, it looks like we're doing a lot more loop cloning in our HW intrinsics code:

Top method regressions (bytes):
        3902 (49.82 % of base) : 308137.dasm - CompareVectorWithZero:TestVector512Equality() (FullOpts)
        3902 (49.82 % of base) : 308162.dasm - CompareVectorWithZero:TestVector512Inequality() (FullOpts)
        3062 (107.21 % of base) : 323865.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        3062 (107.21 % of base) : 323985.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionUInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        3058 (107.07 % of base) : 321271.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        3058 (107.07 % of base) : 321391.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionUInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        3032 (93.18 % of base) : 321266.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionInt64:RunMaskingValueScenario():this (FullOpts)
        3032 (93.18 % of base) : 321386.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionUInt64:RunMaskingValueScenario():this (FullOpts)
        3028 (92.94 % of base) : 323860.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionInt64:RunMaskingValueScenario():this (FullOpts)
        3028 (92.94 % of base) : 323980.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionUInt64:RunMaskingValueScenario():this (FullOpts)
        2970 (100.00 % of base) : 321971.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2970 (100.00 % of base) : 322091.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionUInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2970 (100.00 % of base) : 323084.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplyInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2970 (100.00 % of base) : 323204.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplyUInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2902 (103.35 % of base) : 324099.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorUnaryOpTest__op_UnaryNegationInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2902 (103.35 % of base) : 324213.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorUnaryOpTest__op_UnaryNegationUInt64:RunBroadcastAndMaskingScenario():this (FullOpts)
        2888 (47.56 % of base) : 128125.dasm - VectorTest+VectorRelopTest`1[ulong]:VectorRelOp(ulong,ulong):int (Tier0-FullOpts)
        2872 (82.62 % of base) : 322086.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionUInt64:RunMaskingValueScenario():this (FullOpts)
        2872 (82.62 % of base) : 323079.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplyInt64:RunMaskingValueScenario():this (FullOpts)
        2872 (82.62 % of base) : 323199.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplyUInt64:RunMaskingValueScenario():this (FullOpts)

Top method improvements (bytes):
        -526 (-8.20 % of base) : 321525.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_BitwiseAndSByte:RunMaskingValueScenario():this (FullOpts)
        -526 (-8.20 % of base) : 321755.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_BitwiseOrSByte:RunMaskingValueScenario():this (FullOpts)
        -526 (-8.20 % of base) : 322385.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_ExclusiveOrSByte:RunMaskingValueScenario():this (FullOpts)
        -524 (-8.19 % of base) : 321410.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_BitwiseAndByte:RunMaskingValueScenario():this (FullOpts)
        -524 (-8.19 % of base) : 321640.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_BitwiseOrByte:RunMaskingValueScenario():this (FullOpts)
        -524 (-8.19 % of base) : 322270.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_ExclusiveOrByte:RunMaskingValueScenario():this (FullOpts)
        -516 (-8.95 % of base) : 321874.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionByte:RunMaskingZeroScenario():this (FullOpts)
        -516 (-8.93 % of base) : 321994.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionSByte:RunMaskingZeroScenario():this (FullOpts)
        -492 (-16.12 % of base) : 299397.dasm - SmallLoop1:TestEntryPoint():int (FullOpts)
        -492 (-16.12 % of base) : 19722.dasm - SmallLoop1:TestEntryPoint():int (Tier0-FullOpts)
        -438 (-7.39 % of base) : 321870.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionByte:RunMaskingValueScenario():this (FullOpts)
        -438 (-7.37 % of base) : 321990.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionSByte:RunMaskingValueScenario():this (FullOpts)
        -434 (-7.66 % of base) : 321174.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionByte:RunMaskingZeroScenario():this (FullOpts)
        -434 (-7.64 % of base) : 321294.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_AdditionSByte:RunMaskingZeroScenario():this (FullOpts)
        -434 (-7.64 % of base) : 322987.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplyByte:RunMaskingZeroScenario():this (FullOpts)
        -434 (-7.61 % of base) : 323107.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_MultiplySByte:RunMaskingZeroScenario():this (FullOpts)
        -432 (-7.31 % of base) : 321922.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionInt16:RunMaskingZeroScenario():this (FullOpts)
        -432 (-7.31 % of base) : 322042.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_DivisionUInt16:RunMaskingZeroScenario():this (FullOpts)
        -426 (-7.51 % of base) : 323768.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionByte:RunMaskingZeroScenario():this (FullOpts)
        -426 (-7.49 % of base) : 323888.dasm - JIT.HardwareIntrinsics.General._Vector512_1.VectorBinaryOpTest__op_SubtractionSByte:RunMaskingZeroScenario():this (FullOpts)

Top method regressions (percentages):
          86 (358.33 % of base) : 30264.dasm - System.Runtime.Intrinsics.Vector64:Create(double):System.Runtime.Intrinsics.Vector64`1[double] (Instrumented Tier1)
          86 (358.33 % of base) : 39176.dasm - System.Runtime.Intrinsics.Vector64:Create(double):System.Runtime.Intrinsics.Vector64`1[double] (Tier1)
          86 (358.33 % of base) : 305146.dasm - System.Runtime.Intrinsics.Vector64:Create[double](double):System.Runtime.Intrinsics.Vector64`1[double] (FullOpts)
          86 (358.33 % of base) : 29956.dasm - System.Runtime.Intrinsics.Vector64:Create[double](double):System.Runtime.Intrinsics.Vector64`1[double] (Tier0-FullOpts)
         188 (348.15 % of base) : 65714.dasm - System.Runtime.Intrinsics.Vector64:Narrow(System.Runtime.Intrinsics.Vector64`1[long],System.Runtime.Intrinsics.Vector64`1[long]):System.Runtime.Intrinsics.Vector64`1[int] (Instrumented Tier1)
         188 (348.15 % of base) : 65777.dasm - System.Runtime.Intrinsics.Vector64:Narrow(System.Runtime.Intrinsics.Vector64`1[ulong],System.Runtime.Intrinsics.Vector64`1[ulong]):System.Runtime.Intrinsics.Vector64`1[uint] (Instrumented Tier1)
         146 (347.62 % of base) : 54822.dasm - System.Runtime.Intrinsics.Vector64`1[double]:op_Addition(System.Runtime.Intrinsics.Vector64`1[double],System.Runtime.Intrinsics.Vector64`1[double]):System.Runtime.Intrinsics.Vector64`1[double] (Tier1)
          40 (333.33 % of base) : 344367.dasm - SwitchTest:TestEntryPoint():int (FullOpts)
          40 (333.33 % of base) : 124230.dasm - SwitchTest:TestEntryPoint():int (Tier0-FullOpts)
          82 (292.86 % of base) : 30256.dasm - System.Runtime.Intrinsics.Vector64:Create(float):System.Runtime.Intrinsics.Vector64`1[float] (Instrumented Tier1)
          82 (292.86 % of base) : 39215.dasm - System.Runtime.Intrinsics.Vector64:Create(float):System.Runtime.Intrinsics.Vector64`1[float] (Tier1)
          82 (292.86 % of base) : 30262.dasm - System.Runtime.Intrinsics.Vector64:Create[float](float):System.Runtime.Intrinsics.Vector64`1[float] (Instrumented Tier1)
          82 (292.86 % of base) : 54956.dasm - System.Runtime.Intrinsics.Vector64:Create[float](float):System.Runtime.Intrinsics.Vector64`1[float] (Tier1)
         204 (291.43 % of base) : 65655.dasm - System.Runtime.Intrinsics.Vector64:Narrow(System.Runtime.Intrinsics.Vector64`1[double],System.Runtime.Intrinsics.Vector64`1[double]):System.Runtime.Intrinsics.Vector64`1[float] (Instrumented Tier1)
          68 (283.33 % of base) : 30252.dasm - System.Runtime.Intrinsics.Vector64:Create(int):System.Runtime.Intrinsics.Vector64`1[int] (Instrumented Tier1)
          68 (283.33 % of base) : 39238.dasm - System.Runtime.Intrinsics.Vector64:Create(uint):System.Runtime.Intrinsics.Vector64`1[uint] (Instrumented Tier1)
          68 (283.33 % of base) : 305145.dasm - System.Runtime.Intrinsics.Vector64:Create[int](int):System.Runtime.Intrinsics.Vector64`1[int] (FullOpts)
          68 (283.33 % of base) : 29952.dasm - System.Runtime.Intrinsics.Vector64:Create[int](int):System.Runtime.Intrinsics.Vector64`1[int] (Tier0-FullOpts)
          68 (283.33 % of base) : 55016.dasm - System.Runtime.Intrinsics.Vector64:Create[uint](uint):System.Runtime.Intrinsics.Vector64`1[uint] (Instrumented Tier1)
          68 (283.33 % of base) : 55288.dasm - System.Runtime.Intrinsics.Vector64:Create[uint](uint):System.Runtime.Intrinsics.Vector64`1[uint] (Tier1)

Diffs in our non-test collections, particularly the ones with Dynamic PGO enabled, are much less dramatic. Also, the TP improvement pays for #116017, which is nice. @AndyAyersMS are you ok with this going into Preview 6?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's take this.

@amanasifkhalid
Copy link
Member Author

/ba-g unrelated wasm build failure, and a known issue

@amanasifkhalid amanasifkhalid merged commit b146d75 into dotnet:main Jun 14, 2025
106 of 109 checks passed
@amanasifkhalid amanasifkhalid deleted the move-loop-inversion branch June 14, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants