JIT: Remove early `fgOptimizeBranch` pass #115734

amanasifkhalid · 2025-05-19T17:59:03Z

Part of #107749. fgOptimizeBranch looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable, fgOptimizeBranch clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly.

However, I see little reason to run this transformation early, especially before loop recognition; from the diffs, fgOptimizeBranch seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.

Copilot

Pull Request Overview

This PR removes the early branch optimization pass (fgOptimizeBranch) to prevent the formation of unnatural loop shapes before loop recognition.

Removed early branch optimization loop from optOptimizeFlow.
Aimed at enabling a more aggressive and correctly placed branch optimization in later phases.

dotnet-policy-service · 2025-05-19T18:00:11Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2025-05-19T18:44:22Z

Some metric diffs from benchmarks.run_pgo on win-x64:

130 (66.67% of base) : 29998.dasm - System.SpanHelpers:LastIndexOfValueType[int,System.SpanHelpers+Negate`1[int]](byref,int,int):int (Tier1):

- LoopsFoundDuringOpts                      : 1
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- PerfScore                                 : 16.193149
+ LoopsFoundDuringOpts                      : 2
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 35
+ PerfScore                                 : 16.443149

RBO getting pessimized seemed to block a significant code size reduction, though it looks like it was mostly cold code removed considering the small PerfScore diff. I'll look for a smaller example to dig into.

234 (22.44% of base) : 127581.dasm - System.IO.File:WriteAllBytes(System.String,System.ReadOnlySpan`1[ubyte]) (Tier1):

- LoopsFoundDuringOpts                      : 7
- UnusedIVsRemoved                          : 1
- LoopsStrengthReduced                      : 0
- RedundantBranchesEliminated               : 13
- BasicBlocksAtCodegen                      : 65
+ LoopsFoundDuringOpts                      : 9
+ UnusedIVsRemoved                          : 3
+ LoopsStrengthReduced                      : 2
+ RedundantBranchesEliminated               : 11
+ BasicBlocksAtCodegen                      : 87

We recognize and optimize more loops, and do less RBO. Neither makes a difference PerfScore-wise. Maybe we should skip some of these loop opts if the loop is cold.

183 (57.91% of base) : 56190.dasm - System.SpanHelpers:NonPackedContainsValueType[short](byref,short,int):ubyte (Tier1):

- LoopsFoundDuringOpts                      : 3
- UnusedIVsRemoved                          : 0
- LoopsStrengthReduced                      : 0
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 25
- PerfScore                                 : 46.213569
+ LoopsFoundDuringOpts                      : 4
+ UnusedIVsRemoved                          : 1
+ LoopsStrengthReduced                      : 1
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 36
+ PerfScore                                 : 46.213918

Same story as above. Now some size decreases:

-220 (-35.31% of base) : 144716.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.Metadata.PE.TupleTypeDecoder:DecodeTypeArguments(System.Collections.Immutable.ImmutableArray`1[Microsoft.CodeAnalysis.CSharp.Symbols.TypeWithAnnotations]):System.Collections.Immutable.ImmutableArray`1[Microsoft.CodeAnalysis.CSharp.Symbols.TypeWithAnnotations]:this (Tier1):

- LoopsCloned                               : 1
- RedundantBranchesEliminated               : 1
- JumpThreadingsPerformed                   : 1
- BasicBlocksAtCodegen                      : 18
+ LoopsCloned                               : 0
+ RedundantBranchesEliminated               : 0
+ JumpThreadingsPerformed                   : 0
+ BasicBlocksAtCodegen                      : 12

We no longer clone a loop, but the loop was cold anyway. I don't think we should be considering such loops to begin with:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1     31 [000..009)-> BB03(0),BB02(1)         ( cond )                   i IBC idxlen
BB02 [0001]  2       BB01,BB11             1     31 [009..00B)                           (return)                   i IBC
BB03 [0002]  1       BB01                  0      0 [00B..026)-> BB09(1)                 (always)                   i IBC rare hascall gcsafe idxlen
BB04 [0003]  1       BB09                  0      0 [026..039)-> BB07(0.2),BB05(0.8)     ( cond )                   i IBC rare hascall gcsafe idxlen bwd bwd-target
BB05 [0015]  1       BB04                  0      0 [038..039)-> BB07(0.48),BB06(0.52)   ( cond )                   i IBC rare bwd
BB06 [0016]  1       BB05                  0      0 [038..039)-> BB08(1)                 (always)                   i IBC rare bwd
BB07 [0017]  2       BB04,BB05             0      0 [038..039)-> BB08(1)                 (always)                   i IBC rare bwd
BB08 [0018]  2       BB06,BB07             0      0 [038..052)-> BB09(1)                 (always)                   i IBC rare hascall gcsafe idxlen bwd
BB09 [0004]  2       BB03,BB08             0      0 [052..056)-> BB04(0.9),BB10(0.1)     ( cond )                   i IBC rare bwd bwd-src
BB10 [0005]  1       BB09                  0      0 [056..059)-> BB12(0.48),BB11(0.52)   ( cond )                   i IBC rare
BB11 [0006]  1       BB10                  0      0 [059..061)-> BB02(1)                 (always)                   i IBC rare hascall gcsafe
BB12 [0007]  1       BB10                  0      0 [061..06E)                           (return)                   i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
Considering loop L00 to clone for optimizations.
Analyzing iteration for L00 with header BB09
  Preheader = BB03
  Checking exiting block BB09
    Could not extract an IV
  Could not find any IV
Loop cloning: rejecting loop L00. Could not analyze iteration.

-111 (-23.47% of base) : 57373.dasm - System.Reflection.RuntimePropertyInfo:GetIndexParametersSpan():System.ReadOnlySpan`1[System.Reflection.ParameterInfo]:this (Tier1)

- LoopsCloned                               : 1
- LoopsIVWidened                            : 0
- WidenedIVs                                : 0
- UnusedIVsRemoved                          : 0
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 24
+ LoopsCloned                               : 0
+ LoopsIVWidened                            : 1
+ WidenedIVs                                : 1
+ UnusedIVsRemoved                          : 1
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 22

Same story as above; we no longer optimize a cold loop. I'll look for some diffs with PerfScore implications next.

amanasifkhalid · 2025-05-19T20:04:14Z

benchmarks.run_pgo has few PerfScore diffs, and the few it has are not all that extreme. aspnet's metric diffs look similar:

342 (28.84 % of base) : 56586.dasm - System.SpanHelpers:NonPackedIndexOfValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,int):int (Instrumented Tier1)

- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 50
- PerfScore                                 : 60.861364
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 68
+ PerfScore                                 : 62.030193

207 (110.11 % of base) : 58050.dasm - System.SpanHelpers:LastIndexOfValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,int):int (Tier1)

- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- PerfScore                                 : 32.740386
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 35
+ PerfScore                                 : 33.990386

That's quite a lot of code being left behind. However, if I tell graph-based loop inversion to invert cold loops, I can get RBO to kick in again:

- LoopsInverted                             : 1
+ LoopsInverted                             : 2
...
RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- - PerfScore                                 : 32.740386
+ BasicBlocksAtCodegen                      : 22
+ PerfScore                                 : 26.326873

amanasifkhalid · 2025-05-19T20:08:13Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. Here's what I'm thinking of doing:

Get this checked in.
PR a change to disable loop unrolling/cloning for cold loops. I think we ought to keep cold code compact, and if we're going to start inverting more cold loops (see below), then we probably want to keep cloning/unrolling in check.
Tweak loop inversion to invert cold loops. It might seem like wasted effort, but it unblocks RBO in some instances, and I'd rather we consolidate any canonicalization abilities we're accidentally getting from fgOptimizeBranch into loop inversion.

Let me know what you think -- thanks!

AndyAyersMS · 2025-05-19T22:44:01Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. Here's what I'm thinking of doing:

Get this checked in.

PR a change to disable loop unrolling/cloning for cold loops. I think we ought to keep cold code compact, and if we're going to start inverting more cold loops (see below), then we probably want to keep cloning/unrolling in check.

Tweak loop inversion to invert cold loops. It might seem like wasted effort, but it unblocks RBO in some instances, and I'd rather we consolidate any canonicalization abilities we're accidentally getting from fgOptimizeBranch into loop inversion.

Let me know what you think -- thanks!

Is this still the plan or are you reconsidering...?

EgorBo · 2025-05-20T01:02:14Z

I'm just curious - isn't this PR both size and PerfScore regression? e.g. win-x64 PerfScore:

amanasifkhalid · 2025-05-20T12:25:00Z

Is this still the plan or are you reconsidering...?

That's still my plan.

I'm just curious - isn't this PR both size and PerfScore regression?

Yes, this PR on its own is likely a net regression, but I think we'll need to do this to unblock graph-based loop inversion (I also think this shouldn't be run so early as a matter of principle). Right now, lexical loop inversion transforms some loops, and fgOptimizeBranch sort of inverts some others -- we don't see it in the metrics, but by no longer running fgOptimizeBranch early, we're essentially inverting fewer loops, which sometimes pessimizes RBO. I think that's where the bulk of PerfScore regressions are coming from.

I expect to address these regressions by getting graph-based loop inversion to handle any loops we were relying on fgOptimizeBranch to invert, but one of the challenges of turning that on is fgOptimizeBranch tends to create unnatural loops. So while we'd expect graph-based loop inversion to invert more loops than the lexical variant, I'm seeing plenty of diffs in that PR where we recognize fewer loops now that inversion runs after flow opts. Removing this early fgOptimizeBranch pass fixes that, but if you'd rather not take regressions in the meantime, I can try to get graph-based loop inversion checked in first; it'll just be harder to triage the diffs. Once that's on, I think we really ought to remove this.

amanasifkhalid · 2025-05-20T12:27:05Z

Also, assuming this PR gets approved, I'll wait until after the Preview 5 snap to merge it.

Companion to #115734.

amanasifkhalid · 2025-05-21T13:42:21Z

/ba-g stackoverflowtester failed

Companion to #115734.

Part of #107749. fgOptimizeBranch looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable, fgOptimizeBranch clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly. However, I see little reason to run this transformation early, especially before loop recognition; from the diffs, fgOptimizeBranch seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.

Remove early fgOptimizeBranch pass

74b33e2

Copilot AI review requested due to automatic review settings May 19, 2025 17:59

Copilot AI reviewed May 19, 2025

View reviewed changes

github-actions bot added the needs-area-label label May 19, 2025

dotnet-policy-service bot assigned amanasifkhalid May 19, 2025

amanasifkhalid added area-CodeGen-coreclr and removed needs-area-label labels May 19, 2025

This was referenced May 19, 2025

Assert failure '!CREATE_CHECK_STRING(bSmallObjectHeapPtr || bLargeObjectHeapPtr)' #111922

Open

JIT: Don't clone or unroll cold loops #115744

Merged

build-analysis bot mentioned this pull request May 20, 2025

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

3 tasks

AndyAyersMS approved these changes May 20, 2025

View reviewed changes

Merge branch 'main' into remove-early-branch-opt

e97dfcb

amanasifkhalid added a commit that referenced this pull request May 21, 2025

JIT: Don't clone or unroll cold loops (#115744)

c8cb0f8

Companion to #115734.

amanasifkhalid merged commit 272948a into dotnet:main May 21, 2025
106 of 108 checks passed

amanasifkhalid deleted the remove-early-branch-opt branch May 21, 2025 13:43

SimaTian pushed a commit that referenced this pull request May 27, 2025

JIT: Don't clone or unroll cold loops (#115744)

1a1a016

Companion to #115734.

LoopedBard3 mentioned this pull request May 27, 2025

[Perf] Linux/x64: 2 Regressions on 5/21/2025 2:09:44 PM +00:00 #116023

Closed

github-actions bot locked and limited conversation to collaborators Jun 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT: Remove early `fgOptimizeBranch` pass #115734

JIT: Remove early `fgOptimizeBranch` pass #115734

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

dotnet-policy-service bot commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

AndyAyersMS commented May 19, 2025

Uh oh!

EgorBo commented May 20, 2025

Uh oh!

amanasifkhalid commented May 20, 2025 •

edited

Loading

Uh oh!

amanasifkhalid commented May 20, 2025

Uh oh!

amanasifkhalid commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

JIT: Remove early fgOptimizeBranch pass #115734

JIT: Remove early fgOptimizeBranch pass #115734

Uh oh!

Conversation

amanasifkhalid commented May 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

dotnet-policy-service bot commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

amanasifkhalid commented May 19, 2025

Uh oh!

AndyAyersMS commented May 19, 2025

Uh oh!

EgorBo commented May 20, 2025

Uh oh!

amanasifkhalid commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amanasifkhalid commented May 20, 2025

Uh oh!

amanasifkhalid commented May 21, 2025

Uh oh!

Uh oh!

Uh oh!

JIT: Remove early `fgOptimizeBranch` pass #115734

JIT: Remove early `fgOptimizeBranch` pass #115734

amanasifkhalid commented May 20, 2025 •

edited

Loading