Skip to content

JIT: Remove early fgOptimizeBranch pass #115734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2025

Conversation

amanasifkhalid
Copy link
Member

Part of #107749. fgOptimizeBranch looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable, fgOptimizeBranch clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly.

However, I see little reason to run this transformation early, especially before loop recognition; from the diffs, fgOptimizeBranch seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.

@Copilot Copilot AI review requested due to automatic review settings May 19, 2025 17:59
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes the early branch optimization pass (fgOptimizeBranch) to prevent the formation of unnatural loop shapes before loop recognition.

  • Removed early branch optimization loop from optOptimizeFlow.
  • Aimed at enabling a more aggressive and correctly placed branch optimization in later phases.

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 19, 2025
@amanasifkhalid amanasifkhalid added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

Some metric diffs from benchmarks.run_pgo on win-x64:

130 (66.67% of base) : 29998.dasm - System.SpanHelpers:LastIndexOfValueType[int,System.SpanHelpers+Negate`1[int]](byref,int,int):int (Tier1):

- LoopsFoundDuringOpts                      : 1
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- PerfScore                                 : 16.193149
+ LoopsFoundDuringOpts                      : 2
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 35
+ PerfScore                                 : 16.443149

RBO getting pessimized seemed to block a significant code size reduction, though it looks like it was mostly cold code removed considering the small PerfScore diff. I'll look for a smaller example to dig into.

234 (22.44% of base) : 127581.dasm - System.IO.File:WriteAllBytes(System.String,System.ReadOnlySpan`1[ubyte]) (Tier1):

- LoopsFoundDuringOpts                      : 7
- UnusedIVsRemoved                          : 1
- LoopsStrengthReduced                      : 0
- RedundantBranchesEliminated               : 13
- BasicBlocksAtCodegen                      : 65
+ LoopsFoundDuringOpts                      : 9
+ UnusedIVsRemoved                          : 3
+ LoopsStrengthReduced                      : 2
+ RedundantBranchesEliminated               : 11
+ BasicBlocksAtCodegen                      : 87

We recognize and optimize more loops, and do less RBO. Neither makes a difference PerfScore-wise. Maybe we should skip some of these loop opts if the loop is cold.

183 (57.91% of base) : 56190.dasm - System.SpanHelpers:NonPackedContainsValueType[short](byref,short,int):ubyte (Tier1):

- LoopsFoundDuringOpts                      : 3
- UnusedIVsRemoved                          : 0
- LoopsStrengthReduced                      : 0
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 25
- PerfScore                                 : 46.213569
+ LoopsFoundDuringOpts                      : 4
+ UnusedIVsRemoved                          : 1
+ LoopsStrengthReduced                      : 1
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 36
+ PerfScore                                 : 46.213918

Same story as above. Now some size decreases:

-220 (-35.31% of base) : 144716.dasm - Microsoft.CodeAnalysis.CSharp.Symbols.Metadata.PE.TupleTypeDecoder:DecodeTypeArguments(System.Collections.Immutable.ImmutableArray`1[Microsoft.CodeAnalysis.CSharp.Symbols.TypeWithAnnotations]):System.Collections.Immutable.ImmutableArray`1[Microsoft.CodeAnalysis.CSharp.Symbols.TypeWithAnnotations]:this (Tier1):

- LoopsCloned                               : 1
- RedundantBranchesEliminated               : 1
- JumpThreadingsPerformed                   : 1
- BasicBlocksAtCodegen                      : 18
+ LoopsCloned                               : 0
+ RedundantBranchesEliminated               : 0
+ JumpThreadingsPerformed                   : 0
+ BasicBlocksAtCodegen                      : 12

We no longer clone a loop, but the loop was cold anyway. I don't think we should be considering such loops to begin with:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1     31 [000..009)-> BB03(0),BB02(1)         ( cond )                   i IBC idxlen
BB02 [0001]  2       BB01,BB11             1     31 [009..00B)                           (return)                   i IBC
BB03 [0002]  1       BB01                  0      0 [00B..026)-> BB09(1)                 (always)                   i IBC rare hascall gcsafe idxlen
BB04 [0003]  1       BB09                  0      0 [026..039)-> BB07(0.2),BB05(0.8)     ( cond )                   i IBC rare hascall gcsafe idxlen bwd bwd-target
BB05 [0015]  1       BB04                  0      0 [038..039)-> BB07(0.48),BB06(0.52)   ( cond )                   i IBC rare bwd
BB06 [0016]  1       BB05                  0      0 [038..039)-> BB08(1)                 (always)                   i IBC rare bwd
BB07 [0017]  2       BB04,BB05             0      0 [038..039)-> BB08(1)                 (always)                   i IBC rare bwd
BB08 [0018]  2       BB06,BB07             0      0 [038..052)-> BB09(1)                 (always)                   i IBC rare hascall gcsafe idxlen bwd
BB09 [0004]  2       BB03,BB08             0      0 [052..056)-> BB04(0.9),BB10(0.1)     ( cond )                   i IBC rare bwd bwd-src
BB10 [0005]  1       BB09                  0      0 [056..059)-> BB12(0.48),BB11(0.52)   ( cond )                   i IBC rare
BB11 [0006]  1       BB10                  0      0 [059..061)-> BB02(1)                 (always)                   i IBC rare hascall gcsafe
BB12 [0007]  1       BB10                  0      0 [061..06E)                           (return)                   i IBC rare hascall gcsafe
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
Considering loop L00 to clone for optimizations.
Analyzing iteration for L00 with header BB09
  Preheader = BB03
  Checking exiting block BB09
    Could not extract an IV
  Could not find any IV
Loop cloning: rejecting loop L00. Could not analyze iteration.

-111 (-23.47% of base) : 57373.dasm - System.Reflection.RuntimePropertyInfo:GetIndexParametersSpan():System.ReadOnlySpan`1[System.Reflection.ParameterInfo]:this (Tier1)

- LoopsCloned                               : 1
- LoopsIVWidened                            : 0
- WidenedIVs                                : 0
- UnusedIVsRemoved                          : 0
- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 24
+ LoopsCloned                               : 0
+ LoopsIVWidened                            : 1
+ WidenedIVs                                : 1
+ UnusedIVsRemoved                          : 1
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 22

Same story as above; we no longer optimize a cold loop. I'll look for some diffs with PerfScore implications next.

@amanasifkhalid
Copy link
Member Author

benchmarks.run_pgo has few PerfScore diffs, and the few it has are not all that extreme. aspnet's metric diffs look similar:

342 (28.84 % of base) : 56586.dasm - System.SpanHelpers:NonPackedIndexOfValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,int):int (Instrumented Tier1)

- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 50
- PerfScore                                 : 60.861364
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 68
+ PerfScore                                 : 62.030193

207 (110.11 % of base) : 58050.dasm - System.SpanHelpers:LastIndexOfValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,int):int (Tier1)

- RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- PerfScore                                 : 32.740386
+ RedundantBranchesEliminated               : 0
+ BasicBlocksAtCodegen                      : 35
+ PerfScore                                 : 33.990386

That's quite a lot of code being left behind. However, if I tell graph-based loop inversion to invert cold loops, I can get RBO to kick in again:

- LoopsInverted                             : 1
+ LoopsInverted                             : 2
...
RedundantBranchesEliminated               : 1
- BasicBlocksAtCodegen                      : 21
- - PerfScore                                 : 32.740386
+ BasicBlocksAtCodegen                      : 22
+ PerfScore                                 : 26.326873

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. Here's what I'm thinking of doing:

  • Get this checked in.
  • PR a change to disable loop unrolling/cloning for cold loops. I think we ought to keep cold code compact, and if we're going to start inverting more cold loops (see below), then we probably want to keep cloning/unrolling in check.
  • Tweak loop inversion to invert cold loops. It might seem like wasted effort, but it unblocks RBO in some instances, and I'd rather we consolidate any canonicalization abilities we're accidentally getting from fgOptimizeBranch into loop inversion.

Let me know what you think -- thanks!

@AndyAyersMS
Copy link
Member

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. Here's what I'm thinking of doing:

  • Get this checked in.
  • PR a change to disable loop unrolling/cloning for cold loops. I think we ought to keep cold code compact, and if we're going to start inverting more cold loops (see below), then we probably want to keep cloning/unrolling in check.
  • Tweak loop inversion to invert cold loops. It might seem like wasted effort, but it unblocks RBO in some instances, and I'd rather we consolidate any canonicalization abilities we're accidentally getting from fgOptimizeBranch into loop inversion.

Let me know what you think -- thanks!

Is this still the plan or are you reconsidering...?

@EgorBo
Copy link
Member

EgorBo commented May 20, 2025

I'm just curious - isn't this PR both size and PerfScore regression? e.g. win-x64 PerfScore:
image

@amanasifkhalid
Copy link
Member Author

amanasifkhalid commented May 20, 2025

Is this still the plan or are you reconsidering...?

That's still my plan.

I'm just curious - isn't this PR both size and PerfScore regression?

Yes, this PR on its own is likely a net regression, but I think we'll need to do this to unblock graph-based loop inversion (I also think this shouldn't be run so early as a matter of principle). Right now, lexical loop inversion transforms some loops, and fgOptimizeBranch sort of inverts some others -- we don't see it in the metrics, but by no longer running fgOptimizeBranch early, we're essentially inverting fewer loops, which sometimes pessimizes RBO. I think that's where the bulk of PerfScore regressions are coming from.

I expect to address these regressions by getting graph-based loop inversion to handle any loops we were relying on fgOptimizeBranch to invert, but one of the challenges of turning that on is fgOptimizeBranch tends to create unnatural loops. So while we'd expect graph-based loop inversion to invert more loops than the lexical variant, I'm seeing plenty of diffs in that PR where we recognize fewer loops now that inversion runs after flow opts. Removing this early fgOptimizeBranch pass fixes that, but if you'd rather not take regressions in the meantime, I can try to get graph-based loop inversion checked in first; it'll just be harder to triage the diffs. Once that's on, I think we really ought to remove this.

@amanasifkhalid
Copy link
Member Author

Also, assuming this PR gets approved, I'll wait until after the Preview 5 snap to merge it.

amanasifkhalid added a commit that referenced this pull request May 21, 2025
@amanasifkhalid
Copy link
Member Author

/ba-g stackoverflowtester failed

@amanasifkhalid amanasifkhalid merged commit 272948a into dotnet:main May 21, 2025
106 of 108 checks passed
@amanasifkhalid amanasifkhalid deleted the remove-early-branch-opt branch May 21, 2025 13:43
SimaTian pushed a commit that referenced this pull request May 27, 2025
SimaTian pushed a commit that referenced this pull request May 27, 2025
Part of #107749. fgOptimizeBranch looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable, fgOptimizeBranch clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly.

However, I see little reason to run this transformation early, especially before loop recognition; from the diffs, fgOptimizeBranch seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants