-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Remove early fgOptimizeBranch
pass
#115734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Remove early fgOptimizeBranch
pass
#115734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes the early branch optimization pass (fgOptimizeBranch) to prevent the formation of unnatural loop shapes before loop recognition.
- Removed early branch optimization loop from optOptimizeFlow.
- Aimed at enabling a more aggressive and correctly placed branch optimization in later phases.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Some metric diffs from
- LoopsFoundDuringOpts : 1
- RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 21
- PerfScore : 16.193149
+ LoopsFoundDuringOpts : 2
+ RedundantBranchesEliminated : 0
+ BasicBlocksAtCodegen : 35
+ PerfScore : 16.443149 RBO getting pessimized seemed to block a significant code size reduction, though it looks like it was mostly cold code removed considering the small PerfScore diff. I'll look for a smaller example to dig into.
- LoopsFoundDuringOpts : 7
- UnusedIVsRemoved : 1
- LoopsStrengthReduced : 0
- RedundantBranchesEliminated : 13
- BasicBlocksAtCodegen : 65
+ LoopsFoundDuringOpts : 9
+ UnusedIVsRemoved : 3
+ LoopsStrengthReduced : 2
+ RedundantBranchesEliminated : 11
+ BasicBlocksAtCodegen : 87 We recognize and optimize more loops, and do less RBO. Neither makes a difference PerfScore-wise. Maybe we should skip some of these loop opts if the loop is cold.
- LoopsFoundDuringOpts : 3
- UnusedIVsRemoved : 0
- LoopsStrengthReduced : 0
- RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 25
- PerfScore : 46.213569
+ LoopsFoundDuringOpts : 4
+ UnusedIVsRemoved : 1
+ LoopsStrengthReduced : 1
+ RedundantBranchesEliminated : 0
+ BasicBlocksAtCodegen : 36
+ PerfScore : 46.213918 Same story as above. Now some size decreases:
- LoopsCloned : 1
- RedundantBranchesEliminated : 1
- JumpThreadingsPerformed : 1
- BasicBlocksAtCodegen : 18
+ LoopsCloned : 0
+ RedundantBranchesEliminated : 0
+ JumpThreadingsPerformed : 0
+ BasicBlocksAtCodegen : 12 We no longer clone a loop, but the loop was cold anyway. I don't think we should be considering such loops to begin with:
- LoopsCloned : 1
- LoopsIVWidened : 0
- WidenedIVs : 0
- UnusedIVsRemoved : 0
- RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 24
+ LoopsCloned : 0
+ LoopsIVWidened : 1
+ WidenedIVs : 1
+ UnusedIVsRemoved : 1
+ RedundantBranchesEliminated : 0
+ BasicBlocksAtCodegen : 22 Same story as above; we no longer optimize a cold loop. I'll look for some diffs with PerfScore implications next. |
- RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 50
- PerfScore : 60.861364
+ RedundantBranchesEliminated : 0
+ BasicBlocksAtCodegen : 68
+ PerfScore : 62.030193
- RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 21
- PerfScore : 32.740386
+ RedundantBranchesEliminated : 0
+ BasicBlocksAtCodegen : 35
+ PerfScore : 33.990386 That's quite a lot of code being left behind. However, if I tell graph-based loop inversion to invert cold loops, I can get RBO to kick in again: - LoopsInverted : 1
+ LoopsInverted : 2
...
RedundantBranchesEliminated : 1
- BasicBlocksAtCodegen : 21
- - PerfScore : 32.740386
+ BasicBlocksAtCodegen : 22
+ PerfScore : 26.326873 |
cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs. Here's what I'm thinking of doing:
Let me know what you think -- thanks! |
Is this still the plan or are you reconsidering...? |
That's still my plan.
Yes, this PR on its own is likely a net regression, but I think we'll need to do this to unblock graph-based loop inversion (I also think this shouldn't be run so early as a matter of principle). Right now, lexical loop inversion transforms some loops, and I expect to address these regressions by getting graph-based loop inversion to handle any loops we were relying on |
Also, assuming this PR gets approved, I'll wait until after the Preview 5 snap to merge it. |
/ba-g stackoverflowtester failed |
Part of #107749. fgOptimizeBranch looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable, fgOptimizeBranch clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly. However, I see little reason to run this transformation early, especially before loop recognition; from the diffs, fgOptimizeBranch seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.
Part of #107749.
fgOptimizeBranch
looks for cases where a block jumps to a conditional block, which jumps back to the lexical successor of the first block. When it is profitable,fgOptimizeBranch
clones the condition into the first block to avoid the jump-check-jump pattern. While probably more pattern-matchy in its analysis than ideal, this can help block layout align loops properly.However, I see little reason to run this transformation early, especially before loop recognition; from the diffs,
fgOptimizeBranch
seems to create unnatural loop shapes with some frequency, pessimizing downstream loop opts. Removing the early pass means we can also make the late pass more aggressive without splitting off the implementation.