Huge compile-time regression in beta/nightly #91128

bnjbvr · 2021-11-22T10:26:46Z

We have a serious Rust compile-time regression happening on an internal (unfortunately closed-source) crate, happening on both the beta and nightly channels.

Before, compiling the crate would take around 1 minute 30 seconds in automation. Now if I'm using the nightly channel, it seems to take around an hour or so.

Running with RUSTFLAGS="-Z new-llvm-pass-manager=no" makes the regression disappear on the nightly channel, so the new pass manager seems to be at fault here.

I've tried compiling with a rustc build from #89830, but it didn't make the regression disappear, so it might not be the same inlining issue.

Thanks a bunch to @lqd who suggested disabling the pass manager, and trying the above patch. Pinging LLVM people: @nikic @Mark-Simulacrum. Of course I'm happy to run any other diagnostic and try any tool that could help figure out what's going on here.

The text was updated successfully, but these errors were encountered:

apiraino · 2021-11-22T16:15:34Z

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

We assign a P-critcal to raise the flag for the team. Further investigation will probably need some way to reproduce

@rustbot label -I-prioritize +P-critical

bnjbvr · 2021-11-23T20:44:18Z

I've managed to reduce our use case to a small file that reproduces the issue, all the code is accessible in the following repository https://github.com/EmbarkStudios/rustc-compile-time-regress:

on stable, it takes around 26 seconds to compile on my machine with --release
on beta and nightly, it takes around 2 minutes to compile with --release

There are multiple variants of the Shim trait in our code base, in some cases with many more functions in it that make use of the same patterns as those shown in this example. I am almost certain that it's the compilation of all these traits and impls that cause the very long compile times somehow, hence the reduced test case.

Note: I've also tried again the build from #89830, and the compile time is still the same with it, so it's definitely not the same issue.

nikic · 2021-11-24T17:13:44Z

Problematic IR for opt -O3: https://gist.github.com/nikic/cae7dbfe7da91308ba012b4b3172eebc

cc rust-lang#91128

nikic · 2021-11-24T19:59:31Z

Top passes time -time-passes:

  423.0093 ( 33.2%)   0.5118 ( 33.9%)  423.5211 ( 33.2%)  423.8182 ( 33.2%)  ModuleInlinerWrapperPass
  423.0046 ( 33.2%)   0.5112 ( 33.9%)  423.5159 ( 33.2%)  423.8129 ( 33.2%)  DevirtSCCRepeatedPass
  370.2776 ( 29.0%)   0.2598 ( 17.2%)  370.5374 ( 29.0%)  370.8072 ( 29.0%)  CorrelatedValuePropagationPass
  23.3253 (  1.8%)   0.0002 (  0.0%)  23.3255 (  1.8%)  23.3373 (  1.8%)  GVNPass
   6.0481 (  0.5%)   0.1486 (  9.9%)   6.1967 (  0.5%)   6.1992 (  0.5%)  InlinerPass
   6.1613 (  0.5%)   0.0029 (  0.2%)   6.1642 (  0.5%)   6.1672 (  0.5%)  InstCombinePass
   4.7685 (  0.4%)   0.0080 (  0.5%)   4.7765 (  0.4%)   4.7780 (  0.4%)  SROAPass
   3.0350 (  0.2%)   0.0041 (  0.3%)   3.0391 (  0.2%)   3.0403 (  0.2%)  ADCEPass

Most time is spent in CVP. Though the real problem is that NewPM produces a huge function, with "_ZN102_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$9size_hint17haa64424fbc292ec1E" having 750k or so instructions.

Most likely it's another case of catastrophic inlining, though not the same as the other one (as the mentioned patch indeed doesn't fix this case).

kdy1 · 2021-11-29T14:47:30Z

This also affects swc.
There's an issue for this. (swc-project/swc#2872)

With workaround:

===-------------------------------------------------------------------------===
  Total Execution Time: 14.6342 seconds (14.6144 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  ---Instr---  --- Name ---
   2.5938 ( 18.8%)   0.0213 (  2.5%)   2.6151 ( 17.9%)   2.6148 ( 17.9%)  17898317626  Function Integration/Inlining
   1.4954 ( 10.8%)   0.0053 (  0.6%)   1.5008 ( 10.3%)   1.5005 ( 10.3%)  17085737226  MemCpy Optimization
   0.8417 (  6.1%)   0.0065 (  0.8%)   0.8482 (  5.8%)   0.8477 (  5.8%)  6363534522  SROA #2
   0.6932 (  5.0%)   0.0054 (  0.6%)   0.6986 (  4.8%)   0.6982 (  4.8%)  4464950446  Combine redundant instructions #2
   0.5760 (  4.2%)   0.0057 (  0.7%)   0.5817 (  4.0%)   0.5811 (  4.0%)  4107118695  Global Value Numbering
   0.5173 (  3.7%)   0.0046 (  0.6%)   0.5219 (  3.6%)   0.5214 (  3.6%)  3496755636  Combine redundant instructions #4
   0.4593 (  3.3%)   0.0047 (  0.6%)   0.4640 (  3.2%)   0.4633 (  3.2%)  2996247269  Combine redundant instructions #3
   0.4480 (  3.2%)   0.0048 (  0.6%)   0.4528 (  3.1%)   0.4523 (  3.1%)  2818690217  Combine redundant instructions #5
   0.4087 (  3.0%)   0.0047 (  0.6%)   0.4134 (  2.8%)   0.4131 (  2.8%)  3581923261  Dead Store Elimination
   0.2232 (  1.6%)   0.0053 (  0.6%)   0.2284 (  1.6%)   0.2280 (  1.6%)  1827801629  Combine redundant instructions
   0.2236 (  1.6%)   0.0046 (  0.5%)   0.2281 (  1.6%)   0.2278 (  1.6%)  1707889189  Memory SSA
   0.1726 (  1.3%)   0.0530 (  6.3%)   0.2256 (  1.5%)   0.2256 (  1.5%)  864823098  Interprocedural Sparse Conditional Constant Propagation
   0.2089 (  1.5%)   0.0048 (  0.6%)   0.2137 (  1.5%)   0.2132 (  1.5%)  1552468131  Early CSE w/ MemorySSA
   0.0974 (  0.7%)   0.1058 ( 12.6%)   0.2033 (  1.4%)   0.2033 (  1.4%)  1119829288  Global Variable Optimizer
   0.1913 (  1.4%)   0.0047 (  0.6%)   0.1960 (  1.3%)   0.1956 (  1.3%)  1338609401  Jump Threading

Without workaround:

  10.0674 ( 30.6%)   2.4567 ( 30.9%)  12.5241 ( 30.6%)  12.5284 ( 30.6%)  80168985859  ModuleInlinerWrapperPass
   9.9170 ( 30.1%)   2.3687 ( 29.7%)  12.2857 ( 30.0%)  12.2897 ( 30.1%)  78943942431  DevirtSCCRepeatedPass
   2.7391 (  8.3%)   0.5879 (  7.4%)   3.3269 (  8.1%)   3.3273 (  8.1%)  19748806913  InlinerPass
   2.0915 (  6.3%)   0.3909 (  4.9%)   2.4824 (  6.1%)   2.4811 (  6.1%)  15760601093  InstCombinePass
   0.6599 (  2.0%)   0.4267 (  5.4%)   1.0867 (  2.7%)   1.0869 (  2.7%)  6916625691  BlockFrequencyAnalysis
   0.8143 (  2.5%)   0.0207 (  0.3%)   0.8350 (  2.0%)   0.8351 (  2.0%)  8673702465  MemCpyOptPass
   0.5102 (  1.5%)   0.3107 (  3.9%)   0.8208 (  2.0%)   0.8205 (  2.0%)  5318224460  BranchProbabilityAnalysis
   0.6374 (  1.9%)   0.0592 (  0.7%)   0.6966 (  1.7%)   0.6956 (  1.7%)  5198860878  SROA
   0.5529 (  1.7%)   0.0574 (  0.7%)   0.6103 (  1.5%)   0.6099 (  1.5%)  4223120654  GVN
   0.4074 (  1.2%)   0.0916 (  1.1%)   0.4990 (  1.2%)   0.4980 (  1.2%)  3360041371  EarlyCSEPass
   0.1873 (  0.6%)   0.2731 (  3.4%)   0.4605 (  1.1%)   0.4604 (  1.1%)  3700728437  AAManager

pnkfelix · 2021-12-02T15:29:04Z

As a workaround, we have disabled NewPM by default on 1.57 release.

Can someone confirm if this compile-time regression is indeed resolved on 1.57?

bnjbvr · 2021-12-02T16:49:07Z

Fwiw, I tested with the pre-release and the problem had disappeared. Don't know if much has changed between the pre-release and the release, though. Sorry, forgot to confirm here when the pre-release happened.

(Hi Felix!)

apiraino · 2021-12-02T18:06:55Z

Discussed in T-compiler meeting on Zulip, linking this to the progress of #91190

nikic · 2021-12-08T16:26:37Z

A slightly cleaned up IR reproducer (still large): https://gist.github.com/nikic/d66abc8901a21594d0798d169eb9d725

I've investigated this in a bit more detail. First, a note on the general design of the inliner, which is is not very robust. LLVM uses a bottom-up inliner where inlining starts at leaf functions and a local inlining decision for each call is made. After the leafs have been inlined the function will grow, and if it is gets large enough, then it will not get inlined into its callee due to cost modelling. This is what generally prevents runaway inlining. However, it does require that the callee is already "maximally inlined" -- if we have X calls Y calls Z and decided not to inline Z into Y, but then after inlining Y into X decide that we now also want to inline Z into X, we can get into these exponential inlining situations. An obvious case where the callee is not maximally inlined is with SCCs (and presumably the reason why it attempts to inline already inlined callees at all), where there is no "bottom" to start from, The inliner solves this by a combination of inlining history and by doing a round of inlining over all functions in an SCC before reprocessing new inlined calls, so that function size increases in a balanced way across the SCC.

In this particular case, we run into a different manifestation of this issue: Z is not inlined into Y because the call to Z is considered a cold callsite based on BFI information. But after inlining into X the callsite of Z becomes non-cold. This obviously doesn't make sense and is the result of precision loss. I'm not completely clear on why things degenerate as much as they do though.

Fixing the BFI issue (possibly through some rescaling during inlining) should address this particular issue, but it's rather concerning how fragile the inliner is.

nikic · 2021-12-09T09:57:32Z

Okay, here's the part that I was missing before (the BFI issue compounds on this, but it's really the root problem): The inlining advisor has a concept of "deferred inlining", where it decides not to inline a call into the caller, because it determines that it would be more profitable to inline the caller into its callers instead. Specifically, this can happen if the caller is an internal function, which means that there is a very large cost bonus if it can be removed entirely. So if we can show that if we don't inline into this caller it's going to be inlined into its callers and then dropped entirely, we may defer.

Now, this explicitly goes against the bottom-up inlining approach and causes predictable issues: If you have a tree of calls for which inlining gets deferred, then on reaching the top-level function (where inlining is no longer deferred), we can end up inlining the whole tree, because each individual inlining decision will now look profitable.

For example, if we run https://gist.github.com/nikic/ec18c736e69d29614ff268e2977fc491 through opt -inline we'll end up flattening the whole call tree and inline some 2000 calls to @foo().

pnkfelix · 2021-12-09T15:54:05Z

@rustbot assign @nikic

nikic · 2021-12-09T17:29:23Z

I ran an experiment to disable deferred inlining entirely (#91703) with results at https://perf.rust-lang.org/compare.html?start=3b263ceb5cb89b6d53b5a03b47ec447c3a7f7765&end=07378cd9e76010238f64ea03d1219774eb60510d. The result looks promising as far as rustc is concerned, in that compile-times drop, and there is not much impact on check/incr-unchanged builds (our proxy for run-time). The worst regression there is "token-stream-stress check" at 1.3% and otherwise it's mostly mildly positive.

Unfortunately it doesn't seem like this can be disabled without patching LLVM. At this point I'm pretty convinced that deferred inlining is broken on a fundamental level and should be removed entirely, and any regressions arising from that addressed in some other way. However, I suspect that upstream is going to disagree with that :)

For an extra fun test case, https://gist.github.com/nikic/1262b5f7d27278e1b34a190ae10947f5 is less than 100 lines of IR, and produces about 500000 lines of IR when run through opt -inline :)

nikic · 2021-12-14T15:56:35Z

Upstream review is https://reviews.llvm.org/D115497.

cc rust-lang#91128

bnjbvr mentioned this issue Nov 22, 2021

Embark issues EmbarkStudios/rust-ecosystem#36

Open

jyn514 added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. perf-regression Performance regression. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels Nov 22, 2021

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Nov 22, 2021

jyn514 added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Nov 22, 2021

rustbot added P-critical Critical priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Nov 22, 2021

nagisa added a commit to nagisa/rust that referenced this issue Nov 24, 2021

Disable LLVM newPM by default

01f8b6c

cc rust-lang#91128

This was referenced Nov 24, 2021

[beta] [1.57] Disable LLVM newPM by default #91189

Closed

Disable LLVM newPM by default #91190

Closed

kdy1 added a commit to kdy1/swc that referenced this issue Nov 29, 2021

Workaround rust-lang/rust#91128

159bf5b

kdy1 mentioned this issue Nov 29, 2021

chore: Reduce compile time swc-project/swc#2909

Merged

sanxiyn added the I-compiletime Issue: Problems and improvements with respect to compile times. label Nov 30, 2021

nightkr mentioned this issue Dec 6, 2021

Compile-time regression between 1.56.1 and 1.57.0 for deeply nested decorator types #91598

Closed

hudson-ayers mentioned this issue Dec 8, 2021

Update rust-toolchain to 2021-12-04 tock/tock#2915

Merged

2 tasks

rustbot assigned nikic Dec 9, 2021

nagisa mentioned this issue Dec 19, 2021

Backport LLVM changes to disable deferred inlining #92110

Merged

bors closed this as completed in 77497c7 Dec 23, 2021

Mark-Simulacrum pushed a commit to Mark-Simulacrum/rust that referenced this issue Jan 5, 2022

Disable LLVM newPM by default

ce3d508

cc rust-lang#91128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge compile-time regression in beta/nightly #91128

Huge compile-time regression in beta/nightly #91128

bnjbvr commented Nov 22, 2021 •

edited by rustbot

Loading

apiraino commented Nov 22, 2021 •

edited

Loading

bnjbvr commented Nov 23, 2021

nikic commented Nov 24, 2021

nikic commented Nov 24, 2021

kdy1 commented Nov 29, 2021 •

edited

Loading

pnkfelix commented Dec 2, 2021

bnjbvr commented Dec 2, 2021

apiraino commented Dec 2, 2021 •

edited

Loading

nikic commented Dec 8, 2021

nikic commented Dec 9, 2021 •

edited

Loading

pnkfelix commented Dec 9, 2021

nikic commented Dec 9, 2021 •

edited

Loading

nikic commented Dec 14, 2021

Huge compile-time regression in beta/nightly #91128

Huge compile-time regression in beta/nightly #91128

Comments

bnjbvr commented Nov 22, 2021 • edited by rustbot Loading

apiraino commented Nov 22, 2021 • edited Loading

bnjbvr commented Nov 23, 2021

nikic commented Nov 24, 2021

nikic commented Nov 24, 2021

kdy1 commented Nov 29, 2021 • edited Loading

pnkfelix commented Dec 2, 2021

bnjbvr commented Dec 2, 2021

apiraino commented Dec 2, 2021 • edited Loading

nikic commented Dec 8, 2021

nikic commented Dec 9, 2021 • edited Loading

pnkfelix commented Dec 9, 2021

nikic commented Dec 9, 2021 • edited Loading

nikic commented Dec 14, 2021

bnjbvr commented Nov 22, 2021 •

edited by rustbot

Loading

apiraino commented Nov 22, 2021 •

edited

Loading

kdy1 commented Nov 29, 2021 •

edited

Loading

apiraino commented Dec 2, 2021 •

edited

Loading

nikic commented Dec 9, 2021 •

edited

Loading

nikic commented Dec 9, 2021 •

edited

Loading