Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTracking issue for enabling multiple CGUs in release mode by default #45320
Comments
alexcrichton
added
A-codegen
C-tracking-issue
labels
Oct 16, 2017
alexcrichton
referenced this issue
Oct 16, 2017
Closed
Replace the link step of codegen-units with ThinLTO #35996
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Presumably this will remain a stable compiler flag, but what else will change by default? Is the plan to make this only affect |
This comment has been minimized.
This comment has been minimized.
|
I would specifically propose that any opt level greater than 1 uses 16 codegen units and ThinLTO enabled by default for those 16 codegen units. |
alexcrichton
referenced this issue
Oct 17, 2017
Closed
ThinLTO + O2 build time regression in rust-doom #45188
This comment has been minimized.
This comment has been minimized.
|
This is very interesting. One would think that ThinLTO-driven inlining should always have to do less work when our much more conservative pre-LLVM inlining. But that doesn't always seem to be the case, as in the rust-doom crate. Most tests on the irlo thread seem to profit though. Overall it's still not a clear picture to me. |
This comment has been minimized.
This comment has been minimized.
|
@michaelwoerister was that comment meant for a different thread? I forget which one as well though, so I'll respond here! As a "quick benchmark" I compiled the regex test suite with 16 CGUs + ThinLTO and then toggled inlining in all CGUs on/off. Surprisingly inlining in all CGUs was 5s faster to compile, and additionally did better on tons of benchmarks. In those timings |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton Yeah, the comment was in response to #45188 (rust-doom) but since that was closed already, I put it here. Regarding the compilation time difference between pre- and post-trans inlining, my hypothesis would be that sometimes pre-trans inlining will lead to more code being eliminated early on (as in the case of regex apparently) and sometimes it will have the opposite effect. I suspect that there's room for improvement by tuning which LLVM passes we run specifically for ThinLTO. Or do we do that already? |
This comment has been minimized.
This comment has been minimized.
|
Heh it's true yeah, I'd imagine that there's always room for improvement in pass tuning in Rust :). Right now we perform (afaik) 0 customization of any pass manager in LLVM. All of the normal optimization passes, LTO optimization passes, and ThinLTO optimization passes are all the same as what's in LLVM itself. |
This comment has been minimized.
This comment has been minimized.
|
I wanted to also take the time and tabulate all the results from the call to action thread to make sure it's all visibile in one place. Note that all the timings below are comparing a release build to a release build with 16 CGUs and ThinLTO enabled Improvements All of the following compile times improved, sorted by most improved to least improved
Regressions The following crates regressed in compile times, sorted from smallest regression to largest
alexcrichton's attempt to reproduce the regressions Here I attempt to reproduce the regressions on my own machine with
Unfortunately the only regression I was able to reproduce was the rust-belt regression. I'll be looking more into that. |
This comment has been minimized.
This comment has been minimized.
|
Looking into Almost all of the benefit of multiple codegen units is spreading out the work across all CPUs. Enabling ThinLTO to avoid losing any perf is fundamentally doing more work than what's already happening at What's happening here is that we have one huge CGU, but when we split it up we still have one huge CGU. This means that the CGU which may take ~1% less time in LLVM only gives us a tiny sliver of a window to run ThinLTO passes. In the case of the So generalizing even further, I think that enabling ThinLTO and multiple CGUs by default is going to regress compile time performance in any crate where our paritioning implementation doesn't actually partition very well. In the case of I would be willing to conclude, however, that such a situation is likely quite rare. Almost all other crates in the ecosystem will benefit from the partitioning which should basically evenly split up a crate. @michaelwoerister maybe you have thoughts on this though? |
alexcrichton
referenced this issue
Oct 19, 2017
Closed
32 codegen units may not always be better at -O0 #44941
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 20, 2017
This comment has been minimized.
This comment has been minimized.
|
Thank you so much for collecting and analyzing such a large amount of data, @alexcrichton! Your conclusions make sense to me. Unevenly distributed CGU sizes are also a problem for incremental compilation; e.g. the tokio-webpush-simple@030-minor-change test sees >90% CGU re-use, yet compile time is almost the same as from-scratch. In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though. For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs. We'd only need a metric for the size of a In conclusion, judging from the table above, I think we should make
|
This comment has been minimized.
This comment has been minimized.
|
FWIW, as the author of |
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 20, 2017
bors
added a commit
that referenced
this issue
Oct 20, 2017
This comment has been minimized.
This comment has been minimized.
This seems like a good thing to have in our back pocket, but I'm also wary of trying to do this by default. For example the
Agreed! So far all the crate's I've seen the O(instructions) has been quite a good metric for "how long this takes in all LLVM-related passes", so counting MIR sounds reasonable to me as well. Again though I also feel like this is ok to have in our back pocket. I'd want dig more into the
Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!) That being said we're seeing such big wins from ThinLTO on big projects today that we can probably just save this as a possible optimization for the future. On the topic of inline functions I recently dug up #10212 again (later closed in favor of #14527, but the latter may no longer be true today). I suspect that may "fix" quite a bit of the compile time issue here without coming at a loss of performance? In any case possibly lots of interesting things we could do there.
Also a good question! I find this to be a difficult one, however, because the CGU number will affect the output artifact, which means that if we do this sort of probing it'll be beneficial for performance but come at the cost of more difficult deterministic builds. You'd have to specify the CGUs manually or build on the same-ish hardware to get a deterministic build I think? I think though that if you have 1 CPU then this is universaly a huge regression. With one CPU we'd split the preexisting one huge CGU into N different ones, probably take roughly the same amount of time to opimize those, and then tack on ThinLTO and more optimization passes. My guess is that with one CPU we'd easily see 50% regressions. With 2+ CPUs however I'd probably expect to see benefits to compile time. Anything giving us twice the resources to churn the cgus more quickly should quickly start seeing wins in theory I think. For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays! |
This comment has been minimized.
This comment has been minimized.
|
One thing I think that's also worth pointing out is that up to this point we've mostly been measuring the runtime of an entire The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine. It's overall increasing the amount of work the compiler does. For a Put another way, ThinLTO and multiple CGUs should only be beneficial for builds which dont have many crates compiling in parallel for long parts of the build. If a build is 100% parallel for the entire time then ThinLTO will likely regress compile time performance. Now you might realize, however, that one very common case where you're only building one crate is in an incremental build! Typically if you do an incremental build you're only building a handful of crates, often serially. In that sense I think that there's some massive wins of ThinLTO + multiple CGUs in incremental builds rather than entire crate builds. Although improving both is of course great as well :) |
This comment has been minimized.
This comment has been minimized.
For deterministic builds you have to do some extra configuration anyway (e.g. path remapping) so I would not consider that a blocker. And I guess there are single core VMs around somewhere. But all of this is such a niche case that I don't really care
MIR-only RLIBs should put us into a pretty good spot regarding this, so I'm quite confident that we're on the right path overall. |
This comment has been minimized.
This comment has been minimized.
That's really interesting. For incremental compilation enabled the situation might be different (because pre-trans inlining hurts re-use) but for the non-incremental case it sounds like it's pretty clear what to do. |
This comment has been minimized.
This comment has been minimized.
Yeah, maybe we should revisit this at some point. Although I have to say the current solution of only |
This comment has been minimized.
This comment has been minimized.
|
Hm yeah that's a good point about needing configuration anyway for deterministic builds. It now seems like a more plausible route to take! Also that's a very interesting apoint about incremental and inlining on our own end... Maybe we should dig more into those ThinLTO runtime regressions at some point! Also yeah I don't really want to change how we trans inline functions just yet, I do like the simplicity too :) |
This comment has been minimized.
This comment has been minimized.
@alexcrichton out of curiosity how did you figure out this and which function it was. I've wanted to look into why the webrender build is so slow and would welcome tips. |
This comment has been minimized.
This comment has been minimized.
|
@jrmuizel oh sure I'd love to explain! So I originally found
That drops The graph here isn't always the easiest to read, but we've clearly got two huge bars, both of which correspond to taking a huge amount of time for that one CGU (first is optimization, second is ThinLTO + codegen). Next I ran:
and that command dumps a bunch of IR files into Inside that file it was 70k lines and some poking around showed that one function was 66k lines of IR. I sort of forget now how at this point I went from that IR to determining there was a huge function in there though... |
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 20, 2017
bors
added a commit
that referenced
this issue
Oct 20, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 20, 2017
bors
added a commit
that referenced
this issue
Oct 20, 2017
bors
added a commit
that referenced
this issue
Oct 22, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 25, 2017
JordiPolo
referenced this issue
Oct 26, 2017
Open
Big functions makes it difficult to parallelize compilation #33
bors
added a commit
that referenced
this issue
Oct 27, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Oct 31, 2017
bors
added a commit
that referenced
this issue
Nov 3, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Nov 25, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Nov 26, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Nov 26, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Nov 29, 2017
alexcrichton
referenced this issue
Nov 29, 2017
Merged
rustc: Prepare to enable ThinLTO by default #46382
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Dec 21, 2017
alexcrichton
referenced this issue
Dec 21, 2017
Merged
rustc: Set release mode cgus to 16 by default #46910
kennytm
added a commit
to kennytm/rust
that referenced
this issue
Dec 23, 2017
kennytm
added a commit
to kennytm/rust
that referenced
this issue
Dec 23, 2017
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Dec 24, 2017
This comment has been minimized.
This comment has been minimized.
|
I don't know if this is the right place, but I think it should be considered whether the symbol issues reported in the Nightly section here: #46552 should be considered a blocker or not. see also discussion in commit comment here ;) The tl;dr is that certain versions of llvm, (i've seen this on 3.8, and whatever llvm rustc nightly uses) appends seemingly random garbage to the end of some names, e.g. we get:
instead of:
This knocks the debuginfo out of sync (it doesn't have the garbage appended). I've been able to repro this with clang3.8 for c++ files as well (haven't tested on other llvm), and switching to 5.0 seems to have fixed the issue. I don't know if its within reach, but perhaps we should attempt upgrading to llvm 5.0 before releasing this on stable? Note I understand this is for release mode, but i see this in debug mode on rustc nightly right now as well... |

alexcrichton commentedOct 16, 2017
•
edited
I'm opening this up to serve as a tracking issue for enabling multiple codegen units in release mode by default. I've written up a lengthy summary before but the tl;dr; is that multiple codegen units enables us to run optimization/code generation in parallel, making use of all available computing resources often speeding up compilations by more than 2x.
Historically this has not been done due to claims of a loss in performance, but the recently implemented ThinLTO is intended to assuage such concerns. The most viable route forward seems to be to enable multiple CGUs and ThinLTO at the same time in release mode.
Performance summary
Blocking issues:
ThinLTO exposes too many symbols- fixedfirst attempt-blocked on presumed LLVM bug-Rust tracking issue-fixedblocked on test failures-current presumed cause of test failures-next attemptPossible build-time regressions using multiple CGUs in debug mode- couldn't reproduceReported build time regression in rust-doom- couldn't reproduceThinLTO broken some MSVC rlibs- fixedPotential blockers/bugs:
proposed fix-update to rust