Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for enabling multiple CGUs in release mode by default #45320

Closed
8 of 11 tasks
alexcrichton opened this issue Oct 16, 2017 · 20 comments · Fixed by #46910
Closed
8 of 11 tasks

Tracking issue for enabling multiple CGUs in release mode by default #45320

alexcrichton opened this issue Oct 16, 2017 · 20 comments · Fixed by #46910
Labels
A-codegen Area: Code generation C-tracking-issue Category: A tracking issue for an RFC or an unstable feature.

Comments

@alexcrichton
Copy link
Member

alexcrichton commented Oct 16, 2017

I'm opening this up to serve as a tracking issue for enabling multiple codegen units in release mode by default. I've written up a lengthy summary before but the tl;dr; is that multiple codegen units enables us to run optimization/code generation in parallel, making use of all available computing resources often speeding up compilations by more than 2x.

Historically this has not been done due to claims of a loss in performance, but the recently implemented ThinLTO is intended to assuage such concerns. The most viable route forward seems to be to enable multiple CGUs and ThinLTO at the same time in release mode.

Performance summary

Blocking issues:

Potential blockers/bugs:

@alexcrichton alexcrichton added A-codegen Area: Code generation C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Oct 16, 2017
@alexcrichton
Copy link
Member Author

cc @michaelwoerister

@bstrie
Copy link
Contributor

bstrie commented Oct 16, 2017

Presumably this will remain a stable compiler flag, but what else will change by default? Is the plan to make this only affect cargo build --release, or is the plan to also make this affect rustc --opt-level=2? Do we want to make --opt-level=3 continue to use only a single CGU, to hedge against regressions for people who are already willing to trade off compiler time for runtime?

@alexcrichton
Copy link
Member Author

I would specifically propose that any opt level greater than 1 uses 16 codegen units and ThinLTO enabled by default for those 16 codegen units.

@michaelwoerister
Copy link
Member

This is very interesting. One would think that ThinLTO-driven inlining should always have to do less work when our much more conservative pre-LLVM inlining. But that doesn't always seem to be the case, as in the rust-doom crate. Most tests on the irlo thread seem to profit though. Overall it's still not a clear picture to me.

@alexcrichton
Copy link
Member Author

@michaelwoerister was that comment meant for a different thread? I forget which one as well though, so I'll respond here!

As a "quick benchmark" I compiled the regex test suite with 16 CGUs + ThinLTO and then toggled inlining in all CGUs on/off. Surprisingly inlining in all CGUs was 5s faster to compile, and additionally did better on tons of benchmarks. In those timings after is where inlining is only in one CGU, where before is inlining in all CGUs. One benchmark, misc::reverse_suffix_no_quadratic, got twice as slow!

@michaelwoerister
Copy link
Member

@alexcrichton Yeah, the comment was in response to #45188 (rust-doom) but since that was closed already, I put it here.

Regarding the compilation time difference between pre- and post-trans inlining, my hypothesis would be that sometimes pre-trans inlining will lead to more code being eliminated early on (as in the case of regex apparently) and sometimes it will have the opposite effect. I suspect that there's room for improvement by tuning which LLVM passes we run specifically for ThinLTO. Or do we do that already?

@alexcrichton
Copy link
Member Author

Heh it's true yeah, I'd imagine that there's always room for improvement in pass tuning in Rust :). Right now we perform (afaik) 0 customization of any pass manager in LLVM. All of the normal optimization passes, LTO optimization passes, and ThinLTO optimization passes are all the same as what's in LLVM itself.

@alexcrichton
Copy link
Member Author

alexcrichton commented Oct 19, 2017

I wanted to also take the time and tabulate all the results from the call to action thread to make sure it's all visibile in one place. Note that all the timings below are comparing a release build to a release build with 16 CGUs and ThinLTO enabled

Improvements

All of the following compile times improved, sorted by most improved to least improved

project compile time runtime
internal project -67%
unnamed project -67%
cargo -61%
crates.io final crate -57%
rustling-ontology -61% no impact
servo -59%
base100 -55%
rust-doom one crate -54% 0%
serde -50% -12% to +16%
distributary -44% -3%
bindgen -39% +7%
lewton -37%
crates.io -33%
paniopticon -32% -64% (?)
lewton benchmarks -26% no impact
internal project -23%
rust-doom -8% 0%
webrender -3%
rust-belt -3%

Regressions

The following crates regressed in compile times, sorted from smallest regression to largest

project compile time runtime before after
rust-belt +1% 97.498s 99.09s
slab-alloc +2.5% +5.5% 90.323s 92.65s
ripfs-rs +11% 312s 347s
rayon +13% 7.27s 8.18s
channel +23% -39% 7.13s 8.80s
unnamed project +30% 15.62 20.24
channel final crate +60% 0.78s 1.25s

alexcrichton's attempt to reproduce the regressions

Here I attempt to reproduce the regressions on my own machine with rustc 1.22.0-nightly (f6d751454 2017-10-17)

project compile time before after
ripfs-rs final crate -43% 8.18s 4.7s
rust-belt final crate -35% 12.0s 7.79s
slab-alloc -30% 28.98s 21.12s
channel final crate -25% 0.83 0.62
ripfs-rs -20% 145.66s 115.87s
slab-alloc final crate -17% 0.75s 0.62s
rayon -14% 6.24s 5.35s
channel -12% 4.91 4.33
rayon final crate -3% 1.75s 1.70s
rust-belt +8% 104.62s 113.81s

Unfortunately the only regression I was able to reproduce was the rust-belt regression. I'll be looking more into that.

@alexcrichton
Copy link
Member Author

Looking into rust-belt, one interesting thing I've found is that the inflate crate takes longer to compile with ThinLTO than it would otherwise. Looking at profiling information it appears that one codegen unit in this crate takes 99% of the time in LLVM. This codegen unit appears to be basically entirely dominated by one function.

Almost all of the benefit of multiple codegen units is spreading out the work across all CPUs. Enabling ThinLTO to avoid losing any perf is fundamentally doing more work than what's already happening at -O2, but we get speedups across the board in most cases. If we have one huge CGU, split it in two, and work on those in parallel then we've got 50% of the original time to run ThinLTO (in a perfect world). If ThinLTO takes more than 50% of the time then we'd have a build time regression, but that's almost never happening.

What's happening here is that we have one huge CGU, but when we split it up we still have one huge CGU. This means that the CGU which may take ~1% less time in LLVM only gives us a tiny sliver of a window to run ThinLTO passes. In the case of the inflate crate this means that adding ThinLTO passes is an overall build time regression.

So generalizing even further, I think that enabling ThinLTO and multiple CGUs by default is going to regress compile time performance in any crate where our paritioning implementation doesn't actually partition very well. In the case of inflate we've got one huge function which hogs almost all the optimization time (I think) and the entire crate is basically dominated by that one function.

I would be willing to conclude, however, that such a situation is likely quite rare. Almost all other crates in the ecosystem will benefit from the partitioning which should basically evenly split up a crate. @michaelwoerister maybe you have thoughts on this though?

@michaelwoerister
Copy link
Member

Thank you so much for collecting and analyzing such a large amount of data, @alexcrichton!

Your conclusions make sense to me. Unevenly distributed CGU sizes are also a problem for incremental compilation; e.g. the tokio-webpush-simple@030-minor-change test sees >90% CGU re-use, yet compile time is almost the same as from-scratch.

In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though.

For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs. We'd only need a metric for the size of a TransItem. The number of MIR instructions might be a sufficient heuristic here. That would not help for inflate but it might help for other crates.

In conclusion, judging from the table above, I think we should make cgus=16 + ThinLTO the default sooner rather than later. Open questions I see:

  • What to do about -Zinline-in-all-cgus? I would have thought that it would always be a clear win to disable this when ThinLTO is enabled but that doesn't seem to be the case.
  • Are there hardware configurations where this never is a win? E.g. should we default to traditional compilation if num_cpus <= 2?

@johnthagen
Copy link
Contributor

FWIW, as the author of rust-belt, which is a made-for-fun Piston game, I am more than happy to take a small compile regression when it seems like vast majority of crates get huge real-time improvements!

@alexcrichton
Copy link
Member Author

In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though.

This seems like a good thing to have in our back pocket, but I'm also wary of trying to do this by default. For example the inflate case is one where we may wish to disable ThinLTO but the one massive function could also critically rely on a function in a different CGU being inlined? That sort of case would be difficult for us to determine... In any case though I agree that we probably don't need the complexity just yet so I think we can defer this for later.

For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs.

Agreed! So far all the crate's I've seen the O(instructions) has been quite a good metric for "how long this takes in all LLVM-related passes", so counting MIR sounds reasonable to me as well. Again though I also feel like this is ok to have in our back pocket. I'd want dig more into the tokio-webpush-simple minor change example, I'm curious how awry the CGU distribution is there and whether a heuristic like this would help.

What to do about -Zinline-in-all-cgus?

Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!)

That being said we're seeing such big wins from ThinLTO on big projects today that we can probably just save this as a possible optimization for the future.

On the topic of inline functions I recently dug up #10212 again (later closed in favor of #14527, but the latter may no longer be true today). I suspect that may "fix" quite a bit of the compile time issue here without coming at a loss of performance? In any case possibly lots of interesting things we could do there.

Are there hardware configurations where this never is a win? E.g. should we default to traditional compilation if num_cpus <= 2?

Also a good question! I find this to be a difficult one, however, because the CGU number will affect the output artifact, which means that if we do this sort of probing it'll be beneficial for performance but come at the cost of more difficult deterministic builds. You'd have to specify the CGUs manually or build on the same-ish hardware to get a deterministic build I think?

I think though that if you have 1 CPU then this is universaly a huge regression. With one CPU we'd split the preexisting one huge CGU into N different ones, probably take roughly the same amount of time to opimize those, and then tack on ThinLTO and more optimization passes. My guess is that with one CPU we'd easily see 50% regressions.

With 2+ CPUs however I'd probably expect to see benefits to compile time. Anything giving us twice the resources to churn the cgus more quickly should quickly start seeing wins in theory I think.

For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays!

@alexcrichton
Copy link
Member Author

One thing I think that's also worth pointing out is that up to this point we've mostly been measuring the runtime of an entire cargo build. That's actually, I believe, the absolute worst case scenario for where ThinLTO will provide benefit. Despite this, it's showing huge improvements for lots of projects!

The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine. It's overall increasing the amount of work the compiler does. For a cargo build from scratch, though, you typically already have tons of crates compiling for the first half of the build in parallel. In that sense there's not actually any idle parallelism.

Put another way, ThinLTO and multiple CGUs should only be beneficial for builds which dont have many crates compiling in parallel for long parts of the build. If a build is 100% parallel for the entire time then ThinLTO will likely regress compile time performance.

Now you might realize, however, that one very common case where you're only building one crate is in an incremental build! Typically if you do an incremental build you're only building a handful of crates, often serially. In that sense I think that there's some massive wins of ThinLTO + multiple CGUs in incremental builds rather than entire crate builds.

Although improving both is of course great as well :)

@michaelwoerister
Copy link
Member

For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays!

For deterministic builds you have to do some extra configuration anyway (e.g. path remapping) so I would not consider that a blocker. And I guess there are single core VMs around somewhere. But all of this is such a niche case that I don't really care :)

The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine.

MIR-only RLIBs should put us into a pretty good spot regarding this, so I'm quite confident that we're on the right path overall.

@michaelwoerister
Copy link
Member

What to do about -Zinline-in-all-cgus?

Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!)

That's really interesting. For incremental compilation enabled the situation might be different (because pre-trans inlining hurts re-use) but for the non-incremental case it sounds like it's pretty clear what to do.

@michaelwoerister
Copy link
Member

On the topic of inline functions I recently dug up #10212 again.

Yeah, maybe we should revisit this at some point. Although I have to say the current solution of only internal and external symbols and nothing in between is really nice and simple.

@alexcrichton
Copy link
Member Author

Hm yeah that's a good point about needing configuration anyway for deterministic builds. It now seems like a more plausible route to take!

Also that's a very interesting apoint about incremental and inlining on our own end... Maybe we should dig more into those ThinLTO runtime regressions at some point!

Also yeah I don't really want to change how we trans inline functions just yet, I do like the simplicity too :)

@jrmuizel
Copy link
Contributor

This codegen unit appears to be basically entirely dominated by one function.

@alexcrichton out of curiosity how did you figure out this and which function it was. I've wanted to look into why the webrender build is so slow and would welcome tips.

@alexcrichton
Copy link
Member Author

@jrmuizel oh sure I'd love to explain! So I originally found inflate as a problematic crate when compiling rust-belt as it just took awhile and I decided to dig deeper. I checked out the crate and ran:

$ RUSTFLAGS='-Z trans-time-graph -Z thinlto -C codegen-units=16' cargo +nightly build --release
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling adler32 v1.0.2
   Compiling inflate v0.3.3 (file:///home/alex/code/inflate)
    Finished release [optimized] target(s) in 16.47 secs

That drops inflate-timings.html in the current directory, and opening that up I see:

screenshot-2017-10-20 screenshot

The graph here isn't always the easiest to read, but we've clearly got two huge bars, both of which correspond to taking a huge amount of time for that one CGU (first is optimization, second is ThinLTO + codegen).

Next I ran:

$ RUSTFLAGS='-Z trans-time-graph -Z thinlto -C codegen-units=16' cargo +nightly rustc --release -- --emit llvm-ir

and that command dumps a bunch of IR files into target/release/deps. Our interesting CGU is inflate0 so I opened up target/release/deps/inflate-9778032a59339daf.inflate0-3828e589be87d42871e8ca3be2241a9d.rs.rust-cgu.ll (we sure do love our long filenames).

Inside that file it was 70k lines and some poking around showed that one function was 66k lines of IR.

I sort of forget now how at this point I went from that IR to determining there was a huge function in there though...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 20, 2017
This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for rust-lang#45320 and shaking out bugs.
bors added a commit that referenced this issue Oct 20, 2017
rustbuild: Compile rustc with ThinLTO

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for #45320 and shaking out bugs.
bors added a commit that referenced this issue Oct 22, 2017
rustbuild: Compile rustc with ThinLTO

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for #45320 and shaking out bugs.
alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 25, 2017
This commit moves the standard library to get compiled with multiple codegen
units and ThinLTO like the compiler itself. This I would hope is the last major
step towards closing out rust-lang#45320
alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 31, 2017
This commit moves the standard library to get compiled with multiple codegen
units and ThinLTO like the compiler itself. This I would hope is the last major
step towards closing out rust-lang#45320
alexcrichton added a commit to alexcrichton/rust that referenced this issue Dec 21, 2017
This commit is the next attempt to enable multiple codegen units by default in
release mode, getting some of those sweet, sweet parallelism wins by running
codegen in parallel. Performance should not be lost due to ThinLTO being on by
default as well.

Closes rust-lang#45320
kennytm added a commit to kennytm/rust that referenced this issue Dec 23, 2017
…haelwoerister

rustc: Set release mode cgus to 16 by default

This commit is the next attempt to enable multiple codegen units by default in
release mode, getting some of those sweet, sweet parallelism wins by running
codegen in parallel. Performance should not be lost due to ThinLTO being on by
default as well.

Closes rust-lang#45320
kennytm added a commit to kennytm/rust that referenced this issue Dec 23, 2017
…haelwoerister

rustc: Set release mode cgus to 16 by default

This commit is the next attempt to enable multiple codegen units by default in
release mode, getting some of those sweet, sweet parallelism wins by running
codegen in parallel. Performance should not be lost due to ThinLTO being on by
default as well.

Closes rust-lang#45320
alexcrichton added a commit to alexcrichton/rust that referenced this issue Dec 24, 2017
This commit is the next attempt to enable multiple codegen units by default in
release mode, getting some of those sweet, sweet parallelism wins by running
codegen in parallel. Performance should not be lost due to ThinLTO being on by
default as well.

Closes rust-lang#45320
@m4b
Copy link
Contributor

m4b commented Dec 24, 2017

I don't know if this is the right place, but I think it should be considered whether the symbol issues reported in the Nightly section here: #46552 should be considered a blocker or not. see also discussion in commit comment here ;)

The tl;dr is that certain versions of llvm, (i've seen this on 3.8, and whatever llvm rustc nightly uses) appends seemingly random garbage to the end of some names, e.g. we get:

_ZN3std3sys4unix2os8ENV_LOCK17hbf5ac5d1fa9db31cE.llvm.D64EB761

instead of:

_ZN3std3sys4unix2os8ENV_LOCK17hbf5ac5d1fa9db31cE

This knocks the debuginfo out of sync (it doesn't have the garbage appended).

I've been able to repro this with clang3.8 for c++ files as well (haven't tested on other llvm), and switching to 5.0 seems to have fixed the issue.

I don't know if its within reach, but perhaps we should attempt upgrading to llvm 5.0 before releasing this on stable?

Note I understand this is for release mode, but i see this in debug mode on rustc nightly right now as well...

bors added a commit that referenced this issue Dec 25, 2017
rustc: Set release mode cgus to 16 by default

This commit is the next attempt to enable multiple codegen units by default in
release mode, getting some of those sweet, sweet parallelism wins by running
codegen in parallel. Performance should not be lost due to ThinLTO being on by
default as well.

Closes #45320
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-tracking-issue Category: A tracking issue for an RFC or an unstable feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants