Tracking issue for enabling multiple CGUs in release mode by default #45320

alexcrichton · 2017-10-16T14:37:27Z

alexcrichton · 2017-10-16T18:04:31Z

bstrie · 2017-10-16T21:08:46Z

Presumably this will remain a stable compiler flag, but what else will change by default? Is the plan to make this only affect cargo build --release, or is the plan to also make this affect rustc --opt-level=2? Do we want to make --opt-level=3 continue to use only a single CGU, to hedge against regressions for people who are already willing to trade off compiler time for runtime?

alexcrichton · 2017-10-16T22:24:18Z

I would specifically propose that any opt level greater than 1 uses 16 codegen units and ThinLTO enabled by default for those 16 codegen units.

michaelwoerister · 2017-10-18T11:07:29Z

This is very interesting. One would think that ThinLTO-driven inlining should always have to do less work when our much more conservative pre-LLVM inlining. But that doesn't always seem to be the case, as in the rust-doom crate. Most tests on the irlo thread seem to profit though. Overall it's still not a clear picture to me.

alexcrichton · 2017-10-18T18:03:22Z

@michaelwoerister was that comment meant for a different thread? I forget which one as well though, so I'll respond here!

As a "quick benchmark" I compiled the regex test suite with 16 CGUs + ThinLTO and then toggled inlining in all CGUs on/off. Surprisingly inlining in all CGUs was 5s faster to compile, and additionally did better on tons of benchmarks. In those timings after is where inlining is only in one CGU, where before is inlining in all CGUs. One benchmark, misc::reverse_suffix_no_quadratic, got twice as slow!

michaelwoerister · 2017-10-19T09:08:47Z

@alexcrichton Yeah, the comment was in response to #45188 (rust-doom) but since that was closed already, I put it here.

Regarding the compilation time difference between pre- and post-trans inlining, my hypothesis would be that sometimes pre-trans inlining will lead to more code being eliminated early on (as in the case of regex apparently) and sometimes it will have the opposite effect. I suspect that there's room for improvement by tuning which LLVM passes we run specifically for ThinLTO. Or do we do that already?

alexcrichton · 2017-10-19T14:03:57Z

Heh it's true yeah, I'd imagine that there's always room for improvement in pass tuning in Rust :). Right now we perform (afaik) 0 customization of any pass manager in LLVM. All of the normal optimization passes, LTO optimization passes, and ThinLTO optimization passes are all the same as what's in LLVM itself.

alexcrichton · 2017-10-19T16:08:04Z

I wanted to also take the time and tabulate all the results from the call to action thread to make sure it's all visibile in one place. Note that all the timings below are comparing a release build to a release build with 16 CGUs and ThinLTO enabled

Improvements

All of the following compile times improved, sorted by most improved to least improved

project	compile time	runtime
internal project	-67%
unnamed project	-67%
cargo	-61%
crates.io final crate	-57%
rustling-ontology	-61%	no impact
servo	-59%
base100	-55%
rust-doom one crate	-54%	0%
serde	-50%	-12% to +16%
distributary	-44%	-3%
bindgen	-39%	+7%
lewton	-37%
crates.io	-33%
paniopticon	-32%	-64% (?)
lewton benchmarks	-26%	no impact
internal project	-23%
rust-doom	-8%	0%
webrender	-3%
rust-belt	-3%

Regressions

The following crates regressed in compile times, sorted from smallest regression to largest

project	compile time	runtime	before	after
rust-belt	+1%		97.498s	99.09s
slab-alloc	+2.5%	+5.5%	90.323s	92.65s
ripfs-rs	+11%		312s	347s
rayon	+13%		7.27s	8.18s
channel	+23%	-39%	7.13s	8.80s
unnamed project	+30%		15.62	20.24
channel final crate	+60%		0.78s	1.25s

alexcrichton's attempt to reproduce the regressions

Here I attempt to reproduce the regressions on my own machine with rustc 1.22.0-nightly (f6d751454 2017-10-17)

project	compile time	before	after
ripfs-rs final crate	-43%	8.18s	4.7s
rust-belt final crate	-35%	12.0s	7.79s
slab-alloc	-30%	28.98s	21.12s
channel final crate	-25%	0.83	0.62
ripfs-rs	-20%	145.66s	115.87s
slab-alloc final crate	-17%	0.75s	0.62s
rayon	-14%	6.24s	5.35s
channel	-12%	4.91	4.33
rayon final crate	-3%	1.75s	1.70s
rust-belt	+8%	104.62s	113.81s

Unfortunately the only regression I was able to reproduce was the rust-belt regression. I'll be looking more into that.

alexcrichton · 2017-10-19T16:45:06Z

Looking into rust-belt, one interesting thing I've found is that the inflate crate takes longer to compile with ThinLTO than it would otherwise. Looking at profiling information it appears that one codegen unit in this crate takes 99% of the time in LLVM. This codegen unit appears to be basically entirely dominated by one function.

Almost all of the benefit of multiple codegen units is spreading out the work across all CPUs. Enabling ThinLTO to avoid losing any perf is fundamentally doing more work than what's already happening at -O2, but we get speedups across the board in most cases. If we have one huge CGU, split it in two, and work on those in parallel then we've got 50% of the original time to run ThinLTO (in a perfect world). If ThinLTO takes more than 50% of the time then we'd have a build time regression, but that's almost never happening.

What's happening here is that we have one huge CGU, but when we split it up we still have one huge CGU. This means that the CGU which may take ~1% less time in LLVM only gives us a tiny sliver of a window to run ThinLTO passes. In the case of the inflate crate this means that adding ThinLTO passes is an overall build time regression.

So generalizing even further, I think that enabling ThinLTO and multiple CGUs by default is going to regress compile time performance in any crate where our paritioning implementation doesn't actually partition very well. In the case of inflate we've got one huge function which hogs almost all the optimization time (I think) and the entire crate is basically dominated by that one function.

I would be willing to conclude, however, that such a situation is likely quite rare. Almost all other crates in the ecosystem will benefit from the partitioning which should basically evenly split up a crate. @michaelwoerister maybe you have thoughts on this though?

michaelwoerister · 2017-10-20T09:48:13Z

Thank you so much for collecting and analyzing such a large amount of data, @alexcrichton!

Your conclusions make sense to me. Unevenly distributed CGU sizes are also a problem for incremental compilation; e.g. the tokio-webpush-simple@030-minor-change test sees >90% CGU re-use, yet compile time is almost the same as from-scratch.

In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though.

For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs. We'd only need a metric for the size of a TransItem. The number of MIR instructions might be a sufficient heuristic here. That would not help for inflate but it might help for other crates.

In conclusion, judging from the table above, I think we should make cgus=16 + ThinLTO the default sooner rather than later. Open questions I see:

What to do about -Zinline-in-all-cgus? I would have thought that it would always be a clear win to disable this when ThinLTO is enabled but that doesn't seem to be the case.
Are there hardware configurations where this never is a win? E.g. should we default to traditional compilation if num_cpus <= 2?

johnthagen · 2017-10-20T10:39:59Z

FWIW, as the author of rust-belt, which is a made-for-fun Piston game, I am more than happy to take a small compile regression when it seems like vast majority of crates get huge real-time improvements!

alexcrichton · 2017-10-20T14:43:35Z

In the non-incremental case it might be an option to detect the problematic case right after partitioning and then switch to non-LTO mode? I'm not sure it's worth the extra complexity though.

This seems like a good thing to have in our back pocket, but I'm also wary of trying to do this by default. For example the inflate case is one where we may wish to disable ThinLTO but the one massive function could also critically rely on a function in a different CGU being inlined? That sort of case would be difficult for us to determine... In any case though I agree that we probably don't need the complexity just yet so I think we can defer this for later.

For cases where there is one big CGU but that CGU contains multiple functions, we should be able to rather easily redistribute functions to other CGUs.

Agreed! So far all the crate's I've seen the O(instructions) has been quite a good metric for "how long this takes in all LLVM-related passes", so counting MIR sounds reasonable to me as well. Again though I also feel like this is ok to have in our back pocket. I'd want dig more into the tokio-webpush-simple minor change example, I'm curious how awry the CGU distribution is there and whether a heuristic like this would help.

What to do about -Zinline-in-all-cgus?

Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!)

That being said we're seeing such big wins from ThinLTO on big projects today that we can probably just save this as a possible optimization for the future.

On the topic of inline functions I recently dug up #10212 again (later closed in favor of #14527, but the latter may no longer be true today). I suspect that may "fix" quite a bit of the compile time issue here without coming at a loss of performance? In any case possibly lots of interesting things we could do there.

Are there hardware configurations where this never is a win? E.g. should we default to traditional compilation if num_cpus <= 2?

Also a good question! I find this to be a difficult one, however, because the CGU number will affect the output artifact, which means that if we do this sort of probing it'll be beneficial for performance but come at the cost of more difficult deterministic builds. You'd have to specify the CGUs manually or build on the same-ish hardware to get a deterministic build I think?

I think though that if you have 1 CPU then this is universaly a huge regression. With one CPU we'd split the preexisting one huge CGU into N different ones, probably take roughly the same amount of time to opimize those, and then tack on ThinLTO and more optimization passes. My guess is that with one CPU we'd easily see 50% regressions.

With 2+ CPUs however I'd probably expect to see benefits to compile time. Anything giving us twice the resources to churn the cgus more quickly should quickly start seeing wins in theory I think.

For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays!

alexcrichton · 2017-10-20T14:49:37Z

One thing I think that's also worth pointing out is that up to this point we've mostly been measuring the runtime of an entire cargo build. That's actually, I believe, the absolute worst case scenario for where ThinLTO will provide benefit. Despite this, it's showing huge improvements for lots of projects!

The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine. It's overall increasing the amount of work the compiler does. For a cargo build from scratch, though, you typically already have tons of crates compiling for the first half of the build in parallel. In that sense there's not actually any idle parallelism.

Put another way, ThinLTO and multiple CGUs should only be beneficial for builds which dont have many crates compiling in parallel for long parts of the build. If a build is 100% parallel for the entire time then ThinLTO will likely regress compile time performance.

Now you might realize, however, that one very common case where you're only building one crate is in an incremental build! Typically if you do an incremental build you're only building a handful of crates, often serially. In that sense I think that there's some massive wins of ThinLTO + multiple CGUs in incremental builds rather than entire crate builds.

Although improving both is of course great as well :)

michaelwoerister · 2017-10-20T15:25:45Z

For now though the deterministic builds wins me over in terms of leaving this as-is for all hardware configurations. That and I doubt anyone's compiling Rust on single-core machines nowadays!

For deterministic builds you have to do some extra configuration anyway (e.g. path remapping) so I would not consider that a blocker. And I guess there are single core VMs around somewhere. But all of this is such a niche case that I don't really care :)

The benefit of ThinLTO and multiple CGUs is leveraging otherwise idle parallelism on the build machine.

MIR-only RLIBs should put us into a pretty good spot regarding this, so I'm quite confident that we're on the right path overall.

michaelwoerister · 2017-10-20T15:30:20Z

What to do about -Zinline-in-all-cgus?

Quite surprisingly I've yet to see any data that it's beneficial to compile time in release mode or doesn't hurt runtime. (all data is that it hurts both compile time and runtime performance!)

That's really interesting. For incremental compilation enabled the situation might be different (because pre-trans inlining hurts re-use) but for the non-incremental case it sounds like it's pretty clear what to do.

michaelwoerister · 2017-10-20T15:32:24Z

On the topic of inline functions I recently dug up #10212 again.

Yeah, maybe we should revisit this at some point. Although I have to say the current solution of only internal and external symbols and nothing in between is really nice and simple.

alexcrichton · 2017-10-20T15:33:43Z

Hm yeah that's a good point about needing configuration anyway for deterministic builds. It now seems like a more plausible route to take!

Also that's a very interesting apoint about incremental and inlining on our own end... Maybe we should dig more into those ThinLTO runtime regressions at some point!

Also yeah I don't really want to change how we trans inline functions just yet, I do like the simplicity too :)

jrmuizel · 2017-10-20T15:37:38Z

This codegen unit appears to be basically entirely dominated by one function.

@alexcrichton out of curiosity how did you figure out this and which function it was. I've wanted to look into why the webrender build is so slow and would welcome tips.

alexcrichton · 2017-10-20T15:50:50Z

@jrmuizel oh sure I'd love to explain! So I originally found inflate as a problematic crate when compiling rust-belt as it just took awhile and I decided to dig deeper. I checked out the crate and ran:

$ RUSTFLAGS='-Z trans-time-graph -Z thinlto -C codegen-units=16' cargo +nightly build --release
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling adler32 v1.0.2
   Compiling inflate v0.3.3 (file:///home/alex/code/inflate)
    Finished release [optimized] target(s) in 16.47 secs

That drops inflate-timings.html in the current directory, and opening that up I see:

The graph here isn't always the easiest to read, but we've clearly got two huge bars, both of which correspond to taking a huge amount of time for that one CGU (first is optimization, second is ThinLTO + codegen).

Next I ran:

$ RUSTFLAGS='-Z trans-time-graph -Z thinlto -C codegen-units=16' cargo +nightly rustc --release -- --emit llvm-ir

and that command dumps a bunch of IR files into target/release/deps. Our interesting CGU is inflate0 so I opened up target/release/deps/inflate-9778032a59339daf.inflate0-3828e589be87d42871e8ca3be2241a9d.rs.rust-cgu.ll (we sure do love our long filenames).

Inside that file it was 70k lines and some poking around showed that one function was 66k lines of IR.

I sort of forget now how at this point I went from that IR to determining there was a huge function in there though...

This commit enables ThinLTO for the compiler as well as multiple codegen units. This is intended to get the benefits of parallel codegen while also avoiding any major loss of perf. Finally this commit is also intended as further testing for rust-lang#45320 and shaking out bugs.

rustbuild: Compile rustc with ThinLTO This commit enables ThinLTO for the compiler as well as multiple codegen units. This is intended to get the benefits of parallel codegen while also avoiding any major loss of perf. Finally this commit is also intended as further testing for #45320 and shaking out bugs.

This commit moves the standard library to get compiled with multiple codegen units and ThinLTO like the compiler itself. This I would hope is the last major step towards closing out rust-lang#45320

This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320

…haelwoerister rustc: Set release mode cgus to 16 by default This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320

This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes rust-lang#45320

m4b · 2017-12-24T19:36:04Z

I don't know if this is the right place, but I think it should be considered whether the symbol issues reported in the Nightly section here: #46552 should be considered a blocker or not. see also discussion in commit comment here ;)

The tl;dr is that certain versions of llvm, (i've seen this on 3.8, and whatever llvm rustc nightly uses) appends seemingly random garbage to the end of some names, e.g. we get:

_ZN3std3sys4unix2os8ENV_LOCK17hbf5ac5d1fa9db31cE.llvm.D64EB761

instead of:

_ZN3std3sys4unix2os8ENV_LOCK17hbf5ac5d1fa9db31cE

This knocks the debuginfo out of sync (it doesn't have the garbage appended).

I've been able to repro this with clang3.8 for c++ files as well (haven't tested on other llvm), and switching to 5.0 seems to have fixed the issue.

I don't know if its within reach, but perhaps we should attempt upgrading to llvm 5.0 before releasing this on stable?

Note I understand this is for release mode, but i see this in debug mode on rustc nightly right now as well...

rustc: Set release mode cgus to 16 by default This commit is the next attempt to enable multiple codegen units by default in release mode, getting some of those sweet, sweet parallelism wins by running codegen in parallel. Performance should not be lost due to ThinLTO being on by default as well. Closes #45320

alexcrichton added A-codegen Area: Code generation C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Oct 16, 2017

alexcrichton mentioned this issue Oct 16, 2017

Replace the link step of codegen-units with ThinLTO #35996

Closed

alexcrichton mentioned this issue Oct 17, 2017

ThinLTO + O2 build time regression in rust-doom #45188

Closed

alexcrichton mentioned this issue Oct 19, 2017

32 codegen units may not always be better at -O0 #44941

Closed

alexcrichton mentioned this issue Oct 20, 2017

rustbuild: Compile rustc with ThinLTO #45400

Merged

alexcrichton mentioned this issue Oct 25, 2017

rustbuild: Build libstd with ThinLTO #45529

Closed

JordiPolo mentioned this issue Oct 26, 2017

Big functions makes it difficult to parallelize compilation image-rs/inflate#33

Open

alexcrichton mentioned this issue Oct 31, 2017

rustbuild: Build libstd with ThinLTO #45661

Closed

alexcrichton mentioned this issue Nov 29, 2017

rustc: Prepare to enable ThinLTO by default #46382

Merged

alexcrichton mentioned this issue Dec 21, 2017

rustc: Set release mode cgus to 16 by default #46910

Merged

bors closed this as completed in #46910 Dec 25, 2017

This was referenced Jan 11, 2018

Performance regression on nightly (when using Cursor::read_exact and Byteorder) #47321

Closed

Performance regression on pest in newest nightly #47356

Closed

alexcrichton mentioned this issue Feb 25, 2018

CPU usage regression for ripgrep #48257

Closed

michaelwoerister mentioned this issue Feb 26, 2018

Compiler Performance Tracking Issue #48547

Open

thedavidmeister mentioned this issue Dec 19, 2018

swarm travis holochain/holochain-rust#775

Merged

alexcrichton mentioned this issue Jan 29, 2019

Tracking Issue for making incremental compilation the default for Release Builds #57968

Open

4 tasks

alexcrichton mentioned this issue Oct 1, 2019

Less codegen parallelism than expected with -C codegen-units=16 #64913

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for enabling multiple CGUs in release mode by default #45320

Tracking issue for enabling multiple CGUs in release mode by default #45320

alexcrichton commented Oct 16, 2017 •

edited

Loading

alexcrichton commented Oct 16, 2017

bstrie commented Oct 16, 2017

alexcrichton commented Oct 16, 2017

michaelwoerister commented Oct 18, 2017

alexcrichton commented Oct 18, 2017

michaelwoerister commented Oct 19, 2017

alexcrichton commented Oct 19, 2017

alexcrichton commented Oct 19, 2017 •

edited

Loading

alexcrichton commented Oct 19, 2017

michaelwoerister commented Oct 20, 2017

johnthagen commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

jrmuizel commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

m4b commented Dec 24, 2017

Tracking issue for enabling multiple CGUs in release mode by default #45320

Tracking issue for enabling multiple CGUs in release mode by default #45320

Comments

alexcrichton commented Oct 16, 2017 • edited Loading

alexcrichton commented Oct 16, 2017

bstrie commented Oct 16, 2017

alexcrichton commented Oct 16, 2017

michaelwoerister commented Oct 18, 2017

alexcrichton commented Oct 18, 2017

michaelwoerister commented Oct 19, 2017

alexcrichton commented Oct 19, 2017

alexcrichton commented Oct 19, 2017 • edited Loading

alexcrichton commented Oct 19, 2017

michaelwoerister commented Oct 20, 2017

johnthagen commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

michaelwoerister commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

jrmuizel commented Oct 20, 2017

alexcrichton commented Oct 20, 2017

m4b commented Dec 24, 2017

alexcrichton commented Oct 16, 2017 •

edited

Loading

alexcrichton commented Oct 19, 2017 •

edited

Loading