Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustbuild: Compile rustc with ThinLTO #45400

Merged
merged 2 commits into from
Oct 22, 2017

Conversation

alexcrichton
Copy link
Member

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for #45320 and shaking out bugs.

@@ -175,6 +175,9 @@ fn main() {
if let Ok(s) = env::var("RUSTC_CODEGEN_UNITS") {
cmd.arg("-C").arg(format!("codegen-units={}", s));
}
if stage != "0" && env::var("RUSTC_THINLTO").is_ok() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this skips stage0 as the current stage0 doesn't have ThinLTO (but that should be coming soon) but despite that the ThinLTO implementation on beta is probably too buggy to use.

@alexcrichton
Copy link
Member Author

r? @Mark-Simulacrum

cc @michaelwoerister
cc #44675

@@ -618,6 +618,10 @@ impl<'a> Builder<'a> {
// FIXME: cargo bench does not accept `--release`
if self.config.rust_optimize && cmd != "bench" {
cargo.arg("--release");

if mode != Mode::Libstd {
cargo.env("RUSTC_THINLTO", "1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not want this for std?

This will (afaict) override the codegen-units setting in config.toml unconditionally, and isn't gated either, both of which seem somewhat odd. I think I'm okay with the latter (we expect ThinLTO to work reliably and not lose performance) but I'd prefer that we then remove codegen-units from the configuration.

OTOH, if we are disabling this for std due to performance concerns, then I feel like it should be behind a toggle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I'm just trying to be conservative, my plan is to, after this lands, enable this for libstd. I just suspect that'll have more bugs/regressions associated with it (for whatever reason) and wanted to get the rustc wins ASAP

@Mark-Simulacrum
Copy link
Member

Once Travis is fixed, I'd also like to run a perf try on this to make sure it's not a regression in performance (don't want to eat compiler dev cycles for no reason).

@kennytm kennytm added the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Oct 20, 2017
@@ -175,6 +175,9 @@ fn main() {
if let Ok(s) = env::var("RUSTC_CODEGEN_UNITS") {
cmd.arg("-C").arg(format!("codegen-units={}", s));
}
if stage != "0" && env::var("RUSTC_THINLTO").is_ok() {
cmd.arg("-Ccodegen-units=16").arg("-Zthinlto");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we respect the rust.codegen-units settings in config.toml here instead of hard-coding "16"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all intended to be pretty temporary, and for now I'd prefer to just keep this easy to remove later as I don't think anyone's really configuring codegen units anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexcrichton Ok, please add a FIXME.

(I did set the CGU to 3 😆)

@michaelwoerister
Copy link
Member

Awesome :)

@alexcrichton
Copy link
Member Author

@bors: try

@Mark-Simulacrum should be fixed now I think

@bors
Copy link
Contributor

bors commented Oct 20, 2017

⌛ Trying commit 115d187eaed09cff8a60dacf5694d237f9b8b15f with merge 4306d764ad37fe6df6c145c2790541f26e6134aa...

@bors
Copy link
Contributor

bors commented Oct 20, 2017

☀️ Test successful - status-travis
State: approved= try=True

@Mark-Simulacrum
Copy link
Member

Looks good to me. I'll go ahead and run perf once I can and if that doesn't show any major regressions r+. Excited to improve rustc compile times!

@Mark-Simulacrum
Copy link
Member

Hm, so this looks like it regresses compiler perf by quite a bit, so I don't think we'd be ready to do this quite yet: http://perf.rust-lang.org/compare.html?start=95272a07f1fe3a82497225a4cd0d67080b8ffebf&end=4306d764ad37fe6df6c145c2790541f26e6134aa&stat=instructions%3Au. Do you think we can do something within this PR to improve that? Otherwise this seems like it'd be a net loss to compiler devs.

@alexcrichton
Copy link
Member Author

Thanks for collecting the data @Mark-Simulacrum!

I do sort of disagree with the conclusion I think though. The best thing to look at is the timing comparison rather than the instruction comparison (aka instructions mostly just show where else to look). Here we still see a "wall of red" but everything is quite modest. It looks like most incremental benchmarks fared particulary bad, but again losing at most hundreds of milliseconds.

I fear that if we accept zero runtime regressions that we will never make this change. I consider turning ThinLTO and multiple CGUs on for rustc to be an important step #45320, and if we never do this we'll never actually close #45320 as finished. The data on #45320 shows massive wins across the ecosystem. I think it's inevitable that enabling ThinLTO and multiple CGUs is going to come at a slight performance loss. It's simply architecturally different and less optimized than today's setup.

On the other hand, however, I'd like to consider the other extreme for a moment. Let's say that we had LTO enabled unconditionally for all Rust crates at 1.0.0. Clearly the compile times are sub-par and have easy wins (turn LTO off). If you disable LTO, however, you're very likely to regress on almost all benchmarks slightly. Despite this I don't think we'd ever consider enabling LTO by default!

What we are doing, however, is effectively enabling "whole crate LTO" by default today. This compilation model is mostly unprecedented in C/C++ and is often why Rust code performs so well despite not paying much attention to it. That we blocked on enabling mulitple CGUs in release mode on ThinLTO was surprising enough to me originally, but if we don't enable multiple CGUs at this juncture I fear that we never will. We should of course be honest in that ThinLTO and multiple CGUs will come at a performance regression, but the "regression" here, in my opinion, is very much in the "lost in the noise" territory. For example all of these 1-5% regressions could indicating timing differences or immediately get swallowed up by other improvements. Furthermore we continue to have an escape hatch to get localized equivalences to today's codegen, #[inline].

I don't think a policy of "zero regressions tolerated" is too helpful in this situation, and in this situation I think not merging this could be actively harmful to compile times in rustc in the long run actually. Here we see some modest regressions which may or may not improve as ThinLTO is refined upstream over time. If this actually affects code in practice there's an escape hatch through #[inline].


With my own personal opinions out of the way, I did indeed dig more into the differences between these two compilers. I chose the syntex_syntax benchmark as it showed the largest cpu-clock regression (~10%), and this benchmark is specifically for a building the syntex_syntax crate in incremental mode where the incremental cache is 100% full and has 100% cache hits (no source changes).

I was indeed able to reproduce the regression, where compile times increased from ~13s to ~14s. Taking a look at the profile data from before this change and then after this change I noticed a few things:

  • This benchmark is very heavy on HashMap as it's a huge crate with 100% cache hits in incremental, which means it's basically entirely bottlenecked on the performance there
  • Notably the before profile highest function is try_mark_green, a large incremental-related function. Afterewards the highest function (a number of them actually), are HashMap::get and related functionality.
  • This seems to imply that with one CGU the HashMap::get function was inlined, but ThinLTO decided to not inline it, however.
  • The assembly generated for HashMap::get is not the biggest thing in the world, but it's also not exactly small.

So my conclusion here is that the performance difference mostly likely does indeed just come down to inlining decisions. I'm not sure why ThinLTO didn't inline this function whereas one CGU did indeed inline it.

I personally think we should move forward regardless with this patch no matter what. I think multiple CGUs has the potential for a huge speedup of rustc compile times (trans/LLVM is an absolutely huge portion of bootstrap, parallelizing anything would be a huge boon).

Despite this, though, I'm curious to evaluate the impact of #[inline] to attempt to recover any changes from multiple CGUs. I've pushed a commit up with a few more #[inline] annotations throughout HashMap (on the hottest functions here) and we can rerun performance data to see what's up.

@alexcrichton
Copy link
Member Author

@bors: try

@bors
Copy link
Contributor

bors commented Oct 20, 2017

⌛ Trying commit 2997c03c831acbbb94c2943fd70e1e33c89e0aaa with merge 57a4ae1e8edd91e10f5d7865423b81f79e22adb6...

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for rust-lang#45320 and shaking out bugs.
@alexcrichton
Copy link
Member Author

@bors: try

@bors
Copy link
Contributor

bors commented Oct 20, 2017

⌛ Trying commit d01427f with merge 11de9ca...

bors added a commit that referenced this pull request Oct 20, 2017
rustbuild: Compile rustc with ThinLTO

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for #45320 and shaking out bugs.
@bors
Copy link
Contributor

bors commented Oct 20, 2017

☀️ Test successful - status-travis
State: approved= try=True

@alexcrichton
Copy link
Member Author

Ok the new comparison I believe is less worse than the old comparison, and sure enough the before/after of the #[inline] annotations shows improvements across the board.

@Mark-Simulacrum
Copy link
Member

I'm happy with your explanation, thanks! Seems like this overall should be an improvement in compile times for the rustc itself, and only a marginal loss for the ecosystem, so I'm happy with that.
@bors r+

@bors
Copy link
Contributor

bors commented Oct 20, 2017

📌 Commit d01427f has been approved by Mark-Simulacrum

@petrochenkov
Copy link
Contributor

improvement in compile times for the rustc itself, and only a marginal loss for the ecosystem

I guess it's okay for development builds, but it looks like this affects compilers released on stable channel as well.
There are no reasons to pessimize stable compilers in favor of faster bootstrap if this bootstrap happens only once in 6 weeks.

@Mark-Simulacrum
Copy link
Member

We can disable the ThinLTO on dist builds -- but then again, I think the argument @alexcrichton makes is that we don't enable LTO either (which would net us perf wins, I'd imagine).

I'm okay with disabling ThinLTO on dist builds, but I do think it should be by default enabled for developers working on rustc, since we expect it to improve compile times of rustc. In the long run, I expect (perhaps wrongly) that ThinLTO + codgen-units will eventually be equivalent to a non-parallel optimizer, and perhaps even slightly better.

Since next beta will dist without this though I think we're okay landing this for now and we can iterate: I don't think it's worth not landing this, but if you or others disagree, feel free to r-.

@petrochenkov
Copy link
Contributor

petrochenkov commented Oct 20, 2017

we don't enable LTO either

For stable releases it would be nice to enable it as well.
I don't remember why exactly this wasn't done earlier (LTO bugs? too much memory? or maybe it was in old infra?), @alexcrichton probably knows better.

@kennytm kennytm added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 21, 2017
@bors
Copy link
Contributor

bors commented Oct 22, 2017

⌛ Testing commit d01427f with merge 1042190...

bors added a commit that referenced this pull request Oct 22, 2017
rustbuild: Compile rustc with ThinLTO

This commit enables ThinLTO for the compiler as well as multiple codegen units.
This is intended to get the benefits of parallel codegen while also avoiding
any major loss of perf. Finally this commit is also intended as further testing
for #45320 and shaking out bugs.
@bors
Copy link
Contributor

bors commented Oct 22, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: Mark-Simulacrum
Pushing 1042190 to master...

@michaelwoerister
Copy link
Member

So here are some thoughts from me:

  • It is really surprising how small the performance regression is. The trade-off here is between bors merge time and very rigorously testing ThinLTO on the one hand and having nightly builds with slightly worse runtime performance on the other hand. To me the choice is pretty clear.
  • I agree that we should build stable and beta releases with just one CGU and whatever other performance optimization we can do. I think the only reason we don't use regular LTO for stable releases is that rustc is built out of Rust dylibs which are incompatible with LTO.
  • @alexcrichton, @Mark-Simulacrum, do we have numbers on how this influences build times in CI?

Overall I'm in favor of keeping this enabled for nightly builds.

@alexcrichton alexcrichton deleted the bootstrap-thinlto branch October 23, 2017 15:51
@alexcrichton
Copy link
Member Author

@petrochenkov sorry didn't get a chance to comment on this before it merged, will continue on #45444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants