Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc: Don't inline in CGUs at -O0 #45075

Merged
merged 1 commit into from Oct 9, 2017

Conversation

Projects
None yet
9 participants
@alexcrichton
Copy link
Member

alexcrichton commented Oct 6, 2017

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat #[inline] functions by translating them into all codegen units that
they're needed within, marking the linkage as internal. This commit changes
the behavior so that in debug mode (compiling at -O0) rustc will instead only
translate #[inline] functions into one codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example String would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of #[inline] functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected data shows that this does indeed improve the situation from before
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Oct 6, 2017

r? @eddyb

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 6, 2017

@@ -280,75 +280,74 @@ fn place_root_translation_items<'a, 'tcx, I>(tcx: TyCtxt<'a, 'tcx, 'tcx>,
let mut internalization_candidates = FxHashSet();

for trans_item in trans_items {
let is_root = trans_item.instantiation_mode(tcx) == InstantiationMode::GloballyShared;

This comment has been minimized.

@alexcrichton

alexcrichton Oct 6, 2017

Author Member

This file's diff is probably best viewed with no whitespace (very few changes here, just indentation)

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 6, 2017

Oh I should also mention that I only changed the compiler's default behavior in O0 mode. My thinking was that if we don't have any way to inline into all codegen units then we basically kill performance for optimized codegen unit builds, so I didn't want to tamper with anyone relying on that. Once we have ThinLTO, however, that I believe is the avenue by which we'd achieve inlining, so I think we could turn this behavior on by default.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Oct 7, 2017

@bors r+

Thanks, @alexcrichton ! I'm excited about this :) Apart from reducing plain multi-cgu overhead it could have quite the positive effect on incremental compilation with debuginfo enabled.

@alexcrichton it would be great if you could do some runtime performance benchmarks to see the effect of this.

One thing that I'm not sure about is whether it is a good idea to also apply this to drop-glue and shims. With the changes in this PR they will all end up in one big codegen unit (the "fallback codegen-unit) and that might not be what we want. Also characteristic_def_id_of_trans_item() should probably be made smarter before extending this to non--O0 scenarios. The implementation still assumes that it never has to deal with anything other than regular functions.

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 7, 2017

📌 Commit cb7f7f0 has been approved by michaelwoerister

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 7, 2017

@michaelwoerister when you say performance benchmarks, you mean of activating this in release mode? I sort of assumed that the benchmarks could only be worse than this which is where we don't inline generics across codegen units but we inline #[inline] functions across all cgus. I could see just how bad it fares though :)

@mersinvald

This comment has been minimized.

Copy link
Contributor

mersinvald commented Oct 7, 2017

@alexcrichton I've tested this PR on the same project as before:

1:  66.25 65.56 64.43 (65.41)
2:  63.62 63.66 65.21 (64.16)
4:  60.76 60.81 60.74 (60.77)
8:  60.88 60.96 61.31 (61.05)
16: 61.31 62.62 62.27 (62.06)
32: 63.65 64.10 64.37 (64.04)

Interesting that now 2 CGUs performing only as goog as 32 (with inlining it was the fastest option), and 4 CGUs seems the fastest and most stable option.

Also now builds are faster with any CGU number then with inlining. (approx. -10 secs at least).

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 7, 2017

@mersinvald fascinating! I'm particularly curious about the huge dip using one codegen unit. Before you mentioned that one CGU was 80.41s to compile and now you're seeing 65.41s. That's quite a big improvement, and definitely shouldn't be affected by this PR! (this PR should have the same performance in one-CGU mode before and after).

Do you know what might be causing that discrepancy? I'm still surprised at the drastic difference between 4 and 32 cgus!

@mersinvald

This comment has been minimized.

Copy link
Contributor

mersinvald commented Oct 7, 2017

@alexcrichton I'll recheck now with one-commit-before-this-pr rustc, our code base changed a bit, I am sorry for confusion in advance :)

@mersinvald

This comment has been minimized.

Copy link
Contributor

mersinvald commented Oct 7, 2017

@alexcrichton rechecked.

Before PR:
4:  62.87 63.40 63.77 (63.34)
32: 69.74 69.47 69.55 (69.53)

After PR:
1:  66.25 65.56 64.43 (65.41)
2:  63.62 63.66 65.21 (64.16)
4:  60.76 60.81 60.74 (60.77)
8:  60.88 60.96 61.31 (61.05)
16: 61.31 62.62 62.27 (62.06)
32: 63.65 64.10 64.37 (64.04)

So, disabling inlines clearly has a good effect, dip between 4 and 32 GCUs is lower.

-2 secs with 4 CGUs anyway)

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 8, 2017

☔️ The latest upstream changes (presumably #44841) made this pull request unmergeable. Please resolve the merge conflicts.

rustc: Don't inline in CGUs at -O0
This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: #44941 (comment)
[before]: #44941 (comment)

@alexcrichton alexcrichton force-pushed the alexcrichton:inline-less branch from cb7f7f0 to 4b2bdf7 Oct 8, 2017

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 8, 2017

@mersinvald hm ok very interesting! Thanks so much again for taking the time to collect this data :)

Something still feels not quite right though about the 4 -> 32 codegen unit transition. Do you know if there's one crate in particular that slows down by 4 seconds? Or does everything in general just slow down a little bit?

@bors: r=michaelwoerister

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 8, 2017

📌 Commit 4b2bdf7 has been approved by michaelwoerister

@mersinvald

This comment has been minimized.

Copy link
Contributor

mersinvald commented Oct 8, 2017

@alexcrichton if there were some cargo option to display timings for every built crate, I could measure it. Is there something like that?

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Oct 9, 2017

@mersinvald Do you maybe have a mechanical hard drive instead of an SSD in the computer you are testing this on? Having more CGUs will lead to more I/O and we are probably testing mostly on machine with at least some kind of SSD.

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Oct 9, 2017

@bors p=1 (this is kind of important)

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 9, 2017

@mersinvald ah unfotunately there's not anything easy to see how long crates are compiling for, but you may know of some that take longer than others perhaps?

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 9, 2017

⌛️ Testing commit 4b2bdf7 with merge 199e837...

bors added a commit that referenced this pull request Oct 9, 2017

Auto merge of #45075 - alexcrichton:inline-less, r=michaelwoerister
rustc: Don't inline in CGUs at -O0

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: #44941 (comment)
[before]: #44941 (comment)
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 9, 2017

💔 Test failed - status-travis

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Oct 9, 2017

@michaelwoerister when you say performance benchmarks, you mean of activating this in release mode?

No, I was actually interested in the runtime performance difference in debug mode. In particular: is there any difference at all? Do we do any inlining with -O0? What about #[inline(always)]?

@michaelwoerister

This comment has been minimized.

Copy link
Contributor

michaelwoerister commented Oct 9, 2017

@bors retry

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 9, 2017

⌛️ Testing commit 4b2bdf7 with merge 72d6501...

bors added a commit that referenced this pull request Oct 9, 2017

Auto merge of #45075 - alexcrichton:inline-less, r=michaelwoerister
rustc: Don't inline in CGUs at -O0

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: #44941 (comment)
[before]: #44941 (comment)
@aidanhs

This comment has been minimized.

Copy link
Member

aidanhs commented Oct 9, 2017

[01:24:43] Dist std stage2 (x86_64-unknown-linux-gnu -> x86_64-unknown-netbsd)
[01:26:25] Dist analysis
[01:26:25] image_src: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-std/x86_64-unknown-netbsd/release/deps/save-analysis", dst: "/checkout/obj/build/tmp/dist/rust-analysis-nightly-x86_64-unknown-netbsd-image/lib/rustlib/x86_64-unknown-netbsd/analysis"
[01:26:26] Dist src
[01:26:35] Dist extended stage2 (x86_64-unknown-netbsd)
[01:26:35] Dist cargo stage2 (x86_64-unknown-netbsd)
[01:26:35]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[01:26:35]                                  Dload  Upload   Total   Spent    Left  Speed
[01:26:35] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   282    0   282    0     0    847      0 --:--:-- --:--:-- --:--:--   849
[01:26:35] thread 'main' panicked at 'downloaded openssl sha256 different
[01:26:35] expected: 6b3977c61f2aedf0f96367dcfb5c6e578cf37e7b8d913b4ecb6643c3cb88d8c0
[01:26:35] found:    8552e3169b5a6071edfab25bf7dacb3bbd40ae5325bd472690047b90c1cd0589
[01:26:35] ', /checkout/src/bootstrap/native.rs:380:16
[01:26:35] note: Run with `RUST_BACKTRACE=1` for a backtrace.
[01:26:35] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap dist --host x86_64-unknown-netbsd --target x86_64-unknown-netbsd
[01:26:35] Build completed unsuccessfully in 1:24:19

Possibly #40474?

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 9, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: michaelwoerister
Pushing 72d6501 to master...

@bors bors merged commit 4b2bdf7 into rust-lang:master Oct 9, 2017

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details

@bluss bluss added the relnotes label Oct 9, 2017

@alexcrichton alexcrichton deleted the alexcrichton:inline-less branch Oct 10, 2017

kennytm added a commit to kennytm/rust that referenced this pull request Oct 13, 2017

Rollup merge of rust-lang#45209 - kennytm:treat-checksum-error-as-dow…
…nload-error, r=Mark-Simulacrum

rustbuild: Make openssl download more reliable.

1. Add `-f` flag to curl, so when the server returns 403 or 500 it will fail immediately.
2. Moved the checksum part into the retry loop, assuming checksum failure is due to broken download that can be fixed by downloading again.

This PR is created responding to two recent spurious failures in rust-lang#45075 (comment) and rust-lang#45030 (comment).

r? @Mark-Simulacrum , cc @aidanhs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.