Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign uprustc: Don't inline in CGUs at -O0 #45075
Conversation
rust-highfive
assigned
eddyb
Oct 6, 2017
This comment has been minimized.
This comment has been minimized.
|
r? @eddyb (rust_highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
rust-highfive
assigned
michaelwoerister
and unassigned
eddyb
Oct 6, 2017
alexcrichton
referenced this pull request
Oct 6, 2017
Closed
32 codegen units may not always be better at -O0 #44941
alexcrichton
reviewed
Oct 6, 2017
| @@ -280,75 +280,74 @@ fn place_root_translation_items<'a, 'tcx, I>(tcx: TyCtxt<'a, 'tcx, 'tcx>, | |||
| let mut internalization_candidates = FxHashSet(); | |||
|
|
|||
| for trans_item in trans_items { | |||
| let is_root = trans_item.instantiation_mode(tcx) == InstantiationMode::GloballyShared; | |||
This comment has been minimized.
This comment has been minimized.
alexcrichton
Oct 6, 2017
Author
Member
This file's diff is probably best viewed with no whitespace (very few changes here, just indentation)
This comment has been minimized.
This comment has been minimized.
|
Oh I should also mention that I only changed the compiler's default behavior in O0 mode. My thinking was that if we don't have any way to inline into all codegen units then we basically kill performance for optimized codegen unit builds, so I didn't want to tamper with anyone relying on that. Once we have ThinLTO, however, that I believe is the avenue by which we'd achieve inlining, so I think we could turn this behavior on by default. |
This comment has been minimized.
This comment has been minimized.
|
@bors r+ Thanks, @alexcrichton ! I'm excited about this @alexcrichton it would be great if you could do some runtime performance benchmarks to see the effect of this. One thing that I'm not sure about is whether it is a good idea to also apply this to drop-glue and shims. With the changes in this PR they will all end up in one big codegen unit (the "fallback codegen-unit) and that might not be what we want. Also |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
@michaelwoerister when you say performance benchmarks, you mean of activating this in release mode? I sort of assumed that the benchmarks could only be worse than this which is where we don't inline generics across codegen units but we inline |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton I've tested this PR on the same project as before:
Interesting that now 2 CGUs performing only as goog as 32 (with inlining it was the fastest option), and 4 CGUs seems the fastest and most stable option. Also now builds are faster with any CGU number then with inlining. (approx. -10 secs at least). |
This comment has been minimized.
This comment has been minimized.
|
@mersinvald fascinating! I'm particularly curious about the huge dip using one codegen unit. Before you mentioned that one CGU was 80.41s to compile and now you're seeing 65.41s. That's quite a big improvement, and definitely shouldn't be affected by this PR! (this PR should have the same performance in one-CGU mode before and after). Do you know what might be causing that discrepancy? I'm still surprised at the drastic difference between 4 and 32 cgus! |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton I'll recheck now with one-commit-before-this-pr rustc, our code base changed a bit, I am sorry for confusion in advance :) |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton rechecked.
So, disabling inlines clearly has a good effect, dip between 4 and 32 GCUs is lower. -2 secs with 4 CGUs anyway) |
This comment has been minimized.
This comment has been minimized.
|
|
alexcrichton
force-pushed the
alexcrichton:inline-less
branch
from
cb7f7f0
to
4b2bdf7
Oct 8, 2017
This comment has been minimized.
This comment has been minimized.
|
@mersinvald hm ok very interesting! Thanks so much again for taking the time to collect this data :) Something still feels not quite right though about the 4 -> 32 codegen unit transition. Do you know if there's one crate in particular that slows down by 4 seconds? Or does everything in general just slow down a little bit? @bors: r=michaelwoerister |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton if there were some cargo option to display timings for every built crate, I could measure it. Is there something like that? |
This comment has been minimized.
This comment has been minimized.
|
@mersinvald Do you maybe have a mechanical hard drive instead of an SSD in the computer you are testing this on? Having more CGUs will lead to more I/O and we are probably testing mostly on machine with at least some kind of SSD. |
This comment has been minimized.
This comment has been minimized.
|
@bors p=1 (this is kind of important) |
This comment has been minimized.
This comment has been minimized.
|
@mersinvald ah unfotunately there's not anything easy to see how long crates are compiling for, but you may know of some that take longer than others perhaps? |
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Oct 9, 2017
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
No, I was actually interested in the runtime performance difference in debug mode. In particular: is there any difference at all? Do we do any inlining with |
This comment has been minimized.
This comment has been minimized.
|
@bors retry |
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Oct 9, 2017
carols10cents
added
the
S-waiting-on-bors
label
Oct 9, 2017
This comment has been minimized.
This comment has been minimized.
Possibly #40474? |
This comment has been minimized.
This comment has been minimized.
|
|
alexcrichton commentedOct 6, 2017
This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat
#[inline]functions by translating them into all codegen units thatthey're needed within, marking the linkage as
internal. This commit changesthe behavior so that in debug mode (compiling at
-O0) rustc will instead onlytranslate
#[inline]functions into one codegen unit, forcing all othercodegen units to reference this one copy.
The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example
Stringwould be basicallyinlined into all codegen units if used). The strategy in this commit should
reduce the cost of
#[inline]functions to being equivalent to one codegenunit, which is only translating and codegen'ing inline functions once.
Collected data shows that this does indeed improve the situation from before
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.
One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.