Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 codegen units may not always be better at -O0 #44941

Closed
alexcrichton opened this issue Sep 30, 2017 · 21 comments
Closed

32 codegen units may not always be better at -O0 #44941

alexcrichton opened this issue Sep 30, 2017 · 21 comments
Labels
regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Milestone

Comments

@alexcrichton
Copy link
Member

Continuing discussion from #44853 (comment)

@alexcrichton alexcrichton changed the title 32 codegen units may not always be better 32 codegen units may not always be better at -O0 Sep 30, 2017
@alexcrichton
Copy link
Member Author

@mersinvald unfortunately the perf files you mentioned I can't read locally, and the perf diff output isn't as illuminating as I hoped it would be.

I wonder, could you try investigating if one crate in particular takes much longer? Maybe take some of the crates that take longest to compile (sorting the rlibs by size is typically a rough measurement for this) and see how long they take?

I'm hoping that this is one crate hitting an accidentally pathological case, but it may also just be a case where everything slows down with 32 codegen units (even though we don't intend that).

One other possiblility may be the linker? (like ld) That should show up though as the final crate taking much longer to finish in 32 cgus than 2 in theory.

@crumblingstatue
Copy link
Contributor

crumblingstatue commented Sep 30, 2017

I tested compiling regex with different number of codegen units on my 2 core Pentium E2200.
Here are the results:

1:  33.4
2:  28.76
4:  28.90
8:  29.57
16: 29.97
32: 31.0

It gets a decent win with the number of codegen units equal the number of cores, but gets increasingly slower with more than that.

Here are the results for rustfmt. These are a bit more wild. It's actually faster on 4 units than 2, but gets increasingly slower after that, up to the point where 32 units is actually slower than 1 unit.

1:  177.31
2:  172.14
4:  167.91
8:  170.75
16: 175.80
32: 179.49

@retep998
Copy link
Member

retep998 commented Sep 30, 2017

Testing with winapi 0.3 on a Ryzen 7 1700X.

codegen-units Time
1 15.17
2 14.63
4 15.18
8 15.44
16 14.68
32 14.89
64 15.22
128 15.22
256 15.92
512 17.34
1024 20.24
2048 27.12
4096 40.74
8192 69.23

@oyvindln
Copy link
Contributor

oyvindln commented Oct 1, 2017

There is a pretty noticable change in compile times on perf.rust-lang.org with the commit that changed the default number as well.

@alexcrichton
Copy link
Member Author

I'm going to tag this as a regression until we can figure out what the spike is

@alexcrichton alexcrichton added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Oct 4, 2017
@alexcrichton alexcrichton added this to the 1.22 milestone Oct 4, 2017
@alexcrichton
Copy link
Member Author

Ok looking more into the data from perf.rust-lang.org. We're measuring instructions executed and the "cpu-clock" measurement. Instructions certainly don't have much bearing on wall time (but serve as great indicators of when to look for a regression) and apparently cpu-clock is not the same as wall time. I believe the cpu-clock stat is the total sum of all threads on all cpus.

I'm trying to rationalize the numbers we're seeing and it's not quite making sense. The regressions shown on perf.rust-lang.org imply to me that if I pin the process to one CPU we'll see that much increase in compile time, but I see virtually no increase in compile time locally. In fact, if I pin cargo/rustc to one CPU it typically gets faster with more cgus presumably because the objects are a little quicker to create overall.

Basically I'm having a lot of difficulty reproducing any slowdowns here. Assistance investigating why things are slowing down as opposed to just "things are slowing down" would be much appreciated.

@alexcrichton
Copy link
Member Author

I believe these are all the reported regressions from this change

@alexcrichton
Copy link
Member Author

alexcrichton commented Oct 5, 2017

Some more investigation

I found locally I was able to reproduce some slowdowns with the syntex_syntax crate:

cgus 1 2 4 8 16 32
wall time 32.9 24.9 22.8 21.4 22.0 21.8
cpu-clock 33064 35218 40584 44581 48542 51788
1 core wall time 34.3 35.2 36.4 38.1 40.3 41.7

Here "wall time" is the wall time it took to compile the syntex_syntax crate on an 8 core cpu. The "cpu-clock" is the statistic reported by perf state, and "1 core wall time" is the amount of time the compilation takes when pinned to one cpu (taskset -c 0).

There's a clear tend here that parallelization is helping but also taking up more CPU time. Namely increasing the number of codegen units is not purely dividing the work, it's also increasing the total amount of work being done. This becomes particularly noticeable with more codegen units limited to one core on this big crate.

So on one hand there's certainly some overhead from more codegen units. We're shuffling more files around (not copying, just working with), spawning more threads, more messages, more synchronization, etc. I don't think that these factors would cause such a drastic increase in compile times, however. Instead what I think may be happening here (that I have just now remembered) is how we treat #[inline].

@michaelwoerister correct me if I'm wrong, but I believe that we'll inline any #[inline] function into all codegen units instead of just one. This means that if you're using String you're probably going to get a copy of each of its functions in all of your codegen units. This to me seems like the root cause of the timing regressions here.

As a result my conclusion is that increasing the number of codegen units increases the amount of work the compiler does, and hence parallelism isn't always enough to recover this loss in the increase of work done. In the worst case with N codegen units you can probably do N times more work, and with less than N cores you'll notice the increase in compile times.

@michaelwoerister do you agree with the conclusions here? If so, do you think we could treat #[inline] differently in debug mode? Is it worth it to solve this? Can you think of another possible solution?

I'm personally hesitant to set the number of codegen units based on the amount of cores on the machine. I think that may return optimal performance (perhaps 2x the number of cores cgus) because the extra work needed for each codegen unit I think should come out in the wash because of parallelization. This, however, would make builds across machines nondeterministic if they've got different numbers of cpus, forcing deterministic compiles to explicitly specify the number of codegen units.

My personal preference for a solution right now is:

  • Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.
  • Reduce the number of codegen units from 32 to 16 here if we continue to see problems. I think most machines compiling Rust today have fewer than 16 cores so we should still get some benefit of using as much as possible, while also this should help reduce any extra costs from extra work needed for each codegen unit.

@michaelwoerister
Copy link
Member

@alexcrichton, Yes I think your assumptions and conclusions are correct. I also think that #[inline] functions are the main source of work duplication here. There's also debuginfo, which cannot be shared between codegen units, but that is usually rather cheap afaik.

Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.

We probably mean the same thing here: I would suggest treating them the same as regular polymorphic functions:

  • Instantiate only "on-demand", and
  • have exactly one instance and link to that.

That might require some special handling in the symbol hash to avoid conflicts between crates, just as we do for generic functions at the moment:

// If this is an instance of a generic function, we also hash in
// the ID of the instantiating crate. This avoids symbol conflicts
// in case the same instances is emitted in two crates of the same
// project.
if substs.types().next().is_some() {
hasher.hash(tcx.crate_name.as_str());
hasher.hash(tcx.sess.local_crate_disambiguator().as_str());
}

If performance turns out to be unbearable, we maybe could have a heuristic to do some inlining:

  • if the function body is small, and
  • if the function does not call other inline functions.

Reduce the number of codegen units from 32 to 16 here if we continue to see problems.

Seems like a sensible approach.

Other solutions?

I wonder if ThinLTO would allow us to completely get rid of duplicating bodies of inline functions across codegen units during trans. That would be awesome for incremental compilation and it would also help in the non-incremental case.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 6, 2017
Rationale explained in the included comment as well as rust-lang#44941
@alexcrichton
Copy link
Member Author

alexcrichton commented Oct 6, 2017

Some excellent news!

I implemented the strategy of "only inline functions into one codegen unit" and the updating timings I get are:

cgus 1 2 4 8 16 32
wall time 30.7 25.2 21.0 19.0 18.4 18.7
cpu-clock 30864 31387 31876 32525 34341 35370
1 core wall time 31.2 30.8 30.8 30.6 30.7 30.6

In other words, I think this indeed makes a huge dent in the issue of "more cgus causes rustc to do more work" problem. I'm going to send a PR shortly implementing this.

@michaelwoerister I also think that you're correct in that ThinLTO should eliminate the need for inlining functions into all codegen units unconditionally. Just gotta land ThinLTO and test it out first :)

@alexcrichton
Copy link
Member Author

Ok I've created a PR at #45075

@mersinvald and @crumblingstatue, if y'all could test out that PR locally that'd be great! My hope is that it should make 32 codegen units continue to be better than 1, and perhaps give us flexibility to increase the number of cgus in the future.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Oct 8, 2017
This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in rust-lang#44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: rust-lang#44941 (comment)
[before]: rust-lang#44941 (comment)
bors added a commit that referenced this issue Oct 9, 2017
…erister

rustc: Reduce default CGUs to 16

Rationale explained in the included comment as well as #44941
bors added a commit that referenced this issue Oct 9, 2017
rustc: Don't inline in CGUs at -O0

This commit tweaks the behavior of inlining functions into multiple codegen
units when rustc is compiling in debug mode. Today rustc will unconditionally
treat `#[inline]` functions by translating them into all codegen units that
they're needed within, marking the linkage as `internal`. This commit changes
the behavior so that in debug mode (compiling at `-O0`) rustc will instead only
translate `#[inline]` functions into *one* codegen unit, forcing all other
codegen units to reference this one copy.

The goal here is to improve debug compile times by reducing the amount of
translation that happens on behalf of multiple codegen units. It was discovered
in #44941 that increasing the number of codegen units had the adverse side
effect of increasing the overal work done by the compiler, and the suspicion
here was that the compiler was inlining, translating, and codegen'ing more
functions with more codegen units (for example `String` would be basically
inlined into all codegen units if used). The strategy in this commit should
reduce the cost of `#[inline]` functions to being equivalent to one codegen
unit, which is only translating and codegen'ing inline functions once.

Collected [data] shows that this does indeed improve the situation from [before]
as the overall cpu-clock time increases at a much slower rate and when pinned to
one core rustc does not consume significantly more wall clock time than with one
codegen unit.

One caveat of this commit is that the symbol names for inlined functions that
are only translated once needed some slight tweaking. These inline functions
could be translated into multiple crates and we need to make sure the symbols
don't collideA so the crate name/disambiguator is mixed in to the symbol name
hash in these situations.

[data]: #44941 (comment)
[before]: #44941 (comment)
@alexcrichton
Copy link
Member Author

@mersinvald can you try re-testing with tonight's nightly which should have some improvements?

@mersinvald
Copy link
Contributor

@alexcrichton sure, will do it tomorrow.

@mersinvald
Copy link
Contributor

mersinvald commented Oct 12, 2017

@alexcrichton I've ran the tests on the rustc 1.22.0-nightly (a47c9f870 2017-10-11)

I've also measured the hot build time, i.e only 200LOC crate + link-time.

Cold Build
4:  63.13 63.60 62.84 (avg 63.19)
32: 69.70 70.32 69.76 (avg 69.92)

Hot Build
4:  5.39 5.28 5.28 (avg 5.31)
32: 6.70 6.10 6.10 (avg 6.30)

@alexcrichton
Copy link
Member Author

alexcrichton commented Oct 18, 2017

@mersinvald ok I've done some further investigation into your benchmark. The setup was:

Results:

Build from scratch

Here I compiled the positions-api crate from scratch with a clean target directory with various settings

cgus 1 2 4 8 16 32 64
debug #0 43.368 37.003 34.790 34.254 34.376 34.287 34.849
debug #1 43.911 37.401 34.974 33.668 34.612 34.199 34.904
release thinlto=no #0 91.289 70.930 63.766 59.373 57.378 56.717 56.278
release thinlto=no #1 91.516 71.976 63.714 58.143 58.023 57.228 57.023
release thinlto=yes #0 90.799 96.447 91.650 80.653 81.806 82.241 85.829
release thinlto=yes #1 90.615 97.803 89.126 83.018 81.862 82.962 84.413

There's a suspicious jump at 2 CGUs where the compilation takes longer (with ThinLTO) but it quickly flattens out, the time improving with more and more codegen units thrown at it. Additionally in debug mode we see wins across the board as soon as codegen units are enabled.

Incremental build

Here I compiled the positions-api crate with a preexisting target directory. For these timings only the positions-api crate was rebuilt, effectively measuring the time it takes to compile just that one crate.

cgus 1 2 4 8 16 32 64
debug #0 5.147 3.945 3.326 3.129 3.257 3.265 3.604
debug #1 5.310 3.927 3.399 3.148 3.308 3.228 3.415
release thinlto=no #0 14.114 11.626 7.808 6.741 5.145 4.751 4.438
release thinlto=no #1 14.032 11.648 7.865 6.635 4.871 4.657 4.421
release thinlto=yes #0 14.121 15.107 11.088 8.836 6.876 6.368 6.366
release thinlto=yes #1 14.054 14.960 10.448 7.283 7.119 6.345 6.280

Here like before debug builds leveled off pretty quickly, benefitting a good amount from multiple CGUs. Release builds didn't fare so well after 8 CGUs, however, but still show improvements across the board from the original build time with just one CGU. Again though there's a jump at 2 CGUs in release mode which is a little suspicious


@mersinvald can you try running with that script and similar settings to guarantee that the only difference in the data collection here is hardware?

@lnicola
Copy link
Member

lnicola commented Oct 18, 2017

Are the incremental crate-only builds 10-15x slower than full non-incremental builds?

@alexcrichton
Copy link
Member Author

oh oops, I mixed up the tables

@mersinvald
Copy link
Contributor

mersinvald commented Oct 19, 2017

@alexcrichton
I've run tests with the exact same revisions of the compiler and circles-server.

rustc 1.22.0-nightly (4e9527c 2017-10-16)
4-core (2x cores + hyperthread) Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
SSD + 8GB RAM
Arch Linux

cgus 1 2 4 8 16 32 64
debug #0 7.197 5.804 5.561 5.512 5.766 5.832 6.607
debug #1 7.170 5.834 5.533 5.490 5.782 5.913 6.277
release thinlto=no #0 18.490 15.631 13.683 10.496 9.092 9.368 9.457
release thinlto=no #1 18.496 15.671 13.846 9.143 9.124 9.391 9.496
release thinlto=yes #0 18.502 20.199 18.267 12.506 13.133 13.334 13.704
release thinlto=yes #1 18.647 20.059 15.457 12.428 12.731 13.368 13.660
cgus 1 2 4 8 16 32 64
debug #0 77.231 72.293 69.728 70.492 71.270 72.801 73.855
debug #1 76.678 70.609 70.246 69.892 72.097 72.374 73.683
release thinlto=no #0 162.873 139.603 130.557 128.175 130.154 131.654 132.660
release thinlto=no #1 160.766 140.791 129.430 126.258 130.905 132.073 134.501
release thinlto=yes #0 159.064 184.526 179.635 180.172 184.242 189.370 192.278
release thinlto=yes #1 157.843 181.420 180.135 178.202 186.637 191.419 192.772

@alexcrichton
Copy link
Member Author

Awesome, thanks @mersinvald!

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

Otherwise it looks like the proposal for turning on multiple cgus by default in release mode means that 16 cgus is universally better than 1 cgu in terms of compile time if you only compile the last crate. If you compile the entire crate graph it looks like multiple cgus regreses the build time?

@michaelwoerister
Copy link
Member

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

👍

@alexcrichton
Copy link
Member Author

Ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants