rustc: Default 32 codegen units at O0 #44853

alexcrichton · 2017-09-25T22:57:31Z

This commit changes the default of rustc to use 32 codegen units when compiling
in debug mode, typically an opt-level=0 compilation. Since their inception
codegen units have matured quite a bit, gaining features such as:

Parallel translation and codegen enabling codegen units to get worked on even
more quickly.
Deterministic and reliable partitioning through the same infrastructure as
incremental compilation.
Global rate limiting through the jobserver crate to avoid overloading the
system.

The largest benefit of codegen units has forever been faster compilation through
parallel processing of modules on the LLVM side of things, using all the cores
available on build machines that typically have many available. Some downsides
have been fixed through the features above, but the major downside remaining is
that using codegen units reduces opportunities for inlining and optimization.
This, however, doesn't matter much during debug builds!

In this commit the default number of codegen units for debug builds has been
raised from 1 to 32. This should enable most cargo build compiles that are
bottlenecked on translation and/or code generation to immediately see speedups
through parallelization on available cores.

Work is being done to always enable multiple codegen units (and therefore
parallel codegen) but it requires #44841 at least to be landed and stabilized,
but stay tuned if you're interested in that aspect!

alexcrichton · 2017-09-25T22:57:39Z

r? @michaelwoerister

rust-highfive · 2017-09-25T22:57:43Z

r? @nikomatsakis

(rust_highfive has picked a reviewer for you, use r? to override)

ishitatsuyuki · 2017-09-26T00:47:11Z

I feel this is an Cargo thing.

daboross · 2017-09-26T06:22:44Z

@ishitatsuyuki Cargo can definitely control this - but we want a sane default when running raw rustc too, right?

This will make it so that if no other configuration happens - either via rustc parameters, or cargo configuration, rustc defaults to a very parallel build of a single crate.

Beginner projects which don't use cargo shouldn't have to have a slow build just because they aren't using cargo. This makes a sane default for all uses of rustc.

michaelwoerister · 2017-09-26T08:00:50Z

This is the jobserver (for cpu resource management) and async-llvm (for peak memory consumption) really pay off :)

r=me with the tests fixed.

michaelwoerister · 2017-09-26T08:04:30Z

On a separate note: I don't like how we often duplicate things between sess.opts and sess.opts.cg/sess.opts.debugging_opts, where only one of them is the correct value but both of them are accessible. But that's not something to solve in this PR.

michaelwoerister · 2017-09-26T08:09:31Z

What's the situation with perf.rlo? Are we still limiting benchmarking to a single core there?
cc @Mark-Simulacrum

Mark-Simulacrum · 2017-09-26T12:39:40Z

To my knowledge, all benchmarks on perf.rlo currently are using 8 threads of parallelism. @alexcrichton may be able to correct me if I recall incorrectly.

This commit changes the default of rustc to use 32 codegen units when compiling in debug mode, typically an opt-level=0 compilation. Since their inception codegen units have matured quite a bit, gaining features such as: * Parallel translation and codegen enabling codegen units to get worked on even more quickly. * Deterministic and reliable partitioning through the same infrastructure as incremental compilation. * Global rate limiting through the `jobserver` crate to avoid overloading the system. The largest benefit of codegen units has forever been faster compilation through parallel processing of modules on the LLVM side of things, using all the cores available on build machines that typically have many available. Some downsides have been fixed through the features above, but the major downside remaining is that using codegen units reduces opportunities for inlining and optimization. This, however, doesn't matter much during debug builds! In this commit the default number of codegen units for debug builds has been raised from 1 to 32. This should enable most `cargo build` compiles that are bottlenecked on translation and/or code generation to immediately see speedups through parallelization on available cores. Work is being done to *always* enable multiple codegen units (and therefore parallel codegen) but it requires rust-lang#44841 at least to be landed and stabilized, but stay tuned if you're interested in that aspect!

alexcrichton · 2017-09-26T15:18:19Z

@bors: r=michaelwoerister

bors · 2017-09-26T15:18:21Z

📌 Commit 9e35b79 has been approved by michaelwoerister

alexcrichton · 2017-09-26T15:18:57Z

@michaelwoerister

agreed that the duplication is unfortunate! I'd hope that one day we could just use functions to access these rather than accessing fields, but agreed that this is probably best left for a future PR

mersinvald · 2017-09-26T16:04:46Z

Forwarding here a comment that I accidentally left in #44841

I've just ran some build time benchmarks on my project that uses a lot of popular rust libraries and codegen (diesel, hyper, serde, tokio, futures, reqwest) on my Intel Core i5 laptop (skylake 2c/4t) and got these results:

1 unit:    92 secs
2 units:   81 secs
4 units:   83 secs
8 units:   85 secs
16 units:  90 secs
32 units:  102 secs

Cargo profile:

# The development profile, used for `cargo build`.
[profile.dev]
opt-level = 0
debug = true 
lto = false
debug-assertions = true 
codegen-units = N

rustc 1.22.0-nightly (17f56c5 2017-09-21)

As expected, the best results is for codegen units of number of cpus and 32 is way to much for an average machine.

Did you concider an option to select number of codegen units depending on number of cpus, with num_cpus crate?

Thank you for working on compile times!

alexcrichton · 2017-09-26T17:39:27Z

@mersinvald fascinating!

First up though, can you clarify what you were measuring? Was it a cargo build of the entire workspace? Just one crate? From a fresh target directory?

Locally I ran cargo build --all for the entire workspace with the latest nightly (rustc 1.22.0-nightly (6c476ce46 2017-09-25)) and got the following timings

cgus= 1 Duration { secs: 97, nanos: 667334447 }
cgus= 2 Duration { secs: 91, nanos: 974436776 }
cgus= 4 Duration { secs: 88, nanos: 860553853 }
cgus= 8 Duration { secs: 86, nanos: 138881102 }
cgus=16 Duration { secs: 84, nanos: 37066957 }
cgus=32 Duration { secs: 84, nanos: 907956016 }

oddly though sometimes it was very variable what the build times were...

cgus= 1 Duration { secs: 99, nanos: 325352640 }
cgus= 2 Duration { secs: 90, nanos: 759988443 }
cgus= 4 Duration { secs: 88, nanos: 654634816 }
cgus= 8 Duration { secs: 86, nanos: 381700701 }
cgus=16 Duration { secs: 87, nanos: 243640455 }
cgus=32 Duration { secs: 47, nanos: 899950745 }

I've got an 8 core machine locally, but the number of cores vs number of codegen units should have little effect on compile time (in theory). The codegen units are chosen to be explicitly high here to hopefully make sure that no codegen unit takes too long in the optimizer, allowing ideally for optimal use of all available cores throughout compilation. Additionally, more cgus should mean a lower peak memory of rustc itself due to async translation/codegen.

Are you sure you didn't have anythign else running in the background when you were collecting that timing? And were the timings you got reproducible?

mersinvald · 2017-09-26T19:09:53Z

@alexcrichton

First up though, can you clarify what you were measuring?

I'm building with cargo build from the root of the repo, so it's a whole workspace build
Before every iteration I do cargo clean

Good point about background tasks, though. I disabled everything that can eat up CPU to get more steady results and re-ran each build three times:

1:  80.79 79.97 80.47 (avg 80.41)
2:  74.5 73.91 74.75  (avg 74.38)
4:  75.84 75.48 75.83 (avg 75.71)
8:  79.39 78.7 78.27  (avg 78.78)
16: 80.68 80.70 80.75 (avg 80.71)
32: 83.89 83.81 84.26 (avg 83.98)

Results seem to be quiet steady and reproducible

2-4 units are optimal for my setup.

Btw, I've updated rustc to the latest nightly version before running new tests, so now it is:

Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
rustc 1.22.0-nightly (6c476ce 2017-09-25)
Linux 4.12.1 Complete name-mangling rules for linkage #8 SMP PREEMPT Sun Jul 16 01:00:39 MSK 2017 x86_64 GNU/Linux

alexcrichton · 2017-09-26T19:50:15Z

@mersinvald hm so if you're on Linux, mind poking around with perf? Could you try a perf record of 2 cgus and a perf record of 32 cgus? I think perf diff will work here for looking at the comparison.

Otherwise though this is indeed curious! It may be worth drilling into specific crates as well, maybe going one at a time running rustc by hand. If one crate takes way longer in 32 codegen units than in 2 then that's something to investigate. Overall builds tend to be hard to drill into :(

mersinvald · 2017-09-26T21:00:10Z

@alexcrichton ok, I'll do perf tomorrow)

mersinvald · 2017-09-29T02:02:23Z

@alexcrichton i've collected statistics for clean cargo build

https://drive.google.com/drive/folders/0B28cL71oGfpOTVRKMUZjTlFQU2M?usp=sharing

diff had been made with perf diff perf.data.2 perf.data.32 > diff

Hope it will help.

I don't think I can interpret this data myself, but it you'll need me to run perf on some specific crates, feel free to ask, I'm happy to help.

bors · 2017-09-29T10:10:23Z

⌛ Testing commit 9e35b79 with merge d514263...

…rister rustc: Default 32 codegen units at O0 This commit changes the default of rustc to use 32 codegen units when compiling in debug mode, typically an opt-level=0 compilation. Since their inception codegen units have matured quite a bit, gaining features such as: * Parallel translation and codegen enabling codegen units to get worked on even more quickly. * Deterministic and reliable partitioning through the same infrastructure as incremental compilation. * Global rate limiting through the `jobserver` crate to avoid overloading the system. The largest benefit of codegen units has forever been faster compilation through parallel processing of modules on the LLVM side of things, using all the cores available on build machines that typically have many available. Some downsides have been fixed through the features above, but the major downside remaining is that using codegen units reduces opportunities for inlining and optimization. This, however, doesn't matter much during debug builds! In this commit the default number of codegen units for debug builds has been raised from 1 to 32. This should enable most `cargo build` compiles that are bottlenecked on translation and/or code generation to immediately see speedups through parallelization on available cores. Work is being done to *always* enable multiple codegen units (and therefore parallel codegen) but it requires #44841 at least to be landed and stabilized, but stay tuned if you're interested in that aspect!

bors · 2017-09-29T12:56:19Z

☀️ Test successful - status-appveyor, status-travis
Approved by: michaelwoerister
Pushing d514263 to master...

rust-lang/rust#44853 changed the default number of codegen units from 1 to 32 for the dev profile. Unfortunately this broke our dev builds so we are reverting the change in the Cargo.toml.

rust-highfive assigned nikomatsakis Sep 25, 2017

rust-highfive assigned michaelwoerister and unassigned nikomatsakis Sep 25, 2017

alexcrichton force-pushed the debug-codegen-units branch 3 times, most recently from a6c4f73 to 8e09bfb Compare September 25, 2017 23:06

alexcrichton force-pushed the debug-codegen-units branch from 8e09bfb to 9e35b79 Compare September 26, 2017 15:18

alexcrichton mentioned this pull request Sep 26, 2017

rustc: Implement ThinLTO #44841

Merged

arielb1 added the S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. label Sep 26, 2017

bors merged commit 9e35b79 into rust-lang:master Sep 29, 2017

bors mentioned this pull request Sep 29, 2017

rustc: Enable LTO and multiple codegen units #44783

Closed

alexcrichton deleted the debug-codegen-units branch September 30, 2017 08:00

alexcrichton mentioned this pull request Sep 30, 2017

32 codegen units may not always be better at -O0 #44941

Closed

This was referenced Oct 2, 2017

use a single codegen-unit with the dev profile rust-embedded/cortex-m-quickstart#18

Closed

compiling for msp430 doesn't work with multiple codegen units #45000

Closed

alexcrichton mentioned this pull request Oct 7, 2017

rustc: Don't inline in CGUs at -O0 #45075

Merged

japaric mentioned this pull request Oct 9, 2017

Compiling Debug builds fails with undefined reference to 'rust_begin_unwind'. rust-embedded/discovery#52

Closed

bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Oct 9, 2017

alexcrichton mentioned this pull request Dec 29, 2017

Updated RELEASES.md for 1.23.0 #46327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustc: Default 32 codegen units at O0 #44853

rustc: Default 32 codegen units at O0 #44853

alexcrichton commented Sep 25, 2017

alexcrichton commented Sep 25, 2017

rust-highfive commented Sep 25, 2017

ishitatsuyuki commented Sep 26, 2017

daboross commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

Mark-Simulacrum commented Sep 26, 2017

alexcrichton commented Sep 26, 2017

bors commented Sep 26, 2017

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017 •

edited

Loading

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017 •

edited

Loading

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017

mersinvald commented Sep 29, 2017 •

edited

Loading

bors commented Sep 29, 2017

bors commented Sep 29, 2017

rustc: Default 32 codegen units at O0 #44853

rustc: Default 32 codegen units at O0 #44853

Conversation

alexcrichton commented Sep 25, 2017

alexcrichton commented Sep 25, 2017

rust-highfive commented Sep 25, 2017

ishitatsuyuki commented Sep 26, 2017

daboross commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

michaelwoerister commented Sep 26, 2017

Mark-Simulacrum commented Sep 26, 2017

alexcrichton commented Sep 26, 2017

bors commented Sep 26, 2017

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017 • edited Loading

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017 • edited Loading

alexcrichton commented Sep 26, 2017

mersinvald commented Sep 26, 2017

mersinvald commented Sep 29, 2017 • edited Loading

bors commented Sep 29, 2017

bors commented Sep 29, 2017

mersinvald commented Sep 26, 2017 •

edited

Loading

mersinvald commented Sep 26, 2017 •

edited

Loading

mersinvald commented Sep 29, 2017 •

edited

Loading