Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Rust compile times #6065

Open
str4d opened this issue Jul 8, 2022 · 8 comments
Open

Improve Rust compile times #6065

str4d opened this issue Jul 8, 2022 · 8 comments
Labels
A-build Area: Build system I-performance Problems and improvements with respect to performance L-rust Involves Rust code.

Comments

@str4d
Copy link
Contributor

str4d commented Jul 8, 2022

Currently we spend quite a bit of time recompiling Rust when making incremental changes. It would be nice to figure out how to improve this. Ideas for improvement:

  • We seem to have a noticeable performance cost in compiling the Rust library and the binaries separately.
  • We currently use codegen-units = 1.
@str4d str4d added I-performance Problems and improvements with respect to performance A-build Area: Build system L-rust Involves Rust code. labels Jul 8, 2022
@str4d
Copy link
Contributor Author

str4d commented Jul 8, 2022

Benchmarked codegen-units on Ryzen 9 5950X. Note that the -j30 parameter affects the C++ side, not the Rust side (which just uses all 32 threads).

./zcutil/clean.sh && ./zcutil/build.sh -j30

With codegen-units = 1

real    3m6.597s
user    6m24.224s
sys     0m43.280s

real    3m6.023s
user    6m23.447s
sys     0m43.794s

Without codegen-units = 1

real    2m29.704s
user    5m58.134s
sys     0m44.960s

real    2m21.490s
user    5m59.819s
sys     0m45.727s

touch src/rust/src/rustzcash.rs && make -j30

With codegen-units = 1

real    1m27.347s
user    1m26.618s
sys     0m0.920s

real    1m28.987s
user    1m31.265s
sys     0m4.161s

Without codegen-units = 1

real    0m59.504s
user    1m19.499s
sys     0m1.066s

real    0m59.122s
user    1m21.626s
sys     0m4.131s

So it looks like using the default codegen-units saves us around 30 seconds of build time, which is around 22% of clean build time and 33% of incremental rebuild time.

@str4d
Copy link
Contributor Author

str4d commented Aug 29, 2022

Ran the same scripts on the same CPU again, but on current master:

./zcutil/clean.sh && time ./zcutil/build.sh -j30

Note that this does not include the build time of vendored dependencies (otherwise you'd see a disparity between the two measurements).

With codegen-units = 1

real    3m3.069s
user    6m18.913s
sys     0m44.811s

real    2m58.695s
user    6m15.580s
sys     0m44.638s

Without codegen-units = 1

real    2m19.684s
user    6m2.841s
sys     0m46.162s

real    2m18.342s
user    6m3.672s
sys     0m45.786s

This is slightly faster than before; it might be due to the LLVM 14 or Rust 1.63 updates.

Without codegen-units = 1, and merging --lib and --bins building

real    2m10.844s
user    5m43.436s
sys     0m44.201s

real    2m10.715s
user    5m47.419s
sys     0m44.728s

So it looks like using the default codegen-units currently saves us around 40 seconds of build time, and building the binaries and library at the same time saves a further 10 seconds. The latter is because the binaries can be built in parallel; currently that is prevented due to locks on the Rust build folder. The rationale for separating them was to avoid any cost of the binary builds when not needed, but clearly they are not the main cost, because the library build cost itself is so high:

image

str4d added a commit to str4d/zcash that referenced this issue Aug 29, 2022
This enables `cargo` to parallelize the library and binary builds
internally, reducing the Rust build time by the build time of the
binaries (because they are overall faster than the library build).

Part of zcash#6065.
@str4d
Copy link
Contributor Author

str4d commented Aug 29, 2022

Hmm, I had zcash-inspect disabled because of a local linker issue. I re-enabled it because parallelising the library and binaries also fixed my linker issue, and it revealed a significantly different performance profile (I presume since zcash-inspect was merged):

image

Avenues for improvement:

  • Figure out what is going on with the orchard crate's codegen.
  • Move more of the librustzcash code to use the cxx crate, and see what effect that has.
  • Move more of the librustzcash code out to external Rust crates.
    • Some of these might make sense to live in this repo, but librustzcash is intended to only be FFI code, so any business logic should likely move into https://github.com/zcash/librustzcash.

@str4d
Copy link
Contributor Author

str4d commented Feb 6, 2023

I forgot to mention that I obtained the above figures by adding the --timings flag to RUST_BUILD_OPTS in src/Makefile.am.

@str4d
Copy link
Contributor Author

str4d commented Feb 6, 2023

I tried using rustc's self-profiling to figure out where the time was being spent:

diff --git a/src/Makefile.am b/src/Makefile.am
index 0e21d2d44c..daecbe9832 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -139,7 +139,7 @@ LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_SHANI)
 endif
 
 cargo-build: $(CARGO_CONFIGURED) $(LIBSECP256K1)
-       $(rust_verbose)$(RUST_ENV_VARS) $(CARGO) build $(RUST_BUILD_OPTS) $(cargo_verbose)
+       $(rust_verbose)$(RUST_ENV_VARS) RUSTC_BOOTSTRAP=1 $(CARGO) rustc $(RUST_BUILD_OPTS) $(cargo_verbose) --lib -- -Z self-profile -Z self-profile-events=default,args
 
 cargo-build-lib: cargo-build
 

Here's what I get on my Ryzen 9 5950X (also with a zcashd reindex running in the background during its "read all the blocks" phase):

image

Oof, that's a lot of time doing linking! And about half of it is LTO:

zcash/Cargo.toml

Lines 111 to 112 in feec543

[profile.release]
lto = true

It turns out that lto = true is equivalent to lto = 'fat'. Here's what happens when I switch to lto = 'thin' (which should give close-to-equivalent performance improvements):

image

It uses a bunch more threads (the timeline scrolls off down to a thread number of 196), and saves over 20s of compile time.

Some of the durations for comparison (in seconds):

Phase lto = true lto = 'thin'
LLVM_passes 83.40 59.94
codegen_module_perform_lto 37.56 17.40
codegen_module_optimize 44.23 42.30

str4d added a commit to str4d/zcash that referenced this issue Feb 6, 2023
This should achieve similar performance gains to "fat" LTO (which we
were previously using) while taking substantially less time to run
(over 20s saved on a Ryzen 9 5950X).

Part of zcash#6065.
@str4d
Copy link
Contributor Author

str4d commented Feb 6, 2023

I tried combining #6407 with removing codegen-units (on my Ryzen 9 5950X):

./zcutil/clean.sh && time ./zcutil/build.sh -j30

master

real    2m52.682s
user    6m24.228s
sys     0m42.750s

#6407

real    2m28.203s
user    7m5.917s
sys     0m44.407s

#6407 without codegen-units = 1

real    1m45.812s
user    8m38.458s
sys     0m50.999s

touch src/rust/src/rustzcash.rs && time make -j30:

master

real    1m28.580s
user    1m27.796s
sys     0m0.899s

#6407

real    1m8.626s
user    1m51.819s
sys     0m5.348s

#6407 without codegen-units = 1

real    0m17.458s
user    2m57.936s
sys     0m5.158s

So the combination of ThinLTO and default codegen-units basically almost completely eliminates the slowness of recompiles, and has an equivalent effect on clean builds. We should definitely consider the latter. @ebfull what was the original motivation for codegen-units = 1?

@daira
Copy link
Contributor

daira commented Feb 6, 2023

We should check that this doesn't break determinism of the gitian builds.

str4d added a commit to str4d/zcash that referenced this issue Feb 6, 2023
This moves us back to using the default value of 16 for the release
profile (which does not enable incremental builds).

Part of zcash#6065.
@str4d
Copy link
Contributor Author

str4d commented Mar 2, 2023

@ebfull what was the original motivation for codegen-units = 1?

@ebfull told me that the original motivation here was performance; he seems to recall that the compiler at the time could produce faster code if all codegen was in a single unit. We should check there are no significant performance regressions caused by #6411 (or if there are, decide if they are worth the compile time improvements).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-build Area: Build system I-performance Problems and improvements with respect to performance L-rust Involves Rust code.
Projects
None yet
Development

No branches or pull requests

2 participants