Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectively enable opt-level 1 #8141

Merged
merged 1 commit into from
May 15, 2024
Merged

Selectively enable opt-level 1 #8141

merged 1 commit into from
May 15, 2024

Conversation

bgw
Copy link
Member

@bgw bgw commented May 14, 2024

This PR compiles all non-workspace dependencies, as well as turbo-tasks-memory (which is particularly sensitive) with basic optimizations. Most crates in the workspace still use opt-level 0 locally.

While not as good as applying opt-level 1 everywhere, this significantly reduces execution times versus opt-level 0, while making cold builds about 50-60% slower. Warm build times are largely unaffected.

The debugging (gdb/lldb) experience may also be slightly worsened by the optimizations.

What about cargo check/cargo clippy/rust-analyzer? No expected change, as (outside of proc macros) these don't perform LLVM code generation.

Why selectively, and not everywhere? While applying this everwhere can give us about 3x faster execution, this still gives us most of the runtime performance benefits, while avoiding most of the compilation cost (especially for warm builds). I believe we should still optimize more for build times than execution times. I benchmarked applying opt-level 1 to all crates here: https://docs.google.com/document/d/1iaREbzYpDmBt54fT2egzptTfx0OYsTIJ633gRqddzDY/edit?usp=sharing

Why not just a few hot dependencies? I tried profiling the debug build and only optimizing the hot crates, but I wasn't able to get meaningful improvements in my testing.

Benchmarking Notes

  • System configuration is here: https://github.com/bgw/benchmark-scripts . This is a downclocked machine with most CPU cores disabled to get low-noise measurements. Treat these results as relative to each other, not as absolute values.
  • Build benchmarks are run with mold, as GNU ld is incredibly slow (and often causes OOMs with 16GB of RAM). We're already using mold in the private nextpack meta-repository. I'll follow up with another PR to use mold or lld by default.

Build Time Benchmarks

There's a significant regression to cold builds, but there's no meaningful regression for warm builds.

Cold time to build tests (2 runs):

rm -rf target/ && time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -- dummy_filter_build_only_dont_run_any_tests

Before:

real	9m29.839s
real	9m27.522s

After:

real    15m28.105s
real    15m28.577s

Warm time to build tests (2 runs):

Modify a string in an error message inside of crates/turbopack-ecmascript/src/minify.rs. This guarantees forced recompilation of all dependent crates without meaninfully changing any behavior. Then run:

time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -- dummy_filter_build_only_dont_run_any_tests

Before:

real    1m33.497s
real    1m36.134s

After:

real    1m41.232s
real    1m40.153s

Warm time to build single binary (2 runs):

This is less dependent on linking than the tests, which generate many binary targets.

Modify a string in an error message inside of crates/turbopack-ecmascript/src/minify.rs. This guarantees forced recompilation of all dependent crates without meaninfully changing any behavior. Then run:

time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo build -p turbopack-cli

Before:

real    0m37.565s
real    0m37.058s

After:

real    0m36.450s
real    0m36.849s

Cold time to build a single turborepo binary:

rm -rf target/ && time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo build -p turbo

Before:

real    3m43.488s

After:

real    4m54.416s

Execution Time Benchmarks

turbopack-cli's bench_startup

cargo bench --profile dev -p turbopack-cli

Before:

bench_startup/Turbopack CSR/1000 modules
                    	time:   [20.744 s 20.869 s 20.995 s]

After:

bench_startup/Turbopack CSR/1000 modules
                        time:   [7.8037 s 7.8505 s 7.9030 s]

Test Execution (excluding build, 2 runs)

With a completely warm build cache (such that nothing needs to build), run:

time RUSTFLAGS=-Clink-arg=-fuse-ld=mold cargo nextest run -E 'not test(node_file_trace)'

Before:

real	2m51.767s
real	2m51.482s

After:

real    1m17.286s
real    1m12.520s

This PR compiles all non-workspace dependencies, as well as
`turbo-tasks-memory` (which is particularly sensitive) with basic
optimizations. Most crates in the workspace still use opt-level 0
locally.

While not as good as applying opt-level 1 everywhere, this significantly
reduces execution times versus opt-level 0, while making cold builds
about 50-60% slower. Warm build times are largely unaffected.

The debugging (gdb/lldb) experience may also be slightly worsened by the
optimizations.
Copy link

vercel bot commented May 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
examples-nonmonorepo ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 14, 2024 10:29pm
examples-vite-web 🔄 Building (Inspect) Visit Preview 💬 Add feedback May 14, 2024 10:29pm
8 Ignored Deployments
Name Status Preview Comments Updated (UTC)
examples-basic-web ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-designsystem-docs ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-gatsby-web ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-kitchensink-blog ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-native-web ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-svelte-web ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
examples-tailwind-web ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm
rust-docs ⬜️ Ignored (Inspect) Visit Preview May 14, 2024 10:29pm

Copy link
Member Author

bgw commented May 14, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @bgw and the rest of your teammates on Graphite Graphite

Copy link
Contributor

🟢 Turbopack Benchmark CI successful 🟢

Thanks

Copy link
Contributor

github-actions bot commented May 14, 2024

⚠️ CI failed ⚠️

The following steps have failed in CI:

  • Turbopack Rust tests (mac/win, non-blocking)

See workflow summary for details

@@ -77,8 +77,13 @@ turbopack-wasi = [
[workspace.lints.clippy]
too_many_arguments = "allow"

[profile.dev.package.turbo-tasks-macros]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config was redundant with the

[profile.dev.build-override]

below. I measured no meaningful change to build times with/without it.


# Set the options for dependencies (not crates in the workspace), this mostly impacts cold builds
[profile.dev.package."*"]
opt-level = 1

# Set the settings for build scripts and proc-macros.
[profile.dev.build-override]
Copy link
Member Author

@bgw bgw May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: One problem is that due to precedence ordering issues in cargo, dependencies like syn will now be compiled with opt-level = 1 instead of opt-level = 3. I need to perform further benchmarking to determine if this is significant and worth working around.

rust-lang/cargo#9351

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this will mostly affect macro eval? turbo tasks has a few thousand alone but I would hope that the aggregate wins from not optimizing deps would be greater than the perf loss.

also TY for adding comments. our build configs are under documented

@bgw bgw marked this pull request as ready for review May 14, 2024 23:18
@bgw bgw requested review from a team as code owners May 14, 2024 23:18
@bgw bgw requested review from mehulkar and codybrouwers May 14, 2024 23:18
@mehulkar mehulkar requested review from chris-olszewski and NicholasLYang and removed request for mehulkar May 15, 2024 03:08
Copy link
Contributor

@arlyon arlyon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with this pending verification of syn costs. I don't expect it to be an issue but can't hurt

Copy link
Member Author

bgw commented May 15, 2024

I modified the Cargo.toml to explicitly include opt-level = 3 for all the "big" proc-macro related dependencies I could find that would've been overridden by [profile.dev.package."*"].

Additions to `Cargo.toml`
[profile.dev.package.syn]
opt-level = 3

[profile.dev.package.quote]
opt-level = 3

[profile.dev.package.proc-macro2]
opt-level = 3

[profile.dev.package.serde_derive]
opt-level = 3

[profile.dev.package.serde_derive_internals]
opt-level = 3

[profile.dev.package.wasm-bindgen-backend]
opt-level = 3

[profile.dev.package.bindgen]
opt-level = 3

[profile.dev.package.tokio-macros]
opt-level = 3

[profile.dev.package.rkyv_derive]
opt-level = 3

I repeated the "Warm time to build single binary" benchmark, which uses many of these proc macros, but doesn't include the time spent optimizing them at opt-level = 3. This is the benchmark I'd expect to see the most benefit from improved proc macro execution. There was no meaningful change:

real    0m37.078s

Copy link
Contributor

@NicholasLYang NicholasLYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Our sccache setup should hopefully mitigate cold builds being slower

@bgw bgw merged commit 46b888b into main May 15, 2024
50 of 52 checks passed
@bgw bgw deleted the bgw/dev-opt-level branch May 15, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants