Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc_codegen_ssa: tune codegen according to available concurrency #82127

Merged
merged 1 commit into from
Feb 23, 2021

Conversation

tgnottingham
Copy link
Contributor

This change tunes ahead-of-time codegening according to the amount of
concurrency available, rather than according to the number of CPUs on
the system. This can lower memory usage by reducing the number of
compiled LLVM modules in memory at once, particularly across several
rustc instances.

Previously, each rustc instance would assume that it should codegen
ahead of time to meet the demand of number-of-CPUs workers. But often, a
rustc instance doesn't have nearly that much concurrency available to
it, because the concurrency availability is split, via the jobserver,
across all active rustc instances spawned by the driving cargo process,
and is further limited by the -j flag argument. Therefore, each rustc
might have had several times the number of LLVM modules in memory than
it really needed to meet demand. If the modules were large, the effect
on memory usage would be noticeable.

With this change, the required amount of ahead-of-time codegen scales up
with the actual number of workers running within a rustc instance. Note
that the number of workers running can be less than the actual
concurrency available to a rustc instance. However, if more concurrency
is actually available, workers are spun up quickly as job tokens are
acquired, and the ahead-of-time codegen scales up quickly as well.

@rust-highfive
Copy link
Collaborator

r? @varkor

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 15, 2021
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Feb 15, 2021

This change mostly targets concurrent rustc scenarios (i.e. cargo building multiple crates at once) or any scenario where -j is used with a number less than the number of CPUs. I'm not sure these scenarios apply to rustc-perf benchmarks, so we may not see a change there.

But here are some graphs of manual benchmark results.

rustc_quarter_workers

This shows the memory usage using the current queue_full_enough heuristic versus the proposed one when compiling the entire stage0 rustc with -j equal to the number of CPUs (eight in this case). The reduction in peak memory usage is ~1GB, or 16%, accounting for the base memory usage outside of compilation. On higher CPU systems, the results might differ, though I'd expect them to be more dramatic.

rustc_all

This shows how some additional heuristics perform. The line labels indicate when the heuristic considers the queue to be full enough for the main thread to decide to stop codegenning and start LLVMing (see code comments for more detail than you probably want).

Notice that the Items >= 1 heuristic performs better than the proposed one during part of the compilation, but not all of it. However, I wouldn't be comfortable using it, as it could more easily increase compile times.

rustc_middle_lto_on

This shows the memory usage of the current heuristic versus the proposed one when compiling rustc_middle with -j1 and thin local LTO on. Technically, it's simulated -j1 via -Z no-parallel-llvm because I wasn't using x.py to compile, but they should behave the same or similarly enough. The reduction in peak memory usage is ~250MB, or 10%. This is meant to show the potential improvement when using a -j value less than the number of CPUs on the system. Remember that rustc_middle is kind of an extreme example though.

rustc_middle_lto_off

This shows the same benchmark but with LTO off. The reduction in peak memory usage is ~600MB, or 22%.

One thing I learned from this is how little an LLVM module seems to take up in memory after it's been serialized into an in-memory buffer (ThinBuffer) and the module proper dropped. I expected enabling thin LTO to cause a huge increase in memory usage because all of the modules have to be kept in memory until LTO is done. But it appears the serialized forms are small enough that the impact is negligible.

Stats were gathered by polling system memory usage once per second via the free command on Linux. The machine was otherwise left alone during benchmarking. I'm sure there's a better way, but this did a passable job.

@rustbot label T-compiler A-codegen I-compilemem

@rustbot rustbot added A-codegen Area: Code generation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 15, 2021
@tgnottingham
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2021
@bors
Copy link
Contributor

bors commented Feb 15, 2021

⌛ Trying commit ed858772f2047fdb02db3328fef70e7cec4023a1 with merge b12cc6754498881db4edf807a09dd6d3c6cf4f15...

@bors
Copy link
Contributor

bors commented Feb 15, 2021

☀️ Try build successful - checks-actions
Build commit: b12cc6754498881db4edf807a09dd6d3c6cf4f15 (b12cc6754498881db4edf807a09dd6d3c6cf4f15)

@rust-timer
Copy link
Collaborator

Queued b12cc6754498881db4edf807a09dd6d3c6cf4f15 with parent 9503ea1, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (b12cc6754498881db4edf807a09dd6d3c6cf4f15): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2021
This change tunes ahead-of-time codegening according to the amount of
concurrency available, rather than according to the number of CPUs on
the system. This can lower memory usage by reducing the number of
compiled LLVM modules in memory at once, particularly across several
rustc instances.

Previously, each rustc instance would assume that it should codegen
ahead of time to meet the demand of number-of-CPUs workers. But often, a
rustc instance doesn't have nearly that much concurrency available to
it, because the concurrency availability is split, via the jobserver,
across all active rustc instances spawned by the driving cargo process,
and is further limited by the `-j` flag argument. Therefore, each rustc
might have had several times the number of LLVM modules in memory than
it really needed to meet demand. If the modules were large, the effect
on memory usage would be noticeable.

With this change, the required amount of ahead-of-time codegen scales up
with the actual number of workers running within a rustc instance. Note
that the number of workers running can be less than the actual
concurrency available to a rustc instance. However, if more concurrency
is actually available, workers are spun up quickly as job tokens are
acquired, and the ahead-of-time codegen scales up quickly as well.
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Feb 15, 2021

Added more to the massive comment. Wanted to note that keeping the queue full is not only beneficial for meeting demand of existing workers, but for meeting demand when we have a surge in available concurrency.

Apologies if the comment is too monstrous. It's hard for me to express the scheduling nuances succinctly. Or anything, really. :)

As expected, the rustc-perf results don't show any change in max-rss. There may be a small regression in compile times (0.6% on bootstrap).

@tgnottingham
Copy link
Contributor Author

@varkor, were you already reviewing this, or would you prefer to have it off your plate? I know @michaelwoerister has worked in this area (I always seem to end up in your neck of the woods, @michaelwoerister!).

@varkor
Copy link
Member

varkor commented Feb 18, 2021

@tgnottingham: sorry, I've been quite busy this week and last. You will probably get a quicker review if you have another reviewer in mind already, but I'll try to get to it this weekend if not :)

@tgnottingham
Copy link
Contributor Author

@varkor, no problem at all. :)

@tgnottingham
Copy link
Contributor Author

tgnottingham commented Feb 22, 2021

I checked the effects of this change on compiling rustc, post #70951. On my system, it now reduces peak memory usage by 13%, down from 16%, pre #70951.

@varkor
Copy link
Member

varkor commented Feb 22, 2021

Thanks for the change, and the detailed analysis and comment! This all looks good to me!

@bors r+

@bors
Copy link
Contributor

bors commented Feb 22, 2021

📌 Commit 5f243d3 has been approved by varkor

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2021
@bors
Copy link
Contributor

bors commented Feb 23, 2021

⌛ Testing commit 5f243d3 with merge 0196107...

@bors
Copy link
Contributor

bors commented Feb 23, 2021

☀️ Test successful - checks-actions
Approved by: varkor
Pushing 0196107 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 23, 2021
@bors bors merged commit 0196107 into rust-lang:master Feb 23, 2021
@rustbot rustbot added this to the 1.52.0 milestone Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants