Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base CGU estimates on basic block count, not statement count. #113047

Closed
wants to merge 1 commit into from

Conversation

nnethercote
Copy link
Contributor

r? @ghost

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jun 26, 2023
@nnethercote nnethercote marked this pull request as draft June 26, 2023 07:51
@nnethercote
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jun 26, 2023
@bors
Copy link
Contributor

bors commented Jun 26, 2023

⌛ Trying commit 64ee4e1 with merge 57da7fc899d24430772acd1c0df9059d593714e0...

@bors
Copy link
Contributor

bors commented Jun 26, 2023

☀️ Try build successful - checks-actions
Build commit: 57da7fc899d24430772acd1c0df9059d593714e0 (57da7fc899d24430772acd1c0df9059d593714e0)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (57da7fc899d24430772acd1c0df9059d593714e0): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.7% [0.9%, 3.0%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.8% [-2.9%, -0.4%] 58
Improvements ✅
(secondary)
-0.8% [-1.4%, -0.4%] 24
All ❌✅ (primary) -0.6% [-2.9%, 3.0%] 61

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
4.0% [1.7%, 7.4%] 4
Regressions ❌
(secondary)
2.7% [2.0%, 3.0%] 5
Improvements ✅
(primary)
-5.1% [-10.7%, -1.5%] 5
Improvements ✅
(secondary)
-2.5% [-3.1%, -2.1%] 3
All ❌✅ (primary) -1.0% [-10.7%, 7.4%] 9

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.0% [1.1%, 2.9%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.1% [-3.4%, -2.8%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.5% [-3.4%, 2.9%] 4

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.5% [0.0%, 1.7%] 32
Regressions ❌
(secondary)
0.9% [0.2%, 1.5%] 2
Improvements ✅
(primary)
-0.5% [-0.7%, -0.4%] 4
Improvements ✅
(secondary)
-1.1% [-1.1%, -1.1%] 1
All ❌✅ (primary) 0.4% [-0.7%, 1.7%] 36

Bootstrap: 662.496s -> 666.596s (0.62%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Jun 26, 2023
@klensy
Copy link
Contributor

klensy commented Jun 26, 2023

Interesting, run https://github.com/rust-lang/rust/actions/runs/5375275677 still spinning while should be completed a while ago.

https://github.com/rust-lang/rust/actions/runs/5375275677/jobs/9751558060#step:25:1784

thread '[codegen-units] tests/codegen-units/partitioning/incremental-merging.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:2906:13
Build completed unsuccessfully in 0:12:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    [codegen-units] tests/codegen-units/partitioning/incremental-merging.rs

@Kobzol
Copy link
Contributor

Kobzol commented Jun 26, 2023

Seems like some kind of GH bug, all the workflow jobs have actually completed, but the workflow run isn't marked as completed yet.

@nnethercote
Copy link
Contributor Author

@klensy: I got a "PR run failed" email saying three jobs had failed, but when I visited here they were still spinning.

@nnethercote
Copy link
Contributor Author

Some explanation: after much effort, I concluded that all efforts to improve CGU formation are basically doomed because the CGU size estimates aren't very good. There's no point doing anything clever if the estimates are regularly off by a factor of 2 or more in both directions (underestimating and overestimating).

So I wrote code to compare the CGU estimates to the actual compile times, and computed an error value based on the mean absolute deviation. The goal was not to to come up with a perfect measure of error, but something reasonably good that would let me evaluate alternative CGU size estimation functions.

I tried quite a few different estimation functions, and found that basing the estimates on basic block counts rather than statement counts was a big win. This makes a certain amount of sense, because basic blocks are pretty easy for compilers to handle, and a basic block with 20 statements is not that much harder than one with 5 statements, for example, because all the complexity comes from the control flow between basic blocks. (Picture me waving my hands right now.)

Here is the data showing the error measure for the old (statement-based) and new (basic-block based) estimate functions. The error for the new function is better in every case except two (marked with *) where it's marginally worse. Some drop to zero because the number of CGUs dropped to one, in which case there is no error because it's the degenerate case (can't have any discrepancy between CGU estimates when there is only one CGU).

Debug
                             old    new
bitmaps-3.1.0                0.58   0.00
cargo-0.60.0                 0.38   0.13
cexpr-0.6.0                  0.39   0.13
clap-3.1.6                   0.29   0.14
cranelift-codegen-0.82.1     0.34   0.10
diesel-1.4.8                 0.35   0.00
exa-0.10.1                   0.39   0.22
html5ever-0.26.0             0.27   0.15
image-0.24.1                 0.26   0.05
regex-1.5.5                  0.57   0.42
ripgrep-13.0.0               0.14   0.17 *
serde-1.0.136                0.42   0.00
stm32f4-0.14.0               0.20   0.07
syn-1.0.89                   0.38   0.14
ucd                          0.66   0.57
unicode-normalization-0.1.19 0.54   0.00
webrender-2022               0.28   0.12

Opt
                             old    new
bitmaps-3.1.0                0.46   0.00
cargo-0.60.0                 0.22   0.18
cexpr-0.6.0                  0.48   0.19
clap-3.1.6                   0.29   0.19
cranelift-codegen-0.82.1     0.33   0.33
diesel-1.4.8                 0.57   0.00
exa-0.10.1                   0.00   0.00
html5ever-0.26.0             0.26   0.11
image-0.24.1                 0.14   0.16 *
regex-1.5.5                  0.44   0.35
ripgrep-13.0.0               0.27   0.19
serde-1.0.136                0.61   0.00
stm32f4-0.14.0               0.11   0.11
syn-1.0.89                   0.46   0.19
ucd                          0.86   0.39
unicode-normalization-0.1.19 0.72   0.00
webrender-2022               0.26   0.23

This PR implements the new estimator, but the results are worse than I'd hoped.

  • Instruction counts are great! But not really relevant for this PR, because of the parallelism in the backend
  • Walltimes look slightly worse, though not that many benchmarks are affected. If you select "Show non-relevant results" then the results seem better overall, with lots of small improvements.
  • max-rss is a wash.
  • Binary size is a little worse. Not sure why, I would have thought wins and losses would balance out.

I think the reason why we didn't see bigger wall time improvement is that even with the improved estimate, it only takes one bad estimate to throw the whole thing off, especially if that is a bad underestimate -- you can end up with a single "long pole" CGU that takes much longer than the other CGUs to compile.

Anyway, I thought this was worth explaining, and I'd be interested to hear if others think the perf results make this worth merging.

cc @rust-lang/wg-compiler-performance @wesleywiser @mw

@michaelwoerister
Copy link
Member

Thanks for the explanation, that's very interesting!

So I wrote code to compare the CGU estimates to the actual compile times

Is that end-to-end wall time?

I could also imagine that inline functions play a non-trivial role here:

  • They complicate the logic around merging CGUs, because the same inline functions can be in two CGUs and when the CGUs are merged, the inline function is merged too - i.e. sizeof(CGU1 + CGU2) < sizeof(CGU1) + sizeof(CGU2).
  • I suspect that the cost of performing the inlining is non-trivial inside of LLVM. But it's also hard to tell beforehand how much inlining will even be done.

It would be interesting to know if the cost estimates for debug builds (which do almost no inlining) are more accurate than for release builds (which do inlining, but probably also other super-linear optimizations).

What I'd really like to see at some point is the effect of throwing some machine learning on the problem. That seems a good fit for dealing with such a fuzzy, statistical problem. It would be a nice research project 🙂

@nnethercote
Copy link
Contributor Author

Is that end-to-end wall time?

It is walltime for the "gen" phase (generating the LLVM IR) plus the "opt" phase (LLVM doing its work) plus the "lto" phase (opt builds only).

What I'd really like to see at some point is the effect of throwing some machine learning on the problem. That seems a good fit for dealing with such a fuzzy, statistical problem. It would be a nice research project 🙂

Perhaps, though I wonder if overfitting to specific machines would be a problem.

@michaelwoerister
Copy link
Member

Perhaps, though I wonder if overfitting to specific machines would be a problem.

Yes, that's true. On the other hand, I think, we already have some overfitting to the existing benchmarks and benchmark machines going on.

FWIW, the change in this PR looks like an overall improvement to me.

@nnethercote
Copy link
Contributor Author

I will close this. I have been working on collecting more MIR stats, for others with data science expertise to take a look at. Hopefully that will lead to a better estimation function.

@nnethercote nnethercote deleted the cgu-bb-estimate branch July 23, 2023 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants