-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove most #[inline]
annotations
#119
Conversation
I see more regressions than improvements, and with the exception of
I don't know anything about the relative importance of each benchmark, though. |
The general order of importance for hash table operations is (from most important to least):
I am particularly worried about the regression in insertion benchmarks. Looking at the disassembly shows that there are 2 out-of-line functions: As a point of comparison, in the C++ version of SwissTables, every function is inline except for |
I'm personally very wary to consider these microbenchmarks serious regressions and/or grounds for skipping this PR entirely. One benchmark got 100% faster by removing
One thing I've tried to emphasize with this PR is drawing from data. Data sources like perf.r-l.o and these local benchmarks are showing that 90% of the wins are just inlining the functions which otherwise would not be candidates for inlining (like non-generic functions). There's an extremely long tail of "regressions" elsewhere because I think we should make an explicit decision to trade off a miniscule amount of perf in microbenchmarks for 20% compile time in hashmap-heavy crates. This is a balancing act, and I think it's fine to use concrete data to guide insertion of
I'd want to be clear though, Rust's compilation model has no parallel in C++. What C++ does with headers does not at all match Rust generics and The entire crate is "inlined" anyway since it's generic. Using |
Isn't the codegen'ing done anyways for generic functions? This means that effectively in hashbrown all we are doing is adding the
I disagree with your interpretation of the perf.r-l.o data: if you filter the results to only look at the |
Looking at the wall time measurements it's pretty clear to me that most of the check regressions, while theoretically 1-3% are actually less than ~100ms longer in terms of compiletime. I agree with @alexcrichton here that the trade-off in compile time on optimized/debug LLVM builds is more than worth the possibly tiny losses in performance. |
No, Instead the behavior of this PR is that only one CGU has a hash map (because it must be monomorphized), all 16 cgus reference it, and then ThinLTO will inline across codegen units as necessary. I will again, as I usually do, strongly push back against religiously adhering to the numbers provided by perf.rust-lang.org. What you're looking at is instruction counts which does not guarantee any sort of correlation with respect to runtime. It can, and often is, an indicator that when instruction counts change something about the wall-time changes. Moving around a few percent of instructions here or there doesn't mean anything though in terms of a meaningful number, it simply means "please take the time to investigate more to understand what this change means". As @Mark-Simulacrum points out the "regressions" here are on the order of milliseconds. I don't think anyone's going to lament that rustc is a few milliseconds slower on each crate, no one is even close to the scale where that matters at all. What people do care about is shaving 20% of their compile time when using hash maps. That's actually a significant win, and has real-world impacts on any "big" crate. |
I made some comments about the More generally, this change has two effects.
So the question is: what's the right balance between performance and compile times? Different people will have different opinions about this. @Amanieu worked hard to get I don't know what the right answer is here, but having |
As a small note here, you could put inlining behind a feature that is enabled by default. That's what I did for Of course, if you depend on anything that depends on hashbrown that enables the feature, then I don't think it can be turned off. |
Just my two-cents, but I would gladly wait multiple minutes extra on compile times if it improves final runtime performance by even 5%. If you need faster debug iteration, why not just do something like:
|
I guess one way to do this is to measure multiple versions: no inlining, full inlining, and several points in between. If we had data on, say, five different versions, it might show that there is a sweet spot where we can get a big chunk of the compile-time wins with very little runtime performance cost. (Or it might show that there is no such sweet spot.) I can see that @alexcrichton did some of that already. It would be instructive to have more data points; I understand that this would take a significant amount of time. |
I sort of get the impression that very few folks are ok admitting that getting compile times under control will require changing code we send to rustc. I feel that most of the discussion here is "look at the red number on perf.r-l.o, that means we can't land anything right?" That line of reasoning I find pretty unproductive and also misses the point of what perf.r-l.o even is, which I'll say again is purely instruction counts which may correlate with wall-time performance, but don't always. I don't think it's the case that 100% of Rust users want the absolute fastest code at all costs no matter how long it takes. I'm sure that's the case for some but I think there's a very large chunk of users that want to also be able to reasonably iterate fast as well (cue everyone who's ever thought that Rust compiles slowly). I feel like our job in the standard library is to strike a balance, and adding I personally find it extremely difficult and frustrating to make these sorts of changes. As I mentioned above I feel that few want to admit that these sorts of changes are necessary for getting compile times under control. This has been true for all of Rust's history, for example I was quite frustrated that parallel codegen was stymied due to the lack of ThinLTO originally. This later ended up being the only major dip in compile times in Rust's history when we finally got it enabled with ThinLTO. This is a way of saying that I'm running out of steam for making these kinds of changes since for years basically no one seems to "be on my side". That's a sign that I'm one of the only who cares enough about this to put energy into it, and it's not really worth my time if I'm always the sole advocate. |
@alexcrichton Your explanation of I think what you are doing always sets an example (it does for me at least, I started working on a de-inlining PR for a crate yesterday, though it's possible I'm too optimistic about inline(always) too, for small methods) |
Thanks @alexcrichton for your explanation of I am happy to accept what @BurntSushi suggested, which is to put the inlining behind an However I would also like to get a better understanding of how |
It's clear to me that rust-lang/rust#64600 and this PR have identified that excessive inlining of library functions can have a shockingly large effect on |
Avoids unnecessary rebuilds when locally developing the crate.
Helps when debugging and looking at symbols to see what we got.
This commit goes through and deletes almost all `#[inline]` annotations in this crate. It looks like before this commit basically every single function is `#[inline]`, but this is generally not necessary for performance and can have a severe impact on compile times in both debug and release modes, most severely in release mode. Some `#[inline]` annotations are definitely necessary, however. Most functions in this crate are already candidates for inlining because they're generic, but functions like `Group` and `BitMask` aren't candidates for inlining without `#[inline]`. Additionally LLVM is by no means perfect, so some `#[inline]` may still be necessary to get some further speedups. The procedure used to generate this commit looked like: * Remove all `#[inline]` annotations. * Run `cargo bench`, comparing against the `master` branch, and add `#[inline]` to hot spots as necessary. * A [PR] was made against rust-lang/rust to [evaluate the impact][run1] on the compiler for more performance data. * Using this data, `perf diff` was used locally to determine further hot spots and more `#[inline]` annotations were added. * A [second round of benchmarking][run2] was done The numbers are at the point where I think this should land in the crate and get published to move into the standard library. There are up to 20% wins in compile time for hashmap-heavy crates (like Cargo) and milder wins (up to 10%) for a number of other large crates. The regressions are all in the 1-3% range and are largely on benchmarks taking a few handful of milliseconds anyway, which I'd personally say is a worthwhile tradeoff. For comparison, the benchmarks of this crate before and after this commit look like so: name baseline ns/iter new ns/iter diff ns/iter diff % speedup insert_ahash_highbits 7,137 9,044 1,907 26.72% x 0.79 insert_ahash_random 7,575 9,789 2,214 29.23% x 0.77 insert_ahash_serial 9,833 9,476 -357 -3.63% x 1.04 insert_erase_ahash_highbits 15,824 19,164 3,340 21.11% x 0.83 insert_erase_ahash_random 16,933 20,353 3,420 20.20% x 0.83 insert_erase_ahash_serial 20,857 27,675 6,818 32.69% x 0.75 insert_erase_std_highbits 35,117 38,385 3,268 9.31% x 0.91 insert_erase_std_random 35,357 37,236 1,879 5.31% x 0.95 insert_erase_std_serial 30,617 34,136 3,519 11.49% x 0.90 insert_std_highbits 15,675 18,180 2,505 15.98% x 0.86 insert_std_random 16,566 17,803 1,237 7.47% x 0.93 insert_std_serial 14,612 16,025 1,413 9.67% x 0.91 iter_ahash_highbits 1,715 1,640 -75 -4.37% x 1.05 iter_ahash_random 1,721 1,634 -87 -5.06% x 1.05 iter_ahash_serial 1,723 1,636 -87 -5.05% x 1.05 iter_std_highbits 1,715 1,634 -81 -4.72% x 1.05 iter_std_random 1,715 1,637 -78 -4.55% x 1.05 iter_std_serial 1,722 1,637 -85 -4.94% x 1.05 lookup_ahash_highbits 4,565 5,809 1,244 27.25% x 0.79 lookup_ahash_random 4,632 4,047 -585 -12.63% x 1.14 lookup_ahash_serial 4,612 4,906 294 6.37% x 0.94 lookup_fail_ahash_highbits 4,206 3,976 -230 -5.47% x 1.06 lookup_fail_ahash_random 4,327 4,211 -116 -2.68% x 1.03 lookup_fail_ahash_serial 8,999 4,386 -4,613 -51.26% x 2.05 lookup_fail_std_highbits 13,284 13,342 58 0.44% x 1.00 lookup_fail_std_random 13,172 13,614 442 3.36% x 0.97 lookup_fail_std_serial 11,240 11,539 299 2.66% x 0.97 lookup_std_highbits 13,075 13,333 258 1.97% x 0.98 lookup_std_random 13,257 13,193 -64 -0.48% x 1.00 lookup_std_serial 10,782 10,917 135 1.25% x 0.99 The summary of this from what I can tell is that the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving, as expected. In general I would be surprised if there's much of a significant performance regression attributed to this commit, and `#[inline]` can always be selectively added back in easily without adding it to every function in the crate. [PR]: rust-lang/rust#64846 [run1]: rust-lang/rust#64846 (comment) [run2]: rust-lang/rust#64846 (comment)
f1666de
to
4e9e27d
Compare
I've pushed up a version which adds back |
Thanks @alexcrichton! Just to satisfy my curiosity (and check that I understand
|
I would need to verify, but I think your understanding is correct and that would have the same effect of causing |
Sorry about the delay, I'm dealing with some CI issues in #121. |
@bors r+ |
📌 Commit 4e9e27d has been approved by |
Remove most `#[inline]` annotations This commit goes through and deletes almost all `#[inline]` annotations in this crate. It looks like before this commit basically every single function is `#[inline]`, but this is generally not necessary for performance and can have a severe impact on compile times in both debug and release modes, most severely in release mode. Some `#[inline]` annotations are definitely necessary, however. Most functions in this crate are already candidates for inlining because they're generic, but functions like `Group` and `BitMask` aren't candidates for inlining without `#[inline]`. Additionally LLVM is by no means perfect, so some `#[inline]` may still be necessary to get some further speedups. The procedure used to generate this commit looked like: * Remove all `#[inline]` annotations. * Run `cargo bench`, comparing against the `master` branch, and add `#[inline]` to hot spots as necessary. * A [PR] was made against rust-lang/rust to [evaluate the impact][run1] on the compiler for more performance data. * Using this data, `perf diff` was used locally to determine further hot spots and more `#[inline]` annotations were added. * A [second round of benchmarking][run2] was done The numbers are at the point where I think this should land in the crate and get published to move into the standard library. There are up to 20% wins in compile time for hashmap-heavy crates (like Cargo) and milder wins (up to 10%) for a number of other large crates. The regressions are all in the 1-3% range and are largely on benchmarks taking a few handful of milliseconds anyway, which I'd personally say is a worthwhile tradeoff. For comparison, the benchmarks of this crate before and after this commit look like so: ``` name baseline ns/iter new ns/iter diff ns/iter diff % speedup insert_ahash_highbits 7,137 9,044 1,907 26.72% x 0.79 insert_ahash_random 7,575 9,789 2,214 29.23% x 0.77 insert_ahash_serial 9,833 9,476 -357 -3.63% x 1.04 insert_erase_ahash_highbits 15,824 19,164 3,340 21.11% x 0.83 insert_erase_ahash_random 16,933 20,353 3,420 20.20% x 0.83 insert_erase_ahash_serial 20,857 27,675 6,818 32.69% x 0.75 insert_erase_std_highbits 35,117 38,385 3,268 9.31% x 0.91 insert_erase_std_random 35,357 37,236 1,879 5.31% x 0.95 insert_erase_std_serial 30,617 34,136 3,519 11.49% x 0.90 insert_std_highbits 15,675 18,180 2,505 15.98% x 0.86 insert_std_random 16,566 17,803 1,237 7.47% x 0.93 insert_std_serial 14,612 16,025 1,413 9.67% x 0.91 iter_ahash_highbits 1,715 1,640 -75 -4.37% x 1.05 iter_ahash_random 1,721 1,634 -87 -5.06% x 1.05 iter_ahash_serial 1,723 1,636 -87 -5.05% x 1.05 iter_std_highbits 1,715 1,634 -81 -4.72% x 1.05 iter_std_random 1,715 1,637 -78 -4.55% x 1.05 iter_std_serial 1,722 1,637 -85 -4.94% x 1.05 lookup_ahash_highbits 4,565 5,809 1,244 27.25% x 0.79 lookup_ahash_random 4,632 4,047 -585 -12.63% x 1.14 lookup_ahash_serial 4,612 4,906 294 6.37% x 0.94 lookup_fail_ahash_highbits 4,206 3,976 -230 -5.47% x 1.06 lookup_fail_ahash_random 4,327 4,211 -116 -2.68% x 1.03 lookup_fail_ahash_serial 8,999 4,386 -4,613 -51.26% x 2.05 lookup_fail_std_highbits 13,284 13,342 58 0.44% x 1.00 lookup_fail_std_random 13,172 13,614 442 3.36% x 0.97 lookup_fail_std_serial 11,240 11,539 299 2.66% x 0.97 lookup_std_highbits 13,075 13,333 258 1.97% x 0.98 lookup_std_random 13,257 13,193 -64 -0.48% x 1.00 lookup_std_serial 10,782 10,917 135 1.25% x 0.99 ``` The summary of this from what I can tell is that the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving, as expected. In general I would be surprised if there's much of a significant performance regression attributed to this commit, and `#[inline]` can always be selectively added back in easily without adding it to every function in the crate. [PR]: rust-lang/rust#64846 [run1]: rust-lang/rust#64846 (comment) [run2]: rust-lang/rust#64846 (comment)
☀️ Test successful - checks-travis |
Thanks @Amanieu! Mind publishing so I can include this in rust-lang/rust as well? |
I've just published hashbrown 0.6.2. I made the |
Pulls in rust-lang/hashbrown#119 which should be a good improvement for compile times of hashmap-heavy crates.
Thanks! I've opened rust-lang/rust#65766 to merge this into libstd |
… r=Mark-Simulacrum Update hashbrown to 0.6.2 Pulls in rust-lang/hashbrown#119 which should be a good improvement for compile times of hashmap-heavy crates.
… r=Mark-Simulacrum Update hashbrown to 0.6.2 Pulls in rust-lang/hashbrown#119 which should be a good improvement for compile times of hashmap-heavy crates.
This commit goes through and deletes almost all
#[inline]
annotationsin this crate. It looks like before this commit basically every single
function is
#[inline]
, but this is generally not necessary forperformance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.
Some
#[inline]
annotations are definitely necessary, however. Mostfunctions in this crate are already candidates for inlining because
they're generic, but functions like
Group
andBitMask
aren'tcandidates for inlining without
#[inline]
. Additionally LLVM is by nomeans perfect, so some
#[inline]
may still be necessary to get somefurther speedups.
The procedure used to generate this commit looked like:
#[inline]
annotations.cargo bench
, comparing against themaster
branch, and add#[inline]
to hot spots as necessary.on the compiler for more performance data.
perf diff
was used locally to determine further hotspots and more
#[inline]
annotations were added.The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.
For comparison, the benchmarks of this crate before and after this
commit look like so:
The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and
#[inline]
can always be selectively added back in easily withoutadding it to every function in the crate.