Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove most `#[inline]` annotations #119

Merged
merged 4 commits into from Oct 15, 2019

Conversation

@alexcrichton
Copy link
Member

alexcrichton commented Oct 1, 2019

This commit goes through and deletes almost all #[inline] annotations
in this crate. It looks like before this commit basically every single
function is #[inline], but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some #[inline] annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like Group and BitMask aren't
candidates for inlining without #[inline]. Additionally LLVM is by no
means perfect, so some #[inline] may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

  • Remove all #[inline] annotations.
  • Run cargo bench, comparing against the master branch, and add
    #[inline] to hot spots as necessary.
  • A PR was made against rust-lang/rust to evaluate the impact
    on the compiler for more performance data.
  • Using this data, perf diff was used locally to determine further hot
    spots and more #[inline] annotations were added.
  • A second round of benchmarking was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and #[inline] can always be selectively added back in easily without
adding it to every function in the crate.

build.rs Show resolved Hide resolved
@nnethercote

This comment has been minimized.

Copy link

nnethercote commented Oct 1, 2019

the microbenchmarks are sort of all over the place, but they're neither consistently regressing nor improving

I see more regressions than improvements, and with the exception of lookup_fail_ahash_serial, the regressions are mostly larger than the improvements. This becomes clearer if you sort the table by diff %.

   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05

I don't know anything about the relative importance of each benchmark, though.

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 1, 2019

The general order of importance for hash table operations is (from most important to least):

  • successful lookup
  • unsuccessful lookup
  • insert (which implies an unsuccessful lookup)
  • remove
  • iteration

I am particularly worried about the regression in insertion benchmarks. Looking at the disassembly shows that there are 2 out-of-line functions: HashMap::insert (which checks whether the key exists) and RawTable::insert (which doesn't).

As a point of comparison, in the C++ version of SwissTables, every function is inline except for prepare_for_insert which roughly maps to RawTable::insert.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 1, 2019

I'm personally very wary to consider these microbenchmarks serious regressions and/or grounds for skipping this PR entirely. One benchmark got 100% faster by removing #[inline] which shows that these I think are sort of all over the place and extremely susceptible to decisions in LLVM, and penalizing all users with more codegen does not seem like a fair tradeoff. I've also seen that when enabling LTO on master most of these benchmarks 'regress'. I think they've just got a good deal of variation.

I am particularly worried about the regression in insertion benchmarks.

One thing I've tried to emphasize with this PR is drawing from data. Data sources like perf.r-l.o and these local benchmarks are showing that 90% of the wins are just inlining the functions which otherwise would not be candidates for inlining (like non-generic functions). There's an extremely long tail of "regressions" elsewhere because I think we should make an explicit decision to trade off a miniscule amount of perf in microbenchmarks for 20% compile time in hashmap-heavy crates.

This is a balancing act, and I think it's fine to use concrete data to guide insertion of #[inline], but I want to push back very hard against the idea that everything needs #[inline]. With what I mentioned above, I'm very wary of using these benchmarks in this repository to guide the insertion of #[inline] beyond the "90% of the perf matters" case. Getting a few percent on these benchmarks I don't think actually translates to real-world wins anywhere else.

As a point of comparison, in the C++ version of SwissTables, every function is inline except for prepare_for_insert which roughly maps to RawTable::insert.

I'd want to be clear though, Rust's compilation model has no parallel in C++. What C++ does with headers does not at all match Rust generics and #[inline]. While it's similar there are some subtle crucial differences.

The entire crate is "inlined" anyway since it's generic. Using #[inline] caues causes the compiler to do extra work, such as codegen'ing into every single codegen unit which references it as well as adding inlinehint to LLVM. Those latter two are the source of quite large compile-time slowdowns when using HashMap heavily (as seen on perf.r-l.o). The latter two are also almost always "fixed" via ThinLTO, just like all other non-#[inline] function which may or may not be generic in Rust.

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 1, 2019

Using #[inline] caues causes the compiler to do extra work, such as codegen'ing into every single codegen unit which references it as well as adding inlinehint to LLVM.

Isn't the codegen'ing done anyways for generic functions? This means that effectively in hashbrown all we are doing is adding the inlinehint attribute to a few functions, which in turn causes LLVM to inline more aggressively.

One thing I've tried to emphasize with this PR is drawing from data.

I disagree with your interpretation of the perf.r-l.o data: if you filter the results to only look at the check results, you can see that this change is a 1%-3% regression across the board. I would argue that this is a very significant regression considering that introducing hashbrown only achieved an average speedup of 5%.

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

Mark-Simulacrum commented Oct 2, 2019

Looking at the wall time measurements it's pretty clear to me that most of the check regressions, while theoretically 1-3% are actually less than ~100ms longer in terms of compiletime. I agree with @alexcrichton here that the trade-off in compile time on optimized/debug LLVM builds is more than worth the possibly tiny losses in performance.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 2, 2019

No, #[inline] is very different than simply just an inline hint. As I mentioned before, there's no equivalent in C++ for what #[inline] does. In debug mode rustc basically ignores #[inline], pretending you didn't even write it. In release mode the compiler will, by default, codegen an #[inline] function into every single referencing codegen unit, and then it will also add inlinehint. This means that if you have 16 CGUs and they all reference a hash map, every single one is getting the entire hash map implementation inlined into it.

Instead the behavior of this PR is that only one CGU has a hash map (because it must be monomorphized), all 16 cgus reference it, and then ThinLTO will inline across codegen units as necessary.

I will again, as I usually do, strongly push back against religiously adhering to the numbers provided by perf.rust-lang.org. What you're looking at is instruction counts which does not guarantee any sort of correlation with respect to runtime. It can, and often is, an indicator that when instruction counts change something about the wall-time changes. Moving around a few percent of instructions here or there doesn't mean anything though in terms of a meaningful number, it simply means "please take the time to investigate more to understand what this change means".

As @Mark-Simulacrum points out the "regressions" here are on the order of milliseconds. I don't think anyone's going to lament that rustc is a few milliseconds slower on each crate, no one is even close to the scale where that matters at all. What people do care about is shaving 20% of their compile time when using hash maps. That's actually a significant win, and has real-world impacts on any "big" crate.

@nnethercote

This comment has been minimized.

Copy link

nnethercote commented Oct 3, 2019

I made some comments about the rustc perf effects yesterday, here.

More generally, this change has two effects.

  1. It hurts somewhat the performance of code that uses HashMap/HashSet.
  2. It improves compile times of code that uses HashMap/HashSet, sometimes significantly. (This is for debug and opt builds, but not for check builds).

So the question is: what's the right balance between performance and compile times? Different people will have different opinions about this. @Amanieu worked hard to get hashbrown as fast as possible, and so will naturally be reluctant to make changes that compromise that. @alexcrichton works on Cargo, which stands to gain 18% compile time speedups, and so will naturally have a different opinion.

I don't know what the right answer is here, but having #[inline] on every (or almost every) function does seem excessive. Looking at executable sizes might be instructive, too.

@BurntSushi

This comment has been minimized.

Copy link
Member

BurntSushi commented Oct 3, 2019

As a small note here, you could put inlining behind a feature that is enabled by default. That's what I did for regex (among other things): https://docs.rs/regex/1.3.1/regex/#performance-features

Of course, if you depend on anything that depends on hashbrown that enables the feature, then I don't think it can be turned off.

@novacrazy

This comment has been minimized.

Copy link

novacrazy commented Oct 3, 2019

Just my two-cents, but I would gladly wait multiple minutes extra on compile times if it improves final runtime performance by even 5%. If you need faster debug iteration, why not just do something like:

#[cfg_attr(not(debug_assertions), inline)]?

@nnethercote

This comment has been minimized.

Copy link

nnethercote commented Oct 3, 2019

I guess one way to do this is to measure multiple versions: no inlining, full inlining, and several points in between. If we had data on, say, five different versions, it might show that there is a sweet spot where we can get a big chunk of the compile-time wins with very little runtime performance cost. (Or it might show that there is no such sweet spot.)

I can see that @alexcrichton did some of that already. It would be instructive to have more data points; I understand that this would take a significant amount of time.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 3, 2019

I sort of get the impression that very few folks are ok admitting that getting compile times under control will require changing code we send to rustc. I feel that most of the discussion here is "look at the red number on perf.r-l.o, that means we can't land anything right?" That line of reasoning I find pretty unproductive and also misses the point of what perf.r-l.o even is, which I'll say again is purely instruction counts which may correlate with wall-time performance, but don't always.

I don't think it's the case that 100% of Rust users want the absolute fastest code at all costs no matter how long it takes. I'm sure that's the case for some but I think there's a very large chunk of users that want to also be able to reasonably iterate fast as well (cue everyone who's ever thought that Rust compiles slowly). I feel like our job in the standard library is to strike a balance, and adding #[inline] on every function we can find is a case of gross overuse and feels entirely driven by fear that someone might eventually show a benchmark that's slower. The hashbrown crate is just that, a crate on crates.io. If @Amanieu you really want to keep #[inline] everywhere then @BurntSushi's idea seems reasonable, but I would personally have blocked hashbrown landing in the standard library had I seen that #[inline] was applied literally everywhere.

I personally find it extremely difficult and frustrating to make these sorts of changes. As I mentioned above I feel that few want to admit that these sorts of changes are necessary for getting compile times under control. This has been true for all of Rust's history, for example I was quite frustrated that parallel codegen was stymied due to the lack of ThinLTO originally. This later ended up being the only major dip in compile times in Rust's history when we finally got it enabled with ThinLTO. This is a way of saying that I'm running out of steam for making these kinds of changes since for years basically no one seems to "be on my side". That's a sign that I'm one of the only who cares enough about this to put energy into it, and it's not really worth my time if I'm always the sole advocate.

@bluss

This comment has been minimized.

Copy link
Member

bluss commented Oct 3, 2019

@alexcrichton Your explanation of #[inline] here is hugely helpful! It would be great if we could spread some up to date information in the community about how inlining works in Rust in the landscape of codegen units, debug vs release compiles etc.

I think what you are doing always sets an example (it does for me at least, I started working on a de-inlining PR for a crate yesterday, though it's possible I'm too optimistic about inline(always) too, for small methods)

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 3, 2019

Thanks @alexcrichton for your explanation of #[inline], it makes the problem much clearer. I agree that we should aim to reduce compile times (18% reduction in compile time for cargo is huge and definitely worth the performance cost).

I am happy to accept what @BurntSushi suggested, which is to put the inlining behind an inline feature, with the exception of some hand-picked hot functions which are always marked #[inline].

However I would also like to get a better understanding of how #[inline] interacts with codegen units in Rust. If my understanding is correct, the main performance cost is that we need to generate LLVM IR multiple times if a function is referenced by multiple codegen units. What if we only marked internal methods with #[inline] but not the public API of HashMap? If my understanding is correct (which it probably isn't) the public methods will only be monomorphized in one codegen unit, and all the inlined internal methods will only be referenced from that codegen unit. Would this avoid the issue of generating LLVM IR multiple times for the same method?

@nnethercote

This comment has been minimized.

Copy link

nnethercote commented Oct 3, 2019

It's clear to me that rust-lang/rust#64600 and this PR have identified that excessive inlining of library functions can have a shockingly large effect on debug/opt compile times. As someone who has put a lot of energy into improving compile times, I'm taking this as good news -- there's a whole new area of potential improvements that I previously didn't know about. But every one of those potential improvements could involve a compile time vs runtime trade-off. New tools to identify which inlined functions cause the most bloat will be very valuable.

This commit goes through and deletes almost all `#[inline]` annotations
in this crate. It looks like before this commit basically every single
function is `#[inline]`, but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some `#[inline]` annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like `Group` and `BitMask` aren't
candidates for inlining without `#[inline]`. Additionally LLVM is by no
means perfect, so some `#[inline]` may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

* Remove all `#[inline]` annotations.
* Run `cargo bench`, comparing against the `master` branch, and add
  `#[inline]` to hot spots as necessary.
* A [PR] was made against rust-lang/rust to [evaluate the impact][run1]
  on the compiler for more performance data.
* Using this data, `perf diff` was used locally to determine further hot
  spots and more `#[inline]` annotations were added.
* A [second round of benchmarking][run2] was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and `#[inline]` can always be selectively added back in easily without
adding it to every function in the crate.

[PR]: rust-lang/rust#64846
[run1]: rust-lang/rust#64846 (comment)
[run2]: rust-lang/rust#64846 (comment)
Avoids unnecessary rebuilds when locally developing the crate.
Helps when debugging and looking at symbols to see what we got.
@alexcrichton alexcrichton force-pushed the alexcrichton:less-generics branch from f1666de to 4e9e27d Oct 9, 2019
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 9, 2019

I've pushed up a version which adds back #[inline] behind a #[cfg_attr] for all "removed" #[inline] annotations

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 9, 2019

Thanks @alexcrichton!

Just to satisfy my curiosity (and check that I understand #[inline] correctly), could you tell me if the following would work in theory? I'm not asking you to change the PR, I'm happy with it as it is.

However I would also like to get a better understanding of how #[inline] interacts with codegen units in Rust. If my understanding is correct, the main performance cost is that we need to generate LLVM IR multiple times if a function is referenced by multiple codegen units. What if we only marked internal methods with #[inline] but not the public API of HashMap? If my understanding is correct (which it probably isn't) the public methods will only be monomorphized in one codegen unit, and all the inlined internal methods will only be referenced from that codegen unit. Would this avoid the issue of generating LLVM IR multiple times for the same method?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 9, 2019

I would need to verify, but I think your understanding is correct and that would have the same effect of causing std::collections::HashMap to not forcibly get inlined into all CGUs.

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 13, 2019

Sorry about the delay, I'm dealing with some CI issues in #121.

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 15, 2019

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 15, 2019

📌 Commit 4e9e27d has been approved by Amanieu

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 15, 2019

⌛️ Testing commit 4e9e27d with merge b8c34c9...

bors added a commit that referenced this pull request Oct 15, 2019
Remove most `#[inline]` annotations

This commit goes through and deletes almost all `#[inline]` annotations
in this crate. It looks like before this commit basically every single
function is `#[inline]`, but this is generally not necessary for
performance and can have a severe impact on compile times in both debug
and release modes, most severely in release mode.

Some `#[inline]` annotations are definitely necessary, however. Most
functions in this crate are already candidates for inlining because
they're generic, but functions like `Group` and `BitMask` aren't
candidates for inlining without `#[inline]`. Additionally LLVM is by no
means perfect, so some `#[inline]` may still be necessary to get some
further speedups.

The procedure used to generate this commit looked like:

* Remove all `#[inline]` annotations.
* Run `cargo bench`, comparing against the `master` branch, and add
  `#[inline]` to hot spots as necessary.
* A [PR] was made against rust-lang/rust to [evaluate the impact][run1]
  on the compiler for more performance data.
* Using this data, `perf diff` was used locally to determine further hot
  spots and more `#[inline]` annotations were added.
* A [second round of benchmarking][run2] was done

The numbers are at the point where I think this should land in the crate
and get published to move into the standard library. There are up to 20%
wins in compile time for hashmap-heavy crates (like Cargo) and milder
wins (up to 10%) for a number of other large crates. The regressions are
all in the 1-3% range and are largely on benchmarks taking a few handful
of milliseconds anyway, which I'd personally say is a worthwhile
tradeoff.

For comparison, the benchmarks of this crate before and after this
commit look like so:

```
   name                         baseline ns/iter  new ns/iter  diff ns/iter   diff %  speedup
   insert_ahash_highbits        7,137             9,044               1,907   26.72%   x 0.79
   insert_ahash_random          7,575             9,789               2,214   29.23%   x 0.77
   insert_ahash_serial          9,833             9,476                -357   -3.63%   x 1.04
   insert_erase_ahash_highbits  15,824            19,164              3,340   21.11%   x 0.83
   insert_erase_ahash_random    16,933            20,353              3,420   20.20%   x 0.83
   insert_erase_ahash_serial    20,857            27,675              6,818   32.69%   x 0.75
   insert_erase_std_highbits    35,117            38,385              3,268    9.31%   x 0.91
   insert_erase_std_random      35,357            37,236              1,879    5.31%   x 0.95
   insert_erase_std_serial      30,617            34,136              3,519   11.49%   x 0.90
   insert_std_highbits          15,675            18,180              2,505   15.98%   x 0.86
   insert_std_random            16,566            17,803              1,237    7.47%   x 0.93
   insert_std_serial            14,612            16,025              1,413    9.67%   x 0.91
   iter_ahash_highbits          1,715             1,640                 -75   -4.37%   x 1.05
   iter_ahash_random            1,721             1,634                 -87   -5.06%   x 1.05
   iter_ahash_serial            1,723             1,636                 -87   -5.05%   x 1.05
   iter_std_highbits            1,715             1,634                 -81   -4.72%   x 1.05
   iter_std_random              1,715             1,637                 -78   -4.55%   x 1.05
   iter_std_serial              1,722             1,637                 -85   -4.94%   x 1.05
   lookup_ahash_highbits        4,565             5,809               1,244   27.25%   x 0.79
   lookup_ahash_random          4,632             4,047                -585  -12.63%   x 1.14
   lookup_ahash_serial          4,612             4,906                 294    6.37%   x 0.94
   lookup_fail_ahash_highbits   4,206             3,976                -230   -5.47%   x 1.06
   lookup_fail_ahash_random     4,327             4,211                -116   -2.68%   x 1.03
   lookup_fail_ahash_serial     8,999             4,386              -4,613  -51.26%   x 2.05
   lookup_fail_std_highbits     13,284            13,342                 58    0.44%   x 1.00
   lookup_fail_std_random       13,172            13,614                442    3.36%   x 0.97
   lookup_fail_std_serial       11,240            11,539                299    2.66%   x 0.97
   lookup_std_highbits          13,075            13,333                258    1.97%   x 0.98
   lookup_std_random            13,257            13,193                -64   -0.48%   x 1.00
   lookup_std_serial            10,782            10,917                135    1.25%   x 0.99
```

The summary of this from what I can tell is that the microbenchmarks are
sort of all over the place, but they're neither consistently regressing
nor improving, as expected. In general I would be surprised if there's
much of a significant performance regression attributed to this commit,
and `#[inline]` can always be selectively added back in easily without
adding it to every function in the crate.

[PR]: rust-lang/rust#64846
[run1]: rust-lang/rust#64846 (comment)
[run2]: rust-lang/rust#64846 (comment)
@bors

This comment has been minimized.

Copy link
Contributor

bors commented Oct 15, 2019

☀️ Test successful - checks-travis
Approved by: Amanieu
Pushing b8c34c9 to master...

@bors bors merged commit 4e9e27d into rust-lang:master Oct 15, 2019
1 of 2 checks passed
1 of 2 checks passed
Travis CI - Pull Request Build Failed
Details
homu Test successful
Details
@alexcrichton alexcrichton deleted the alexcrichton:less-generics branch Oct 22, 2019
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 22, 2019

Thanks @Amanieu! Mind publishing so I can include this in rust-lang/rust as well?

@Amanieu

This comment has been minimized.

Copy link
Collaborator

Amanieu commented Oct 23, 2019

I've just published hashbrown 0.6.2. I made the unwind-more feature enabled by default since I don't want to regress performance for anyone using the hashbrown crate directly. However libstd uses hashbrown with default features disabled, and shouldn't be affected.

alexcrichton added a commit to alexcrichton/rust that referenced this pull request Oct 24, 2019
Pulls in rust-lang/hashbrown#119 which should be a good improvement for
compile times of hashmap-heavy crates.
@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Oct 24, 2019

Thanks! I've opened rust-lang/rust#65766 to merge this into libstd

Centril added a commit to Centril/rust that referenced this pull request Oct 24, 2019
… r=Mark-Simulacrum

Update hashbrown to 0.6.2

Pulls in rust-lang/hashbrown#119 which should be a good improvement for
compile times of hashmap-heavy crates.
Centril added a commit to Centril/rust that referenced this pull request Oct 24, 2019
… r=Mark-Simulacrum

Update hashbrown to 0.6.2

Pulls in rust-lang/hashbrown#119 which should be a good improvement for
compile times of hashmap-heavy crates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.