Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manually implement Hash for DefId #91660

Merged
merged 1 commit into from
Dec 14, 2021
Merged

Conversation

llogiq
Copy link
Contributor

@llogiq llogiq commented Dec 8, 2021

This might speed up hashing for hashers that can work on individual u64s. Just as an experiment, suggested in a reddit thread on FxHasher. cc @nnethercote

Note that this should not be merged as is without cfg-ing the code path for 64 bits.

@rust-highfive
Copy link
Collaborator

r? @jackh726

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 8, 2021
@Mark-Simulacrum
Copy link
Member

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 8, 2021
@bors
Copy link
Contributor

bors commented Dec 8, 2021

⌛ Trying commit ad29818c9fa32552de852d1656757960a5487cf0 with merge abe401b21ff392a1e7153686e94b22d7949b9a0c...

@bors
Copy link
Contributor

bors commented Dec 8, 2021

☀️ Try build successful - checks-actions
Build commit: abe401b21ff392a1e7153686e94b22d7949b9a0c (abe401b21ff392a1e7153686e94b22d7949b9a0c)

@rust-timer
Copy link
Collaborator

Queued abe401b21ff392a1e7153686e94b22d7949b9a0c with parent ce0f7ba, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (abe401b21ff392a1e7153686e94b22d7949b9a0c): comparison url.

Summary: This change led to moderate relevant improvements 🎉 in compiler performance.

  • Moderate improvement in instruction counts (up to -0.4% on full builds of externs)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 8, 2021
@llogiq
Copy link
Contributor Author

llogiq commented Dec 8, 2021

Those results appear to be a mixed bag. I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

// hash one u64 instead of two u32s
impl Hash for DefId {
fn hash<H: Hasher>(&self, h: &mut H) {
((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)
((u64::from(u32::from(self.index)) << 32) | u64::from(u32::from(self.krate))).hash(h)

It may be faster this way: https://rust.godbolt.org/z/EMeebaGb3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but that would exacerbate the problem of def index bit distribution further (at least my theory is that we usually have more defs in a crate than we have crates within one build, but I may be wrong).

I'd rather reorder the fields in the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little-endian only improvement, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah (probably 64-bit LE only to boot) -- if you throw a --target on the godbolt for a BE arch you can see what it looks like.

In practice the most popular architectures these days are LE (x86-64 and aarch64), but I guess if one were really motivated they could use #[cfg] to keep it optimal on both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I always find endianness tricky to think about, just checking my reasoning :)

@nnethercote
Copy link
Contributor

Those results appear to be a mixed bag.

Are they? The instruction count changes of significance are all small improvements.

I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

I don't understand this. Can you elaborate?

@llogiq
Copy link
Contributor Author

llogiq commented Dec 8, 2021

I guess I was reading too much into the results. After looking into the hash function again, I think it's probably noise. I'd like to see if reordering the fields brings more benefit. What's your take? Should we try to optimize the Hash derive for 64 bit hashers?

@rust-log-analyzer

This comment has been minimized.

@nnethercote
Copy link
Contributor

I will experiment today with different variations. I have local benchmarking set up which is much faster than CI for doing multiple perf runs.

@nnethercote
Copy link
Contributor

Ok, some very surprising results.

v1

Like the original in this PR (with slightly different syntax for the integer conversions):

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.krate.as_u32() as u64) << 32) | (self.index.as_u32() as u64)).hash(h)
    }

Gave some tiny instruction count wins (I only measured Full builds):

Benchmark & Profile Scenario % Change Significance Factor ?
externs debug full -0.60% 3.00x
externs opt full -0.60% 3.00x
deep-vector debug full -0.55% 2.77x
inflate debug full -0.35% 1.76x

v2

Alex's suggestion of swapping the order:

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.index.as_u32() as u64) << 32) | (self.krate.as_u32() as u64)).hash(h)
    }

Gave gigantic slowdowns. Here's the first part of the results:

Benchmark & Profile Scenario % Change Significance Factor ?
stm32f4 check full 1350.47% 6752.37x
stm32f4 debug full 1264.39% 6321.93x
stm32f4 opt full 1242.46% 6212.28x
unused-warnings check full 416.26% 2081.28x
unused-warnings debug full 385.00% 1925.00x
unused-warnings opt full 382.23% 1911.17x
externs debug full 336.14% 1680.69x
externs opt full 335.54% 1677.72x
many-assoc-items debug full 161.70% 808.52x
many-assoc-items check full 161.44% 807.21x
many-assoc-items opt full 161.37% 806.83x
externs check full 94.20% 470.98x

I did a Cachegrind diff and confirmed that it was due to huge numbers of collisions. E.g. this hashbrown function:

         .               fn move_next(&mut self, bucket_mask: usize) {
         .                   // We should have found an empty bucket by now and ended the probe.
         .                   debug_assert!(
         .                       self.stride <= bucket_mask,
         .                       "Went past end of probe sequence"
         .                   );
         .               
   386,363 ( 0.02%)          self.stride += Group::WIDTH;
         6 ( 0.00%)          self.pos += self.stride;
 1,571,177 ( 0.09%)          self.pos &= bucket_mask;
         .               }

changed to this:

            .               fn move_next(&mut self, bucket_mask: usize) {
            .                   // We should have found an empty bucket by now and ended the probe.
            .                   debug_assert!(
            .                       self.stride <= bucket_mask,
            .                       "Went past end of probe sequence"
            .                   ); 
            .               
  337,778,897 ( 3.86%)          self.stride += Group::WIDTH;
            6 ( 0.00%)          self.pos += self.stride;
1,476,729,111 (16.86%)          self.pos &= bucket_mask;
            .               }   
            .           }

Wow. FxHasher does really poorly when most of the entropy is in the higher bits.

v3

Like v1, but with the fields in DefId swapped. Gave slightly better results than v1.

Benchmark & Profile Scenario % Change Significance Factor ?
match-stress-enum check full -2.14% 10.72x
match-stress-enum debug full -2.04% 10.21x
match-stress-enum opt full -2.04% 10.19x
externs opt full -1.22% 6.09x
externs debug full -1.22% 6.08x
deep-vector debug full -0.56% 2.80x
webrender-wrench debug full -0.35% 1.75x

But I get the same test failures seen on CI, which I don't understand. I get those same failures even if I write hash like this:

    fn hash<H: Hasher>(&self, h: &mut H) {
        self.krate.hash(h);
        self.index.hash(h);
    }
}

So now I'm really confused.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 8, 2021

Perhaps those tests rely on the field order of DefId? I'm not sure how (I haven't had the time to look into them, but given your suggestion, I'll do that next.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 8, 2021

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

@nnethercote
Copy link
Contributor

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

By "large shift" do you mean the left shift by 32? It's the obvious way to combine two 32 bit values into a 64 bit value. If you're talking about doing something more complex like xor or rotate then that's just a kind of pre-hash before the real hash, which doesn't seem worthwhile; might as well just hash the two fields in the normal way.

I did try another thing: changing FxHasher to be more like the fallback hasher from aHash.

pub(crate) const MULTIPLE: u64 = 6364136223846793005;

#[inline(always)]
pub(crate) const fn folded_multiply(s: u64, by: u64) -> u64 {
    let result = (s as u128).wrapping_mul(by as u128);
    ((result & 0xffff_ffff_ffff_ffff) as u64) ^ ((result >> 64) as u64)
}

impl FxHasher {
    #[inline]
    fn add_to_hash(&mut self, i: u64) {
        self.hash = folded_multiply(i ^ self.hash, MULTIPLE);
    }
}

The folded multiply is clever, it folds the overflow bytes from the multiply back into the result. This avoids the FxHasher weakness with higher bits and compiles down to the same number of instructions as FxHasher: https://rust.godbolt.org/z/17xqa5e5e

It doesn't go bad when the DefIf fields are hashed in the different order, but overall it's a tiny bit more instructions than FxHasher. Not sure why, maybe it inlines a little differently.

@nnethercote
Copy link
Contributor

Thinking some more: within the compiler FxHasher is mostly hashing integers and pointers, and some strings.

  • The integers are all or almost all 32-bits or less, so the weakness in bits 32-63 doesn't matter.
  • As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.
  • As for strings (e.g. symbols), they are processed in 64-bit chunks, and so the upper bits weakness will have some effect there.

@alex
Copy link
Member

alex commented Dec 9, 2021

As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.

Except the bottom 2-3 bits, which are always 0. (I'm not sure if this meaningfully impacts your point.)

(Sorry for the confusion of commenting from two different accounts!)

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

The UI tests appear to check trait coherence and ambiguity. Probably a good idea to dig into that code, as it may be less robust than we'd like.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

Having had a deeper look at the failing tests, the difference appears to hinge on the iteration order of the trait lookup, which is a HashMap, so it's kind of expected that order would change when the hashing function does. I think updating the tests here should be fine.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

Ok, so I guess we want to avoid doing such optimizations generically. The risk of getting an avalanche of collisions (pardon the pun) is very real.

For this PR, before merging I think I should #[cfg]-out the change for all but 64 bits platforms. I also wonder if it would be worth to make the order of the DefId fields platform-endian-dependent, but then we'd get different UI test output on those platforms. Given that the difference isn't that large, I don't think it should block this PR.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

This should be ready now. In practice it boils down to @nnethercote 's v3 (which gave slightly better results in his benchmarks) on 64 bit systems and keeping the derived impls on all other pointer widths. Do we need another perf run or can we merge this @jackh726 ?

@apiraino apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 9, 2021
@camelid
Copy link
Member

camelid commented Dec 9, 2021

Also, I find this form more readable, do others?

Extracting krate = self.krate.as_u32() as u64 and index = self.index.as_u32() as u64 variables might improve it even further.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

@camelid with the comments I added, this should hopefully now be sufficiently clear.

@camelid
Copy link
Member

camelid commented Dec 9, 2021

I'd suggest making the comments into doc-comments so rustdoc picks them up. Also, what's the disadvantage of extracting local variables? I'm not sure why adding comments helps with that.

@llogiq
Copy link
Contributor Author

llogiq commented Dec 9, 2021

I even made some ascii art to show where each bit ends up. So what's unclear? 😄

My reasoning for not making those doc comments is that they are for folks working on exactly that file whereas docs are for people using the type. If I just use a DefId, I don't need to care how it's going to be hashed. On the other hand, if I want to change DefId, I read the source anyway, which will include the comments.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (03e04a6f4e077238d1936bfd25c5b6450327e583): comparison url.

Summary: This change led to large relevant improvements 🎉 in compiler performance.

  • Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 9, 2021
@camelid
Copy link
Member

camelid commented Dec 9, 2021

I even made some ascii art to show where each bit ends up. So what's unclear? 😄

Oh, I wasn't saying it's unclear, just that the code would be a little easier to read if local variables were extracted. Your comment is definitely helpful :)

This also reorders the fields to reduce the assembly operations for hashing
and changes two UI tests that depended on the former ordering because of
hashmap iteration order.
@nnethercote
Copy link
Contributor

@bors r+

@bors
Copy link
Contributor

bors commented Dec 10, 2021

📌 Commit 635533b has been approved by nnethercote

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 10, 2021
@klensy
Copy link
Contributor

klensy commented Dec 10, 2021

Is this actually a good idea to change order of fields and relying on current behavior? As for default repr(rust): There are no guarantees of data layout made by this representation., so relying on field order is UB?

@llogiq
Copy link
Contributor Author

llogiq commented Dec 10, 2021

We only rely on field order for performance, not correctness. Nothing to worry here.

@bors
Copy link
Contributor

bors commented Dec 14, 2021

⌛ Testing commit 635533b with merge a2d25b4...

@bors
Copy link
Contributor

bors commented Dec 14, 2021

☀️ Test successful - checks-actions
Approved by: nnethercote
Pushing a2d25b4 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 14, 2021
@bors bors merged commit a2d25b4 into rust-lang:master Dec 14, 2021
@rustbot rustbot added this to the 1.59.0 milestone Dec 14, 2021
@llogiq llogiq deleted the make-a-hash-of-def-ids branch December 14, 2021 05:38
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a2d25b4): comparison url.

Summary: This change led to large relevant improvements 🎉 in compiler performance.

  • Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

@michaelwoerister
Copy link
Member

We only rely on field order for performance, not correctness. Nothing to worry here.

Field order could be guaranteed by #[repr(C)], right? Would that have any downsides?

@llogiq
Copy link
Contributor Author

llogiq commented Dec 16, 2021

Probably not. Although at that point we may also want to #[cfg] the field order for 64-bit big-Endian systems.

I'll create a follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.