Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a fast HashMap / hash function #11783

Closed
brson opened this Issue Jan 25, 2014 · 44 comments

Comments

Projects
None yet
@brson
Copy link
Contributor

brson commented Jan 25, 2014

Our current HashMap uses cryptographically secure hashing but it is slow. Sometimes this is a big problem. Implement another hash map that is very fast.

@brson

This comment has been minimized.

Copy link
Contributor Author

brson commented Jan 25, 2014

Not all the performance problems of our HashMap may be due to the hashing algorithm. The Rust implementation, paired with our IterBytes trait may also be a part of it.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 25, 2014

A single-pass SipHash implementation is indeed faster than the stateful implementation used by Rust, but it's not an enormous improvement. Rust's hash table is faster than the libstdc++/libc++ unordered_map when using a C++ (no asm or SIMD intrinsics) single-pass SipHash implementation because linear open addressing is a good fit for a strong hash function.

SipHash fundamentally requires a 60-120 cycle (on x86_64) setup time, and then provides a hashing rate of ~2 cycles per byte (maybe you can do better with asm and SIMD intrinsics). This is tolerable for long strings, because it's only twice as slow as good hash functions without the same guarantees. For integers and short strings, the setup time is a huge expense.

I'm sure that a hash function can exist for small fixed-size keys with far better performance and the same security guarantees, and we may be able to find a weak block cipher to use for this. It will of course still be slower than using the identity function as the hash like C++ standard library implementations all seem to do (with a few extra bit operations in the table itself while matching it to a bucket).

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 25, 2014

Tagged with I-compiletime since I think this a major bottleneck in librustc. It may be able to use an improved TrieMap instead, but it's currently not a good enough implementation.

@huonw

This comment has been minimized.

Copy link
Member

huonw commented Jan 25, 2014

Something like SmallIntMap would also be usable for NodeId tables that have a high occupancy since NodeIds increment by 1 from 0. (This gave a few tens of megabytes memory saving when I converted the ast map in #11616, although, I imagine no other tables will be as full as that one.)

@c-a

This comment has been minimized.

Copy link
Contributor

c-a commented Jan 25, 2014

Go uses the AESENC instruction available on newer x86/amd64 processors to implement a fast, and supposedly secure against hash collision DOS attacks, hash function. Unfortunately I can't find any other references of using AESENC for this purpose.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 25, 2014

What do we do on other processors though?

@bnoordhuis

This comment has been minimized.

Copy link
Contributor

bnoordhuis commented Jan 27, 2014

Is it a requirement that the hash function is cryptographically secure? For example, is a hash function that's seeded by a random number at start-up acceptable?

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 27, 2014

@bnoordhuis: SipHash provides security against DoS attacks. That's the sense in which it's cryptographically secure. Hash functions like murmurhash don't actually provide working DoS protection.

@bnoordhuis

This comment has been minimized.

Copy link
Contributor

bnoordhuis commented Jan 27, 2014

@thestinger I think we're talking about the same thing: preventing collision attacks, right?

Taking V8 as an example, it uses what looks like a Jenkins hash derivative (the inner loop is two adds, two shifts and a xor) with a randomized initial value. It doesn't give quite the same guarantees as SipHash but the tradeoff is that it's much faster. Is an approach like that acceptable?

Apropos random numbers, I noticed that each hash map is created with a new random seed. Has anyone measured what happens when you generate the random seed once instead of repeatedly?

@brson

This comment has been minimized.

Copy link
Contributor Author

brson commented Jan 27, 2014

@bnoordhuis It's not necessarily a requirement that it prevents collision attacks. I think it is a requirement that the default hash map prevents these attacks, as long as we have a fast option. If the default is both fast and prevents collisions then that is even better.

I don't recall anybody measuring the difference in seeding strategies. We do have a task-local rng that imo should generally be used for these types of things.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 27, 2014

@brson: At one point, I tried using a constant as the key and it didn't make an impact. The speed issue is entirely caused by SipHash, although a weak hash wouldn't mix as well with linear open addressing.

@bnoordhuis

This comment has been minimized.

Copy link
Contributor

bnoordhuis commented Jan 28, 2014

@thestinger Can you qualify 'weak hash'? As long as the hash function's output is unpredictable and has the avalanche property, it should be good, right?

@cgaebel

This comment has been minimized.

Copy link
Contributor

cgaebel commented Jan 29, 2014

I am currently investigating using Robin Hood Hashing [1][2] for this. I will report back with results and a pull request.

[1] http://codecapsule.com/2013/11/11/robin-hood-hashing/
[2] http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/

@huonw

This comment has been minimized.

Copy link
Member

huonw commented Jan 29, 2014

#11863 is relevant to this since it changes the infrastructure of hashing in std.

@ghost

This comment has been minimized.

Copy link

ghost commented Jan 29, 2014

I've ported the xxHash algorithm's single-pass version, and plan on doing more of these, like MurmurHash3. I'm unsure what a good Rust hashing API would look like though, so I'm holding off on that for now.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Jan 29, 2014

@bnoordhuis: Yes, but the default will need to remain one providing a strong guarantee of security against denial of service attacks. Ideally, we won't need a second hash table.

I think running a fraction of the rounds of a block cipher is a good path to investigate for fixed-size keys.

@metajack

This comment has been minimized.

Copy link
Contributor

metajack commented Jan 31, 2014

Found this over in Clojure-land, which might be useful:
http://dev.clojure.org/display/design/Better+hashing

The discussion happened when someone found a small program that had terrible hashing performance:
https://groups.google.com/forum/#!msg/clojure/5vg0T8F7_1w/-x75e9ve1UgJ

@erickt

This comment has been minimized.

Copy link
Contributor

erickt commented Jan 31, 2014

@Jurily: can you comment on #11863? Will this design work for xxHash?

@ghost

This comment has been minimized.

Copy link

ghost commented Jan 31, 2014

@erickt My two concerns were the variable number and types of keys on construction (xxHash only uses one u32), and the need for the generic Result type (xxHash is u32). I'm happy.

@brson

This comment has been minimized.

Copy link
Contributor Author

brson commented Feb 1, 2014

Would extending xxHash's result to u64 (instead of parameterizing the hash result) hurt its performance?

@cmr

This comment has been minimized.

Copy link
Member

cmr commented Feb 1, 2014

@eddyb wants to use our new default type parameters for this.

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 1, 2014

For 64-bit, I'd rather port MurmurHash (which also has a 32-bit variant). It's what libstdc++ uses. Quoting:

(3) - That's about 1.68 bytes per cycle, or about 9.5 cycles per 16-byte chunk. The inner loop is 20 instructions long, so we're sustaining over 2 instructions per cycle. Hooray for modern platforms with fast 64-bit multipliers and superscalar architectures. :)

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 2, 2014

I've come to believe there's nothing wrong with SipHash:

#[cfg(test)]
mod rust {
    #[bench]
    fn iterbytes(bench: &mut BenchHarness) {
        use std::to_bytes::{IterBytes};
        bench_base( bench, |v| {
            let mut xxh: XXHState = XXHState::new(0);
            v.iter_bytes(true, |bytes| {
                xxh.update(bytes);
                true
            });
            xxh.digest();
        });
    }

    #[bench]
    fn oneshot(bench: &mut BenchHarness) {
        bench_base( bench, |v| {
            xxh32(v, 0);
        });
    }
}

#[cfg(test)]
mod siphash {
    #[bench]
    fn oneshot(bench: &mut BenchHarness) {
        use std::hash;
        use std::hash::{Streaming};
        bench_base( bench, |v| {
            let mut sip: hash::State = hash::default_state();
            sip.write(v);
            sip.result_u64();
        });
    }

    #[bench]
    fn iterbytes(bench: &mut BenchHarness) {
        bench_base( bench, |v| {
            v.hash();
        });
    }
}

Gives

test xxhash::rust::iterbytes  ... bench:    908360 ns/iter (+/- 32305) = 72 MB/s
test xxhash::rust::oneshot    ... bench:     17026 ns/iter (+/- 69) = 3849 MB/s
test xxhash::rust::chunks_64  ... bench:     29224 ns/iter (+/- 183) = 2242 MB/s
test siphash::iterbytes       ... bench:    830721 ns/iter (+/- 1775) = 78 MB/s
test siphash::oneshot         ... bench:    127336 ns/iter (+/- 647) = 514 MB/s
@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Feb 2, 2014

@Jurily: SipHash is great for enormous streams. The performance problems are with very small keys. Try it out on keys that are 1 to 64 bytes as 99% of hash table keys are going to be.

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 2, 2014

At 1, 2 and 4, choice of algorithm doesn't really matter. The overhead is completely dominating my benchmark:

test siphash::chunks_01       ... bench:    793589 ns/iter (+/- 53379) = 82 MB/s
test xxhash::clang::chunks_01 ... bench:   3517378 ns/iter (+/- 7811) = 74 MB/s
test xxhash::rust::chunks_01  ... bench:    851273 ns/iter (+/- 1006) = 76 MB/s

test siphash::chunks_02       ... bench:    521734 ns/iter (+/- 6753) = 125 MB/s
test xxhash::clang::chunks_02 ... bench:   2072373 ns/iter (+/- 2239) = 126 MB/s
test xxhash::rust::chunks_02  ... bench:    480232 ns/iter (+/- 3435) = 136 MB/s

test siphash::chunks_04       ... bench:    380421 ns/iter (+/- 77124) = 172 MB/s
test xxhash::clang::chunks_04 ... bench:   1267936 ns/iter (+/- 3077) = 206 MB/s
test xxhash::rust::chunks_04  ... bench:    241673 ns/iter (+/- 1050) = 271 MB/s
@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Feb 2, 2014

That's exactly the point though. Neither of those algorithms is suitable for small fixed-size keys, which are the vast majority of keys we care about. Even with strings, only the performance on small ones (< 64 in length) is going to matter for most applications.

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 2, 2014

Something like this maybe? Source.

It seems to sacrifice all notions of security in favor of small key performance.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Feb 2, 2014

Most programming languages don't offer a hash secure from denial of service attacks. Using something like Murmurhash or xxHash for security reasons isn't going to be useful in practice because key-independent attacks are publicly known.

I think the default needs to be secure from these kinds of denial of service attacks, and it needs to be much faster for small keys. We can offer faster hashes without the guarantee, but it's the case where the guarantee is upheld that I care most about.

@ghost

This comment has been minimized.

Copy link

ghost commented Feb 3, 2014

It seems to me that the requirements "reasonably secure" and "much faster for small inputs" are mutually exclusive. Can we split this issue in two?

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Feb 3, 2014

I don't see why they're mutually exclusive. A 64-bit fixed-size input doesn't need to be compressed so there's no need for something like SipHash. As I stated above, I think security by default is a requirement and it also needs to be faster on small keys.

@bill-myers

This comment has been minimized.

Copy link
Contributor

bill-myers commented Feb 12, 2014

I think all you need to do is have the HashMap use a fast but "insecure" hash initially, and when the length of any chain surpasses a given constant threshold, recreate the table using a secure hash (or perhaps add a secondary table using the secure hash that would be used for future insertions).

It might also be nice to then switch to a red-black tree if a secure hash chain becomes too long, so that we always have a sublinear worst case, rather than merely having very high probability of the worst case being sublinear, but this is a low priority issue.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Feb 12, 2014

The hash table doesn't use chaining. Switching to chaining will probably hurt performance... especially with the new robin hood hashing pull request.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Feb 12, 2014

@thestinger I interpreted @bill-myers 's suggestion as being applicable to open-addressed tables as well as chained ones. (Its just about tracking the maximum probe sequence length, and switching the representation to a secure hash when it passes some threshold, no?)

@cgaebel

This comment has been minimized.

Copy link
Contributor

cgaebel commented Feb 13, 2014

@pnkfelix: I don't really like the idea of performance characteristics of the hashtable switching dynamically at runtime depending on some opaque parameter. You're talking about a switch from linear array accesses to a balanced binary search tree! I know asymptotically the RB-tree looks nice, but there's gotta be a better way.

I think just having a fast, secure hash function is that way. It keeps everything simple, y'know?

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Feb 13, 2014

@cgaebel I think the first sentence of my comment applies to both the first and second paragraphs of bill-myers 's suggestion. The second sentence of my comment was meant to apply solely to the first paragraph of his suggestion.

But it seems to me like you are interpreting my whole comment as being in support of the rb-tree suggestion in his second paragraph? That was not my intent...

@erickt

This comment has been minimized.

Copy link
Contributor

erickt commented Feb 24, 2014

@Jurily: now that my new Hash code has landed, could you try porting your xxHash implementation over to it to see if it'll work for you?

@erickt

This comment has been minimized.

Copy link
Contributor

erickt commented Mar 11, 2014

@brson / @alexcrichton: have you considered factoring out the FNV hasher from rustc so it can be used by everyone? In my opinion, doing that along with @Jurily's xxHash would be sufficient for us to close this issue.

bors added a commit that referenced this issue Mar 11, 2014

auto merge of #12081 : cgaebel/rust/robinhood-hashing, r=brson
Partially addresses #11783.

Previously, rust's hashtable was totally unoptimized. It used an Option
per key-value pair, and used very naive open allocation.

The old hashtable had very high variance in lookup time. For an example,
see the 'find_nonexisting' benchmark below. This is fixed by keys in
'lucky' spots with a low probe sequence length getting their good spots
stolen by keys with long probe sequence lengths. This reduces hashtable
probe length variance, while maintaining the same mean.

Also, other optimization liberties were taken. Everything is as cache
aware as possible, and this hashtable should perform extremely well for
both large and small keys and values.

Benchmarks:

```
comprehensive_old_hashmap         378 ns/iter (+/- 8)
comprehensive_new_hashmap         206 ns/iter (+/- 4)
1.8x faster

old_hashmap_as_queue              238 ns/iter (+/- 8)
new_hashmap_as_queue              119 ns/iter (+/- 2)
2x faster

old_hashmap_insert                172 ns/iter (+/- 8)
new_hashmap_insert                146 ns/iter (+/- 11)
1.17x faster

old_hashmap_find_existing         50 ns/iter (+/- 12)
new_hashmap_find_existing         35 ns/iter (+/- 6)
1.43x faster

old_hashmap_find_notexisting      49 ns/iter (+/- 49)
new_hashmap_find_notexisting      34 ns/iter (+/- 4)
1.44x faster

Memory usage of old hashtable (64-bit assumed):

aligned(8+sizeof(Option)+sizeof(K)+sizeof(V))/0.75 + 48ish bytes

Memory usage of new hashtable:

(aligned(sizeof(K))
+ aligned(sizeof(V))
+ 8)/0.9 + 112ish bytes

Timing of building librustc:

compile_and_link: x86_64-unknown-linux-gnu/stage0/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.457 s   parsing
time: 0.028 s   gated feature checking
time: 0.000 s   crate injection
time: 0.108 s   configuration 1
time: 1.049 s   expansion
time: 0.219 s   configuration 2
time: 0.222 s   maybe building test harness
time: 0.223 s   prelude injection
time: 0.268 s   assinging node ids and indexing ast
time: 0.075 s   external crate/lib resolution
time: 0.026 s   language item collection
time: 1.016 s   resolution
time: 0.038 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.030 s   looking for macro registrar
time: 0.061 s   freevar finding
time: 0.138 s   region resolution
time: 0.110 s   type collecting
time: 0.072 s   variance inference
time: 0.126 s   coherence checking
time: 9.110 s   type checking
time: 0.186 s   const marking
time: 0.049 s   const checking
time: 0.418 s   privacy checking
time: 0.057 s   effect checking
time: 0.033 s   loop checking
time: 1.293 s   compute moves
time: 0.182 s   match checking
time: 0.242 s   liveness checking
time: 0.866 s   borrow checking
time: 0.150 s   kind checking
time: 0.013 s   reachability checking
time: 0.175 s   death checking
time: 0.461 s   lint checking
time: 13.112 s  translation
  time: 4.352 s llvm function passes
  time: 96.702 s    llvm module passes
  time: 50.574 s    codegen passes
time: 154.611 s LLVM passes
  time: 2.821 s running linker
time: 15.750 s  linking


compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.422 s   parsing
time: 0.031 s   gated feature checking
time: 0.000 s   crate injection
time: 0.126 s   configuration 1
time: 1.014 s   expansion
time: 0.251 s   configuration 2
time: 0.249 s   maybe building test harness
time: 0.273 s   prelude injection
time: 0.279 s   assinging node ids and indexing ast
time: 0.076 s   external crate/lib resolution
time: 0.033 s   language item collection
time: 1.028 s   resolution
time: 0.036 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.029 s   looking for macro registrar
time: 0.063 s   freevar finding
time: 0.133 s   region resolution
time: 0.111 s   type collecting
time: 0.077 s   variance inference
time: 0.565 s   coherence checking
time: 8.953 s   type checking
time: 0.176 s   const marking
time: 0.050 s   const checking
time: 0.401 s   privacy checking
time: 0.063 s   effect checking
time: 0.032 s   loop checking
time: 1.291 s   compute moves
time: 0.172 s   match checking
time: 0.249 s   liveness checking
time: 0.831 s   borrow checking
time: 0.121 s   kind checking
time: 0.013 s   reachability checking
time: 0.179 s   death checking
time: 0.503 s   lint checking
time: 14.385 s  translation
  time: 4.495 s llvm function passes
  time: 92.234 s    llvm module passes
  time: 51.172 s    codegen passes
time: 150.809 s LLVM passes
  time: 2.542 s running linker
time: 15.109 s  linking
```

BUT accesses are much more cache friendly. In fact, if the probe
sequence length is below 8, only two cache lines worth of hashes will be
pulled into cache. This is unlike the old version which would have to
stride over the stoerd keys and values, and would be more cache
unfriendly the bigger the stored values got.

And did you notice the higher load factor? We can now reasonably get a
load factor of 0.9 with very good performance.

Please review this very closely. This is my first major contribution to Rust. Sorry for the ugly diff!
@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 12, 2014

I haven't considered moving it out, but it would certainly be quite trivial to do so. I'm not sure where the best place for it would be. I guess std::hash isn't so bad a place (std::hash::xxhash and std::hash::fnv), but I don't think that std::hash should become a comprehensive location for all hashing algorithms. Perhaps a few in there is ok?

@erickt

This comment has been minimized.

Copy link
Contributor

erickt commented Mar 12, 2014

@alexcrichton: We could also go in the opposite direction and pull std::hash it's own crate. I would feel a little better at putting a bunch of hashing algorithms in it's own crate instead of cluttering up libstd.

edit: *crate, not module

@alexcrichton

This comment has been minimized.

Copy link
Member

alexcrichton commented Mar 12, 2014

I think that would require extern crate hash; to be present with a deriving(Hash) attribute, but that seems reasonable to me.

bors added a commit that referenced this issue Mar 12, 2014

auto merge of #12081 : cgaebel/rust/robinhood-hashing, r=huonw
Partially addresses #11783.

Previously, rust's hashtable was totally unoptimized. It used an Option
per key-value pair, and used very naive open allocation.

The old hashtable had very high variance in lookup time. For an example,
see the 'find_nonexisting' benchmark below. This is fixed by keys in
'lucky' spots with a low probe sequence length getting their good spots
stolen by keys with long probe sequence lengths. This reduces hashtable
probe length variance, while maintaining the same mean.

Also, other optimization liberties were taken. Everything is as cache
aware as possible, and this hashtable should perform extremely well for
both large and small keys and values.

Benchmarks:

```
comprehensive_old_hashmap         378 ns/iter (+/- 8)
comprehensive_new_hashmap         206 ns/iter (+/- 4)
1.8x faster

old_hashmap_as_queue              238 ns/iter (+/- 8)
new_hashmap_as_queue              119 ns/iter (+/- 2)
2x faster

old_hashmap_insert                172 ns/iter (+/- 8)
new_hashmap_insert                146 ns/iter (+/- 11)
1.17x faster

old_hashmap_find_existing         50 ns/iter (+/- 12)
new_hashmap_find_existing         35 ns/iter (+/- 6)
1.43x faster

old_hashmap_find_notexisting      49 ns/iter (+/- 49)
new_hashmap_find_notexisting      34 ns/iter (+/- 4)
1.44x faster

Memory usage of old hashtable (64-bit assumed):

aligned(8+sizeof(Option)+sizeof(K)+sizeof(V))/0.75 + 48ish bytes

Memory usage of new hashtable:

(aligned(sizeof(K))
+ aligned(sizeof(V))
+ 8)/0.9 + 112ish bytes

Timing of building librustc:

compile_and_link: x86_64-unknown-linux-gnu/stage0/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.457 s   parsing
time: 0.028 s   gated feature checking
time: 0.000 s   crate injection
time: 0.108 s   configuration 1
time: 1.049 s   expansion
time: 0.219 s   configuration 2
time: 0.222 s   maybe building test harness
time: 0.223 s   prelude injection
time: 0.268 s   assinging node ids and indexing ast
time: 0.075 s   external crate/lib resolution
time: 0.026 s   language item collection
time: 1.016 s   resolution
time: 0.038 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.030 s   looking for macro registrar
time: 0.061 s   freevar finding
time: 0.138 s   region resolution
time: 0.110 s   type collecting
time: 0.072 s   variance inference
time: 0.126 s   coherence checking
time: 9.110 s   type checking
time: 0.186 s   const marking
time: 0.049 s   const checking
time: 0.418 s   privacy checking
time: 0.057 s   effect checking
time: 0.033 s   loop checking
time: 1.293 s   compute moves
time: 0.182 s   match checking
time: 0.242 s   liveness checking
time: 0.866 s   borrow checking
time: 0.150 s   kind checking
time: 0.013 s   reachability checking
time: 0.175 s   death checking
time: 0.461 s   lint checking
time: 13.112 s  translation
  time: 4.352 s llvm function passes
  time: 96.702 s    llvm module passes
  time: 50.574 s    codegen passes
time: 154.611 s LLVM passes
  time: 2.821 s running linker
time: 15.750 s  linking


compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.422 s   parsing
time: 0.031 s   gated feature checking
time: 0.000 s   crate injection
time: 0.126 s   configuration 1
time: 1.014 s   expansion
time: 0.251 s   configuration 2
time: 0.249 s   maybe building test harness
time: 0.273 s   prelude injection
time: 0.279 s   assinging node ids and indexing ast
time: 0.076 s   external crate/lib resolution
time: 0.033 s   language item collection
time: 1.028 s   resolution
time: 0.036 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.029 s   looking for macro registrar
time: 0.063 s   freevar finding
time: 0.133 s   region resolution
time: 0.111 s   type collecting
time: 0.077 s   variance inference
time: 0.565 s   coherence checking
time: 8.953 s   type checking
time: 0.176 s   const marking
time: 0.050 s   const checking
time: 0.401 s   privacy checking
time: 0.063 s   effect checking
time: 0.032 s   loop checking
time: 1.291 s   compute moves
time: 0.172 s   match checking
time: 0.249 s   liveness checking
time: 0.831 s   borrow checking
time: 0.121 s   kind checking
time: 0.013 s   reachability checking
time: 0.179 s   death checking
time: 0.503 s   lint checking
time: 14.385 s  translation
  time: 4.495 s llvm function passes
  time: 92.234 s    llvm module passes
  time: 51.172 s    codegen passes
time: 150.809 s LLVM passes
  time: 2.542 s running linker
time: 15.109 s  linking
```

BUT accesses are much more cache friendly. In fact, if the probe
sequence length is below 8, only two cache lines worth of hashes will be
pulled into cache. This is unlike the old version which would have to
stride over the stoerd keys and values, and would be more cache
unfriendly the bigger the stored values got.

And did you notice the higher load factor? We can now reasonably get a
load factor of 0.9 with very good performance.

Please review this very closely. This is my first major contribution to Rust. Sorry for the ugly diff!

bors added a commit that referenced this issue Mar 12, 2014

auto merge of #12081 : cgaebel/rust/robinhood-hashing, r=alexcrichton
Partially addresses #11783.

Previously, rust's hashtable was totally unoptimized. It used an Option
per key-value pair, and used very naive open allocation.

The old hashtable had very high variance in lookup time. For an example,
see the 'find_nonexisting' benchmark below. This is fixed by keys in
'lucky' spots with a low probe sequence length getting their good spots
stolen by keys with long probe sequence lengths. This reduces hashtable
probe length variance, while maintaining the same mean.

Also, other optimization liberties were taken. Everything is as cache
aware as possible, and this hashtable should perform extremely well for
both large and small keys and values.

Benchmarks:

```
comprehensive_old_hashmap         378 ns/iter (+/- 8)
comprehensive_new_hashmap         206 ns/iter (+/- 4)
1.8x faster

old_hashmap_as_queue              238 ns/iter (+/- 8)
new_hashmap_as_queue              119 ns/iter (+/- 2)
2x faster

old_hashmap_insert                172 ns/iter (+/- 8)
new_hashmap_insert                146 ns/iter (+/- 11)
1.17x faster

old_hashmap_find_existing         50 ns/iter (+/- 12)
new_hashmap_find_existing         35 ns/iter (+/- 6)
1.43x faster

old_hashmap_find_notexisting      49 ns/iter (+/- 49)
new_hashmap_find_notexisting      34 ns/iter (+/- 4)
1.44x faster

Memory usage of old hashtable (64-bit assumed):

aligned(8+sizeof(Option)+sizeof(K)+sizeof(V))/0.75 + 48ish bytes

Memory usage of new hashtable:

(aligned(sizeof(K))
+ aligned(sizeof(V))
+ 8)/0.9 + 112ish bytes

Timing of building librustc:

compile_and_link: x86_64-unknown-linux-gnu/stage0/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.457 s   parsing
time: 0.028 s   gated feature checking
time: 0.000 s   crate injection
time: 0.108 s   configuration 1
time: 1.049 s   expansion
time: 0.219 s   configuration 2
time: 0.222 s   maybe building test harness
time: 0.223 s   prelude injection
time: 0.268 s   assinging node ids and indexing ast
time: 0.075 s   external crate/lib resolution
time: 0.026 s   language item collection
time: 1.016 s   resolution
time: 0.038 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.030 s   looking for macro registrar
time: 0.061 s   freevar finding
time: 0.138 s   region resolution
time: 0.110 s   type collecting
time: 0.072 s   variance inference
time: 0.126 s   coherence checking
time: 9.110 s   type checking
time: 0.186 s   const marking
time: 0.049 s   const checking
time: 0.418 s   privacy checking
time: 0.057 s   effect checking
time: 0.033 s   loop checking
time: 1.293 s   compute moves
time: 0.182 s   match checking
time: 0.242 s   liveness checking
time: 0.866 s   borrow checking
time: 0.150 s   kind checking
time: 0.013 s   reachability checking
time: 0.175 s   death checking
time: 0.461 s   lint checking
time: 13.112 s  translation
  time: 4.352 s llvm function passes
  time: 96.702 s    llvm module passes
  time: 50.574 s    codegen passes
time: 154.611 s LLVM passes
  time: 2.821 s running linker
time: 15.750 s  linking


compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.422 s   parsing
time: 0.031 s   gated feature checking
time: 0.000 s   crate injection
time: 0.126 s   configuration 1
time: 1.014 s   expansion
time: 0.251 s   configuration 2
time: 0.249 s   maybe building test harness
time: 0.273 s   prelude injection
time: 0.279 s   assinging node ids and indexing ast
time: 0.076 s   external crate/lib resolution
time: 0.033 s   language item collection
time: 1.028 s   resolution
time: 0.036 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.029 s   looking for macro registrar
time: 0.063 s   freevar finding
time: 0.133 s   region resolution
time: 0.111 s   type collecting
time: 0.077 s   variance inference
time: 0.565 s   coherence checking
time: 8.953 s   type checking
time: 0.176 s   const marking
time: 0.050 s   const checking
time: 0.401 s   privacy checking
time: 0.063 s   effect checking
time: 0.032 s   loop checking
time: 1.291 s   compute moves
time: 0.172 s   match checking
time: 0.249 s   liveness checking
time: 0.831 s   borrow checking
time: 0.121 s   kind checking
time: 0.013 s   reachability checking
time: 0.179 s   death checking
time: 0.503 s   lint checking
time: 14.385 s  translation
  time: 4.495 s llvm function passes
  time: 92.234 s    llvm module passes
  time: 51.172 s    codegen passes
time: 150.809 s LLVM passes
  time: 2.542 s running linker
time: 15.109 s  linking
```

BUT accesses are much more cache friendly. In fact, if the probe
sequence length is below 8, only two cache lines worth of hashes will be
pulled into cache. This is unlike the old version which would have to
stride over the stoerd keys and values, and would be more cache
unfriendly the bigger the stored values got.

And did you notice the higher load factor? We can now reasonably get a
load factor of 0.9 with very good performance.

Please review this very closely. This is my first major contribution to Rust. Sorry for the ugly diff!

bors added a commit that referenced this issue Mar 13, 2014

auto merge of #12081 : cgaebel/rust/robinhood-hashing, r=alexcrichton
Partially addresses #11783.

Previously, rust's hashtable was totally unoptimized. It used an Option
per key-value pair, and used very naive open allocation.

The old hashtable had very high variance in lookup time. For an example,
see the 'find_nonexisting' benchmark below. This is fixed by keys in
'lucky' spots with a low probe sequence length getting their good spots
stolen by keys with long probe sequence lengths. This reduces hashtable
probe length variance, while maintaining the same mean.

Also, other optimization liberties were taken. Everything is as cache
aware as possible, and this hashtable should perform extremely well for
both large and small keys and values.

Benchmarks:

```
comprehensive_old_hashmap         378 ns/iter (+/- 8)
comprehensive_new_hashmap         206 ns/iter (+/- 4)
1.8x faster

old_hashmap_as_queue              238 ns/iter (+/- 8)
new_hashmap_as_queue              119 ns/iter (+/- 2)
2x faster

old_hashmap_insert                172 ns/iter (+/- 8)
new_hashmap_insert                146 ns/iter (+/- 11)
1.17x faster

old_hashmap_find_existing         50 ns/iter (+/- 12)
new_hashmap_find_existing         35 ns/iter (+/- 6)
1.43x faster

old_hashmap_find_notexisting      49 ns/iter (+/- 49)
new_hashmap_find_notexisting      34 ns/iter (+/- 4)
1.44x faster

Memory usage of old hashtable (64-bit assumed):

aligned(8+sizeof(Option)+sizeof(K)+sizeof(V))/0.75 + 48ish bytes

Memory usage of new hashtable:

(aligned(sizeof(K))
+ aligned(sizeof(V))
+ 8)/0.9 + 112ish bytes

Timing of building librustc:

compile_and_link: x86_64-unknown-linux-gnu/stage0/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.457 s   parsing
time: 0.028 s   gated feature checking
time: 0.000 s   crate injection
time: 0.108 s   configuration 1
time: 1.049 s   expansion
time: 0.219 s   configuration 2
time: 0.222 s   maybe building test harness
time: 0.223 s   prelude injection
time: 0.268 s   assinging node ids and indexing ast
time: 0.075 s   external crate/lib resolution
time: 0.026 s   language item collection
time: 1.016 s   resolution
time: 0.038 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.030 s   looking for macro registrar
time: 0.061 s   freevar finding
time: 0.138 s   region resolution
time: 0.110 s   type collecting
time: 0.072 s   variance inference
time: 0.126 s   coherence checking
time: 9.110 s   type checking
time: 0.186 s   const marking
time: 0.049 s   const checking
time: 0.418 s   privacy checking
time: 0.057 s   effect checking
time: 0.033 s   loop checking
time: 1.293 s   compute moves
time: 0.182 s   match checking
time: 0.242 s   liveness checking
time: 0.866 s   borrow checking
time: 0.150 s   kind checking
time: 0.013 s   reachability checking
time: 0.175 s   death checking
time: 0.461 s   lint checking
time: 13.112 s  translation
  time: 4.352 s llvm function passes
  time: 96.702 s    llvm module passes
  time: 50.574 s    codegen passes
time: 154.611 s LLVM passes
  time: 2.821 s running linker
time: 15.750 s  linking


compile_and_link: x86_64-unknown-linux-gnu/stage1/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc
time: 0.422 s   parsing
time: 0.031 s   gated feature checking
time: 0.000 s   crate injection
time: 0.126 s   configuration 1
time: 1.014 s   expansion
time: 0.251 s   configuration 2
time: 0.249 s   maybe building test harness
time: 0.273 s   prelude injection
time: 0.279 s   assinging node ids and indexing ast
time: 0.076 s   external crate/lib resolution
time: 0.033 s   language item collection
time: 1.028 s   resolution
time: 0.036 s   lifetime resolution
time: 0.000 s   looking for entry point
time: 0.029 s   looking for macro registrar
time: 0.063 s   freevar finding
time: 0.133 s   region resolution
time: 0.111 s   type collecting
time: 0.077 s   variance inference
time: 0.565 s   coherence checking
time: 8.953 s   type checking
time: 0.176 s   const marking
time: 0.050 s   const checking
time: 0.401 s   privacy checking
time: 0.063 s   effect checking
time: 0.032 s   loop checking
time: 1.291 s   compute moves
time: 0.172 s   match checking
time: 0.249 s   liveness checking
time: 0.831 s   borrow checking
time: 0.121 s   kind checking
time: 0.013 s   reachability checking
time: 0.179 s   death checking
time: 0.503 s   lint checking
time: 14.385 s  translation
  time: 4.495 s llvm function passes
  time: 92.234 s    llvm module passes
  time: 51.172 s    codegen passes
time: 150.809 s LLVM passes
  time: 2.542 s running linker
time: 15.109 s  linking
```

BUT accesses are much more cache friendly. In fact, if the probe
sequence length is below 8, only two cache lines worth of hashes will be
pulled into cache. This is unlike the old version which would have to
stride over the stoerd keys and values, and would be more cache
unfriendly the bigger the stored values got.

And did you notice the higher load factor? We can now reasonably get a
load factor of 0.9 with very good performance.

Please review this very closely. This is my first major contribution to Rust. Sorry for the ugly diff!
@pczarn

This comment has been minimized.

Copy link
Contributor

pczarn commented Jul 18, 2014

What about that interesting approach brought up by @bill-myers? It might be not very elegant, yet it could prevent denial of service. I can imagine it would easily fit into the current implementation of HashMap, given well tuned heuristic for determining the maximum allowed probe count during insertion (or resizing?).

@ghost

This comment has been minimized.

Copy link

ghost commented Sep 30, 2014

xxh64 is pretty much ready for a PR, but it's not faster than SipHash for very small keys. The overhead is still ridiculous:

test xxh64::bench_long_str                       ... bench:       211 ns/iter (+/- 1) = 2113 MB/s
test xxh64::bench_str_under_8_bytes              ... bench:        52 ns/iter (+/- 3) = 57 MB/s
test hash::sip::tests::bench_str_under_8_bytes   ... bench:        48 ns/iter (+/- 1)

There is an alternative approach: Python, where "ints are their own hash codes", and the safety comes from

-R     : use a pseudo-random salt to make hash() values of various types be
         unpredictable between separate invocations of the interpreter, as
         a defense against denial-of-service attacks
@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Sep 30, 2014

@Jurily: Python's randomization only applies to types like strings, not integers. It's still vulnerable to key independent collisions even for the types even when it's using the keyed hash because it doesn't provide the necessary guarantees.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Jan 21, 2015

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/rfcs#631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.