Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upImplement a fast HashMap / hash function #11783
Comments
This comment has been minimized.
This comment has been minimized.
|
Not all the performance problems of our HashMap may be due to the hashing algorithm. The Rust implementation, paired with our IterBytes trait may also be a part of it. |
This comment has been minimized.
This comment has been minimized.
|
A single-pass
I'm sure that a hash function can exist for small fixed-size keys with far better performance and the same security guarantees, and we may be able to find a weak block cipher to use for this. It will of course still be slower than using the identity function as the hash like C++ standard library implementations all seem to do (with a few extra bit operations in the table itself while matching it to a bucket). |
This comment has been minimized.
This comment has been minimized.
|
Tagged with |
This comment has been minimized.
This comment has been minimized.
|
Something like |
This comment has been minimized.
This comment has been minimized.
|
Go uses the AESENC instruction available on newer x86/amd64 processors to implement a fast, and supposedly secure against hash collision DOS attacks, hash function. Unfortunately I can't find any other references of using AESENC for this purpose. |
This comment has been minimized.
This comment has been minimized.
|
What do we do on other processors though? |
This comment has been minimized.
This comment has been minimized.
|
Is it a requirement that the hash function is cryptographically secure? For example, is a hash function that's seeded by a random number at start-up acceptable? |
This comment has been minimized.
This comment has been minimized.
|
@bnoordhuis: SipHash provides security against DoS attacks. That's the sense in which it's cryptographically secure. Hash functions like murmurhash don't actually provide working DoS protection. |
This comment has been minimized.
This comment has been minimized.
|
@thestinger I think we're talking about the same thing: preventing collision attacks, right? Taking V8 as an example, it uses what looks like a Jenkins hash derivative (the inner loop is two adds, two shifts and a xor) with a randomized initial value. It doesn't give quite the same guarantees as SipHash but the tradeoff is that it's much faster. Is an approach like that acceptable? Apropos random numbers, I noticed that each hash map is created with a new random seed. Has anyone measured what happens when you generate the random seed once instead of repeatedly? |
This comment has been minimized.
This comment has been minimized.
|
@bnoordhuis It's not necessarily a requirement that it prevents collision attacks. I think it is a requirement that the default hash map prevents these attacks, as long as we have a fast option. If the default is both fast and prevents collisions then that is even better. I don't recall anybody measuring the difference in seeding strategies. We do have a task-local rng that imo should generally be used for these types of things. |
This comment has been minimized.
This comment has been minimized.
|
@brson: At one point, I tried using a constant as the key and it didn't make an impact. The speed issue is entirely caused by SipHash, although a weak hash wouldn't mix as well with linear open addressing. |
This comment has been minimized.
This comment has been minimized.
|
@thestinger Can you qualify 'weak hash'? As long as the hash function's output is unpredictable and has the avalanche property, it should be good, right? |
This comment has been minimized.
This comment has been minimized.
|
I am currently investigating using Robin Hood Hashing [1][2] for this. I will report back with results and a pull request. [1] http://codecapsule.com/2013/11/11/robin-hood-hashing/ |
This comment has been minimized.
This comment has been minimized.
|
#11863 is relevant to this since it changes the infrastructure of hashing in |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Jan 29, 2014
|
I've ported the xxHash algorithm's single-pass version, and plan on doing more of these, like MurmurHash3. I'm unsure what a good Rust hashing API would look like though, so I'm holding off on that for now. |
This comment has been minimized.
This comment has been minimized.
|
@bnoordhuis: Yes, but the default will need to remain one providing a strong guarantee of security against denial of service attacks. Ideally, we won't need a second hash table. I think running a fraction of the rounds of a block cipher is a good path to investigate for fixed-size keys. |
This comment has been minimized.
This comment has been minimized.
|
Found this over in Clojure-land, which might be useful: The discussion happened when someone found a small program that had terrible hashing performance: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ghost
commented
Jan 31, 2014
|
@erickt My two concerns were the variable number and types of keys on construction (xxHash only uses one |
This comment has been minimized.
This comment has been minimized.
|
Would extending xxHash's result to u64 (instead of parameterizing the hash result) hurt its performance? |
This comment has been minimized.
This comment has been minimized.
|
@eddyb wants to use our new default type parameters for this. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Feb 1, 2014
|
For 64-bit, I'd rather port MurmurHash (which also has a 32-bit variant). It's what libstdc++ uses. Quoting:
|
This comment has been minimized.
This comment has been minimized.
ghost
commented
Feb 2, 2014
|
I've come to believe there's nothing wrong with SipHash: #[cfg(test)]
mod rust {
#[bench]
fn iterbytes(bench: &mut BenchHarness) {
use std::to_bytes::{IterBytes};
bench_base( bench, |v| {
let mut xxh: XXHState = XXHState::new(0);
v.iter_bytes(true, |bytes| {
xxh.update(bytes);
true
});
xxh.digest();
});
}
#[bench]
fn oneshot(bench: &mut BenchHarness) {
bench_base( bench, |v| {
xxh32(v, 0);
});
}
}
#[cfg(test)]
mod siphash {
#[bench]
fn oneshot(bench: &mut BenchHarness) {
use std::hash;
use std::hash::{Streaming};
bench_base( bench, |v| {
let mut sip: hash::State = hash::default_state();
sip.write(v);
sip.result_u64();
});
}
#[bench]
fn iterbytes(bench: &mut BenchHarness) {
bench_base( bench, |v| {
v.hash();
});
}
}Gives
|
This comment has been minimized.
This comment has been minimized.
|
@Jurily: SipHash is great for enormous streams. The performance problems are with very small keys. Try it out on keys that are 1 to 64 bytes as 99% of hash table keys are going to be. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Feb 2, 2014
|
At 1, 2 and 4, choice of algorithm doesn't really matter. The overhead is completely dominating my benchmark:
|
This comment has been minimized.
This comment has been minimized.
|
That's exactly the point though. Neither of those algorithms is suitable for small fixed-size keys, which are the vast majority of keys we care about. Even with strings, only the performance on small ones (< 64 in length) is going to matter for most applications. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Feb 2, 2014
This comment has been minimized.
This comment has been minimized.
|
Most programming languages don't offer a hash secure from denial of service attacks. Using something like Murmurhash or xxHash for security reasons isn't going to be useful in practice because key-independent attacks are publicly known. I think the default needs to be secure from these kinds of denial of service attacks, and it needs to be much faster for small keys. We can offer faster hashes without the guarantee, but it's the case where the guarantee is upheld that I care most about. |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Feb 3, 2014
|
It seems to me that the requirements "reasonably secure" and "much faster for small inputs" are mutually exclusive. Can we split this issue in two? |
This comment has been minimized.
This comment has been minimized.
|
I don't see why they're mutually exclusive. A 64-bit fixed-size input doesn't need to be compressed so there's no need for something like SipHash. As I stated above, I think security by default is a requirement and it also needs to be faster on small keys. |
This comment has been minimized.
This comment has been minimized.
|
I think all you need to do is have the HashMap use a fast but "insecure" hash initially, and when the length of any chain surpasses a given constant threshold, recreate the table using a secure hash (or perhaps add a secondary table using the secure hash that would be used for future insertions). It might also be nice to then switch to a red-black tree if a secure hash chain becomes too long, so that we always have a sublinear worst case, rather than merely having very high probability of the worst case being sublinear, but this is a low priority issue. |
This comment has been minimized.
This comment has been minimized.
|
The hash table doesn't use chaining. Switching to chaining will probably hurt performance... especially with the new robin hood hashing pull request. |
This comment has been minimized.
This comment has been minimized.
|
@thestinger I interpreted @bill-myers 's suggestion as being applicable to open-addressed tables as well as chained ones. (Its just about tracking the maximum probe sequence length, and switching the representation to a secure hash when it passes some threshold, no?) |
This comment has been minimized.
This comment has been minimized.
|
@pnkfelix: I don't really like the idea of performance characteristics of the hashtable switching dynamically at runtime depending on some opaque parameter. You're talking about a switch from linear array accesses to a balanced binary search tree! I know asymptotically the RB-tree looks nice, but there's gotta be a better way. I think just having a fast, secure hash function is that way. It keeps everything simple, y'know? |
This comment has been minimized.
This comment has been minimized.
|
@cgaebel I think the first sentence of my comment applies to both the first and second paragraphs of bill-myers 's suggestion. The second sentence of my comment was meant to apply solely to the first paragraph of his suggestion. But it seems to me like you are interpreting my whole comment as being in support of the rb-tree suggestion in his second paragraph? That was not my intent... |
This comment has been minimized.
This comment has been minimized.
|
@Jurily: now that my new |
huonw
referenced this issue
Mar 4, 2014
Closed
consider using a different hash than SipHash for integer keys in rustc #10586
This comment has been minimized.
This comment has been minimized.
|
@brson / @alexcrichton: have you considered factoring out the FNV hasher from rustc so it can be used by everyone? In my opinion, doing that along with @Jurily's xxHash would be sufficient for us to close this issue. |
bors
added a commit
that referenced
this issue
Mar 11, 2014
This comment has been minimized.
This comment has been minimized.
|
I haven't considered moving it out, but it would certainly be quite trivial to do so. I'm not sure where the best place for it would be. I guess |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton: We could also go in the opposite direction and pull edit: *crate, not module |
This comment has been minimized.
This comment has been minimized.
|
I think that would require |
bors
added a commit
that referenced
this issue
Mar 12, 2014
bors
added a commit
that referenced
this issue
Mar 12, 2014
bors
added a commit
that referenced
this issue
Mar 13, 2014
This comment has been minimized.
This comment has been minimized.
|
What about that interesting approach brought up by @bill-myers? It might be not very elegant, yet it could prevent denial of service. I can imagine it would easily fit into the current implementation of HashMap, given well tuned heuristic for determining the maximum allowed probe count during insertion (or resizing?). |
This comment has been minimized.
This comment has been minimized.
ghost
commented
Sep 30, 2014
|
xxh64 is pretty much ready for a PR, but it's not faster than SipHash for very small keys. The overhead is still ridiculous:
There is an alternative approach: Python, where "ints are their own hash codes", and the safety comes from
|
This comment has been minimized.
This comment has been minimized.
|
@Jurily: Python's randomization only applies to types like strings, not integers. It's still vulnerable to key independent collisions even for the types even when it's using the keyed hash because it doesn't provide the necessary guarantees. |
This comment has been minimized.
This comment has been minimized.
|
I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized. This issue has been moved to the RFCs repo: rust-lang/rfcs#631 |
brson commentedJan 25, 2014
Our current HashMap uses cryptographically secure hashing but it is slow. Sometimes this is a big problem. Implement another hash map that is very fast.