Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTracking issue for custom hashers in HashMap #27713
Comments
alexcrichton
added
A-collections
T-libs
B-unstable
labels
Aug 12, 2015
This comment has been minimized.
This comment has been minimized.
|
I don't know much about hashing functions, but I just want to say that this one is very important for performances. The SipHasher is very slow, and always eats up between 5% and 20% of the CPU time of each of my projects. By comparison the FnvHasher is around 6 times faster on my machine according to some quick benchmarking. The purpose of the SipHasher is to protect against DDoSes, but the vast majority of the applications written in Rust never even open a socket. |
This comment has been minimized.
This comment has been minimized.
|
SipHash looks good on long enough data, but it is slow on small values like integers. Its fixed four round finalization at the end is an example of overhead that is comparatively larger when hashing small values. |
This comment has been minimized.
This comment has been minimized.
|
See: https://github.com/shepmaster/twox-hash/blob/master/README.md for scaling comparison. Based on these results (which I haven't vetted for quality), one can conclude that SipHash is actually a pretty good "general purpose" hasher. Fnv doesn't scale as well to longer strings (> 32 bytes), and XXHash doesn't scale well to smaller strings (< 32 bytes). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
(XXhash was constructed to be a fast checksum for gigabytes of lz4 data, so that's what it is good at) |
Ms2ger
referenced this issue
Aug 16, 2015
Open
Tracking: Unstable Rust feature gates used by Servo #5286
This comment has been minimized.
This comment has been minimized.
|
If |
This comment has been minimized.
This comment has been minimized.
|
Just noting some discussion that was tangentially related to this: The current hasher infra heavily penalizes Farmhash (and cityhash, and murmurhash), which wants to branch on the size of the input to choose the "optimal" algorithm. Community's current solution is to just buffer in a Vec until However more generally it would be nice (even for SipHash) to be able to to indicate to the hasher "there is only one thing to hash, here it is, now give me the final hash". Sketch of design: add: pub trait Hasher {
...
/// Hash only the given bytes, and immediately finalize.
/// Results are unspecified if other bytes were previously written to this hasher
fn write_only(&mut self, bytes: &[u8]) -> u64 {
self.write(bytes);
self.finish()
}
}
pub trait Hash {
...
/// Hashes only this value
fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
self.hash(state);
state.finish()
}
}So far, so pointless. But back-compat! But now hashers can override write_only to something optimized, and Hash impls can specialize in the following way: // #[derive]d impl
impl Hash {
fn hash() { ... } // same-old
#[if(only_has_one_field)] // magic
fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
self.only_field.hash_one_shot(state)
}
}And key types like fn hash_one_shot<H: Hasher>(&self, state: &mut H) -> u64 {
state.write_only(self.as_slice())
}However we must be wary of #5257! In particular, slices and strs must mix in some "bonus" value so that This seems really sketchy, but I can't think of a situation where this would actually break things. You will get potentially curious behaviour, where I've been flying around the country and haven't slept though. |
This comment has been minimized.
This comment has been minimized.
|
Incidentally, this proposal soft-fixes #27108 (&[u8] and &str would indeed hash_one_shot to equal values). |
This comment has been minimized.
This comment has been minimized.
|
To me, the naming of |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@Gankro I'm exploring a slightly different design here: https://github.com/ranma42/rust-hash/tree/master/src Besides from the naming, which can obviously be made more consistent with that currently used in Rust, the main difference is that the stream-oriented part of the code is aggressively inlined. Ideally I would want the implementation of one-shot digest to be automatically generated by the compiler, which should be able to get rid of the streaming overhead simply with constant folding and dead code elimination. This actually happens in the branch I posted, namely the The usual structure of hash functions makes me wonder whether there is much to gain by having |
This comment has been minimized.
This comment has been minimized.
|
An important point is to allow setting the seed for the HashState (or the equivalent). Some applications (like simulators) must have reproducible executions, so its necessary to specify the same seed for the hasher to get the same iteration order on HashSet and HashMap. |
This comment has been minimized.
This comment has been minimized.
|
@malbarbo Yes, this is well-handled by the with_hash_state method. You make your own state with the desired seed, and then pass it to the HashMap. |
This comment has been minimized.
This comment has been minimized.
aldanor
commented
Oct 29, 2015
|
@Gankro the whole problem here is that |
This comment has been minimized.
This comment has been minimized.
|
Nominating for 1.6 discussion |
alexcrichton
added
the
I-nominated
label
Nov 4, 2015
This comment has been minimized.
This comment has been minimized.
|
CC @shepmaster As one of the hash-thing maintainers, are you happy with this API? |
This comment has been minimized.
This comment has been minimized.
|
The libs team decided that this API is likely ready for stabilization pending a final audit of the ergonomics and usage. If anything arises it will be bumped out of FCP for a later cycle. |
alexcrichton
added
final-comment-period
and removed
I-nominated
labels
Nov 5, 2015
This comment has been minimized.
This comment has been minimized.
|
From the point-of-view of a consumer of the API, I've found it to be pleasing enough to use, especially because of the use std::collections::HashMap;
use twox_hash::RandomXxHashState;
let mut hash: HashMap<_, _, RandomXxHashState> = Default::default();
hash.insert(42, "the answer");
assert_eq!(hash.get(&42), Some(&"the answer"));From an implementor point-of-view, I'll softly agree with @ranma42's point:
In my mind, the state of the hash is the internal bits-and-bobs that change every time you add more data to be hashed. In the usage of twox-hash, it's really more like the "state" is a seed and the "hasher" contains the real state. However, looking back from a user POV, neither "state" nor "seed" sound great when you write the declaration ( Naming is hard! Falling into Java naming land, the parameter to the From a pragmatic POV, there will probably never be more than ~100 implementations of hash algorithms, so it's a very small number of people that will have to struggle with this. That may mean it's not worth spending enormous amounts of time on this naming. There will be many more people who use a custom hash algorithm, so as long as the implementations have a nice bit of example of "use it like this", it will probably be just fine. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for weighing in @shepmaster! I agree that the naming here perhaps isn't the best, although I am also at a bit of a loss of what would fit as a good name. For construction of a custom hash map, though, you can even use |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton I mentioned that context ( In the stub library I wrote, I named Would it still be possible to rename these traits (and maybe their methods)? Although the names I proposed may not be much easier to understand, they would align with the existing nomenclature. This would at least make them easier to use for people which is already using them in other languages (and it would provide nice consistency for anybody writing library bindings/wrappers). |
This comment has been minimized.
This comment has been minimized.
|
Hm yeah Regardless I think it's still possible to tweak some naming here. If larger changes happen, however, then I think we'll have to punt on this until another cycle. |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton Yes, in the In my stub library, the object hierarchy is designed so that the additional state (seeds, initialization vectors, number of ruounds/passes, any parameters) of the hashing algorithm belongs to the object implementing the This has the convenient advantage that in most cases the initialisation data does not need to be stored in the context (for example, the I did some more work on this, but I did not clean it up and push it to the public repo, but since there seem to be some interest, I will try to get around to it this weekend. |
This comment has been minimized.
This comment has been minimized.
|
Yeah we wanted to avoid "factory" to avoid java-isms but we also felt that we should avoid "builder" because the builder pattern in Rust already has a concept (e.g. chaining method calls to build something) which this isn't quite the same as. |
This comment has been minimized.
This comment has been minimized.
I guess I'd disagree. For example, I could see some code like: trait BuildHasher {
type Hasher;
fn build_hasher(&self) -> Self::Hasher;
}
struct MyCoolHash {
seed: u32,
stream: u32,
}
struct MyCoolHashBuilder {
seed: u32,
stream: u32,
}
impl MyCoolHashBuilder {
fn new() -> Self {
MyCoolHashBuilder {
seed: 0, // make a random seed
stream: 1, // make a random stream
}
}
fn seed(self, seed: u32) -> Self {
MyCoolHashBuilder {
seed: seed,
..self
}
}
fn stream(self, stream: u32) -> Self {
MyCoolHashBuilder {
stream: stream,
..self
}
}
}
impl BuildHasher for MyCoolHashBuilder {
type Hasher = MyCoolHash;
fn build_hasher(&self) -> Self::Hasher {
MyCoolHash {
seed: self.seed,
stream: self.stream,
}
}
}
trait BuildHasher {
type Hasher;
fn build_hasher(&self) -> Self::Hasher;
}
struct MyCoolHash {
seed: u32,
stream: u32,
}
struct MyCoolHashBuilder {
seed: u32,
stream: u32,
}
impl MyCoolHashBuilder {
fn new() -> Self {
MyCoolHashBuilder {
seed: 0, // make a random seed
stream: 1, // make a random stream
}
}
fn seed(self, seed: u32) -> Self {
MyCoolHashBuilder {
seed: seed,
..self
}
}
fn stream(self, stream: u32) -> Self {
MyCoolHashBuilder {
stream: stream,
..self
}
}
}
impl BuildHasher for MyCoolHashBuilder {
type Hasher = MyCoolHash;
fn build_hasher(&self) -> Self::Hasher {
MyCoolHash {
seed: self.seed,
stream: self.stream,
}
}
}
fn main() {
use std::collections::HashMap;
let x = MyCoolHashBuilder::new().seed(42).stream(1);
HashMap::with_hasher(x);
}There's two parts to the builder pattern - the interesting and unique configuration and the actual building. The The biggest downside I see is that I'd expect that most hashing is going to have a small number of knobs to tweak, which means the number of configuration methods would be small (or zero!). |
This comment has been minimized.
This comment has been minimized.
|
Oh interesting! That's actually a good point that the trait could be considered the "final step" rather than the configuration up front, and along those lines I'd be pretty cool with |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton I'd still argue for the proposed |
This comment has been minimized.
This comment has been minimized.
|
I suppose I personally prefer the name "hash builder" or We don't have that many instances of "make" in libstd I think, the ones I could find are:
I feel that "build", however, at least does show up throughout libstd |
This comment has been minimized.
This comment has been minimized.
|
I agree that (That said, I won't stand strongly in the way of |
This comment has been minimized.
This comment has been minimized.
|
We're avoiding factory for no reason at all, that's fine, builder can be our own factory. I think BuildHasher is better than MakeHasher. However, no reason to make a builder style API for it just because of the name. Utility first. |
alexcrichton
added
the
I-nominated
label
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
|
The final piece we believe warrants discussion is the naming of the trait itself, of which the two leading candidates seem to be |
alexcrichton
removed
the
I-nominated
label
Dec 17, 2015
This comment has been minimized.
This comment has been minimized.
|
Is it too late to submit Womb as Rust's version of a Factory? I think this would definitely resolve any and all confusion with regards to the semantics. Definitely. |
This comment has been minimized.
This comment has been minimized.
jminer
commented
Dec 25, 2015
|
I feel like |
This comment has been minimized.
This comment has been minimized.
|
I think the name |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I think it does have to be random forever, at least to some extent. Changing the default would undermine the safety of applications that use HashMap. If you accept that, guaranteeing it in the name seems kind've reasonable, as a warning. |
This comment has been minimized.
This comment has been minimized.
|
Also: I maintain that |
This comment has been minimized.
This comment has been minimized.
|
The libs team discussed this in triage recently and the decision was to stabilize with the names |
alexcrichton
referenced this issue
Jan 15, 2016
Merged
std: Stabilize APIs for the 1.7 release #30943
This comment has been minimized.
This comment has been minimized.
jminer
commented
Jan 16, 2016
|
@alexcrichton I had two reasons for preferring Then again, I agree that |
alexcrichton commentedAug 12, 2015
This is a tracking issue for the unstable
hashmap_hasherfeature in the standard library. This provides the ability to create a HashMap with a custom hashing implementation that abides by theHashStatetrait. This has already been used quite a bit in the compiler itself as well as in Servo I believe.Some notable points to consider:
HashStatetrait really necessary?HashStatecorrect?Defaultappropriate here?newconstructor be leveraged to create hash maps that use a hasher implementingDefault? Right now thenewconstructor only works withRandomState.Hasherimplementation? In theory it should be quite ergonomic.cc @Gankro