-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should the "initial" hash also be "randomized" ? #21
Comments
@cbeck88 thansk a lot for sharing your thoughts. i'm a bit busy right now but I'll definitively try to answer this week. |
An educated guess : instead of iteratively try new incremented values, we could use a PRNG ? |
I switched for a more standard hash function (murmur3 finalizer) and it yields better performance while still providing good distribution. |
I opened a WIP / experimental PR to show you my ideas on this, will write more later! |
So here is roughly my concern: I'm concerned that if the initial input has a pathological distribution, the algorithm will fail to complete and will time out. Here's an example input distribution I had in mind: Suppose you are using bit operations to pack multiple bits into an enum. And you want to associate strings to each bit for purpose of diagnostic messages.
Or for another example where I think it would go badly, suppose that all of the enum values are a multiple of 32.
Now, we want to be able to associate the enums with string representations of their names. We could implement an However, what is going to happen in the pmh tables when I try to do
If there are only say, 16 strings or fewer, then So in the first pass of the Why is that bad? It's bad because in the second part of the algorithm, where we choose a random seed for each bucket, the chance that a random seed is good gets exponentially smaller as the bucket size increases.
(That is: I think the running time of this loop is exponential in If it were okay for the buckets to be arbitrarily large, then we could just skip the entire "first table" part. A "naive" way to try to do perfect hashing would be, just choose random numbers The linear-time perfect hashing paper, referred to from the original blog post, makes a statistical argument that, if I understand correctly, has two parts:
This is why they say it's linear time even though it is probabilistic -- you could start the algorithm and measure with a stop watch, and if it doesn't succeed after 100N steps or something, you could start over and with high probability you won't ever have to restart more than a few times. The thing is that we don't have the luxury of starting over, if the second part fails to finish, we'll just get a compilation failure. So to try to improve the success rate, and make sure the algorithm can handle even pathological inputs, I want to try to find a "hash looks good" condition, so we can try a few "initial" hashes and only proceed to building the second table if the bucket distribution looks good. (Since otherwise, building the second table will probably fail.) In the commits I sent over, I messed around with "sum of squares of bucket sizes" as a metric for success. Because, in a "good" hash, most of the buckets should be like a constant, or I think the sum-of-squares thing is not very rigorous though. The best thing would be to read the paper in detail and see what condition they use. I think the "rigorous" answer would be something like, sum of It's still not completely rigorous also because in the paper they assume a random hash function model, and we aren't using "random hash functions". But there's only so much you can do. We are giving the hash functions pretty large seeds (64 bits) and if the tables are usually not that big then it should work almost the same I think. Also: I think that using a seed for the "initial hash" will fix all of these pathological input cases, because we should be able to quickly find a seed that has a "pretty good" bucket size distribution and then the algorithm is likely to succeed. If the sum of the squares of the bucket sizes is When I have some free time again I'm going to try creating some test cases along these lines, and see if those tests fail to compile in master and if they do compile after my PR, but I didn't get around to it yet. |
Hi, I opened at PR #26 to show what I think is a "pathological input" where the current master fails, and which shows the advantage of the Basically even if the inputs (deterministically) cause a bad hash in the first round, with the initial seed, we can detect that and try a new seed before trying to build the second table. Or so the idea goes. |
Thanks for sharing your implementation here -- I have been studying it for a while now. I have some questions:
In the perfect minimal hashing implementation
frozen/bits/pmh.h
, you have quite faithfully implemented what is described generally here: http://stevehanov.ca/blog/index.php?id=119And in this blog, it is claimed that the algorithm takes linear time, due to this paper: http://cmph.sourceforge.net/papers/esa09.pdf
However, it seems to me that there is a disconnect, because in that paper, not only are the hashes for each bucket random, but the "initial" hash is also. In the blog post (and in frozen), the hash key used for each bucket is either chosen strategically or by trial and error, but the "initial" hash that is used to populate the buckets is fixed, and is not random.
This is a problem because the overall running time analysis relies on all of the buckets being small ~ as if the initial hash is random. In the totally random hash function model, suppose that we are going along and handling the buckets greedily. If we already have mapped a constant fraction of the items, and we encounter a bucket with k items, we expect to take
2^O(k)
guesses before we find a hash seed that accommodates them all. So if "most" of the items are in "large" buckets, we take exponential time.I think it means that for "bad" inputs, or basically, inputs where the unseeded version of
elsa
gives a bad distribution of buckets, the algorithm will always fail or take a long time.Intuitively, one way to fix it is, use the seeded version of
elsa
for the initial bucket placement also. Then, inspect the distribution of buckets, and if there were a huge number of collisions, then try a different seed, until it is within some tolerance.I attempted to do this in an ad-hoc way in a branch here: https://github.com/cbeck88/frozen/tree/pmh_seeds
However, I'm having problems with runtime performance now -- it seems that when I compile with
clang
, invokingelsa
twice instead of once makes us slower thangcc
forunordered_map<int, int>
.My hope would be that this would help users who are troubleshooting
"possible infinite loop"
etc. Since basically, if the initial hash is bad and they get a bucket with half of the items in it, (or even, several buckets with sqrt(n) items), I would expect the algorithm to time out.Do you think that my assessment is wrong? Do you think it would be a good idea to try to simplify
elsa
so that it can be run twice as fast? Do you think instead of usingelsa
for the initial hash, a different hash should be used, andelsa
should only be used for the second part?Where did the
elsa
hash algorithm come from?The text was updated successfully, but these errors were encountered: