Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should the "initial" hash also be "randomized" ? #21

Closed
cbeck88 opened this issue Nov 9, 2017 · 7 comments
Closed

should the "initial" hash also be "randomized" ? #21

cbeck88 opened this issue Nov 9, 2017 · 7 comments

Comments

@cbeck88
Copy link
Collaborator

cbeck88 commented Nov 9, 2017

Thanks for sharing your implementation here -- I have been studying it for a while now. I have some questions:

In the perfect minimal hashing implementation frozen/bits/pmh.h, you have quite faithfully implemented what is described generally here: http://stevehanov.ca/blog/index.php?id=119

And in this blog, it is claimed that the algorithm takes linear time, due to this paper: http://cmph.sourceforge.net/papers/esa09.pdf

However, it seems to me that there is a disconnect, because in that paper, not only are the hashes for each bucket random, but the "initial" hash is also. In the blog post (and in frozen), the hash key used for each bucket is either chosen strategically or by trial and error, but the "initial" hash that is used to populate the buckets is fixed, and is not random.

This is a problem because the overall running time analysis relies on all of the buckets being small ~ as if the initial hash is random. In the totally random hash function model, suppose that we are going along and handling the buckets greedily. If we already have mapped a constant fraction of the items, and we encounter a bucket with k items, we expect to take 2^O(k) guesses before we find a hash seed that accommodates them all. So if "most" of the items are in "large" buckets, we take exponential time.

I think it means that for "bad" inputs, or basically, inputs where the unseeded version of elsa gives a bad distribution of buckets, the algorithm will always fail or take a long time.

Intuitively, one way to fix it is, use the seeded version of elsa for the initial bucket placement also. Then, inspect the distribution of buckets, and if there were a huge number of collisions, then try a different seed, until it is within some tolerance.

I attempted to do this in an ad-hoc way in a branch here: https://github.com/cbeck88/frozen/tree/pmh_seeds

However, I'm having problems with runtime performance now -- it seems that when I compile with clang, invoking elsa twice instead of once makes us slower than gcc for unordered_map<int, int>.

My hope would be that this would help users who are troubleshooting "possible infinite loop" etc. Since basically, if the initial hash is bad and they get a bucket with half of the items in it, (or even, several buckets with sqrt(n) items), I would expect the algorithm to time out.

Do you think that my assessment is wrong? Do you think it would be a good idea to try to simplify elsa so that it can be run twice as fast? Do you think instead of using elsa for the initial hash, a different hash should be used, and elsa should only be used for the second part?

Where did the elsa hash algorithm come from?

@serge-sans-paille
Copy link
Owner

@cbeck88 thansk a lot for sharing your thoughts. i'm a bit busy right now but I'll definitively try to answer this week.

@serge-sans-paille
Copy link
Owner

An educated guess : instead of iteratively try new incremented values, we could use a PRNG ?

@serge-sans-paille
Copy link
Owner

@cbeck88 #23 is an attempt to randomize the search. I still use a fixed seed to start the search alternatives are:

  • providing an extra template parameter to the container
  • using __LINE__
  • ??

What do you think?

@serge-sans-paille
Copy link
Owner

serge-sans-paille commented Dec 2, 2017

Do you think it would be a good idea to try to simplify elsa so that it can be run twice as fast?

I switched for a more standard hash function (murmur3 finalizer) and it yields better performance while still providing good distribution.

@cbeck88
Copy link
Collaborator Author

cbeck88 commented Dec 2, 2017

I opened a WIP / experimental PR to show you my ideas on this, will write more later!

@cbeck88
Copy link
Collaborator Author

cbeck88 commented Dec 4, 2017

@serge-sans-paille:

So here is roughly my concern: I'm concerned that if the initial input has a pathological distribution, the algorithm will fail to complete and will time out.

Here's an example input distribution I had in mind: Suppose you are using bit operations to pack multiple bits into an enum. And you want to associate strings to each bit for purpose of diagnostic messages.

enum BitMask {
  a = 1,
  b = 2,
  c = 4,
  d = 8,
  e = 16,
  f = 32,
};

Or for another example where I think it would go badly, suppose that all of the enum values are a multiple of 32.

enum MyEnum {
  foo = 32,
  bar = 64,
  baz = 96,
  qaz = 128,
};

Now, we want to be able to associate the enums with string representations of their names. We could implement an enum2string function using switch, which would be fast. If we also want string2enum... we could also implement that as a bunch of if/else. But often, the programmers want to be able to iterate over all of the valid enum values. So we end up with a bunch of extern const std::map<const MyEnum, const std::string> in the code. We'd like to replace all that stuff with frozen::map.

However, what is going to happen in the pmh tables when I try to do

frozen::make_unordered_map<int, frozen::string>({{32, "foo"}, {64, "bar"}, {96, "baz"}, {128, "qaz"}, ...});

If there are only say, 16 strings or fewer, then M is going to be maybe 16, maybe 32. Either way, the "unseeded" version of elsa returns just the input. So elsa{}(32) == 32, elsa{}(64) == 64, elsa{}(96) == 96. And all of these numbers, mod M is going to be zero.

So in the first pass of the make_pmh_tables thing, where we put every item into one of the M buckets, they are actually all going to end up in the 0 bucket.

Why is that bad? It's bad because in the second part of the algorithm, where we choose a random seed for each bucket, the chance that a random seed is good gets exponentially smaller as the bucket size increases.

    // Repeatedly try different values of d until we find a hash function
    // that places all items in the bucket into free slots
    std::size_t d = 1;
    cvector<std::size_t, M> slots;

    while (slots.size() < bsize) {
      auto slot = hash(key(it[bucket[slots.size()] - 1]), d) % M;

      if (values[slot] != 0 || !all_different_from(slots, slot)) {
        slots.clear();
        d++;
        continue;
      }

      slots.push_back(slot);
    }

(That is: I think the running time of this loop is exponential in bsize. There is also some dependency on, how many items have already been mapped, but when bsize is very large, like say half the elements, I think it should always be a bad case, and exponential bad in bsize.)

If it were okay for the buckets to be arbitrarily large, then we could just skip the entire "first table" part. A "naive" way to try to do perfect hashing would be, just choose random numbers d over and over until one of them manages to hash everything perfectly, and then write that down as the seed. The thing is, that will not be linear time -- it will be exponential time, IIUC, in n where n is the number of elements.

The linear-time perfect hashing paper, referred to from the original blog post, makes a statistical argument that, if I understand correctly, has two parts:

  • If the distribution of bucket sizes from the initial hash is "good enough", then the greedy choosing of seeds (construction of secondary table) takes place in linear time
  • With high probability, the initial distribution of bucket sizes is "good enough".

This is why they say it's linear time even though it is probabilistic -- you could start the algorithm and measure with a stop watch, and if it doesn't succeed after 100N steps or something, you could start over and with high probability you won't ever have to restart more than a few times.

The thing is that we don't have the luxury of starting over, if the second part fails to finish, we'll just get a compilation failure.

So to try to improve the success rate, and make sure the algorithm can handle even pathological inputs, I want to try to find a "hash looks good" condition, so we can try a few "initial" hashes and only proceed to building the second table if the bucket distribution looks good. (Since otherwise, building the second table will probably fail.)

In the commits I sent over, I messed around with "sum of squares of bucket sizes" as a metric for success. Because, in a "good" hash, most of the buckets should be like a constant, or log n or so at worst -- and overall I would expect sum of buckets squared to be like O(n). And in an extremely imbalanced hash, where one bucket has all the elements, the sum of bucket squares should be like O(n^2). So I just chose some limit like 10n and say, anything that exceeds this threshold is rejected.

I think the sum-of-squares thing is not very rigorous though. The best thing would be to read the paper in detail and see what condition they use. I think the "rigorous" answer would be something like, sum of c^{bucket size} for some constant c. That makes sense to me because I think the running time of the second part is indeed going to be exponential in the size of each bucket, and we do it for each bucket. I'm really not sure what c is supposed to be though.

It's still not completely rigorous also because in the paper they assume a random hash function model, and we aren't using "random hash functions". But there's only so much you can do. We are giving the hash functions pretty large seeds (64 bits) and if the tables are usually not that big then it should work almost the same I think.

Also: I think that using a seed for the "initial hash" will fix all of these pathological input cases, because we should be able to quickly find a seed that has a "pretty good" bucket size distribution and then the algorithm is likely to succeed. If the sum of the squares of the bucket sizes is O(n), it means that the majority of the elements are in buckets of size O(1), and there can't be more than a few buckets as large as sqrt(n).

When I have some free time again I'm going to try creating some test cases along these lines, and see if those tests fail to compile in master and if they do compile after my PR, but I didn't get around to it yet.

@cbeck88
Copy link
Collaborator Author

cbeck88 commented Dec 6, 2017

Hi,

I opened at PR #26 to show what I think is a "pathological input" where the current master fails, and which shows the advantage of the do { ... } while loop in the PR #24

Basically even if the inputs (deterministically) cause a bad hash in the first round, with the initial seed, we can detect that and try a new seed before trying to build the second table. Or so the idea goes.

@cbeck88 cbeck88 closed this as completed Dec 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants