Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New hashmap implementation #5999

Merged
merged 1 commit into from
Sep 2, 2020
Merged

New hashmap implementation #5999

merged 1 commit into from
Sep 2, 2020

Conversation

Sahnvour
Copy link
Contributor

@Sahnvour Sahnvour commented Aug 8, 2020

First off, I'm sorry this comes shortly after @squeek502 's work on the robin hood hashmap. I have been working on and off on this hashmap implementation for quite some time with the goal to contribute it to the std.

However I think the two implementations are complementary and no work was wasted.
I tried to adhere as much as possible to the current API,

Design

It's based on open addressing (all elements are stored in a single contiguous array) and linear probing (we resolve collisions by just trying the next slot in the array). Quite similar to the widely publicized google's swiss tables, but a lot simpler.

1. Fast

The goal is to have a hashmap that is as fast as possible for lookups (considered most important usecase) and insertion/removal (second most important).

Statistics

We assume that the hash function is of good quality, giving unbiased results and dispatching elements all over the available slots. This is an absolute prerequisite for hashmap implementations, and that is the case with Zig's standard hash function (though there are other candidates obviously).
The probability of an element being assigned to a slot is 1/number_of_slot. This does not mean that we can have a bijection between a key and a slot, therefore we need to handle collisions.

Linear probing is an efficient (see next §) and very simple way to deal with collisions. I quite like that it's very easy to understand, and has very predictable behavior for the CPU. However, when probing a collision chain to find the correct key, the algorithm needs to perform equality comparison on the keys. For simple types such as ints, that's not really a problem, but keys can be larger and much more complex (you don't want to do lots of string comparisons for example). To remedy this, the hashmap keeps 6 bits from each hash and stores it as a metadata by slot (along with its state : free, used, tombstone).

The mecanism to determine which ideal slot the key belongs to already uses log2(number_of_slots) bits from the hash ; since we always keep a power-of-two number of slots to do fast modulus by masking. Keys that belong in the same ideal slot will have their hashes with identical log2(number_of_slots) low bits. But since the hash function is assumed to give random results, the 6 high bits of their hashes are very likely to be different. Using this further helps to differenciate keys without resorting to equality comparison.

Spatial locality

Pieces of data that are accessed together benefit greatly from begin close together in memory as CPUs optimize for this usecase. By using only 8bits of metadata per element and storing all the metadata contiguously, when accessing one (while doing a lookup for example) we typically get the metadata of 7 other slots for free (assuming a 64bytes cache line).

Effectively this means that even if the hashmap is nearly full and has collisions that require probing, it is almost free since a probe chain of length 8 is already in cache.
Not having to access the slot array until we're certain to have found the correct slot also means that we don't waste memory bandwith with unused data. If the metadata were to be embedded within the slots, probing would make a bad use of cache as they would be espaced by the size of an element.

2. Memory efficient

8 bits of metadata is quite low, and it's hard to go further with comparable advantages.

The hashmap also holds only one allocation that contains both the metadata and slot array. This is hard to measure but I think this helps reduce pressure on the allocator and fragmentation.

3. Small footprint

Extra effort was spent in trying to keep the struct as small as possible : it's only 16 bytes (24 for the Managed variant).
The fields embedded in the struct are the ones needed most frequently to avoid unnecessary cache misses. Other field-candidates are stored in the allocation (mainly, the capacity).

To keep it at 16 bytes, the size of the hashmap is limited to 32 bits. This seems like a reasonnable choice to me.

Comparison with status quo

Pros

  • in my benchmarks, almost always faster by a good margin
  • uses less memory, only 8bits of overhead by element
  • can achieve higher load factors for similar speed, for even lower memory use

Cons

  • iteration speed over elements is slower by an order of magnitude (you can't beat contiguous arrays !)
  • absolutely no order guarantee (which can be nice, but is not a main goal of hashmaps in my opinion)
  • modification invalidates live iterators

Performance

I've taken the best of 3 runs from my benchmark, which is basically just a longer version of what's now used in gotta-go-fast. Ported from https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/

Master

iterate while adding 0.428s
iterate while removing 0.427s

insert 100M int 13.606s
clear 100M int 0.127s
reinsert 100M int 7.300s
remove 100M int 8.824s
reinsert 100M int 7.293s
deinit map 0.148s

5% distinct 9.440s
25% distinct 10.650s
50% distinct 11.089s
100% distinct 9.932s

0% success, ffffffff  6.095ns
0% success, ffffffff00000000  6.168ns
25% success, ffffffff  6.161ns
25% success, ffffffff00000000  6.190ns
50% success, ffffffff  6.186ns
50% success, ffffffff00000000  6.123ns
75% success, ffffffff  6.238ns
75% success, ffffffff00000000  6.097ns
100% success, ffffffff  6.608ns
100% success, ffffffff00000000  6.597ns

This PR

iterate while adding 6.430s
iterate while removing 7.732s

insert 100M int 9.489s
clear 100M int 0.005s
reinsert 100M int 5.474s
remove 100M int 3.331s
reinsert 100M int 6.953s
deinit map 0.452s

5% distinct 3.844s
25% distinct 5.801s
50% distinct 6.572s
100% distinct 8.181s

0% success, ffffffff  5.641ns
0% success, ffffffff00000000  5.986ns
25% success, ffffffff  5.543ns
25% success, ffffffff00000000  5.374ns
50% success, ffffffff  5.077ns
50% success, ffffffff00000000  4.923ns
75% success, ffffffff  4.614ns
75% success, ffffffff00000000  4.526ns
100% success, ffffffff  4.398ns
100% success, ffffffff00000000  4.413ns

Right tool for the job

I think there's use for both implementations, and renamed the current one to sliceable_hash_map as it is its main advantage. That is only a proposal.

@daurnimator daurnimator added the standard library This issue involves writing Zig code for the standard library. label Aug 8, 2020
@squeek502
Copy link
Collaborator

Just to clarify, I didn't do any work on the new hash map implementation. That was all @andrewrk. I just made some graphs.

@Sahnvour Sahnvour force-pushed the hashmap branch 2 times, most recently from aecb1f3 to e34b954 Compare August 9, 2020 11:37
@Sahnvour
Copy link
Contributor Author

Sahnvour commented Aug 9, 2020

There's still a few errors (for example I think the translate-c code depends on the ordering of the current hashmap somewhere), which I'll solve if there is interest in this.

@squeek502
Copy link
Collaborator

Ran my basic insertion benchmark out of curiosity and got some interesting results.

(note: this only goes up to 10 million insertions)

5999-i32
5999-strings

I keep coming back to this comment: ziglang/gotta-go-fast#2 (comment)

because hash maps can have very different performance characteristics depending on the size of the map, a benchmark like 'add 1,000,000 elements, get 1,000,000 times, then remove all the elements one by one' would only represent one single point along the continuum of possible hash map sizes, and therefore it would fail to catch regressions in performance for other map sizes.

@Sahnvour
Copy link
Contributor Author

That's true, and I totally agree that there are many (too many, in fact) dimensions on which we can compare hashmaps.
However that would be a belittlement of Martin's benchmarks which try to model some "real life" scenarios and have good considerations in my opinion. Granted, they are more interested in big hashmaps.

I ran your benchmark on my machine and got these (raw) results
image
So I guess we can also add CPU arch, memory speed, OS, etc. to the mix.

For example, I think the low insertion counts (< 1000 ?) mostly measure allocation speed, rather than insertion.

I'm a bit puzzled by your results in this range, and can't quite yet wrap my head around the sawtooth pattern in my graph for this PR. Maybe the number of inserts are pathologically bad in every odd count (for example, just needing a grow + rehash, thus increasing the mean insertion time).

@squeek502
Copy link
Collaborator

I'm a total beginner to this type of stuff so I'm more than willing to believe that I'm not taking the proper things into account. My graphs are very strange indeed.

@ifreund
Copy link
Member

ifreund commented Aug 11, 2020

I haven't read through your implementation in detail, but from my knowledge of the swiss tables implementation they make good use of SIMD to speed up lookups. What advantages does your design have over such an approach? Or could that be a potential future improvement?

@Sahnvour
Copy link
Contributor Author

@ifreund It is my understanding that google tries to achieve extremely high load factor in its hashmaps, because memory usage is literally money to them. I would have to watch their videos again so don't quote me on that but IIRC they go as high as 97.5% occupancy to reduce memory waste.

This means that even with a very good hash function that provides good distribution, they likely have long collision chains when their maps are full; because the set of possible input keys is almost always larger than the set of possible hash results. Flat hashmaps (as this one, or the stdandard one) performance relies heavily on reducing collision probability by using more memory than would be minimally needed to store all their elements. The more additional memory you reserve, the more there are slots that will be unused and thus break collision chains. This means than when looking for a slot (to either do a lookup or a modification) you can statistically stop traversing your data earlier.

When you allow very high load factors, the number of free slots is reduced to a bare minimum that you find acceptable (this really is just a tradeoff) and collision chains get longer and longer. This implies that when looking for a slot, you will likely have to probe for more already used slots before finding the good one. In both google and this implementation, probing is mainly done by looking at an array of 8bit metadata. When collision chains get longer, testing multiple metadata at once with SIMD becomes interesting.

However, doing probing by packs of say 16 metadata implies some complexification of the design and implementation (which are best explained in their conferences so I will not go on only by my memory of it :) that IMO is not needed with less extreme load factors to achieve good performance.
When I prototyped this implementation in C++, I was able to beat google's in pretty much all benchmarks from the suite I linked in my previous comment. Your mileage may vary though.

So to conclude, it might be an improvement but it also might not. 😄

@andrewrk
Copy link
Member

Just to clarify, I didn't do any work on the new hash map implementation. That was all @andrewrk. I just made some graphs.

they were pretty graphs tho 😉

However I think the two implementations are complementary and no work was wasted.

I agree with this conclusion. However, let's try to make it really clear what the differences and usage patterns would be. I think it would be fair to say that the API of the master branch hash map is strictly more convenient. Having the ArrayList of entries available for direct access is pretty nice in terms of ergonomics, and the fact that order is preserved (and independent of the hash function) is a really nice property. So it would seem the benefits of this implementation are trading some of this API convenience for better resource usage (memory & CPU). This being the case, I'd like to make sure it actually is significantly better than status quo hash maps before adding it as another option. Based on some of the data here, it's not entirely clear, right? I think before merging this it would be worth it to understand the benefits in a more conclusive way, so that the doc comments in the std lib can confidently explain when to choose one over the other.

sawtooth pattern

My guess is it has to do with the underlying memory allocator returning memory that it already has available vs requesting extra memory from the OS. E.g. imagine appending to an array list, and on the 9th element appended, it doubles its capacity.

@Sahnvour
Copy link
Contributor Author

Well, to me the results are pretty clear in that this PR has better (up to 2.5-3x) performance on all operations except iteration. The results are focused on large hashmaps, but that is what we currently have to measure.
It would indeed be interesting to have some benchmarks targeting small hashmaps because that is a common usecase that can have a different performance profile.

Can we define the required level of insight you need to accept of reject this PR ?

Thinking again about the sawtooth, apart from the influence of the allocator, in my opinion this is normal and expected, but not necessarily representative. Since storage is not reserved for the hashmap in the benchmark, it has to grow when necessary. Depending on the number of elements inserted (x values), if a grow was just triggered for this number because it exceeded previous capacity, the total time / number of inserts (y values) will be higher than for higher numbers of inserts, until the hashmap has to grow again. The choice of x values just happen to illustrate that.
If we plotted the total time for the inserts instead, we would likely see some kind of plateaux.

@andrewrk
Copy link
Member

Well, to me the results are pretty clear in that this PR has better (up to 2.5-3x) performance on all operations except iteration.

I see - I think I focused too much on the graphs and neglected to pay more attention to your original performance statistics. OK I see now.

Can we define the required level of insight you need to accept of reject this PR ?

OK never mind about the performance situation, I'm convinced on that end.

Let's start with the fully qualified names for each hash map, along with their doc comments, and once we get those settled I think it will be easy to take the next step towards merging.

So far it looks like we have:

std.sliceable_hash_map.HashMap with doc comments:

/// Insertion order is preserved.
/// Deletions perform a "swap removal" on the entries list.
/// Modifying the hash map while iterating is allowed, however one must understand
/// the (well defined) behavior when mixing insertions and deletions with iteration.
/// For a hash map that can be initialized directly that does not store an Allocator
/// field, see `HashMapUnmanaged`.
/// When `store_hash` is `false`, this data structure is biased towards cheap `eql`
/// functions. It does not store each item's hash in the table. Setting `store_hash`
/// to `true` incurs slightly more memory cost by storing each key's hash in the table
/// but only has to call `eql` for hash collisions. 

std.hash_map.HashMap with no doc comments.

Now imagine you're someone who has never used zig before and you want a hash map, and you're faced with these two options. Let's come up with some nice names and explanations to guide such a user into making the choice that will work best for them.

@Sahnvour
Copy link
Contributor Author

Totally agree, naming things is hard though. I've been in this position before, having to name two different implementations that optimized for different usecase, and it's not easy. For example we have to avoid names that, when compared one to another, will lead to think there's an absolute better option.

For the current hashmap, I propose sliceable_hash_map or ordered_hash_map, depending on what we think is its main advantage. I was focused on the entries being stored contiguously and thus sliceable, but maybe you think its ordered property is more interesting.

For the new hashmap from this PR, it's a bit harder, I would go for fast_hash_map which is self-describing, but I fear it will outshine the other because fast is always better, right?

@andrewrk
Copy link
Member

"fast" is always implied in all code, and there is always a better adjective to describe in the name. Better for the name to mention the constraints rather than the ephemeral performance. Consider how misleading the name "quicksort" is, regardless of whether it was actually faster than its contemporaries. A better name for "quicksort" would have been "partition-exchange sort".

It is misleading because depending on the usage pattern, the sliceable hash map might be faster, for example, if the bottleneck in the code will be iterating over the hash map. Better to describe the semantic limitations or lackthereof. Here are some ideas:

  • SparseHashMap. If we compare this to ArrayHashMap and ArrayList, there is a nice consistency here.
  • HashMap. I am OK with using the bare name for this implementation because it has the "default" semantic properties that one expects to find in a hash map API. Let's just make the doc comments easy to find and helpful. I don't think it should say "performance oriented" because again all code is performance oriented. The doc comments should focus on what constraints and guarantees the API does or does not have. The programmer will be expected to choose the API to match the constraints of their problem domain.

After thinking about this a bit I agree with your original proposal to make it the default and name it simply HashMap (the second bullet point). I inspected the code a bit more closely and I see that this new hash map has low memory overhead for small maps, which is great! I think this is quite suitable for default use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
standard library This issue involves writing Zig code for the standard library.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants