std: count hash_map tombstones as available #10337

sentientwaffle · 2021-12-14T21:45:30Z

When entries are inserted and removed into a hash map at an equivalent rate (maintaining a mostly-consistent total count of entries), the map should never need to be resized. But HashMapUnmanaged.available does not presently count tombstoned slots as "available", so this put/remove pattern eventually panics (assertion failure) when available reaches 0.

The solution implemented here is to count tombstoned slots as "available". Another approach (which hashbrown takes) would be to rehash all entries in place when there are too many tombstones. This is more complex but avoids an O(n) bad case when the hash map has many tombstones.

The assertion failure is shown below. The new test cast exercises this behavior. This is the same problem described in #7468.

…/zig/lib/std/debug.zig:224:14: 0x20919b in std.debug.assert (test)
    if (!ok) unreachable; // assertion failure
             ^
…/zig/lib/std/hash_map.zig:1013:23: 0x20f5f7 in HashMapUnmanaged(u32,u32,AutoContext(u32),80).putAssumeCapacityNoClobberContext (test)
                assert(self.available > 0);
                      ^
…/zig/lib/std/hash_map.zig:551:68: 0x20ac70 in HashMap(u32,u32,AutoContext(u32),80).putAssumeCapacityNoClobber (test)
            return self.unmanaged.putAssumeCapacityNoClobberContext(key, value, self.ctx);
                                                                   ^
…/zig/lib/std/hash_map.zig:1916:39: 0x20a513 in test "std.hash_map loop putAssumeCapacity/remove" (test)
        map.putAssumeCapacityNoClobber(20 + i, i);
                                      ^
…/zig/lib/std/special/test_runner.zig:77:28: 0x240507 in std.special.main (test)
        } else test_fn.func();
                           ^
…/zig/lib/std/start.zig:543:22: 0x23797c in std.start.callMain (test)
            root.main();
                     ^
…/zig/lib/std/start.zig:495:12: 0x210bee in std.start.callMainWithArgs (test)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
…/zig/lib/std/start.zig:409:17: 0x20b286 in std.start.posixCallMainAndExit (test)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
…/zig/lib/std/start.zig:322:5: 0x20b092 in std.start._start (test)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^

I'm not quite sure how the "std.hash_map ensureUnusedCapacity with tombstones" test should be updated, since it was written with the expectation (and a note) that tombstones count as load.

As an aside, the current default_max_load_percentage = 80 is rather high and may lead to (or exacerbate) primary clustering issues, so that should be addressed in the future.

When entries are inserted and removed into a hash map at an equivalent rate (maintaining a mostly-consistent total count of entries), it should never need to be resized. But `HashMapUnmanaged.available` does not presently count tombstoned slots as "available", so this put/remove pattern eventually panics (assertion failure) when `available` reaches `0`. The solution implemented here is to count tombstoned slots as "available". Another approach (which hashbrown (https://github.com/rust-lang/hashbrown/blob/b3eaf32e608d1ec4c10963a4f495503d7f8a7ef5/src/raw/mod.rs#L1455-L1542) takes) would be to rehash all entries in place when there are too many tombstones. This is more complex but avoids an `O(n)` bad case when the hash map is full of many tombstones.

Sahnvour · 2021-12-14T23:21:20Z

If I remember correctly, the reason why tombstones participate in the load factor (hence are not counted as "available" slots) is that they participate in collision chains. When doing lookups, even if the hashmap has 0.1% used slots and 79.9% tombstones, it will effectively have a high probability of probing, and that's what we try to minimize. Hence the fact that we grow even if the hashmap appears almost empty.

When entries are inserted and removed into a hash map at an equivalent rate (maintaining a mostly-consistent total count of entries), the map should never need to be resized.

In this scenario, resizes can occur, but their likeliness will decrease exponentially in probability.
I agree this behaviour may be surprising, but your change is keeping longer probe lengths to avoid a resize. Maybe it's worth it, maybe not, but I'm not sure it's such a clear cut. If we're thinking about the most generic use of hashmaps (which isn't defined so I'm not sure it even makes sense, as always everything is a matter of compromise), I'd argue that lookup performance is the main metric to optimize for.

As an aside, the current default_max_load_percentage = 80 is rather high and may lead to (or exacerbate) primary clustering issues, so that should be addressed in the future.

From my tests it's a good compromise between memory usage and performance (but as always, it depends on the use case). The hashmap performs quite nicely even with this value, but feel free to do measurements on probe lengths to convince yourself.

However I'm quite against lowering it without a substantial amount of actual data to back this claim. Clustering is a matter of hash function quality, and the stdlib uses state of the art, very high quality hash functions (at least last time I checked).

sentientwaffle · 2021-12-15T01:16:27Z

Thanks for clarification & feedback!

A brief summary of my use case:

AutoHashMapUnmanaged with a known maximum number of entries.
After reserving some capacity during setup (ensureCapacity), it shouldn't require any further allocations/reallocations.
Entries are periodically removed or inserted (putAssumeCapacityNoClobber) but the total number of entries never exceeds the capacity reserved up front.

In this scenario, resizes can occur, but their likeliness will decrease exponentially in probability.
I agree this behaviour may be surprising, but your change is keeping longer probe lengths to avoid a resize. Maybe it's worth it, maybe not, but I'm not sure it's such a clear cut. If we're thinking about the most generic use of hashmaps (which isn't defined so I'm not sure it even makes sense, as always everything is a matter of compromise), I'd argue that lookup performance is the main metric to optimize for.

Right, but HashMapUnmanaged doesn't store a reference to the allocator, and putAssumeCapacity{,NoClobber} doesn't take it as an argument, so it can't grow/resize. I'm aware that my change compromises on the lookup performance, but the existing behavior seems unsafe. Lookup performance could be regained with the approach I linked in hashbrown.

The current behavior seems unsafe especially because if/when it panics is so unpredictable. Currently, when putAssumeCapacity lands on a tombstone slot it succeeds. But entries that land on an empty slot decrement available, and eventually hit the assertion failure.

rohlem · 2021-12-15T10:28:25Z

I don't have much expertise on this, but it sounds like both options can be justified.
Additionally, this looks like a reasonably small code change.

Maybe it could be exposed via a compile-time option, just like the load ratio is currently configurable?
Especially since the interface doesn't change, that way usage scenarios can choose their preferred benefits, or periodically benchmark the performance and decide based on that.

As for which option should be the default, I'm not sure.
I agree that panics should be absolutely avoidable, so maybe all interfaces that can lead to random panics should be reconsidered.

Devil's advocate, in a theoretically endless usage scenario with random data, I don't know if I'd prefer theoretically boundless memory usage or runtime.
I feel like Zig's design generally favours low-memory use cases.
I wouldn't always consider that more important, though of course boundless memory requirements make any program gradually inexecutable in all environments.

I guess a compromise could be to make it a tombstone_availability_percentage parameter, if that doesn't add too much complexity.
Just my 2 cents though, feel free to ignore if it's not helpful.

jorangreef · 2021-12-15T17:52:02Z

@Sahnvour I understand where you're coming from.

As luck would have it, there's a recent paper, Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering from Stony Brook/Google/MIT published in July this year, that dives into all of this and comes back out with some counter-intuitive recommendations (my understanding around Swiss and FB's F14 tables was altered on several points the past hour!).

Apologies for the lengthy quotes, but they're such awesome sections and seem to be speaking right to the heart of what we're wanting to know here:

It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because of an effect known as primary clustering which causes insertions at a load factor of 1 − 1/x to take expected time Θ(x2) (rather than the intuitive running time of Θ(x)). The dangers of primary clustering, first discovered by Knuth in 1963, have now been taught to generations of computer scientists, and have influenced the design of some of the most widely used hash tables in production.

We show that primary clustering is not the foregone conclusion that it is reputed to be. We demonstrate that seemingly small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions: if these design decisions are made correctly, then even if a hash table operates continuously at a load factor of 1 − Θ(1/x), the expected amortized cost per insertion/deletion is O ̃ (x). This is because the tombstones left behind by deletions can actually cause an anti-clustering effect that combats primary clustering. Interestingly, these design decisions, despite their remarkable effects, have historically been viewed as simply implementation-level engineering choices.

The dangers of primary clustering (and the advice of using quadratic probing as a solution) have been taught to generations of computer scientists over roughly six decades. The folklore advice has shaped some of the most widely used hash tables in production, including the high-performance hash tables authored by both Google [1] and Facebook [25]. The consequence is that primary clustering—along with the design compromises made to avoid it—has a first-order impact on the performance of hash tables used by millions of users every day.

What the classical analysis misses. Classically, the analysis of linear probing considers the costs of insertions in an insertion-only workload. Of course, the fact that the final insertion takes expected time Θ(x2) doesn’t mean that all of the insertions do; most of the insertions are performed at much lower load factors, and the average cost is only Θ(x).
The more pressing concern is what happens for workloads that operate continuously at high load factors, for example, the workload in which a user first fills the table to a load factor of 1 − 1/x, and then alternates between insertions and deletions indefinitely. Now almost all of the insertions are performed at a high load factor. Conventional wisdom has it that these insertions must therefore all incur the wrath of primary clustering.

This conventional wisdom misses an important point, however, which is that the tombstones created by deletions actually substantially change the combinatorial structure of the hash table. Whereas insertions add elements at the ends of runs, deletions tend to place tombstones in the middles of runs. If implemented correctly, then the anti-clustering effects of deletions actually outpace the clustering effects of insertions.

We call this new phenomenon primary anti-clustering. The effect is so powerful that, as we shall see, it is even worthwhile to simulate deletions in insertion-only workloads by prophylactically adding tombstones. Our results flip the narrative surrounding deletions in hash tables: whereas past work on analyzing tomb- stones [7,68] has focused on showing that tombstones do not degrade performance in various open-addressing-based hash tables, we argue that tombstones actually help performance. By harnessing the power of tombstones in the right way, we can rewrite the asymptotic landscape of linear probing.

This point from the paper in particular also helped me to understand better how queries and insertions interact:

Tombstones interact asymmetrically with queries and insertions: queries treat a tombstone as being a value that does not match the query, whereas insertions treat the tombstone as a free slot.

Therefore, I think the key is to see this asymmetry, and then deal with queries and insertions separately:

How does replacing an element with a tombstone affect queries over the remaining elements? In fact, the remaining elements all stay in the same place with the same linear probing path length, so no impact.
How does replacing an element with a tombstone affect insertions? Here the paper shows something radical in "that the tombstones left behind by those deletions have a primary-anti-clustering effect, that is, they have a tendency to speed up future insertions".

So... from the point of view of performance, I think we're actually good on incrementing the available counter when removing an existing element, so that the available counter doesn't leak.

And this paper also seems to support the 80% load factor that the std lib also arrived at... my thinking on this changed overnight — I used to think that 50% is the safe default for linear/triangular probing, but your numbers have also quantitively shown that the current default of 80% is reasonable. So we are agreed there.

On the other hand, the early resize definitely carries a very heavy performance cost (rehashing all keys in the table), and impacts the overall amortized runtime. The less resizes the better, unless the user explicitly decides on a compact/rehash policy to shorten existing probe lengths—but I don't believe the std lib should expose compact/rehash in the interface or make any decision to handle that internally at all. If minimizing existing probe lengths is really a concern then there are more recent probing strategies we can adopt that would offer guaranteed worst-case 1 cache miss 99% of the time with a 2nd cache miss 1% of the time, no primary clustering. I believe that's significantly better than linear probing's average case, and certainly its worst-case. I would love to implement this in Zig, and hopefully will be able to soon.

However, after performance, from the immediate point of view of correctness and explicitness, the surprising status quo assertion crash is a critical showstopper for embedded environments. Without this PR, TigerBeetle would be unable to rely on the std lib's AutoHashMapUnmanaged and we would then have to write our own very similar implementation but without forcing the unnecessary early resize.

I think @sentientwaffle (who's on the TigerBeetle team with me) makes a strong case also when he shows that the status quo of the counter leak breaks the ensureCapacity()/putAssumeCapacity() interface, at the point exactly where an allocator is not provided.

Sahnvour · 2021-12-15T22:43:20Z

Thanks for the paper, it's very interesting (although I can't say I get all the subtleties) 🙂

It's true and intuitive that tombstones help insertions, and that synthetic benchmarks testing only insertions may be quite far from real-world usage. In fact I believe benchmarking hashmaps is depressingly hard because of the number of different use cases and their performance characteristics. gotta go fast could definitely be expanded and improved.

However, after performance, from the immediate point of view of correctness and explicitness, the surprising status quo assertion crash is a critical showstopper for embedded environments. Without this PR, TigerBeetle would be unable to rely on the std lib's AutoHashMapUnmanaged and we would then have to write our own very similar implementation but without forcing the unnecessary early resize.

I think @sentientwaffle (who's on the TigerBeetle team with me) makes a strong case also when he shows that the status quo of the counter leak breaks the ensureCapacity()/putAssumeCapacity() interface, at the point exactly where an allocator is not provided.

Yes, this is absolutely an issue.

If minimizing existing probe lengths is really a concern then there are more recent probing strategies we can adopt that would offer guaranteed worst-case 1 cache miss 99% of the time with a 2nd cache miss 1% of the time, no primary clustering. I believe that's significantly better than linear probing's average case, and certainly its worst-case. I would love to implement this in Zig, and hopefully will be able to soon.

You're referring to the proposed graveyard hashing, correct ?
Minimizing probing is a concern, and I emphasized on this point previously, but after reading the main parts of the paper I might be convinced that given a good hash function (that we have), and the fact that primary clustering isn't really an issue ... It's probably not a sufficient reason to reject this PR.

andrewrk · 2021-12-17T04:40:16Z

Perf results are in

It does seem to be worse perf according to our benchmarks as currently measured:

std.AutoHashMap - Insert 10M int - 19% more CPU Instructions
std.AutoHashMap - Random distinct - 15% more CPU Instructions
std.AutoHashMap - Random find - 14% more CPU instructions

Some open questions:

are there missing hash map benchmarks to paint a full picture?
should we complicate the hash map API a bit more so that when doing removals, the caller can opt-out of the available slots incrementing?

jorangreef · 2021-12-17T07:19:45Z

@andrewrk this PR also introduced a necessary fix to guard against infinite loop on wraparound (we should have highlighted this more in the discussion), and I believe that's what's responsible for the performance regression (extra branch in the hot loop, that might be mitigated in the short term with unrolling, or in the long term with a better probing strategy): https://github.com/ziglang/zig/pull/10337/files#diff-8d3864cfd9fd2f29dd2b0458387c1036858fa8ac78d64517736675fbe3eaf33cR1118

Regarding benchmarks:

I don't know if this is how they currently work, but it would be great for the hash table benchmarks to work on the basis of ceteris parabus. We should be clear on comparing hash table performance over the same memory cost, i.e. compare hash tables of equivalent size, not one hash table of 512 bytes vs another of 4096 bytes, because the latter has an unfair advantage, when in fact the former might even have better performance/cost ratio, or power-to-weight ratio. I believe this is the more important metric for hash tables.
Given equal hash table sizes, intuitively speaking, for some reassurance, it's also been shown before this paper linked above, that tombstones do not affect performance of existing lookups (this paper just took it further to show they help performance, at the same table size). My gut feel is that the difference is down to the extra limit check, and the comparison of unequal hash table sizes. This could be checked quickly by re-running the benchmarks without the limit check.

Regarding complicating the hash map API, I would keep it simple and not let the current probing implementation leak out, because I don't think the current probing strategy is the best maxima to begin with.

There are slightly better probing strategies that don't use tombstones at all, that don't need to handle infinite loops on wraparound, and that have tighter guaranteed worst-case bounds. I wouldn't be worried by the regression at this stage, we should rather move to a better maxima.

jorangreef · 2021-12-17T07:51:40Z

On second thought, there's a trick we could use to safely eliminate the expensive limit check completely, just set available at startup to be one less than the actual capacity, so that lookups are always guaranteed to terminate naturally in the presence of tombstones with no infinite loop. I think this change will clear up the performance regression.

Both linear probing (what we appear to be using at present) and Swiss Table's triangular probing (but then we must always use power-of-two table sizes, which we do at present) will then be guaranteed to probe all slots and find a free slot. Sorry about this, my bad!

andrewrk · 2021-12-17T07:53:38Z

No need to apologize- I'll count it as a win for our new perf tracking system :-)

jorangreef · 2021-12-17T07:54:09Z

Gotta go fast!

Sahnvour · 2021-12-17T10:47:08Z

I was just thinking about this yesterday and about to propose the same thing :)
Plus, it's interesting to see that the Project Euler 14 benchmark has very little perf variation (and arguably looks more like a real-world usage than just inserting 100M ints).

I don't know if this is how they currently work, but it would be great for the hash table benchmarks to work on the basis of ceteris parabus. We should be clear on comparing hash table performance over the same memory cost, i.e. compare hash tables of equivalent size, not one hash table of 512 bytes vs another of 4096 bytes, because the latter has an unfair advantage, when in fact the former might even have better performance/cost ratio, or power-to-weight ratio. I believe this is the more important metric for hash tables.

They do run the same code at each commit, and track metrics evolution. So unless we change the load factor for example, the comparison is quite fair.

See ziglang#10337 for context. In ziglang#10337 the `available` tracking fix necessitated an additional condition on the probe loop in both `getOrPut` and `getIndex` to prevent an infinite loop. Previously, this condition was implicit thanks to the guaranteed presence of a free slot. The new condition hurts the `HashMap` benchmarks (ziglang#10337 (comment)). This commit removes that extra condition on the loop. Instead, when probing, first check whether the "home" slot is the target key — if so, return it. Otherwise, save the home slot's metadata to the stack and temporarily "free" the slot (but don't touch its value). Then continue with the original loop. Once again, the loop will be implicitly broken by the new "free" slot. The original metadata is restored before the function returns. `getOrPut` has one additional gotcha — if the home slot is a tombstone and `getOrPut` misses, then the home slot is is written with the new key; that is, its original metadata (the tombstone) is not restored.

See ziglang#10337 for context. In ziglang#10337 the `available` tracking fix necessitated an additional condition on the probe loop in both `getOrPut` and `getIndex` to prevent an infinite loop. Previously, this condition was implicit thanks to the guaranteed presence of a free slot. The new condition hurts the `HashMap` benchmarks (ziglang#10337 (comment)). This commit removes that extra condition on the loop. Instead, when probing, first check whether the "home" slot is the target key — if so, return it. Otherwise, save the home slot's metadata to the stack and temporarily "free" the slot (but don't touch its value). Then continue with the original loop. Once again, the loop will be implicitly broken by the new "free" slot. The original metadata is restored before the function returns. `getOrPut` has one additional gotcha — if the home slot is a tombstone and `getOrPut` misses, then the home slot is is written with the new key; that is, its original metadata (the tombstone) is not restored. Other changes: - Test hash map misses. - Test using `getOrPutAssumeCapacity` to get keys at the end (along with `get`).

See #10337 for context. In #10337 the `available` tracking fix necessitated an additional condition on the probe loop in both `getOrPut` and `getIndex` to prevent an infinite loop. Previously, this condition was implicit thanks to the guaranteed presence of a free slot. The new condition hurts the `HashMap` benchmarks (#10337 (comment)). This commit removes that extra condition on the loop. Instead, when probing, first check whether the "home" slot is the target key — if so, return it. Otherwise, save the home slot's metadata to the stack and temporarily "free" the slot (but don't touch its value). Then continue with the original loop. Once again, the loop will be implicitly broken by the new "free" slot. The original metadata is restored before the function returns. `getOrPut` has one additional gotcha — if the home slot is a tombstone and `getOrPut` misses, then the home slot is is written with the new key; that is, its original metadata (the tombstone) is not restored. Other changes: - Test hash map misses. - Test using `getOrPutAssumeCapacity` to get keys at the end (along with `get`).

Sahnvour requested a review from SpexGuy December 14, 2021 23:21

andrewrk merged commit ef0566d into ziglang:master Dec 17, 2021

sentientwaffle mentioned this pull request Dec 17, 2021

std: optimize hash_map probe loop condition #10350

Merged

sentientwaffle mentioned this pull request Jan 10, 2022

std: hash_map: optimize isFree/isTombstone #10562

Merged

rohlem mentioned this pull request Jun 1, 2023

Interleaving remove() / getOrPutAssumeCapacity() calls to std.HashMap causes existing entries in a hash map to be reported to not be found. #7494

Open

rohlem mentioned this pull request Nov 4, 2023

removals cause HashMaps to become slow #17851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

std: count hash_map tombstones as available #10337

std: count hash_map tombstones as available #10337

Uh oh!

sentientwaffle commented Dec 14, 2021

Uh oh!

Sahnvour commented Dec 14, 2021

Uh oh!

sentientwaffle commented Dec 15, 2021

Uh oh!

rohlem commented Dec 15, 2021 •

edited

Loading

Uh oh!

jorangreef commented Dec 15, 2021 •

edited

Loading

Uh oh!

Sahnvour commented Dec 15, 2021

Uh oh!

andrewrk commented Dec 17, 2021 •

edited

Loading

Uh oh!

jorangreef commented Dec 17, 2021 •

edited

Loading

Uh oh!

jorangreef commented Dec 17, 2021

Uh oh!

andrewrk commented Dec 17, 2021

Uh oh!

jorangreef commented Dec 17, 2021

Uh oh!

Sahnvour commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

std: count hash_map tombstones as available #10337

std: count hash_map tombstones as available #10337

Uh oh!

Conversation

sentientwaffle commented Dec 14, 2021

Uh oh!

Sahnvour commented Dec 14, 2021

Uh oh!

sentientwaffle commented Dec 15, 2021

Uh oh!

rohlem commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorangreef commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sahnvour commented Dec 15, 2021

Uh oh!

andrewrk commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorangreef commented Dec 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorangreef commented Dec 17, 2021

Uh oh!

andrewrk commented Dec 17, 2021

Uh oh!

jorangreef commented Dec 17, 2021

Uh oh!

Sahnvour commented Dec 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rohlem commented Dec 15, 2021 •

edited

Loading

jorangreef commented Dec 15, 2021 •

edited

Loading

andrewrk commented Dec 17, 2021 •

edited

Loading

jorangreef commented Dec 17, 2021 •

edited

Loading