New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
std.hash_map: adding a rehash() method #17890
Conversation
3c24425
to
212b8cc
Compare
I noticed that i had an assignment to |
I thought the algorithm was cleaner without comments, but if you think its worth adding a note about how I'm using the fingerprint to mark slots that we've already hashed ahead of the cursor, let me know. Now that I'm thinking about it, I need to make sure it's not possible for that to mis-identify a slot, let me review actually and perhaps add one thing. |
As a follow up to #17851, the blocks now take (when you call
Of which 20% of the time is in that |
8c3efaf
to
93072d5
Compare
I rebased it on latest master, maybe the unrelated build failure goes away? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to enable a strategy choice at init, to either have an automatic rehash at a certain point, or keep it manual.
(as you noted, this is why tombstones initially counted as part of the load factor, to avoid pathological performance degradation)
Also you might consider proposing your microbenchmark to https://github.com/ziglang/gotta-go-fast/ if this project is still relevant.
Hi @Sahnvour I made those changes, thank you. |
lib/std/hash_map.zig
Outdated
@@ -1505,6 +1511,85 @@ pub fn HashMapUnmanaged( | |||
return result; | |||
} | |||
|
|||
/// Rehash the map, in-place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it do?
when should you call this?
what happens to existing key/value pointers?
all this information should be in the doc comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andrewk, can you help me understand what kind of answer you expect here?
what does it do?
This is only required because the HashMap algorithm currently uses tombstones for deletion slots and has no way of cleaning these up, or re-using them effectively, and over time gets super slow due to excessive probing.
when should you call this?
This can be called whenever you believe your HashMap is suffering from performance degredation due to excessive tombstone buildup.
what happens to the existing key/value pointers
I believe that they are maintained correctly at their new locations in the hashmap.
all this information should be in the doc comments
When you say doc comments, do you mean ///
comments before the pub fn rehash
line? Or somewhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would code know when there is "excessive" tombstone buildup (not by performance monitoring)? Can you query for the stats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can easily determine that, in my original issue on #17851 it took a somewhat large number of iterations to expose the performance issue.
This is a band-aid and the best fix would be a different HashMap implementation, but without this patch there are scenarios where the HashMap has terrible performance by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say doc comments, do you mean
///
comments before thepub fn rehash
line? Or somewhere else?
Yes, precisely - triple slash comments in front of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great.
What do you think of a1b78a2
0b24698
to
0965fe5
Compare
This allows a highly fragmented hash_map to have tombstones removed as the values are all rehashed. It would be nice to make this rehash() automatically, but that currently presents a challenge where it doesn't work with adapted contexts since the keys are not preserved in the map for re-hashing and the hash value is not stored currently, and the non-adapted contexts require a bit of additional book-keeping to check before calling rehash().
lib/std/hash_map.zig
Outdated
@@ -679,6 +679,11 @@ pub fn HashMap( | |||
self.unmanaged = .{}; | |||
return result; | |||
} | |||
|
|||
/// Rehash the map, in-place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy-paste the docs here too please, it will help IDE users.
lib/std/hash_map.zig
Outdated
/// All existing key/value pointers in the HashMap are maintained at | ||
/// their new rehashed location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm still not quite clear on this, does it invalidate existing pointers or not?
All existing key/value pointers in the HashMap are maintained
Sounds like yes
new rehashed location
Sounds like no
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I maybe don't know enough about Zig, we do this:
var keys_ptr = self.keys();
var values_ptr = self.values();
and later, to store the pointer to its new rehashed location:
assert(metadata[idx].isFree());
metadata[idx].fill(fingerprint);
keys_ptr[idx] = keys_ptr[curr];
values_ptr[idx] = values_ptr[curr];
the pointers are moved in the array but their value is preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also using std.mem.swap
if (metadata[idx].isUsed()) {
std.mem.swap(K, &keys_ptr[curr], &keys_ptr[idx]);
std.mem.swap(V, &values_ptr[curr], &values_ptr[idx]);
} else {
metadata[idx].used = 1;
keys_ptr[idx] = keys_ptr[curr];
values_ptr[idx] = values_ptr[curr];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll ask the question with code instead:
const a_key_ptr = &map.keys()[i];
const a_value_ptr = &map.values()[i];
map.rehash();
foo(a_key_ptr.*);
bar(a_value_ptr.*);
does this invoke illegal behavior?
the answer to this question is not communicated clearly in the doc comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give you a hint: it definitely invalidates existing pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you have a_key_ptr
as a pointer into the keys array, the value that points to will (could) change after rehashing.
And that needs to be reflected in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you have
a_key_ptr
as a pointer into the keys array, the value that points to will (could) change after rehashing.
Incorrect. The value pointed to will remain the same. The pointer is invalidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my rehash()
approach flawed, or do I just need to document that this occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's expected that a rehash will invalidate live pointers, so documenting it should be enough, as other functions modifying the hashmap do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch. If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around. Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely. |
This allows a highly fragmented HashMap to have tombstones removed as the values are all rehashed.
It would be nice to make this
rehash()
automatically, but that currently presents a challenge where it doesn't work with adapted contexts since the keys are not preserved in the map for re-hashing and the hash value is not stored currently, and the non-adapted contexts require a bit of additional book-keeping to check before calling rehash().This is a partial fix for #17851, but requires the user to call
rehash()
periodically to get the benefit.