Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std.hash_map: adding a rehash() method #17890

Closed
wants to merge 4 commits into from
Closed

Conversation

mrjbq7
Copy link
Contributor

@mrjbq7 mrjbq7 commented Nov 6, 2023

This allows a highly fragmented HashMap to have tombstones removed as the values are all rehashed.

It would be nice to make this rehash() automatically, but that currently presents a challenge where it doesn't work with adapted contexts since the keys are not preserved in the map for re-hashing and the hash value is not stored currently, and the non-adapted contexts require a bit of additional book-keeping to check before calling rehash().

This is a partial fix for #17851, but requires the user to call rehash() periodically to get the benefit.

@mrjbq7 mrjbq7 force-pushed the rehash branch 2 times, most recently from 3c24425 to 212b8cc Compare November 6, 2023 15:36
@mrjbq7
Copy link
Contributor Author

mrjbq7 commented Nov 6, 2023

I noticed that i had an assignment to self.available that was unnecessary since this PR didn't change the meaning of that property, so just now removed it. I think it's clean now and ready for review.

@mrjbq7
Copy link
Contributor Author

mrjbq7 commented Nov 6, 2023

I thought the algorithm was cleaner without comments, but if you think its worth adding a note about how I'm using the fingerprint to mark slots that we've already hashed ahead of the cursor, let me know.

Now that I'm thinking about it, I need to make sure it's not possible for that to mis-identify a slot, let me review actually and perhaps add one thing.

@mrjbq7
Copy link
Contributor Author

mrjbq7 commented Nov 6, 2023

As a follow up to #17851, the blocks now take (when you call rehash() in each block):

➜  build git:(rehash) ✗ ./stage3/bin/zig run -O ReleaseFast ../maptest.zig --zig-lib-dir ~/Projects/zig/lib 2>&1
inserting 2000000 took 116 ms
2000000 block took 0 ms
3000000 block took 80 ms
4000000 block took 82 ms
5000000 block took 87 ms
6000000 block took 91 ms
7000000 block took 93 ms
8000000 block took 95 ms
9000000 block took 95 ms
10000000 block took 95 ms
11000000 block took 96 ms
12000000 block took 95 ms
13000000 block took 95 ms
14000000 block took 95 ms
15000000 block took 95 ms
16000000 block took 95 ms
17000000 block took 95 ms
18000000 block took 95 ms
19000000 block took 97 ms
20000000 block took 96 ms
21000000 block took 95 ms
22000000 block took 95 ms
23000000 block took 94 ms
24000000 block took 95 ms
25000000 block took 96 ms
26000000 block took 94 ms
27000000 block took 95 ms
28000000 block took 95 ms
29000000 block took 95 ms
30000000 block took 96 ms
31000000 block took 94 ms
32000000 block took 94 ms
33000000 block took 96 ms
34000000 block took 95 ms
35000000 block took 95 ms
36000000 block took 95 ms
37000000 block took 95 ms
38000000 block took 95 ms
39000000 block took 95 ms
40000000 block took 96 ms
41000000 block took 94 ms
42000000 block took 95 ms
43000000 block took 95 ms
44000000 block took 96 ms
45000000 block took 95 ms
46000000 block took 95 ms
47000000 block took 95 ms
48000000 block took 95 ms
49000000 block took 95 ms
50000000 block took 94 ms
51000000 block took 95 ms
52000000 block took 95 ms
53000000 block took 95 ms
54000000 block took 95 ms
55000000 block took 95 ms
56000000 block took 95 ms
57000000 block took 95 ms
58000000 block took 95 ms
59000000 block took 95 ms
60000000 block took 95 ms
61000000 block took 95 ms
62000000 block took 95 ms
63000000 block took 95 ms
64000000 block took 95 ms
65000000 block took 95 ms
66000000 block took 95 ms
67000000 block took 95 ms
68000000 block took 95 ms
69000000 block took 95 ms
70000000 block took 95 ms
71000000 block took 94 ms
72000000 block took 95 ms
73000000 block took 96 ms
74000000 block took 95 ms
75000000 block took 95 ms
76000000 block took 95 ms
77000000 block took 95 ms
78000000 block took 95 ms
79000000 block took 95 ms
80000000 block took 95 ms
81000000 block took 94 ms
82000000 block took 95 ms
83000000 block took 95 ms
84000000 block took 95 ms
85000000 block took 95 ms
86000000 block took 95 ms
87000000 block took 95 ms
88000000 block took 95 ms
89000000 block took 95 ms
90000000 block took 95 ms
91000000 block took 95 ms
92000000 block took 95 ms
93000000 block took 95 ms
94000000 block took 95 ms
95000000 block took 95 ms
96000000 block took 95 ms
97000000 block took 95 ms
98000000 block took 97 ms
99000000 block took 95 ms
100000000 block took 95 ms
101000000 block took 95 ms
102000000 block took 95 ms
103000000 block took 94 ms
104000000 block took 96 ms
105000000 block took 95 ms
106000000 block took 95 ms
107000000 block took 95 ms
108000000 block took 95 ms
109000000 block took 95 ms
110000000 block took 95 ms
111000000 block took 94 ms
112000000 block took 96 ms
113000000 block took 95 ms
114000000 block took 95 ms
115000000 block took 95 ms
116000000 block took 95 ms
117000000 block took 95 ms
118000000 block took 95 ms
119000000 block took 94 ms
120000000 block took 96 ms
121000000 block took 95 ms
122000000 block took 95 ms
123000000 block took 95 ms
124000000 block took 95 ms
125000000 block took 95 ms
126000000 block took 95 ms
127000000 block took 95 ms
128000000 block took 95 ms
129000000 block took 98 ms
130000000 block took 94 ms
131000000 block took 96 ms
132000000 block took 95 ms
133000000 block took 95 ms
134000000 block took 95 ms
135000000 block took 95 ms
136000000 block took 95 ms
137000000 block took 95 ms
138000000 block took 95 ms
139000000 block took 95 ms
140000000 block took 95 ms
141000000 block took 95 ms
142000000 block took 96 ms
143000000 block took 94 ms
144000000 block took 96 ms
145000000 block took 95 ms
146000000 block took 95 ms
147000000 block took 94 ms
148000000 block took 95 ms
149000000 block took 95 ms
150000000 block took 95 ms
151000000 block took 96 ms
152000000 block took 94 ms
153000000 block took 96 ms
154000000 block took 95 ms
155000000 block took 94 ms
156000000 block took 95 ms
157000000 block took 96 ms
158000000 block took 96 ms
159000000 block took 95 ms
160000000 block took 95 ms
161000000 block took 95 ms
162000000 block took 95 ms
163000000 block took 95 ms
164000000 block took 95 ms
165000000 block took 95 ms
166000000 block took 95 ms
167000000 block took 94 ms
168000000 block took 95 ms
169000000 block took 95 ms
170000000 block took 95 ms
171000000 block took 95 ms
172000000 block took 95 ms
173000000 block took 96 ms
174000000 block took 95 ms
175000000 block took 95 ms
176000000 block took 95 ms
177000000 block took 97 ms
178000000 block took 95 ms
179000000 block took 95 ms
180000000 block took 95 ms
181000000 block took 95 ms
182000000 block took 95 ms
183000000 block took 95 ms
184000000 block took 95 ms
185000000 block took 96 ms
186000000 block took 94 ms
187000000 block took 95 ms
188000000 block took 95 ms
189000000 block took 96 ms
190000000 block took 95 ms
191000000 block took 95 ms
192000000 block took 95 ms
193000000 block took 95 ms
194000000 block took 95 ms
195000000 block took 95 ms
196000000 block took 95 ms
197000000 block took 96 ms
198000000 block took 94 ms
199000000 block took 95 ms
200000000 block took 95 ms
201000000 block took 95 ms
202000000 block took 96 ms
203000000 block took 95 ms
204000000 block took 94 ms
205000000 block took 95 ms
206000000 block took 95 ms
207000000 block took 96 ms
208000000 block took 94 ms
209000000 block took 95 ms
210000000 block took 95 ms
211000000 block took 95 ms
212000000 block took 95 ms
213000000 block took 95 ms
214000000 block took 94 ms
215000000 block took 95 ms
216000000 block took 95 ms
217000000 block took 95 ms
218000000 block took 96 ms
219000000 block took 94 ms
220000000 block took 95 ms
221000000 block took 95 ms
222000000 block took 96 ms
223000000 block took 95 ms
224000000 block took 95 ms
225000000 block took 95 ms
226000000 block took 95 ms
227000000 block took 95 ms
228000000 block took 96 ms
229000000 block took 94 ms
230000000 block took 95 ms
231000000 block took 94 ms
232000000 block took 96 ms
233000000 block took 94 ms
234000000 block took 95 ms
235000000 block took 95 ms
236000000 block took 95 ms
237000000 block took 95 ms
238000000 block took 94 ms
239000000 block took 95 ms
240000000 block took 96 ms
241000000 block took 94 ms
242000000 block took 96 ms
243000000 block took 95 ms
244000000 block took 94 ms
245000000 block took 95 ms
246000000 block took 95 ms
247000000 block took 95 ms
248000000 block took 95 ms
249000000 block took 96 ms

Of which 20% of the time is in that rehash() call, which is a lot, but much-much-much better than the performance degradation we had before, and still 40% faster than the ArrayHashMap.

@mrjbq7
Copy link
Contributor Author

mrjbq7 commented Nov 10, 2023

I rebased it on latest master, maybe the unrelated build failure goes away?

Copy link
Contributor

@Sahnvour Sahnvour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to enable a strategy choice at init, to either have an automatic rehash at a certain point, or keep it manual.
(as you noted, this is why tombstones initially counted as part of the load factor, to avoid pathological performance degradation)

Also you might consider proposing your microbenchmark to https://github.com/ziglang/gotta-go-fast/ if this project is still relevant.

lib/std/hash_map.zig Outdated Show resolved Hide resolved
lib/std/hash_map.zig Outdated Show resolved Hide resolved
@mrjbq7
Copy link
Contributor Author

mrjbq7 commented Nov 13, 2023

Hi @Sahnvour I made those changes, thank you.

@mrjbq7 mrjbq7 changed the title hash_map: adding a rehash() method std.hash_map: adding a rehash() method Feb 3, 2024
@@ -1505,6 +1511,85 @@ pub fn HashMapUnmanaged(
return result;
}

/// Rehash the map, in-place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it do?

when should you call this?

what happens to existing key/value pointers?

all this information should be in the doc comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andrewk, can you help me understand what kind of answer you expect here?

what does it do?

This is only required because the HashMap algorithm currently uses tombstones for deletion slots and has no way of cleaning these up, or re-using them effectively, and over time gets super slow due to excessive probing.

when should you call this?

This can be called whenever you believe your HashMap is suffering from performance degredation due to excessive tombstone buildup.

what happens to the existing key/value pointers

I believe that they are maintained correctly at their new locations in the hashmap.

all this information should be in the doc comments

When you say doc comments, do you mean /// comments before the pub fn rehash line? Or somewhere else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would code know when there is "excessive" tombstone buildup (not by performance monitoring)? Can you query for the stats?

Copy link
Contributor Author

@mrjbq7 mrjbq7 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can easily determine that, in my original issue on #17851 it took a somewhat large number of iterations to expose the performance issue.

This is a band-aid and the best fix would be a different HashMap implementation, but without this patch there are scenarios where the HashMap has terrible performance by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say doc comments, do you mean /// comments before the pub fn rehash line? Or somewhere else?

Yes, precisely - triple slash comments in front of the function.

Copy link
Contributor Author

@mrjbq7 mrjbq7 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great.

What do you think of a1b78a2

@mrjbq7 mrjbq7 force-pushed the rehash branch 3 times, most recently from 0b24698 to 0965fe5 Compare February 7, 2024 21:57
This allows a highly fragmented hash_map to have tombstones removed as
the values are all rehashed.

It would be nice to make this rehash() automatically, but that currently
presents a challenge where it doesn't work with adapted contexts since
the keys are not preserved in the map for re-hashing and the hash value
is not stored currently, and the non-adapted contexts require a bit of
additional book-keeping to check before calling rehash().
@@ -679,6 +679,11 @@ pub fn HashMap(
self.unmanaged = .{};
return result;
}

/// Rehash the map, in-place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste the docs here too please, it will help IDE users.

Comment on lines 1520 to 1521
/// All existing key/value pointers in the HashMap are maintained at
/// their new rehashed location.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm still not quite clear on this, does it invalidate existing pointers or not?

All existing key/value pointers in the HashMap are maintained

Sounds like yes

new rehashed location

Sounds like no

Copy link
Contributor Author

@mrjbq7 mrjbq7 Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I maybe don't know enough about Zig, we do this:

            var keys_ptr = self.keys();
            var values_ptr = self.values();

and later, to store the pointer to its new rehashed location:

                    assert(metadata[idx].isFree());
                    metadata[idx].fill(fingerprint);
                    keys_ptr[idx] = keys_ptr[curr];
                    values_ptr[idx] = values_ptr[curr];

the pointers are moved in the array but their value is preserved.

Copy link
Contributor Author

@mrjbq7 mrjbq7 Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also using std.mem.swap

                    if (metadata[idx].isUsed()) {
                        std.mem.swap(K, &keys_ptr[curr], &keys_ptr[idx]);
                        std.mem.swap(V, &values_ptr[curr], &values_ptr[idx]);
                    } else {
                        metadata[idx].used = 1;
                        keys_ptr[idx] = keys_ptr[curr];
                        values_ptr[idx] = values_ptr[curr];
                    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the question with code instead:

const a_key_ptr = &map.keys()[i];
const a_value_ptr = &map.values()[i];
map.rehash();
foo(a_key_ptr.*);
bar(a_value_ptr.*);

does this invoke illegal behavior?

the answer to this question is not communicated clearly in the doc comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give you a hint: it definitely invalidates existing pointers

Copy link
Contributor Author

@mrjbq7 mrjbq7 Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you have a_key_ptr as a pointer into the keys array, the value that points to will (could) change after rehashing.

And that needs to be reflected in the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you have a_key_ptr as a pointer into the keys array, the value that points to will (could) change after rehashing.

Incorrect. The value pointed to will remain the same. The pointer is invalidated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my rehash() approach flawed, or do I just need to document that this occurs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected that a rehash will invalidate live pointers, so documenting it should be enough, as other functions modifying the hashmap do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Sahnvour and @andrewrk, thank you. I have updated the docs.

I apologize my understanding of Zig isn't deep enough yet to recognize the issue of invalidating live pointers into the map.

@andrewrk
Copy link
Member

andrewrk commented May 9, 2024

I'm sorry, I didn't review this in time, and now it has bitrotted. Furthermore, so many pull requests have stacked up that I can't keep up and I am therefore declaring Pull Request Bankruptcy and closing old PRs that now have conflicts with master branch.

If you want to reroll, you are by all means welcome to revisit this changeset with respect to the current state of master branch, and there's a decent chance your patch will be reviewed the second time around.

Either way, I'm closing this now, otherwise the PR queue will continue to grow indefinitely.

@mrjbq7
Copy link
Contributor Author

mrjbq7 commented May 10, 2024

Hi @andrewrk, see #19923 for a rebased PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants