-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
[collections] Increase HashMap Load Factor #21251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I previously pulled some sketchy math out of my ass to figure out the hashmap load factor. Turns out, the paper has a chapter on this and I didn't use it. As a result, turns out we can push this hashmap all the way up to a load factor of 98%. That's pretty cool. Paper: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf Relevant discussion starts on page 31 of the pdf. r? @gankro r? [someone that's good at closely reading papers]
|
This seems like something we can just experimentally verify, no? Also I believe http://cglab.ca/~morin/publications/hashing/robinhood-siamjc.pdf has more modern results on this problem. |
|
Also of note: expected sequence length is sensitive to load factor and table size, as I understand it. (The table posted above demonstrates this). |
|
I was shooting for 4 in the average case. Longest probe sequence is proportional to log table size, so that's a little silly to use. On Sat, Jan 17, 2015 at 1:48 AM, Alexis Beingessner
|
|
It's only proportial to logn if you have a load factor of 1, which is silly to do. It's loglogn otherwise. |
|
Well its within 5% of a load factor of 1, so I'm being a little lazy here... On Sat, Jan 17, 2015 at 10:54 AM, Alexis Beingessner
|
|
@cgaebel Your 1-α^k is inaccurate, but closer to the truth. There should be two formulas: one for insertions and another for lookups. You should read this: http://www.pvk.ca/Blog/2013/11/26/the-other-robin-hood-hashing/ @gankro, this part of the introduction is troubling:
|
|
@pczarn Ah crap, you're right. |
|
I've gathered some actual data at https://github.com/cgaebel/bench-psl. See data.csv if you just want something to throw into matlab/R/mathematica/excel. Otherwise, it contains the rust code that generated such data, along with a patch to rustc to get the instrumentation necessary. It looks like 0.9 was a pretty good choice, even though the math to back it up was pretty shoddy. If any one else is interested in finding some interesting conclusions from the data, please take a look! Direct link to the data: https://github.com/cgaebel/bench-psl/blob/master/data.csv?raw=true |
|
I'll try running another experiment where I measure psl avg+max as a hashmap is grown from minimum size to a few tens of megabytes, given a variety of load factors. I think that might provide more useful, directly actionable information. |
|
Thinking about this a bit more: Growing a HashMap from 50% load to 100% load means the average load factor is 75%. At that load factor, the average probe sequence length will be somewhere in the neighborhood of 0-1. That means that even if we set the maximum load factor to 100%, the average lookup will still be extremely fast. If we do set the maximum load factor to 100%, in the worst case there will be log(n) lookup cost. That... still sounds pretty good, and memory accesses are sequential. This "average worst case bound" matches the worst case bound of a hashtable chained with balanced binary search trees. Because of these two points, I'm now inclined to consider upping the load factor to 100%. |
|
(at 100% load factor) It looks like average probe sequence length grows O(logn) where n is the number of elements we grow the table to. Here's a fun way to figure out the average probe sequence length from how many elements you expect the table to grow to: count the digits in how many elements the table will grow to, subtract 2. Growing to 10000 will have an average probe sequence length of about 3. |
|
I think you're combining different ideas in inappropriate ways. Average/maximum probe length is important for reads, because that is how long a read takes. However a write in a basically full table may require basically every element to be touched. e.g. you have one empty space left in your table, you hash to the space after it, you now need to shift every element in the table over one. That's an O(n) operation, and completely unacceptable to occur. And it is expected to occur on the last few insertions. With |
|
With a constant number of spaces left, a series of |
|
We can change the load factor to 0.95. for α=0.95:
For the current load factor, α=0.909, Pr{displacement < 8} = 0.77 |
|
To make lookups read one cache line on average, we'd want displacement <= 4. Also, what formula are you using for the probability? Is this for lookups, insertions, or removals? |
|
Ah. For lookups, I'm using the following formula taken from (Alfredo Viola (2005)) and simplified for b=1: |
|
Closing due to inactivity. I don't think we have sufficient data to make this change. I also don't think we should be optimizing for space over speed for this collection in general (space is cheap, time is expensive). |
|
I agree with closing this due to inactivity, but in general, optimizing data structures for space is the first step in optimizing for speed in the land of caches. |
|
I don't think that's a super applicable concern for a HashMap. I suppose for a small hashmap there's some gains to be had, but cache concerns don't scale like they would in e.g. a BTreeMap with a good access pattern. |
|
Sorry I should clarify: for a HashMap's load factor. Saving space in other ways (e.g. moving to u32 hashes) is totally legitimate for cache optimization. |
…-kind fix: Fix a panic in `ast::TypeBound::kind()`

I previously pulled some sketchy math out of my ass to figure out the
hashmap load factor. Turns out, the paper has a chapter on this and
I didn't use it. As a result, turns out we can push this hashmap all
the way up to a load factor of 98%. That's pretty cool.
Paper: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf
Relevant discussion starts on page 31 of the pdf.
r? @gankro
r? [someone that's good at closely reading papers]