[collections] Increase HashMap Load Factor #21251

cgaebel · 2015-01-16T20:16:21Z

I previously pulled some sketchy math out of my ass to figure out the
hashmap load factor. Turns out, the paper has a chapter on this and
I didn't use it. As a result, turns out we can push this hashmap all
the way up to a load factor of 98%. That's pretty cool.

Paper: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf
Relevant discussion starts on page 31 of the pdf.

r? @gankro
r? [someone that's good at closely reading papers]

I previously pulled some sketchy math out of my ass to figure out the hashmap load factor. Turns out, the paper has a chapter on this and I didn't use it. As a result, turns out we can push this hashmap all the way up to a load factor of 98%. That's pretty cool. Paper: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf Relevant discussion starts on page 31 of the pdf. r? @gankro r? [someone that's good at closely reading papers]

Gankra · 2015-01-17T06:39:15Z

This seems like something we can just experimentally verify, no?

Also I believe http://cglab.ca/~morin/publications/hashing/robinhood-siamjc.pdf has more modern results on this problem.

Gankra · 2015-01-17T06:43:59Z

Although "shoot for 4" is a bit ambigous. Average case? Worst case? The paper I linked seems to focus on worst case. In partcular page 4 has this table:

Gankra · 2015-01-17T06:47:49Z

Also of note: expected sequence length is sensitive to load factor and table size, as I understand it. (The table posted above demonstrates this).

cgaebel · 2015-01-17T15:50:33Z

I was shooting for 4 in the average case. Longest probe sequence is proportional to log table size, so that's a little silly to use.

On Sat, Jan 17, 2015 at 1:48 AM, Alexis Beingessner
notifications@github.com wrote:

Also of note: expected sequence length is sensitive to load factor and table size, as I understand it. (The table posted above demonstrates this).

Reply to this email directly or view it on GitHub:
#21251 (comment)

Gankra · 2015-01-17T15:53:59Z

It's only proportial to logn if you have a load factor of 1, which is silly to do. It's loglogn otherwise.

cgaebel · 2015-01-17T16:22:44Z

Well its within 5% of a load factor of 1, so I'm being a little lazy here...

On Sat, Jan 17, 2015 at 10:54 AM, Alexis Beingessner
notifications@github.com wrote:

It's only proportial to log table size if you have a load factor of 1, which is silly to do. It's loglogn otherwise.

Reply to this email directly or view it on GitHub:
#21251 (comment)

pczarn · 2015-01-18T22:16:55Z

@cgaebel Your 1-α^k is inaccurate, but closer to the truth.

There should be two formulas: one for insertions and another for lookups.

You should read this: http://www.pvk.ca/Blog/2013/11/26/the-other-robin-hood-hashing/

@gankro, this part of the introduction is troubling:

Each element has associated with it an infinite probe sequence consisting of
i.i.d. integers uniformly distributed over {1, . . . , n}, representing the consecutive places of
probes for that element. It is assumed that when searching for an element, its infinite probe
sequence is available to the searcher. The probe sequence for the i-th element is denoted by
Xi,0, Xi,1, Xi,2, . . ..

Gankra · 2015-01-19T02:36:31Z

@pczarn Ah crap, you're right.

cgaebel · 2015-01-19T19:25:50Z

I've gathered some actual data at https://github.com/cgaebel/bench-psl. See data.csv if you just want something to throw into matlab/R/mathematica/excel. Otherwise, it contains the rust code that generated such data, along with a patch to rustc to get the instrumentation necessary.

It looks like 0.9 was a pretty good choice, even though the math to back it up was pretty shoddy. If any one else is interested in finding some interesting conclusions from the data, please take a look! Direct link to the data: https://github.com/cgaebel/bench-psl/blob/master/data.csv?raw=true

cgaebel · 2015-01-19T19:35:46Z

I'll try running another experiment where I measure psl avg+max as a hashmap is grown from minimum size to a few tens of megabytes, given a variety of load factors. I think that might provide more useful, directly actionable information.

cgaebel · 2015-01-19T20:49:27Z

Thinking about this a bit more:

Growing a HashMap from 50% load to 100% load means the average load factor is 75%. At that load factor, the average probe sequence length will be somewhere in the neighborhood of 0-1. That means that even if we set the maximum load factor to 100%, the average lookup will still be extremely fast.

If we do set the maximum load factor to 100%, in the worst case there will be log(n) lookup cost. That... still sounds pretty good, and memory accesses are sequential. This "average worst case bound" matches the worst case bound of a hashtable chained with balanced binary search trees.

Because of these two points, I'm now inclined to consider upping the load factor to 100%.

cgaebel · 2015-01-19T21:10:01Z

(at 100% load factor)

It looks like average probe sequence length grows O(logn) where n is the number of elements we grow the table to.

Here's a fun way to figure out the average probe sequence length from how many elements you expect the table to grow to: count the digits in how many elements the table will grow to, subtract 2. Growing to 10000 will have an average probe sequence length of about 3.

Gankra · 2015-01-19T22:02:43Z

I think you're combining different ideas in inappropriate ways. Average/maximum probe length is important for reads, because that is how long a read takes.

However a write in a basically full table may require basically every element to be touched. e.g. you have one empty space left in your table, you hash to the space after it, you now need to shift every element in the table over one. That's an O(n) operation, and completely unacceptable to occur. And it is expected to occur on the last few insertions. With k spaces left, you expect to move n/k elements on insertion or removal, because that's how far away the space you need is.

Gankra · 2015-01-19T22:03:56Z

With a constant number of spaces left, a series of m alternating insertions and removals will take O(mn) expected time.

pczarn · 2015-01-23T12:03:09Z

We can change the load factor to 0.95.

for α=0.95:

X	Pr{X}
displacement = 0	0.08090420
displacement = 1	0.08406587
displacement = 2	0.07775618
displacement = 3	0.07029125
displacement = 4	0.06337353
displacement = 5	0.05712849
displacement = 6	0.05149544
displacement = 7	0.04641329
displacement = 8	0.04182829
displacement = 9	0.03769211
displacement = 10	0.03396111
displacement = 11	0.03059584
displacement = 12	0.02756072
displacement = 13	0.02482359
displacement = 14	0.02235541
displacement = 15	0.02012997

displacement < 8	0.53142827
displacement < 16	0.77037537

For the current load factor, α=0.909, Pr{displacement < 8} = 0.77

cgaebel · 2015-01-23T13:45:28Z

To make lookups read one cache line on average, we'd want displacement <= 4. Also, what formula are you using for the probability? Is this for lookups, insertions, or removals?

pczarn · 2015-01-23T19:28:02Z

Ah. For lookups, I'm using the following formula taken from (Alfredo Viola (2005)) and simplified for b=1:

Gankra · 2015-02-21T04:47:20Z

Closing due to inactivity. I don't think we have sufficient data to make this change. I also don't think we should be optimizing for space over speed for this collection in general (space is cheap, time is expensive).

cgaebel · 2015-02-21T04:50:42Z

I agree with closing this due to inactivity, but in general, optimizing data structures for space is the first step in optimizing for speed in the land of caches.

Gankra · 2015-02-21T05:01:59Z

I don't think that's a super applicable concern for a HashMap.

I suppose for a small hashmap there's some gains to be had, but cache concerns don't scale like they would in e.g. a BTreeMap with a good access pattern.

Gankra · 2015-02-21T05:06:38Z

Sorry I should clarify: for a HashMap's load factor. Saving space in other ways (e.g. moving to u32 hashes) is totally legitimate for cache optimization.

…-kind fix: Fix a panic in `ast::TypeBound::kind()`

rust-highfive assigned Gankra Jan 16, 2015

Gankra closed this Feb 21, 2015

lnicola pushed a commit to lnicola/rust that referenced this pull request Dec 15, 2025

Merge pull request rust-lang#21251 from ChayimFriedman2/fix-typebound…

4bf8020

…-kind fix: Fix a panic in `ast::TypeBound::kind()`

Uh oh!

[collections] Increase HashMap Load Factor #21251

[collections] Increase HashMap Load Factor #21251

Uh oh!

Conversation

cgaebel commented Jan 16, 2015

Uh oh!

Gankra commented Jan 17, 2015

Uh oh!

Gankra commented Jan 17, 2015

Uh oh!

Gankra commented Jan 17, 2015

Uh oh!

cgaebel commented Jan 17, 2015

Also of note: expected sequence length is sensitive to load factor and table size, as I understand it. (The table posted above demonstrates this).

Uh oh!

Gankra commented Jan 17, 2015

Uh oh!

cgaebel commented Jan 17, 2015

It's only proportial to log table size if you have a load factor of 1, which is silly to do. It's loglogn otherwise.

Uh oh!

pczarn commented Jan 18, 2015

Uh oh!

Gankra commented Jan 19, 2015

Uh oh!

cgaebel commented Jan 19, 2015

Uh oh!

cgaebel commented Jan 19, 2015

Uh oh!

cgaebel commented Jan 19, 2015

Uh oh!

cgaebel commented Jan 19, 2015

Uh oh!

Gankra commented Jan 19, 2015

Uh oh!

Gankra commented Jan 19, 2015

Uh oh!

pczarn commented Jan 23, 2015

Uh oh!

cgaebel commented Jan 23, 2015

Uh oh!

pczarn commented Jan 23, 2015

Uh oh!

Gankra commented Feb 21, 2015

Uh oh!

cgaebel commented Feb 21, 2015

Uh oh!

Gankra commented Feb 21, 2015

Uh oh!

Gankra commented Feb 21, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants