Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upCache conscious hashmap table #36692
Conversation
rust-highfive
assigned
aturon
Sep 24, 2016
This comment has been minimized.
This comment has been minimized.
|
r? @aturon (rust_highfive has picked a reviewer for you, use r? to override) |
arthurprs
force-pushed the
arthurprs:hashmap-layout
branch
from
976004c
to
006c6ba
Sep 24, 2016
bluss
reviewed
Sep 24, 2016
| @@ -371,8 +370,7 @@ impl<K, V, M> EmptyBucket<K, V, M> | |||
| pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> { | |||
| unsafe { | |||
| *self.raw.hash = hash.inspect(); | |||
| ptr::write(self.raw.key as *mut K, key); | |||
| ptr::write(self.raw.val as *mut V, value); | |||
| ptr::write(self.raw.pair as *mut (K, V), (key, value)); | |||
This comment has been minimized.
This comment has been minimized.
bluss
Sep 24, 2016
Contributor
It would feel more natural to have two writes here and skip making a tuple. Does it matter either direction for performance?
This comment has been minimized.
This comment has been minimized.
arthurprs
Sep 24, 2016
•
Author
Contributor
I looked at the disassembler and the end result seems to be the same
for (usize, usize) it's both MOVDQU
for (usize, [u64; 10]) it's
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
ec1d: 49 89 39 mov %rdi,(%r9)
ec20: 48 8b 8d 10 fe ff ff mov -0x1f0(%rbp),%rcx
ec27: 49 89 0c 24 mov %rcx,(%r12)
ec2b: 0f 28 85 50 ff ff ff movaps -0xb0(%rbp),%xmm0
ec32: 41 0f 11 44 24 48 movups %xmm0,0x48(%r12)
ec38: 0f 28 85 10 ff ff ff movaps -0xf0(%rbp),%xmm0
ec3f: 0f 28 8d 20 ff ff ff movaps -0xe0(%rbp),%xmm1
ec46: 0f 28 95 30 ff ff ff movaps -0xd0(%rbp),%xmm2
ec4d: 0f 28 9d 40 ff ff ff movaps -0xc0(%rbp),%xmm3
ec54: 41 0f 11 5c 24 38 movups %xmm3,0x38(%r12)
ec5a: 41 0f 11 54 24 28 movups %xmm2,0x28(%r12)
ec60: 41 0f 11 4c 24 18 movups %xmm1,0x18(%r12)
ec66: 41 0f 11 44 24 08 movups %xmm0,0x8(%r12)
ec6c: 4c 8b 75 b8 mov -0x48(%rbp),%r14
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(&mut (*pair_mut).0, key);
ptr::write(&mut (*pair_mut).1, value);
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
ec1d: 49 89 39 mov %rdi,(%r9)
ec20: 48 8b 8d 10 fe ff ff mov -0x1f0(%rbp),%rcx
ec27: 49 89 0c 24 mov %rcx,(%r12)
ec2b: 0f 28 85 50 ff ff ff movaps -0xb0(%rbp),%xmm0
ec32: 41 0f 11 44 24 48 movups %xmm0,0x48(%r12)
ec38: 0f 28 85 10 ff ff ff movaps -0xf0(%rbp),%xmm0
ec3f: 0f 28 8d 20 ff ff ff movaps -0xe0(%rbp),%xmm1
ec46: 0f 28 95 30 ff ff ff movaps -0xd0(%rbp),%xmm2
ec4d: 0f 28 9d 40 ff ff ff movaps -0xc0(%rbp),%xmm3
ec54: 41 0f 11 5c 24 38 movups %xmm3,0x38(%r12)
ec5a: 41 0f 11 54 24 28 movups %xmm2,0x28(%r12)
ec60: 41 0f 11 4c 24 18 movups %xmm1,0x18(%r12)
ec66: 41 0f 11 44 24 08 movups %xmm0,0x8(%r12)
ec6c: 4c 8b 75 b8 mov -0x48(%rbp),%r14
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(pair_mut, (key, value));
for (String, usize) it's
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
f670: 4d 89 20 mov %r12,(%r8)
f673: 48 8b 45 90 mov -0x70(%rbp),%rax
f677: 49 89 06 mov %rax,(%r14)
f67a: f3 41 0f 7f 46 08 movdqu %xmm0,0x8(%r14)
f680: 49 89 5e 18 mov %rbx,0x18(%r14)
f684: 48 8b 5d c8 mov -0x38(%rbp),%rbx
let pair_mut = self.raw.pair as *mut (K, V);
ptr::write(pair_mut, (key, value));
pub fn put(mut self, hash: SafeHash, key: K, value: V) -> FullBucket<K, V, M> {
unsafe {
*self.raw.hash = hash.inspect();
f670: 4d 89 20 mov %r12,(%r8)
f673: 48 8b 45 90 mov -0x70(%rbp),%rax
f677: 49 89 06 mov %rax,(%r14)
f67a: f3 41 0f 7f 46 08 movdqu %xmm0,0x8(%r14)
f680: 49 89 5e 18 mov %rbx,0x18(%r14)
f684: 48 8b 5d c8 mov -0x38(%rbp),%rbx
let pair_mut = self.raw.pair as *mut (K, V);
// ptr::write(pair_mut, (key, value));
ptr::write(&mut (*pair_mut).0, key);
ptr::write(&mut (*pair_mut).1, value);
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
This subset are the .contains_key() benchmarks (retrieving key but not value).
So this would be the drawback, where the old layout had better cache usage. It seems ok to give this up in return for the rest? |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Results for x86
x86 again (with usize hashes from #36595, thus 31 hash bits)
|
This comment has been minimized.
This comment has been minimized.
|
Maybe with bigger hashmaps? To make sure it's well out of the cpu cache size. |
This comment has been minimized.
This comment has been minimized.
|
After the 3000x look I finally saw that iter_keys_big_value was busted, here are several others for good measure:
|
durka
reviewed
Sep 25, 2016
|
If the potential downside is wasted space, shouldn't there be some memory benchmarks as well? |
| /// | ||
| /// This design uses less memory and is a lot faster than the naive | ||
| /// This design uses is a lot faster than the naive |
This comment has been minimized.
This comment has been minimized.
| /// | ||
| /// This design uses less memory and is a lot faster than the naive | ||
| /// This design uses is a lot faster than the naive | ||
| /// `Vec<Option<u64, K, V>>`, because we don't pay for the overhead of an |
This comment has been minimized.
This comment has been minimized.
| @@ -48,12 +48,14 @@ const EMPTY_BUCKET: u64 = 0; | |||
| /// which will likely map to the same bucket, while not being confused | |||
| /// with "empty". | |||
| /// | |||
| /// - All three "arrays represented by pointers" are the same length: | |||
| /// - All two "arrays represented by pointers" are the same length: | |||
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
arthurprs
force-pushed the
arthurprs:hashmap-layout
branch
from
006c6ba
to
9098c5c
Sep 25, 2016
alexcrichton
added
the
T-libs
label
Sep 26, 2016
This comment has been minimized.
This comment has been minimized.
|
cc @pczarn |
This comment has been minimized.
This comment has been minimized.
|
@Veedrac PTAL |
This comment has been minimized.
This comment has been minimized.
|
@arthurprs Seems like a solid improvement. |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot fcp merge Looks like we've got solid wins all around to consider merging? |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Oct 3, 2016
•
|
FCP proposed with disposition to merge. Review requested from: No concerns currently listed. |
This comment has been minimized.
This comment has been minimized.
|
I'm happy to go along with the experts here. |
arthurprs
referenced this pull request
Oct 5, 2016
Open
Exposure of HashMap iteration order allows for O(n²) blowup. #36481
This comment has been minimized.
This comment has been minimized.
|
Has anyone evaluated this on a real workload? The first one that comes to mind is of course rustc. |
This comment has been minimized.
This comment has been minimized.
|
I'm not familiar enough with the bootstrap process, but if somebody provide some guidance I could do it. |
This comment has been minimized.
This comment has been minimized.
|
Tip from simulacrum, that we can use https://github.com/rust-lang-nursery/rustc-benchmarks to test rustc impact. Rustc building itself is a heavier (and more important?) benchmark, don't know exactly what to time there |
This comment has been minimized.
This comment has been minimized.
|
@arthurprs short of timing an execution |
This comment has been minimized.
This comment has been minimized.
|
@arthurprs You can run I agree with changing the memory layout. However, the tradeoffs are subtle. The benefits and drawbacks of this change depend on circumstances such as the sizes of keys and values. There is one more drawback that you didn't describe in detail. Let's say the user wants to iterate through HashMap's keys. The user will access every key, which will waste some memory and cache bandwidth on loading the map's values. So neither layout is truly cache conscious. Both are cache conscious in different ways. Of course you have to decide if the efficiency of the keys() and values() iterators is important enough to give the change to the layout a second thought. I think the benefits outweigh the drawbacks, because accessing single map entries is very common. |
This comment has been minimized.
This comment has been minimized.
|
I don't think those tests will be feasible in my laptop. Specially considering the trial and error involved. I think the benefits far outweighs the drawbacks, there's potential to waste some padding but in the real world it's frequently not the case (try using github search in rust repo and skim some pages). We shouldn't optimize for keys() and values() and those will definitely take a hit (as per benchmarks). |
This comment has been minimized.
This comment has been minimized.
|
|
arthurprs
force-pushed the
arthurprs:hashmap-layout
branch
from
9098c5c
to
70f9b98
Oct 10, 2016
brson
added
the
relnotes
label
Oct 10, 2016
This comment has been minimized.
This comment has been minimized.
|
Nice work @arthurprs ! |
bors
added a commit
that referenced
this pull request
Oct 12, 2016
bors
added a commit
that referenced
this pull request
Oct 12, 2016
This comment has been minimized.
This comment has been minimized.
|
|
bors
added a commit
that referenced
this pull request
Oct 12, 2016
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
I'll fix it. |
arthurprs
force-pushed the
arthurprs:hashmap-layout
branch
from
c5068a4
to
c435821
Oct 12, 2016
This comment has been minimized.
This comment has been minimized.
|
Travis is happy again. |
This comment has been minimized.
This comment has been minimized.
|
@arthurprs Have you looked into why the buildbot tests failed? log link Since it's in the big testsuite and I don't see the PR changing anything there. It was unfortunately green on travis before, and still the buildbot build failed. |
This comment has been minimized.
This comment has been minimized.
|
It's the nopt builder, so presumably related to debug assertions? |
This comment has been minimized.
This comment has been minimized.
|
Yes, I should have said "CI should be happy". |
This comment has been minimized.
This comment has been minimized.
|
@bors: r+ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Oct 13, 2016
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
I'm not sure it's related to the PR. |
This comment has been minimized.
This comment has been minimized.
|
@bors: retry |
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Oct 14, 2016
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
error: pretty-printing failed in round 0 revision None |
This comment has been minimized.
This comment has been minimized.
|
@bors retry |
This comment has been minimized.
This comment has been minimized.
bors
added a commit
that referenced
this pull request
Oct 14, 2016
This comment has been minimized.
This comment has been minimized.
bors
merged commit c435821
into
rust-lang:master
Oct 14, 2016
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Oct 17, 2016
|
All relevant subteam members have reviewed. No concerns remain. |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Oct 24, 2016
|
It has been one week since all blocks to the FCP were resolved. |
arthurprs commentedSep 24, 2016
•
edited
Right now the internal HashMap representation is 3 unziped arrays hhhkkkvvv, I propose to change it to hhhkvkvkv (in further iterations kvkvkvhhh may allow inplace grow). A previous attempt is at #21973.
benefits
This layout is generally more cache conscious as it makes the value immediately accessible after a key matches. The separated hash arrays is a no-brainer because of how the RH algorithm works and that's unchanged.
Lookups: Upon a successful match in the hash array the code can check the key and immediately have access to the value in the same or next cache line (effectively saving a L[1,2,3] miss compared to the current layout).
Inserts/Deletes/Resize: Moving values in the table (robin hooding it) is faster because it touches consecutive cache lines and uses less instructions.
Some backing benchmarks (besides the ones bellow) for the benefits of this layout can be seen here as well http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/
drawbacks
The obvious drawbacks is: padding can be wasted between the key and value. Because of that keys(), values() and contains() can consume more cache and be slower.
Total wasted padding between items (C being the capacity of the table).
In practice padding between K-K and V-V can be smaller than K-V and V-K. The overhead is capped(ish) at sizeof u64 - 1 so we can actually measure the worst case (u8 at the end of key type and value with aliment of 1, hardly the average case in practice).
Starting from the worst case the memory overhead is:
HashMap<u64, u8>46% memory overhead. (aka worst case)HashMap<u64, u16>33% memory overhead.HashMap<u64, u32>20% memory overhead.HashMap<T, T>0% memory overheadbenchmarks
I've a test repo here to run benchmarks https://github.com/arthurprs/hashmap2/tree/layout