Cache full key hash codes in ChampMap #6975

retronym · 2018-07-26T00:09:10Z

(This builds atop #6856)

Fixes scala/scala-dev#525

TODO:

measure the footprint increase that we're incurring in favour of using cached hash codes
consider caching size in BitMapIndexedNode to avoid complexity of MapNodeSource/Replaced. The extra field will likely come for free in otherwise wasted space now that we've added the hashes field.

SethTisue · 2018-08-07T01:52:17Z

@retronym this is a blocker for RC1 but not for M5, right?

dwijnand · 2018-08-08T07:34:56Z

JFYI merge conflicts in src/library/scala/collection/immutable/ChampHashMap.scala

retronym · 2018-08-08T08:25:19Z

@msteindorfer Could I ask you to review this change? We're hoping to get this into to M5, rather than waiting for RC1, but want to make sure we don't regress the behaviour.

msteindorfer · 2018-08-08T11:55:05Z

@retronym yes, I'll do a review asap. When is the deadline for M5?

dwijnand · 2018-08-08T11:56:41Z

@msteindorfer Friday (https://github.com/scala/scala/milestone/74)

msteindorfer · 2018-08-08T12:01:10Z

@dwijnand thanks. I'll do the review in a timely manner such that I can be merged for M5.

msteindorfer

@retronym the implementation of element hash code caching looks good to me.

As you already pointed out, due to 8-byte alignment on the JVM you could now also add the cached size field to BitmapIndexedMapNode (since the cost is already paid for by adding the hashes field). It would also simplify MapNodeSource/Replaced as you said.

Further considerations for the future:

Since you've now added the hashes of the keys, you could optimize hashCode() such that it doesn't have to recompute the key hashes (only the value hashes); would required a custom iterator and accessors to hashes though.
How about ChampHashSet do you plan to also introduce caching there as well, or will it be a map-only specialization?
- Same benefits / drawbacks w.r.t. partial collisions would apply for sets as well.
- Having the size cached in BitmapIndexedSetNode would help in future with specializing set-algebra operations.
Theoretically you could introduce a feature flag for element hash-code caching that might be useful for A/B (performance) testing. (Not sure which data you have that speaks for having element hash code caching as an always-on default for all users.)

retronym · 2018-08-10T00:46:11Z

Not sure which data you have that speaks for having element hash code caching as an always-on default for all users

My rationale here is that this is the status quo in 2.12.x collections, and that we would really need data to justify that change to remove the caching. The first performance test that I tried ended up running into the regression that it would pop up in real-world code. I've discussed the problem with some users, and the feedback was that is is quite common to use map keys with relatively expensive hash codes (case classes composining collections of case classes, etc)

How about ChampHashSet

You're right, I should add it there, too. I'll simplify the Replaced part.

Having the size cached in BitmapIndexedSetNode would help in future with specializing set-algebra operations.

Could you give an example of what you have in mind?

The extra field comes "for free" after the previous addition of the field for `hashes`, and it simplifies the handling of detecting whether `updated` adds or overrwrites a binding.

msteindorfer · 2018-08-10T06:34:18Z

I've discussed the problem with some users, and the feedback was that is is quite common to use map keys with relatively expensive hash codes (case classes composining collections of case classes, etc)

I see your point, however there's an alternative design vector to solve this problem, at least when keys are immutable. One could cache the hash codes within the keys (e.g., having a val hashCode in the case class). The canonical CHAMP implementation in the Capsule library furthermore does caching of the collections hashes. When you combine both, it addresses the concern of composed keys with expensive hash codes, and reduces the need to cache elements' hash codes.

Having the size cached in BitmapIndexedSetNode would help in future with specializing set-algebra operations.

Could you give an example of what you have in mind?

Say you want to subtract one set from another and two sub-trees are referential equal. Then you could discard the sub-tree from the result right away without looking into the sub-tree. In case you've the size cached you immediately know how much elements you've discarded, otherwise you'd a traversal to recover the count of discovered elements.

Let me know if there's more to review in this PR.

retronym · 2018-08-10T06:47:28Z

One could cache the hash codes within the keys

Good point. In Scala that often comes down to override val hashCode = MurmurHash.productHash(this) in a case class.

Let me know if there's more to review in this PR.

I've just pushed a further commit that explores the idea of using using the cached hash codes to more efficiently compute set hash codes. The change is a bit cumbersome, because we need to base this on the un-improved hashcode (for hash code compatibility with other Map subclasses). But it sure saves a lot of Tuple creations and hash code calls during that operation.

msteindorfer · 2018-08-10T07:03:49Z

Good point. In Scala that often comes down to override val hashCode = MurmurHash.productHash(this) in a case class.

One advantage of caching the within the keys is that it works across collection instances. Overall it's a tricky decision to make if key hash codes should be cached inside the collection or in the user code.

Let me know if there's more to review in this PR.

I've just pushed a further commit that explores the idea of using using the cached hash codes to more efficiently compute set hash codes. The change is a bit cumbersome, because we need to base this on the un-improved hashcode (for hash code compatibility with other Map subclasses). But it sure saves a lot of Tuple creations and hash code calls during that operation.

A quick general question: is this PR still on track for M5 or are you now experimenting with different designs (as with unimproved hashes)?

I had a look a the latest commit that stores unimproved hashes, and at a glance it looks fine. As you said it's a bit cumbersome, but not complicated code. Did you benchmark the effects already?

retronym · 2018-08-10T07:56:23Z

"Exploit cached hash code in ChampHashSet.{hashCode,equals}" could be deferred to a subsequent PR if the rest is ready in time for M5, but I figured we might as well review it as a unit. I haven't benchmarked it yet. I guess we need to make such benchmark parameteric in how slow the hashCode implementation is.

retronym · 2018-08-10T07:59:21Z

I've also added caching of size in Set and used it make sure we only structurally share when adding an element that is reference equal to an existing element. (Users expect that after set += x; set.exists(_ eq x)).

msteindorfer · 2018-08-10T10:22:07Z

I had another look at the PR. So far it looks good to me.

Users expect that after set += x; set.exists(_ eq x).

Ok. I think remembering you mentioned the same is true for maps (that the compiler relies on that behaviour).

szeiger · 2018-08-13T14:59:05Z

@retronym Do you want to make any further changes or should we merge this for M5?

retronym · 2018-08-14T12:30:17Z

@szeiger I think this is ready for M5.

scala-jenkins added this to the 2.13.0-M5 milestone Jul 26, 2018

retronym added the performance the need for speed. usually compiler performance, sometimes runtime performance. label Jul 26, 2018

SethTisue modified the milestones: 2.13.0-M5, 2.13.0-RC1 Aug 7, 2018

dwijnand added the WIP label Aug 8, 2018

adriaanm modified the milestones: 2.13.0-RC1, 2.13.0-M5 Aug 8, 2018

adriaanm added the prio:blocker release blocker (used only by core team, only near release time) label Aug 8, 2018

adriaanm assigned szeiger Aug 8, 2018

msteindorfer approved these changes Aug 9, 2018

View reviewed changes

retronym added 3 commits August 10, 2018 11:18

Cache full key hash codes in ChampMap

d593b08

Remove temporary assertion

07307d2

Cache full key hash codes in ChampHashSet

c363dac

retronym force-pushed the topic/champ-hash-code-cache branch from 94a9746 to 59ec86b Compare August 10, 2018 02:50

retronym changed the title ~~WIP Cache full key hash codes in ChampMap~~ Cache full key hash codes in ChampMap Aug 10, 2018

retronym removed the WIP label Aug 10, 2018

retronym force-pushed the topic/champ-hash-code-cache branch from 59ec86b to 305657a Compare August 10, 2018 02:52

Cache size in MapNode

afa7b6b

The extra field comes "for free" after the previous addition of the field for `hashes`, and it simplifies the handling of detecting whether `updated` adds or overrwrites a binding.

retronym force-pushed the topic/champ-hash-code-cache branch from 305657a to c250eba Compare August 10, 2018 04:02

retronym force-pushed the topic/champ-hash-code-cache branch from 679e2d3 to 5c6b410 Compare August 10, 2018 06:40

Exploit cached hash code in ChampHashSet.{hashCode,equals}

1574d17

retronym force-pushed the topic/champ-hash-code-cache branch from 5c6b410 to 1574d17 Compare August 10, 2018 07:05

Cache size in ChampSet, and improve handling of == but ne updates

f92fc7d

retronym added the WIP label Aug 10, 2018

retronym removed the WIP label Aug 14, 2018

szeiger merged commit ca5a07d into scala:2.13.x Aug 14, 2018

joshlemer mentioned this pull request Jan 18, 2019

VectorMap should cooperate with im/mutable HashMap/Sets/VectorMaps during addAll/removeAll/concat/removedAll etc. scala/bug#11326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache full key hash codes in ChampMap #6975

Cache full key hash codes in ChampMap #6975

retronym commented Jul 26, 2018 •

edited

SethTisue commented Aug 7, 2018

dwijnand commented Aug 8, 2018

retronym commented Aug 8, 2018 •

edited

msteindorfer commented Aug 8, 2018

dwijnand commented Aug 8, 2018

msteindorfer commented Aug 8, 2018

msteindorfer left a comment •

edited

retronym commented Aug 10, 2018

msteindorfer commented Aug 10, 2018

retronym commented Aug 10, 2018

msteindorfer commented Aug 10, 2018

retronym commented Aug 10, 2018

retronym commented Aug 10, 2018 •

edited

msteindorfer commented Aug 10, 2018

szeiger commented Aug 13, 2018

retronym commented Aug 14, 2018

Cache full key hash codes in ChampMap #6975

Cache full key hash codes in ChampMap #6975

Conversation

retronym commented Jul 26, 2018 • edited

SethTisue commented Aug 7, 2018

dwijnand commented Aug 8, 2018

retronym commented Aug 8, 2018 • edited

msteindorfer commented Aug 8, 2018

dwijnand commented Aug 8, 2018

msteindorfer commented Aug 8, 2018

msteindorfer left a comment • edited

Choose a reason for hiding this comment

retronym commented Aug 10, 2018

msteindorfer commented Aug 10, 2018

retronym commented Aug 10, 2018

msteindorfer commented Aug 10, 2018

retronym commented Aug 10, 2018

retronym commented Aug 10, 2018 • edited

msteindorfer commented Aug 10, 2018

szeiger commented Aug 13, 2018

retronym commented Aug 14, 2018

retronym commented Jul 26, 2018 •

edited

retronym commented Aug 8, 2018 •

edited

msteindorfer left a comment •

edited

retronym commented Aug 10, 2018 •

edited