New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache full key hash codes in ChampMap #6975
Conversation
@retronym this is a blocker for RC1 but not for M5, right? |
JFYI merge conflicts in |
@msteindorfer Could I ask you to review this change? We're hoping to get this into to M5, rather than waiting for RC1, but want to make sure we don't regress the behaviour. |
@retronym yes, I'll do a review asap. When is the deadline for M5? |
@dwijnand thanks. I'll do the review in a timely manner such that I can be merged for M5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@retronym the implementation of element hash code caching looks good to me.
As you already pointed out, due to 8-byte alignment on the JVM you could now also add the cached size
field to BitmapIndexedMapNode
(since the cost is already paid for by adding the hashes
field). It would also simplify MapNodeSource
/Replaced
as you said.
Further considerations for the future:
- Since you've now added the
hashes
of the keys, you could optimizehashCode()
such that it doesn't have to recompute the key hashes (only the value hashes); would required a custom iterator and accessors to hashes though. - How about
ChampHashSet
do you plan to also introduce caching there as well, or will it be a map-only specialization?- Same benefits / drawbacks w.r.t. partial collisions would apply for sets as well.
- Having the size cached in
BitmapIndexedSetNode
would help in future with specializing set-algebra operations.
- Theoretically you could introduce a feature flag for element hash-code caching that might be useful for A/B (performance) testing. (Not sure which data you have that speaks for having element hash code caching as an always-on default for all users.)
My rationale here is that this is the status quo in 2.12.x collections, and that we would really need data to justify that change to remove the caching. The first performance test that I tried ended up running into the regression that it would pop up in real-world code. I've discussed the problem with some users, and the feedback was that is is quite common to use map keys with relatively expensive hash codes (case classes composining collections of case classes, etc)
You're right, I should add it there, too. I'll simplify the
Could you give an example of what you have in mind? |
94a9746
to
59ec86b
Compare
59ec86b
to
305657a
Compare
The extra field comes "for free" after the previous addition of the field for `hashes`, and it simplifies the handling of detecting whether `updated` adds or overrwrites a binding.
305657a
to
c250eba
Compare
I see your point, however there's an alternative design vector to solve this problem, at least when keys are immutable. One could cache the hash codes within the keys (e.g., having a
Say you want to subtract one set from another and two sub-trees are referential equal. Then you could discard the sub-tree from the result right away without looking into the sub-tree. In case you've the Let me know if there's more to review in this PR. |
679e2d3
to
5c6b410
Compare
Good point. In Scala that often comes down to
I've just pushed a further commit that explores the idea of using using the cached hash codes to more efficiently compute set hash codes. The change is a bit cumbersome, because we need to base this on the un-improved hashcode (for hash code compatibility with other |
One advantage of caching the within the keys is that it works across collection instances. Overall it's a tricky decision to make if key hash codes should be cached inside the collection or in the user code.
A quick general question: is this PR still on track for M5 or are you now experimenting with different designs (as with unimproved hashes)? I had a look a the latest commit that stores unimproved hashes, and at a glance it looks fine. As you said it's a bit cumbersome, but not complicated code. Did you benchmark the effects already? |
5c6b410
to
1574d17
Compare
"Exploit cached hash code in ChampHashSet.{hashCode,equals}" could be deferred to a subsequent PR if the rest is ready in time for M5, but I figured we might as well review it as a unit. I haven't benchmarked it yet. I guess we need to make such benchmark parameteric in how slow the |
I've also added caching of |
I had another look at the PR. So far it looks good to me.
Ok. I think remembering you mentioned the same is true for maps (that the compiler relies on that behaviour). |
@retronym Do you want to make any further changes or should we merge this for M5? |
@szeiger I think this is ready for M5. |
(This builds atop #6856)
Fixes scala/scala-dev#525
TODO:
size
in BitMapIndexedNode to avoid complexity ofMapNodeSource
/Replaced
. The extra field will likely come for free in otherwise wasted space now that we've added thehashes
field.