New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up instance variable cache misses #8744
Conversation
This is an experimental commit that uses a functional red-black tree to create an index of the ancestor shapes. It uses an Okasaki style functional red black tree: https://www.cs.tufts.edu/comp/150FP/archive/chris-okasaki/redblack99.pdf This tree is advantageous because: * It offers O(n log n) insertions and O(n log n) lookups. * It shares memory with previous "versions" of the tree When we insert a node in the tree, only the parts of the tree that need to be rebalanced are newly allocated. Parts of the tree that don't need to be rebalanced are not reallocated, so "new trees" are able to share memory with old trees. This is in contrast to a sorted set where we would have to duplicate the set, and also resort the set on each insertion. I've added a new stat to RubyVM.stat so we can understand how the red black tree increases.
We're only going to create a redblack tree on platforms that have mmap
4a7b24b
to
f69a8e4
Compare
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
So the implementation goes over my head for the most part, but a couple things. If cache misses have decent complexity now, should we bump or remove the Line 34 in c44d654
This significantly speedup looking up for a shape ancestor, but do you think #8650 still make sense assuming redblack-tree is merged? I'd think it would for the supposedly common "close ancestor" case. On another node I triggered an internal Shopify CI build to see if it spots any issue with your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good, thanks for doing this! A couple small optional nits
@@ -120,7 +348,7 @@ shape_alloc(void) | |||
shape_id_t shape_id = GET_SHAPE_TREE()->next_shape_id; | |||
GET_SHAPE_TREE()->next_shape_id++; | |||
|
|||
if (shape_id == MAX_SHAPE_ID) { | |||
if (shape_id == (MAX_SHAPE_ID + 1)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted MAX_SHAPE_ID
to be actually allocatable. We could probably roll it back, but since MAX_SHAPE_ID
wasn't actually allocatable the name seemed odd.
✅ |
We can just remove it. It's not even used anymore, except in tests. There's a bunch of other stuff I want to clean up after this lands, but I wanted to keep the diff as small as possible (I realize this is a big PR 😢 ) |
Yes, I think that PR still makes sense. The RB tree seems to only pay off when we need to examine more than ~10 shapes. I think "near shape" hints would compliment this well. WDYT @jhawthorn? |
808fa57
to
ae89520
Compare
I'm not sure. The bigger benefit from that post-this-change I think would be Jean's change to avoid flip-flopping the cache rather than the speed of ivar lookup. |
Well, the assumption is that in the rather common case of the "memoization" pattern, the common ancestor is very close, so in many case we might get away with just checking two or three shapes. It's also very helpful for objects that are sometimes frozen sometimes not like I'll try to find some time to benchmark this again now that the index was merged, see if I can demonstrate a substantial perf gain. |
Alright, so something I totally missed until now is that all shapes with more than So it kinda changes things. Still, I benchmarked my (unmodified) branch against newest master, and it's still quite significantly faster on the "close ancestor" case, which I suspect is common both in case of light memoization use and use case of frozen objects.
So I think it could make sense to check a small number of ancestors before giving up and using the index. |
This PR speeds up instance variable cache misses by introducing a red-black tree as a shape cache. With the red-black tree, we can easily check whether an instance variable has been set and what index the IV uses. Before this change, in the worst case, IV lookup would be
O(n)
, but with the red-black tree, the worst case isO(log n)
(wheren
== shape depth).I added a benchmark to show the difference:
This change has an impact on YJIT as well. When IV sites become megamorphic, rather than exit, the JIT will generate machine code that does a "slow path" read on the IV. Below is a benchmark to demonstrate:
With YJIT on master:
With YJIT on this branch: