Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up instance variable cache misses #8744

Merged
merged 8 commits into from Oct 24, 2023

Conversation

tenderlove
Copy link
Member

@tenderlove tenderlove commented Oct 23, 2023

This PR speeds up instance variable cache misses by introducing a red-black tree as a shape cache. With the red-black tree, we can easily check whether an instance variable has been set and what index the IV uses. Before this change, in the worst case, IV lookup would be O(n), but with the red-black tree, the worst case is O(log n) (where n == shape depth).

I added a benchmark to show the difference:

$ make benchmark ITEM=vm_ivar_ic_miss
/Users/aaron/.rubies/arm64/ruby-trunk/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
	            --executables="compare-ruby::/Users/aaron/.rubies/arm64/ruby-trunk/bin/ruby --disable=gems -I.ext/common --disable-gem" \
	            --executables="built-ruby::./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems --disable-gem" \
	            --output=markdown --output-compare -v $(find ./benchmark -maxdepth 1 -name 'vm_ivar_ic_miss' -o -name '*vm_ivar_ic_miss*.yml' -o -name '*vm_ivar_ic_miss*.rb' | sort) 
compare-ruby: ruby 3.3.0dev (2023-10-23T15:37:50Z master 62c674f98c) [arm64-darwin23]
built-ruby: ruby 3.3.0dev (2023-10-23T22:56:14Z rb-shape-index 4a7b24be69) [arm64-darwin23]
# Iteration per second (i/s)

|                 |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|vm_ivar_ic_miss  |      3.465M|   23.031M|
|                 |           -|     6.65x|

This change has an impact on YJIT as well. When IV sites become megamorphic, rather than exit, the JIT will generate machine code that does a "slow path" read on the IV. Below is a benchmark to demonstrate:

class Foo
  def initialize idx
    case idx
    when 0; then @c0 = 1
    when 1; then @c1 = 1
    when 2; then @c2 = 1
    when 3; then @c3 = 1
    when 4; then @c4 = 1
    when 5; then @c5 = 1
    when 6; then @c6 = 1
    when 7; then @c7 = 1
    when 8; then @c8 = 1
    end

    @a0 = @a1 = @a2 = @a3 = @a4 = @a5 = @a6 = @a7 = @a8 = @a9 = @a10 = @a11 = @a12 = @a13 = @a14 = @a15 = @a16 = @a17 = @a18 = @a19 = @a20 = @a21 = @a22 = @a23 = @a24 = @a25 = @a26 = @a27 = @a28 = @a29 = @a30 = @a31 = @a32 = @a33 = @a34 = @a35 = @a36 = @a37 = @a38 = @a39 = @a40 = @a41 = @a42 = @a43 = @a44 = @a45 = @a46 = @a47 = @a48 = @a49 = @a50 = @a51 = @a52 = @a53 = @a54 = @a55 = @a56 = @a57 = @a58 = @a59 = @a60 = @a61 = @a62 = @a63 = @a64 = @a65 = @a66 = @a67 = @a68 = @a69 = @a70 = @a71 = @a72 = @a73 = @a74 = @b = 1
  end

  def b; @b; end
end

# Force the `@b` site to be megamorphic
9.times { Class.new(Foo).new(_1).b }
obj = Foo.new(8)
60000000.times { obj.b }

With YJIT on master:

$ time ruby --yjit --yjit-call-threshold=2 -v test.rb
ruby 3.3.0dev (2023-10-23T15:37:50Z master 62c674f98c) +YJIT [arm64-darwin23]

________________________________________________________
Executed in   10.37 secs    fish           external
   usr time   10.28 secs   30.29 millis   10.25 secs
   sys time    0.13 secs    5.12 millis    0.12 secs

With YJIT on this branch:

$ time ./ruby --yjit --yjit-call-threshold=2 -v test.rb
ruby 3.3.0dev (2023-10-23T22:56:14Z rb-shape-index 4a7b24be69) +YJIT [arm64-darwin23]

________________________________________________________
Executed in    2.61 secs    fish           external
   usr time    2.45 secs    0.10 millis    2.45 secs
   sys time    0.04 secs    2.15 millis    0.04 secs

This is an experimental commit that uses a functional red-black tree to
create an index of the ancestor shapes.  It uses an Okasaki style
functional red black tree:

  https://www.cs.tufts.edu/comp/150FP/archive/chris-okasaki/redblack99.pdf

This tree is advantageous because:

* It offers O(n log n) insertions and O(n log n) lookups.
* It shares memory with previous "versions" of the tree

When we insert a node in the tree, only the parts of the tree that need
to be rebalanced are newly allocated.  Parts of the tree that don't need
to be rebalanced are not reallocated, so "new trees" are able to share
memory with old trees.  This is in contrast to a sorted set where we
would have to duplicate the set, and also resort the set on each
insertion.

I've added a new stat to RubyVM.stat so we can understand how the red
black tree increases.
We're only going to create a redblack tree on platforms that have mmap
benchmark/vm_ivar_ic_miss.yml Outdated Show resolved Hide resolved
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
@tenderlove tenderlove requested a review from nobu October 24, 2023 03:09
@byroot
Copy link
Member

byroot commented Oct 24, 2023

So the implementation goes over my head for the most part, but a couple things.

If cache misses have decent complexity now, should we bump or remove the SHAPE_MAX_NUM_IVS constant?

ruby/shape.h

Line 34 in c44d654

# define SHAPE_MAX_NUM_IVS 80
. It was added to avoid to factorial complexity situation when setting many ivars, should be resolved right?

This significantly speedup looking up for a shape ancestor, but do you think #8650 still make sense assuming redblack-tree is merged? I'd think it would for the supposedly common "close ancestor" case.

On another node I triggered an internal Shopify CI build to see if it spots any issue with your PR.

Copy link
Contributor

@jemmaissroff jemmaissroff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, thanks for doing this! A couple small optional nits

shape.c Outdated Show resolved Hide resolved
@@ -120,7 +348,7 @@ shape_alloc(void)
shape_id_t shape_id = GET_SHAPE_TREE()->next_shape_id;
GET_SHAPE_TREE()->next_shape_id++;

if (shape_id == MAX_SHAPE_ID) {
if (shape_id == (MAX_SHAPE_ID + 1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted MAX_SHAPE_ID to be actually allocatable. We could probably roll it back, but since MAX_SHAPE_ID wasn't actually allocatable the name seemed odd.

shape.c Show resolved Hide resolved
@byroot
Copy link
Member

byroot commented Oct 24, 2023

On another node I triggered an internal Shopify CI build to see if it spots any issue with your PR.

@tenderlove
Copy link
Member Author

If cache misses have decent complexity now, should we bump or remove the SHAPE_MAX_NUM_IVS constant?

We can just remove it. It's not even used anymore, except in tests. There's a bunch of other stuff I want to clean up after this lands, but I wanted to keep the diff as small as possible (I realize this is a big PR 😢 )

@tenderlove
Copy link
Member Author

This significantly speedup looking up for a shape ancestor, but do you think #8650 still make sense assuming redblack-tree is merged? I'd think it would for the supposedly common "close ancestor" case.

Yes, I think that PR still makes sense. The RB tree seems to only pay off when we need to examine more than ~10 shapes. I think "near shape" hints would compliment this well. WDYT @jhawthorn?

@tenderlove tenderlove merged commit e71f343 into ruby:master Oct 24, 2023
109 of 191 checks passed
@tenderlove tenderlove deleted the rb-shape-index branch October 24, 2023 17:52
@jhawthorn
Copy link
Member

The RB tree seems to only pay off when we need to examine more than ~10 shapes. I think "near shape" hints would compliment this well.

I'm not sure. The bigger benefit from that post-this-change I think would be Jean's change to avoid flip-flopping the cache rather than the speed of ivar lookup.

@byroot
Copy link
Member

byroot commented Oct 24, 2023

the speed of ivar lookup.

Well, the assumption is that in the rather common case of the "memoization" pattern, the common ancestor is very close, so in many case we might get away with just checking two or three shapes.

It's also very helpful for objects that are sometimes frozen sometimes not like Set.

I'll try to find some time to benchmark this again now that the index was merged, see if I can demonstrate a substantial perf gain.

@casperisfine
Copy link
Contributor

Alright, so something I totally missed until now is that all shapes with more than ANCESTOR_CACHE_THRESHOLD ivars do have an index, somehow I thought only one out of ANCESTOR_CACHE_THRESHOLD shapes had an index.

So it kinda changes things.

Still, I benchmarked my (unmodified) branch against newest master, and it's still quite significantly faster on the "close ancestor" case, which I suspect is common both in case of light memoization use and use case of frozen objects.

/usr/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
	            --executables="compare-ruby::../miniruby-master -I.ext/common --disable-gem" \
	            --executables="built-ruby::./miniruby --disable-gem" \
	            --output=markdown --output-compare -v $(find ./benchmark -maxdepth 1 -name 'vm_ivar_memoize' -o -name '*vm_ivar_memoize*.yml' -o -name '*vm_ivar_memoize*.rb' | sort) 
compare-ruby: ruby 3.3.0dev (2023-10-25T07:50:00Z master 526292d9fe) [arm64-darwin22]
last_commit=LLDB: Use `expression` to save the result into the history [ci skip]
built-ruby: ruby 3.3.0dev (2023-10-25T08:53:12Z shapes_double_sear.. 63014a9464) [arm64-darwin22]
warming up......
# Iteration per second (i/s)

|                                     |compare-ruby|built-ruby|
|:------------------------------------|-----------:|---------:|
|vm_ivar_stable_shape                 |     11.145M|   11.714M|
|                                     |           -|     1.05x|
|vm_ivar_memoize_unstable_shape       |      7.534M|    9.818M|
|                                     |           -|     1.30x|
|vm_ivar_memoize_unstable_shape_miss  |     11.583M|   11.011M|
|                                     |       1.05x|         -|
|vm_ivar_unstable_undef               |      8.546M|    8.176M|
|                                     |       1.05x|         -|
|vm_ivar_divergent_shape              |      7.899M|  863.465k|
|                                     |       9.15x|         -|
|vm_ivar_divergent_shape_imbalanced   |     10.108M|    2.271M|
|                                     |       4.45x|         -|

So I think it could make sense to check a small number of ancestors before giving up and using the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants