Speed up instance variable cache misses #8744

tenderlove · 2023-10-23T23:24:16Z

This PR speeds up instance variable cache misses by introducing a red-black tree as a shape cache. With the red-black tree, we can easily check whether an instance variable has been set and what index the IV uses. Before this change, in the worst case, IV lookup would be O(n), but with the red-black tree, the worst case is O(log n) (where n == shape depth).

I added a benchmark to show the difference:

$ make benchmark ITEM=vm_ivar_ic_miss
/Users/aaron/.rubies/arm64/ruby-trunk/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
	            --executables="compare-ruby::/Users/aaron/.rubies/arm64/ruby-trunk/bin/ruby --disable=gems -I.ext/common --disable-gem" \
	            --executables="built-ruby::./miniruby -I./lib -I. -I.ext/common  ./tool/runruby.rb --extout=.ext  -- --disable-gems --disable-gem" \
	            --output=markdown --output-compare -v $(find ./benchmark -maxdepth 1 -name 'vm_ivar_ic_miss' -o -name '*vm_ivar_ic_miss*.yml' -o -name '*vm_ivar_ic_miss*.rb' | sort) 
compare-ruby: ruby 3.3.0dev (2023-10-23T15:37:50Z master 62c674f98c) [arm64-darwin23]
built-ruby: ruby 3.3.0dev (2023-10-23T22:56:14Z rb-shape-index 4a7b24be69) [arm64-darwin23]
# Iteration per second (i/s)

|                 |compare-ruby|built-ruby|
|:----------------|-----------:|---------:|
|vm_ivar_ic_miss  |      3.465M|   23.031M|
|                 |           -|     6.65x|

This change has an impact on YJIT as well. When IV sites become megamorphic, rather than exit, the JIT will generate machine code that does a "slow path" read on the IV. Below is a benchmark to demonstrate:

class Foo
  def initialize idx
    case idx
    when 0; then @c0 = 1
    when 1; then @c1 = 1
    when 2; then @c2 = 1
    when 3; then @c3 = 1
    when 4; then @c4 = 1
    when 5; then @c5 = 1
    when 6; then @c6 = 1
    when 7; then @c7 = 1
    when 8; then @c8 = 1
    end

    @a0 = @a1 = @a2 = @a3 = @a4 = @a5 = @a6 = @a7 = @a8 = @a9 = @a10 = @a11 = @a12 = @a13 = @a14 = @a15 = @a16 = @a17 = @a18 = @a19 = @a20 = @a21 = @a22 = @a23 = @a24 = @a25 = @a26 = @a27 = @a28 = @a29 = @a30 = @a31 = @a32 = @a33 = @a34 = @a35 = @a36 = @a37 = @a38 = @a39 = @a40 = @a41 = @a42 = @a43 = @a44 = @a45 = @a46 = @a47 = @a48 = @a49 = @a50 = @a51 = @a52 = @a53 = @a54 = @a55 = @a56 = @a57 = @a58 = @a59 = @a60 = @a61 = @a62 = @a63 = @a64 = @a65 = @a66 = @a67 = @a68 = @a69 = @a70 = @a71 = @a72 = @a73 = @a74 = @b = 1
  end

  def b; @b; end
end

# Force the `@b` site to be megamorphic
9.times { Class.new(Foo).new(_1).b }
obj = Foo.new(8)
60000000.times { obj.b }

With YJIT on master:

$ time ruby --yjit --yjit-call-threshold=2 -v test.rb
ruby 3.3.0dev (2023-10-23T15:37:50Z master 62c674f98c) +YJIT [arm64-darwin23]

________________________________________________________
Executed in   10.37 secs    fish           external
   usr time   10.28 secs   30.29 millis   10.25 secs
   sys time    0.13 secs    5.12 millis    0.12 secs

With YJIT on this branch:

$ time ./ruby --yjit --yjit-call-threshold=2 -v test.rb
ruby 3.3.0dev (2023-10-23T22:56:14Z rb-shape-index 4a7b24be69) +YJIT [arm64-darwin23]

________________________________________________________
Executed in    2.61 secs    fish           external
   usr time    2.45 secs    0.10 millis    2.45 secs
   sys time    0.04 secs    2.15 millis    0.04 secs

This is an experimental commit that uses a functional red-black tree to create an index of the ancestor shapes. It uses an Okasaki style functional red black tree: https://www.cs.tufts.edu/comp/150FP/archive/chris-okasaki/redblack99.pdf This tree is advantageous because: * It offers O(n log n) insertions and O(n log n) lookups. * It shares memory with previous "versions" of the tree When we insert a node in the tree, only the parts of the tree that need to be rebalanced are newly allocated. Parts of the tree that don't need to be rebalanced are not reallocated, so "new trees" are able to share memory with old trees. This is in contrast to a sorted set where we would have to duplicate the set, and also resort the set on each insertion. I've added a new stat to RubyVM.stat so we can understand how the red black tree increases.

We're only going to create a redblack tree on platforms that have mmap

benchmark/vm_ivar_ic_miss.yml

Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>

byroot · 2023-10-24T13:02:36Z

So the implementation goes over my head for the most part, but a couple things.

If cache misses have decent complexity now, should we bump or remove the SHAPE_MAX_NUM_IVS constant?

ruby/shape.h

Line 34 in c44d654

# define SHAPE_MAX_NUM_IVS 80

. It was added to avoid to factorial complexity situation when setting many ivars, should be resolved right?

This significantly speedup looking up for a shape ancestor, but do you think #8650 still make sense assuming redblack-tree is merged? I'd think it would for the supposedly common "close ancestor" case.

On another node I triggered an internal Shopify CI build to see if it spots any issue with your PR.

jemmaissroff

Looks really good, thanks for doing this! A couple small optional nits

shape.c

jemmaissroff · 2023-10-24T13:02:45Z

shape.c

@@ -120,7 +348,7 @@ shape_alloc(void)
    shape_id_t shape_id = GET_SHAPE_TREE()->next_shape_id;
    GET_SHAPE_TREE()->next_shape_id++;

-    if (shape_id == MAX_SHAPE_ID) {
+    if (shape_id == (MAX_SHAPE_ID + 1)) {


Why did this change?

I wanted MAX_SHAPE_ID to be actually allocatable. We could probably roll it back, but since MAX_SHAPE_ID wasn't actually allocatable the name seemed odd.

shape.c

byroot · 2023-10-24T13:49:20Z

On another node I triggered an internal Shopify CI build to see if it spots any issue with your PR.

✅

tenderlove · 2023-10-24T15:41:52Z

If cache misses have decent complexity now, should we bump or remove the SHAPE_MAX_NUM_IVS constant?

We can just remove it. It's not even used anymore, except in tests. There's a bunch of other stuff I want to clean up after this lands, but I wanted to keep the diff as small as possible (I realize this is a big PR 😢 )

tenderlove · 2023-10-24T16:02:49Z

This significantly speedup looking up for a shape ancestor, but do you think #8650 still make sense assuming redblack-tree is merged? I'd think it would for the supposedly common "close ancestor" case.

Yes, I think that PR still makes sense. The RB tree seems to only pay off when we need to examine more than ~10 shapes. I think "near shape" hints would compliment this well. WDYT @jhawthorn?

jhawthorn · 2023-10-24T22:06:55Z

The RB tree seems to only pay off when we need to examine more than ~10 shapes. I think "near shape" hints would compliment this well.

I'm not sure. The bigger benefit from that post-this-change I think would be Jean's change to avoid flip-flopping the cache rather than the speed of ivar lookup.

byroot · 2023-10-24T22:10:19Z

the speed of ivar lookup.

Well, the assumption is that in the rather common case of the "memoization" pattern, the common ancestor is very close, so in many case we might get away with just checking two or three shapes.

It's also very helpful for objects that are sometimes frozen sometimes not like Set.

I'll try to find some time to benchmark this again now that the index was merged, see if I can demonstrate a substantial perf gain.

casperisfine · 2023-10-25T09:25:12Z

Alright, so something I totally missed until now is that all shapes with more than ANCESTOR_CACHE_THRESHOLD ivars do have an index, somehow I thought only one out of ANCESTOR_CACHE_THRESHOLD shapes had an index.

So it kinda changes things.

Still, I benchmarked my (unmodified) branch against newest master, and it's still quite significantly faster on the "close ancestor" case, which I suspect is common both in case of light memoization use and use case of frozen objects.

/usr/bin/ruby --disable=gems -rrubygems -I./benchmark/lib ./benchmark/benchmark-driver/exe/benchmark-driver \
	            --executables="compare-ruby::../miniruby-master -I.ext/common --disable-gem" \
	            --executables="built-ruby::./miniruby --disable-gem" \
	            --output=markdown --output-compare -v $(find ./benchmark -maxdepth 1 -name 'vm_ivar_memoize' -o -name '*vm_ivar_memoize*.yml' -o -name '*vm_ivar_memoize*.rb' | sort) 
compare-ruby: ruby 3.3.0dev (2023-10-25T07:50:00Z master 526292d9fe) [arm64-darwin22]
last_commit=LLDB: Use `expression` to save the result into the history [ci skip]
built-ruby: ruby 3.3.0dev (2023-10-25T08:53:12Z shapes_double_sear.. 63014a9464) [arm64-darwin22]
warming up......
# Iteration per second (i/s)

|                                     |compare-ruby|built-ruby|
|:------------------------------------|-----------:|---------:|
|vm_ivar_stable_shape                 |     11.145M|   11.714M|
|                                     |           -|     1.05x|
|vm_ivar_memoize_unstable_shape       |      7.534M|    9.818M|
|                                     |           -|     1.30x|
|vm_ivar_memoize_unstable_shape_miss  |     11.583M|   11.011M|
|                                     |       1.05x|         -|
|vm_ivar_unstable_undef               |      8.546M|    8.176M|
|                                     |       1.05x|         -|
|vm_ivar_divergent_shape              |      7.899M|  863.465k|
|                                     |       9.15x|         -|
|vm_ivar_divergent_shape_imbalanced   |     10.108M|    2.271M|
|                                     |       4.45x|         -|

So I think it could make sense to check a small number of ancestors before giving up and using the index.

tenderlove requested review from byroot, jhawthorn and jemmaissroff October 23, 2023 23:24

tenderlove added 5 commits October 23, 2023 16:35

increase the maximum number of ivs

e85271b

remove IV limit / support complex shapes on classes

2d6ad3c

geniv objects can become too complex

6d22d3e

Don't cache on platforms without mmap

f69a8e4

We're only going to create a redblack tree on platforms that have mmap

tenderlove force-pushed the rb-shape-index branch from 4a7b24b to f69a8e4 Compare October 23, 2023 23:35

updating bindgen

41f34e5

nobu reviewed Oct 24, 2023

View reviewed changes

benchmark/vm_ivar_ic_miss.yml Outdated Show resolved Hide resolved

Update benchmark/vm_ivar_ic_miss.yml

2681d52

Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>

tenderlove requested a review from nobu October 24, 2023 03:09

jemmaissroff approved these changes Oct 24, 2023

View reviewed changes

Addressing feedback

ae89520

tenderlove force-pushed the rb-shape-index branch from 808fa57 to ae89520 Compare October 24, 2023 16:34

tenderlove merged commit e71f343 into ruby:master Oct 24, 2023
109 of 191 checks passed

tenderlove deleted the rb-shape-index branch October 24, 2023 17:52

nevinera mentioned this pull request Dec 30, 2023

Add initializer support method for object shape optimization tycooon/memery#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up instance variable cache misses #8744

Speed up instance variable cache misses #8744

tenderlove commented Oct 23, 2023 •

edited

byroot commented Oct 24, 2023

jemmaissroff left a comment

jemmaissroff Oct 24, 2023

tenderlove Oct 24, 2023

byroot commented Oct 24, 2023

tenderlove commented Oct 24, 2023

tenderlove commented Oct 24, 2023

jhawthorn commented Oct 24, 2023

byroot commented Oct 24, 2023

casperisfine commented Oct 25, 2023

Speed up instance variable cache misses #8744

Speed up instance variable cache misses #8744

Conversation

tenderlove commented Oct 23, 2023 • edited

byroot commented Oct 24, 2023

jemmaissroff left a comment

Choose a reason for hiding this comment

jemmaissroff Oct 24, 2023

Choose a reason for hiding this comment

tenderlove Oct 24, 2023

Choose a reason for hiding this comment

byroot commented Oct 24, 2023

tenderlove commented Oct 24, 2023

tenderlove commented Oct 24, 2023

jhawthorn commented Oct 24, 2023

byroot commented Oct 24, 2023

casperisfine commented Oct 25, 2023

tenderlove commented Oct 23, 2023 •

edited