Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the NEWOBJ API narrower #7393

Merged
merged 5 commits into from Apr 6, 2023

Conversation

eightbitraptor
Copy link
Contributor

@eightbitraptor eightbitraptor commented Feb 27, 2023

Since the introduction of variable width allocation with RVARGC there are a lot of different *NEWOBJ* macros. Currently there are:

  • RB_RVARGC_NEWOBJ_OF
  • RB_RVARGC_EC_NEWOBJ_OF
  • RB_NEWOBJ_OF, an alias of RB_RVARGC_NEWOBJ_OF
  • RB_EC_NEWOBJ_OF, an alias of RB_RVARGC_EC_NEWOBJ_OF
  • NEWOBJ_OF, an alias of RB_RVARGC_NEWOBJ_OF
  • RVARGC_NEWOBJ_OF, an alias of RB_RVARGC_NEWOBJ_OF

This PR merges RB_RVARGC_NEWOBJ_OF,RB_RVARGC_EC_NEWOBJ_OF, RVARGC_NEWOBJ_OF, RB_NEWOBJ_OF and NEWOBJ_OF into a single macro that takes an execution context as an argument (which can be NULL; if so then the current execution context is found and used using GET_EC()).

The resulting single macro has been named NEWOBJ_OF to reflect that it should now be the only way of creating new objects.

Both RB_NEWOBJ_OF and NEWOBJ_OF have seperate implementations that are part of the public API exposed in include/ruby/internal/newobj.h so they are available to extension authors. These have not been modified in any way.

Note that implementing this PR required gc.h to include vm_core.h. This was not possibly because vm_core.h included shape.h, which included gc.h creating a circular dependency.

To address this I have removed the shape_list, next_shape_id and root_shape members from rb_vm_struct and instead promoted them to their own type: rb_shape_tree_t. This is defined as a global object accessible using rb_shape_tree_ptr and the GET_SHAPE_TREE() function macro.

In practice, this means that GET_VM()->shape_list etc are now replaced with GET_SHAPE_TREE()->shape_list.

@matzbot matzbot requested a review from a team February 27, 2023 16:59
@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 2 times, most recently from 9417280 to c735d37 Compare February 28, 2023 14:35
@nobu
Copy link
Member

nobu commented Mar 1, 2023

In the commit "Remove RB_RVARGC_EC_NEWOBJ_OF and RB_EC_NEWOBJ_OF", "teh" seems a typo.

@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 4 times, most recently from a2bd681 to c61a058 Compare March 2, 2023 20:42
Copy link
Member

@peterzhu2118 peterzhu2118 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a huge fan of passing in the EC to RVARGC_NEWOBJ_OF since the EC is just an implementation detail of the current GC, so in the future we should think about how we can get rid of passing the EC into the GC.

ext/socket/extconf.rb Show resolved Hide resolved
include/ruby/internal/core/rbasic.h Outdated Show resolved Hide resolved
ractor_core.h Show resolved Hide resolved
internal/gc.h Outdated Show resolved Hide resolved
mjit_c.rb Outdated Show resolved Hide resolved
@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 3 times, most recently from fe3148c to 5cc67cc Compare March 7, 2023 20:51
k0kubun added a commit that referenced this pull request Mar 8, 2023
This is to fix a weird bindgen behavior on Matt's branch:
#7393
@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 2 times, most recently from 0df5558 to 5607e41 Compare March 8, 2023 21:45
internal/gc.h Outdated Show resolved Hide resolved
internal/gc.h Outdated Show resolved Hide resolved
iseq.h Outdated Show resolved Hide resolved
internal/gc.h Outdated Show resolved Hide resolved
@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 6 times, most recently from 2b2630d to c8bd5ef Compare March 16, 2023 19:39
@eightbitraptor eightbitraptor force-pushed the mvh-narrow-newobj-api branch 2 times, most recently from b7c7181 to 0e0c7e5 Compare March 16, 2023 21:38
@ko1
Copy link
Contributor

ko1 commented Mar 20, 2023

I'm not a huge fan of passing in the EC to RVARGC_NEWOBJ_OF since the EC is just an implementation detail of the current GC, so in the future we should think about how we can get rid of passing the EC into the GC.

The background to pass the ec is here:

  • ec is needed for Ractor systems (from 3.0)
  • Getting ec from TLS (thread-local storage) can have overhead
  • Therefore I made a plan to migrate all C-API to have the first parameter ec (rbc_str_new(c, ...) for example, in this case c is "context"). It is like mrb_ APIs of mruby.
  • This is why Primitive for builtin methods receive the ec parameter from VM.

However, now C-compiler specific TLS feature (__thread on gcc for example https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html) provides nice performance on current Ruby (as I observed).

So if you measured at least microbenchmarks the following:

  • with passing ec (current macro)
  • without passing ec (and GET_EC() everytime)

and if we don't have a difference, I agree to avoid passing the ec.

@ko1
Copy link
Contributor

ko1 commented Mar 20, 2023

(which can be NULL; if so then the current execution context is found and used using GET_EC()).

For me, this kind of branch should be avoided to reduce branches.

@eightbitraptor
Copy link
Contributor Author

I got around to looking at these benchmarks recently, and to summarise the situation so far:

I refactored the NEWOBJ api's to reduce complexity at the call sites. This has resulted in an api that requires us to either pass an ec or NULL to the NEWOBJ_OF internal macro.

@peterzhu2118 had a concern that requiring call sites to pass the ec was leaking GC abstractions.

@ko1 was concerned that not passing the ec and instead relying on GET_EC might be a performance problem because of the overhead incurred by accessing a thread local every time.

Additionally @ko1 was concerned that the extra branch point required now in the NEWOBJ_OF macro might also cause a performance impact.

I built a micro-benchmark, designed to produce many putstring instructions ran it across several variations of this feature.

This is the benchmark:

eval("def string_literal; foo = [#{ 10_000.times.map { "'test'" }.join(",") }]; nil; end")

before = Process.clock_gettime(Process::CLOCK_MONOTONIC)
10000.times { string_literal }
p Process.clock_gettime(Process::CLOCK_MONOTONIC) - before

These are the branches I used, and how they compare:

  • master - passes EC explicitly in the cases of the putstring and newarray instructions. So that we don't have to use GET_EC to retreive the thread local ec variable.

  • mvh-narrow-newobj-api - Based on master but with the newobj paths in gc.c and the NEWOBJ_OF macros refactored to reduce paths through the code. ec still must be explicitly passed, although this is now decided using a branch generated by the macro, rather than by using completely seperate code paths as in master.

  • mvh-narrow-newobj-api-macro-branch - Same as mvh-narrow-newobj-api. but additionally removes the branch point, by using some pre-processor string concatenation and dynamic macro name generation to handle the branch at the pre-processor level.

  • mvh-narrow-newobj-api-remove-ec - This builds on top of the refactors carried out in mvh-narrow-newobj-api but also completely removes the code path that passes the ec in the putstring and newarray instructions, relying instead on always getting the ec from the thread. This eliminates all the extra code paths required to check the presence of the ec.

results

Screenshot 2023-03-28 at 17 21 29

master mvh-narrow-newobj-api mvh-narrow-newobj-api-remove-ec mvh-narrow-newbj-api-macro-branch
1 3.50582845 3.51671825 3.42554660 3.62113722
2 3.52347104 3.50921376 3.42405504 3.63043113
3 3.50835028 3.51863772 3.48882321 3.62117951
4 3.50525342 3.51858662 3.43421950 3.60670734
5 3.50967737 3.51790606 3.42857820 3.60528875
avg 3.51051611 3.51621248 3.44024451 3.61694879

Benchmarking was done on Fedora 37 on a Ryzen 3600, running at a fixed 3.6GHz, with each process restricted to a single CPU core and with ASLR disabled, to try and reduce variance.

We can see from these results that mvh-narrow-newobj-api-remove-ec is consitently faster than all the other branches. From this I conclude that the compiler optimisations for thread locals are good enough now to offset the performance hit caused by repeatedly accessing the thread local ec using GET_EC.

So I would like to merge the patches that remove the ec from the internal NEWOBJ_OF into this branch and merge it.

@eightbitraptor
Copy link
Contributor Author

One thing I don't have a good answer for at this time is why mvh-narrow-newobj-api-macro-branch is slower than mvh-narrow-newobj-api, given that it's the same functionality but missing a branch.

We can even see this by looking at the generated code for the rb_ec_str_ressurrect function with objdump:

objdump -M intel --disassemble=rb_ec_str_resurrect --no-addresses miniruby | wc -l
  • on the mvh-narrow-newobj-api the generated asm is 193 lines
  • on the mvh-narrow-newobj-api-macro-branch branch the generated asm is 181 lines

In addition, in mvh-narrow-newobj-api we can clearly see the calls to rb_current_ec where it gets the thread local

❯ objdump -M intel --disassemble=rb_ec_str_resurrect --no-addresses miniruby | grep ruby_current_ec
        48 8b 05 61 5f 2b 00    mov    rax,QWORD PTR [rip+0x2b5f61]        # <ruby_current_ec@@Base+0x50dca0>
        48 8b 05 29 5f 2b 00    mov    rax,QWORD PTR [rip+0x2b5f29]        # <ruby_current_ec@@Base+0x50dca0>

So given that it generates more code, and that extra code is fetching the ec, I can't explain why it's faster.

I don't think this is anything to worry about, given that removing the ec is faster than both of these branches. But it's a curiosity I can't explain.

@tenderlove
Copy link
Member

We can see from these results that mvh-narrow-newobj-api-remove-ec is consitently faster than all the other branches. From this I conclude that the compiler optimisations for thread locals are good enough now to offset the performance hit caused by repeatedly accessing the thread local ec using GET_EC.

We should probably try this benchmark with more than 1 thread live in the system. I'd guess we'll see stalls when there are multiple threads?

@eightbitraptor
Copy link
Contributor Author

@tenderlove It looks like you're right. I think restricting the benchmark too far was skewing the results.

I changed the benchmark script to this:

eval("def string_literal; foo = [#{ 10_000.times.map { "'test'" }.join(",") }]; nil; end")

before = Process.clock_gettime(Process::CLOCK_MONOTONIC)

5.times.map do
  Thread.new do
    10000.times { string_literal }
  end
end.map(&:join)

p Process.clock_gettime(Process::CLOCK_MONOTONIC) - before

And ran all the benchmarks again, outside of the cgroup so they can be moved between cores. And the numbers I got were as follows (averages of 20 runs):

feature time
refactor 16.18893562
refactor+macro 15.64237013
refactor+ec-removal 17.37436765

So, as expected, this time removing the explicit ec and relying on GET_EC is the slowest, and there does seem to be a decent speedup by removing the branch from the code, and relying on the pre-processor, which makes sense as there's fewer machine instructions generated on this branch.

If you're happy with this, then I'll tidy up the commits and merge the refactor with the macro.

@tenderlove
Copy link
Member

If you're happy with this, then I'll tidy up the commits and merge the refactor with the macro.

Sounds good to me!

We can just make newobj_of take a ractor
so that now shape can happily include gc.h
NEWOBJ_OF is now our canonical newobj macro. It takes an optional ec
The socket extensions rubysocket.h pulls in the "private" include/gc.h,
which now depends on vm_core.h. vm_core.h pulls in id.h

when tool/update-deps generates the dependencies for the makefiles, it
generates the line for id.h to be based on VPATH, which is configured in
the extconf.rb for each of the extensions. By default VPATH does not
include the actual source directory of the current Ruby so the
dependency fails to resolve and linking fails.

We need to append the topdir and top_srcdir to VPATH to have the
dependancy picked up correctly (and I believe we need both of these to
cope with in-tree and out-of-tree builds).

I copied this from the approach taken in
https://github.com/ruby/ruby/blob/master/ext/objspace/extconf.rb#L3
@eightbitraptor eightbitraptor merged commit 2a34bca into ruby:master Apr 6, 2023
98 checks passed
@eightbitraptor eightbitraptor deleted the mvh-narrow-newobj-api branch April 6, 2023 10:07
@ko1
Copy link
Contributor

ko1 commented Dec 28, 2023

We should probably try this benchmark with more than 1 thread live in the system. I'd guess we'll see stalls when there are multiple threads?

Sorry why does it introduce stalls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants