Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError #7522

jeremyevans · 2023-03-14T23:35:23Z

Originally, when 2e7bceb fixed cfuncs to no longer use the VM stack for large array splats, it was thought to have fully fixed Bug #4040, since the issue was fixed for methods defined in Ruby (iseqs) back in Ruby 2.2.

After additional research, I determined that same issue affects almost all types of method calls, not just iseq and cfunc calls. There were two main types of remaining issues, important cases (where large array splat should work) and pedantic cases (where large array splat raised SystemStackError instead of ArgumentError).

Important cases:

# bmethod
define_method(:a){|*a|}
a(*1380888.times)

# send
def b(*a); end
send(:b, *1380888.times)

# symproc
:b.to_proc.call(self, *1380888.times)

# method to proc
def d; yield(*1380888.times) end
d(&method(:b))

# method_missing
def self.method_missing(*a); end
not_a_method(*1380888.times)

Pedantic cases:

# iseq with only required or optional arguments
def a; end
a(*1380888.times)
def b(_); end
b(*1380888.times)
def c(_=nil); end
c(*1380888.times)

# attr reader/writer
c = Class.new do
  attr_accessor :a
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

# Struct aref/aset
c = Struct.new(:a) do
  alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)

This patch fixes all usage of CALLER_SETUP_ARG with splatting a large number of arguments, and required similar fixes to use a temporary hidden array in three other cases where the VM would use the VM stack for handling a large number of arguments. However, it is possible there may be additional cases where splatting a large number of arguments still causes a SystemStackError.

This has a measurable performance impact, as it requires additional checks for a large number of arguments in many additional cases.

This change is fairly invasive, as there were many different VM functions that needed to be modified to support this. To avoid too much API change, I modified struct rb_calling_info to add a heap_argv member for storing the array, so I would not have to thread it through many functions. This struct is always stack allocated, which helps ensure sure GC doesn't collect it early.

Because of how invasive the changes are, and how rarely large arrays are actually splatted in Ruby code, the existing test/spec suites are not great at testing for correct behavior. To try to find and fix all issues, I set VM_ARGC_STACK_MAX to -1, ensuring that a temporary array is used for all array splat method calls. This was very helpful in finding breaking cases, especially ones involving flagged keyword hashes.

Fixes [Bug #4040]

Note that this initial commit uses -1 as VM_ARGC_STACK_MAX. This is for testing, to ensure that a temporary array is used for almost all argument splats. This makes things significantly slower. After we get a clean CI run, I'll push another commit that changes VM_ARGC_STACK_MAX back to 128.

jeremyevans · 2023-03-15T01:42:34Z

CI is clean except for arm64 YJIT failures (which appear to definitely be related to these changes). Going to push the commit to set VM_ARGC_STACK_MAX back to 128.

ko1 · 2023-04-14T02:10:52Z

vm_insnhelper.c

+            cfp->sp[-1] = argv_ary;
+            calling->argc = 1;
+            calling->heap_argv = argv_ary;
+            RB_GC_GUARD(argv_ary);


is it needed? (calling is on the stack)

This one may not be needed (I think I added it first when debugging). I'll try removing it.

ko1 · 2023-04-14T02:12:55Z

vm_insnhelper.c

+            calling->argc = 1;
+            calling->heap_argv = argv_ary;
+            RB_GC_GUARD(argv_ary);
+            RB_GC_GUARD(ary);


is it needed? (caller may mark the ary?)

I think this was definitely needed to fix a segfault under GC.stress = true when VM_ARGC_STACK_MAX was set to -1 (forget which test did that).

eregon · 2023-04-16T13:43:24Z

vm_insnhelper.c

+            rb_ary_cat(argv_ary, argv, argc);
+            rb_ary_cat(argv_ary, ptr, len);


Does this mean there is always a full copy of the arguments when nb-of-args > VM_ARGC_STACK_MAX? (for any kind of call)

This is only for callers that use CALLER_SETUP_ARG with ALLOW_HEAP_ARGV or ALLOW_HEAP_ARGV_KEEP_KWSPLAT. Currently, that is these type of calls only:

cfunc

ifunc

non-iseq bmethod

send(*args) # but not send(:meth, *args)

:symbol.to_proc.call(*args) # but not :symbol.to_proc.call(recv, *args)

jeremyevans · 2023-04-24T21:38:15Z

This passes CI. I'm going to squash all of the heap_argv related code into a single commit, with separate commits for the five separate optimizations. After that passes CI, I plan to merge this.

…aising SystemStackError Originally, when 2e7bceb fixed cfuncs to no longer use the VM stack for large array splats, it was thought to have fully fixed Bug ruby#4040, since the issue was fixed for methods defined in Ruby (iseqs) back in Ruby 2.2. After additional research, I determined that same issue affects almost all types of method calls, not just iseq and cfunc calls. There were two main types of remaining issues, important cases (where large array splat should work) and pedantic cases (where large array splat raised SystemStackError instead of ArgumentError). Important cases: ```ruby define_method(:a){|*a|} a(*1380888.times) def b(*a); end send(:b, *1380888.times) :b.to_proc.call(self, *1380888.times) def d; yield(*1380888.times) end d(&method(:b)) def self.method_missing(*a); end not_a_method(*1380888.times) ``` Pedantic cases: ```ruby def a; end a(*1380888.times) def b(_); end b(*1380888.times) def c(_=nil); end c(*1380888.times) c = Class.new do attr_accessor :a alias b a= end.new c.a(*1380888.times) c.b(*1380888.times) c = Struct.new(:a) do alias b a= end.new c.a(*1380888.times) c.b(*1380888.times) ``` This patch fixes all usage of CALLER_SETUP_ARG with splatting a large number of arguments, and required similar fixes to use a temporary hidden array in three other cases where the VM would use the VM stack for handling a large number of arguments. However, it is possible there may be additional cases where splatting a large number of arguments still causes a SystemStackError. This has a measurable performance impact, as it requires additional checks for a large number of arguments in many additional cases. This change is fairly invasive, as there were many different VM functions that needed to be modified to support this. To avoid too much API change, I modified struct rb_calling_info to add a heap_argv member for storing the array, so I would not have to thread it through many functions. This struct is always stack allocated, which helps ensure sure GC doesn't collect it early. Because of how invasive the changes are, and how rarely large arrays are actually splatted in Ruby code, the existing test/spec suites are not great at testing for correct behavior. To try to find and fix all issues, I tested this in CI with VM_ARGC_STACK_MAX to -1, ensuring that a temporary array is used for all array splat method calls. This was very helpful in finding breaking cases, especially ones involving flagged keyword hashes. Fixes [Bug ruby#4040] Co-authored-by: Jimmy Miller <jimmy.miller@shopify.com>

Currently, bmethod arguments are copied from the VM stack to the C stack in vm_call_bmethod, then copied from the C stack to the VM stack later in invoke_iseq_block_from_c. This is inefficient. This adds vm_call_iseq_bmethod and vm_call_noniseq_bmethod. vm_call_iseq_bmethod is an optimized method that skips stack copies (though there is one copy to remove the receiver from the stack), and avoids calling vm_call_bmethod_body, rb_vm_invoke_bmethod, invoke_block_from_c_proc, invoke_iseq_block_from_c, and vm_yield_setup_args. Th vm_call_iseq_bmethod argument handling is similar to the way normal iseq methods are called, and allows for similar performance optimizations when using splats or keywords. However, even in the no argument case it's still significantly faster. A benchmark is added for bmethod calling. In my environment, it improves bmethod calling performance by 38-59% for simple bmethod calls, and up to 180% for bmethod calls passing literal keywords on both sides. ``` ./miniruby-iseq-bmethod: 18159792.6 i/s ./miniruby-m: 13174419.1 i/s - 1.38x slower bmethod_simple_1 ./miniruby-iseq-bmethod: 15890745.4 i/s ./miniruby-m: 10008972.7 i/s - 1.59x slower bmethod_simple_0_splat ./miniruby-iseq-bmethod: 13142804.3 i/s ./miniruby-m: 11168595.2 i/s - 1.18x slower bmethod_simple_1_splat ./miniruby-iseq-bmethod: 12375791.0 i/s ./miniruby-m: 8491140.1 i/s - 1.46x slower bmethod_no_splat ./miniruby-iseq-bmethod: 10151258.8 i/s ./miniruby-m: 8716664.1 i/s - 1.16x slower bmethod_0_splat ./miniruby-iseq-bmethod: 8138802.5 i/s ./miniruby-m: 7515600.2 i/s - 1.08x slower bmethod_1_splat ./miniruby-iseq-bmethod: 8028372.7 i/s ./miniruby-m: 5947658.6 i/s - 1.35x slower bmethod_10_splat ./miniruby-iseq-bmethod: 6953514.1 i/s ./miniruby-m: 4840132.9 i/s - 1.44x slower bmethod_100_splat ./miniruby-iseq-bmethod: 5287288.4 i/s ./miniruby-m: 2243218.4 i/s - 2.36x slower bmethod_kw ./miniruby-iseq-bmethod: 8931358.2 i/s ./miniruby-m: 3185818.6 i/s - 2.80x slower bmethod_no_kw ./miniruby-iseq-bmethod: 12281287.4 i/s ./miniruby-m: 10041727.9 i/s - 1.22x slower bmethod_kw_splat ./miniruby-iseq-bmethod: 5618956.8 i/s ./miniruby-m: 3657549.5 i/s - 1.54x slower ```

This optimizes the following calls: * ~10-15% for f(*a) when a does not end with a flagged keywords hash * ~10-15% for f(*a) when a ends with an empty flagged keywords hash * ~35-40% for f(*a, **kw) if kw is empty This still copies the array contents to the VM stack, but avoids some overhead. It would be faster to use the array pointer directly, but that could cause problems if the array was modified during the call to the function. You could do that optimization for frozen arrays, but as splatting frozen arrays is uncommon, and the speedup is minimal (<5%), it doesn't seem worth it. The vm_send_cfunc benchmark has been updated to test additional cfunc call types, and the numbers above were taken from the benchmark results.

Similar to the bmethod optimization, this avoids using CALLER_ARG_SPLAT if not necessary. As long as the method argument can be shifted off, other arguments are passed through as-is. This optimizes the following types of calls: * send(meth, arg) ~5% * send(meth, *args) ~75% for args.length == 200 * send(meth, *args, **kw) ~50% for args.length == 200 * send(meth, **kw) ~25% * send(meth, kw: 1) ~115% Note that empty argument splats do get slower with this approach, by about 20%. This is probably because iseq argument setup is slower for empty argument splats than CALLER_SETUP_ARG is. Other than non-empty argument splats, other argument splats are faster, with the speedup depending on the number of arguments. The following types of calls are not optimized: * send(*args) * send(*args, **kw) This is because the you cannot shift the method argument off without first splatting the arg.

Similar to the bmethod/send optimization, this avoids using CALLER_ARG_SPLAT if not necessary. As long as the receiver argument can be shifted off, other arguments are passed through as-is. This optimizes the following types of calls: * symproc.(recv) ~5% * symproc.(recv, *args) ~65% for args.length == 200 * symproc.(recv, *args, **kw) ~45% for args.length == 200 * symproc.(recv, **kw) ~30% * symproc.(recv, kw: 1) ~100% Note that empty argument splats do get slower with this approach, by about 2-3%. This is probably because iseq argument setup is slower for empty argument splats than CALLER_SETUP_ARG is. Other than non-empty argument splats, other argument splats are faster, with the speedup depending on the number of arguments. The following types of calls are not optimized: * symproc.(*args) * symproc.(*args, **kw) This is because the you cannot shift the receiver argument off without first splatting the arg.

CALLER_ARG_SPLAT is not necessary for method_missing. We just need to unshift the method name into the arguments. This optimizes all method_missing calls: * mm(recv) ~9% * mm(recv, *args) ~215% for args.length == 200 * mm(recv, *args, **kw) ~55% for args.length == 200 * mm(recv, **kw) ~22% * mm(recv, kw: 1) ~100% Note that empty argument splats do get slower with this approach, by about 30-40%. Other than non-empty argument splats, other argument splats are faster, with the speedup depending on the number of arguments.

jeremyevans · 2023-04-24T22:57:35Z

CI is clean after the rebasing. I plan to merge this around midnight JST.

jeremyevans requested a review from ko1 March 14, 2023 23:35

jeremyevans force-pushed the large-array-splat-4040 branch 2 times, most recently from a96468d to 96cdd85 Compare March 24, 2023 02:48

jeremyevans force-pushed the large-array-splat-4040 branch from 96cdd85 to 5c71d84 Compare April 2, 2023 20:57

ko1 reviewed Apr 14, 2023

View reviewed changes

jeremyevans force-pushed the large-array-splat-4040 branch from 5c71d84 to 19583f6 Compare April 14, 2023 23:31

eregon reviewed Apr 16, 2023

View reviewed changes

maximecb requested review from XrXr, k0kubun and jimmyhmiller April 17, 2023 14:12

jeremyevans force-pushed the large-array-splat-4040 branch from f77f0cf to cc8bfc7 Compare April 24, 2023 20:34

jeremyevans and others added 6 commits April 24, 2023 14:58

jeremyevans force-pushed the large-array-splat-4040 branch from cc8bfc7 to b50b8e2 Compare April 24, 2023 22:03

jeremyevans merged commit a82a24e into ruby:master Apr 25, 2023
96 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError #7522

Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError #7522

jeremyevans commented Mar 14, 2023

jeremyevans commented Mar 15, 2023

ko1 Apr 14, 2023

jeremyevans Apr 14, 2023

ko1 Apr 14, 2023

jeremyevans Apr 14, 2023

eregon Apr 16, 2023

jeremyevans Apr 16, 2023

jeremyevans commented Apr 24, 2023

jeremyevans commented Apr 24, 2023

		rb_ary_cat(argv_ary, argv, argc);
		rb_ary_cat(argv_ary, ptr, len);

Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError #7522

Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError #7522

Conversation

jeremyevans commented Mar 14, 2023

jeremyevans commented Mar 15, 2023

ko1 Apr 14, 2023

Choose a reason for hiding this comment

jeremyevans Apr 14, 2023

Choose a reason for hiding this comment

ko1 Apr 14, 2023

Choose a reason for hiding this comment

jeremyevans Apr 14, 2023

Choose a reason for hiding this comment

eregon Apr 16, 2023

Choose a reason for hiding this comment

jeremyevans Apr 16, 2023

Choose a reason for hiding this comment

jeremyevans commented Apr 24, 2023

jeremyevans commented Apr 24, 2023