Reduce implicit array allocations on caller side of method calling #8853

jeremyevans · 2023-11-07T01:00:08Z

This eliminates array allocations for the following common cases:

f(1, *a)
f(1, *a, &lvar)
f(*a, **lvar)
f(*a, **lvar, &lvar)

This also eliminates array allocations for these less common cases that I think are still worth optimizing:

f(1, *a, &@ivar)
f(*a, **@ivar)
f(*a, **@ivar, &@ivar)
f(*a, kw: 1)
f(*a, kw:1, &lvar)
f(*a, kw:1, &@ivar)

This is handled via the peephole optimizer.

In terms of safety, currently, f(*a, &lvar) and f(*a, &@ivar) both avoid array allocations, and all of the above are as safe as those in terms of safety.

Due to how the compiler works, while f(*a) does not allocate an array f(1, *a) does. This is possible to fix in the compiler, but the change is much more complex. This attempts to fix the issue in a simpler way using the peephole optimizer. Eliminating this array allocation is safe, since just as in the f(*a) case, nothing else on the caller side can modify the array.

nobu · 2023-11-07T01:24:44Z

compile.c

+                    /*
+                    * Eliminate array allocation for f(*a, **lvar) and f(*a, **@iv)
+                    *
+                    *  splatarray true
+                    *  getlocal / getinstancevariable
+                    *  send ARGS_SPLAT|KW_SPLAT and not ARGS_BLOCKARG
+                    * =>
+                    *  splatarray false
+                    *  getlocal / getinstancevariable
+                    *  send
+                    */
+                    else if (!(flag & VM_CALL_ARGS_BLOCKARG) && (flag & VM_CALL_KW_SPLAT)) {
+                        OPERAND_AT(iobj, 0) = Qfalse;
+                    }


Is this safe for this code?

ary = [1,2] (kwd = Object.new).singleton_class.define_method(:to_hash) {ary << 3; {}} p(*ary, **kwd)

This should print 1 and 2 lines now.

As stated in the pull request description and commit messages, this is as safe as f(*a, &lvar). The same issue you describe currently exists with f(*a, &lvar):

ary = [1,2] kwd = Object.new kwd.define_singleton_method(:to_hash) {ary << 3; {}} kwd.define_singleton_method(:to_proc) {ary << 4; lambda{}} p(*ary, &kwd) puts ary = [1,2] p(*ary, **kwd) puts ary = [1,2] p(*ary, **kwd, &kwd)

Currently:

1 2 4 1 2 1 2

After:

1 2 4 1 2 3 1 2 4 3

Note the 4 before 3 is a different bug where Ruby calls to_proc before to_hash (even in the current case where an array is allocated).

As we currently accept the behavior for f(*a, &lvar), I think we should accept the behavior for the other cases.

Alternatively, we could have f(*a, &lvar) allocate an array. However, that makes almost all code using that style of method calling slower.

I submitted a pull request to ensure Ruby calls to_hash before to_proc: #8877

nobu · 2023-11-07T01:25:26Z

compile.c

+                /*
+                * Eliminate array allocation for f(*a, **lvar, &lvar) and f(*a, **@iv, &@iv)
+                *
+                *  splatarray true
+                *  getlocal / getinstancevariable
+                *  getlocal / getinstancevariable
+                *  send ARGS_SPLAT|KW_SPLAT|ARGS_BLOCKARG
+                * =>
+                *  splatarray false
+                *  getlocal / getinstancevariable
+                *  getlocal / getinstancevariable
+                *  send
+                */
+                if (IS_NEXT_INSN_ID(niobj, send)) {
+                    niobj = niobj->next;
+                    unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));
+
+                    if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
+                        OPERAND_AT(iobj, 0) = Qfalse;
+                    }
+                }


Ditto for to_proc.

ko1 · 2023-11-18T09:35:51Z

compile.c

@@ -3888,6 +3888,30 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                        OPERAND_AT(iobj, 0) = Qfalse;
                    }
                }
+            } else if (IS_NEXT_INSN_ID(niobj, getlocal) | IS_NEXT_INSN_ID(niobj, getinstancevariable)) {


Thanks, that should be ||, though in this case | has the same behavior.

ko1 · 2023-11-18T09:38:13Z

The first commit makes sense for me, but others seem too ad-hoc for me (limited to lvar and ivar). Maybe node is needed for it?

jeremyevans · 2023-11-18T17:52:18Z

The reason I limited this to lvar and ivar is that those are fairly easy to handle, not risky (only pathological code fails), and very common. The other common case is splatting of method calls, but handling that is more complex, more risky (actual code might fail), and slightly less common, so I don't think we should do that. This could easily be extended to gvar and cvar, but those very uncommon, so I don't think it is worth it.

jeremyevans · 2023-11-18T21:39:47Z

To check how common these types of methods are, I added some metrics to the code, to see how many methods would be optimized by removing allocations:

For ruby -e '' (just loads rubygems):

  5 : f(1, *a)
  2 : f(1, *a, &lvar) | f(1, *a, &@ivar)

Current stdlib:

139 : f(1, *a)
  3 : f(1, *a, &lvar) | f(1, *a, &@ivar)
  4 : f(*a, **lvar) | f(*a, **@ivar)

minitest 5.20 (bundled gem):

  4 : f(1, *a)
  1 : f(1, *a, &lvar) | f(1, *a, &@ivar)
  2 : f(*a, **lvar) | f(*a, **@ivar)
  2 : f(*a, **lvar, &lvar) | f(*a, **@ivar, &@ivar)

Rails master:

 77 : f(1, *a)
  5 : f(1, *a, &lvar) | f(1, *a, &@ivar)
 26 : f(*a, **lvar) | f(*a, **@ivar)
  5 : f(*a, kw: 1)

So all optimizations seem to be used by standard lib and common gems, except maybe f(*a, kw:1, &lvar) | f(*a, kw:1, &@ivar). I could remove that if we don't think it is worth keeping.

Code to enable metrics:

diff --git a/compile.c b/compile.c
index 41008f3cfd..88b281eedd 100644
--- a/compile.c
+++ b/compile.c
@@ -3848,6 +3848,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
             niobj = niobj->next;
             unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));
             if ((flag & VM_CALL_ARGS_SPLAT) && !(flag & (VM_CALL_KW_SPLAT|VM_CALL_ARGS_BLOCKARG))) {
+write(1, "1", 1);
                 OPERAND_AT(iobj, 0) = Qfalse;
             }
         } else if (IS_NEXT_INSN_ID(niobj, getlocal) || IS_NEXT_INSN_ID(niobj, getinstancevariable)) {
@@ -3870,6 +3871,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     *  send
                     */
                     if ((flag & VM_CALL_ARGS_BLOCKARG) && !(flag & VM_CALL_KW_SPLAT)) {
+write(1, "2", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
 
@@ -3885,6 +3887,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     *  send
                     */
                     else if (!(flag & VM_CALL_ARGS_BLOCKARG) && (flag & VM_CALL_KW_SPLAT)) {
+write(1, "3", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }
@@ -3909,6 +3912,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));
 
                     if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "4", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }
@@ -3933,6 +3937,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
 
                 if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) &&
                         (flag & VM_CALL_KW_SPLAT_MUT) && !(flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "5", 1);
                     OPERAND_AT(iobj, 0) = Qfalse;
                 }
             }
@@ -3958,6 +3963,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
 
                     if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) &&
                             (flag & VM_CALL_KW_SPLAT_MUT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "6", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }

Program to load files and collect metrics (pipe the output of this into the next program):

# coding: UTF-8
b = binding
ARGV.flat_map{|x| File.directory?(x) ? Dir["#{x}/**/*.rb"] : x}.each do |file|
  code = "# coding: UTF-8\nBEGIN{throw :valid, true}\n" + File.binread(file)
  catch(:valid){eval(code, b, file)}
rescue SyntaxError
end
puts

Program to parse metrics to get above output:

res = $stdin.read

types = (<<END).split("\n")
f(1, *a)
f(1, *a, &lvar) | f(1, *a, &@ivar)
f(*a, **lvar) | f(*a, **@ivar)
f(*a, **lvar, &lvar) | f(*a, **@ivar, &@ivar)
f(*a, kw: 1)
f(*a, kw:1, &lvar) | f(*a, kw:1, &@ivar)
END

%w[1 2 3 4 5 6].each do |i|
  count = res.count(i)
  next unless count > 0
  print count.to_s.rjust(3), ' : ', types[i.to_i-1]
  puts
end

byroot · 2023-12-01T08:55:11Z

Have you tried this patch on yjit-bench? I suspect that if this translate to real world gains, it should be visible on things like the lobsters bench as it's a Rails app, so likely has significant delegation going on.

jeremyevans · 2023-12-01T20:33:06Z

Have you tried this patch on yjit-bench? I suspect that if this translate to real world gains, it should be visible on things like the lobsters bench as it's a Rails app, so likely has significant delegation going on.

Here are the results. I ran with WARMUP_ITRS=1 MIN_BENCH_ITRS=2 MIN_BENCH_TIME=1 and it still took over 15 minutes on my machine. For more accurate numbers, you would want to run it with standard settings on a machine that is not being used for anything except the benchmark.

Note that other than the compiler change to change splatarray true to splatarray false in certain cases (which should always improve runtime performance), the only other change in the pull request is the small one to vm_inshelper.c. That should not result in an additional allocation (pull request changes that allocation from caller-side to callee-side). Possibly combining the branches makes things slower, I could keep the split branches to remove the added conditional.

after: ruby 3.3.0dev (2023-11-30T23:30:30Z reduce-implicit-al.. b2707195ec) [x86_64-openbsd7.4]
before: ruby 3.3.0dev (2023-11-06T19:39:09Z v3_3_0_preview3~213 49b6dc8f07) [x86_64-openbsd7.4]

--------------  ----------  ----------  -----------  ----------  ------------  --------------
bench           after (ms)  stddev (%)  before (ms)  stddev (%)  after/before  before 1st itr
activerecord    123.9       3.1         122.3        0.4         1.01          1.04
erubi_rails     45.8        5.0         46.2         5.1         0.99          0.97
hexapdf         6239.6      0.1         6181.9       0.2         1.01          1.01
liquid-c        147.5       0.5         147.3        0.5         1.00          0.99
liquid-compile  173.0       0.3         177.5        0.6         0.97          0.98
liquid-render   318.4       0.5         320.2        0.1         0.99          1.00
mail            321.9       0.1         323.1        0.1         1.00          0.97
psych-load      5883.5      0.1         5790.1       0.2         1.02          1.02
railsbench      4092.9      0.5         4147.1       0.2         0.99          0.96
ruby-lsp        14.6        10.5        14.7         10.5        0.99          1.02
sequel          182.7       1.9         183.1        1.7         1.00          0.99
binarytrees     675.3       0.3         655.1        0.2         1.03          1.03
chunky_png      2207.5      0.0         2228.3       0.2         0.99          0.99
erubi           576.7       0.1         573.6        0.1         1.01          1.01
etanni          665.1       0.1         661.6        0.0         1.01          1.00
fannkuchredux   3741.2      0.2         3594.7       0.0         1.04          1.04
lee             2571.0      0.4         2550.8       0.3         1.01          1.01
nbody           168.7       0.2         169.0        0.2         1.00          0.99
optcarrot       10568.0     0.0         10501.7      0.3         1.01          1.00
ruby-json       10451.8     0.9         10318.8      0.0         1.01          1.00
rubykon         19591.2     0.1         19625.6      0.0         1.00          1.00
30k_ifelse      3390.4      0.0         3371.2       0.1         1.01          1.02
30k_methods     7287.3      0.0         7375.1       0.9         0.99          1.00
cfunc_itself    300.8       0.1         294.4        0.3         1.02          1.01
fib             439.6       0.1         432.4        0.3         1.02          1.02
getivar         204.1       0.1         213.7        0.2         0.95          0.96
keyword_args    506.8       0.2         491.4        0.0         1.03          1.02
respond_to      695.8       0.1         704.7        0.2         0.99          0.99
setivar         141.2       0.4         137.4        0.3         1.03          1.03
setivar_object  164.7       0.4         167.7        0.2         0.98          0.98
setivar_young   166.4       0.3         174.6        4.6         0.95          0.99
str_concat      186.0       2.0         185.5        2.1         1.00          1.01
throw           43.8        0.4         43.5         6.5         1.01          1.02
--------------  ----------  ----------  -----------  ----------  ------------  --------------

byroot · 2023-12-02T08:29:17Z

Hum, not sure why lobsters isn't on that result list, it's the one where I would have expected a significant amount of delegation, hence a noticeable speedup.

To share my line of thoughts, if this change show significant gains on at least one headline benchmark then I'm in favor of taking the risk to merge such a change soonish so it makes it into 3.3. But otherwise I'd advocate for just waiting until January so we don't take the risk of introducing a subtle bug this close to release.

Either way, if we want to merge this for 3.3, I strongly believe it should be before the first RC (coming very soon I heard).

@Iv

Due to how the compiler works, while f(*a, &lvar) and f(*a, &@Iv) do not allocate an array, but f(1, *a, &lvar) and f(1, *a, &@Iv) do. It's probably possible to fix this in the compiler, but seems easiest to fix this in the peephole optimizer. Eliminating this array allocation is as safe as the current elimination of the array allocation for f(*a, &lvar) and f(*a, &@Iv).

@Iv

The compiler already eliminates the array allocation for f(*a, &lvar) and f(*a, &@Iv), and eliminating the array allocation for keyword splat is as safe as eliminating it for block passes.

…Iv

…Iv) The compiler already eliminates the array allocation for f(*a, &lvar) and f(*a, &@Iv). If that is safe, then eliminating it for f(*a, **lvar) and f(*a, **@Iv) as the last commit did is as safe, and eliminating it for f(*a, **lvar, &lvar) and f(*a, **@Iv, &@Iv) is also as safe.

In cases where the compiler can detect the hash is static, it would use duphash for the hash part. As the hash is static, there is no need to allocate an array.

) Similar to the previous commit, but this handles the block pass case.

For the following: ``` def f(*a); a end p f(*a, kw: 3) ``` `setup_parameters_complex` pushes `{kw: 3}` onto `a`. This worked fine back when `concatarray true` was used and `a` was already a copy. It does not work correctly with the optimization to switch to `concatarray false`. This duplicates the array on the callee side in such a case. This affects cases when passing a regular splat and a keyword splat (or literal keywords) in a method call, where the method does not accept keywords. This allocation could probably be avoided, but doing so would make `setup_parameters_complex` more complicated.

… f(*a, kw: 1, &arg) These are similar to the f(1, *a, &lvar), f(*a, **kw, &lvar) and f(*a, kw: 1, &lvar) optimizations, but they use getblockparamproxy instruction instead of getlocal. This also fixes the else style to be more similar to the surrounding code.

jeremyevans requested a review from ko1 November 7, 2023 01:00

nobu reviewed Nov 7, 2023

View reviewed changes

ko1 reviewed Nov 18, 2023

View reviewed changes

jeremyevans added 8 commits December 7, 2023 07:21

Eliminate array allocation for f(*a, **lvar) and f(*a, **@Iv)

7deedf5

The compiler already eliminates the array allocation for f(*a, &lvar) and f(*a, &@Iv), and eliminating the array allocation for keyword splat is as safe as eliminating it for block passes.

Eliminate array allocation for f(*a, kw: 1)

7394ab3

In cases where the compiler can detect the hash is static, it would use duphash for the hash part. As the hash is static, there is no need to allocate an array.

Eliminate array allocation for f(*a, kw: 1, &lvar) and f(*a, kw: 1, &@Iv

b3a6c98

) Similar to the previous commit, but this handles the block pass case.

Add benchmark for recent optimization to avoid implicit allocations

757d00a

jeremyevans force-pushed the reduce-implicit-allocations-optimizer branch from 7e82b16 to f1c43a7 Compare December 7, 2023 15:28

jeremyevans merged commit 7738b31 into ruby:master Dec 7, 2023
98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce implicit array allocations on caller side of method calling #8853

Reduce implicit array allocations on caller side of method calling #8853

jeremyevans commented Nov 7, 2023

nobu Nov 7, 2023

jeremyevans Nov 7, 2023

jeremyevans Nov 9, 2023

nobu Nov 7, 2023

ko1 Nov 18, 2023

jeremyevans Nov 18, 2023

ko1 commented Nov 18, 2023

jeremyevans commented Nov 18, 2023

jeremyevans commented Nov 18, 2023

byroot commented Dec 1, 2023

jeremyevans commented Dec 1, 2023

byroot commented Dec 2, 2023

Reduce implicit array allocations on caller side of method calling #8853

Reduce implicit array allocations on caller side of method calling #8853

Conversation

jeremyevans commented Nov 7, 2023

nobu Nov 7, 2023

Choose a reason for hiding this comment

jeremyevans Nov 7, 2023

Choose a reason for hiding this comment

jeremyevans Nov 9, 2023

Choose a reason for hiding this comment

nobu Nov 7, 2023

Choose a reason for hiding this comment

ko1 Nov 18, 2023

Choose a reason for hiding this comment

jeremyevans Nov 18, 2023

Choose a reason for hiding this comment

ko1 commented Nov 18, 2023

jeremyevans commented Nov 18, 2023

jeremyevans commented Nov 18, 2023

byroot commented Dec 1, 2023

jeremyevans commented Dec 1, 2023

byroot commented Dec 2, 2023