Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce implicit array allocations on caller side of method calling #8853

Merged

Conversation

jeremyevans
Copy link
Contributor

This eliminates array allocations for the following common cases:

f(1, *a)
f(1, *a, &lvar)
f(*a, **lvar)
f(*a, **lvar, &lvar)

This also eliminates array allocations for these less common cases that I think are still worth optimizing:

f(1, *a, &@ivar)
f(*a, **@ivar)
f(*a, **@ivar, &@ivar)
f(*a, kw: 1)
f(*a, kw:1, &lvar)
f(*a, kw:1, &@ivar)

This is handled via the peephole optimizer.

In terms of safety, currently, f(*a, &lvar) and f(*a, &@ivar) both avoid array allocations, and all of the above are as safe as those in terms of safety.

Due to how the compiler works, while f(*a) does not allocate an
array f(1, *a) does.  This is possible to fix in the compiler, but
the change is much more complex.  This attempts to fix the issue
in a simpler way using the peephole optimizer.

Eliminating this array allocation is safe, since just as in the
f(*a) case, nothing else on the caller side can modify the array.
@jeremyevans jeremyevans requested a review from ko1 November 7, 2023 01:00
Comment on lines +3876 to +3889
/*
* Eliminate array allocation for f(*a, **lvar) and f(*a, **@iv)
*
* splatarray true
* getlocal / getinstancevariable
* send ARGS_SPLAT|KW_SPLAT and not ARGS_BLOCKARG
* =>
* splatarray false
* getlocal / getinstancevariable
* send
*/
else if (!(flag & VM_CALL_ARGS_BLOCKARG) && (flag & VM_CALL_KW_SPLAT)) {
OPERAND_AT(iobj, 0) = Qfalse;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe for this code?

ary = [1,2]
(kwd = Object.new).singleton_class.define_method(:to_hash) {ary << 3; {}}
p(*ary, **kwd)

This should print 1 and 2 lines now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated in the pull request description and commit messages, this is as safe as f(*a, &lvar). The same issue you describe currently exists with f(*a, &lvar):

ary = [1,2]
kwd = Object.new
kwd.define_singleton_method(:to_hash) {ary << 3; {}}
kwd.define_singleton_method(:to_proc) {ary << 4; lambda{}}

p(*ary, &kwd)

puts
ary = [1,2]
p(*ary, **kwd)

puts
ary = [1,2]
p(*ary, **kwd, &kwd)

Currently:

1
2
4

1
2

1
2

After:

1
2
4

1
2
3

1
2
4
3

Note the 4 before 3 is a different bug where Ruby calls to_proc before to_hash (even in the current case where an array is allocated).

As we currently accept the behavior for f(*a, &lvar), I think we should accept the behavior for the other cases.

Alternatively, we could have f(*a, &lvar) allocate an array. However, that makes almost all code using that style of method calling slower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted a pull request to ensure Ruby calls to_hash before to_proc: #8877

compile.c Outdated
Comment on lines 3894 to 3916
/*
* Eliminate array allocation for f(*a, **lvar, &lvar) and f(*a, **@iv, &@iv)
*
* splatarray true
* getlocal / getinstancevariable
* getlocal / getinstancevariable
* send ARGS_SPLAT|KW_SPLAT|ARGS_BLOCKARG
* =>
* splatarray false
* getlocal / getinstancevariable
* getlocal / getinstancevariable
* send
*/
if (IS_NEXT_INSN_ID(niobj, send)) {
niobj = niobj->next;
unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));

if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
OPERAND_AT(iobj, 0) = Qfalse;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for to_proc.

compile.c Outdated
@@ -3888,6 +3888,30 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
OPERAND_AT(iobj, 0) = Qfalse;
}
}
} else if (IS_NEXT_INSN_ID(niobj, getlocal) | IS_NEXT_INSN_ID(niobj, getinstancevariable)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that should be ||, though in this case | has the same behavior.

@ko1
Copy link
Contributor

ko1 commented Nov 18, 2023

The first commit makes sense for me, but others seem too ad-hoc for me (limited to lvar and ivar). Maybe node is needed for it?

@jeremyevans
Copy link
Contributor Author

The reason I limited this to lvar and ivar is that those are fairly easy to handle, not risky (only pathological code fails), and very common. The other common case is splatting of method calls, but handling that is more complex, more risky (actual code might fail), and slightly less common, so I don't think we should do that. This could easily be extended to gvar and cvar, but those very uncommon, so I don't think it is worth it.

@jeremyevans
Copy link
Contributor Author

To check how common these types of methods are, I added some metrics to the code, to see how many methods would be optimized by removing allocations:

For ruby -e '' (just loads rubygems):

  5 : f(1, *a)
  2 : f(1, *a, &lvar) | f(1, *a, &@ivar)

Current stdlib:

139 : f(1, *a)
  3 : f(1, *a, &lvar) | f(1, *a, &@ivar)
  4 : f(*a, **lvar) | f(*a, **@ivar)

minitest 5.20 (bundled gem):

  4 : f(1, *a)
  1 : f(1, *a, &lvar) | f(1, *a, &@ivar)
  2 : f(*a, **lvar) | f(*a, **@ivar)
  2 : f(*a, **lvar, &lvar) | f(*a, **@ivar, &@ivar)

Rails master:

 77 : f(1, *a)
  5 : f(1, *a, &lvar) | f(1, *a, &@ivar)
 26 : f(*a, **lvar) | f(*a, **@ivar)
  5 : f(*a, kw: 1)

So all optimizations seem to be used by standard lib and common gems, except maybe f(*a, kw:1, &lvar) | f(*a, kw:1, &@ivar). I could remove that if we don't think it is worth keeping.

Code to enable metrics:

diff --git a/compile.c b/compile.c
index 41008f3cfd..88b281eedd 100644
--- a/compile.c
+++ b/compile.c
@@ -3848,6 +3848,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
             niobj = niobj->next;
             unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));
             if ((flag & VM_CALL_ARGS_SPLAT) && !(flag & (VM_CALL_KW_SPLAT|VM_CALL_ARGS_BLOCKARG))) {
+write(1, "1", 1);
                 OPERAND_AT(iobj, 0) = Qfalse;
             }
         } else if (IS_NEXT_INSN_ID(niobj, getlocal) || IS_NEXT_INSN_ID(niobj, getinstancevariable)) {
@@ -3870,6 +3871,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     *  send
                     */
                     if ((flag & VM_CALL_ARGS_BLOCKARG) && !(flag & VM_CALL_KW_SPLAT)) {
+write(1, "2", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
 
@@ -3885,6 +3887,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     *  send
                     */
                     else if (!(flag & VM_CALL_ARGS_BLOCKARG) && (flag & VM_CALL_KW_SPLAT)) {
+write(1, "3", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }
@@ -3909,6 +3912,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
                     unsigned int flag = vm_ci_flag((const struct rb_callinfo *)OPERAND_AT(niobj, 0));
 
                     if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "4", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }
@@ -3933,6 +3937,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
 
                 if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) &&
                         (flag & VM_CALL_KW_SPLAT_MUT) && !(flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "5", 1);
                     OPERAND_AT(iobj, 0) = Qfalse;
                 }
             }
@@ -3958,6 +3963,7 @@ iseq_peephole_optimize(rb_iseq_t *iseq, LINK_ELEMENT *list, const int do_tailcal
 
                     if ((flag & VM_CALL_ARGS_SPLAT) && (flag & VM_CALL_KW_SPLAT) &&
                             (flag & VM_CALL_KW_SPLAT_MUT) && (flag & VM_CALL_ARGS_BLOCKARG)) {
+write(1, "6", 1);
                         OPERAND_AT(iobj, 0) = Qfalse;
                     }
                 }

Program to load files and collect metrics (pipe the output of this into the next program):

# coding: UTF-8
b = binding
ARGV.flat_map{|x| File.directory?(x) ? Dir["#{x}/**/*.rb"] : x}.each do |file|
  code = "# coding: UTF-8\nBEGIN{throw :valid, true}\n" + File.binread(file)
  catch(:valid){eval(code, b, file)}
rescue SyntaxError
end
puts

Program to parse metrics to get above output:

res = $stdin.read

types = (<<END).split("\n")
f(1, *a)
f(1, *a, &lvar) | f(1, *a, &@ivar)
f(*a, **lvar) | f(*a, **@ivar)
f(*a, **lvar, &lvar) | f(*a, **@ivar, &@ivar)
f(*a, kw: 1)
f(*a, kw:1, &lvar) | f(*a, kw:1, &@ivar)
END

%w[1 2 3 4 5 6].each do |i|
  count = res.count(i)
  next unless count > 0
  print count.to_s.rjust(3), ' : ', types[i.to_i-1]
  puts
end

@byroot
Copy link
Member

byroot commented Dec 1, 2023

Have you tried this patch on yjit-bench? I suspect that if this translate to real world gains, it should be visible on things like the lobsters bench as it's a Rails app, so likely has significant delegation going on.

@jeremyevans
Copy link
Contributor Author

Have you tried this patch on yjit-bench? I suspect that if this translate to real world gains, it should be visible on things like the lobsters bench as it's a Rails app, so likely has significant delegation going on.

Here are the results. I ran with WARMUP_ITRS=1 MIN_BENCH_ITRS=2 MIN_BENCH_TIME=1 and it still took over 15 minutes on my machine. For more accurate numbers, you would want to run it with standard settings on a machine that is not being used for anything except the benchmark.

Note that other than the compiler change to change splatarray true to splatarray false in certain cases (which should always improve runtime performance), the only other change in the pull request is the small one to vm_inshelper.c. That should not result in an additional allocation (pull request changes that allocation from caller-side to callee-side). Possibly combining the branches makes things slower, I could keep the split branches to remove the added conditional.

after: ruby 3.3.0dev (2023-11-30T23:30:30Z reduce-implicit-al.. b2707195ec) [x86_64-openbsd7.4]
before: ruby 3.3.0dev (2023-11-06T19:39:09Z v3_3_0_preview3~213 49b6dc8f07) [x86_64-openbsd7.4]

--------------  ----------  ----------  -----------  ----------  ------------  --------------
bench           after (ms)  stddev (%)  before (ms)  stddev (%)  after/before  before 1st itr
activerecord    123.9       3.1         122.3        0.4         1.01          1.04
erubi_rails     45.8        5.0         46.2         5.1         0.99          0.97
hexapdf         6239.6      0.1         6181.9       0.2         1.01          1.01
liquid-c        147.5       0.5         147.3        0.5         1.00          0.99
liquid-compile  173.0       0.3         177.5        0.6         0.97          0.98
liquid-render   318.4       0.5         320.2        0.1         0.99          1.00
mail            321.9       0.1         323.1        0.1         1.00          0.97
psych-load      5883.5      0.1         5790.1       0.2         1.02          1.02
railsbench      4092.9      0.5         4147.1       0.2         0.99          0.96
ruby-lsp        14.6        10.5        14.7         10.5        0.99          1.02
sequel          182.7       1.9         183.1        1.7         1.00          0.99
binarytrees     675.3       0.3         655.1        0.2         1.03          1.03
chunky_png      2207.5      0.0         2228.3       0.2         0.99          0.99
erubi           576.7       0.1         573.6        0.1         1.01          1.01
etanni          665.1       0.1         661.6        0.0         1.01          1.00
fannkuchredux   3741.2      0.2         3594.7       0.0         1.04          1.04
lee             2571.0      0.4         2550.8       0.3         1.01          1.01
nbody           168.7       0.2         169.0        0.2         1.00          0.99
optcarrot       10568.0     0.0         10501.7      0.3         1.01          1.00
ruby-json       10451.8     0.9         10318.8      0.0         1.01          1.00
rubykon         19591.2     0.1         19625.6      0.0         1.00          1.00
30k_ifelse      3390.4      0.0         3371.2       0.1         1.01          1.02
30k_methods     7287.3      0.0         7375.1       0.9         0.99          1.00
cfunc_itself    300.8       0.1         294.4        0.3         1.02          1.01
fib             439.6       0.1         432.4        0.3         1.02          1.02
getivar         204.1       0.1         213.7        0.2         0.95          0.96
keyword_args    506.8       0.2         491.4        0.0         1.03          1.02
respond_to      695.8       0.1         704.7        0.2         0.99          0.99
setivar         141.2       0.4         137.4        0.3         1.03          1.03
setivar_object  164.7       0.4         167.7        0.2         0.98          0.98
setivar_young   166.4       0.3         174.6        4.6         0.95          0.99
str_concat      186.0       2.0         185.5        2.1         1.00          1.01
throw           43.8        0.4         43.5         6.5         1.01          1.02
--------------  ----------  ----------  -----------  ----------  ------------  --------------

@byroot
Copy link
Member

byroot commented Dec 2, 2023

Hum, not sure why lobsters isn't on that result list, it's the one where I would have expected a significant amount of delegation, hence a noticeable speedup.

To share my line of thoughts, if this change show significant gains on at least one headline benchmark then I'm in favor of taking the risk to merge such a change soonish so it makes it into 3.3. But otherwise I'd advocate for just waiting until January so we don't take the risk of introducing a subtle bug this close to release.

Either way, if we want to merge this for 3.3, I strongly believe it should be before the first RC (coming very soon I heard).

Due to how the compiler works, while f(*a, &lvar) and f(*a, &@Iv)
do not allocate an array, but f(1, *a, &lvar) and f(1, *a, &@Iv)
do.  It's probably possible to fix this in the compiler, but
seems easiest to fix this in the peephole optimizer.

Eliminating this array allocation is as safe as the current
elimination of the array allocation for f(*a, &lvar) and
f(*a, &@Iv).
The compiler already eliminates the array allocation for
f(*a, &lvar) and f(*a, &@Iv), and eliminating the array allocation
for keyword splat is as safe as eliminating it for block passes.
…Iv)

The compiler already eliminates the array allocation for
f(*a, &lvar) and f(*a, &@Iv).  If that is safe, then eliminating
it for f(*a, **lvar) and f(*a, **@Iv) as the last commit did is
as safe, and eliminating it for f(*a, **lvar, &lvar) and
f(*a, **@Iv, &@Iv) is also as safe.
In cases where the compiler can detect the hash is static, it would
use duphash for the hash part. As the hash is static, there is no need
to allocate an array.
)

Similar to the previous commit, but this handles the block pass
case.
For the following:

```
def f(*a); a end
p f(*a, kw: 3)
```

`setup_parameters_complex` pushes `{kw: 3}` onto `a`.  This worked fine
back when `concatarray true` was used and `a` was already a copy. It
does not work correctly with the optimization to switch to
`concatarray false`.  This duplicates the array on the callee side
in such a case.

This affects cases when passing a regular splat and a keyword splat
(or literal keywords) in a method call, where the method does not
accept keywords.

This allocation could probably be avoided, but doing so would
make `setup_parameters_complex` more complicated.
… f(*a, kw: 1, &arg)

These are similar to the f(1, *a, &lvar), f(*a, **kw, &lvar) and
f(*a, kw: 1, &lvar) optimizations, but they use getblockparamproxy
instruction instead of getlocal.

This also fixes the else style to be more similar to the surrounding
code.
@jeremyevans jeremyevans force-pushed the reduce-implicit-allocations-optimizer branch from 7e82b16 to f1c43a7 Compare December 7, 2023 15:28
@jeremyevans jeremyevans merged commit 7738b31 into ruby:master Dec 7, 2023
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants