Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`where` in single `multi` vs. `sub` is 10x slower #2002

Open
zoffixznet opened this issue Jul 1, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@zoffixznet
Copy link
Contributor

commented Jul 1, 2018

Notice that simply changing sub to multi makes this code 10x slower. At first, I thought auto-genned proto were the cause, but even if you add one of your own the same issue exists.

I'm guessing this might be due to candidate selection process invoking the where just to figure out
if types match, but is there any improvements we can make here?

<Zoffix_> m: sub  foo ($ where rand.so) {}; my $n = 42e0; { for ^100_000 {  foo $n };  say now - ENTER now; }
<camelia> rakudo-moar 64bdb3dd7: OUTPUT: «0.11642038␤»

<Zoffix_> m: multi  foo ($ where rand.so) {}; my $n = 42e0; { for ^100_000 {  foo $n };  say now - ENTER now; }
<camelia> rakudo-moar 64bdb3dd7: OUTPUT: «1.23743391␤»
<Zoffix_> m: proto foo(|) {*}; multi  foo ($ where rand.so) {}; my $n = 42e0; { for ^100_000 {  foo $n };  say now - ENTER now; }
<camelia> rakudo-moar 64bdb3dd7: OUTPUT: «1.25060981␤»

<Zoffix_> m: say 1.23743391 / 0.11642038
<camelia> rakudo-moar 64bdb3dd7: OUTPUT: «10.629014525␤»
@zoffixznet

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2018

BTW, if dump the QAST for this code, you'll see no ACCEPTS calls, because that literal gets optimized into a simple param check. However, if you --profile it, you'll see the .ACCEPTS is there, eating up a large chunk of time along with slow-path-binder methods.

multi foo (42) {}; { for ^100_000 { foo 42 };  say now - ENTER now; }

Where's it coming from?

@jnthn

This comment has been minimized.

Copy link
Member

commented Jul 1, 2018

Where's it coming from?

The multi-dispatcher calling the slow-path signature binder to determine if the signature can be bound.

A little background. Naively:

  • A signature bind would call into the binder, which interprets the signature and does what is needed
  • A multi-dispatch would walk through the available candidates until it finds an applicable one, applying tie-breakers when there's multiple applicable ones

That would all be rather slow, however, and no small number of things are done to avoid all the work:

  • We compile many signatures into a QAST tree that does the work the binder would do. It doesn't cover everything - unpacking with subsignatures being the main example - but it covers nearly all of the common cases.
  • The above enables a whole cascade of optimizations to take place in the specializer (type check elimination, definedness check elimination, hllization check elimination, arg buffer bounds check elimination, not bothering allocating the hash that would back *%_ if not used, decont elimination, rewriting named arg access to a fast pointer chase rather than a hash lookup, special-casing slurply construction, and probably some other stuff I forgot).
  • There's also a multi-dispatch cache. The multi-dispatcher makes a cache entry when the outcome of the dispatch can be determined purely by the types, their definedness, and rw-ness. Further such calls are thus served from this dispatch cache, so don't have to go through the multi-dispatch code at all.
  • The specializer can use knowledge of argument types (or statistics + guard insertion, or a combination of the two) in order to simply emit a call directly to the cache entry, and potentially even inline it.

Which is why (though nobody ever puts it this way :P) so many forms of dispatch are 10x faster than the one in question here. If it wasn't for those, thing would be uniformly slow. :-)

As for what can be done about the case in question, there's a few possible directions we can explore, sorted in order of ease.

  1. The dispatch cache doesn't care what we stick into it, it just invokes it. We could, therefore, stick a closure into it that does the work needed to tie-break a smaller set of candidates, rather than check all of them again. This wouldn't immediately avoid the slow-path binder, but it would avoid some re-checking on each dispatch.
  2. The multi-dispatcher currently calls the signature binder to determine if there is a match. An alternative strategy would be to just call the first candidate, catching any bind fail exception and continuing on to the next candidate. This would avoid a vast amount of re-work and would let the optimized code do the checking. However, we need to find a way to make it not be ridiculously fragile: we only want to catch bind exceptions for the immediate callee. I don't immediately know how we achieve that, but it should be possible to come up with something robust enough. This would be quite a big win, as it would avoid re-checking that we currently do on a successful dispatch too.
  3. If we've done both 1 and 2, then at this point the closure in step 1 is hopefully just a loop over a bunch of code objects that we invoke and then see if we get an exception from. In a perfect world, we'd then have a way for spesh to unroll the loop and spit out optimized calls to each of the candidates. However, we could also consider a code-gen solution where we compile a tie-breaker, which can then be installed into the caches.

Note that in the event all tie-breaks fail, we can then end up having to look at the next group of candidates (where a group is a set of candidates considered equally narrow). This will add an amount of complexity to both 1 and 3. Effectively, the dispatcher probably has to be restructured into two steps: find all the candidates, by narrowness group, that would be applicable modulo tie-breaking, and then in the second step go through them until one works. This done alone would actually make things slower; it needs the work suggested in step 1 to take place to make it an actual win.

On the other hand, 2 can probably done without much restructuring, and it would in the end likely boil down to a relatively small patch, the problem being working out what code to write.

While 3 depends on 1 and 2 to make sense, 1 and 2 are both individually worthwhile. 1 is "just" NQP level work; 2 might involve some MoarVM fiddling too, and some thought about how it might impact on other backends (this opt may turn out to be MoarVM specific, however).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.