Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper tail calls #1888

Closed
wants to merge 8 commits into from

Conversation

@DemiMarie
Copy link

commented Feb 7, 2017

Rendered

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

It should be possible to implement tail calls as some sort of transformation in rustc itself, predicated on the backend allowing manipulation of the stack and supporting some form of goto. I assume that WebAssembly at least would allow for us to write our own method calls, but haven't looked at it.

C is a problem, save in the case that the become keyword is used on the function we are in (there is a name for this that I'm forgetting). In that case, you may be able to just reassign to the arguments and reuse variables. More advanced constructions might be possible by wrapping the tail calls in some sort of outer running loop and returning codes as to which to call next, but this doesn't adequately address how to go about passing arguments around. I wouldn't rule out being able to support this in ANSI C, it's just incredibly tricky.

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

Okay, an outline for a specific scheme in C:

  • Post-monomorphization, find all functions that use become. Build a list of the possible tail calls that may be reached from each of these functions.

  • Declare a C variable for all variables in all the functions making up the set. Ad a variable, code, that says which function we're in. code of 0 means we're done. Add another variable, ret, to hold the return value.

  • Give each function a nonzero code and copy the body into a while loop that dispatches based on the code variable. When the while loop exits, return ret. Any instances of return are translated into an assignment to ret and setting code to 0.

  • When entering a tailcall function, redirect the call to the special version instead.

I believe this scheme works in all cases. We can deal with the issue of things that have Drop impls by dropping them: the variable can stay around without a problem, as long as the impl gets called. The key point is that we're declaring the slots up front so that the stack doesn't keep growing. The biggest disadvantage is that we have to declare all the slots for all the functions, and consequently the combined stack frame is potentially (much?) larger than if we had done it in the way the RFC currently proposes. If making sure there is parody in terms of performance is a concern, this could be the used scheme in all backends. Nonetheless, it works for any backend which C can be compiled to.

Unless I'm missing something obvious, anyway.

@glaebhoerl

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

@camlorn Does that work for function pointers? I don't immediately see any reason it wouldn't, just the "Build a list of the possible tail calls that may be reached from each of these functions." snippet which sticks out otherwise, because in the interesting cases it's presumably "all of them"?

(Also, this feels very like defunctionalization? Is it?)

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

@glaebhoerl
Can you become a function pointer? This was not how I read the RFC, though it would make sense if this were the case. Nonetheless, you are correct: if function pointers are allowed, this probably does indeed break my scheme. It might be possible to get around it, somehow.

I don't know what defunctionalization is. Is this defunctionalization? I'll get back to you once I learn a new word.

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

Is there any benchmarking data regarding the callee-pops calling convention?
AFAICT Windows uses such a calling convention stdcall for most APIs.
I have repeatedly looked for benchmarks comparing stdcall to cdecl, but I have only found minor differences (in either direction, possibly related to the interaction with optimisations) and I was unable to find something providing a conclusive answer on which one results in better performance.

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

@ranma42
I'm not sure why there would be a difference: either you do your jmp for return and then pop or you pop and then do your jmp for return, but in either case someone is popping the same amount of stuff?

Also, why does it matter here?

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

@glaebhoerl
Apparently today is idea day:

Instead of making the outer loop be inside a function that declares all the needed variables, make the outer loop something that expects a struct morally equivalent to the tuple (int code, void* ptr, void* args), then have it cast ptr to the appropriate function pointer type by switching on code, cast args to a function-pointer-specific argument structure, then call the function pointer. It should be possible to get the args struct to be inline as opposed to an additional level of indirection somehow, but I'm not sure how to do it without violating strict aliasing. This has the advantage of making the stack frame roughly the same size as what it would be in the LLVM backend, but the disadvantage of being slower (but maybe we can sometimes use the faster while-loop with switch statement approach).

I don't think this is defunctionalization, based off a quick google of that term.

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2017

@camlorn That is my opinion, too, but it is mentioned as "one major drawback of proper tail calls" in the current RFC

Later phases in the compiler assert that these requirements are met.

New nodes are added in HIR and HAIR to correspond to `become`. In MIR, however,
a new flag is added to the `TerminatorKind::Call` varient. This flag is only

This comment has been minimized.

Copy link
@mglagla

mglagla Feb 7, 2017

Typo: varient -> variant

@DemiMarie

This comment has been minimized.

Copy link
Author

commented Feb 7, 2017

@camlorn @ranma42 The drawback of a callee-pops calling convention is that for caller-pops calling conventions, much of the stack pointer motion can be eliminated by the optimizer, since it is all in one function. However, with a callee-pops calling convention, you might be able to do the same thing in the callee – but I don't think you gain anything except on Windows, due to the red zone which Windows doesn't have.

I really don't know what I am talking about on the performance front though. Easy way to find out would be to patch the LLVM bindings that Rust uses to always enable tail calls at the LLVM level, then build the compiler, and finally see if the modified compiler is faster or slower than the original.

@DemiMarie

This comment has been minimized.

Copy link
Author

commented Feb 7, 2017

@camlorn My intent was that one can become any function or method that uses the Rust ABI or the rust-call ABI (both of which lower to LLVM fastcc), provided that the return types match. Haven't thought about function pointers, but I believe that tail calls on trait object methods are an equivalent problem.

@camlorn

This comment has been minimized.

Copy link

commented Feb 7, 2017

@DemiMarie
Good point. They are.

I think my latest idea works out, but I'm not quite sure where you put the supporting structs without heap allocation. I do agree that not being able to do it in all backends might sadly be a deal breaker.

Is there a reason that Rustc doesn't already always enable tail calls in release mode?

[implementation]: #implementation

A current, mostly-functioning implementation can be found at
[DemiMarie/rust/tree/explicit-tailcalls](/DemiMarie/rust/tree/explicit-tailcalls).

This comment has been minimized.

Copy link
@cramertj

cramertj Feb 7, 2017

Member

This 404s for me.

@cramertj

This comment has been minimized.

Copy link
Member

commented Feb 8, 2017

Is there any particular reason this RFC specifies that become should be implemented at an LLVM level rather than through some sort of MIR transformation? I don't know how they work, but it seems like maybe StorageLive and StorageDead could be used to mark the callee's stack as expired prior to the function call.

@archshift
Copy link
Contributor

left a comment

Just a note:

You shouldn't be changing the template file, but rather copying the template to a new file (0000-proper-tail-calls.md) and changing that!

@archshift

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

I wonder if one can simulate the behavior of computed goto dispatch using these tail calls. That would be pretty neat indeed!

@DemiMarie

This comment has been minimized.

Copy link
Author

commented Feb 8, 2017

@archshift There is a better way to do that (get rustc to emit the appropriate LLVM IR for a loop wrapped around a match when told to do so, perhaps by an attribute).

@DemiMarie

This comment has been minimized.

Copy link
Author

commented Feb 8, 2017

@archshift done.

@Stebalien

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

As a non-FP/non-PL person, it would be really nice to see some concrete examples of where become is nicer than a simple while loop. Personally, I only ever use recursion when I want a stack.

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

@Stebalien a case where they are typically nicer than a loop is when they are used to encode (the states of a) state machine. That is because instead of explicitly looping and changing the state, it is sufficient to call the appropriate function (i.e. the state is implicitly encoded by the function being run at that time). Note that this often makes it easier for the compiler to detect optimisation opportunities, as in some cases a state can trivially be inlined.

@Stebalien

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

@ranma42 I see. Usually, I'd just put the state in an enum and use a while + match loop but I can see how become with a bunch of individual functions could be cleaner. Thanks!

@sgrif

This comment has been minimized.

Copy link
Contributor

commented Feb 8, 2017

Should this RFC include at least one example of what this syntax looks like in use? (e.g. an entire function body)

@arthurprs

This comment has been minimized.

Copy link

commented Feb 8, 2017

A good example snippet would go a long way. 👍 overall as the surface is fairly small and really helps rust functional mojo.

@DemiMarie

This comment has been minimized.

Copy link
Author

commented Feb 8, 2017

Pinging @thepowersgang because they are the only person working on an alternative Rust compiler to the best of my knowledge, and because since their compiler (mrustc) compiles via C they would need to implement one of the above solutions.

@bbarker

This comment has been minimized.

Copy link

commented Jun 23, 2018

@ehaliewicz

This comment has been minimized.

Copy link

commented Jul 30, 2018

@bbarker yep, that's a similar solution to what Webkit does. It's a classic trick.
And also used by Chicken Scheme.

@timthelion

This comment has been minimized.

Copy link

commented Nov 3, 2018

@bbarker Trampolining can be implemented as a library (and already has been) https://docs.rs/tramp/0.3.0/tramp/

@bbarker

This comment has been minimized.

Copy link

commented Nov 4, 2018

@timthelion thanks, that looks very promising!

@likeabbas

This comment has been minimized.

Copy link

commented Nov 4, 2018

Wow. The source doe for this is genius https://docs.rs/crate/tramp/0.3.0/source/src/lib.rs this could be implemented in a nightly build quickly if it was to be made apart of the language and use the become keyword to replace the tramp function call

@timthelion

This comment has been minimized.

Copy link

commented Nov 4, 2018

Here is another example of trampolining from elm-lang https://package.elm-lang.org/packages/elm-lang/core/3.0.0/Trampoline It seems that with Enum types trampolines are quite natural.

@jonhoo

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2018

@likeabbas it's worth pointing out that tramp will heap-allocate for each recursive call though (unless I'm misreading it, rec_call! calls Thunk::new which allocates)

@timthelion

This comment has been minimized.

Copy link

commented Nov 4, 2018

I wrote some benchmarks for that package and unfortunately I either made a serious error in my control group loop implementation, or trampolining is slighly slower than a straight up loop.

At first I simply added a benchmark. You can see my code and results here: https://gitlab.com/timthelion/trampoline-rs/commit/84f6c843658c6c3a5893effa031ce734b910171c

test bench_oddness      ... bench:     398,832 ns/iter (+/- 261,575)
test bench_oddness_loop ... bench:           2 ns/iter (+/- 0)

bench_oddness_loop is my loop based re-implementation of the pre-existing recursive algorithm that was shipped with the package. As you can see, the speed daemons among you would prefer to use a loop.

Then I used stacker to prevent trampolining unless the stack is running low. This seems to roughly double the speed. https://gitlab.com/timthelion/trampoline-rs/commit/b2a1cf4d4cb01a99088e991e6fc120841c122a26

test bench_oddness      ... bench:     158,448 ns/iter (+/- 78,957)
test bench_oddness_loop ... bench:           2 ns/iter (+/- 0)

I also believe that there is unnecessary boxing going on in that code, but I have yet to get past the 7th chapter of "the rust book" and this is really my second day with the language, so I'll leave that judgement to those who have more experience with the language.

Edit I opened a merge request to the library here https://gitlab.com/bzim/trampoline-rs/merge_requests/2

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

Another option is to implement trampolines (ab)using macros
https://play.rust-lang.org/?version=nightly&mode=release&edition=2015&gist=0d709d2f38ff60c9975eb0967c7e3232

EDIT: an extended version, with an example of some "typical" tail recursion usage and minor improvements (simpler syntax when no mutual recursion is used, support for type arguments):

https://play.rust-lang.org/?version=nightly&mode=release&edition=2015&gist=73f412aad2ebc9e7ef30e7d4c6cd2c0d

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

I've added @ranma42's version to the benchmark suit. https://gitlab.com/timthelion/trampoline-rs/commit/03ee85c293cdcb57142612ae1c03d2bece0da5f5

The results are 8x faster than my stacker version:

test bench_oddness       ... bench:     193,335 ns/iter (+/- 89,934)
test bench_oddness_loop  ... bench:           2 ns/iter (+/- 0)
test bench_oddness_macro ... bench:      25,085 ns/iter (+/- 31,424)

but it seems, if I understand correctly, that this version only works for TCO which does not cross module boundaries, which is very useful for implementing state machines, but which still rules out TCO's use in monad based parser combinator libraries.

I'm starting to have serious doubts about the validity of my benchmark, it seems to me that my loop based implementation is too fast. If I understand my own code correctly, the loop should be converted to cca 1000 iterations of decriment, compare, short jump. If my processor is 2ghz, 2ns is just 4 cycles. 1000 iterations should be at minimum 3000 cycles or 1500ns, no?

@RalfJung

This comment has been minimized.

Copy link
Member

commented Nov 5, 2018

@timthelion I think you need to tame the optimizer. Likely your entire example gets optimized to a constant. The keyword here is black_box and there is an RFC for it. I seem to recall it also exists somewhere in libtest already where you should be able to use it in your benchmarks.

Something like this should be able to stop the optimizer from knowing what val will be:

fn bench_oddness_loop(b: &mut Bencher) {
    let mut val : u128 = 1000;
    b.iter(|| {
      // black_box will not mutate the value but LLVM does not know that,
      // so it cannot const-propagate the value after this call.
      black_box(&mut val);
      oddness::is_even_loop(val);
      val += 1;
    });
}
@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

@RalfJung I tried blackboxing val in a commit here https://gitlab.com/timthelion/trampoline-rs/commit/701224bd36bd8da055617fbe166b77d8d6b32d59 but it had no effect on the speed.

@RalfJung

This comment has been minimized.

Copy link
Member

commented Nov 5, 2018

Hm, maybe the way I told you to use black_box is wrong... @gnzlbg halp? :D

@Nemo157

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

It's the implementation that's getting optimized away: https://rust.godbolt.org/z/yoB7iD

Changing it to use i = black_box(i - 1); in is_even_loop results in similar benchmark times for me (but is probably suppressing other, lesser optimizations than the one that just turns it into a single instruction).

@gnzlbg

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

That code is not using the result of is_even, so it can be optimized away. Use black_box like this:

#[bench]
fn bench_oddness(b: &mut Bencher) {
    let mut val : u128 = 1000;
    b.iter(|| {
        black_box(oddness::is_even(black_box(val)));
        val += 1;
    });
}

The black_box(val) prevents the compiler from using any information about the value of val when optimizing is_even, and the black_box(is_even(...)) forces the compiler to materialize the result of the computation.

@Nemo157

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

@gnzlbg is_even is a pub fn so it's not really optimized away, it's just that LLVM is capable of detecting the fact it's checking for even-ness and compile down to a 2 instruction implementation of that, ~2ns per call seems appropriate then.

EDIT: I assume the point was to test looping vs trampoline, if LLVM is removing the loop then that defeats the benchmark.

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

Yes, obviously if LLVM has gone full singularity on us and can read and understand what the code is doing, then it's not a fair test. The point was to test the efficiency of a straight loop vs a trampoline.

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

So the new results with @Nemo157 's black_boxing of the decrement really is very similar.

test bench_oddness       ... bench:     162,518 ns/iter (+/- 85,663)
test bench_oddness_loop  ... bench:     149,110 ns/iter (+/- 184,309)
test bench_oddness_macro ... bench:      18,632 ns/iter (+/- 24,032)

https://gitlab.com/timthelion/trampoline-rs/commit/0276338a4bcde0b44de4ddc6e0e0fb9537c55a56

Perhaps this trampolining thing isn't so stupid after-all. But why is it so similar? Shouldn't the trampoline still be slower as a result of my stacker test?

@bbarker

This comment has been minimized.

Copy link

commented Nov 5, 2018

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

@bbarker it wasn't the whole function call that was being optimized away, it was that the loop (which tests for evenness) in this particular example was being transformed into a direct test for evenness. Pseudo-random inputs would not change that.

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

It seemed like the two versions were suspiciously too close in speed, and it occured to me that at a recursion depth of just 1000, it is likely no trampolining is actually taking place. So I increased that depth to 1 million and found the trampoline version to be roughly twice as slow. The crazy macro version is now the one being suspiciously fast ;)

DEPTH=1_000_000

test bench_oddness       ... bench:   4,458,959 ns/iter (+/- 386,660)
test bench_oddness_loop  ... bench:   2,256,811 ns/iter (+/- 217,545)
test bench_oddness_macro ... bench:      69,017 ns/iter (+/- 6,315)

DEPTH=10_000_000

test bench_oddness       ... bench:  44,760,280 ns/iter (+/- 3,127,017)
test bench_oddness_loop  ... bench:  22,540,487 ns/iter (+/- 1,121,262)
test bench_oddness_macro ... bench:     664,491 ns/iter (+/- 58,690)

DEPTH=100_000_000

test bench_oddness       ... bench: 451,533,923 ns/iter (+/- 11,650,011)
test bench_oddness_loop  ... bench: 228,790,228 ns/iter (+/- 6,842,624)
test bench_oddness_macro ... bench:   6,773,690 ns/iter (+/- 661,104)

https://gitlab.com/timthelion/trampoline-rs/commit/337a9f009291754caa45b44ede5256b07b7bd0cc

@LinuxMercedes

This comment has been minimized.

Copy link

commented Nov 5, 2018

It seems to me that using stacker isn't the best choice here, as it's possible that tail-recursive functions will call arbitrarily-nested functions during their execution. Only tail-recursing when close to overflowing the stack violates the common assumption that tail-recursive functions only consume one stack frame.

EDIT: I am also seeing a ridiculously fast macro benchmark, which is a little confusing since the disassembly is pretty close to that of the loop. I'll look at it a bit more if I have time today.

@timthelion

This comment has been minimized.

Copy link

commented Nov 5, 2018

@LinuxMercedes well, it would be best if the cut-off was at least configurable. I'd also be curious, if we aren't seeing a saw-tooth memory allocation pattern to boot. But it seems that cargo bench doesn't tell us memory usage.

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

@timthelion I think one of the issues with the benchmark is that the number of operations performed in each bench iteration depends on the number of iterations that have already been performed (this explains the huge standard deviation).
Another issue is related to the definition of the bench_oddness_loop function, which LLVM recognises and optimizes to a non-looping one.

@timthelion

This comment has been minimized.

Copy link

commented Nov 6, 2018

@ranma42 we could try benchmarking factorial as well, but I believe that trampolining will never be as fast as a loop, which means that if that is to be our method of TCO it should not be a recomened one for functions which can be translated into a loop. However, there still remains a set of design patterns (continuation passing style/Monads (for those unfamiliar with Monads, CPS and Monads are the same thing)) which cannot be translated into loops, and it seems very desirable to allow such patterns and to standardize the method of trampolining so as to guarantee compatibility between CPS based libraries.

@ranma42

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

@timthelion if the underlying compiler is good enough, trampolining can be optimized out. As an example, compare the assembly for length64 (the trampoline-recursive version) and length_loop64 (the loop-based one). They are identical, which is unsurprising, given that after the macro expansion the trampoline version is basically equivalent to the loop-based one (just one loop with a single state, termination vs unwrapping of the tail based on the result of the match of the data structure being examined).

EDIT: the functions I mentioned are implemented in the last playground I posted

@OvermindDL1

This comment has been minimized.

Copy link

commented Nov 6, 2018

Problem with the above TCO benchmarks is that they are just self-recursive function calls, which is easy to optimize out and is not always the common functional patterns, so I'm wondering how well it would optimize out a large number of function calls that TCO to each other via a random number choosing of them or so? That should make for a pretty good 'worst case' for optimization of the trampoline.

Is there really not a way to just pop the call stack then jump to the new function after setting up it's frame though from within the function? That's usually not a hard step to put into a compiler (if you don't mind the stacktraces not contained 'everything', which is fine for TCO, especially explicit TCO)...

@Pauan Pauan referenced this pull request Nov 7, 2018

Open

Reviving proper tail calls? #23

@haf haf referenced this pull request Feb 17, 2019

Open

Add compiler-known attribute for enforced tail recursion #721

5 of 5 tasks complete
@slanterns

This comment has been minimized.

Copy link

commented Mar 15, 2019

Is there any plan for 2019?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.