Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper tail calls #1888

Closed
wants to merge 8 commits into from
Closed

Proper tail calls #1888

wants to merge 8 commits into from

Conversation

@DemiMarie
Copy link

@DemiMarie DemiMarie commented Feb 7, 2017

Rendered

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

It should be possible to implement tail calls as some sort of transformation in rustc itself, predicated on the backend allowing manipulation of the stack and supporting some form of goto. I assume that WebAssembly at least would allow for us to write our own method calls, but haven't looked at it.

C is a problem, save in the case that the become keyword is used on the function we are in (there is a name for this that I'm forgetting). In that case, you may be able to just reassign to the arguments and reuse variables. More advanced constructions might be possible by wrapping the tail calls in some sort of outer running loop and returning codes as to which to call next, but this doesn't adequately address how to go about passing arguments around. I wouldn't rule out being able to support this in ANSI C, it's just incredibly tricky.

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

Okay, an outline for a specific scheme in C:

  • Post-monomorphization, find all functions that use become. Build a list of the possible tail calls that may be reached from each of these functions.

  • Declare a C variable for all variables in all the functions making up the set. Ad a variable, code, that says which function we're in. code of 0 means we're done. Add another variable, ret, to hold the return value.

  • Give each function a nonzero code and copy the body into a while loop that dispatches based on the code variable. When the while loop exits, return ret. Any instances of return are translated into an assignment to ret and setting code to 0.

  • When entering a tailcall function, redirect the call to the special version instead.

I believe this scheme works in all cases. We can deal with the issue of things that have Drop impls by dropping them: the variable can stay around without a problem, as long as the impl gets called. The key point is that we're declaring the slots up front so that the stack doesn't keep growing. The biggest disadvantage is that we have to declare all the slots for all the functions, and consequently the combined stack frame is potentially (much?) larger than if we had done it in the way the RFC currently proposes. If making sure there is parody in terms of performance is a concern, this could be the used scheme in all backends. Nonetheless, it works for any backend which C can be compiled to.

Unless I'm missing something obvious, anyway.

@glaebhoerl
Copy link
Contributor

@glaebhoerl glaebhoerl commented Feb 7, 2017

@camlorn Does that work for function pointers? I don't immediately see any reason it wouldn't, just the "Build a list of the possible tail calls that may be reached from each of these functions." snippet which sticks out otherwise, because in the interesting cases it's presumably "all of them"?

(Also, this feels very like defunctionalization? Is it?)

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

@glaebhoerl
Can you become a function pointer? This was not how I read the RFC, though it would make sense if this were the case. Nonetheless, you are correct: if function pointers are allowed, this probably does indeed break my scheme. It might be possible to get around it, somehow.

I don't know what defunctionalization is. Is this defunctionalization? I'll get back to you once I learn a new word.

@ranma42
Copy link
Contributor

@ranma42 ranma42 commented Feb 7, 2017

Is there any benchmarking data regarding the callee-pops calling convention?
AFAICT Windows uses such a calling convention stdcall for most APIs.
I have repeatedly looked for benchmarks comparing stdcall to cdecl, but I have only found minor differences (in either direction, possibly related to the interaction with optimisations) and I was unable to find something providing a conclusive answer on which one results in better performance.

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

@ranma42
I'm not sure why there would be a difference: either you do your jmp for return and then pop or you pop and then do your jmp for return, but in either case someone is popping the same amount of stuff?

Also, why does it matter here?

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

@glaebhoerl
Apparently today is idea day:

Instead of making the outer loop be inside a function that declares all the needed variables, make the outer loop something that expects a struct morally equivalent to the tuple (int code, void* ptr, void* args), then have it cast ptr to the appropriate function pointer type by switching on code, cast args to a function-pointer-specific argument structure, then call the function pointer. It should be possible to get the args struct to be inline as opposed to an additional level of indirection somehow, but I'm not sure how to do it without violating strict aliasing. This has the advantage of making the stack frame roughly the same size as what it would be in the LLVM backend, but the disadvantage of being slower (but maybe we can sometimes use the faster while-loop with switch statement approach).

I don't think this is defunctionalization, based off a quick google of that term.

@ranma42
Copy link
Contributor

@ranma42 ranma42 commented Feb 7, 2017

@camlorn That is my opinion, too, but it is mentioned as "one major drawback of proper tail calls" in the current RFC

0000-template.md Outdated
Later phases in the compiler assert that these requirements are met.

New nodes are added in HIR and HAIR to correspond to `become`. In MIR, however,
a new flag is added to the `TerminatorKind::Call` varient. This flag is only
Copy link
Contributor

@mglagla mglagla Feb 7, 2017

Typo: varient -> variant

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Feb 7, 2017

@camlorn @ranma42 The drawback of a callee-pops calling convention is that for caller-pops calling conventions, much of the stack pointer motion can be eliminated by the optimizer, since it is all in one function. However, with a callee-pops calling convention, you might be able to do the same thing in the callee – but I don't think you gain anything except on Windows, due to the red zone which Windows doesn't have.

I really don't know what I am talking about on the performance front though. Easy way to find out would be to patch the LLVM bindings that Rust uses to always enable tail calls at the LLVM level, then build the compiler, and finally see if the modified compiler is faster or slower than the original.

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Feb 7, 2017

@camlorn My intent was that one can become any function or method that uses the Rust ABI or the rust-call ABI (both of which lower to LLVM fastcc), provided that the return types match. Haven't thought about function pointers, but I believe that tail calls on trait object methods are an equivalent problem.

@ahicks92
Copy link

@ahicks92 ahicks92 commented Feb 7, 2017

@DemiMarie
Good point. They are.

I think my latest idea works out, but I'm not quite sure where you put the supporting structs without heap allocation. I do agree that not being able to do it in all backends might sadly be a deal breaker.

Is there a reason that Rustc doesn't already always enable tail calls in release mode?

0000-template.md Outdated
[implementation]: #implementation

A current, mostly-functioning implementation can be found at
[DemiMarie/rust/tree/explicit-tailcalls](/DemiMarie/rust/tree/explicit-tailcalls).
Copy link
Member

@cramertj cramertj Feb 7, 2017

This 404s for me.

@cramertj
Copy link
Member

@cramertj cramertj commented Feb 8, 2017

Is there any particular reason this RFC specifies that become should be implemented at an LLVM level rather than through some sort of MIR transformation? I don't know how they work, but it seems like maybe StorageLive and StorageDead could be used to mark the callee's stack as expired prior to the function call.

Copy link
Contributor

@archshift archshift left a comment

Just a note:

You shouldn't be changing the template file, but rather copying the template to a new file (0000-proper-tail-calls.md) and changing that!

@archshift
Copy link
Contributor

@archshift archshift commented Feb 8, 2017

I wonder if one can simulate the behavior of computed goto dispatch using these tail calls. That would be pretty neat indeed!

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Feb 8, 2017

@archshift There is a better way to do that (get rustc to emit the appropriate LLVM IR for a loop wrapped around a match when told to do so, perhaps by an attribute).

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Feb 8, 2017

@archshift done.

@Stebalien
Copy link
Contributor

@Stebalien Stebalien commented Feb 8, 2017

As a non-FP/non-PL person, it would be really nice to see some concrete examples of where become is nicer than a simple while loop. Personally, I only ever use recursion when I want a stack.

@ranma42
Copy link
Contributor

@ranma42 ranma42 commented Feb 8, 2017

@Stebalien a case where they are typically nicer than a loop is when they are used to encode (the states of a) state machine. That is because instead of explicitly looping and changing the state, it is sufficient to call the appropriate function (i.e. the state is implicitly encoded by the function being run at that time). Note that this often makes it easier for the compiler to detect optimisation opportunities, as in some cases a state can trivially be inlined.

@Stebalien
Copy link
Contributor

@Stebalien Stebalien commented Feb 8, 2017

@ranma42 I see. Usually, I'd just put the state in an enum and use a while + match loop but I can see how become with a bunch of individual functions could be cleaner. Thanks!

@sgrif
Copy link
Contributor

@sgrif sgrif commented Feb 8, 2017

Should this RFC include at least one example of what this syntax looks like in use? (e.g. an entire function body)

@arthurprs
Copy link

@arthurprs arthurprs commented Feb 8, 2017

A good example snippet would go a long way. 👍 overall as the surface is fairly small and really helps rust functional mojo.

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Feb 8, 2017

Pinging @thepowersgang because they are the only person working on an alternative Rust compiler to the best of my knowledge, and because since their compiler (mrustc) compiles via C they would need to implement one of the above solutions.

@timthelion
Copy link

@timthelion timthelion commented Nov 5, 2018

@bbarker it wasn't the whole function call that was being optimized away, it was that the loop (which tests for evenness) in this particular example was being transformed into a direct test for evenness. Pseudo-random inputs would not change that.

@timthelion
Copy link

@timthelion timthelion commented Nov 5, 2018

It seemed like the two versions were suspiciously too close in speed, and it occured to me that at a recursion depth of just 1000, it is likely no trampolining is actually taking place. So I increased that depth to 1 million and found the trampoline version to be roughly twice as slow. The crazy macro version is now the one being suspiciously fast ;)

DEPTH=1_000_000

test bench_oddness       ... bench:   4,458,959 ns/iter (+/- 386,660)
test bench_oddness_loop  ... bench:   2,256,811 ns/iter (+/- 217,545)
test bench_oddness_macro ... bench:      69,017 ns/iter (+/- 6,315)

DEPTH=10_000_000

test bench_oddness       ... bench:  44,760,280 ns/iter (+/- 3,127,017)
test bench_oddness_loop  ... bench:  22,540,487 ns/iter (+/- 1,121,262)
test bench_oddness_macro ... bench:     664,491 ns/iter (+/- 58,690)

DEPTH=100_000_000

test bench_oddness       ... bench: 451,533,923 ns/iter (+/- 11,650,011)
test bench_oddness_loop  ... bench: 228,790,228 ns/iter (+/- 6,842,624)
test bench_oddness_macro ... bench:   6,773,690 ns/iter (+/- 661,104)

https://gitlab.com/timthelion/trampoline-rs/commit/337a9f009291754caa45b44ede5256b07b7bd0cc

@LinuxMercedes
Copy link

@LinuxMercedes LinuxMercedes commented Nov 5, 2018

It seems to me that using stacker isn't the best choice here, as it's possible that tail-recursive functions will call arbitrarily-nested functions during their execution. Only tail-recursing when close to overflowing the stack violates the common assumption that tail-recursive functions only consume one stack frame.

EDIT: I am also seeing a ridiculously fast macro benchmark, which is a little confusing since the disassembly is pretty close to that of the loop. I'll look at it a bit more if I have time today.

@timthelion
Copy link

@timthelion timthelion commented Nov 5, 2018

@LinuxMercedes well, it would be best if the cut-off was at least configurable. I'd also be curious, if we aren't seeing a saw-tooth memory allocation pattern to boot. But it seems that cargo bench doesn't tell us memory usage.

@ranma42
Copy link
Contributor

@ranma42 ranma42 commented Nov 5, 2018

@timthelion I think one of the issues with the benchmark is that the number of operations performed in each bench iteration depends on the number of iterations that have already been performed (this explains the huge standard deviation).
Another issue is related to the definition of the bench_oddness_loop function, which LLVM recognises and optimizes to a non-looping one.

@timthelion
Copy link

@timthelion timthelion commented Nov 6, 2018

@ranma42 we could try benchmarking factorial as well, but I believe that trampolining will never be as fast as a loop, which means that if that is to be our method of TCO it should not be a recomened one for functions which can be translated into a loop. However, there still remains a set of design patterns (continuation passing style/Monads (for those unfamiliar with Monads, CPS and Monads are the same thing)) which cannot be translated into loops, and it seems very desirable to allow such patterns and to standardize the method of trampolining so as to guarantee compatibility between CPS based libraries.

@ranma42
Copy link
Contributor

@ranma42 ranma42 commented Nov 6, 2018

@timthelion if the underlying compiler is good enough, trampolining can be optimized out. As an example, compare the assembly for length64 (the trampoline-recursive version) and length_loop64 (the loop-based one). They are identical, which is unsurprising, given that after the macro expansion the trampoline version is basically equivalent to the loop-based one (just one loop with a single state, termination vs unwrapping of the tail based on the result of the match of the data structure being examined).

EDIT: the functions I mentioned are implemented in the last playground I posted

@OvermindDL1
Copy link

@OvermindDL1 OvermindDL1 commented Nov 6, 2018

Problem with the above TCO benchmarks is that they are just self-recursive function calls, which is easy to optimize out and is not always the common functional patterns, so I'm wondering how well it would optimize out a large number of function calls that TCO to each other via a random number choosing of them or so? That should make for a pretty good 'worst case' for optimization of the trampoline.

Is there really not a way to just pop the call stack then jump to the new function after setting up it's frame though from within the function? That's usually not a hard step to put into a compiler (if you don't mind the stacktraces not contained 'everything', which is fine for TCO, especially explicit TCO)...

@slanterns
Copy link
Contributor

@slanterns slanterns commented Mar 15, 2019

Is there any plan for 2019?

@pkonrad-ny

This comment was marked as off-topic.

@boggle
Copy link

@boggle boggle commented Jul 9, 2020

Can we at least get an update for 2020?

@archshift
Copy link
Contributor

@archshift archshift commented Jul 9, 2020

As a language with many conspicuous FP paradigms integrated into it, it's pretty disappointing that simple recursion in Rust is pretty much a non-starter because of this. There are many situations where it's plainly more readable to use recursion to express a recursive algorithm than to try and manually lift it to a loop with mutable state.

If we want a compiler that reduces cognitive load for the programmer, we should encourage expressing algorithms in their most natural representation.

@matu3ba
Copy link

@matu3ba matu3ba commented Jul 20, 2020

@archshift
This is blocked by loop optimizations workaround and the missing LLVM fix to define forward-progress.
Justification was to fix wrong loop behavior (removal of complete loops) while upholding optimizations.

Overall on doing TCE/TCO the compiler needs to prove that the drop order does not matter.
I might be wrong on this, but any strict tail call optimization is a prove over the structure of a finite automata in which each node (containing some code) does not affect RAII variables existence(create variables or drop them).
A relaxation would run a (functional) PDA proof/run on all paths to show that the create and drop order of any variable remains the same.
This has according compile-time implications.

Correct me, if I am wrong here.

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Aug 2, 2020

@matu3ba My proposal changed the drop order so that drops happened before the call was made. This changes semantics, but since tail calls were explicit under my proposal, this was fine.

@matu3ba
Copy link

@matu3ba matu3ba commented Aug 3, 2020

@DemiMarie Thanks for your kind reply. And yes, you are correct that nobody sane does write loop in a 1-state recursive function / automaton.

As I understand the RFC, you explain how to do the following:

 //  1. Trivial instructions between the call and return do not prevent the
 //     transformation from taking place, though currently the analysis cannot
 //     support moving any really useful instructions (only dead ones).
 //  2. This pass transforms functions that are prevented from being tail
 //     recursive by an associative and commutative expression to use an
 //     accumulator variable, thus compiling the typical naive factorial or
 //     'fib' implementation into efficient code.
 //  3. TRE is performed if the function returns void, if the return
 //     returns the result returned by the call, or if the function returns a
 //     run-time constant on all exits from the function.  It is possible, though
 //     unlikely, that the return returns something else (like constant 0), and
 //     can still be TRE'd.  It can be TRE'd if ALL OTHER return instructions in
 //     the function return the exact same value.
 //  4. If it can prove that callees do not access their caller stack frame,
 //     they are marked as eligible for tail call elimination (by the code
 //     generator).

Sorry for being not very explicit, but I was more talking about general cases of TRE, which are limited by what needs to be proofed by LLVM or Rust. Specifically function chaining, ie for automata.

The LLVM description, which is hopefully the correct one, 1. does not allow function chaining, 2. limitation on memory layout (no GEP) etc [since its more complicated to keep track of all the memory access in c/c++/LLVM].

When you have function chaining, one might also write loop or infinite loops inside an end state, which would be silently optimized away (and thus be unsound).

@bbarker
Copy link

@bbarker bbarker commented Dec 5, 2020

workaround and the missing LLVM fix to define forward-progress.

I would simply like to highlight for anyone that didn't click that link, without judgement, that this was reported in 2006. Those were the days. Anyway, at least there appears to be progress on the forward progress.

@matklad
Copy link
Member

@matklad matklad commented Mar 13, 2021

Those who are interested in become for computed goto purposes might want to take a look at this Zig issue: ziglang/zig#8220

Apparently, there's a general shape of loop/match construct, which convinces LLVM to generate the right code.

@matthieu-m
Copy link

@matthieu-m matthieu-m commented Apr 25, 2021

Clang landed [[clang::musttail]] in https://reviews.llvm.org/D99517.

Based on this support, a C protobuf parser was created which doubled (2x) the speed at which protobuf is parsed according to https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html .

It is not clear from the above on which targets the attribute is supported. At the same time, a conditionally supported functionality could -- at least in nightly -- allow experimentation to check the potential speed-up from tail-calls compared to the above loop/match construct mentioned in Zig.

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Apr 25, 2021

@DemiMarie Thanks for your kind reply. And yes, you are correct that nobody sane does write loop in a 1-state recursive function / automaton.

As I understand the RFC, you explain how to do the following:

 //  1. Trivial instructions between the call and return do not prevent the
 //     transformation from taking place, though currently the analysis cannot
 //     support moving any really useful instructions (only dead ones).
 //  2. This pass transforms functions that are prevented from being tail
 //     recursive by an associative and commutative expression to use an
 //     accumulator variable, thus compiling the typical naive factorial or
 //     'fib' implementation into efficient code.
 //  3. TRE is performed if the function returns void, if the return
 //     returns the result returned by the call, or if the function returns a
 //     run-time constant on all exits from the function.  It is possible, though
 //     unlikely, that the return returns something else (like constant 0), and
 //     can still be TRE'd.  It can be TRE'd if ALL OTHER return instructions in
 //     the function return the exact same value.
 //  4. If it can prove that callees do not access their caller stack frame,
 //     they are marked as eligible for tail call elimination (by the code
 //     generator).

Sorry for being not very explicit, but I was more talking about general cases of TRE, which are limited by what needs to be proofed by LLVM or Rust. Specifically function chaining, ie for automata.

Actually, general tail call elimination is the goal of the RFC. That’s why become must be explicit: it changes when local variables are dropped. If become is used for anything but a tail call, that is a compile time error. If LLVM cannot generate a tail call, that is a bug in LLVM.

@burdges
Copy link

@burdges burdges commented Apr 25, 2021

As an aside, are there any directives that make LLVM leave as little as possible on the stack? Ideally profiler guided optimizations would attempt to detect where such places belong.

@leonardo-m
Copy link

@leonardo-m leonardo-m commented Oct 5, 2021

Elsewhere they are discussing the [[clang::musttail]] attribute (in LLVM 13):

https://releases.llvm.org/13.0.0/tools/clang/docs/ReleaseNotes.html#major-new-features

https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html

https://reviews.llvm.org/D107872

Perhaps we could introduce in Nightly a similar Rust attribute like #[tailcall] that stops the compilation of a function (or group of mutually recursive ones) if rustc can't replace all its recursive calls with loops.

@matthieu-m
Copy link

@matthieu-m matthieu-m commented Oct 6, 2021

Elsewhere they are discussing the [[clang::musttail]] attribute (in LLVM 13):
[...]
Perhaps we could introduce in Nightly a similar Rust attribute like #[tailcall] that stops the compilation of a function (or group of mutually recursive ones) if rustc can't replace all its recursive calls with loops.

There is already a keyword reserved for tail-calls become. Is there a reason to use an attribute rather than a keyword when the keyword already exists?

Note that rustc itself is unlikely to perform the replacements, it would typically leave this up to LLVM. What rustc should do, however, is to check the feasibility; typically, such tail calls require that no destructor be executed "around" the call, and it seems from the original article that the function to be called must have the same arguments as the function calling it.

@DemiMarie
Copy link
Author

@DemiMarie DemiMarie commented Oct 6, 2021

Note that rustc itself is unlikely to perform the replacements, it would typically leave this up to LLVM. What rustc should do, however, is to check the feasibility; typically, such tail calls require that no destructor be executed "around" the call, and it seems from the original article that the function to be called must have the same arguments as the function calling it.

One advantage of having become be explicit is that it can change when destructors run, so that destructors run before the tail call instead of afterwards.

@timthelion
Copy link

@timthelion timthelion commented Oct 6, 2021

Note that rustc itself is unlikely to perform the replacements, it would typically leave this up to LLVM.

If rustc doesn't do the replacements itself then can it guarantee that functions with become will be TCO'd? It would be kind of nightmarish if you specified TCOing with become and then had the stack explode and had no idea why.

@Ericson2314
Copy link
Contributor

@Ericson2314 Ericson2314 commented Oct 7, 2021

@timthelion what is meant is that rustc will tell LLVM to guarantee it, since for a function that isn't inlined this out of rustc's hands. The guarantee is still made, however.

@OvermindDL1
Copy link

@OvermindDL1 OvermindDL1 commented Oct 19, 2021

Should this issue still be closed considering it seems there's at least some movement happening on it? Or has another issue number taken over (and if so could the OP be updated)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet