Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenStream manipulations are 1000x too slow #65080

Closed
dtolnay opened this issue Oct 4, 2019 · 15 comments
Closed

TokenStream manipulations are 1000x too slow #65080

dtolnay opened this issue Oct 4, 2019 · 15 comments

Comments

@dtolnay
Copy link
Member

@dtolnay dtolnay commented Oct 4, 2019

Context: illicitonion/num_enum#14
Switching a proc macro from being token-based to operating on strings with just a final conversion from string to TokenStream can be a 100x improvement in compile time. If we care that people continue to write macros using tokens, the performance needs to be better.

I minimized the slow part of the num_enum macro to this benchmark:
https://github.com/alexcrichton/proc-macro2/tree/12bac84dd8d090d2987a57b747c7ae7bbeb8a3d0/benches/bench-libproc-macro
On my machine the string implementation takes 8ms and the token implementation takes 25721ms.

I know that there is a proc macro server that these calls end up talking to, but I wouldn't expect this huge of a factor from that. If the server calls are the only thing making this slow, is there maybe a way we could buffer operations in memory to defer and batch the server work?

I will file issues in proc-macro2 and quote as well to see if anything can be improved on their end.

FYI @eddyb @petrochenkov @alexcrichton

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Oct 4, 2019

This is (AFAICT) from a brief analysis because the code appears to be allocating a fresh vector for the entire tokenstream each time we append one token. I believe that is because our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint, presumably called from this From<TokenTree> impl.
Ultimately, the root cause boils down to this piece of code:

#[stable(feature = "token_stream_extend", since = "1.30.0")]
impl Extend<TokenTree> for TokenStream {
fn extend<I: IntoIterator<Item = TokenTree>>(&mut self, trees: I) {
self.extend(trees.into_iter().map(TokenStream::from));
}
}
#[stable(feature = "token_stream_extend", since = "1.30.0")]
impl Extend<TokenStream> for TokenStream {
fn extend<I: IntoIterator<Item = TokenStream>>(&mut self, streams: I) {
// FIXME(eddyb) Use an optimized implementation if/when possible.
*self = iter::once(mem::replace(self, Self::new())).chain(streams).collect();
}
}
.

That has a FIXME from eddyb that this is super inefficient, and rightfully so. We are essentially always creating a TokenStreamBuilder, pushing on ~two streams, one with lots of elements and the other with two (token from original extend, as well as non joint). Having done that, we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I believe this means that in order to append N tokens we will create N vectors of 1, 2, 3, 4, .. N elements (or so, possibly modulo a constant factor of some kind). If N is large, this is really slow.

cc @pnkfelix as well, as they did some investigating into a similar issue in #57735 (comment).


I am not sure what we can do about this -- ideally, we'd not be re-creating new tokenstreams on every appended token, but TokenStream seems to have a structure that isn't really amenable to us appending tokens into it (despite the name). Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal...

@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Oct 4, 2019

cc @nnethercote as well

@eddyb

This comment has been minimized.

Copy link
Member

@eddyb eddyb commented Oct 4, 2019

I remember the FIXME but I also remember this not being that big of a problem for some reason. I guess if the first impl wasn't ending up using the second impl, it would be faster?

TokenStreamBuilder could totally take individual TokenTrees as well, AFAIK, and you could at least keep one builder for the entire extend (but if it's called with iter::once(tt) that would require even more optimization work).

Maybe the main problem here is that there is no API for creating a builder out of a TokenStream in CoW fashion?

Sadly the original vision of using ropes falls apart when you can't even slice a TokenStream.

@eddyb

This comment has been minimized.

Copy link
Member

@eddyb eddyb commented Oct 4, 2019

Maybe we can have proc macro keep a TokenStreamBuilder instead of a TokenStream around, which would let us append a bunch of tiny tokenstreams together before collating them all, but that also seems non-ideal.

One nice way of doing this (if all else fails) is representing TokenStream on the server side (for proc macros not rustc in general) as TokenStream | TokenStreamBuilder and finishing the builder when any of the TokenStream methods are used which we can't easily implement on the builder itself.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 7, 2019

@matklad: this might affect your approach in #64782.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 7, 2019

our Extend<TokenTree> for TokenStream implementation maps the token tree to a token stream with two elements: tree, and NonJoint

I think you misread that code. It's a token stream with a single element; that element is a 2-tuple.

we call TokenStreamBuilder::build, which calls TokenStream::from_streams with two streams, which allocates the vector for the contents of both streams, and we have a new tokenstream.

I have confirmed (via profiling and println! insertion) that this part is true. In the relevant part of execution, the second stream passed to TokenStream::from_streams always has a single element, while the first stream is one larger than the previous call.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 7, 2019

I tried changing TokenStream from this:

pub struct TokenStream(Option<Lrc<Vec<TreeAndJoint>>>);

to this

pub struct TokenStream(Vec<TreeAndJoint>>);

This drastically sped up the microbenchmark under discussion, because new elements can be trivially appended to an existing TokenStream, rather than having to duplicate before appending.

However, it was a big slowdown for benchmarks. Most of them slowed down 2-20%, but script-servo and style-servo were both almost 5x worse. This is because the second representation is much more expensive to clone, something which happens frequently and from many places.

@alexcrichton

This comment has been minimized.

Copy link
Member

@alexcrichton alexcrichton commented Oct 7, 2019

@nnethercote I'm not entirely sure where this bottoms out so this may be a distracting comment, but would Rc::make_mut and such make those two representations "effectively equivalent" for the purposes of the benchmark here? That should still give us cheap clones and pushing a bunch of stuff onto one clone should still be quite cheap since it doesn't have to reallocate.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 8, 2019

@alexcrichton: Indeed, I had the same thought late last night! I have used make_mut successfully in TokenStream::from_streams to fix half of the quadratic behaviour. I am now working at using it in TokenStreamBuilder::push.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 8, 2019

#65198 fixes this so that the TokenStream code is roughly 10x slower than the string code, and the slowdown is consistent for different values of N (i.e. no more quadratic behaviour). There is room for additional improvement, but this is a good start.

It would make sense to add a benchmark to rustc-perf for this kind of code. The given microbenchmark is a possibility, but I'd prefer to use real-world code that demonstrated the problem, if possible.

bors added a commit that referenced this issue Oct 8, 2019
Speed up `TokenStream` concatenation

This PR fixes the quadratic behaviour identified in #65080.

r? @Mark-Simulacrum
@Mark-Simulacrum

This comment has been minimized.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Oct 8, 2019

I'd personally prefer the microbenchmark as it's less likely to stop building in the future in some sense :)

But that PR looks excellent. Do we know if the remaining 10x cost is for some particular reason, or are strings just that much faster than the token code? (Lrc, etc.)

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 8, 2019

The 10x is because there's a lot of faffing around: a couple of once iterators, a chain iterator, a conversion from a TokenTree to a TokenStream, creating a TokenStreamBuilder, pushing onto it, Lrc, etc.

bors added a commit that referenced this issue Oct 9, 2019
Speed up `TokenStream` concatenation

This PR fixes the quadratic behaviour identified in #65080.

r? @Mark-Simulacrum
@eddyb

This comment has been minimized.

Copy link
Member

@eddyb eddyb commented Oct 9, 2019

Oh I thought the implementation must be using Lrc<[TreeAndJoint]>, I'm really glad to be wrong and that this is getting fixed 🎉

There's likely no benefit in representing a proc_macro::TokenStream as a builder server-side but we could probably add a way to extend a proc_macro::TokenStream with a proc_macro::TokenTree.

@nnethercote

This comment has been minimized.

Copy link
Contributor

@nnethercote nnethercote commented Oct 11, 2019

@dtolnay: is the improved performance good enough for your purposes?

@dtolnay

This comment has been minimized.

Copy link
Member Author

@dtolnay dtolnay commented Oct 11, 2019

I confirmed that the performance is massively improved in nightly-2019-10-10. Thanks!

FYI @illicitonion @anlumo

@dtolnay dtolnay closed this Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.