Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement converting an AST to a token tree #43081

Open
alexcrichton opened this Issue Jul 6, 2017 · 17 comments

Comments

Projects
None yet
8 participants
@alexcrichton
Copy link
Member

alexcrichton commented Jul 6, 2017

Updated description

First implemented in #43230 the compiler can now tokenize a few AST nodes when necessary losslessly from the original token stream. Currently, however, this is somewhat buggy:

  • It only happens for items in certain situations (aka no inner attributes). The consequence of this is that more and more attributed items will have invalid span information when sent to procedural macros.
  • We don't invalidate the cache when later changing the item. The consequence of this is that procedural macros will receive a buggy view of what the AST node actually represents.

The "real bug" here is that we haven't actually implemented this ticket. The initial implementation in #43230 was effectively just a heuristic, and not even a great one! As a result we still need the ability basically to losslessly tokenize an AST node back to its original set of tokens.

Some bugs that arise from this are:

  • Usage of #[cfg] modifies the AST but doesn't invalidate the cache -- #48644
  • Modules with attributes disagree on what their internal tokens are -- #47627
  • macro_rules! and procedural macros don't play well together -- #49846
  • Invoking a macro with brackets loses span information -- #50840

Original Description

There's an associated FIXME in the code right now, and to fix this we'll need to implement tokenization of an AST node. Right now the thinking of how to implement this is to save all TokenStream instances adjacent to an AST node, and use that instead of converting back into a token stream

cc @dtolnay, @nrc, @jseyfried

@mystor mystor referenced this issue Jul 7, 2017

Closed

Release 0.12 #123

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jul 14, 2017

proc_macro: Use an item's tokens if available
This partly resolves the `FIXME` located in `src/libproc_macro/lib.rs` when
interpreting interpolated tokens. All instances of `ast::Item` which have a list
of tokens attached to them now use that list of tokens to losslessly get
converted into a `TokenTree` instead of going through stringification and losing
span information.

cc rust-lang#43081

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jul 28, 2017

proc_macro: Use an item's tokens if available
This partly resolves the `FIXME` located in `src/libproc_macro/lib.rs` when
interpreting interpolated tokens. All instances of `ast::Item` which have a list
of tokens attached to them now use that list of tokens to losslessly get
converted into a `TokenTree` instead of going through stringification and losing
span information.

cc rust-lang#43081

bors added a commit that referenced this issue Jul 28, 2017

Auto merge of #43230 - alexcrichton:more-tokenstream, r=nrc,jseyfried
Implement tokenization for some items in proc_macro

This PR is a partial implementation of #43081 targeted towards preserving span information in attribute-like procedural macros. Currently all attribute-like macros will lose span information with the input token stream if it's iterated over due to the inability of the compiler to losslessly tokenize an AST node. This PR takes a strategy of saving off a list of tokens in particular AST nodes to return a lossless tokenized version. There's a few limitations with this PR, however, so the old fallback remains in place.

mattico added a commit to mattico/rust that referenced this issue Jul 29, 2017

proc_macro: Use an item's tokens if available
This partly resolves the `FIXME` located in `src/libproc_macro/lib.rs` when
interpreting interpolated tokens. All instances of `ast::Item` which have a list
of tokens attached to them now use that list of tokens to losslessly get
converted into a `TokenTree` instead of going through stringification and losing
span information.

cc rust-lang#43081

matthewhammer pushed a commit to matthewhammer/rust that referenced this issue Aug 3, 2017

proc_macro: Use an item's tokens if available
This partly resolves the `FIXME` located in `src/libproc_macro/lib.rs` when
interpreting interpolated tokens. All instances of `ast::Item` which have a list
of tokens attached to them now use that list of tokens to losslessly get
converted into a `TokenTree` instead of going through stringification and losing
span information.

cc rust-lang#43081

alexcrichton added a commit to alexcrichton/rust that referenced this issue May 22, 2018

rustc: Correctly pretty-print macro delimiters
This commit updates the `Mac_` AST structure to keep track of the delimiters
that it originally had for its invocation. This allows us to faithfully
pretty-print macro invocations not using parentheses (e.g. `vec![...]`). This in
turn helps procedural macros due to rust-lang#43081.

Closes rust-lang#50840

alexcrichton added a commit to alexcrichton/rust that referenced this issue May 22, 2018

rustc: Correctly pretty-print macro delimiters
This commit updates the `Mac_` AST structure to keep track of the delimiters
that it originally had for its invocation. This allows us to faithfully
pretty-print macro invocations not using parentheses (e.g. `vec![...]`). This in
turn helps procedural macros due to rust-lang#43081.

Closes rust-lang#50840

bors added a commit that referenced this issue May 24, 2018

Auto merge of #50971 - alexcrichton:no-stringify, r=petrochenkov
rustc: Correctly pretty-print macro delimiters

This commit updates the `Mac_` AST structure to keep track of the delimiters
that it originally had for its invocation. This allows us to faithfully
pretty-print macro invocations not using parentheses (e.g. `vec![...]`). This in
turn helps procedural macros due to #43081.

Closes #50840

mark-i-m added a commit to mark-i-m/rust that referenced this issue May 24, 2018

rustc: Correctly pretty-print macro delimiters
This commit updates the `Mac_` AST structure to keep track of the delimiters
that it originally had for its invocation. This allows us to faithfully
pretty-print macro invocations not using parentheses (e.g. `vec![...]`). This in
turn helps procedural macros due to rust-lang#43081.

Closes rust-lang#50840

HMPerson1 added a commit to HMPerson1/rust-clippy that referenced this issue Oct 26, 2018

bors added a commit that referenced this issue Nov 19, 2018

Auto merge of #55971 - SergioBenitez:skip-non-semantic, r=alexcrichton
Ignore non-semantic tokens for 'probably_eq' streams.

Improves the situation in #43081 by skipping typically non-semantic tokens when checking for 'probably_eq'.

r? @alexcrichton

HMPerson1 added a commit to HMPerson1/rust-clippy that referenced this issue Dec 21, 2018

HMPerson1 added a commit to HMPerson1/rust-clippy that referenced this issue Dec 21, 2018

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Dec 23, 2018

submodules: update clippy from a416c5e0 to fc24fce7
Fixes clippy tool state

Changes:
````
FIXME > TODO
rustup rust-lang#56992
Document map_clone known problems rust-lang#498
Remove header link
test: panic at map_unit_fn.rs:202 for map() without args
rm unused file map_unit_fn.stderr
panic at map_unit_fn.rs:202 for map() without args
Change contrib.md hierarchy, link to it from readme
Workaround rust-lang#43081
Teach `suspicious_else_formatting` about `if .. {..} {..}`
Link to `rustc_driver` crate in plugin
mutex_atomic: Correct location of AtomicBool and friends
Update README local run command to specify syspath
Do not mark as_ref as useless if it's followed by a method call
Changes lint sugg to bitwise and operator `&`
Run update_lints after renaming
Rename lint to MODULE_NAME_REPETITIONS
Add renaming tests
Move renaming to the right place
Implements lint for order comparisons against bool
fix(module_name_repeat): Try to register renamed lint, not valid yet
Fix an endless loop in the tests.
Fix `implicit_return` false positives.
chore(moduel_name_repeat): Rename stutter lint to module_name_repeat to avoid ableist language
Make integration tests fail on 'E0463'
base tests: make sure cargo-clippy binary can be called directly
Revert "Merge pull request rust-lang#3257 from o01eg/remove-sysroot"
````

bors added a commit that referenced this issue Dec 24, 2018

Auto merge of #57079 - matthiaskrgr:clippy, r=oli-obk
submodules: update clippy from a416c5e0 to fc24fce7

Fixes clippy tool state

Changes:
````
FIXME > TODO
rustup #56992
Document map_clone known problems #498
Remove header link
test: panic at map_unit_fn.rs:202 for map() without args
rm unused file map_unit_fn.stderr
panic at map_unit_fn.rs:202 for map() without args
Change contrib.md hierarchy, link to it from readme
Workaround #43081
Teach `suspicious_else_formatting` about `if .. {..} {..}`
Link to `rustc_driver` crate in plugin
mutex_atomic: Correct location of AtomicBool and friends
Update README local run command to specify syspath
Do not mark as_ref as useless if it's followed by a method call
Changes lint sugg to bitwise and operator `&`
Run update_lints after renaming
Rename lint to MODULE_NAME_REPETITIONS
Add renaming tests
Move renaming to the right place
Implements lint for order comparisons against bool
fix(module_name_repeat): Try to register renamed lint, not valid yet
Fix an endless loop in the tests.
Fix `implicit_return` false positives.
chore(moduel_name_repeat): Rename stutter lint to module_name_repeat to avoid ableist language
Make integration tests fail on 'E0463'
base tests: make sure cargo-clippy binary can be called directly
Revert "Merge pull request #3257 from o01eg/remove-sysroot"
````

r? @oli-obk
@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Dec 30, 2018

Unfortunately no matter how we slice it this is a very difficult issue :(

@alexcrichton or @petrochenkov would it be possible to make an order of magnitude estimate of effort for a real fix? Multiple person-weeks, multiple months, most of a year?

Does most of the difficulty come from performance implications? I imagine the naive fix of making the AST look more like Syn's AST where every token is individually represented would be pretty bad for compiler performance.

I've been hitting this issue over and over and over since it was filed. It wasn't so bad with only derive macros where the symptom was usually just error messages being printed in the wrong place, but now with attribute macros and function-like macros I often find it totally impossible to build composable macros (as seen in #57207) and the consequence of losing spans is no longer cosmetic, it is that code that should compile doesn't compile.

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Dec 31, 2018

@dtolnay
I think we should start with attaching associated token streams to all AST fragments that can be used as interpolated tokens in macros (the list can be found in enum Nonterminal), like it's already done for ast::Item. This shouldn't require much time.
(Note, that significant part of enum Nonterminal is used only for legacy quasi-quoting, that's being removed right now.)

This may regress performance of expansion-heavy compilation by few percent, but correctness seems more important and this can be optimized later.
We need to start with attaching tokens to ast::Expr and running benchmarks.


Tokens attached to non-terminals will fix the decl macro issues, but not #[cfg] or inner attributes.
Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.

Alternatively, AST as it exists now would only be created after expansion, and only tokens + things necessary for expansion, like attribute targets, would exist before that (unfortunately, too much is needed to resolve and expand an attribute).
Or some hybrid using AST only for expanded parts and being gradually folded from token-only to AST-only during expansion.
That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.

@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Jan 1, 2019

Thanks! That's very helpful.

That's some design work that requires a focused person who knows things to sit and work through the design and proof-of-concept implementation.
I have no idea how much time it would take for me personally, I never did this kind of high-level work.

Let's try to find the right person to at least be thinking passively about this problem, or else it will keep sitting here racking up mentions as every procedural macro library hits it. Would @eddyb be the most appropriate person?

@petrochenkov

This comment has been minimized.

Copy link
Contributor

petrochenkov commented Jan 1, 2019

cc @matklad who was also interested in an alternative AST.

@matklad

This comment has been minimized.

Copy link
Member

matklad commented Jan 1, 2019

Thanks for the ping @petrochenkov! It's interesting that I am experimenting with macro expansion in rust-analyzer in this very moment :)

I don't have much expertise here: I don't really know how rust macros actually work, and have taken only a cursory look at the spans/syntax context infrastructure. Below are my random thoughts about ast/token tree conversion in the context of IDEs:

In rust-analyzer, the syntax trees are not based on token trees (in IDE context, braces are not always balanced, and you need to account for trivia, etc). My current plan is to make token-trees strictly a macro-specific interface and materialize them only when expanding macro invocations. And, because rust-analyzer syntax trees are pretty heavy weight, I also want (rust-analyzer/rust-analyzer#386) to introduce a "post-expansion" AST based on more traditional hierarchy-of-enums representation.

Longer term we need some dual AST/tokens representation with cfgs, built-in derives and other compiler transformations correctly transforming both parts.

One way to achieve that would be libsyntax2 style tree, whose "real" representation is a sequence of token trees, and where AST nodes don't store data themselves besides pointers into the token tree.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 2, 2019

I think we probably just haven't put enough weight on this issue yet because it wasn't really all the pressing earlier, but I definitely agree with @dtolnay that it's far more pressing now!

I've always thought that the hard part of this is that we'd basically have to make libsyntax's AST looks like syn's, which I figured was a monumental task given the size and usage of libsyntax. I never really considered the performance implications, but I wouldn't be surprised if it had more than a few regressions.

Thinking on this for a bit though, we may be able to get to a more short-term solution. Up to now #[cfg] is the only (AFAIK) extension which actually modifies the AST pre-expansion. If that's true, then I think we can take a hybrid libsyntax/syn-like approach. We could perhaps say that all composite structures, those which can be internally #[cfg]'d, have the ability to generate their token stream. They do this by iterating over their components and recursively generating token streams. All terminal nodes have cached token streams from the parser (like ast::Item does today).

A strategy like that, which is specific to #[cfg] and falls apart if something else modifies the AST, may be an order of magnitude less work than "make the entire AST syn-like". I haven't thought this through too much though, and the numerous locations we can place #[cfg] may effectively mean that we're just implementing a ToTokens trait for the whole AST anyway.

In any case the best way forward with this I think is to try to brainstorm a bit to figure out a few possibilities of what a "reasonable" solution might look like, and then bring this up with the broader compiler team because it has a lot of implications on the compiler!

@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Jan 2, 2019

How feasible is it to implement ToTokens for the existing AST without data structure changes? We care most about preserving spans of identifiers as those are what lead to compilation failures. As I understand it, for every identifier represented in the AST there should be a way to form a proc_macro::Span corresponding to its hygiene context. As a first pass, we care less about spans of other tokens like keywords and punctuation since those would typically only affect error message placement.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 2, 2019

I'm not sure I've actually got a great idea of how feasible that would be. I feel like it would be difficult because we don't track all spans everywhere, but that's based on a mental snapshot of the AST from like 3 years ago. Diagnostics have come a long way since then which often comes with annotating more parts of the AST with more spans.

I'm pretty sure, however, that most punctuation in the AST doesn't have span information. For example nothing in #[foo] or foo::bar or foo + bar has span information on just the punctuation itself. That may the hardest part because all identifiers may have spans at this point.

(oh right, also keywords aren't really tracked at all)

It may just not actually be that hard to implement a raw ToTokens for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to a TokenStream and then pretty printing that later on

@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Jan 2, 2019

all identifiers may have spans at this point.

That's what I mean. We can implement conversions from AST to token stream (similar to ToTokens) in a way that preserves spans of all identifiers while inventing spans (as currently done) for keywords and punctuation.

It may just not actually be that hard to implement a raw ToTokens for the AST. The pretty printing module is basically already that, and we'd just switch it all to pretty printing to a TokenStream and then pretty printing that later on

@alexcrichton or @petrochenkov would one of you be able to give this a shot for one of the simplest AST types? Then we could see whether other people can fill in the work for the rest of the AST.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Jan 3, 2019

Before we give this a stab I'd want to game out the viability of this strategy. I feel like the punctuation/keyword spans are basically unused for TokenStream but are critical for setting span information on all the rustc AST nodes (as the hi/lo is used all over the place in the parser). I'd be wary that if identifiers were correct and nothing else we'd still be in a pretty bad place

@dtolnay

This comment has been minimized.

Copy link
Member

dtolnay commented Feb 5, 2019

@eddyb had some ideas at the Rust All Hands for how to fix this. As I understand it, we want to treat cfg eliminations and other manipulations of the syntax tree as overlays, with each overlay also carrying a token range that identifies the contiguous group of source tokens involved. This overlay mechanism already exists to some extent but is collapsed at some point in a way that is not information-preserving. Based on the token range information captured in overlays it should be possible to reconstruct accurate spans for every token in the syntax tree. And the performance will possibly even be better than the current stringify-reparse workaround.

This will be impactful not just for procedural macros. We identified that IDEs also care about span information captured accurately in macro-expanded code.

Mentioning @alexreg who may be interested in tackling this issue.

@elichai

This comment has been minimized.

Copy link

elichai commented Feb 6, 2019

Just so I understand. this issue is about the Span?
If so is it even close to solving? Because This gives me a lot of problems in my attribute macro, every error in the function is shown like it's on the macro.

i.e.

83 |     #[logfn(INFO)]                                      
   |     ^^^^^^^^^^^^^^ expected struct `std::string::String`, found array of 20 elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.