Replace the pretty-print/retokenize hack with a direct comparsion of AST nodes #75820

Aaron1011 · 2020-08-22T21:10:37Z

Currently, proc-macro expansion pretty-prints the Nonterminal being passed to the proc macro:

Lines 283 to 293 in 5528caf

    
           // FIXME(#43081): Avoid this pretty-print + reparse hack 
        
           let source = pprust::nonterminal_to_string(nt); 
        
           let filename = FileName::macro_expansion_source_code(&source); 
        
           let tokens_for_real = parse_stream_from_source_str(filename, source, sess, Some(span)); 
        
           // During early phases of the compiler the AST could get modified 
        
           // directly (e.g., attributes added or removed) and the internal cache 
        
           // of tokens my not be invalidated or updated. Consequently if the 
        
           // "lossless" token stream disagrees with our actual stringification 
        
           // (which has historically been much more battle-tested) then we go 
        
           // with the lossy stream anyway (losing span information).

This ensures that the TokenStream we pass to a proc-macro always agrees with the AST struct it is attached to. This is necessary because the compiler may directly modify the AST structure after it is parsed but before it is passed to a proc-macro. Currently, it looks like #[cfg] attributes are the only way for a proc-macro to observe this: a derive macro will have cfg-stripping applied to the fields/variants of the target struct/enum before it is invoked.

Unfortunately, while this check has no false negatives (keeping an invalid TokenStream), it has had a large number of false positives (throwing away the TokenStream when it could have been kept):

When a false positive occurs, it results in an instance of #43081. Originally, I thought that these false positives could only result in unspanned/mis-spanned compiler errors - while obnoxious to deal with, they can be fixed as they come up.

Unfortunately, using the re-created TokenStream means that in addition to losing location information, we also lose hygiene information (since both are accessed via a Span). This is much more serious - it results in code compiling that should not. In two cases, users actually wrote macros that explicitly relied on hygiene information being lost, likely because they did not realize that this was a bug. Fixing pretty-print/retokenize false positives can therefore incur unavoidable breakage in certain crates. If one of these buggy macros was exposed in a public API, the problem would be much worse. See #73084 for more details.

I propose that we replace the pretty-print/reparse hack with the following strategy:

During token collection, we clone the AST struct (e.g. Item) just before the TokenStream is attached. We then attach a struct AstTokens(P(T), TokenStream) to the struct. The struct now stores a copy of itself (behind a P) as it was just after parsing.
We #[derive(PartialEq)] for the entire AST - while a few structs/enum do this, most don't.
When we need to tokenize a Nonterminal, we compare the current AST struct to the original one that it stores. If they are equal, we use the stored TokenStream - otherwise, we pretty-print and retokenize.

There are some details that would need to be worked out:

We would need to temporarily remove/ignore certain attributes from the original AST struct prior to the comparison, since the outer AST struct (wrapped in a Nonterminal) has an attribute macro removed before it is expanded.
We might want to create an internal AstEq trait, which we derive instead of PartialEq. It's conceivable that we might want to implement PartialEq for a 'looser' form of equality on some AST structs in the future, which would be incompatible with the 'strict' equality (e.g. checking that NodeIds match exactly) required by this check. However, doing so might be overkill.
The overhead of cloning an AST structure and walking it via PartialEQ could conceivable be higher than pretty-printing and re-tokenizing - we would need to benchmark it.

However, adopting this scheme would allow us to bypass all issues resulting from inconsistencies in the pretty-printer. While we would still require the pretty-printer to produce correct code (when we detect a mismatch), the proc-macro infrastructure would no longer require it to exactly preserve certain information (empty where clauses, extra parentheses, etc.)

Alternatives:

Keep the pretty-print/re-tokenize hack. It seems very unlikely that Pretty-printing adds additional parentheses, breaking the proc-macro reparse check #75734 is the last issue we will encounter, so we will probably need to continue to make modifications to the pretty-printer to support its use with proc-macros.
Attempt to track modification to the AST structure - e.g. instead of handing out mutable references in MutVisitor, create some kind of smart pointer that invalidates the TokenStream when mutable access is required. I attempted several variants of this approach, but all proved to be extremely invasive, requiring changes across the entire compiler.
Do nothing for now, but remove the pretty-print hack without replacing it with anything once Compiler loses location information before calling macros (sometimes) #43081 is fully resolved (e.g. we handle #[cfg]s and inner attributes correctly). While this approach might work, it seems dangerous - any modifications to the AST done by the compiler risk getting lost whenever proc-macros are involved. While I'm pretty sure that #[cfg]s are the only place where we intentionally modify the AST before a proc-macro sees it, we might inadvertently introduce additional modifications in the future.

The text was updated successfully, but these errors were encountered:

Aaron1011 · 2020-08-23T21:10:28Z

@petrochenkov: If you're supportive of the general concept, I plan to start working on implementing this.

petrochenkov · 2020-08-23T21:18:14Z

I need to think.
Alternatives 2 + 3 is probably the proper long term solution, and it's indeed invasive and needs design work.

petrochenkov · 2020-08-24T21:03:01Z

TLDR: I don't think relying on AST more is the right direction.

We don't really need AST until expansion is done, before that some simplified mostly token-based represenation like #43081 (comment) could be used. So it could make sense to rely on it less.

The plans to get rid of nonterminal tokens and always treat them as simply delimited groups with token streams inside (and not only in proc macros) also fit into the picture.

People also expressed desire to delay parsing inside any delimited groups until after expansion (#65860) to be able to use unstable syntax inside items marked with #[cfg(nightly)].

(We'll have to force pre-expansion parsing into AST at least in some cases though to maintain the current behavior of cfgs in derive macro input.)

cc @matklad as an interested party

Aaron1011 · 2020-08-24T23:26:08Z

With regard to the PreexpTokenStream described in #43081 (comment):

Without const generics, we could probably get away with determining the attribute target via a simple rule like 'stop at the first comma or delimited stream at the same depth' (which would handle #[cfg] field: Ty, and field: MyType<#[cfg] T>). Unfortunately, const generics means that we need to be able to CFG-strip nested statements before running a 'derive' macro - see #75864.

Aaron1011 · 2021-01-12T14:17:33Z

The hack was removed entirely in #79472.

Aaron1011 added the A-proc-macros Area: Procedural macros label Aug 22, 2020

estebank added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-cleanup Category: PRs that clean code up or issues documenting cleanup. labels Aug 23, 2020

petrochenkov mentioned this issue Nov 8, 2020

Syntactically permit postfix macros to reject them later #78849

Closed

Aaron1011 closed this as completed Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the pretty-print/retokenize hack with a direct comparsion of AST nodes #75820

Replace the pretty-print/retokenize hack with a direct comparsion of AST nodes #75820

Aaron1011 commented Aug 22, 2020 •

edited

Aaron1011 commented Aug 23, 2020

petrochenkov commented Aug 23, 2020

petrochenkov commented Aug 24, 2020

Aaron1011 commented Aug 24, 2020 •

edited

Aaron1011 commented Jan 12, 2021

Replace the pretty-print/retokenize hack with a direct comparsion of AST nodes #75820

Replace the pretty-print/retokenize hack with a direct comparsion of AST nodes #75820

Comments

Aaron1011 commented Aug 22, 2020 • edited

Aaron1011 commented Aug 23, 2020

petrochenkov commented Aug 23, 2020

petrochenkov commented Aug 24, 2020

Aaron1011 commented Aug 24, 2020 • edited

Aaron1011 commented Jan 12, 2021

Aaron1011 commented Aug 22, 2020 •

edited

Aaron1011 commented Aug 24, 2020 •

edited