proc_macro: implement `TokenTree`, `TokenKind`, hygienic `quote!`, and other API #40939

jseyfried · 2017-03-31T05:09:57Z

All new API is gated behind #![feature(proc_macro)] and may be used with #[proc_macro], #[proc_macro_attribute], and #[proc_macro_derive] procedural macros.

More specifically, this PR adds the following in proc_macro:

// `TokenStream` constructors:
impl TokenStream { fn empty() -> TokenStream { ... } }
impl From<TokenTree> for TokenStream { ... }
impl From<TokenKind> for TokenStream { ... }
impl<T: Into<TokenStream>> FromIterator<T> for TokenStream { ... } 
macro quote($($t:tt)*) { ... } // A hygienic `TokenStream` quoter

// `TokenStream` destructuring: 
impl TokenStream { fn is_empty(&self) -> bool { ... } }
impl IntoIterator for TokenStream { type Item = TokenTree; ... }

struct TokenTree { span: Span, kind: TokenKind }
impl From<TokenKind> for TokenTree { ... }
impl Display for TokenTree { ... }

struct Span { ... } // a region of source code along with expansion/hygiene information
impl Default for Span { ... } // a span from the current procedural macro definition
impl Span { fn call_site() -> Span { ... } } // the call site of the current expansion
fn quote_span(span: Span) -> TokenStream;

enum TokenKind {
    Group(Delimiter, TokenStream), // A delimited sequence, e.g. `( ... )`
    Term(Term), // a unicode identifier, lifetime ('a), or underscore
    Op(char, Spacing), // a punctuation character (`+`, `,`, `$`, etc.).
    Literal(Literal), // a literal character (`'a'`), string (`"hello"`), or number (`2.3`)
}

enum Delimiter {
    Parenthesis, // `( ... )`
    Brace, // `[ ... ]`
    Bracket, // `{ ... }`
    None, // an implicit delimiter, e.g. `$var`, where $var is  `...`.
}

struct Term { ... } // An interned string
impl Term {
    fn intern(string: &str) -> Symbol { ... }
    fn as_str(&self) -> &str { ... }
}

enum Spacing {
    Alone, // not immediately followed by another `Op`, e.g. `+` in `+ =`.
    Joint, // immediately followed by another `Op`, e.g. `+` in `+=`
}

struct Literal { ... }
impl Display for Literal { ... }
impl Literal {
    fn integer(n: i128) -> Literal { .. } // unsuffixed integer literal
    fn float(n: f64) -> Literal { .. } // unsuffixed floating point literal
    fn u8(n: u8) -> Literal { ... } // similarly: i8, u16, i16, u32, i32, u64, i64, f32, f64
    fn string(string: &str) -> Literal { ... }
    fn character(ch: char) -> Literal { ... }
    fn byte_string(bytes: &[u8]) -> Literal { ... }
}

For details on quote! hygiene, see this example and declarative macros 2.0.

r? @nrc

jseyfried · 2017-03-31T07:29:42Z

cc @rust-lang/compiler @abonander @dtolnay @Ericson2314 @SergioBenitez @SimonSapin @sgrif @tikue @wycats

eddyb · 2017-03-31T10:25:49Z

Can't we call TokenTree just Token? I'm a bit disappointed in how I fooled myself we could get rid of the "tree" model but flattening seems to not be in the cards.
However, I still think we can keep simpler names in the API.

SimonSapin · 2017-03-31T15:08:57Z

Nice! I like the overall approach, some comments on the details:

impl IntoIterator for TokenStream { type Item = TokenTree; ... }

Would impls for &TokenStream and &mut TokenStream also be useful?

struct Symbol { ... } // An interned string

To me the word "symbol" evokes punctuation, excluding letters. Could this be called something else? Identifier, Ident, Word, … ? (Are the first two not appropriate because "identifier" excludes keywords?)

Joint, // immediately followed by another `Op`, e.g. `+` in `+=`

So the = in += would be a separate token? &&, .., or :: would similarly be two tokens each? (>>= or ... three tokens?) Please expand the doc-comments in the PR to explain this.

impl Display for Literal { ... }
impl Literal {

I’d also add:

fn byte_string(bytes: &[u8]) -> Literal, much more convenient that building a Sequence of u8 literals for a slice literal. (And more efficient?)
Maybe fn byte(byte: u8) -> Literal similar to Literal::u8 but with byte literal syntax, though it only makes a difference when looking at generated code for debugging.
fn as_str(&self) -> Option<&str> and other fn as_*(&self) -> Option<*> methods to extract the value. (Though number literals without a type suffix are tricky.) The Display impl technically provides the same information, but writing a Rust parser (even just for literals) is a complex undertaking.

In src/libproc_macro/lib.rs

#[unstable(feature = "proc_macro", issue = "38356")]
#[macro_export]
macro_rules! quote { () => {} }

In src/libproc_macro/quote.rs

macro_rules! quote {
    () => { TokenStream::empty() };
    ($($t:tt)*) => { [ $( quote_tree!($t), )* ].iter().cloned().collect::<TokenStream>() };
}

Why are there two quote macros, with one empty, and only the empty one exported?

SimonSapin · 2017-03-31T15:17:32Z

I think this PR addresses all of my previous comments in #38356 (comment) 👍

bors · 2017-03-31T18:47:37Z

☔ The latest upstream changes (presumably #40950) made this pull request unmergeable. Please resolve the merge conflicts.

jseyfried · 2017-04-01T02:23:22Z

@eddyb

I'm a bit disappointed in how I fooled myself we could get rid of the "tree" model but flattening seems to not be in the cards.

Do you mean w.r.t. the internals? I believe we can still use the internal flattened representation you proposed with this API (imo, would be nice but not high priority).

@SimonSapin

Would impls for &TokenStream and &mut TokenStream also be useful?

impl IntoIterator for &TokenStream { type Item = TokenTree; ... } and/or
impl TokenStream { fn iter(&self) -> TokenIter { ... } } could be useful -- both would be equivalent to tokens.clone().into_iter().

impl IntoIterator for &TokenStream { type Item = &TokenTree; ... } isn't possible (yet, perhaps), and would make an internal "flattened" internal representation more complex / less efficient.

TokenStream uses shared, ref-counted memory, so I don't think impl IntoIterator for &mut TokenStream makes sense.

To me the word "symbol" evokes punctuation, excluding letters. Could this be called something else?

Yeah. Currently, Symbol is just an interned string, so Word/Ident don't seem appropriate. If we restrict the type to valid TokenKind::Words, I'd prefer Word for symmetry with TokenKind. Otherwise, perhaps Str/InternedStr? cc @nrc

So the = in += would be a separate token? &&, .., or :: would similarly be two tokens each?

Yeah, it's TokenKind::Op(char, OpKind), so a just single character per Op -- I'll mention that in the doc comment.

I’d also add [more methods for Literal]

Yeah, I agree that we should have Literal::{byte, byte_string, as_*}. This API isn't intended to be exhaustive, e.g. we also want API for emitting "first class" diagnostics (@sergio is working on this).

Why are there two quote macros, with one empty, and only the empty one exported?

The non-empty one is incomplete (by design) and only used to bootstrap the internal implementation of proc_macro::quote. The exported one is a placeholder for the internal implementation (à la lang items). After we support stability hygiene and stability-checking macro resolutions, we should be able to replace the internal implementation with a #[proc_macro].

bors · 2017-04-05T12:12:41Z

☔ The latest upstream changes (presumably #40811) made this pull request unmergeable. Please resolve the merge conflicts.

nrc

There's a bunch of changes requested, but most are just about docs, and I don't think there is anything major. Assuming all the fixes are straightforward, then r=me

nrc · 2017-04-10T05:40:23Z

src/libproc_macro/lib.rs

-pub struct TokenStream {
-    inner: TokenStream_,
-}
+pub struct TokenStream(tokenstream::TokenStream);


Not this PR (and pretty nitty), but just pointing out that it should either token_stream or Tokenstream, currently the naming is inconsistent.

nrc · 2017-04-10T05:45:10Z

src/libproc_macro/lib.rs

+
+/// An iterator over `TokenTree`s.
+#[unstable(feature = "proc_macro", issue = "38356")]
+pub struct TokenIter {


Should we call this a TokenTreeIter? That leaves room for a flattening TokenIter and means the name and target correspond

@nrc Yeah, TokenTreeIter makes more sense.

On a similar note, @eddyb proposed using TokenNode instead of TokenKind for better correspondence to TokenTree and to leave room for Token/TokenKind -- what do you think?

nrc · 2017-04-10T05:46:47Z

src/libproc_macro/lib.rs

+}
+
+impl TokenTree {
+    fn from_raw(stream: tokenstream::TokenStream, next: &mut Option<tokenstream::TokenStream>)


I don't like from_raw as a name - 'raw' can mean so many different things, and perhaps the most likely here is as in raw string which is wrong. How about from_internal or from_rustc_internal or something like that?

I'll change to from_internal.

nrc · 2017-04-10T05:48:28Z

src/libproc_macro/lib.rs

+        TokenTree { span: Span(span), kind: kind }
+    }
+
+    fn to_raw(self) -> tokenstream::TokenStream {


likewise here.

And somewhat connected, is there a way to mark the API which we expect macro authors to use? Is it all the pub API? Or is some of that transitional or only intended for the compiler to use?

It's all the pub API not in __internal (which is unusable without #![feature(proc_macro_internals)]).

nrc · 2017-04-10T05:49:02Z

src/libproc_macro/lib.rs

@@ -80,7 +524,11 @@ pub struct LexError {
 /// all of the contents.
 #[unstable(feature = "proc_macro_internals", issue = "27812")]
 #[doc(hidden)]
+#[path = ""]


what is this for? I'm not sure what the effect is

mod quote; in __internal; this allows the file to be libproc_macro/quote.rs. Thinking about this some more, I'll move mod quote into the crate root and remove the #[path = ""].

nrc · 2017-04-10T05:49:53Z

src/libproc_macro/quote.rs

+
+//! # Quasiquoter
+//! This file contains the implementation internals of the quasiquoter provided by `qquote!`.
+


Could you add an overview of how the quasi-quoter works please?

nrc · 2017-04-10T06:00:06Z

src/libsyntax/tokenstream.rs

@@ -196,6 +201,10 @@ impl TokenStream {
        }
    }

+    pub fn builder() -> TokenStreamBuilder {


I think this would be better as a free function rather than a method on TokenStream

Or it could be a new method on TokenStreamBuilder

I'll change to TokenStreamBuilder::new() (motivation for TokenStream::builder() was to avoid making people use anything else into their scopes).

nrc · 2017-04-10T06:00:41Z

src/libsyntax/tokenstream.rs

@@ -225,13 +234,107 @@ impl TokenStream {
        }
        true
    }
+
+    pub fn as_tree(self) -> (TokenTree, bool /* joint? */) {


Could these public methods get docs please? (In particular here, some more explanation of what joint means would be good).

Also, since this takes self by value, should this be named into_tree instead? Or are the conventions different for reference counted things

@nrc will do.
@lfairy Perhaps, I don't think it matters much.

nrc · 2017-04-10T06:03:39Z

src/test/run-pass-fulldeps/auxiliary/proc_macro_def.rs

@@ -8,47 +8,37 @@
 // option. This file may not be copied, modified, or distributed
 // except according to those terms.

-#![feature(plugin, plugin_registrar, rustc_private)]


It looks like anyone with an existing procedural macro is going to have to make some significant changes. Could you write up either here or on internals or somewhere, a guide to the changes needed to make a proc macro work after this PR please?

Existing procedural macros will lose the syntax::tokenstream::TokenStream quoter from proc_macro_internals, AFAIK other than that they won't have to make any changes. I could include a syntax::tokenstream::TokenStream quoter for back-compat.

How many people are actually using "existing procedural macros" though? This PR doesn't effect the widely used legacy plugin system; I didn't think many people were using SyntaxExtension::ProcMacro from #36154.

nrc · 2017-04-10T06:06:45Z

src/libsyntax/parse/token.rs

@@ -455,3 +461,38 @@ pub fn is_op(tok: &Token) -> bool {
        _ => true,
    }
 }
+
+#[derive(Clone, Eq, PartialEq, Debug)]
+pub struct LazyTokenStream(RefCell<Option<TokenStream>>);


Is this meant to be used by macro authors? If so could you document with why and when.

No, nothing in libsyntax can be used by proc macro authors.

lambda-fairy · 2017-04-10T07:14:44Z

src/libsyntax/parse/token.rs

+}
+
+#[derive(Clone, Eq, PartialEq, Debug)]
+pub struct LazyTokenStream(RefCell<Option<TokenStream>>);


As an aside: if the dynamic borrow checks end up too inefficient, then this could be re-implemented on top of MoveCell.

std::cell::Cell is now like MoveCell :) rust-lang/rfcs#1651

lambda-fairy · 2017-04-10T07:15:32Z

src/libsyntax/parse/token.rs

+        let mut opt_stream = self.0.borrow_mut();
+        if opt_stream.is_none() {
+            *opt_stream = Some(f());
+        };


Nit: extra semicolon

carols10cents · 2017-04-17T12:18:02Z

Friendly ping that there are merge conflicts and some comments for you @jseyfried! ❤️

abonander · 2017-04-23T06:19:53Z

Should we use a different feature gate for the unstable APIs? I'm under the impression that language feature gates and library feature gates are meant to be distinct from each other, and proc_macro currently falls under the language side.

nikomatsakis · 2017-04-24T17:23:01Z

Hmm, is there a strict separation between language/lib feature-gates? I wouldn't think so.

abonander · 2017-04-24T17:24:14Z

It's kind of implied by the structure of the Unstable Book but that might be more of an issue for #41476 where this came up originally.

arielb1 · 2017-04-25T09:45:19Z

status: still waiting for #40847.

arielb1 · 2017-05-02T09:16:31Z

status: still waiting for #40847.

carols10cents · 2017-05-08T13:15:47Z

status: still waiting for #40847.

oli-obk · 2017-07-09T19:55:09Z

The lack of expansion info broke clippy together with proc macros, since code inside derives will now trigger lints. Is there any new way to detect expanded code? If not, what needs to be done in order to create expansion info that doesn't influence save analysis.

See https://github.com/Manishearth/rust-clippy/issues/1877

nrc · 2017-07-13T03:23:07Z

Since we no longer have the extra expansion information, @nrc's save-analysis code tries to re-parses the call site span and expects it to match the proc-macro generated tokens.

Ah, I wondered why I was seeing so many errors in the RLS :-(

I think we must fix this. This feels like another reason not to conflate macro hygiene with expansion history. That might be too hard a way to fix this though.

If we want to fix this, we could instead mark tokens from FromStr for TokenStream specially so that save-analysis would know that they are "weird", but so that they would still behave like unexpanded tokens w.r.t. hygiene. I don't have a strong opinion here.

Not using spans for hygiene seems a much nicer fix here - the span should reflect the expansion history and the hygiene should be altered (to erase hygiene, essentially). I expect we want the correct expansion info for errors as well as save-analysis.

oli-obk · 2017-07-13T05:56:08Z

I already have a fix in #43179

jseyfried · 2017-07-13T21:09:12Z

@nrc
We could instead just include hygiene bending information along with expansion information so that span expansion info and hygiene info remain unified (@oli-obk's solution is fine for impl FromStr for TokenStream, so we're OK for now).

That is, I don't think think hygiene bending requires us to split up span/hygiene info or construct an "fake" expansion history just for hygiene. I think it'd be nicer to have a single source of truth for expansion info that is expressive enough for hygiene bending.

Tidy: allow common lang+lib features This allows changes to the Rust language that have both library and language components share one feature gate. The feature gates need to be "about the same change", so that both library and language components must either be both unstable, or both stable, and share the tracking issue. Removes the ugly "proc_macro" exception added by #40939. Closes #43089

nrc · 2017-07-25T05:08:59Z

So #43179 did not fix the expansion info issue (at least not completely) - I'm still seeing errors in the RLS where we are treating macro-generated names as hand-written ones (and I think this is causing #43371 too).

@jseyfried

Three small fixes for save-analysis First commit does some naive deduplication of macro uses. We end up with lots of duplication here because of the weird way we get this data (we extract a use for every span generated by a macro use). Second commit is basically a typo fix. Third commit is a bit interesting, it partially reverts a change from #40939 where temporary variables in format! (and thus println!) got a span with the primary pointing at the value stored into the temporary (e.g., `x` in `println!("...", x)`). If `format!` had a definition it should point at the temporary in the macro def, but since it is built-in, that is not possible (for now), so `DUMMY_SP` is the best we can do (using the span in the callee really breaks save-analysis because it thinks `x` is a definition as well as a reference). There aren't a test for this stuff because: the deduplication is filtered by any of the users of save-analysis, so it is purely an efficiency change. I couldn't actually find an example for the second commit that we have any machinery to test, and the third commit is tested by the RLS, so there will be a test once I update the RLS version and and uncomment the previously failing tests). r? @jseyfried

@alexcrichton

Remove `LazyTokenStream`. `LazyTokenStream` was added in #40939. Perhaps it was an effective optimization then, but no longer. This PR removes it, making the code both simpler and faster. r? @alexcrichton

@alexcrichton