New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: libsyntax2.0 #2256

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
@matklad
Member

matklad commented Dec 23, 2017

Hi!

This RFC proposes to change AST and parser used by rustc, so as to create a solid base on top of which a great IDE support can be developed. The RFC is largely informed by my experience developing IntelliJ Rust and by my experiments with IDE-ready syntax tree in Rust in fall. I am not so knowledgeable about the internals of rustc and especially about macro expansion machinery, so input from the compiler team would be very valuable!

@rust-lang/compiler @rust-lang/dev-tools @nrc @jseyfried

Rendered

@mark-i-m

This comment has been minimized.

Show comment
Hide comment
@mark-i-m

mark-i-m Dec 23, 2017

Contributor

👍 x1000000

The fact that rust has neither a stable official correct parser nor an official correct grammar is very annoying. I think this is a major step towards making tooling easier to contribute to and more complete/stable.

Contributor

mark-i-m commented Dec 23, 2017

👍 x1000000

The fact that rust has neither a stable official correct parser nor an official correct grammar is very annoying. I think this is a major step towards making tooling easier to contribute to and more complete/stable.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 23, 2017

Member

Note that that the RFC explicitly does not propose to create an official grammar. There's https://github.com/nox/rust-rfcs/blob/master/text/1331-grammar-is-canonical.md for that.

However, I do hope to produce a comprehensive, progressive test-suite for rust parsers.

Member

matklad commented Dec 23, 2017

Note that that the RFC explicitly does not propose to create an official grammar. There's https://github.com/nox/rust-rfcs/blob/master/text/1331-grammar-is-canonical.md for that.

However, I do hope to produce a comprehensive, progressive test-suite for rust parsers.

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Dec 23, 2017

Member

While this model doesn't use ADTs like rustc does, libsyntax's AST has more or less 3 main uses left in the compiler: macro expansion, name resolution (the two of which are intertwined) and lowering to HIR (which the rest of the compiler works with).

So as long as those 3 tasks can be performed without much difficulty, the actual representation of the AST doesn't matter much to the compiler itself.

My preference would be something auto-generated from an official grammar, such that everything matches the names of rules and whatnot from there. I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

Member

eddyb commented Dec 23, 2017

While this model doesn't use ADTs like rustc does, libsyntax's AST has more or less 3 main uses left in the compiler: macro expansion, name resolution (the two of which are intertwined) and lowering to HIR (which the rest of the compiler works with).

So as long as those 3 tasks can be performed without much difficulty, the actual representation of the AST doesn't matter much to the compiler itself.

My preference would be something auto-generated from an official grammar, such that everything matches the names of rules and whatnot from there. I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

@Manishearth

So, firstly, I was under the impression that syn was the end goal of "stable libsyntax"

That said, syn doesn't handle comments. But still, having the library live in the crates ecosystem sounds like the best way to deal with this. For one, we don't have to tiptoe around having a typed API because it's fine to do major version updates.

For another, this way it doesn't impact the compiler; I'm concerned that this will lead to performance and readability issues since we're losing a lot of the typed AST if merged into the compiler.

As an "officially maintained crate" this makes sense, but less so as a replacement of libsyntax IMO.

Show outdated Hide outdated text/0000-libsyntax2.0.md
@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth
Member

Manishearth commented Dec 23, 2017

cc @mystor

@mystor

This comment has been minimized.

Show comment
Hide comment
@mystor

mystor Dec 23, 2017

syn has quite different goals from a theoretical stable "libsyntax". syn aims to be correct enough for use by procedural macros, but is willing to sacrifice a lot of functionality to reduce build times.

For example, syn puts no effort into producing good error messages, because it is assumed that all input to it is well-formed. syn error messages usually read "unable to parse a Item" or something along the lines of that, because it does not track what went wrong.

syn is also being rewritten on top of the proc-macro API. This API is good for procedural macros, but isn't necessarily the sort of API which you'd want to build a general rust syntax parser on top of. For example, the span information exposed by proc-macro is intentionally limited to add flexibility in the implementation of rustc, but we would want to allow access to lower-level details in a theoretical libsyntax rewrite.

It's likely that at some point the ecosystem will either grow another rust parser, or syn will grow enough feature flags that it can start to pick up these sorts of use cases. I'm not sure which solution is preferable.

In general I think I support the idea of making a feature-complete alternative to libsyntax outside of the compiler, and I think it fills a different niche than syn. However, I doubt that the timeline for merging these efforts back into rustc will be short.

mystor commented Dec 23, 2017

syn has quite different goals from a theoretical stable "libsyntax". syn aims to be correct enough for use by procedural macros, but is willing to sacrifice a lot of functionality to reduce build times.

For example, syn puts no effort into producing good error messages, because it is assumed that all input to it is well-formed. syn error messages usually read "unable to parse a Item" or something along the lines of that, because it does not track what went wrong.

syn is also being rewritten on top of the proc-macro API. This API is good for procedural macros, but isn't necessarily the sort of API which you'd want to build a general rust syntax parser on top of. For example, the span information exposed by proc-macro is intentionally limited to add flexibility in the implementation of rustc, but we would want to allow access to lower-level details in a theoretical libsyntax rewrite.

It's likely that at some point the ecosystem will either grow another rust parser, or syn will grow enough feature flags that it can start to pick up these sorts of use cases. I'm not sure which solution is preferable.

In general I think I support the idea of making a feature-complete alternative to libsyntax outside of the compiler, and I think it fills a different niche than syn. However, I doubt that the timeline for merging these efforts back into rustc will be short.

@mystor

This comment has been minimized.

Show comment
Hide comment
@mystor

mystor Dec 23, 2017

Also, cc @dtolnay if he hasn't been pinged already, as he wrote most of syn.

mystor commented Dec 23, 2017

Also, cc @dtolnay if he hasn't been pinged already, as he wrote most of syn.

@scottmcm scottmcm added the T-compiler label Dec 24, 2017

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

My preference would be something auto-generated from an official grammar,

I think we could and should generated typed AST layer, but the parser itself is better be hand-written.

Both IntelliJ and fall generate the parsing code as well, which is very useful because maintaining the parser is relatively low-effort. But the generated parsers have suboptimal error reporting, performance and correctness (because Rust already has some funny aspects of the grammar which are not natural to express in a declarative way).

While hand-writing a parser takes more effort, I think it's a good investment long-term. And given that we already have a hand-written parser, it may actually save the time :)

One problem with the generated AST and hand-written parser is that there's no guarantee that they match, but this is not a big problem in practice: such bugs are easy to notice, fix and test for.

Member

matklad commented Dec 24, 2017

My preference would be something auto-generated from an official grammar,

I think we could and should generated typed AST layer, but the parser itself is better be hand-written.

Both IntelliJ and fall generate the parsing code as well, which is very useful because maintaining the parser is relatively low-effort. But the generated parsers have suboptimal error reporting, performance and correctness (because Rust already has some funny aspects of the grammar which are not natural to express in a declarative way).

While hand-writing a parser takes more effort, I think it's a good investment long-term. And given that we already have a hand-written parser, it may actually save the time :)

One problem with the generated AST and hand-written parser is that there's no guarantee that they match, but this is not a big problem in practice: such bugs are easy to notice, fix and test for.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

By the way, I am still looking for a good term describing such non-Abstract Syntax Trees. So if anyone knows one, please let me know!

Member

matklad commented Dec 24, 2017

By the way, I am still looking for a good term describing such non-Abstract Syntax Trees. So if anyone knows one, please let me know!

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

Reusing subtrees is also important for incremental reparsing, so I am very interested in allowing this as well. To do this, we should store lengths instead of offsets in syntax tree nodes, and store (lazily-calculated) offsets a the file level.

Member

matklad commented Dec 24, 2017

I am biased towards schemes which allow reuse of parse results, e.g. for when a macro or derive uses an input expression / type multiple times.

Reusing subtrees is also important for incremental reparsing, so I am very interested in allowing this as well. To do this, we should store lengths instead of offsets in syntax tree nodes, and store (lazily-calculated) offsets a the file level.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

I'm concerned that this will lead to performance and readability issues since we're losing a lot of the typed AST if merged into the compiler.

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

It's easier to see this in action, so here's some example code from fall. This is the code for "add impl" action, which turns this:

struct Foo<X, Y: Clone> {}

into this

struct Foo<X, Y: Clone> {}

impl<X, Y: Clone> Foo<X, Y> {

}

Note how precisely the function in question characterizes the set of applicable syntactical constructs: T: NameOwner<'f> + TypeParametersOwner<'f>. These Owner traits are generated from the grammar and, for example, structs and enums implement them. You can even combine typed and untyped access in a single visitor :) In general, my experience from fall is that this two tiered implementation in Rust is a never ending stream of goodness.

It's true though that there are some performance considerations: there's no identifier interning built in, and accessing a child of particular type is a liner scan of all children. However, this tree should be converted to HIR early on, and it's always possible to create a side table which maps syntax tree nodes to arbitrary cached data, because the node is an integer index.

Member

matklad commented Dec 24, 2017

I'm concerned that this will lead to performance and readability issues since we're losing a lot of the typed AST if merged into the compiler.

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

It's easier to see this in action, so here's some example code from fall. This is the code for "add impl" action, which turns this:

struct Foo<X, Y: Clone> {}

into this

struct Foo<X, Y: Clone> {}

impl<X, Y: Clone> Foo<X, Y> {

}

Note how precisely the function in question characterizes the set of applicable syntactical constructs: T: NameOwner<'f> + TypeParametersOwner<'f>. These Owner traits are generated from the grammar and, for example, structs and enums implement them. You can even combine typed and untyped access in a single visitor :) In general, my experience from fall is that this two tiered implementation in Rust is a never ending stream of goodness.

It's true though that there are some performance considerations: there's no identifier interning built in, and accessing a child of particular type is a liner scan of all children. However, this tree should be converted to HIR early on, and it's always possible to create a side table which maps syntax tree nodes to arbitrary cached data, because the node is an integer index.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

@Manishearth I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

I also agree with @mystor that syn has slightly different constraints and that syn, and not libsyntax2.0, would be a better syn.

Member

matklad commented Dec 24, 2017

@Manishearth I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

I also agree with @mystor that syn has slightly different constraints and that syn, and not libsyntax2.0, would be a better syn.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

While hand-writing a parser takes more effort, I think it's a good investment long-term.

To expand on this a bit, I think the ideal would be to have two parsers in the compiler.

  1. An LR parser, generated directly from the grammar by something like LALRPOP, which does zero error reporting and error recovery, but is bloody correct and bloody fast.

  2. Something like proposed libysntax2.0, which does sophisticated error recovery and can parse anything that resembles rust into some sort of the tree.

The command line compiler would then use LR, and only LR parser for compilation. However, if LR parser fails to parse some file, compiler would reparse it with the libsyntax2.0 to give a nice error report. The IDE stuff would use libsyntax2.0.

Member

matklad commented Dec 24, 2017

While hand-writing a parser takes more effort, I think it's a good investment long-term.

To expand on this a bit, I think the ideal would be to have two parsers in the compiler.

  1. An LR parser, generated directly from the grammar by something like LALRPOP, which does zero error reporting and error recovery, but is bloody correct and bloody fast.

  2. Something like proposed libysntax2.0, which does sophisticated error recovery and can parse anything that resembles rust into some sort of the tree.

The command line compiler would then use LR, and only LR parser for compilation. However, if LR parser fails to parse some file, compiler would reparse it with the libsyntax2.0 to give a nice error report. The IDE stuff would use libsyntax2.0.

@Centril

This comment has been minimized.

Show comment
Hide comment
@Centril

Centril Dec 24, 2017

Contributor

@matklad That's sounds like a jolly good idea! An added benefit is that you then may define an Arbitrary impl for the AST / Token tree itself and use property based testing to check that forall ast. parser_1(show(ast)) == parser_2(show(ast)). You've then essentially tested that the sophisticated parser conforms to the spec. The downside can be a duplication of effort to some small extent but may be well worth it?

Contributor

Centril commented Dec 24, 2017

@matklad That's sounds like a jolly good idea! An added benefit is that you then may define an Arbitrary impl for the AST / Token tree itself and use property based testing to check that forall ast. parser_1(show(ast)) == parser_2(show(ast)). You've then essentially tested that the sophisticated parser conforms to the spec. The downside can be a duplication of effort to some small extent but may be well worth it?

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Dec 24, 2017

Member

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

Oh, I saw that, I'm saying we don't need that weird intermediate layer if we have extensible enums (or if we keep it out of tree so that it can be versioned independently)

I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

Well, if we're not trying to stabilize it, then why the major changes? Your main point against current libsyntax is that it's not a pure function -- but this can fixed in a focused RFC!

I think this RFC as currently stated needs a lot more work clarifying what axes we're evaluating things against, and why the new proposal is better under those axes.

I also feel like the parser that IDEs and rustfmt need is different from the parser that the compiler needs (IDEs and rustfmt want parsers that are more lax with dealing with errors, and preserve comments); which only further makes me feel like having a separate officially-maintained library of which rustc is not a client would help. You yourself put forth two parsers -- only one of these is necessary in the compiler -- why not maintain this outside of tree?


TLDR, before moving forward here I'd like to see:

  • better motivation as to why current libsyntax needs change
  • better understanding of the axes along which we are evaluating libsyntax
  • better clarification as to why the proposal is better along those axes
  • more reasoning as to why this should be made part of rustc and not officially maintained out of tree (which makes it easier to version and also doesn't saddle rustc with additional complexity)
Member

Manishearth commented Dec 24, 2017

@Manishearth please take a closer look at the Typed Tree section. I claim that it's possible to regain full type safety, by layering newtype wrappers on top of raw syntax tree.

Oh, I saw that, I'm saying we don't need that weird intermediate layer if we have extensible enums (or if we keep it out of tree so that it can be versioned independently)

I'd like to reiterate that "stable libsyntax" is an explicit non-goal of this RFC. The intended clients of the library (if everything goes well, of which there's no certainty) are rustc, rls, rustfmt and clippy.

Well, if we're not trying to stabilize it, then why the major changes? Your main point against current libsyntax is that it's not a pure function -- but this can fixed in a focused RFC!

I think this RFC as currently stated needs a lot more work clarifying what axes we're evaluating things against, and why the new proposal is better under those axes.

I also feel like the parser that IDEs and rustfmt need is different from the parser that the compiler needs (IDEs and rustfmt want parsers that are more lax with dealing with errors, and preserve comments); which only further makes me feel like having a separate officially-maintained library of which rustc is not a client would help. You yourself put forth two parsers -- only one of these is necessary in the compiler -- why not maintain this outside of tree?


TLDR, before moving forward here I'd like to see:

  • better motivation as to why current libsyntax needs change
  • better understanding of the axes along which we are evaluating libsyntax
  • better clarification as to why the proposal is better along those axes
  • more reasoning as to why this should be made part of rustc and not officially maintained out of tree (which makes it easier to version and also doesn't saddle rustc with additional complexity)
@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

@Centril this is a little bit off topic, but the property based testing should proceed in slightly different manner :)

First, note that the proposed tree structure does not really have a pretty-print/show operation, because it simply does not exist without underlying source text. Second, if you generate AST you fail to check that both parsers fail to parse invalid code.

So the better property would be forall text. parser_1(text).is_err() and parser_2(text).is_err() or (parser_1(text).unwrap() == parser_2(text).unwrap()).

And to generate arbitrary text, you can take existing rust code for valid inputs, and for invalid inputs you can some valid code and cut&paste fragments of it over itself. This is how fall checks that incremental and full reparsers are equivalent (1, 2).

Member

matklad commented Dec 24, 2017

@Centril this is a little bit off topic, but the property based testing should proceed in slightly different manner :)

First, note that the proposed tree structure does not really have a pretty-print/show operation, because it simply does not exist without underlying source text. Second, if you generate AST you fail to check that both parsers fail to parse invalid code.

So the better property would be forall text. parser_1(text).is_err() and parser_2(text).is_err() or (parser_1(text).unwrap() == parser_2(text).unwrap()).

And to generate arbitrary text, you can take existing rust code for valid inputs, and for invalid inputs you can some valid code and cut&paste fragments of it over itself. This is how fall checks that incremental and full reparsers are equivalent (1, 2).

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

Yeah, totally agree that the motivation section needs more work!

My main mistake is that I overemphasized "pure function" part, while my main point is actually that it's difficult to build great tooling on top of current lossy libsyntax.

However, I don't want to dive too deep into discussion why libsyntax is worse for tooling than the proposed solution simply because libsyntax actually exists and libsyntax2.0 is a little more than a pipe dream at the moment. And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea". No harm will be done to the compiler as a part of this RFC :) I am posting this as an RFC and not as a discussion on IRC or internals because it's a rather ambitions project anyway, and I would like to gather community consensus.

Member

matklad commented Dec 24, 2017

Yeah, totally agree that the motivation section needs more work!

My main mistake is that I overemphasized "pure function" part, while my main point is actually that it's difficult to build great tooling on top of current lossy libsyntax.

However, I don't want to dive too deep into discussion why libsyntax is worse for tooling than the proposed solution simply because libsyntax actually exists and libsyntax2.0 is a little more than a pipe dream at the moment. And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea". No harm will be done to the compiler as a part of this RFC :) I am posting this as an RFC and not as a discussion on IRC or internals because it's a rather ambitions project anyway, and I would like to gather community consensus.

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Dec 24, 2017

Member

And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea"

Right, but it's not clear as to why the proposed structure is better :)

(it would be worth making it very explicit that this RFC does not try to replace libsyntax, just pave a path for something which may eventually replace it but through a different RFC, since otherwise there will be two separate proposals being discussed in tandem)

Member

Manishearth commented Dec 24, 2017

And essentially what this RFC proposes is "let's agree that building a prototype of libsytnax2.0 using the proposed tree structure is a good idea"

Right, but it's not clear as to why the proposed structure is better :)

(it would be worth making it very explicit that this RFC does not try to replace libsyntax, just pave a path for something which may eventually replace it but through a different RFC, since otherwise there will be two separate proposals being discussed in tandem)

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

The one question I want to be answered before building a prototype is whether it is at all feasible to use proposed implementation in the compiler, or will macros through a wrench into the works? So I am eagerly waiting for @nrc or @jseyfried to clarify the following points

  1. Currently, parsing, name resolution and macro expansion are intertwined. Will it be possible to separate parsing into a separate step, so that resolve and expansion call into parser and use sytnax tree, but the parser itself knows nothing about macro expansion?

  2. Is it possible to parse a rust source file in isolation, without knowing it's location on the file system, the set of previously parsed files, etc. Looks like include_str! and include_bytes! and include! are not a problem?

  3. Is it possible to base macro expansion on top of the proposed tree structure? Currently, macro expander operates with token trees, which store hygiene information within themselvs (If I understand the source code correctly). Will it be possilbe to use the text instead of token trees, and store hygiene on the side?

Member

matklad commented Dec 24, 2017

The one question I want to be answered before building a prototype is whether it is at all feasible to use proposed implementation in the compiler, or will macros through a wrench into the works? So I am eagerly waiting for @nrc or @jseyfried to clarify the following points

  1. Currently, parsing, name resolution and macro expansion are intertwined. Will it be possible to separate parsing into a separate step, so that resolve and expansion call into parser and use sytnax tree, but the parser itself knows nothing about macro expansion?

  2. Is it possible to parse a rust source file in isolation, without knowing it's location on the file system, the set of previously parsed files, etc. Looks like include_str! and include_bytes! and include! are not a problem?

  3. Is it possible to base macro expansion on top of the proposed tree structure? Currently, macro expander operates with token trees, which store hygiene information within themselvs (If I understand the source code correctly). Will it be possilbe to use the text instead of token trees, and store hygiene on the side?

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

So, to actual discussion of why the proposed structure is better, and what metrics are used to measure this "better" :)

The crucial point here is that, to make writing good tooling possible, the syntax tree must be lossless, it must explicitly account for comment nodes and whitespace. It also must be capable of representing partially-padsed code. For example, for

fn main() {
    foo(xs, 1
}

the parser must produce a function call node for foo(xs, 1 and parameter nodes for xs and 1.

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees. Maybe I am wrong and it is actually quite possible to create a perfect IDE support on top of conventional AST.

But, if we assume "losslessness" as an axiom, then representing syntax as a generic tree which points to ranges in the original string seems to be the minimal and natural approach?

The trick with typed wrappers seems more weird of course, but, in my personal experience, its a great way to represent ASTs.

@Manishearth, I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

The problem with AST is that AST nodes don't naturally form a hierarchy, and you need all of structs, enums and traits to work with them in a type-safe manner. For example, you need a concrete struct for nodes like "if expression", "structure definition", you'll need an enum for "any expression" and a trait for "stuff with generic parameters, which also may have a where clause". The proposed implementation allows to generate an arbitrary collection of types for representing AST nodes, while using a plain Copy index as the internal representation, which seems quite nifty to me!

So here are the axis which differentiate between libsyntax and proposed libsyntax2

  1. losslessness of the tree
  2. Isolation and reusability of API (even if we move libsyntax to crates.io, we'll probably need to move stirng interner with it as well, because you want interned identifiers in real AST)
  3. Conceptual elegance (hugely subjective, of course!)
  4. Similar representation is successfully used in compiler+IDE use case.
Member

matklad commented Dec 24, 2017

So, to actual discussion of why the proposed structure is better, and what metrics are used to measure this "better" :)

The crucial point here is that, to make writing good tooling possible, the syntax tree must be lossless, it must explicitly account for comment nodes and whitespace. It also must be capable of representing partially-padsed code. For example, for

fn main() {
    foo(xs, 1
}

the parser must produce a function call node for foo(xs, 1 and parameter nodes for xs and 1.

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees. Maybe I am wrong and it is actually quite possible to create a perfect IDE support on top of conventional AST.

But, if we assume "losslessness" as an axiom, then representing syntax as a generic tree which points to ranges in the original string seems to be the minimal and natural approach?

The trick with typed wrappers seems more weird of course, but, in my personal experience, its a great way to represent ASTs.

@Manishearth, I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

The problem with AST is that AST nodes don't naturally form a hierarchy, and you need all of structs, enums and traits to work with them in a type-safe manner. For example, you need a concrete struct for nodes like "if expression", "structure definition", you'll need an enum for "any expression" and a trait for "stuff with generic parameters, which also may have a where clause". The proposed implementation allows to generate an arbitrary collection of types for representing AST nodes, while using a plain Copy index as the internal representation, which seems quite nifty to me!

So here are the axis which differentiate between libsyntax and proposed libsyntax2

  1. losslessness of the tree
  2. Isolation and reusability of API (even if we move libsyntax to crates.io, we'll probably need to move stirng interner with it as well, because you want interned identifiers in real AST)
  3. Conceptual elegance (hugely subjective, of course!)
  4. Similar representation is successfully used in compiler+IDE use case.
@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 24, 2017

Member

more reasoning as to why this should be made part of rustc and not officially maintained out of tree

I am not entirely sure if IDE and compiler should use separate parsers or a single one (some discussion on internals 1). Originally I was of the opinion the their requirements differ to much, and macros make this whole idea impossible, but now I think that it is in fact feasible to use the single parser.

The obvious benefit is of course code reuse.

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary. It's useful to share the actual tree data structure between the compiler and the IDE, because the IDE can reparse file incrementally, and compiler has to do a full reparse (or it has to learn about text editing). It's also interesting to think about how compiler/IDE reports errors. For sure the error must be detected by the compiler. But IDE must be able to suggest a quick fix for it, and quick fixes are all about text editing, file formatting and interacting with the user. So it would be nice if the compiler could collect some context during checks, and then pass this context information to some layer which either prints it to stdout as an error message, or suggest a quick fix in IDE. It would be nice if this context information could use a language of syntax tree!

Macing macro expansion to work with IDE tree also allows some nifty features like live debugging macro expansions, for example :)

Member

matklad commented Dec 24, 2017

more reasoning as to why this should be made part of rustc and not officially maintained out of tree

I am not entirely sure if IDE and compiler should use separate parsers or a single one (some discussion on internals 1). Originally I was of the opinion the their requirements differ to much, and macros make this whole idea impossible, but now I think that it is in fact feasible to use the single parser.

The obvious benefit is of course code reuse.

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary. It's useful to share the actual tree data structure between the compiler and the IDE, because the IDE can reparse file incrementally, and compiler has to do a full reparse (or it has to learn about text editing). It's also interesting to think about how compiler/IDE reports errors. For sure the error must be detected by the compiler. But IDE must be able to suggest a quick fix for it, and quick fixes are all about text editing, file formatting and interacting with the user. So it would be nice if the compiler could collect some context during checks, and then pass this context information to some layer which either prints it to stdout as an error message, or suggest a quick fix in IDE. It would be nice if this context information could use a language of syntax tree!

Macing macro expansion to work with IDE tree also allows some nifty features like live debugging macro expansions, for example :)

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Dec 25, 2017

Member

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees.

Nah, this is still useful data 😄

And I see why lossless trees work well here!

My preferred way of representing this would be to use a complete AST as much as possible, and have nodes like ItemKind::PartialItem, PartialExpr which are more like lex trees with some parsed info when things are partial. But I can see why that's problematic.

Still, worth putting this other proposal up there, I don't know if it's actually better -- you're far more experienced here than I 😄

I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

It's not really relevant anymore, but I was under the impression you were doing the untyped tree so that you wouldn't have a stability problem -- and this is better tackled by having a typed tree where you are never allowed to have an exhaustive match on an enum, you must have a wildcard branch.

Stability was not the actual motivation here, as you clarified, so this is a moot point 😄

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary.

I disagree, but this is pretty subjective anyway 😄 . But yeah, this isn't something we need to discuss in this RFC.


Reading through the RFC again with the motivations in mind I think it's much clearer why we should do this. I like the design! Still feel like the "full AST where possible" design might be better but I'm not sure.

Might leave more specific comments later.

Member

Manishearth commented Dec 25, 2017

However, I don't think I can back this claim up with something else then "it's obvious that it is supposed to work this way": all IDE stuff I've written was using lossless trees.

Nah, this is still useful data 😄

And I see why lossless trees work well here!

My preferred way of representing this would be to use a complete AST as much as possible, and have nodes like ItemKind::PartialItem, PartialExpr which are more like lex trees with some parsed info when things are partial. But I can see why that's problematic.

Still, worth putting this other proposal up there, I don't know if it's actually better -- you're far more experienced here than I 😄

I think I don't fully understand your point about extensible enums, I though it was about making error specifically an enum like enum SyntaxError { MissingSemicolon, ... #[doc(ignore)] __not_exhaustive}? Or is it about some way of representing single inheritance OOP? Could you give an example?

It's not really relevant anymore, but I was under the impression you were doing the untyped tree so that you wouldn't have a stability problem -- and this is better tackled by having a typed tree where you are never allowed to have an exhaustive match on an enum, you must have a wildcard branch.

Stability was not the actual motivation here, as you clarified, so this is a moot point 😄

The less obvious benefit is that it's not entirely clear where compiler ends and IDE starts, and the syntax tree seems to be one natural boundary.

I disagree, but this is pretty subjective anyway 😄 . But yeah, this isn't something we need to discuss in this RFC.


Reading through the RFC again with the motivations in mind I think it's much clearer why we should do this. I like the design! Still feel like the "full AST where possible" design might be better but I'm not sure.

Might leave more specific comments later.

Show outdated Hide outdated text/0000-libsyntax2.0.md
```
FILE

This comment has been minimized.

@Manishearth

Manishearth Dec 25, 2017

Member

I'm wondering: How do we figure out the best way to interpret a partial tree? Should we rely on indentation as a hint? Are there well-known solutions here?

@Manishearth

Manishearth Dec 25, 2017

Member

I'm wondering: How do we figure out the best way to interpret a partial tree? Should we rely on indentation as a hint? Are there well-known solutions here?

This comment has been minimized.

@matklad

matklad Dec 25, 2017

Member

Here's the array of tricks I am aware of:

  1. Code is typically typed left-to-right, so it's possible to have "commit points" in grammar/parser. Here's an example from fall:
pub rule fn_def {
  'const'? 'unsafe'? linkage?
  'fn' <commit> ident
  type_parameters?
  value_parameters
  ret_type?
  where_clause?
  {block_expr | ';'}
}

This <commit> means that as soon parser sees fn keyword, it produces a function node, even if there are parsing errors after this point. Commit effectively makes the trailing part optional. In other words, one way to deal with partial parses is to treat certain prefixes of parses as parses. These commit points nicely mesh up with ll-ness of the parser: you commit just after the necessary lookahead.

  1. Code usually has block structure, so it makes sense, when parsing a block expression, first parse it approximately as a token tree, and then try to parse the internals of the block. That way, parse errors stay isolated to blocks. Of course, you can do this with all kinds of lists and what not, as long as you can invent a robust covering grammar. This trick also makes incremental parsers more efficient (as changes are isolated to one block unless it's a change in the block structure itself) and allows one to lazily parse stuff. There are certain variations of this trick: for example, you can lex string literals as just any stuff between ", and then lex the contents of the literal with a second lexer, which properly deals with escape sequences.

  2. When parsing repeated constructs, like item*, a useful strategy of error recovery is, after each (partially) parsed item skip tokens until a token from FIRST(item) appears. (example from fall)

@matklad

matklad Dec 25, 2017

Member

Here's the array of tricks I am aware of:

  1. Code is typically typed left-to-right, so it's possible to have "commit points" in grammar/parser. Here's an example from fall:
pub rule fn_def {
  'const'? 'unsafe'? linkage?
  'fn' <commit> ident
  type_parameters?
  value_parameters
  ret_type?
  where_clause?
  {block_expr | ';'}
}

This <commit> means that as soon parser sees fn keyword, it produces a function node, even if there are parsing errors after this point. Commit effectively makes the trailing part optional. In other words, one way to deal with partial parses is to treat certain prefixes of parses as parses. These commit points nicely mesh up with ll-ness of the parser: you commit just after the necessary lookahead.

  1. Code usually has block structure, so it makes sense, when parsing a block expression, first parse it approximately as a token tree, and then try to parse the internals of the block. That way, parse errors stay isolated to blocks. Of course, you can do this with all kinds of lists and what not, as long as you can invent a robust covering grammar. This trick also makes incremental parsers more efficient (as changes are isolated to one block unless it's a change in the block structure itself) and allows one to lazily parse stuff. There are certain variations of this trick: for example, you can lex string literals as just any stuff between ", and then lex the contents of the literal with a second lexer, which properly deals with escape sequences.

  2. When parsing repeated constructs, like item*, a useful strategy of error recovery is, after each (partially) parsed item skip tokens until a token from FIRST(item) appears. (example from fall)

field.
## Typed Tree

This comment has been minimized.

@Manishearth

Manishearth Dec 25, 2017

Member

FWIW this is kinda reminiscent of Go's AST, however Go's AST is designed that way more because Go doesn't have ADTs. (I've always found it annoying to work with because of that, but this isn't an inherent problem in this approach)

@Manishearth

Manishearth Dec 25, 2017

Member

FWIW this is kinda reminiscent of Go's AST, however Go's AST is designed that way more because Go doesn't have ADTs. (I've always found it annoying to work with because of that, but this isn't an inherent problem in this approach)

This comment has been minimized.

@matklad

matklad Dec 25, 2017

Member

Yeah, in this representation, you could use enums and pattern match, for example, against all kinds of expressions, but you won't be able to destruct structs.

@matklad

matklad Dec 25, 2017

Member

Yeah, in this representation, you could use enums and pattern match, for example, against all kinds of expressions, but you won't be able to destruct structs.

@scottmcm

This comment has been minimized.

Show comment
Hide comment
@scottmcm

scottmcm Dec 25, 2017

Member

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs. There are probably better resources, but here's one I could find quickly: https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/

Member

scottmcm commented Dec 25, 2017

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs. There are probably better resources, but here's one I could find quickly: https://blogs.msdn.microsoft.com/ericlippert/2012/06/08/persistence-facades-and-roslyns-red-green-trees/

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Dec 25, 2017

Member

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs.

A most reasonable question! I've investigated a little how trees are represented in Dart SDK and Roslyn.

So for both Dart and C#, an object oriented approach is used, where there's a hierarchy of classes corresponding to syntactical constructs, and each class stores its children as typed fields. However, it is possible to view AST as untyped because all nodes share a supertype, are linked via parent links, and provide a getChildren method, which reconstructs a list of children from fields.

But they use different strategies to represent whitespace and comments.

In Roslyin, each token class has two fields for leading and trailing trivia.

The Dart looks weird (or maybe I am reading the wrong thing?) Seems like they build an explicit linked-list of non-whitespace non-comment tokens, and additionally attach comments (but not whitespace?) to the following token: https://github.com/dart-lang/sdk/blob/0c48a1163577a1157ea16c00b2fe914c1759357b/pkg/front_end/lib/src/scanner/token.dart#L477

This all makes me less sure that the representation I propose is good, though I still think it's better then alternatives. One of its worse parts is that accessing a node is linear in the number of children, and not constant. However, this can be fixed in a couple of ways: first, you can store a tree in such a way that all children of a node are stored in a continuous slice, which makes linear time pretty fast. Second, because a node is essentially a file-local index, you can store side tables like Vec<StructDefData>, Vec<TraitDefData> which make AST lookup constant. And what's interesting about the last trick is that you are very flexible in what data you store in such sidetables. Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

Member

matklad commented Dec 25, 2017

Might be interesting to look at how Roslyn solved a bunch of these same challenges for C# parsing and ASTs.

A most reasonable question! I've investigated a little how trees are represented in Dart SDK and Roslyn.

So for both Dart and C#, an object oriented approach is used, where there's a hierarchy of classes corresponding to syntactical constructs, and each class stores its children as typed fields. However, it is possible to view AST as untyped because all nodes share a supertype, are linked via parent links, and provide a getChildren method, which reconstructs a list of children from fields.

But they use different strategies to represent whitespace and comments.

In Roslyin, each token class has two fields for leading and trailing trivia.

The Dart looks weird (or maybe I am reading the wrong thing?) Seems like they build an explicit linked-list of non-whitespace non-comment tokens, and additionally attach comments (but not whitespace?) to the following token: https://github.com/dart-lang/sdk/blob/0c48a1163577a1157ea16c00b2fe914c1759357b/pkg/front_end/lib/src/scanner/token.dart#L477

This all makes me less sure that the representation I propose is good, though I still think it's better then alternatives. One of its worse parts is that accessing a node is linear in the number of children, and not constant. However, this can be fixed in a couple of ways: first, you can store a tree in such a way that all children of a node are stored in a continuous slice, which makes linear time pretty fast. Second, because a node is essentially a file-local index, you can store side tables like Vec<StructDefData>, Vec<TraitDefData> which make AST lookup constant. And what's interesting about the last trick is that you are very flexible in what data you store in such sidetables. Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

@arielb1

This comment has been minimized.

Show comment
Hide comment
@arielb1

arielb1 Dec 25, 2017

Contributor

Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

The compiler itself is already an ECS, this is just pushing stuff to the parser too.

Contributor

arielb1 commented Dec 25, 2017

Joke that it looks like SoA, ECS, and that we should build compiler like a game engine so that we can rewrite a browser like a game engine.

The compiler itself is already an ECS, this is just pushing stuff to the parser too.

Major revisions to the text of RFC
* clarify goals and motivation: this is about IDE stuff, and not about
  stable access to AST

* Elaborate specifics of IDEs

* Retroactively justify the proposed syntax tree structure by listing
  design constraints which it satisfies
@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Jan 3, 2018

Member

@matklad With the latest Span changes we can probably just make libsyntax use (file, lo, hi)
cc @petrochenkov

Member

eddyb commented Jan 3, 2018

@matklad With the latest Span changes we can probably just make libsyntax use (file, lo, hi)
cc @petrochenkov

@petrochenkov

This comment has been minimized.

Show comment
Hide comment
@petrochenkov

petrochenkov Jan 3, 2018

Contributor

At least one reason to concatenate all source files and use lo in that giant file instead of a (file, lo) pair was to use the storage of lo most effectively.
A single u32 could represent both one huge auto-generated file and many smaller files.

Now that most of the compiler uses packed spans (Span), I think we can make unpacked spans (SpanData) larger and use (file, rel_lo, rel_hi) instead of (abs_lo, abs_hi), but this needs benchmarking of course.

Contributor

petrochenkov commented Jan 3, 2018

At least one reason to concatenate all source files and use lo in that giant file instead of a (file, lo) pair was to use the storage of lo most effectively.
A single u32 could represent both one huge auto-generated file and many smaller files.

Now that most of the compiler uses packed spans (Span), I think we can make unpacked spans (SpanData) larger and use (file, rel_lo, rel_hi) instead of (abs_lo, abs_hi), but this needs benchmarking of course.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jan 3, 2018

Member

One more aspect of internal representation is that compiler should not need all these whitespaces and comments most of the time. So, at least in theory, it should be possible to parse source into a lossless tree, then condense the tree into more compact representation, forgetting various bits of information, and then restore the full tree by reparsing when the need arises. Specifically, I think only the following cases need access to the full syntax tree:

  1. Syntax error reporting
  2. Macro expansion
  3. Working with the files, currently opened in the IDE: we need to maintain syntax tree to do stuff like brace matching, highlighting, autoindenting, and to be able to incrementally reparse it.
  4. Refactoring (we need to load the tree only for files actually affected by refactoring).
  5. Error reporting: ideally, each error report should be accompanied with a quick-fix action, which is a form of refactoring.
Member

matklad commented Jan 3, 2018

One more aspect of internal representation is that compiler should not need all these whitespaces and comments most of the time. So, at least in theory, it should be possible to parse source into a lossless tree, then condense the tree into more compact representation, forgetting various bits of information, and then restore the full tree by reparsing when the need arises. Specifically, I think only the following cases need access to the full syntax tree:

  1. Syntax error reporting
  2. Macro expansion
  3. Working with the files, currently opened in the IDE: we need to maintain syntax tree to do stuff like brace matching, highlighting, autoindenting, and to be able to incrementally reparse it.
  4. Refactoring (we need to load the tree only for files actually affected by refactoring).
  5. Error reporting: ideally, each error report should be accompanied with a quick-fix action, which is a form of refactoring.
@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Jan 3, 2018

Member

@matklad That compact representation is HIR in the current implementation, which also desugars things like (expr) (stripping parens), for loops, if let, while let, nested use, etc.

Member

eddyb commented Jan 3, 2018

@matklad That compact representation is HIR in the current implementation, which also desugars things like (expr) (stripping parens), for loops, if let, while let, nested use, etc.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jan 3, 2018

Member

@eddyb right, the idea is not about "having compact representation", but about loading and unloading full representation on demand. Nested uses are a good example actually. Suppose that during type-checking of a method call we find that we should produce an error because the method's trait is implemented, but not in scope. Ideally, we would like to also produce a suggestion for IDE to import this trait, and this suggestions should be aware of nested use imports (you want to add the trait to some existing use import, instead of creating a brand-new one). So we need access to the syntax tree, and one way to get it is just to parse the source code the second time (which might be better than storing all trees in memory, if only small number of files contain errors).

Member

matklad commented Jan 3, 2018

@eddyb right, the idea is not about "having compact representation", but about loading and unloading full representation on demand. Nested uses are a good example actually. Suppose that during type-checking of a method call we find that we should produce an error because the method's trait is implemented, but not in scope. Ideally, we would like to also produce a suggestion for IDE to import this trait, and this suggestions should be aware of nested use imports (you want to add the trait to some existing use import, instead of creating a brand-new one). So we need access to the syntax tree, and one way to get it is just to parse the source code the second time (which might be better than storing all trees in memory, if only small number of files contain errors).

@eddyb

This comment has been minimized.

Show comment
Hide comment
@eddyb

eddyb Jan 3, 2018

Member

@matklad You have the original Span so you can do a very local reparse, btw.

Member

eddyb commented Jan 3, 2018

@matklad You have the original Span so you can do a very local reparse, btw.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jan 10, 2018

Member

I am tentatively starting to put some code in here: https://github.com/matklad/libsyntax2 :-)

Member

matklad commented Jan 10, 2018

I am tentatively starting to put some code in here: https://github.com/matklad/libsyntax2 :-)

@Pzixel

This comment has been minimized.

Show comment
Hide comment
@Pzixel

Pzixel Jan 13, 2018

@matklad I have been using Roslyn for a while as a user, and I can say that it's damn good in these things. You can play with this page to see what tree Roslyn actually generates: https://roslynquoter.azurewebsites.net/ .

I think you can learn something from it :)

Pzixel commented Jan 13, 2018

@matklad I have been using Roslyn for a while as a user, and I can say that it's damn good in these things. You can play with this page to see what tree Roslyn actually generates: https://roslynquoter.azurewebsites.net/ .

I think you can learn something from it :)

@matklad matklad referenced this pull request Jan 13, 2018

Open

libsyntax2.0 #1

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Feb 24, 2018

Member

So, this has been quite for some time.

I'd love to receive a more "official" @rust-lang/compiler team feedback (an FCP maybe?) to better understand how to reach the goals I want to reach :-)

So, here are some core questions about this RFC I would love to hear feedback on:

  • In general, is this "completely crazy and ain't gonna work", or "rather crazy and might, actually, work" ? :)

  • Do we need the ability to represent comments and whitespace in the AST explicitly, or is it sufficient just to store offsets for significant nodes?

  • Does it make sense to get rid of the global state in the parser, such that each file can be parsed independently, in any order, using only the text of the file itself as input? Will it inevitably lead to worse memory usage?

  • Can compiler's parser be usable as a stand-alone library with small interface? Or is it inevitable that the parser should integrate tightly with internal compiler data structures, to provide performance and convenience for compiler writers?

  • Finally, could the proposed two-layer extendable (in ECS-sense) syntax tree data structure be a good approach to achieve the three previous goals?

Member

matklad commented Feb 24, 2018

So, this has been quite for some time.

I'd love to receive a more "official" @rust-lang/compiler team feedback (an FCP maybe?) to better understand how to reach the goals I want to reach :-)

So, here are some core questions about this RFC I would love to hear feedback on:

  • In general, is this "completely crazy and ain't gonna work", or "rather crazy and might, actually, work" ? :)

  • Do we need the ability to represent comments and whitespace in the AST explicitly, or is it sufficient just to store offsets for significant nodes?

  • Does it make sense to get rid of the global state in the parser, such that each file can be parsed independently, in any order, using only the text of the file itself as input? Will it inevitably lead to worse memory usage?

  • Can compiler's parser be usable as a stand-alone library with small interface? Or is it inevitable that the parser should integrate tightly with internal compiler data structures, to provide performance and convenience for compiler writers?

  • Finally, could the proposed two-layer extendable (in ECS-sense) syntax tree data structure be a good approach to achieve the three previous goals?

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Mar 7, 2018

Contributor

@matklad d'oh. I've been carrying around a printed copy of this RFC for about 2 weeks, but haven't gotten around to actually reading it and giving feedback. I will make a point to do so !

Contributor

nikomatsakis commented Mar 7, 2018

@matklad d'oh. I've been carrying around a printed copy of this RFC for about 2 weeks, but haven't gotten around to actually reading it and giving feedback. I will make a point to do so !

@nikomatsakis

This comment has been minimized.

Show comment
Hide comment
@nikomatsakis

nikomatsakis Mar 13, 2018

Contributor

@matklad ok so I read this last night. I'm definitely 👍 on the general idea of having a shared parsing library. I think the details here matter a lot -- for example, what representation to use, whether and what to auto-generate, etc -- but I don't necessarily think that an RFC is the right place to hash them out. I'm personally optimistic that we can craft a single library that is usable for IDEs, proc macros, and the compiler, but it'll definitely take some iteration and tinkering to get the balance right. (I don't consider these use cases as particularly divergent, though proc macros add the fun of wanting to be more extensible.)

I was thinking that it might be profitable to discuss these matters "live", at the upcoming Rust All Hands gathering, presuming that many of the stakeholders will be there?

One thing I would like to note:

I've been wanting for some time to add a mode to LALRPOP where it generates values of some pre-defined type, much like the trees you define here. The idea would be that you just write a grammar with no actions and we'll build up a tree; we could then layer tree transformers on top of that (a bit like what ANTLR does, iirc). I'd love for this "tree representation" we are discussing here to be an independent standard that we could use for that -- this would in turn allow us to have both hand-written and LALRPOP-generated parsers that are compatible (I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell).

I think -- strategically -- it's a mistake to start out with too big of a goal. We should probably start by iterating on actual code and putting it to use. But another way, I feel like replacing libsyntax would be the "final step", not the first one. But it's good to have that goal in mind as we plan our steps, to make sure we're not accidentally building up things with critical flaws.

Contributor

nikomatsakis commented Mar 13, 2018

@matklad ok so I read this last night. I'm definitely 👍 on the general idea of having a shared parsing library. I think the details here matter a lot -- for example, what representation to use, whether and what to auto-generate, etc -- but I don't necessarily think that an RFC is the right place to hash them out. I'm personally optimistic that we can craft a single library that is usable for IDEs, proc macros, and the compiler, but it'll definitely take some iteration and tinkering to get the balance right. (I don't consider these use cases as particularly divergent, though proc macros add the fun of wanting to be more extensible.)

I was thinking that it might be profitable to discuss these matters "live", at the upcoming Rust All Hands gathering, presuming that many of the stakeholders will be there?

One thing I would like to note:

I've been wanting for some time to add a mode to LALRPOP where it generates values of some pre-defined type, much like the trees you define here. The idea would be that you just write a grammar with no actions and we'll build up a tree; we could then layer tree transformers on top of that (a bit like what ANTLR does, iirc). I'd love for this "tree representation" we are discussing here to be an independent standard that we could use for that -- this would in turn allow us to have both hand-written and LALRPOP-generated parsers that are compatible (I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell).

I think -- strategically -- it's a mistake to start out with too big of a goal. We should probably start by iterating on actual code and putting it to use. But another way, I feel like replacing libsyntax would be the "final step", not the first one. But it's good to have that goal in mind as we plan our steps, to make sure we're not accidentally building up things with critical flaws.

@mark-i-m

This comment has been minimized.

Show comment
Hide comment
@mark-i-m

mark-i-m Mar 13, 2018

Contributor

I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell

My understanding from the discussion above was that we gained a lot in terms of custom error messages. For example, would it be possible to do something like rust-lang/rust#48858 with an auto-generated parser? Also, is Rust's grammar even known to be LALR?

Contributor

mark-i-m commented Mar 13, 2018

I'm not sure how hand writing buys us when compared against LALRPOP's existing error recovery mechanisms, but it's really hard to tell

My understanding from the discussion above was that we gained a lot in terms of custom error messages. For example, would it be possible to do something like rust-lang/rust#48858 with an auto-generated parser? Also, is Rust's grammar even known to be LALR?

[drawbacks]: #drawbacks
- No harm will be done as long as the new libsyntax exists as an
experiemt on crates.io. However, actually using it in the compiler

This comment has been minimized.

@pickfire

pickfire Apr 15, 2018

s/experiemt/experiment/

@pickfire

pickfire Apr 15, 2018

s/experiemt/experiment/

* It is minimal: it stores small amount of data and has no
dependencies. For instance, it does not need compiler's string
interner or literal data representation.

This comment has been minimized.

@pickfire

pickfire Apr 15, 2018

Did you mean internal?

@pickfire

pickfire Apr 15, 2018

Did you mean internal?

This comment has been minimized.

@rpjohnst

rpjohnst Apr 15, 2018

No, a "string interner" is a data structure that combines all copies of a string into one so they can be shared, for lower memory usage and fast comparison.

@rpjohnst

rpjohnst Apr 15, 2018

No, a "string interner" is a data structure that combines all copies of a string into one so they can be shared, for lower memory usage and fast comparison.

new tree.
* A prototype implementation of the macro expansion on top of the new
sytnax tree.

This comment has been minimized.

@pickfire

pickfire Apr 15, 2018

s/sytnax/syntax/

@pickfire

pickfire Apr 15, 2018

s/sytnax/syntax/

@mark-i-m

This comment has been minimized.

Show comment
Hide comment
@mark-i-m

mark-i-m Apr 25, 2018

Contributor

Out of curiosity, what were the results of the All Hands meeting on this?

Contributor

mark-i-m commented Apr 25, 2018

Out of curiosity, what were the results of the All Hands meeting on this?

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Apr 27, 2018

Member

@mark-i-m argh, I've should have written it here ages ago, thanks for the ping!

The conclusion from all-hands discussion was that a lot here depends on the actual implementation details, and that it makes sense to experiment with parse tree approach.

For experimenting, we've decided that it would be more interesting to add a parse-tree mode to LALRPOP (which differs from "let's write another rust parser by hand" approach I've proposed in RFC).

The current work on LALRPOP is tracked in this issue: lalrpop/lalrpop#354

Member

matklad commented Apr 27, 2018

@mark-i-m argh, I've should have written it here ages ago, thanks for the ping!

The conclusion from all-hands discussion was that a lot here depends on the actual implementation details, and that it makes sense to experiment with parse tree approach.

For experimenting, we've decided that it would be more interesting to add a parse-tree mode to LALRPOP (which differs from "let's write another rust parser by hand" approach I've proposed in RFC).

The current work on LALRPOP is tracked in this issue: lalrpop/lalrpop#354

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jul 28, 2018

Member

I've just studied the implementation of swift's libsyntax, and it has some nice ideas. It is also surprisingly easy to understand.

They use a three-layered representation:

  • First layer is RawSyntax: immutable persistent (i.e, it stores lengths, not ranges) homogeneous tree (green node).
  • Second layer is SyntaxData: immutable lazy materializable homogeneous tree with ranges (red node).
  • Third layer is Syntax and it's generated subclasses. This is a typed layer in the style of this RFC.

Now, what makes their representation really interesting is that they store child nodes in an array, and not in a linked list (as proposed in the RFC). This allows for O(1) child access operation:

class ClassDeclSyntax final : public DeclSyntax {
public:
  enum Cursor : uint32_t {
    Attributes,
    Modifiers,
    ClassKeyword,
    Identifier,
    GenericParameterClause,
    InheritanceClause,
    GenericWhereClause,
    Members,
  };
};


llvm::Optional<GenericParameterClauseSyntax> ClassDeclSyntax::getGenericParameterClause() {
  auto ChildData = Data->getChild(Cursor::GenericParameterClause);
  if (!ChildData)
    return llvm::None;
  return GenericParameterClauseSyntax {Root, ChildData.get()};
}

What I find sub optimal about swift representation is that whitespace is attached directly to tokens, which make it harder to implement "declaration owns all preceding comments" logic. This can be fixed using the following representation:

struct Trivia {
    kind: SyntaxKind,
    text: InternedString,
}

type Trivias = SmallVec<Arc<Trivia>>;

struct GreenNode {
    kind: SyntaxKind,
    len: TextUnit,
    leadingTrivia: Trivias,
    children: [(Arc<GreenNode>, Trivias)], //DST
}

Another bit is that Arc-based GreenNodes representation is optimized for a rare use-case of modification. At the same time, I think a majority of syntax nodes are immutable once created. A nice thing about layered representation though is that we can have different implementations of layers. In particular, RedNodes might be backed either by GreenNodes, or by some more compact representation which stores all nodes of a single file in a single array.

EDIT: a short video overview of libsyntax: https://www.youtube.com/watch?v=5ivuYGxW_3M

Member

matklad commented Jul 28, 2018

I've just studied the implementation of swift's libsyntax, and it has some nice ideas. It is also surprisingly easy to understand.

They use a three-layered representation:

  • First layer is RawSyntax: immutable persistent (i.e, it stores lengths, not ranges) homogeneous tree (green node).
  • Second layer is SyntaxData: immutable lazy materializable homogeneous tree with ranges (red node).
  • Third layer is Syntax and it's generated subclasses. This is a typed layer in the style of this RFC.

Now, what makes their representation really interesting is that they store child nodes in an array, and not in a linked list (as proposed in the RFC). This allows for O(1) child access operation:

class ClassDeclSyntax final : public DeclSyntax {
public:
  enum Cursor : uint32_t {
    Attributes,
    Modifiers,
    ClassKeyword,
    Identifier,
    GenericParameterClause,
    InheritanceClause,
    GenericWhereClause,
    Members,
  };
};


llvm::Optional<GenericParameterClauseSyntax> ClassDeclSyntax::getGenericParameterClause() {
  auto ChildData = Data->getChild(Cursor::GenericParameterClause);
  if (!ChildData)
    return llvm::None;
  return GenericParameterClauseSyntax {Root, ChildData.get()};
}

What I find sub optimal about swift representation is that whitespace is attached directly to tokens, which make it harder to implement "declaration owns all preceding comments" logic. This can be fixed using the following representation:

struct Trivia {
    kind: SyntaxKind,
    text: InternedString,
}

type Trivias = SmallVec<Arc<Trivia>>;

struct GreenNode {
    kind: SyntaxKind,
    len: TextUnit,
    leadingTrivia: Trivias,
    children: [(Arc<GreenNode>, Trivias)], //DST
}

Another bit is that Arc-based GreenNodes representation is optimized for a rare use-case of modification. At the same time, I think a majority of syntax nodes are immutable once created. A nice thing about layered representation though is that we can have different implementations of layers. In particular, RedNodes might be backed either by GreenNodes, or by some more compact representation which stores all nodes of a single file in a single array.

EDIT: a short video overview of libsyntax: https://www.youtube.com/watch?v=5ivuYGxW_3M

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jul 30, 2018

Member

An interesting observation about swift's tree:

Each node holds an Arc to the root of the tree and a raw pointer to a particular tree node. This is convenient, because you get owned syntax tree nodes with value semantics and parent pointers. However that means that read-only traversal of the tree generates atomic refcount increments/decrements, which I expect might generate contention if a single file processed concurrently. That is, Root in return GenericParameterClauseSyntax {Root, ChildData.get()}; is actually a non-free atomic operation (I might be reading C++ wrong here).

However in Rust, we can make the strong pointer to root generic, and use either Arc<Root> or &'a Root for it. Moreover, it is possible to implement

fn as_borrowed<'a>(owned: &'a SyntaxNode<Arc<Root>>) -> SyntaxNode<&'a Root>

That is, it is possible at runtime to switch from an owned Arced version to a Copy (!) borrowed version for local processing and I think (with a bit of unsafe magic) back as well.

Member

matklad commented Jul 30, 2018

An interesting observation about swift's tree:

Each node holds an Arc to the root of the tree and a raw pointer to a particular tree node. This is convenient, because you get owned syntax tree nodes with value semantics and parent pointers. However that means that read-only traversal of the tree generates atomic refcount increments/decrements, which I expect might generate contention if a single file processed concurrently. That is, Root in return GenericParameterClauseSyntax {Root, ChildData.get()}; is actually a non-free atomic operation (I might be reading C++ wrong here).

However in Rust, we can make the strong pointer to root generic, and use either Arc<Root> or &'a Root for it. Moreover, it is possible to implement

fn as_borrowed<'a>(owned: &'a SyntaxNode<Arc<Root>>) -> SyntaxNode<&'a Root>

That is, it is possible at runtime to switch from an owned Arced version to a Copy (!) borrowed version for local processing and I think (with a bit of unsafe magic) back as well.

@lnicola

This comment has been minimized.

Show comment
Hide comment
@lnicola

lnicola Jul 31, 2018

@matklad Are you aware of the Roslyn team's work on this? It might be worth looking into. There's a very high level description here and maybe here.

It does look similar to what Swift is doing.

lnicola commented Jul 31, 2018

@matklad Are you aware of the Roslyn team's work on this? It might be worth looking into. There's a very high level description here and maybe here.

It does look similar to what Swift is doing.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Jul 31, 2018

Member

@lnicola yes, I am aware of Rolsyn's approach. The Swift libsyntax is indeed a realization of this red/green. Their implementation is much easier to read, and also does not rely on GC.

Member

matklad commented Jul 31, 2018

@lnicola yes, I am aware of Rolsyn's approach. The Swift libsyntax is indeed a realization of this red/green. Their implementation is much easier to read, and also does not rely on GC.

@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Aug 5, 2018

I've just been doing some work on tree-sitter-rust, the incremental rust parser that Atom will soon ship with (and hopefully Xray will ship with at some point).

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

For example, before I added this feature, code like this would not syntax highlight very nicely, because we wouldn't know that c.d is a field access and that T and U are types.

assert_eq!(a::b::<T, U>(), c.d);

Of course, not all token trees have a regular structure like this, so we need to 'fall back' to parsing them in an unstructured way. With tree-sitter, I handle this via GLR's ambiguity resolution mechanism. With a hand-written parser, you could probably use a multi-pass approach.

I also might be overthinking this; I'm curious if you have thoughts on this issue.

maxbrunsfeld commented Aug 5, 2018

I've just been doing some work on tree-sitter-rust, the incremental rust parser that Atom will soon ship with (and hopefully Xray will ship with at some point).

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

For example, before I added this feature, code like this would not syntax highlight very nicely, because we wouldn't know that c.d is a field access and that T and U are types.

assert_eq!(a::b::<T, U>(), c.d);

Of course, not all token trees have a regular structure like this, so we need to 'fall back' to parsing them in an unstructured way. With tree-sitter, I handle this via GLR's ambiguity resolution mechanism. With a hand-written parser, you could probably use a multi-pass approach.

I also might be overthinking this; I'm curious if you have thoughts on this issue.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Aug 5, 2018

Member

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

That is true. Extend selection is next to useless if what looks to a human like an expression is represented as a token tree in the syntax tree.

I also agree that non-deterministic parsing of macro invocation body as an expression or an item is a good approximation, especially with GLR approaches, where you can cheaply try all variants of the parsing.

However the proper solution here is indeed a two phase parsing. You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly. So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

So, for practical purposes, I would probably have used the following progression:

  1. treat macro invocations as token trees
  2. hard-code println family of macros ([e]print[ln], write[ln], format, panic, unreachable, logging macros)
  3. Add support for two-phase parsing
  4. Using two-phase and GLR, implement heuristic approach to parsing (I'd do it on top of two-phase and not directly in the grammar just to exercise two-phase approach a bit)
  5. Extend heuristics by, for example, trying to guess the appropriate macro definition based on the index of macros in the project (surprisingly, such "resolve" wouldn't be too far away from truth for 2015-style macros).
Member

matklad commented Aug 5, 2018

I had one interesting realization related to macros: for syntax highlighting purposes, and probably other purposes as well, it's desirable to, if possible, parse the contents of token trees as expressions and items, as opposed to leaving them in the unstructured form that rustc -Z ast-json-noexpand represents them in.

That is true. Extend selection is next to useless if what looks to a human like an expression is represented as a token tree in the syntax tree.

I also agree that non-deterministic parsing of macro invocation body as an expression or an item is a good approximation, especially with GLR approaches, where you can cheaply try all variants of the parsing.

However the proper solution here is indeed a two phase parsing. You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly. So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

So, for practical purposes, I would probably have used the following progression:

  1. treat macro invocations as token trees
  2. hard-code println family of macros ([e]print[ln], write[ln], format, panic, unreachable, logging macros)
  3. Add support for two-phase parsing
  4. Using two-phase and GLR, implement heuristic approach to parsing (I'd do it on top of two-phase and not directly in the grammar just to exercise two-phase approach a bit)
  5. Extend heuristics by, for example, trying to guess the appropriate macro definition based on the index of macros in the project (surprisingly, such "resolve" wouldn't be too far away from truth for 2015-style macros).
@maxbrunsfeld

This comment has been minimized.

Show comment
Hide comment
@maxbrunsfeld

maxbrunsfeld Aug 5, 2018

So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

Yeah, multi-phase parsing is definitely useful in general; we currently use it for things like JS in HTML, templating languages like EJS, etc. The hard parts of what you propose seems to be resolving macro calls correctly and interpreting macro definitions.

You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly.

But I don't think you could do this based on some finite heuristic, because macro patterns can have arbitrary nesting and complexity. For example, in this macro from serde, the outer token tree isn't an expression, but there are many expressions nested within it, before and after the => token.

declare_tests! {
    // ...

    test_result {
        Ok::<i32, i32>(0) => &[
            Token::Enum { name: "Result" },
            Token::Str("Ok"),
            Token::I32(0),
        ],
   }

   // ...
}

With my current approximate approach based on GLR, I'm able to determine on-the-fly that these inner token trees are expressions, but the outer one is not. How would a heuristic-based system deal with macros like this?

maxbrunsfeld commented Aug 5, 2018

So, you first need parse the file in isolation, treating each macro invocation as a token stream. Then you need to resolve macro calls, interpret macro definitions and then inject true syntax trees into macro calls. And don't forget that two-phase parsing is also useful for things like embedding CSS/JS into HTML!

Yeah, multi-phase parsing is definitely useful in general; we currently use it for things like JS in HTML, templating languages like EJS, etc. The hard parts of what you propose seems to be resolving macro calls correctly and interpreting macro definitions.

You need to know the difference between macro_rules! foo { ($arg:expr) => { ... }} and macro_rules! foo { ($arg:item) => { ... }} to parse macro invocation correctly.

But I don't think you could do this based on some finite heuristic, because macro patterns can have arbitrary nesting and complexity. For example, in this macro from serde, the outer token tree isn't an expression, but there are many expressions nested within it, before and after the => token.

declare_tests! {
    // ...

    test_result {
        Ok::<i32, i32>(0) => &[
            Token::Enum { name: "Result" },
            Token::Str("Ok"),
            Token::I32(0),
        ],
   }

   // ...
}

With my current approximate approach based on GLR, I'm able to determine on-the-fly that these inner token trees are expressions, but the outer one is not. How would a heuristic-based system deal with macros like this?

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Aug 7, 2018

Member

How would a heuristic-based system deal with macros like this?

I mean something like "heuristically resolve macro call, and then interpret the macro definition". But this is probably a lot of effort for little gain in comparison to what GLR already gives you.

Member

matklad commented Aug 7, 2018

How would a heuristic-based system deal with macros like this?

I mean something like "heuristically resolve macro call, and then interpret the macro definition". But this is probably a lot of effort for little gain in comparison to what GLR already gives you.

@matklad

This comment has been minimized.

Show comment
Hide comment
@matklad

matklad Aug 7, 2018

Member

Status update:

I've implemented (approximately) Swift style syntax tree in libsyntax2, in both owned and borrowed variants: https://github.com/matklad/libsyntax2/blob/2fb854ccdae6f1f12b60441e5c3b283bdc81fb0a/src/yellow/syntax.rs

I've also hacked quite a bit on the parser itself, so that it now parses a majority of non-weird Rust constructs. Here's, for example, libsyntax2-based extend selection: https://www.youtube.com/watch?v=21NbnLhj-S4

Member

matklad commented Aug 7, 2018

Status update:

I've implemented (approximately) Swift style syntax tree in libsyntax2, in both owned and borrowed variants: https://github.com/matklad/libsyntax2/blob/2fb854ccdae6f1f12b60441e5c3b283bdc81fb0a/src/yellow/syntax.rs

I've also hacked quite a bit on the parser itself, so that it now parses a majority of non-weird Rust constructs. Here's, for example, libsyntax2-based extend selection: https://www.youtube.com/watch?v=21NbnLhj-S4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment