Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src/grammar for the canonical grammar of the Rust language #1331

Merged
merged 1 commit into from Jan 15, 2016

Conversation

Projects
None yet
@nagisa
Copy link
Contributor

nagisa commented Oct 21, 2015

@nagisa nagisa changed the title src/grammar for the canonical gramar of the Rust language src/grammar for the canonical grammar of the Rust language Oct 21, 2015

@nagisa nagisa force-pushed the nagisa:grammar branch from 6e25ec4 to 5e0efc2 Oct 21, 2015

@nagisa nagisa force-pushed the nagisa:grammar branch from 5e0efc2 to 44e88d3 Oct 21, 2015

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Oct 21, 2015

👍 😍 💯

@nrc nrc added the T-lang label Oct 22, 2015

@nikomatsakis nikomatsakis self-assigned this Oct 22, 2015

@bstrie

This comment has been minimized.

Copy link
Contributor

bstrie commented Oct 27, 2015

The semi-canonical step may be unnecessary. I don't expect there to be very many discrepancies between what we think the grammar is and what the grammar actually is, so we could just switch to the fully canonical stage directly. Even if a discrepancy is discovered after that point, all it would take would be an RFC to determine whether it should result in a change to the grammar or a change to the parser (obviously taking into account how much breakage would result from a change to the parser).

@matklad

This comment has been minimized.

Copy link
Member

matklad commented Nov 3, 2015

I'm hugely interested in this, many thanks!

One question that bothers me is how the consistency between libsyntax and src/grammar is to be enforced? I am afraid that if automatic comparison of parsers is not a part of the test suite, then the grammar will never become canonical (or the rust parser will never become conforming, if you prefer this half of the glass).

And if we speak of test suite, it would be ideal if it is implementation independent and included in something like ctrs.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Nov 3, 2015

We have vestigial grammar tests still in the makefiles, I think.

@Aatch

This comment has been minimized.

Copy link
Contributor

Aatch commented Nov 6, 2015

I share @matklad's concerns. It'd be awesome to have a canonical grammar, but maintaining consistency seems like a major challenge.

I'm not sure how we could actually go about testing it in a rigorous manner. Since the compiler's parser doesn't have to parse in the same way a generated parser (from the grammar) would, figuring out whether or not they are consistent with each other seems difficult.

That said, less-rigorous testing could be done with a combination of fuzzing and regression tests. The grammar could be used to generate syntactically-correct (if nonsensical) code which is run through the compiler's parser. If the sample fails to parse, we can look at it and decide if it's a "real" failure or not. If it is, we can fix the bug and use the sample (or a reduced version) as the regression test. The downside here is that it doesn't enforce the same parse. The compiler could parse a code fragment without error, but not actually be correct. I can't see this being a problem though.

With all that said, some discussion around testing the grammar and parser would be nice to see in the RFC proper.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Nov 6, 2015

@matklad

This comment has been minimized.

Copy link
Member

matklad commented Nov 6, 2015

@steveklabnik I thought those are only lexer tests

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Nov 6, 2015

They at least used to be referred to as "grammar tests", though maybe I cleared some of them out and they're not there anymore. We were like, using antlr and such...


This process is sure to become ambiguous over time as syntax is increasingly adjusted (it is harder
to “blame” syntax changes compared to syntax additions), therefore the resolution process of
discrepancies will also depend more on a decision from the Rust team.

This comment has been minimized.

@nikomatsakis

nikomatsakis Nov 6, 2015

Contributor

Nit: this should be the final decision of the language subteam, I think. Seems like the most logical decision making body for this sort of question. (And it's already what we do in practice.)

# Unresolved questions
[unresolved]: #unresolved-questions

How much time should pass between [src/grammar] becoming semi-canonical and fully-canonical?

This comment has been minimized.

@nikomatsakis

nikomatsakis Nov 6, 2015

Contributor

I feel there does not have to be a defined time period per se. We'll work it out.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 6, 2015

@nagisa I am very positive on this idea. For some time now I have been wanting to have a grammar changes written up more rigorously. I think your plan is pretty coherent -- basically there's going to be a messy period where we have to make ad-hoc judgement calls on what is the 'canonical' grammar for a bit of time, but eventually we'll get to a clean state.

I do think it's important that we add automatic checking of the Rust grammar against the in-tree sources (at least). Ideally we'd maybe even do crater runs or something, but you gotta run before you can walk.

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness? (Also, is one project based on the other?) I know that @rprichard has been filing various issues about discrepancies between what rustc does and what rust-grammar does, so they must have some automated means of testing it.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 6, 2015

cc @rprichard @bleibig @dgrunwald as people who seem likely to have an opinion on this RFC and related matters.

@nagisa

This comment has been minimized.

Copy link
Contributor Author

nagisa commented Nov 6, 2015

One question that I am wondering is how the in-tree grammar compares to e.g. the rust-grammar project in terms of completeness?

As far as I can see our in-tree LARL grammar is pretty much a copy of this project (sans the changes that have been made to the grammar). I suspect, without much digging, that it was imported from the rust-grammar project itself.

@nagisa

This comment has been minimized.

Copy link
Contributor Author

nagisa commented Nov 6, 2015

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

Answers to similar question are also pretty hand-wavy.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 11, 2015

@nagisa

I’m aware it is usually possible to generate valid inputs from a given grammar (or its state machine), but there’s no existing attempts to do this as far as I know. While that sounds like a nice way to test the parser conformance, I’m not even sure how complex the implementation of such tester would be.

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

@dgrunwald

This comment has been minimized.

Copy link
Contributor

dgrunwald commented Nov 11, 2015

ISTM that the rust compiler, libs, and crates.io provide plenty of sample inputs :)

Different grammar productions can interact in surprising ways.

For example, the operator as allows invoking a lambda expression without parenthesizing the lambda expression:

fn main() {
    || println!("Hello World") as ()()
}

It's not obvious that this is even possible (and it probably shouldn't be), so I'd be surprised if you can find such code in the wild.

@matklad

This comment has been minimized.

Copy link
Member

matklad commented Nov 11, 2015

@nikomatsakis sample inputs will allow to verify that the formal grammar accepts what rustc accepts. But even the trivial grammar can do this, namely, the grammar of all token sequences.

The hard problem is to make sure that the formal grammar does not accept invalid Rust. As I understand @Aatch suggestion was about this direction of verification.

And the second hard problem is to make sure that both the grammar and the rustc produce similar parse trees: you can easily mix up operator precedence, for a start.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 11, 2015

@dgrunwald @matklad yes, ok, right, forgot about @Aatch's question. Nonetheless, I think that comparing the two against all the Rust code we can get our hands on (and yes, comparing the parse trees, which seems eminently doable) provides a fairly high degree of confidence that we are parsing what we want to, despite the theoretical disparities. Certainly generating inputs would let us go further (though we'd presumably be filtering these inputs against the rustc impl, which still means that there may be cases where both do the wrong thing in the same way).

@graydon

This comment has been minimized.

Copy link

graydon commented Nov 12, 2015

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 12, 2015

@graydon

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

It is, of course. =)

@nagisa

This comment has been minimized.

Copy link
Contributor Author

nagisa commented Nov 12, 2015

I'd be interested in taking this a step further and experimenting with a grammar-generated frontend. I honestly assumed that was part of Niko's plan with LALRPOP.

To me it sounds like generating parser is a natural step to take after we have a proper (canonical and verified to be correct) grammar. We need some way to ensure a grammar matches current implementation (and/or becomes the canonical representation of the Rust’s grammar) before we migrate rustc to a parser generator of any sort, don’t we?

However, that sounds like something that would need to be discussed once we have a generator itself ready, right?

@dgrunwald

This comment has been minimized.

Copy link
Contributor

dgrunwald commented Nov 12, 2015

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a (partial) syntax tree for code completion, you'll have to puzzle it together from the partially matched productions remaining on the parser's stack. Or (equivalently) introduce a whole bunch of error productions, and you'll always forget some and have cases where nodes aren't integrated into the syntax tree and code completion suffers. This stuff will be much much easier with the existing hand-written parser (since it's LL-style, there's always an obvious choice for connecting the current node to the root of the tree when you encounter an error).

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 12, 2015

@dgrunwald Note that we did not necessarily say an LR parser. I was hoping
for LL(k) or maybe at worst GLL. But I agree that the benefits of an
automatic parser do not necessarily outweigh the downsides.

On Thu, Nov 12, 2015 at 12:30 PM, Daniel Grunwald notifications@github.com
wrote:

Why would we migrate rustc to a generated parser?

Good messages for syntax errors are so much easier in a hand-written
recursive descend parser.

Also, LR parsers are especially problematic for use in IDEs: if you need a
(partial) syntax tree for code completion, you'll have to puzzle it
together from the partially matched productions remaining on the parser's
stack. Or (equivalently) introduce a whole bunch of error productions, and
you'll always forget some and have cases where nodes aren't integrated into
the syntax tree and code completion suffers. This stuff will be much much
easier with the existing hand-written parser (since it's LL-style, there's
always an obvious choice for connecting the current node to the root of the
tree when you encounter an error).


Reply to this email directly or view it on GitHub
#1331 (comment).

@dgrunwald

This comment has been minimized.

Copy link
Contributor

dgrunwald commented Nov 13, 2015

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify an unambiguous formal grammar that matches rustc's behavior (modulo rustc bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of rust's expression grammar (allowing integer literals, the "as" operator, and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not: e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
    | expr "as" type
    | expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
    | expr "as" type_LT_follow
    | expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such disambiguation rules, and our grammar increases exponentially in size. It becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is also the option taken in the C# language specification: the formal grammar in the appendix is just a rough approximation, and sections like "7.6.4.2 Grammar ambiguities" explain how the parser should handle specific tricky cases.

But if the formal grammar is just an approximation, that makes automated consistency checks rather tricky...

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Nov 13, 2015

@dgrunwald

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

We'll have to see how it goes. I'll note though that LALRPOP's macros
include an explicit feature for just this case, but I've not tried using it
"at scale" yet. This is the kind of thing I was thinking of when designing
it, though. (In particular, they let you specify flags that can filter
which productions are used, etc.) I seem to recall that JavaScript uses a
similar mechanism for cases like this in their spec, but I don't recall for
certain.

On Thu, Nov 12, 2015 at 7:37 PM, Daniel Grunwald notifications@github.com
wrote:

comparing the parse trees, which seems eminently doable

Careful there; you seem to be assuming that it will be possible to specify
an unambiguous formal grammar that matches rustc's behavior (modulo rustc
bugs that we can fix without breaking the world).

I'm not sure if this is possible. Some things are easy to specify in
English (and in a hand-written parser), but hard in a formal grammar.

For example, take this small grammar, which looks like a tiny subset of
rust's expression grammar (allowing integer literals, the "as" operator,
and less than operator):

type ::= identifier | identifier "<" type ">"
expr ::= int_literal | expr "as" type | expr "<" expr

But this grammar already accepts some expressions that rustc does not:
e.g. 1 as i32 < 2.

It's easy to say: "A '<' token after a type name is always considered to
be part of the type." But doing this in the formal grammar is not so easy:

type ::= identifier | identifier "<" type ">"
type_LT_follow ::= identifier "<" type ">"
expr ::= int_literal
| expr "as" type
| expr_LT_follow "<" expr
expr_LT_follow ::= int_literal
| expr "as" type_LT_follow
| expr_LT_follow "<" expr_LT_follow

Note that we just duplicated the number of productions! A few more such
disambiguation rules, and our grammar increases exponentially in size. It
becomes unreadable and unmaintainable.

Keeping the disambiguation rule in plain English is much better. This is
also the option taken in the C# language specification: the formal grammar
in the appendix is just a rough approximation, and sections like "7.6.4.2
Grammar ambiguities" explain how the parser should handle specific tricky
cases.

But if the formal grammar is just an approximation, that makes automated
consistency checks rather tricky...


Reply to this email directly or view it on GitHub
#1331 (comment).

@rprichard

This comment has been minimized.

Copy link

rprichard commented Nov 13, 2015

@dgrunwald

It's easy to say: "A '<' token after a type name is always considered to be part of the type." But doing this in the formal grammar is not so easy:

FWIW, the parser-lalr grammar does seem to handle this ambiguity correctly using precedence directives rather than duplicating the expression non-terminal. I'm guessing a precedence directive is also the reason this parser-lalr bug isn't caught at compile-time.

@glaebhoerl

This comment has been minimized.

Copy link
Contributor

glaebhoerl commented Nov 18, 2015

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Dec 21, 2015

(Nominating for final comment period discussion, since it seems like we've reached a fixed point here in this comment thread.)

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 8, 2016

Hear ye, hear ye! This RFC is now entering final comment period.

@jarcane

This comment has been minimized.

Copy link

jarcane commented Jan 12, 2016

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 15, 2016

@jarcane

This is potentially very far off in future times, but would having a canonical grammar like this make it easier for people to develop alternate implementations?

It would certainly help, though probably not by THAT much.

@nikomatsakis

This comment has been minimized.

Copy link
Contributor

nikomatsakis commented Jan 15, 2016

Huzzah! This RFC has been accepted by the language design subteam.

nikomatsakis added a commit to nikomatsakis/rfcs that referenced this pull request Jan 15, 2016

@nikomatsakis nikomatsakis merged commit 44e88d3 into rust-lang:master Jan 15, 2016

@chriskrycho chriskrycho referenced this pull request Mar 29, 2017

Closed

Document all features #9

18 of 48 tasks complete

@Centril Centril added the A-syntax label Nov 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.