Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar ambiguity #2954

Closed
andersfr opened this issue Jul 27, 2019 · 6 comments
Closed

Grammar ambiguity #2954

andersfr opened this issue Jul 27, 2019 · 6 comments

Comments

@andersfr
Copy link
Contributor

In PtrTypeStart we have the following syntax:
KEYWORD_align LPAREN Expr (COLON INTEGER COLON INTEGER)? RPAREN

Or as simplified example: align(Expr:1:32)

This conflicts with the block expression syntax: Identifier: {...}

The issue arises because Identifier is a valid Expr production and thus it becomes impossible to deduce the meaning of the colon token without consuming additional input. In more theoretical terms this is an instance of a shift-reduce conflict.
The simplest solution is to introduce a :: token and use it instead of the leading colon. It looks almost identical.
KEYWORD_align LPAREN Expr (COLONCOLON INTEGER COLON INTEGER)? RPAREN

Or as simplified example: align(Expr::1:32)

@hryx
Copy link
Sponsor Contributor

hryx commented Jul 27, 2019

To be precise, a Parsing Expression Grammars (PEG) is by definition unambiguous -- the first match is the correct match.

A simple solution is to enclose the identifier in parentheses:

align((Expr):1:32)

And just for fun, to demonstrate the grammar precedence (not an honest suggestion):

align(blk: {
    break :blk Expr;
}:1:32)

@andersfr
Copy link
Contributor Author

andersfr commented Jul 28, 2019

The problem is not that it cannot eventually be disambiguated. It is that this issue breaks the possibility of making a context-free grammar, e.g. using LALR(1) which I happen to be using.
Your conclusion on first match is also incorrect. Upon seeing Identifier: the first match is to assume it is a block label that must be followed by a block, because we are currently parsing an expression. The parser then either throws an error upon seeing an IntegerLiteral or must resort to backtracking to resolve the conflict.
The Zig compiler indeed breaks on this problem as explained:

pub fn main() void {
    const identifier: usize = 8;
    const test1 = *align(label: { break :label 8; }:1:32) u8;
    const test2 = *align(identifier:1:32) u8;
}

issue.zig:4:37: error: invalid token: '1'
const test2 = *align(identifier:1:32) u8;

@hryx
Copy link
Sponsor Contributor

hryx commented Jul 28, 2019

using LALR(1) which I happen to be using

If your goal is to write a CFG LL/LR/LALR for Zig, that may not be possible. I'm not an expert, but from Wikipedia:

[...] but the unlimited lookahead capability of the grammar formalism is then lost. Therefore, not all languages that can be expressed using parsing expression grammars can be parsed by LL or LR parsers.

Regarding the error from your example, that is the expected result with today's grammar. identifier:1:32 will not work. Of course this won't help if you're writing a parser, but as a user, enclosing identifier in parentheses makes the intention possible.

Your conclusion on first match is also incorrect.

What I meant was to describe the rule-matching behavior of a PEG in general.

Note that syntax ambiguities are precisely what inspired moving to the new grammar type. Hope that helps or clears some things up.

EDIT: I mistakenly said that a CFG might not be possible to write, when I should have said LR

@andersfr
Copy link
Contributor Author

A Parsing Expression Grammar (PEG) will indeed be able to parse identifier:1:32 because it has unlimited lookahead and this is only ambiguous with lookahead less than 2. It is not an instance of prioritization between rules (they are mutually exclusive).

Increasing lookahead or context dependence slows down compilers. Either by having excessive branching in the parser logic or postponing the problem to semantic analysis.

It is trivial to fix the bug in the Zig compiler (after ALIGN LPAREN check for the edge case before branching into the ast_parse_expression). But let's rather fix the syntax instead of forcing the compiler into submission with manual edge case handling.

@hryx
Copy link
Sponsor Contributor

hryx commented Jul 29, 2019

From what I can tell, this is a proposal and not a bug. Maybe I'm conflating the spec and its implementation. But I should disengage from the conversation since I seem to be out of my depth. Apologies for convoluting the issue.

@andersfr
Copy link
Contributor Author

No apologies needed. It gave me chance to discover the zig-spec PEG implementation and give it a test run.
I will close the issue and resubmit as proposal for changing the align syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants