Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flex-style named definitions cause ambiguity in re2c grammar #115

Closed
skvadrik opened this issue Jul 28, 2015 · 1 comment

Comments

@skvadrik
Copy link
Owner

commented Jul 28, 2015

Bison reports 10 shift/reduce conflicts when compiling re2c parser. Turns out that all of them are caused by one unfortunate production in grammar:

decl -> FID expr

which stands for flex-style named definitions of the form:

name regular-expression

re2c tries to partially support flex syntax with '-F' flag. Native re2c named definitions are of the form:

name = regular-expression ;

Another notable difference is that re2c allows newlines inside of regular expressions in named definitions, while flex doesn't.
Both re2c and flex have rules of the form:

regular-expression action

re2c syntax allows to mix named definitions with rules. With native re2c named definitions that's ok: they have an ending semicolon that allows to distinguish them from rules. However, flex-style named definitions don't have an ending semicolon (newline acts as a delimiter in flex, but not in re2c), so mixing them with rules introduces parsing ambiguity. Consider the following example:

/*!re2c
    name "a"
    "b" "c" {}
*/

One can interpret this fragment in two different ways:

definition -> name "a" 
rule -> "b" "c" {}

and

definition -> name "a" "b"
rule -> "c" {}

both ways are valid, so there's a real ambiguity in grammar, not just some stupid LALR(1) conflict.

In flex, there's no parsing problem: it has newline as a delimiter and doesn't allow to mix named definitions with rules. Named definitions must all come together in a separate section delimited by "%%" :

definitions
%%
rules

As of now, re2c will fail to parse the example above. However, parsing problem vanishes in '-c' mode, because with '-c' rules have different form:

condition regular-expression action

Some re2c users (and notably, PHP team) use '-F' together with '-c' and don't face the parsing problem.

So what should we do? I see the following options:

  1. drop flex-style syntax and -F option completely
    • breaks backwards compatibility for all -F users (users have to rewrite code to use re2c native named definitions), but at least the new code is compatible with older versions of re2c
    • easy to implement, re2c parser will become mush tidier
  2. allow flex-style named definitions only in a separate section delimited by "%%", as flex does
    • breaks backwards compatibility for all -F users (users have to add "%%" and group named definitions together), the new code is incompatible with older re2c versions
    • not hard to implement
  3. forbid newlines everywhere in regular expressions (both in named definitions and in rules)
    • breaks backwards compatibility in a nasty way for some unfortunate users
    • moderately hard to implement
  4. forbid newlines only in flex-style named definitions, demand an ending newline in flex-style named definitions:
    • still may break some code, though conforming to flex syntax
    • very hard to implement, re2c parser will bloat immensely (I tried it and I really wouldn't recommend it)

I vote for (1) for the following reasons:

  • Painful as it is, it is better than making new code incompatible with old re2c (2), breaking old code in obscure ways (3) or making re2c unmaintainable (4).
  • I think that flex-style syntax is not very popular among re2c users. One notable exception is PHP team, but you folks are rapidly developing and responsive, and some syntax de-sugaring must be an easy job for you (or I can send a patch if you're short of time) :) You won't have to care for re2c version, both old and new will do.
  • I think that two different syntaxes will never play well together and trying to merge them is a silly thing to do.

If (1) raises no objections, what should we do with -F option (remove or leave deprecated)?

Opinions welcome.

@skvadrik skvadrik self-assigned this Jul 28, 2015

skvadrik added a commit that referenced this issue Jul 29, 2015

Commented parser conflict.
See #115

orts the commit.

skvadrik added a commit that referenced this issue Jul 30, 2015

Fixed bug #115 "flex-style named definitions cause ambiguity in re2c …
…grammar".

This commit removes 10 shift/reduce conflicts in bison grammar for re2c.
These conflicts are caused by allowing flex-style named definitions
    name regular-expression
to contain newlines and to be mixed with rules. It's not just some
conflicts in LALR(1) grammar, it is genuine ambiguity as can be observed
from the following example:
    /*!re2c
        name "a"
        "b" "c" {}
    */
which can be parsed in two ways:
    definition -> name "a"
    rule -> "b" "c" {}
and
    definition -> name "a" "b"
    rule -> "c" {}
, both ways being perfectly valid.

This commit resolves ambiguity by forbidding newlines in flex-style
named definitions (conforming to flex syntax). Newline in these
definitions is treated in a special way: lexer emits token 'FID_END',
which marks the end of flex-style named definition in parser.
@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jul 30, 2015

Fixed by forbidding newlines in flex-style named definitions. Turned out to be quite easy due to a hack in lexer.

e02b761#diff-9351147abfd7dc33744eaa3f4f30f371R199

@skvadrik skvadrik closed this Jul 30, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.