You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bison reports 10 shift/reduce conflicts when compiling re2c parser. Turns out that all of them are caused by one unfortunate production in grammar:
decl -> FID expr
which stands for flex-style named definitions of the form:
name regular-expression
re2c tries to partially support flex syntax with '-F' flag. Native re2c named definitions are of the form:
name = regular-expression ;
Another notable difference is that re2c allows newlines inside of regular expressions in named definitions, while flex doesn't.
Both re2c and flex have rules of the form:
regular-expression action
re2c syntax allows to mix named definitions with rules. With native re2c named definitions that's ok: they have an ending semicolon that allows to distinguish them from rules. However, flex-style named definitions don't have an ending semicolon (newline acts as a delimiter in flex, but not in re2c), so mixing them with rules introduces parsing ambiguity. Consider the following example:
/*!re2c
name "a"
"b" "c" {}
*/
One can interpret this fragment in two different ways:
definition -> name "a"
rule -> "b" "c" {}
and
definition -> name "a" "b"
rule -> "c" {}
both ways are valid, so there's a real ambiguity in grammar, not just some stupid LALR(1) conflict.
In flex, there's no parsing problem: it has newline as a delimiter and doesn't allow to mix named definitions with rules. Named definitions must all come together in a separate section delimited by "%%" :
definitions
%%
rules
As of now, re2c will fail to parse the example above. However, parsing problem vanishes in '-c' mode, because with '-c' rules have different form:
condition regular-expression action
Some re2c users (and notably, PHP team) use '-F' together with '-c' and don't face the parsing problem.
So what should we do? I see the following options:
drop flex-style syntax and -F option completely
breaks backwards compatibility for all -F users (users have to rewrite code to use re2c native named definitions), but at least the new code is compatible with older versions of re2c
easy to implement, re2c parser will become mush tidier
allow flex-style named definitions only in a separate section delimited by "%%", as flex does
breaks backwards compatibility for all -F users (users have to add "%%" and group named definitions together), the new code is incompatible with older re2c versions
not hard to implement
forbid newlines everywhere in regular expressions (both in named definitions and in rules)
breaks backwards compatibility in a nasty way for some unfortunate users
moderately hard to implement
forbid newlines only in flex-style named definitions, demand an ending newline in flex-style named definitions:
still may break some code, though conforming to flex syntax
very hard to implement, re2c parser will bloat immensely (I tried it and I really wouldn't recommend it)
I vote for (1) for the following reasons:
Painful as it is, it is better than making new code incompatible with old re2c (2), breaking old code in obscure ways (3) or making re2c unmaintainable (4).
I think that flex-style syntax is not very popular among re2c users. One notable exception is PHP team, but you folks are rapidly developing and responsive, and some syntax de-sugaring must be an easy job for you (or I can send a patch if you're short of time) :) You won't have to care for re2c version, both old and new will do.
I think that two different syntaxes will never play well together and trying to merge them is a silly thing to do.
If (1) raises no objections, what should we do with -F option (remove or leave deprecated)?
Opinions welcome.
The text was updated successfully, but these errors were encountered:
…grammar".
This commit removes 10 shift/reduce conflicts in bison grammar for re2c.
These conflicts are caused by allowing flex-style named definitions
name regular-expression
to contain newlines and to be mixed with rules. It's not just some
conflicts in LALR(1) grammar, it is genuine ambiguity as can be observed
from the following example:
/*!re2c
name "a"
"b" "c" {}
*/
which can be parsed in two ways:
definition -> name "a"
rule -> "b" "c" {}
and
definition -> name "a" "b"
rule -> "c" {}
, both ways being perfectly valid.
This commit resolves ambiguity by forbidding newlines in flex-style
named definitions (conforming to flex syntax). Newline in these
definitions is treated in a special way: lexer emits token 'FID_END',
which marks the end of flex-style named definition in parser.
Bison reports 10 shift/reduce conflicts when compiling re2c parser. Turns out that all of them are caused by one unfortunate production in grammar:
which stands for flex-style named definitions of the form:
re2c tries to partially support flex syntax with '-F' flag. Native re2c named definitions are of the form:
Another notable difference is that re2c allows newlines inside of regular expressions in named definitions, while flex doesn't.
Both re2c and flex have rules of the form:
re2c syntax allows to mix named definitions with rules. With native re2c named definitions that's ok: they have an ending semicolon that allows to distinguish them from rules. However, flex-style named definitions don't have an ending semicolon (newline acts as a delimiter in flex, but not in re2c), so mixing them with rules introduces parsing ambiguity. Consider the following example:
One can interpret this fragment in two different ways:
and
both ways are valid, so there's a real ambiguity in grammar, not just some stupid LALR(1) conflict.
In flex, there's no parsing problem: it has newline as a delimiter and doesn't allow to mix named definitions with rules. Named definitions must all come together in a separate section delimited by "%%" :
As of now, re2c will fail to parse the example above. However, parsing problem vanishes in '-c' mode, because with '-c' rules have different form:
Some re2c users (and notably, PHP team) use '-F' together with '-c' and don't face the parsing problem.
So what should we do? I see the following options:
I vote for (1) for the following reasons:
If (1) raises no objections, what should we do with -F option (remove or leave deprecated)?
Opinions welcome.
The text was updated successfully, but these errors were encountered: