Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

empty character class [] matches empty string #59

Closed
skvadrik opened this issue Jun 11, 2015 · 7 comments

Comments

@skvadrik
Copy link
Owner

commented Jun 11, 2015

E.g. the following source:

/*!re2c
    [] {}
*/

compiles to the following code:

{
        YYCTYPE yych;

        {}
}

Reproducible with 0.13.6, 0.14 and HEAD (and seems that it has always been that way). It of course should never match (and preferably report a warning).

Original comment by: skvadrik

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 11, 2015

I'm thinking of completely forbidding empty character classes. It seems to conform perl and posix BRE/ERE regular expressions.

Original comment by: skvadrik

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 16, 2015

Further analyses of the following case:

/*!re2c
    [] {}
*/

/*!re2c
    [^\x00-\xFF] {}
*/

/*!re2c
    [\x00-\xFF]\[\x00-\xFF] {}
*/

showed that versions <=0.13.6 and >=0.13.7 behave differently.

Up to 0.13.6 re2c consistently considered that empty range should match empty string:

$ re2c -i --no-generation-date 1.re 
/* Generated by re2c 0.13.6 */

{
        YYCTYPE yych;

        {}
}



{
        YYCTYPE yych;
        {}
}



{
        YYCTYPE yych;
        {}
}

Starting from 0.13.7 this behaviour was broken (the faulty commit is unsurprisingly the big one that added UTF8 support). empty positive range [] and empty difference (e.g. [a-z][a-z]) still match empty string, but empty negative range (e.g. [^\x00-\xFF]) matches nothing (always fails):

$ re2c -i --no-generation-date 1.re 
/* Generated by re2c 0.14.1.dev */

{
        YYCTYPE yych;

        {}
}



{
        YYCTYPE yych;
}



{
        YYCTYPE yych;
        {}
}

Whether we choose to match empty string or match nothing on empty ranges, the behaviour must be consistent (apply to all cases of range construction).

Original comment by: skvadrik

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 16, 2015

I vote for making it an error (or at least invalid according to the re2c grammar). As far as my imagination goes, it's a pointless thing to do.

Looking at the posix standard for regexes, they're not allowed: "A bracket expression (an expression enclosed in square brackets, "[]" ) is an RE that shall match a single collating element contained in the non-empty set of collating elements represented by the bracket expression." http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html

Original comment by: nuffer

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 16, 2015

I agree entirely, but you know, re2c users are quite conservative ;)

I think I'll add an option "--empty-class <match-empty|match-none|error>" and default to "match-empty" (match empty input, as in <=0.13.6) in 0.15, then switch the default to "error" in 0.16 (if nobody objects to it).

Anyway, a warning will be useful.

Original comment by: skvadrik

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 16, 2015

This commit adds the "--empty-class" option.

Dan, does it look good to you?

Original comment by: skvadrik

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 16, 2015

+1 Looks good.

Original comment by: nuffer

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jun 17, 2015

  • status: accepted --> closed-fixed

Original comment by: skvadrik

@skvadrik skvadrik self-assigned this Jul 23, 2015

@skvadrik skvadrik closed this Jul 23, 2015

skvadrik added a commit that referenced this issue Nov 21, 2015

Fixed bug #59 "bogus 'yyaccept' in '-c' mode".
We have one 'yyaccept' initialization per re2c block. Each block
consists of one or more DFA (multiple DFA in '-c' mode in case of
multiple conditions). Each DFA may or may not use 'yyaccept'
(that is, save 'yyaccept' in some states and have a dispatch state
based on saved 'yyaccept' value).

Description of the bug: in '-c' mode, sometimes a DFA would have
states that save 'yyaccept', but no dispatch state that uses that
saved values. DFA didn't actually need 'yyaccept' (all the
assignments vanished if other conditions that need 'yyaccept' were
removed).

The essence of the bug: re2c decided whether to output 'yyaccept'
related stuff on a per-block basis: for multiple conditions in the
same block, the same decision was made (if any condition needed
'yyaccept', all of them would to output it).

The fix: 'yyaccept' initialization should be done on a per-block
basis, while assignments to 'yyaccept' should be done on a per-DFA
basis. Also, 'yyaccept' initialization must be delayed, while
assignments to 'yyaccept' must not.

Note: we may consider per-DFA 'yyaccept' initialization (have a
local 'yyaccept' variable per DFA). This wouldn't conflict with '-f'
switch (as it might seem) as long as we name all the variables
'yyaccept' and don't generate any 'yyaccept' initializations with '-f'.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.