Skip to content

trailing contexts are fundamentally broken #121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skvadrik opened this issue Oct 1, 2015 · 3 comments
Closed

trailing contexts are fundamentally broken #121

skvadrik opened this issue Oct 1, 2015 · 3 comments
Assignees

Comments

@skvadrik
Copy link
Owner

skvadrik commented Oct 1, 2015

When there are multiple overlapping trailing contexts, re2c-generated code may set incorrect YYCTXMARKER value. The following example fails on input string that matches pattern "abb" [^c]: rule 1 matches and should consume single character a, but it consumes ab because YYCTXMARKER gets overwritten when trying to match rule 0.

Source code (1.re):

/*!re2c
    "abb" / "c" { 0 }
    "a"   / "b" { 1 }
*/

Generated code:

/* Generated by re2c 0.13.6 on Thu Oct  1 21:57:46 2015 */

{
    YYCTYPE yych;

    if ((YYLIMIT - YYCURSOR) < 4) YYFILL(4);
    yych = *YYCURSOR;
    switch (yych) {
    case 'a':    goto yy3;
    default:    goto yy2;
    }
yy2:
    YYCURSOR = YYMARKER;
    goto yy5;
yy3:
    YYCTXMARKER = YYCURSOR + 1;
    yych = *++YYCURSOR;
    switch (yych) {
    case 'b':    goto yy4;
    default:    goto yy2;
    }
yy4:
    yych = *(YYMARKER = ++YYCURSOR);
    switch (yych) {
    case 'b':    goto yy6;
    default:    goto yy5;
    }
yy5:
    YYCURSOR = YYCTXMARKER;
    { 1 }
yy6:
    YYCTXMARKER = YYCURSOR + 1;
    yych = *++YYCURSOR;
    switch (yych) {
    case 'c':    goto yy7;
    default:    goto yy2;
    }
yy7:
    ++YYCURSOR;
    YYCURSOR = YYCTXMARKER;
    { 0 }
}
@skvadrik skvadrik self-assigned this Oct 1, 2015
@skvadrik
Copy link
Owner Author

skvadrik commented Oct 1, 2015

This is because re2c tries to implement variable-length trailing contexts using single pointer (YYCTXMARKER). Initially re2c supported only fixed-length trailing contexts, but this was changed by commit 9c2ee72734b2965a16d88d93a9fe14b7dd2dc6e3: "- Applied #1411087 variable length trailing context.".

Of all re2c tests, only very few are affected (re2c own lexer and one of example lexers).

The problem can be fixed by adding multiple pointers (one pointer per rule) and setting YYCTXMARKER to the right value in the end. Should be simple.

@skvadrik
Copy link
Owner Author

skvadrik commented Oct 2, 2015

This problem is related to the general problem of extracting submatch in regular expressions, which requires either NFA or TDFA (DFA with tagged transitions).

The problem is exposed by the following example:

/*!re2c
    "a"+ / "a" {}
*/

For which re2c generates this code:

/* Generated by re2c 0.13.6 on Fri Oct  2 09:34:09 2015 */

{
        YYCTYPE yych;

        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        YYCTXMARKER = YYCURSOR + 1;
        yych = *++YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy2;
        }
yy4:
        YYCTXMARKER = YYCURSOR + 1;
        ++YYCURSOR;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy6;
        }
yy6:
        YYCURSOR = YYCTXMARKER;
        {}
}

Clearly, this code sets YYCTXMARKER incorrectly.

I can think of two ways to solve this problem: either implement TDFA or report error in cases where greedy match is insufficient to correctly extract submatch.

@skvadrik skvadrik added this to the 0.17 milestone May 10, 2016
@skvadrik
Copy link
Owner Author

skvadrik commented May 10, 2016

Fixed in devel: main fix is by commit 7db4bab (added multiple context markers for overlapping trailing contexts), another important change is by commit 5958a23 (bind tags to DFA transitions, not states).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant