Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference operator sometimes doesn't work with utf-8 #186

Closed
binji opened this issue Jun 9, 2017 · 5 comments

Comments

@binji
Copy link

commented Jun 9, 2017

Here's a simple example:

$ cat test.c
/*!re2c
   test = [\x20-\x7e] \ [()"; ];
*/
$ .build/re2c test.c -8
re2c: error: line 2, column 32: can only difference char sets

Interestingly, it works if I remove characters from the second set:

   test = [\x20-\x7e] \ [()];
@skvadrik

This comment has been minimized.

Copy link
Owner

commented Jun 10, 2017

Fixed in devel branch (at least this very example).

In general, re2c can only difference code unit classes. In UTF-8, code points may occupy multiple code units, so some character ranges have non-trivial structure (directed acyclic graph) rather than simple set.So, for example, [^] \ [a] won't work in devel either. See this article for a thorough explanation https://swtch.com/~rsc/regexp/regexp3.html, section "Step 3: Compile", or just compile /!re2c [^] {} */ with re2c -8s and examine the structure of the range.

Hovewer, in your example character [\x20-\x7e] is a simple range: all code points are 1-byte and the range can be represented as a set of code units. Therefore difference should work.

Proper difference operator for regular expressions is more difficult to implement; maybe some day we will get to it, but this is not a one-off fix.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Jun 10, 2017

There is even a better option to look at [^] structure: http://re2c.org/manual/features/dot/dot.html

@binji

This comment has been minimized.

Copy link
Author

commented Jun 10, 2017

Thanks for the fix! It makes sense that in general that this wouldn't work for utf-8, but I found this case to be surprising for the reasons you've described.

You mention code unit classes; is there a way to specify these? For example if you did want to match specific invalid utf-8 and not just catch-all with *?

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Jun 10, 2017

I think no; re2c assumes that all character literals refer to code points. For example, if you write /*!re2c "\x7f" {} */ and compile it with re2c -8i, you'll get the following:

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 0x7F:      goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
}

If you take the next code point /*!re2c "\x80" {} */, you'll get:

{
        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case 0xC2:      goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x80:      goto yy4;
        default:        goto yy2;
        }
yy4:
        ++YYCURSOR;
        {}
}

There's no way to write \x80 as a 1-byte code unit.

Why do you need this?
As a workaround, it is possible to match the beginning of any invalid code unit sequence with default rule * and do manual post-processing.

@binji

This comment has been minimized.

Copy link
Author

commented Jun 10, 2017

Why do you need this?

I don't think I need this, actually. Ultimately, this all came about because I was trying to match this ocaml lexer rule with re2c. It appears to handle utf-8 manually, and this rule captures valid and invalid utf-8 as reserved. It throws an error either way, so I just limited this to ascii for the same thing which led me to the original bug. If I really wanted to match it, I guess it would be best to just handle utf-8 manually like the ocaml lexer does, but I found the clarity of using -8 nicer.

Anyway, thanks again! I probably won't be able to use the fix just yet since I'd rather not force everyone to re2c devel (some folks are still on 0.13, so forcing them to 0.16 was not pleasant either!) but I figured I'd report it.

@binji binji closed this Jun 10, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.