Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-ASCII encoded characters in regular expressions. #237

Closed
terpstra opened this issue Dec 25, 2018 · 5 comments

Comments

@terpstra
Copy link

commented Dec 25, 2018

For example,
[*\xD7] { return op_type(10, 1); } // multiply
is not the same as
[*×] { return op_type(10, 1); }

The later is clearly more reader-friendly.

@terpstra terpstra changed the title Unicode characters in classes does not work as expected Unicode characters in classes do not work as expected Dec 25, 2018

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 25, 2018

Hmm... at first I confused Unicode × with ASCII x.

As of now, re2c expects input file in ASCII. It will parse and compile non-ASCII input as well (e.g. UTF-8), but it will treat it a a stream of bytes, so [×] will be processed as a 2-charachter range [\xC3\x97]. If the same character occurred in the part of the code not transformed by re2c, it would be pasted verbatim into the output file, so that the next tool in the pipeline (presumably C/C++ compiler) would see it undamaged.

The problem with handling non-ASCII input is not in decoding it (re2c supports that already), but to decide on the following questions: (1) how to guess the input encoding, and (2) which parts of the syntax should allow non-ASCII characters (it probably doesn't make sense to allow Unicode spaces between rules, for example).

@skvadrik skvadrik changed the title Unicode characters in classes do not work as expected Handle non-ASCII encoded characters in regular expressions. Dec 25, 2018

@terpstra

This comment has been minimized.

Copy link
Author

commented Dec 26, 2018

(1) Why do you need to guess the input encoding? Users would be happy to tell re2c on the command-line.
(2) I can't imagine wanting this for anything other than inside ""s or []s.

@terpstra

This comment has been minimized.

Copy link
Author

commented Dec 26, 2018

If I understand correctly, when I put "×"as a string into a re2c input file right now, it is only working because both my re2c input file encoding is utf-8 and my generated parser is utf-8. If they did not match, re2c would do the wrong thing? ie: strings are correct by coincidence, while []s are always broken (because the code point is treated as it's elements).

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 26, 2018

(1) Why do you need to guess the input encoding? Users would be happy to tell re2c on the command-line.
(2) I can't imagine wanting this for anything other than inside ""s or []s.

Ok, makes sense.

If I understand correctly, when I put "×"as a string into a re2c input file right now, it is only working because both my re2c input file encoding is utf-8 and my generated parser is utf-8. If they did not match, re2c would do the wrong thing? ie: strings are correct by coincidence, while []s are always broken (because the code point is treated as it's elements).

No. Both strings and classes "don't work", meaning that they are not handled like you expect. Let's see in detail what happens with character classes, and then with strings. First, re2c has to parse input. Parser is encoding-insensitive: it handles everything as raw bytes (ASCII extended to 8-bit range). When you write [*×], parser sees this byte sequence (assuming that your input file is UTF-8 encoded, as on my system):

$ echo -n [*×] | hexdump -C
00000000  5b 2a c3 97 5d                                    |[*..]|

So, the parser sees opening bracket 5b and understands that this is the beginning of a character class. In this mode it can recognize certain escape sequences starting with \. Everything else is treated as a standalone code point, so we get a range consisting of three code points: 2a, c3 and 97. Next comes the closing bracket 5d.

After parsing, re2c performs encoding-sensitive transformation of code points into sequences of code units. This is where -8 option first comes into play. Without it, re2c would assume 8-bit ASCII and transform code points 2a, c3 and 97 into code units 2a, c3 and 97, producing this lexer:

$ echo '/*!re2c [*×] {} */' | ./re2c -i -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:39:01 2018 */

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':
        case 0x97:
        case 0xC3:      goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
}

Now that you have set -8, re2c treats each code point as a Unicode symbol, so 2a is +U002a (*, star), c3 is +U00c3 (Ã, Latin capital letter A with tilde) and 97 is +U0097 (control). Next, -8 option tells re2c to encode these symbols in UTF-8, so 2a becomes 2a, c3 becomes c3 83 and 97 becomes c2 97. The resulting lexer is:

$ echo '/*!re2c [*×] {} */' | ./re2c -i8 -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:38:25 2018 */

{
        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':       goto yy3;
        case 0xC2:      goto yy5;
        case 0xC3:      goto yy6;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
yy5:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x97:      goto yy3;
        default:        goto yy2;
        }
yy6:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x83:      goto yy3;
        default:        goto yy2;
        }
}

This is totally different from what you meant with [*×], namely a range consisting of code points 2a and d7 (multiplication sign), corresponding to this lexer:

$ echo '/*!re2c [*\xd7] {} */' | ./re2c -i8 -
/* Generated by re2c 1.1.1 on Wed Dec 26 10:37:29 2018 */

{
        YYCTYPE yych;
        if ((YYLIMIT - YYCURSOR) < 2) YYFILL(2);
        yych = *YYCURSOR;
        switch (yych) {
        case '*':       goto yy3;
        case 0xC3:      goto yy5;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        {}
yy5:
        yych = *++YYCURSOR;
        switch (yych) {
        case 0x97:      goto yy3;
        default:        goto yy2;
        }
}

A similar thing happens with strings: if you write "×", it will be transformed to code point sequence c3 2a, which will be further UTF-8 encoded to 4-byte string c3 83 c2 97, while you expected code point sequence d7 and 2-byte string c3 97.

Neither classes not strings are broken: they behave as they are supposed to, but I surely agree that the default behaviour is confusing.

@skvadrik skvadrik referenced this issue May 22, 2019
@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 24, 2019

@terpstra Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option --input-encoding utf8. See #250.

@skvadrik skvadrik closed this Jun 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.