Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 enoding #250

Closed
dtp555-1212 opened this issue May 22, 2019 · 9 comments

Comments

@dtp555-1212
Copy link

commented May 22, 2019

It appears there is a bug in the UTF8 encoding (at least for some characters)...

utf8bug.zip

In the attached file... there is a 2 byte UTF character which should be encoded as C3 A9 ... (if you copy/paste the UTF char into a file by itself, then use od -t x1, you will see that it is indeed C3 A9). The C3 in the generated parser is correct, but then generates 83 as the second target byte. I am using -8 on the command line. (If there is something I am doing wrong, or if there is a workaround, please let me know)

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 22, 2019

Eh, it's a duplicate of #237. The problem is, re2c -8 option does not give you source-level Unicode support: if you write characters like é in regexp definitons, re2c interprets it as a plain byte sequence (each byte as a single character), not as one Unicode symbol. You have to use "\u00e9" instead.

I realize this is very ugly, difficult to use, confusing and needs fixing.

What exactly happens in case of é and how re2c ends up with C3 83 byte sequence is explained in great detail in #237 (let me know if you need more clarifications).

@dtp555-1212

This comment has been minimized.

Copy link
Author

commented May 23, 2019

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 23, 2019

Do I understand you correctly, that I cannot provide the escaped hex byte sequence. I must use a unicode equivalent.

Yes, it won't work. If you try regular expression \xC3\x9A in -8 mode, re2c will interpret it as "code point C3 followed by a code point 9A", both of which translate into 2-byte code unit sequences in UTF-8. The same happens when instead of \xC3\x9A you write é (only re2c doesn't have to unescape bytes).

Will this work for the 3 & 4 byte unicode values as well?

Escaped sequences will work for all Unicode code points (re2c supports 2-byte, 4-byte and 8-byte syntax: \xhh, \uhhhh and \Uhhhhhhhh).

it sounds like I will have to preprocess the input strings to substitute the appropriate unicode encoding prior to processing with re2c. Do you have a suggested tool for that?

No, unfortunately I don't. In a similar issue #235 we ended up with a pre-defined set of Unicode categories, but it's not good enough for your case.

P.S. as you have acknowledged that this needs addressing, do you have a timeframe that it might be implemented?

I might be able to fix this in a few days. I have a sketch of the fix already, but it requires some pre-requisite work in order to make it more elegant. It's a matter of using -8 in re2c own lexer (which is written in re2c) and switching between two different lexers (ASCII and UTF8). The new behavior will be guarded by an option, something like --input-encoding <ascii | utf8>.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 24, 2019

Pushed a fix: 29a6d01.

Now it is possible to use UTF-8 encoded strings in regular expressions (in string literals and character classes). The new behaviour is enabled with option --input-encoding utf8. By default re2c assumes --input-encoding ascii; in future it may be possible to flip default behaviour (if it keeps confusing people).

It was necessary to use a new option instead of reusing -8, because one may wish to generate multiple lexers with different output encoding from the same set of UTF-8 encoded rules. That is, one may need to combine --input-encoding utf8 with one of the options -u, -x, -w, etc., and not necessarily -8.

I deliberately chose a broad name for the new option (as opposed to a more precise --utf-8-literals or some such) so that it can be extended it in future, for example support UTF-8 encoded variable names (I do not see any good in that so far though).

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 24, 2019

@dtp555-1212 If you can, please send me your real-world test. If it's closed-source, I only need the grammar rules (though a working self-contained example is always great).

@dtp555-1212

This comment has been minimized.

Copy link
Author

commented May 24, 2019

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 24, 2019

Thanks! I added a test (it returns 0 for all the names on the list): https://github.com/skvadrik/re2c/blob/a00dc4871106ea39ef84f47bb840a018b17cea25/test/encodings/utf8_names.i8--input-encoding(utf8).re

There is an error in the name Ibargüen, it has a strange C2 byte right before C3 BC representing ü. It doesn't look like valid UTF-8 to me. After deleting C2 from both places everything works fine.

@terpstra

This comment has been minimized.

Copy link

commented May 24, 2019

This is great! When can we expect the next re2c release? I can't wait to re2c:include a Unicode character classes library and define character classes with literal UTF8 strings in them!

@skvadrik

This comment has been minimized.

Copy link
Owner

commented May 24, 2019

Soon, soon, really soon! I know I said this a couple of times before, such a shame... /o\ Realistically, not earlier than in 2 weeks, not later than the end of July. Thanks for asking, it gives me the inspiration to start writing changelog. :)

@skvadrik skvadrik closed this Jun 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.