Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Handle non-ASCII encoded characters in regular expressions. #237
Hmm... at first I confused Unicode
As of now, re2c expects input file in ASCII. It will parse and compile non-ASCII input as well (e.g. UTF-8), but it will treat it a a stream of bytes, so
The problem with handling non-ASCII input is not in decoding it (re2c supports that already), but to decide on the following questions: (1) how to guess the input encoding, and (2) which parts of the syntax should allow non-ASCII characters (it probably doesn't make sense to allow Unicode spaces between rules, for example).
If I understand correctly, when I put "×"as a string into a re2c input file right now, it is only working because both my re2c input file encoding is utf-8 and my generated parser is utf-8. If they did not match, re2c would do the wrong thing? ie: strings are correct by coincidence, while s are always broken (because the code point is treated as it's elements).
Ok, makes sense.
No. Both strings and classes "don't work", meaning that they are not handled like you expect. Let's see in detail what happens with character classes, and then with strings. First, re2c has to parse input. Parser is encoding-insensitive: it handles everything as raw bytes (ASCII extended to 8-bit range). When you write
So, the parser sees opening bracket
After parsing, re2c performs encoding-sensitive transformation of code points into sequences of code units. This is where
Now that you have set
This is totally different from what you meant with
A similar thing happens with strings: if you write
Neither classes not strings are broken: they behave as they are supposed to, but I surely agree that the default behaviour is confusing.