Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Difference operator sometimes doesn't work with utf-8 #186
In general, re2c can only difference code unit classes. In UTF-8, code points may occupy multiple code units, so some character ranges have non-trivial structure (directed acyclic graph) rather than simple set.So, for example,
Hovewer, in your example character
Proper difference operator for regular expressions is more difficult to implement; maybe some day we will get to it, but this is not a one-off fix.
Thanks for the fix! It makes sense that in general that this wouldn't work for utf-8, but I found this case to be surprising for the reasons you've described.
You mention code unit classes; is there a way to specify these? For example if you did want to match specific invalid utf-8 and not just catch-all with
I think no; re2c assumes that all character literals refer to code points. For example, if you write
If you take the next code point
There's no way to write
Why do you need this?
I don't think I need this, actually. Ultimately, this all came about because I was trying to match this ocaml lexer rule with re2c. It appears to handle utf-8 manually, and this rule captures valid and invalid utf-8 as reserved. It throws an error either way, so I just limited this to ascii for the same thing which led me to the original bug. If I really wanted to match it, I guess it would be best to just handle utf-8 manually like the ocaml lexer does, but I found the clarity of using
Anyway, thanks again! I probably won't be able to use the fix just yet since I'd rather not force everyone to re2c devel (some folks are still on 0.13, so forcing them to 0.16 was not pleasant either!) but I figured I'd report it.