Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support range difference with variable-length encodings. #236

Closed
skvadrik opened this issue Dec 25, 2018 · 4 comments

Comments

@skvadrik
Copy link
Owner

commented Dec 25, 2018

Range difference doesn't work in cases when the ranges include multi-code-unit characters, because re2c applies the difference operator \ after it transforms character ranges into graphs representing these ranges (for example, the whole Unicode range [^] looks like this: http://re2c.org/_images/utf8_any.png --- it's not a simple matter of OR-ing all sub-ranges).

Discussion started in #235.

To fix this, re2c would need to delay range expansion until it parses the whole regular expression.

@skvadrik skvadrik changed the title Support range difference with variable-lenght encodings. Support range difference with variable-length encodings. Dec 25, 2018

@terpstra

This comment has been minimized.

Copy link

commented Dec 28, 2018

Is there a reason you don't just support A \ B for regular expressions? I don't know the internals of re2c, but at least my own ancient regular expression compiler could do this. Set union, intersection, and difference of DFAs can all be implemented the same way: you take the cartesian product of their states and mark the combined state accepting or not based on the set operation you want. I don't know at what stage you convert NFA=>DFA, though, so maybe this is hard to do in re2c.

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Dec 28, 2018

No particular reason, just never got to that. The main goal of re2c is performance of the generated code, not feature support (quex I believe supports more features, in particular negation and intersection). Set operations would be nice to have, and I think the feature is quite fundamental, but it will take some effort. From the theoretical point, re2c uses TNFA/TDFA as internal representation (T for "tagged"), and it's not immediately obvious (to me at least) if tags will be a problem. Also, should set operations be done before or after determinization? From the practical perspective, doing set operations via cartesian product might be too inefficient.

It's definitely worth the effort, but it's not trivial.

@terpstra

This comment has been minimized.

Copy link

commented Jan 2, 2019

If I recall correctly, the way I did this was to only create the product nodes as I explored the DAG from the start node. Most product nodes are unreachable in practice, even without running DFA minimization. I don't recall when I did DFA minimization, but I know I did it at multiple stages.

That said, it's the performance of the generated code that matters, and is why I'm using re2c. After DFA minimization it makes no difference if your set subtraction came from character sets or DFAs. The output DFA is the same. I don't think most people mind if re2c itself takes a few seconds to run.

@skvadrik

This comment has been minimized.

Copy link
Owner Author

commented Jan 4, 2019

@terpstra To start small, I fixed range difference with variable-length encodings (last relevant commit is c6f23fb). Now things like [*\xd7] \ [\xd7] in UTF-8 should work as expected.

I don't think most people mind if re2c itself takes a few seconds to run.

Generation speed is unlikely to be the bottleneck for most users, but still it's better to keep an eye on it (e.g in case we want to use re2c as a library, which may be of interest to compare re2c with other libraries). Surely this doesn't interfere with experimental features, as they can be made optional (at re2c compile-time or run-time).

@skvadrik skvadrik closed this Mar 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.