Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

Closed
trishume opened this issue Feb 16, 2017 · 8 comments
Closed

Comments

@trishume
Copy link

I'm working on a syntax highlighting engine in Rust that requires an Oniguruma-compatible regex engine. I'm trying to port it from the onig crate to fancy-regex, but there's some features it doesn't support yet (see trishume/syntect#34).

One of these features is the && operator and nesting in character sets, for example [a-w&&[^c-g]z]. I was thinking this would be added to fancy-regex but @robinst pointed out this comment which suggests that you plan for them to be in the regex crate.

It would be nice if the regex crate supported UTS#18 RL1.3 in full, but the && operator and nesting are all that Oniguruma-compatibility of fancy-regex requires.

I imagine this would take some changes to regex-syntax and then a pass to convert the fancy character sets down to basic character sets. I haven't thought enough about it to know if there are any unicode-related issues that might make this more complex, perhaps by making a tiny fancy character set compile to an enormous basic character set.

@BurntSushi do you have any insight on how difficult you think this would be to add for a contributor not familiar with the internals of regex?

@BurntSushi
Copy link
Member

BurntSushi commented Feb 16, 2017

do you have any insight on how difficult you think this would be to add for a contributor not familiar with the internals of regex?

My belief is that the change would be limited to regex-syntax and shouldn't bleed into anything else. If that's true, then that makes this task "medium" instead of "hard" because it reduces the amount of code you need to be familiar with. In fact, I don't think it even requires changing the existing AST.

I'd still consider this to be a tricky task. I think the parser itself seems a bit complex (the char class parser is already complex, so spending some time trying to make that nicer is probably a worthwhile investment). I also think there could be some issue with computing the various set operations because Unicode makes the sets quite large. For example, [^c-g] is every Unicode codepoint other than c-g.

In general, parsing takes an insignificant time in regex construction compared to the time it takes to compile the regex. I would hope that even with this change, it would stay that way. Maybe set operations on Unicode sets aren't as bad as I think, but if they wind up slowing down parsing a lot, then we might need to be more clever. And that makes this task "medium" instead of "easyish."

At a higher level, I do have plans to rewrite the parser from scratch (again), but I have no idea when that will happen and you shouldn't plan around it. My original intention was to add UTS#18 RL1.3 when I did that. But doing it before then is fine by me.

@BurntSushi
Copy link
Member

Also, I would expect test coverage to remain excellent. Every single branch should be covered by a test. (I believe this is true today.)

@trishume
Copy link
Author

@BurntSushi awesome, thanks for the insight.

@BurntSushi
Copy link
Member

Oh and I'd be happy to answer questions and mentor anyone who works on this. I'm on the usual IRC channels, typically 4-8pm EST and sporadically on the weekends. Email also works well, which I'm on around the clock.

@robinst
Copy link
Contributor

robinst commented Feb 17, 2017

I started looking into this after I made the comment on the syntect issue :). I too think it can be entirely implemented in regex-syntax. Looks like the work is roughly:

  1. Make parse_class handle nested classes
  2. Make it handle &&
  3. Add a intersect method to CharClass

We should probably focus on && as a first step, I'm not sure the other operations are in wide use (e.g. Oniguruma only has &&).

The first change I have is extracting a parse_class_escape method to make parse_class a bit smaller. Would you prefer individual pull requests for steps along the way, or a bigger one with multiple commits at the end?

@BurntSushi
Copy link
Member

BurntSushi commented Feb 17, 2017 via email

@robinst
Copy link
Contributor

robinst commented Feb 21, 2017

Quick status update: I have support for nested classes and intersection using &&, currently writing all the tests to make sure all things are covered.

robinst added a commit to robinst/regex that referenced this issue Feb 22, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
@robinst
Copy link
Contributor

robinst commented Feb 22, 2017

Pull request for nested classes and intersection: #346

robinst added a commit to robinst/regex that referenced this issue Feb 25, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
robinst added a commit to robinst/regex that referenced this issue Mar 13, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
robinst added a commit to robinst/regex that referenced this issue May 5, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
robinst added a commit to robinst/regex that referenced this issue May 21, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
robinst added a commit to robinst/regex that referenced this issue May 21, 2017
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
bors added a commit that referenced this issue May 21, 2017
…ction, r=BurntSushi

Support nested character classes and intersection with `&&`

This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes #341
@bors bors closed this as completed in #346 May 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants