[WIP] Kinda-working fancy-regex support #34

Open
wants to merge 10 commits into
from

Conversation

Projects
None yet
6 participants
@trishume
Owner

trishume commented Feb 9, 2017

This branch switches the regex engine to fancy-regex or more specifically my fork of it.

Currently it only works for a few syntaxes because of a few different features fancy-regex doesn't support:

  • The \n escape (Everything, but fixed it my fork)
  • Unnecessary escapes in character classes like [\<]
  • The \h escape in character classes (Rust)
  • Fix #76 so nonewlines mode doesn't produce weird regexes.
  • Named backrefs \k<marker> (Markdown)
  • Fancy character class syntax [a-w&&[^c-g]z]
  • Add support for match limit to fancy-regex: google/fancy-regex#44

The jQuery highlighting benchmark now takes 1s instead of 0.66s. Which is super unfortunate given that I'd hoped it would be faster than Oniguruma. I have no idea why it is substantially slower.

@raphlinus @robinst

@raphlinus

This comment has been minimized.

Show comment
Hide comment
@raphlinus

raphlinus Feb 9, 2017

Contributor

I took a quick look at this, profiling the highlighting of jquery. It's promising but clearly not compelling yet. It seems to be spending most of its time delegating to regex, but in the VM. This suggests that it's doing backtracking, and might not even be using the NFA (it delegates just to get classes). I have a bunch of ideas on how to optimize more, but don't have insight into specifically what's slow now. The best case would be something like a(?=b), which currently throws a into fancy mode, but could be optimized to just (a)b with fixup of the captures.

The way to make progress here is to capture which regexes are consuming the most time. I'd add some profiling, something like a lazy_static hash table on the side, so that every time the VM runs it increments a count for that regex, and accumulates the time. Then just go down the list in terms of which regexes burn the most time.

I'd be tempted to investigate myself, but am currently trying to really focus on incremental update in xi. Thanks for pushing this forward!

Contributor

raphlinus commented Feb 9, 2017

I took a quick look at this, profiling the highlighting of jquery. It's promising but clearly not compelling yet. It seems to be spending most of its time delegating to regex, but in the VM. This suggests that it's doing backtracking, and might not even be using the NFA (it delegates just to get classes). I have a bunch of ideas on how to optimize more, but don't have insight into specifically what's slow now. The best case would be something like a(?=b), which currently throws a into fancy mode, but could be optimized to just (a)b with fixup of the captures.

The way to make progress here is to capture which regexes are consuming the most time. I'd add some profiling, something like a lazy_static hash table on the side, so that every time the VM runs it increments a count for that regex, and accumulates the time. Then just go down the list in terms of which regexes burn the most time.

I'd be tempted to investigate myself, but am currently trying to really focus on incremental update in xi. Thanks for pushing this forward!

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 9, 2017

Owner

@raphlinus Yah that was my thought on what to investigate as well. Since it's single-threaded I can do even better than a count for each regex, I can actually measure the total elapsed time and count per regex to figure out which ones are slow. Then I can run it again with Oniguruma and see which regexes are faster with fancy-regex and which are slower.

Unfortunately, I'm back to being busy with school work and I'm not sure when/if I'll have time to do this. The perf regression combined with missing features means it's going to be a bunch of work. Not an undoable amount, but still substantial.

Owner

trishume commented Feb 9, 2017

@raphlinus Yah that was my thought on what to investigate as well. Since it's single-threaded I can do even better than a count for each regex, I can actually measure the total elapsed time and count per regex to figure out which ones are slow. Then I can run it again with Oniguruma and see which regexes are faster with fancy-regex and which are slower.

Unfortunately, I'm back to being busy with school work and I'm not sure when/if I'll have time to do this. The perf regression combined with missing features means it's going to be a bunch of work. Not an undoable amount, but still substantial.

@TimNN

This comment has been minimized.

Show comment
Hide comment
@TimNN

TimNN Feb 10, 2017

@trishume: I'm trying to collect some per-regex timings, however trying to run the jquery highlighting benchmark fails because of the Ability to use \z in a character class problem. How did you get around that?

TimNN commented Feb 10, 2017

@trishume: I'm trying to collect some per-regex timings, however trying to run the jquery highlighting benchmark fails because of the Ability to use \z in a character class problem. How did you get around that?

@TimNN

This comment has been minimized.

Show comment
Hide comment
@TimNN

TimNN Feb 10, 2017

So I did some initial benchmarking of the jquery benchmark (measuring how long each regex matching took), the result are in this gist: https://gist.github.com/b6bb756f96b58e52b3299b709fa785dd

CUM: is the cumulative time spend on a regex, AVG, the average time, the lists are sorted by average time (times are in seconds).

The code is available in the respective branches in the TimNN/syntect repo.

The worst offenders by far (based on both, AVG and CUM) are the following:

CUM: PT7.310148847S AVG: PT0.000004021S REGEX: [_$[:alpha:]][_$[:alnum:]]*(?=\s*[\[.])
CUM: PT17.180939801S AVG: PT0.000004806S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)(?=\s*\()
CUM: PT21.358124652S AVG: PT0.000008019S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)\s*(\.)\s*(?=[_$[:alpha:]][_$[:alnum:]]*\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))
CUM: PT22.522137730S AVG: PT0.000008456S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)(?=\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))

They seem to match the "best case" mentioned by @raphlinus, which I guess is a good thing?

TimNN commented Feb 10, 2017

So I did some initial benchmarking of the jquery benchmark (measuring how long each regex matching took), the result are in this gist: https://gist.github.com/b6bb756f96b58e52b3299b709fa785dd

CUM: is the cumulative time spend on a regex, AVG, the average time, the lists are sorted by average time (times are in seconds).

The code is available in the respective branches in the TimNN/syntect repo.

The worst offenders by far (based on both, AVG and CUM) are the following:

CUM: PT7.310148847S AVG: PT0.000004021S REGEX: [_$[:alpha:]][_$[:alnum:]]*(?=\s*[\[.])
CUM: PT17.180939801S AVG: PT0.000004806S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)(?=\s*\()
CUM: PT21.358124652S AVG: PT0.000008019S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)\s*(\.)\s*(?=[_$[:alpha:]][_$[:alnum:]]*\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))
CUM: PT22.522137730S AVG: PT0.000008456S REGEX: ([_$[:alpha:]][_$[:alnum:]]*)\s*(\.)\s*(prototype)(?=\s*=\s*(\s*\b(async\s+)?function\b|\s*(\basync\s*)?([_$[:alpha:]][_$[:alnum:]]*|\(([^()]|\([^()]*\))*\))\s*=>))

They seem to match the "best case" mentioned by @raphlinus, which I guess is a good thing?

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 10, 2017

Owner

@TimNN awesome thank you! That's definitely useful information since it does indeed match up with the case @raphlinus said could be optimized without too much difficulty. Thanks for the help.

And yes the jQuery benchmark breaks because of a substitution I perform for nonewlines mode. I fixed the benchmark to use line strings with newline characters, but didn't end up committing it, sorry.

Owner

trishume commented Feb 10, 2017

@TimNN awesome thank you! That's definitely useful information since it does indeed match up with the case @raphlinus said could be optimized without too much difficulty. Thanks for the help.

And yes the jQuery benchmark breaks because of a substitution I perform for nonewlines mode. I fixed the benchmark to use line strings with newline characters, but didn't end up committing it, sorry.

@TimNN

This comment has been minimized.

Show comment
Hide comment
@TimNN

TimNN Feb 11, 2017

So I've been hacking a bit on fancy-regex and managed to get the optimisation mentioned above working (at least I think so -- the code is very hacky and desperately needs cleaning up).

The results however look very promising so far: On my machine, highlighting jquery went from 1,228,132,968 ns/iter (+/- 63,891,621) down to 858,410,622 ns/iter (+/- 83,053,993), thus an improvement of about 30%.

The code is in the trailing-la-opt branch in my fancy-regex fork, if you want to give it a go. I'll try to clean it up a bit over the weekend and send a PR to get some feedback from @raphlinus.

Edit: It's probably going to take a bit longer, until I find the time to cleanup / send a PR.

TimNN commented Feb 11, 2017

So I've been hacking a bit on fancy-regex and managed to get the optimisation mentioned above working (at least I think so -- the code is very hacky and desperately needs cleaning up).

The results however look very promising so far: On my machine, highlighting jquery went from 1,228,132,968 ns/iter (+/- 63,891,621) down to 858,410,622 ns/iter (+/- 83,053,993), thus an improvement of about 30%.

The code is in the trailing-la-opt branch in my fancy-regex fork, if you want to give it a go. I'll try to clean it up a bit over the weekend and send a PR to get some feedback from @raphlinus.

Edit: It's probably going to take a bit longer, until I find the time to cleanup / send a PR.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 11, 2017

Owner

@TimNN that's awesome! 858ms is still more time than it takes Oniguruma to highlight jQuery on my computer, but my computer also takes less than 1,228ms to highlight with fancy-regex so it is possible that on my computer fancy-regex will be just as fast. I'll try and test your branch on my machine at some point.

I was hoping it would actually lead to a significant performance increase eventually but merely matching the performance of Oniguruma is enough for me to make it the default once the compatibility issues are fixed since it will fix #33 and make all dependencies pure rust.

I may be able to find the time to fix some of the smaller compatibility issues I listed. Specifically the first two unfinished ones listed (I sorted by estimated difficulty). Some of the issues look difficult though, specifically a full expression parser/rewriter for the character class operators.

Owner

trishume commented Feb 11, 2017

@TimNN that's awesome! 858ms is still more time than it takes Oniguruma to highlight jQuery on my computer, but my computer also takes less than 1,228ms to highlight with fancy-regex so it is possible that on my computer fancy-regex will be just as fast. I'll try and test your branch on my machine at some point.

I was hoping it would actually lead to a significant performance increase eventually but merely matching the performance of Oniguruma is enough for me to make it the default once the compatibility issues are fixed since it will fix #33 and make all dependencies pure rust.

I may be able to find the time to fix some of the smaller compatibility issues I listed. Specifically the first two unfinished ones listed (I sorted by estimated difficulty). Some of the issues look difficult though, specifically a full expression parser/rewriter for the character class operators.

@TimNN

This comment has been minimized.

Show comment
Hide comment
@TimNN

TimNN Feb 11, 2017

I ran the jquery benchmark again with the oniguruma version and got 834,134,774 ns/iter (+/- 101,552,378), so the patch brings us at least to the same level as oniguruma on my machine for this benchmark.

(Note that my per regex benchmarking code is currently not very efficient since I had planned on collecting more stats than average & total time, so this may slow everything down a bit).

Also, using RUSTFLAGS=-Ctarget-cpu=native improved the runtime of the optimised fancy-regex version by about 50ms on my machine.

TimNN commented Feb 11, 2017

I ran the jquery benchmark again with the oniguruma version and got 834,134,774 ns/iter (+/- 101,552,378), so the patch brings us at least to the same level as oniguruma on my machine for this benchmark.

(Note that my per regex benchmarking code is currently not very efficient since I had planned on collecting more stats than average & total time, so this may slow everything down a bit).

Also, using RUSTFLAGS=-Ctarget-cpu=native improved the runtime of the optimised fancy-regex version by about 50ms on my machine.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 11, 2017

Owner

@TimNN Excellent. There's probably more optimization possible but that's great for now.

It should be theoretically faster than Oniguruma at least on syntaxes which have been optimized for Sublime's sregex engine to not use many/any fancy regex features, which should allow the rust regex crate to do everything, and that engine should be faster than Oniguruma.

I'm not sure even that would match Sublime Text's performance though. I think it uses something like https://doc.rust-lang.org/regex/regex/struct.RegexSet.html to match many regexes at once in time proportional only to the number of characters, but with support for extracting captures and match positions (unlike the regex crate).

At the moment fancy-regex supports substantially more regexes than Sublime's sregex, which I think only supports the things in the regex crate plus the optimization you just implemented for translating lookaheads and possibly lookbehinds. Otherwise it falls back to Oniguruma. But the things that sregex supports should end up almost entirely delegating to regex under fancy-regex, so the syntaxes optimized for it should be fast under fancy-regex.

Owner

trishume commented Feb 11, 2017

@TimNN Excellent. There's probably more optimization possible but that's great for now.

It should be theoretically faster than Oniguruma at least on syntaxes which have been optimized for Sublime's sregex engine to not use many/any fancy regex features, which should allow the rust regex crate to do everything, and that engine should be faster than Oniguruma.

I'm not sure even that would match Sublime Text's performance though. I think it uses something like https://doc.rust-lang.org/regex/regex/struct.RegexSet.html to match many regexes at once in time proportional only to the number of characters, but with support for extracting captures and match positions (unlike the regex crate).

At the moment fancy-regex supports substantially more regexes than Sublime's sregex, which I think only supports the things in the regex crate plus the optimization you just implemented for translating lookaheads and possibly lookbehinds. Otherwise it falls back to Oniguruma. But the things that sregex supports should end up almost entirely delegating to regex under fancy-regex, so the syntaxes optimized for it should be fast under fancy-regex.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Feb 16, 2017

Collaborator

Should we create issues in fancy-regex for the unsupported syntax? Seems like the better place to discuss these.

For fancy character classes ([a-w&&[^c-g]z]), it looks like that should be added to the regex crate. There's a comment in the source hinting that adding support for these is planned/welcome:

https://github.com/rust-lang/regex/blob/52fdae7169ec619530985a019184319ac4bbee5a/regex-syntax/src/lib.rs#L1408-L1410

(see UTS#18 RL1.3)

I think that would also help with implementing things such as \H in character classes (which is [^0-9A-Fa-f]).

Collaborator

robinst commented Feb 16, 2017

Should we create issues in fancy-regex for the unsupported syntax? Seems like the better place to discuss these.

For fancy character classes ([a-w&&[^c-g]z]), it looks like that should be added to the regex crate. There's a comment in the source hinting that adding support for these is planned/welcome:

https://github.com/rust-lang/regex/blob/52fdae7169ec619530985a019184319ac4bbee5a/regex-syntax/src/lib.rs#L1408-L1410

(see UTS#18 RL1.3)

I think that would also help with implementing things such as \H in character classes (which is [^0-9A-Fa-f]).

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 16, 2017

Owner

@robinst Good point. I guess fancy-regex would be a better place.

I created an issue in regex (rust-lang/regex#341), haven't created any in fancy-regex yet though.

Owner

trishume commented Feb 16, 2017

@robinst Good point. I guess fancy-regex would be a better place.

I created an issue in regex (rust-lang/regex#341), haven't created any in fancy-regex yet though.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Feb 18, 2017

I would be interested in lending some insight here if you folks wind up seeing bottlenecks inside the regex crate. The regex crate is fast in a lot of cases, but that doesn't mean it's fast in every case, so don't assume that the regex crate will always bail you out. :-) I'll get the ball rolling by throwing some things against the wall and seeing what sticks.

If a regex is particularly large, then it's possible that the DFA will be forced to bail out because it's thrashing its cache. When the DFA bails, it falls back to a much slower (by an ~order of magnitude) regex engine. You can tweak how much cache space is available with the dfa_size_limit option. By default, it's 2MB. The surefire indicator of regexes that might thrash the cache are regexes with large counted repetitions (e.g., (foo){1000}) or regexes with lots of Unicode classes. A few Unicode classes here and there aren't going to hurt, but if you combine large classes with counted repetitions, e.g., \pL{100}, then you're in for some pain.

RegexSet may be a little tricky to use since it is very limited in what it can do. All it can really tell you is which regex matches, but doesn't give you any position information, which means you need re-run the regexes in the set that matched to get that position information. It is worth thinking about and possibly even trying a RegexSet, but I wouldn't get your hopes up. :-)

Finally, have you folks seen any problems with performance for compiling the regexes? Does it add any noticeable overhead?

I would be interested in lending some insight here if you folks wind up seeing bottlenecks inside the regex crate. The regex crate is fast in a lot of cases, but that doesn't mean it's fast in every case, so don't assume that the regex crate will always bail you out. :-) I'll get the ball rolling by throwing some things against the wall and seeing what sticks.

If a regex is particularly large, then it's possible that the DFA will be forced to bail out because it's thrashing its cache. When the DFA bails, it falls back to a much slower (by an ~order of magnitude) regex engine. You can tweak how much cache space is available with the dfa_size_limit option. By default, it's 2MB. The surefire indicator of regexes that might thrash the cache are regexes with large counted repetitions (e.g., (foo){1000}) or regexes with lots of Unicode classes. A few Unicode classes here and there aren't going to hurt, but if you combine large classes with counted repetitions, e.g., \pL{100}, then you're in for some pain.

RegexSet may be a little tricky to use since it is very limited in what it can do. All it can really tell you is which regex matches, but doesn't give you any position information, which means you need re-run the regexes in the set that matched to get that position information. It is worth thinking about and possibly even trying a RegexSet, but I wouldn't get your hopes up. :-)

Finally, have you folks seen any problems with performance for compiling the regexes? Does it add any noticeable overhead?

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Feb 18, 2017

Owner

@BurntSushi cool thank you. If we get around to optimizing it more than @TimNN already has, we could probably use your advice.

From the very basic benchmarks I did on this branch I didn't see anything noticeable from Regex compilation. It didn't seem to be very different from Oniguruma. I do make sure to compile each regex at most once and only compile them if they are actually needed.

Owner

trishume commented Feb 18, 2017

@BurntSushi cool thank you. If we get around to optimizing it more than @TimNN already has, we could probably use your advice.

From the very basic benchmarks I did on this branch I didn't see anything noticeable from Regex compilation. It didn't seem to be very different from Oniguruma. I do make sure to compile each regex at most once and only compile them if they are actually needed.

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Mar 1, 2017

Collaborator

Hi, I just wanted to make a quick note that I've submitted a PR to the ST Packages repo that removes the named backrefs compatibility issue from the Markdown syntax, so, depending whether any other syntaxes use named backreferences, you may get away without this support.

Collaborator

keith-hall commented Mar 1, 2017

Hi, I just wanted to make a quick note that I've submitted a PR to the ST Packages repo that removes the named backrefs compatibility issue from the Markdown syntax, so, depending whether any other syntaxes use named backreferences, you may get away without this support.

@raphlinus

This comment has been minimized.

Show comment
Hide comment
@raphlinus

raphlinus Mar 1, 2017

Contributor

Some comments.

First, I took a look at @TimNN 's optimization. It's definitely the optimization I had in mind, but is not quite suitable for merging yet (it changes the output, specifically adding more capture groups than were originally present). I am a bit surprised it's only a 30% gain, I would have expected more.

The instrumentation for total time spent, number of invocations, etc., sounds extremely useful, and I recommend that gets checked in. We'll want to track performance on an ongoing basis (assuming we go ahead with fancy-regex, and even then it's extremely useful for making that decision). Where is the time going after the optimization is in place?

In my (admittedly not very thorough) testing, the time impact of regex compilation was minimal. For large files, it's spending seconds computing the highlighting.

What's the secret to super-fast performance? Is it running multiple regexes in parallel? Is it using RegexSet? I think the latter is promising. One idea that might be worth pursuing is the idea of an approximate (conservative) pure regex, which is guaranteed to match if the source regex does, but not conversely. This would be useful primarily in pruning the regexes that need to be evaluated. One downside, though: if there's a lot of pushing and popping, then it will run multiple sets over (almost) the entire source line.

Contributor

raphlinus commented Mar 1, 2017

Some comments.

First, I took a look at @TimNN 's optimization. It's definitely the optimization I had in mind, but is not quite suitable for merging yet (it changes the output, specifically adding more capture groups than were originally present). I am a bit surprised it's only a 30% gain, I would have expected more.

The instrumentation for total time spent, number of invocations, etc., sounds extremely useful, and I recommend that gets checked in. We'll want to track performance on an ongoing basis (assuming we go ahead with fancy-regex, and even then it's extremely useful for making that decision). Where is the time going after the optimization is in place?

In my (admittedly not very thorough) testing, the time impact of regex compilation was minimal. For large files, it's spending seconds computing the highlighting.

What's the secret to super-fast performance? Is it running multiple regexes in parallel? Is it using RegexSet? I think the latter is promising. One idea that might be worth pursuing is the idea of an approximate (conservative) pure regex, which is guaranteed to match if the source regex does, but not conversely. This would be useful primarily in pruning the regexes that need to be evaluated. One downside, though: if there's a lot of pushing and popping, then it will run multiple sets over (almost) the entire source line.

@trishume trishume referenced this pull request Mar 2, 2017

Open

Support advanced scope selectors #36

3 of 5 tasks complete
@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Mar 16, 2017

Owner

I just learned something from this conversation on Reddit with @BurntSushi and @raphlinus that may be part of the cause of the lack of performance gains.

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

I think an optimization where it keeps track of if a certain rule needs the captures or not may help performance with fancy-regex and possibly with Oniguruma as well, although I'm not sure there's much of a penalty to getting captures with Oniguruma, especially since it lets you re-use the capture regions struct between calls.

Not sure exactly how much difference it would make, but it would probably help a bit.

Owner

trishume commented Mar 16, 2017

I just learned something from this conversation on Reddit with @BurntSushi and @raphlinus that may be part of the cause of the lack of performance gains.

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

I think an optimization where it keeps track of if a certain rule needs the captures or not may help performance with fancy-regex and possibly with Oniguruma as well, although I'm not sure there's much of a penalty to getting captures with Oniguruma, especially since it lets you re-use the capture regions struct between calls.

Not sure exactly how much difference it would make, but it would probably help a bit.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 16, 2017

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

There's a lot of subtlety here. There are various factors at play:

  1. Even if you use the captures method call, if the regex itself has no captures, then the regex engine notices this and doesn't run the NFA. See: https://github.com/rust-lang/regex/blob/d813518e2a199884cd38a4e32497a7453db79697/src/exec.rs#L523-L535
  2. Does fancy-regex ever call the captures method? Or does it handle captures on its own?
  3. What do your regexes actually look like? (I tried looking once but I got lost. :-()
  4. The performance guide for the regex crate is something you should definitely check out.

although I'm not sure there's much of a penalty to getting captures with Oniguruma

Right. Classical backtracking engines, IME, typically don't impose a penalty for extracting captures.

syntect currently always does a regex search with captures even when it doesn't actually need them. Given what @BurntSushi said on Reddit, with the regex crate (which fancy-regex will call into a lot) this has a substantial performance penalty over just finding the bounds of the match.

There's a lot of subtlety here. There are various factors at play:

  1. Even if you use the captures method call, if the regex itself has no captures, then the regex engine notices this and doesn't run the NFA. See: https://github.com/rust-lang/regex/blob/d813518e2a199884cd38a4e32497a7453db79697/src/exec.rs#L523-L535
  2. Does fancy-regex ever call the captures method? Or does it handle captures on its own?
  3. What do your regexes actually look like? (I tried looking once but I got lost. :-()
  4. The performance guide for the regex crate is something you should definitely check out.

although I'm not sure there's much of a penalty to getting captures with Oniguruma

Right. Classical backtracking engines, IME, typically don't impose a penalty for extracting captures.

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Mar 16, 2017

FWIW, there are plans (in my head, anyway) to make capture extraction faster, but I can't commit to a timeline.

FWIW, there are plans (in my head, anyway) to make capture extraction faster, but I can't commit to a timeline.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst May 22, 2017

Collaborator

I rebased this branch and have been implementing missing features in fancy-regex, see my pull requests.

Also, the just released regex 0.2.2 now supports nested character classes and intersections, which means the "Fancy character class syntax [a-w&&[^c-g]z]" task is mostly done.

So I think the only task that doesn't have a pull request yet is "Ability to use \z in a character class (Using the nonewlines option)", which I haven't looked at yet.

I also found a regex that fancy-regex currently has trouble with (haven't investigated why yet): google/fancy-regex#14

Collaborator

robinst commented May 22, 2017

I rebased this branch and have been implementing missing features in fancy-regex, see my pull requests.

Also, the just released regex 0.2.2 now supports nested character classes and intersections, which means the "Fancy character class syntax [a-w&&[^c-g]z]" task is mostly done.

So I think the only task that doesn't have a pull request yet is "Ability to use \z in a character class (Using the nonewlines option)", which I haven't looked at yet.

I also found a regex that fancy-regex currently has trouble with (haven't investigated why yet): google/fancy-regex#14

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Jun 9, 2017

Collaborator

I looked into "Ability to use \z in a character class (Using the nonewlines option)" now. I'm pretty sure it doesn't work as expected in the current implementation, but it wasn't detected because onig doesn't complain.

With the following:

let re = onig::Regex::new(r"^a[\z]").unwrap();
println!("{}", re.is_match("a"));
println!("{}", re.is_match("az"));

The regex is compiled, but it prints false, then true. So it's equivalent to ^az, not ^a\z (which is what syntect would want).

Collaborator

robinst commented Jun 9, 2017

I looked into "Ability to use \z in a character class (Using the nonewlines option)" now. I'm pretty sure it doesn't work as expected in the current implementation, but it wasn't detected because onig doesn't complain.

With the following:

let re = onig::Regex::new(r"^a[\z]").unwrap();
println!("{}", re.is_match("a"));
println!("{}", re.is_match("az"));

The regex is compiled, but it prints false, then true. So it's equivalent to ^az, not ^a\z (which is what syntect would want).

@trishume trishume referenced this pull request Jun 9, 2017

Closed

Fix nonewlines mode #76

0 of 2 tasks complete
@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Jun 9, 2017

Owner

@robinst good catch, I created an issue: #76 and updated the to-do list in this issue.

Owner

trishume commented Jun 9, 2017

@robinst good catch, I created an issue: #76 and updated the to-do list in this issue.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Jul 27, 2017

Collaborator

Update: Fixed #76 now. With that, all of the check boxes in the description are done.

I've rebased @trishume's fancy-regex branch here: https://github.com/robinst/syntect/tree/fancy-regex

Now there's the following failing tests left, and they fail on the assertions (instead of panicking while compiling regexes):

html::tests::strings
html::tests::tokens
parsing::parser::tests::can_parse_yaml

The last one I added here and shows that the YAML syntax is not working yet:

 thread 'parsing::parser::tests::can_parse_yaml' panicked at 'assertion failed: `(left == right)`
-  left: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (1, Pop(1)), (1, Push(<string.unquoted.plain.out.yaml>)), (2, Pop(1)), (2, Push(<constant.language.boolean.yaml>)), (3, Pop(1)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`,
+ right: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (0, Push(<entity.name.tag.yaml>)), (3, Pop(2)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`', src/parsing/parser.rs:482:8

I had a look at the syntax but it's pretty complex. If someone who knows the syntax wants to track down the problem, that would be cool. (I guess I should learn how to use a debugger for Rust :)).

Collaborator

robinst commented Jul 27, 2017

Update: Fixed #76 now. With that, all of the check boxes in the description are done.

I've rebased @trishume's fancy-regex branch here: https://github.com/robinst/syntect/tree/fancy-regex

Now there's the following failing tests left, and they fail on the assertions (instead of panicking while compiling regexes):

html::tests::strings
html::tests::tokens
parsing::parser::tests::can_parse_yaml

The last one I added here and shows that the YAML syntax is not working yet:

 thread 'parsing::parser::tests::can_parse_yaml' panicked at 'assertion failed: `(left == right)`
-  left: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (1, Pop(1)), (1, Push(<string.unquoted.plain.out.yaml>)), (2, Pop(1)), (2, Push(<constant.language.boolean.yaml>)), (3, Pop(1)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`,
+ right: `[(0, Push(<source.yaml>)), (0, Push(<string.unquoted.plain.out.yaml>)), (0, Push(<entity.name.tag.yaml>)), (3, Pop(2)), (3, Push(<punctuation.separator.key-value.mapping.yaml>)), (4, Pop(1)), (5, Push(<string.unquoted.plain.out.yaml>)), (10, Pop(1))]`', src/parsing/parser.rs:482:8

I had a look at the syntax but it's pretty complex. If someone who knows the syntax wants to track down the problem, that would be cool. (I guess I should learn how to use a debugger for Rust :)).

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Jul 27, 2017

Collaborator

I don't know YAML syntax very well, but I cut down the syntax definition to the following, and I think it should still have the same behavior with the key: value\n example (it still gets the same scopes in ST as the full YAML syntax def - untested in syntect though, sorry!), so it may help with debugging.

%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.yaml-test
name: YAML-Test
variables:
  c_indicator: '[-?:,\[\]{}#&*!|>''"%@`]'
  # plain scalar begin and end patterns
  ns_plain_first_plain_out: |- # c=plain-out
    (?x:
        [^\s{{c_indicator}}]
      | [?:-] \S
    )

  _flow_scalar_end_plain_out: |- # kind of the negation of nb-ns-plain-in-line(c) c=plain-out
    (?x:
      (?=
          \s* $
        | \s+ \#
        | \s* : (\s|$)
      )
    )
contexts:
  main:
    - include: block-mapping
    - include: flow-scalar-plain-out

  block-mapping:
    - match: |
        (?x)
        (?=
          {{ns_plain_first_plain_out}}
          (
              [^\s:]
            | : \S
            | \s+ (?![#\s])
          )*
          \s*
          :
          (\s|$)
        )
      push:
        #- include: flow-scalar-plain-out-implicit-type
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true
        - match: '{{ns_plain_first_plain_out}}'
          set:
            - meta_scope: string.unquoted.plain.out.yaml entity.name.tag.yaml
              meta_include_prototype: false
            - match: '{{_flow_scalar_end_plain_out}}'
              pop: true
    - match: :(?=\s|$)
      scope: punctuation.separator.key-value.mapping.yaml

  flow-scalar-plain-out:
    # http://yaml.org/spec/1.2/spec.html#style/flow/plain
    # ns-plain(n,c) (c=flow-out, c=block-key)
    #- include: flow-scalar-plain-out-implicit-type
    - match: '{{ns_plain_first_plain_out}}'
      push:
        - meta_scope: string.unquoted.plain.out.yaml
          meta_include_prototype: false
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true
Collaborator

keith-hall commented Jul 27, 2017

I don't know YAML syntax very well, but I cut down the syntax definition to the following, and I think it should still have the same behavior with the key: value\n example (it still gets the same scopes in ST as the full YAML syntax def - untested in syntect though, sorry!), so it may help with debugging.

%YAML 1.2
---
# See http://www.sublimetext.com/docs/3/syntax.html
scope: source.yaml-test
name: YAML-Test
variables:
  c_indicator: '[-?:,\[\]{}#&*!|>''"%@`]'
  # plain scalar begin and end patterns
  ns_plain_first_plain_out: |- # c=plain-out
    (?x:
        [^\s{{c_indicator}}]
      | [?:-] \S
    )

  _flow_scalar_end_plain_out: |- # kind of the negation of nb-ns-plain-in-line(c) c=plain-out
    (?x:
      (?=
          \s* $
        | \s+ \#
        | \s* : (\s|$)
      )
    )
contexts:
  main:
    - include: block-mapping
    - include: flow-scalar-plain-out

  block-mapping:
    - match: |
        (?x)
        (?=
          {{ns_plain_first_plain_out}}
          (
              [^\s:]
            | : \S
            | \s+ (?![#\s])
          )*
          \s*
          :
          (\s|$)
        )
      push:
        #- include: flow-scalar-plain-out-implicit-type
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true
        - match: '{{ns_plain_first_plain_out}}'
          set:
            - meta_scope: string.unquoted.plain.out.yaml entity.name.tag.yaml
              meta_include_prototype: false
            - match: '{{_flow_scalar_end_plain_out}}'
              pop: true
    - match: :(?=\s|$)
      scope: punctuation.separator.key-value.mapping.yaml

  flow-scalar-plain-out:
    # http://yaml.org/spec/1.2/spec.html#style/flow/plain
    # ns-plain(n,c) (c=flow-out, c=block-key)
    #- include: flow-scalar-plain-out-implicit-type
    - match: '{{ns_plain_first_plain_out}}'
      push:
        - meta_scope: string.unquoted.plain.out.yaml
          meta_include_prototype: false
        - match: '{{_flow_scalar_end_plain_out}}'
          pop: true
@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Jul 28, 2017

Collaborator

Thanks @keith-hall! That helped, I've noticed a difference with this pattern (narrowed down):

let regex = r"(?=\s*$|\s*:(\s|$))";
let s = "key: value";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s));

onig returns Some((3, 3)), whereas fancy-regex returns Some((0, 0)). Removing the \s*$| part of the pattern makes it work. Ran out of time now, but looks like fancy-regex doesn't handle the $ in the positive lookahead correctly?

Collaborator

robinst commented Jul 28, 2017

Thanks @keith-hall! That helped, I've noticed a difference with this pattern (narrowed down):

let regex = r"(?=\s*$|\s*:(\s|$))";
let s = "key: value";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s));

onig returns Some((3, 3)), whereas fancy-regex returns Some((0, 0)). Removing the \s*$| part of the pattern makes it work. Ran out of time now, but looks like fancy-regex doesn't handle the $ in the positive lookahead correctly?

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Aug 3, 2017

Collaborator

Ok, tracked down the problem and have a fix here: google/fancy-regex#21

With that fix, cargo test works! syntest panicks compiling one of the regexes though :).

Collaborator

robinst commented Aug 3, 2017

Ok, tracked down the problem and have a fix here: google/fancy-regex#21

With that fix, cargo test works! syntest panicks compiling one of the regexes though :).

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 5, 2017

Collaborator

FYI, the regex that currently fails to compile is (?<=$|.). Oniguruma handles it by transforming it to (?<=$)|(?<=.). fancy-regex can do the same. I'm looking into that, but don't have a PR yet.

Collaborator

robinst commented Sep 5, 2017

FYI, the regex that currently fails to compile is (?<=$|.). Oniguruma handles it by transforming it to (?<=$)|(?<=.). fancy-regex can do the same. I'm looking into that, but don't have a PR yet.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Sep 5, 2017

Owner

@robinst awesome, great progress, thanks for all the work you've been doing on this. Seems like the goal is in sight.

Owner

trishume commented Sep 5, 2017

@robinst awesome, great progress, thanks for all the work you've been doing on this. Seems like the goal is in sight.

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Sep 5, 2017

Collaborator

Out of interest, why does that cause a panic? Does fancy-regex not support variable-width lookbehind?
I imagine that if I were improving that syntax definition to be compatible with ST's sregex engine, I would transform that pattern to (?:$|(?!^)) to avoid look-behinds altogether for better performance. But I guess if you need a generic fix for variable-width look-behind then transforming it the way Oniguruma does is the only realistic way to do it.

Collaborator

keith-hall commented Sep 5, 2017

Out of interest, why does that cause a panic? Does fancy-regex not support variable-width lookbehind?
I imagine that if I were improving that syntax definition to be compatible with ST's sregex engine, I would transform that pattern to (?:$|(?!^)) to avoid look-behinds altogether for better performance. But I guess if you need a generic fix for variable-width look-behind then transforming it the way Oniguruma does is the only realistic way to do it.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 6, 2017

Collaborator

That case is just not implemented yet, yeah. I haven't checked if there's other patterns that have the same problem, but it's something that fancy-regex will need anyway, so I'm looking into adding it.

The pattern comes from here btw (there's another one further down): https://github.com/sublimehq/Packages/blob/master/Clojure/Clojure.sublime-syntax#L834

(And I agree, the pattern can be improved.)

Collaborator

robinst commented Sep 6, 2017

That case is just not implemented yet, yeah. I haven't checked if there's other patterns that have the same problem, but it's something that fancy-regex will need anyway, so I'm looking into adding it.

The pattern comes from here btw (there's another one further down): https://github.com/sublimehq/Packages/blob/master/Clojure/Clojure.sublime-syntax#L834

(And I agree, the pattern can be improved.)

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 14, 2017

Collaborator

PR for supporting look-behind with variable sized alternative in fancy-regex: google/fancy-regex#25

Collaborator

robinst commented Sep 14, 2017

PR for supporting look-behind with variable sized alternative in fancy-regex: google/fancy-regex#25

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 19, 2017

Collaborator

So with the latest two fixes (google/fancy-regex#28 and google/fancy-regex#29), all the regexes now compile 🎉.

But, there's still quite a few assertion failures when running syntest, here's a diff of running it on master (old) against fancy-regex (new): https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb#file-syntest-onig-to-fancy-diff

So now we have to track down why they fail.

Collaborator

robinst commented Sep 19, 2017

So with the latest two fixes (google/fancy-regex#28 and google/fancy-regex#29), all the regexes now compile 🎉.

But, there's still quite a few assertion failures when running syntest, here's a diff of running it on master (old) against fancy-regex (new): https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb#file-syntest-onig-to-fancy-diff

So now we have to track down why they fail.

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Sep 19, 2017

Collaborator

I had a quick look at the errors from ./testdata/Packages/C#/tests/syntax_test_Generics.cs, as there aren't many and they all seem related and discovered that the following test, if added to fancy-regex, fails:

assert_match(r"$\n", "\n");

and regarding the XML failure, this regex also seems to fail:

assert_match(r"[[:alpha:]:_][[:alnum:]:_.-]*", "таĝñäᴹə");
Collaborator

keith-hall commented Sep 19, 2017

I had a quick look at the errors from ./testdata/Packages/C#/tests/syntax_test_Generics.cs, as there aren't many and they all seem related and discovered that the following test, if added to fancy-regex, fails:

assert_match(r"$\n", "\n");

and regarding the XML failure, this regex also seems to fail:

assert_match(r"[[:alpha:]:_][[:alnum:]:_.-]*", "таĝñäᴹə");
@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 19, 2017

Collaborator

Thanks Keith, that's very helpful! Will look into those.

Collaborator

robinst commented Sep 19, 2017

Thanks Keith, that's very helpful! Will look into those.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 21, 2017

Collaborator

Had a quick look, both have simple causes:

$ means end of text (as opposed to end of line) in the regex crate, and in fancy-regex too. We can consider changing it in fancy-regex, but I'm not sure it's a good idea to deviate too much from the regex crate.
We can instead compile regexes in syntect with (?m) prepended, to enable multi-line mode.

[[:alpha:]] and others are ASCII-only in the regex crate. I'm not sure why they work with onig, as they require Unicode mode, but haven't looked into that yet.
Anyway, I think they should all be changed in the syntaxes to use \p style classes instead, which are unambiguously Unicode in all engines. In the meantime, we could just do replacement in syntect.

Collaborator

robinst commented Sep 21, 2017

Had a quick look, both have simple causes:

$ means end of text (as opposed to end of line) in the regex crate, and in fancy-regex too. We can consider changing it in fancy-regex, but I'm not sure it's a good idea to deviate too much from the regex crate.
We can instead compile regexes in syntect with (?m) prepended, to enable multi-line mode.

[[:alpha:]] and others are ASCII-only in the regex crate. I'm not sure why they work with onig, as they require Unicode mode, but haven't looked into that yet.
Anyway, I think they should all be changed in the syntaxes to use \p style classes instead, which are unambiguously Unicode in all engines. In the meantime, we could just do replacement in syntect.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Sep 22, 2017

Collaborator

Ok, I've done the fixes in syntect here: https://github.com/robinst/syntect/tree/fancy-regex

And another fix for fancy-regex: google/fancy-regex#30

Now the diff looks much better: https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

Collaborator

robinst commented Sep 22, 2017

Ok, I've done the fixes in syntect here: https://github.com/robinst/syntect/tree/fancy-regex

And another fix for fancy-regex: google/fancy-regex#30

Now the diff looks much better: https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Oct 6, 2017

Collaborator

FYI I've narrowed the failure in Markdown (line 12) to this:

Finding the regex r"(?m)^\s*$" in the input "test\n", Oniguruma doesn't match, but fancy-regex returns position 5. Same problem if we simplify the regex to r"(?m)^$".

Note that fancy-regex just delegates to the regex crate for this, so I'll either need to try to fix it there or we work around it. I'll open an issue in the regex crate first.

Collaborator

robinst commented Oct 6, 2017

FYI I've narrowed the failure in Markdown (line 12) to this:

Finding the regex r"(?m)^\s*$" in the input "test\n", Oniguruma doesn't match, but fancy-regex returns position 5. Same problem if we simplify the regex to r"(?m)^$".

Note that fancy-regex just delegates to the regex crate for this, so I'll either need to try to fix it there or we work around it. I'll open an issue in the regex crate first.

@robinst robinst referenced this pull request in rust-lang/regex Oct 9, 2017

Closed

In multi-line mode, ^ matches after a trailing newline #406

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Oct 9, 2017

Collaborator

In terms of a workaround, would it help if syntect replaces ^ with \A and \n with \n\Z?

Collaborator

keith-hall commented Oct 9, 2017

In terms of a workaround, would it help if syntect replaces ^ with \A and \n with \n\Z?

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Oct 9, 2017

Collaborator

Yeah. I think the way to work around it would be to replace ^ with \A but keep $ as-is and keep multi-line mode. \A should work because we're only matching on individual lines anyway. We can't use \n\Z because that wouldn't work on the last lines of files that are not terminated by newlines.

I'll give that a try. It means we'll also have to parse the regex for that case because we don't want to replace all the ^ characters in the pattern.

Collaborator

robinst commented Oct 9, 2017

Yeah. I think the way to work around it would be to replace ^ with \A but keep $ as-is and keep multi-line mode. \A should work because we're only matching on individual lines anyway. We can't use \n\Z because that wouldn't work on the last lines of files that are not terminated by newlines.

I'll give that a try. It means we'll also have to parse the regex for that case because we don't want to replace all the ^ characters in the pattern.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Oct 13, 2017

Collaborator

Another fix: google/fancy-regex#36

And with replacing ^ with \A, we're now really close :): https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

Collaborator

robinst commented Oct 13, 2017

Another fix: google/fancy-regex#36

And with replacing ^ with \A, we're now really close :): https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Oct 13, 2017

Owner

@robinst that's awesome! Maybe if I find myself with some free time I'll try to fix the last few bugs that cause ASP syntax test failures. That should make it easier to figure out which problems are syntect and which are fancy-regex. In the mean time I guess you can look at the Markdown failures.

Owner

trishume commented Oct 13, 2017

@robinst that's awesome! Maybe if I find myself with some free time I'll try to fix the last few bugs that cause ASP syntax test failures. That should make it easier to figure out which problems are syntect and which are fancy-regex. In the mean time I guess you can look at the Markdown failures.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Nov 13, 2017

Collaborator

FYI, I've narrowed down the failure to this difference between onig and fancy-regex:

let regex = r"(?!`(?:[^`]+(?=`)|x)`)";
let s = "`a`";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s));

onig returns (1, 1) but fancy-regex returns (0, 0). Haven't tracked down the bug in fancy-regex yet.

Collaborator

robinst commented Nov 13, 2017

FYI, I've narrowed down the failure to this difference between onig and fancy-regex:

let regex = r"(?!`(?:[^`]+(?=`)|x)`)";
let s = "`a`";
println!("{:?}", onig::Regex::new(regex).unwrap().find(s));
println!("{:?}", fancy_regex::Regex::new(regex).unwrap().find(s));

onig returns (1, 1) but fancy-regex returns (0, 0). Haven't tracked down the bug in fancy-regex yet.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Jan 22, 2018

Collaborator

Found the problem for the above, fix here: google/fancy-regex#42

That fixes the Markdown failures: https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

It's interesting that some tests now have fewer failures with fancy-regex. But that could also be related to the changes in how we transform regexes.

I'll probably look into the Java failures next.

Collaborator

robinst commented Jan 22, 2018

Found the problem for the above, fix here: google/fancy-regex#42

That fixes the Markdown failures: https://gist.github.com/robinst/5b18d47cf45697bacd827fee845795bb

It's interesting that some tests now have fewer failures with fancy-regex. But that could also be related to the changes in how we transform regexes.

I'll probably look into the Java failures next.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Jan 22, 2018

Owner

@robinst Wow, awesome work! If there's less failures with fancy-regex then Oniguruma now, then that sounds like there's only two things left to do before we can make fancy-regex the default or only option:

  • Do a bunch of benchmarking to make sure it isn't a performance regression. Both posting some results here and if you (or anyone else) have your own corpus/server you'd like to try it out on and you tell me how the speed is I'll take your word for it. I'm willing to merge as long as it isn't a significant regression, if necessary we can work on making it faster later.
  • Test for backtracking explosions that Oniguruma doesn't hit, I seem to recall we found some of these before. Again anyone with a private corpus/service can be helpful here.
  • I review your branch before merging, but that should be fairly easy.
Owner

trishume commented Jan 22, 2018

@robinst Wow, awesome work! If there's less failures with fancy-regex then Oniguruma now, then that sounds like there's only two things left to do before we can make fancy-regex the default or only option:

  • Do a bunch of benchmarking to make sure it isn't a performance regression. Both posting some results here and if you (or anyone else) have your own corpus/server you'd like to try it out on and you tell me how the speed is I'll take your word for it. I'm willing to merge as long as it isn't a significant regression, if necessary we can work on making it faster later.
  • Test for backtracking explosions that Oniguruma doesn't hit, I seem to recall we found some of these before. Again anyone with a private corpus/service can be helpful here.
  • I review your branch before merging, but that should be fairly easy.
@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Jan 22, 2018

If y'all come across performance problems in the regex crate, I'm happy to diagnose them if there's a simple example I can run. :-)

If y'all come across performance problems in the regex crate, I'm happy to diagnose them if there's a simple example I can run. :-)

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Jan 24, 2018

Collaborator

this regex seems to cause catastrophic backtracking with fancy-regex, and takes 30 minutes to find a match against this line. I wonder how feasible it would be for fancy-regex to automatically unroll the loop - i.e. convert (?m)([^}"\\]+(\\.)*)+(?="(?!")) to (?m)((?:[^}"\\]+)(?:(\\.)+(?:[^}"\\]*))*)(?="(?!")), or whether some other mitigation would be a better solution.

Collaborator

keith-hall commented Jan 24, 2018

this regex seems to cause catastrophic backtracking with fancy-regex, and takes 30 minutes to find a match against this line. I wonder how feasible it would be for fancy-regex to automatically unroll the loop - i.e. convert (?m)([^}"\\]+(\\.)*)+(?="(?!")) to (?m)((?:[^}"\\]+)(?:(\\.)+(?:[^}"\\]*))*)(?="(?!")), or whether some other mitigation would be a better solution.

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Jan 25, 2018

Collaborator

@keith-hall I tried reproducing that but I can't, it runs fine for me. All of syntest completes in reasonable time here. Are you running my branch? Did you run make packs before? The problem sounds similar to google/fancy-regex#14.

@trishume Yeah that sounds good to me. I'd also like to polish fancy-regex a bit before:

  • It needs a good test suite. I was thinking of parsing Oniguruma's, but haven't looked into that yet. We'd probably need to ignore some of those tests because of different behavior, but maybe it's a good base. If someone knows a good test suite (that includes lookaround assertions), that'd be interesting.
  • The API should be changed to follow the API changes that happened in the regex crate. I think it was modeled on 0.1 regex crate API.
  • It needs good documentation for the things where the behavior differs from other engines like Oniguruma. The differences mostly come from being based on the regex crate.
Collaborator

robinst commented Jan 25, 2018

@keith-hall I tried reproducing that but I can't, it runs fine for me. All of syntest completes in reasonable time here. Are you running my branch? Did you run make packs before? The problem sounds similar to google/fancy-regex#14.

@trishume Yeah that sounds good to me. I'd also like to polish fancy-regex a bit before:

  • It needs a good test suite. I was thinking of parsing Oniguruma's, but haven't looked into that yet. We'd probably need to ignore some of those tests because of different behavior, but maybe it's a good base. If someone knows a good test suite (that includes lookaround assertions), that'd be interesting.
  • The API should be changed to follow the API changes that happened in the regex crate. I think it was modeled on 0.1 regex crate API.
  • It needs good documentation for the things where the behavior differs from other engines like Oniguruma. The differences mostly come from being based on the regex crate.
@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Jan 25, 2018

Collaborator

yes, running your branch - tried make packs with the fixed C# syntax (https://github.com/sublimehq/Packages/pull/1359) and without and seems to happen in both cases when running syntest.

however trying fancy-regex directly seems to work fine/fast, so not sure what's going on, maybe something in syntect I can try to track down.
cargo run --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' ' Console.WriteLine("{0,-20} {1,5:N1}", names[ctr], hours[ctr]);'

It'd be great to have some easy way to profile regex pattern matching performance in syntect.

Collaborator

keith-hall commented Jan 25, 2018

yes, running your branch - tried make packs with the fixed C# syntax (https://github.com/sublimehq/Packages/pull/1359) and without and seems to happen in both cases when running syntest.

however trying fancy-regex directly seems to work fine/fast, so not sure what's going on, maybe something in syntect I can try to track down.
cargo run --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' ' Console.WriteLine("{0,-20} {1,5:N1}", names[ctr], hours[ctr]);'

It'd be great to have some easy way to profile regex pattern matching performance in syntect.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Jan 26, 2018

Owner

@keith-hall In the past I've inserted code that prints every regex before it tries to search for it, and then when it hangs the last thing printed is the cause.

Owner

trishume commented Jan 26, 2018

@keith-hall In the past I've inserted code that prints every regex before it tries to search for it, and then when it hangs the last thing printed is the cause.

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Jan 26, 2018

Collaborator

it's definitely this regex - I updated the parser to show timings:

--- a/src/parsing/parser.rs
+++ b/src/parsing/parser.rs
@@ -216,8 +216,13 @@ impl ParseState {
                     } else {
                         match_pat.regex.as_ref().unwrap()
                     };
+                    use std::time::Instant;
+                    let before = Instant::now();
                     // TODO don't panic on regex error
                     let matched = regex.captures_from_pos(line, *start).unwrap();
+                    let duration = Instant::now().duration_since(before);
+                    println!("regex {:?} took {:?} to find {} starting at position {:?}", regex, duration, if matched.is_some() { "a match" } else { "nothing" }, start);
+
                     if let Some(captures) = matched {
                         let match_start = captures.pos(0).unwrap().0;
                         let match_end = captures.pos(0).unwrap().1;

and this was the output:

regex ([^}"\\]+(\\.)*)+(?="(?!")) took Duration { secs: 1683, nanos: 255688848 } to find nothing starting at position 44

which is 28 minutes, and shows that it starts looking for a match at position 44, which is probably why I couldn't reproduce the problem by testing the pattern in fancy-regex while passing the whole line to match against.

Collaborator

keith-hall commented Jan 26, 2018

it's definitely this regex - I updated the parser to show timings:

--- a/src/parsing/parser.rs
+++ b/src/parsing/parser.rs
@@ -216,8 +216,13 @@ impl ParseState {
                     } else {
                         match_pat.regex.as_ref().unwrap()
                     };
+                    use std::time::Instant;
+                    let before = Instant::now();
                     // TODO don't panic on regex error
                     let matched = regex.captures_from_pos(line, *start).unwrap();
+                    let duration = Instant::now().duration_since(before);
+                    println!("regex {:?} took {:?} to find {} starting at position {:?}", regex, duration, if matched.is_some() { "a match" } else { "nothing" }, start);
+
                     if let Some(captures) = matched {
                         let match_start = captures.pos(0).unwrap().0;
                         let match_end = captures.pos(0).unwrap().1;

and this was the output:

regex ([^}"\\]+(\\.)*)+(?="(?!")) took Duration { secs: 1683, nanos: 255688848 } to find nothing starting at position 44

which is 28 minutes, and shows that it starts looking for a match at position 44, which is probably why I couldn't reproduce the problem by testing the pattern in fancy-regex while passing the whole line to match against.

@Keats Keats referenced this pull request in Keats/gutenberg Jan 29, 2018

Closed

Can not `cargo build --release` on Windows 10. #198

@trishume trishume referenced this pull request Feb 2, 2018

Open

Wasm support #135

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Feb 5, 2018

Collaborator

Ok, so I can reproduce it with (note the 44 at the end of the command, to start from that position):

cargo run --release --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' \
'            Console.WriteLine("{0,-20} {1,5:N1}", names[ctr], hours[ctr]);' 44

or:

cargo run --release --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' ':N1}", names[ctr], hours[ctr]);'

It takes 25 seconds here, which is too slow. Not sure if you were running with --release or not, but --release makes a much bigger difference with fancy-regex than with onig (because all the regexing happens in debug mode).

Collaborator

robinst commented Feb 5, 2018

Ok, so I can reproduce it with (note the 44 at the end of the command, to start from that position):

cargo run --release --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' \
'            Console.WriteLine("{0,-20} {1,5:N1}", names[ctr], hours[ctr]);' 44

or:

cargo run --release --example toy run '(?m)([^}"\\]+(\\.)*)+(?="(?!"))' ':N1}", names[ctr], hours[ctr]);'

It takes 25 seconds here, which is too slow. Not sure if you were running with --release or not, but --release makes a much bigger difference with fancy-regex than with onig (because all the regexing happens in debug mode).

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Feb 9, 2018

Collaborator

@keith-hall I've investigated this now, see my comment here: google/fancy-regex#14 (comment)

Summary: Oniguruma happens to not run into catastrophic backtracking because it has an optimization which makes it check if the input contains a " (from the look-ahead) before it does any matching.

I don't think we'll implement optimizations such as that in fancy-regex in the short term (and it only helps in some cases), although it would be an interesting project.

Can we change the regex instead?

Collaborator

robinst commented Feb 9, 2018

@keith-hall I've investigated this now, see my comment here: google/fancy-regex#14 (comment)

Summary: Oniguruma happens to not run into catastrophic backtracking because it has an optimization which makes it check if the input contains a " (from the look-ahead) before it does any matching.

I don't think we'll implement optimizations such as that in fancy-regex in the short term (and it only helps in some cases), although it would be an interesting project.

Can we change the regex instead?

@jmacdonald jmacdonald referenced this pull request in jmacdonald/amp Mar 22, 2018

Open

shift r causing Segmentation fault 11 on macOS #77

@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall Apr 28, 2018

Collaborator

It'd be great if this PR could be updated with the latest changes (from @robinst's fancy-regex branch and the fixes from master), if anyone has time to resolve the merge conflicts :)

Collaborator

keith-hall commented Apr 28, 2018

It'd be great if this PR could be updated with the latest changes (from @robinst's fancy-regex branch and the fixes from master), if anyone has time to resolve the merge conflicts :)

@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst Apr 28, 2018

Collaborator

@keith-hall: I rebased my branch here: https://github.com/robinst/syntect/tree/fancy-regex

If @trishume is ok with it, I can also force-push that to the branch that this PR is based on.

Collaborator

robinst commented Apr 28, 2018

@keith-hall: I rebased my branch here: https://github.com/robinst/syntect/tree/fancy-regex

If @trishume is ok with it, I can also force-push that to the branch that this PR is based on.

@trishume

This comment has been minimized.

Show comment
Hide comment
@trishume

trishume Apr 28, 2018

Owner

@robinst yah a force push sounds good

Owner

trishume commented Apr 28, 2018

@robinst yah a force push sounds good

trishume and others added some commits Feb 9, 2017

Add a simple parsing test for YAML
A related test for highlighting YAML currently fails with the
fancy-regex changes, so this simpler test may make it easier to track
down the problem.
Compile regexes in multi-line mode for the "newlines" syntaxes
Some of the regexes include `$` and expect it to match end of line. In
fancy-regex, `$` means end of text by default. Adding `(?m)` activates
multi-line mode which changes `$` to match end of line.

This fixes a large number of the failed assertions with syntest.
Replace POSIX character classes so that they match Unicode as well
In fancy-regex, POSIX character classes only match ASCII characters.
Sublime's syntaxes expect them to match Unicode characters as well, so
transform them to corresponding Unicode character classes.
Replace ^ with \A in multi-line mode regexes
With the regex crate and fancy-regex, `^` in multi-line mode also
matches at the end of a string like "test\n". There are some regexes in
the syntax definitions like `^\s*$`, which are intended to match a blank
line only. So change `^` to `\A` which only matches at the beginning of
text.
Fix code that skips a character to work with unicode
Note that this wasn't a problem with Oniguruma because it works on UTF-8
bytes, but fancy-regex works on characters.
@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst May 1, 2018

Collaborator

Done! Note that you might have to cargo update locally before trying to run this (because Cargo.lock is not checked in). Updates:

  • A small change to make fancy-regex work with a newer regex crate.
  • A bug fix in syntect for code that I wrote a couple of days ago that fails with fancy-regex (but is fine with Oniguruma). += 1 on a string index is almost never the right thing to do. I walked into this one before, but apparently forgot about it again :).
Collaborator

robinst commented May 1, 2018

Done! Note that you might have to cargo update locally before trying to run this (because Cargo.lock is not checked in). Updates:

  • A small change to make fancy-regex work with a newer regex crate.
  • A bug fix in syntect for code that I wrote a couple of days ago that fails with fancy-regex (but is fine with Oniguruma). += 1 on a string index is almost never the right thing to do. I walked into this one before, but apparently forgot about it again :).
@keith-hall

This comment has been minimized.

Show comment
Hide comment
@keith-hall

keith-hall May 2, 2018

Collaborator

small note for people struggling to get syntect to build on Windows: using the branch from this PR, you can edit Cargo.toml to remove the onig dependency, and everything will work fine.

Side note: probably this PR should be updated so that fancy-regex is an optional dependency as part of the parsing feature like onig is, no?

--- Cargo.toml
+++ [new] Cargo.toml
@@ -15,7 +15,7 @@
 
 [dependencies]
 yaml-rust = { version = "0.4", optional = true }
-onig = { version = "3.2.1", optional = true }
+#onig = { version = "3.2.1", optional = true }
 walkdir = "2.0"
 regex-syntax = { version = "0.4", optional = true }
 lazy_static = "1.0"
@@ -25,7 +25,7 @@
 flate2 = { version = "1.0", optional = true, default-features = false }
 fnv = { version = "1.0", optional = true }
 regex = "*"
-fancy-regex = { git = "https://github.com/google/fancy-regex.git" }
+fancy-regex = { git = "https://github.com/google/fancy-regex.git", optional = true }
 serde = { version = "1.0", features = ["rc"] }
 serde_derive = "1.0"
 serde_json = "1.0"
@@ -51,7 +51,7 @@
 # Pure Rust dump creation, worse compressor so produces larger dumps than dump-create
 dump-create-rs = ["flate2/rust_backend", "bincode"]
 
-parsing = ["onig", "regex-syntax", "fnv"]
+parsing = ["fancy-regex", "regex-syntax", "fnv"]
 # The `assets` feature enables inclusion of the default theme and syntax packages.
 # For `assets` to do anything, it requires one of `dump-load-rs` or `dump-load` to be set.
 assets = []
Collaborator

keith-hall commented May 2, 2018

small note for people struggling to get syntect to build on Windows: using the branch from this PR, you can edit Cargo.toml to remove the onig dependency, and everything will work fine.

Side note: probably this PR should be updated so that fancy-regex is an optional dependency as part of the parsing feature like onig is, no?

--- Cargo.toml
+++ [new] Cargo.toml
@@ -15,7 +15,7 @@
 
 [dependencies]
 yaml-rust = { version = "0.4", optional = true }
-onig = { version = "3.2.1", optional = true }
+#onig = { version = "3.2.1", optional = true }
 walkdir = "2.0"
 regex-syntax = { version = "0.4", optional = true }
 lazy_static = "1.0"
@@ -25,7 +25,7 @@
 flate2 = { version = "1.0", optional = true, default-features = false }
 fnv = { version = "1.0", optional = true }
 regex = "*"
-fancy-regex = { git = "https://github.com/google/fancy-regex.git" }
+fancy-regex = { git = "https://github.com/google/fancy-regex.git", optional = true }
 serde = { version = "1.0", features = ["rc"] }
 serde_derive = "1.0"
 serde_json = "1.0"
@@ -51,7 +51,7 @@
 # Pure Rust dump creation, worse compressor so produces larger dumps than dump-create
 dump-create-rs = ["flate2/rust_backend", "bincode"]
 
-parsing = ["onig", "regex-syntax", "fnv"]
+parsing = ["fancy-regex", "regex-syntax", "fnv"]
 # The `assets` feature enables inclusion of the default theme and syntax packages.
 # For `assets` to do anything, it requires one of `dump-load-rs` or `dump-load` to be set.
 assets = []
@robinst

This comment has been minimized.

Show comment
Hide comment
@robinst

robinst May 2, 2018

Collaborator

@keith-hall Pushed a commit with those changes, thanks!

Collaborator

robinst commented May 2, 2018

@keith-hall Pushed a commit with those changes, thanks!

src/parsing/syntax_definition.rs
- RegexOptions::REGEX_OPTION_CAPTURE_GROUP,
- Syntax::default())
- .unwrap();
+ println!("compiling {:?}", self.regex_str);

This comment has been minimized.

@keith-hall

keith-hall May 3, 2018

Collaborator

should this println be here, as it generates a lot of noise? if it's useful for debugging, maybe it would be best to hide it behind a feature flag as discussed at #146 (comment)

@keith-hall

keith-hall May 3, 2018

Collaborator

should this println be here, as it generates a lot of noise? if it's useful for debugging, maybe it would be best to hide it behind a feature flag as discussed at #146 (comment)

This comment has been minimized.

@trishume

trishume May 3, 2018

Owner

Just commenting it out is fine, see my reply on the comment

@trishume

trishume May 3, 2018

Owner

Just commenting it out is fine, see my reply on the comment

This comment has been minimized.

@robinst

robinst May 6, 2018

Collaborator

@keith-hall I pushed a commit that changes this to only print in case it fails.

@robinst

robinst May 6, 2018

Collaborator

@keith-hall I pushed a commit that changes this to only print in case it fails.

Replace println! and plain unwrap with nicer panic!
I think the println! was only there to see the regex that failed to
compile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment