Cleanup (?:) from beginning/end of groups #164

josephfrazier · 2017-02-19T18:56:32Z

This simplifies the compiled expressions by ensuring that groups don't
start or end with (?:). For instance, this code:

// Using named capture and flag x (free-spacing and line comments)
date = XRegExp('(?<year>  [0-9]{4} ) -?  # year  \n' +
               '(?<month> [0-9]{2} ) -?  # month \n' +
               '(?<day>   [0-9]{2} )     # day     ', 'x');

now compiles to this pattern:

([0-9]{4})(?:)-?(?:)([0-9]{2})(?:)-?(?:)([0-9]{2})(?:)

instead of

((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)

Here are the two patterns side by side, with whitespace inserted into the new one to illustrate the differences:

(    [0-9]{4}    )(?:)-?(?:)(    [0-9]{2}    )(?:)-?(?:)(    [0-9]{2}    )(?:)
((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)

slevithan · 2017-03-28T06:18:39Z

It looks like this replaces a few patterns:

(?:)) -> )
((?:) -> (
(?:(?:) -> (?:

In addition to (?:)(?:) -> (?:), which is already done prior to these changes.

I agree with these improvements in principle but have two concerns in practice: impact on perf, and correctness.

For perf, what is the effect seen in /tests/perf/index.html, specifically on the "Constructor with short pattern" for the "XRegExp with pattern cache flush" test? (This is a relatively minor concern.)

As for correctness, it's possible to break this in cases like (.(\(?:)), which should match e.g. 'x:', but would be converted to (.(\) and produce a syntax error. Avoiding this problem would likely lead to greater perf/complexity impact.

slevithan · 2017-03-28T07:42:38Z

An alternative strategy for keeping generated regexes clean might be to give token handler functions access to the preceding generated regex token so they can more smartly return (?:) only when really needed. (Search xregexp.js for '(?:)' to see where these are inserted, currently only with a knowledge of what follows and not what precedes.)

This doesn't change its behavior, but makes it more readable and easier to modify.

This will allow us to use it for matching other patterns than just quantifiers.

This test currently fails. Here's the actual and expected patterns, with whitespace inserted to illustrate the difference: '((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)' '( [0-9]{4} )(?:)-?(?:)( [0-9]{2} )(?:)-?(?:)( [0-9]{2} )(?:)'

This passes the tests in the previous commit, using the new isPatternNext function to determine if the match is at the end of a group. Checking if the match is at the beginning of a group is a little more naive, since it only looks at the previous character, rather than ignoring comments and whitespace, but I haven't found a good way to improve on that.

I realized the token handlers are equivalent, so I made them a named function instead.

josephfrazier · 2017-03-28T17:40:36Z

Thanks for the thorough review! I see that my original solution wasn't correct, with the (.(\(?:)) case you pointed out, and I agree that it'd be better to avoid inserting the extra (?:) in the first place, rather than removing them after the fact.

Accordingly, I've force-pushed a different set of commits into this branch that uses the alternative strategy, and additionally tests the (.(\(?:)) case for good measure. This required some refactoring before the meaningful changes, and I also did a small refactoring after, so it's probably easiest to review the commits one-by-one. I tried to make the commit messages informative as well. Let me know what you think!

EDIT: Oh, as far as perf goes: I ran the test page several times with both this (updated) version of the code, as well as version 3.1.1 using the ?version=3.1.1 parameter, and I got between 115 and 120 thousand ops/sec on each run of the Constructor with short pattern - XRegExp with pattern cache flush test, so I don't think perf is significantly impacted by these changes.

Use `new` with RegExp constructor, as is done everywhere else.

slevithan · 2017-04-11T04:25:41Z

Thanks! I still need to look over the new set of diffs closely, but I love the direction of no longer inserting (?:) where it isn't needed.

Aside: I should check if these lines in build.js are still needed after the changes here.

slevithan · 2017-04-16T05:44:32Z

tests/spec/s-xregexp.js

+        var regex = XRegExp('( [0-9]{4} ) -?  # year  \n' +
+                            '( [0-9]{2} ) -?  # month \n' +
+                            '( [0-9]{2} )     # day     ', 'x');
+        expect(regex.source).toEqual('([0-9]{4})(?:)-?(?:)([0-9]{2})(?:)-?(?:)([0-9]{2})(?:)');


This is enforcing the inclusion of multiple (?:) empty groups that aren't needed for this regex to operate correctly. The test should be re-framed to not enforce anything that is unnecessary.

Oops, yeah it is a bit brittle. We could change it to reject certain substrings, but then we might end up duplicating some of the logic in the non-test code. What if we changed it to something a little simpler, but still future-proof, like this?

expect(regex.source.length <= 54).toBe(true); // 54 is the length of the current result

slevithan · 2017-04-16T10:38:19Z

Testing string length wouldn't verify that it's working. I've gone ahead and updated this in 622aaf3 to use a reduced test case.

As a result of these changes, the "Constructor with x flag, whitespace, and comments" perf test is now meaningfully slower than in v3.1.1. It would be easy to create examples that are even more affected, since each regex token that triggers the new code will be slower. I'll try to look at speeding this back up later, probably after v3.2.0. A couple ideas: avoid the string concatenation in isPatternNext (possibly going back to regex literals and making the function specific to quantifiers again even though the current code is more readable/maintainable, since this isn't needed to handle simple cases with whitespace followed by )), and update runTokens to make already-processed tokens available to token handler functions. The latter change would add a generally useful feature to custom XRegExp tokens handlers and also make it easier to add support for more cases where (?:) shouldn't be inserted--e.g. after an opening (?: or (?<n>.

This makes the "Constructor with x flag, whitespace, and comments" test fast again. From slevithan#164 (comment): > A couple ideas: avoid the string concatenation in `isPatternNext` > (possibly going back to regex literals and making the function specific > to quantifiers again even though the current code is more > readable/maintainable, since this isn't needed to handle simple cases > with whitespace followed by `)`) Since babel-plugin-transform-xregexp automatically compiles the `new RegExp()` calls into literals, we get (most of) the performance back without sacrificing the readability of having separate subpatterns.

This makes the "Constructor with x flag, whitespace, and comments" test fast again. From #164 (comment): > A couple ideas: avoid the string concatenation in `isPatternNext` > (possibly going back to regex literals and making the function specific > to quantifiers again even though the current code is more > readable/maintainable, since this isn't needed to handle simple cases > with whitespace followed by `)`) Since babel-plugin-transform-xregexp automatically compiles the `new RegExp()` calls into literals, we get (most of) the performance back without sacrificing the readability of having separate subpatterns.

Following up on slevithan#164, this change prevents a `(?:)` from being inserted in the following places: * At the beginning of a non-capturing group (the end is already handled) * Before or after a `|` * At the beginning or the end of the pattern This solution isn't as complete as the one suggested in slevithan#179, but it's a decent stopgap.

This makes the "Constructor with x flag, whitespace, and comments" test fast again. From slevithan/xregexp#164 (comment): > A couple ideas: avoid the string concatenation in `isPatternNext` > (possibly going back to regex literals and making the function specific > to quantifiers again even though the current code is more > readable/maintainable, since this isn't needed to handle simple cases > with whitespace followed by `)`) Since babel-plugin-transform-xregexp automatically compiles the `new RegExp()` calls into literals, we get (most of) the performance back without sacrificing the readability of having separate subpatterns.

josephfrazier added 5 commits March 28, 2017 13:08

Use subpatterns in isQuantifierNext

cddec76

This doesn't change its behavior, but makes it more readable and easier to modify.

Extract isPatternNext helper from isQuantifierNext

6ec37f8

This will allow us to use it for matching other patterns than just quantifiers.

Extract getCommentOrWhitespace helper from token handlers

68abf55

I realized the token handlers are equivalent, so I made them a named function instead.

josephfrazier force-pushed the strip-useless-groups branch from d01e253 to 68abf55 Compare March 28, 2017 17:33

fixup! Extract isPatternNext helper from isQuantifierNext

5d8716e

Use `new` with RegExp constructor, as is done everywhere else.

slevithan reviewed Apr 16, 2017

View reviewed changes

slevithan merged commit cd0d192 into slevithan:master Apr 16, 2017

slevithan mentioned this pull request Apr 17, 2017

Reduce cases where unnecessary (?:) separator is added to pattern #179

Open

josephfrazier mentioned this pull request Apr 25, 2017

Revert isPatternNext to isQuantifierNext #183

Merged

josephfrazier deleted the strip-useless-groups branch May 1, 2017 13:03

josephfrazier mentioned this pull request May 1, 2017

Cleanup more (?:) from patterns #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup (?:) from beginning/end of groups #164

Cleanup (?:) from beginning/end of groups #164

josephfrazier commented Feb 19, 2017

slevithan commented Mar 28, 2017 •

edited

slevithan commented Mar 28, 2017 •

edited

josephfrazier commented Mar 28, 2017 •

edited

slevithan commented Apr 11, 2017 •

edited

slevithan Apr 16, 2017

josephfrazier Apr 16, 2017

slevithan commented Apr 16, 2017 •

edited

Cleanup (?:) from beginning/end of groups #164

Cleanup (?:) from beginning/end of groups #164

Conversation

josephfrazier commented Feb 19, 2017

slevithan commented Mar 28, 2017 • edited

slevithan commented Mar 28, 2017 • edited

josephfrazier commented Mar 28, 2017 • edited

slevithan commented Apr 11, 2017 • edited

slevithan Apr 16, 2017

Choose a reason for hiding this comment

josephfrazier Apr 16, 2017

Choose a reason for hiding this comment

slevithan commented Apr 16, 2017 • edited

slevithan commented Mar 28, 2017 •

edited

slevithan commented Mar 28, 2017 •

edited

josephfrazier commented Mar 28, 2017 •

edited

slevithan commented Apr 11, 2017 •

edited

slevithan commented Apr 16, 2017 •

edited