Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup (?:) from beginning/end of groups #164

Merged
merged 6 commits into from Apr 16, 2017

Conversation

josephfrazier
Copy link
Collaborator

This simplifies the compiled expressions by ensuring that groups don't
start or end with (?:). For instance, this code:

// Using named capture and flag x (free-spacing and line comments)
date = XRegExp('(?<year>  [0-9]{4} ) -?  # year  \n' +
               '(?<month> [0-9]{2} ) -?  # month \n' +
               '(?<day>   [0-9]{2} )     # day     ', 'x');

now compiles to this pattern:

([0-9]{4})(?:)-?(?:)([0-9]{2})(?:)-?(?:)([0-9]{2})(?:)

instead of

((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)

Here are the two patterns side by side, with whitespace inserted into the new one to illustrate the differences:

(    [0-9]{4}    )(?:)-?(?:)(    [0-9]{2}    )(?:)-?(?:)(    [0-9]{2}    )(?:)
((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)

@slevithan
Copy link
Owner

slevithan commented Mar 28, 2017

It looks like this replaces a few patterns:

  • (?:)) -> )
  • ((?:) -> (
  • (?:(?:) -> (?:

In addition to (?:)(?:) -> (?:), which is already done prior to these changes.

I agree with these improvements in principle but have two concerns in practice: impact on perf, and correctness.

For perf, what is the effect seen in /tests/perf/index.html, specifically on the "Constructor with short pattern" for the "XRegExp with pattern cache flush" test? (This is a relatively minor concern.)

As for correctness, it's possible to break this in cases like (.(\(?:)), which should match e.g. 'x:', but would be converted to (.(\) and produce a syntax error. Avoiding this problem would likely lead to greater perf/complexity impact.

@slevithan
Copy link
Owner

slevithan commented Mar 28, 2017

An alternative strategy for keeping generated regexes clean might be to give token handler functions access to the preceding generated regex token so they can more smartly return (?:) only when really needed. (Search xregexp.js for '(?:)' to see where these are inserted, currently only with a knowledge of what follows and not what precedes.)

This doesn't change its behavior, but makes it more readable and easier
to modify.
This will allow us to use it for matching other patterns than just
quantifiers.
This test currently fails. Here's the actual and expected patterns, with
whitespace inserted to illustrate the difference:

'((?:)[0-9]{4}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)-?(?:)((?:)[0-9]{2}(?:))(?:)'
'(    [0-9]{4}    )(?:)-?(?:)(    [0-9]{2}    )(?:)-?(?:)(    [0-9]{2}    )(?:)'
This passes the tests in the previous commit, using the new
isPatternNext function to determine if the match is at the end of a
group.

Checking if the match is at the beginning of a group is a little
more naive, since it only looks at the previous character, rather than
ignoring comments and whitespace, but I haven't found a good way to
improve on that.
I realized the token handlers are equivalent, so I made them a named
function instead.
@josephfrazier
Copy link
Collaborator Author

josephfrazier commented Mar 28, 2017

Thanks for the thorough review! I see that my original solution wasn't correct, with the (.(\(?:)) case you pointed out, and I agree that it'd be better to avoid inserting the extra (?:) in the first place, rather than removing them after the fact.

Accordingly, I've force-pushed a different set of commits into this branch that uses the alternative strategy, and additionally tests the (.(\(?:)) case for good measure. This required some refactoring before the meaningful changes, and I also did a small refactoring after, so it's probably easiest to review the commits one-by-one. I tried to make the commit messages informative as well. Let me know what you think!

EDIT: Oh, as far as perf goes: I ran the test page several times with both this (updated) version of the code, as well as version 3.1.1 using the ?version=3.1.1 parameter, and I got between 115 and 120 thousand ops/sec on each run of the Constructor with short pattern - XRegExp with pattern cache flush test, so I don't think perf is significantly impacted by these changes.

Use `new` with RegExp constructor, as is done everywhere else.
@slevithan
Copy link
Owner

slevithan commented Apr 11, 2017

Thanks! I still need to look over the new set of diffs closely, but I love the direction of no longer inserting (?:) where it isn't needed.

Aside: I should check if these lines in build.js are still needed after the changes here.

var regex = XRegExp('( [0-9]{4} ) -? # year \n' +
'( [0-9]{2} ) -? # month \n' +
'( [0-9]{2} ) # day ', 'x');
expect(regex.source).toEqual('([0-9]{4})(?:)-?(?:)([0-9]{2})(?:)-?(?:)([0-9]{2})(?:)');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enforcing the inclusion of multiple (?:) empty groups that aren't needed for this regex to operate correctly. The test should be re-framed to not enforce anything that is unnecessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yeah it is a bit brittle. We could change it to reject certain substrings, but then we might end up duplicating some of the logic in the non-test code. What if we changed it to something a little simpler, but still future-proof, like this?

expect(regex.source.length <= 54).toBe(true); // 54 is the length of the current result

@slevithan slevithan merged commit cd0d192 into slevithan:master Apr 16, 2017
@slevithan
Copy link
Owner

slevithan commented Apr 16, 2017

Testing string length wouldn't verify that it's working. I've gone ahead and updated this in 622aaf3 to use a reduced test case.

As a result of these changes, the "Constructor with x flag, whitespace, and comments" perf test is now meaningfully slower than in v3.1.1. It would be easy to create examples that are even more affected, since each regex token that triggers the new code will be slower. I'll try to look at speeding this back up later, probably after v3.2.0. A couple ideas: avoid the string concatenation in isPatternNext (possibly going back to regex literals and making the function specific to quantifiers again even though the current code is more readable/maintainable, since this isn't needed to handle simple cases with whitespace followed by )), and update runTokens to make already-processed tokens available to token handler functions. The latter change would add a generally useful feature to custom XRegExp tokens handlers and also make it easier to add support for more cases where (?:) shouldn't be inserted--e.g. after an opening (?: or (?<n>.

josephfrazier added a commit to josephfrazier/xregexp that referenced this pull request Apr 25, 2017
This makes the "Constructor with x flag, whitespace, and comments" test
fast again. From slevithan#164 (comment):

> A couple ideas: avoid the string concatenation in `isPatternNext`
> (possibly going back to regex literals and making the function specific
> to quantifiers again even though the current code is more
> readable/maintainable, since this isn't needed to handle simple cases
> with whitespace followed by `)`)

Since babel-plugin-transform-xregexp automatically compiles the `new
RegExp()` calls into literals, we get (most of) the performance back
without sacrificing the readability of having separate subpatterns.
slevithan pushed a commit that referenced this pull request Apr 25, 2017
This makes the "Constructor with x flag, whitespace, and comments" test
fast again. From #164 (comment):

> A couple ideas: avoid the string concatenation in `isPatternNext`
> (possibly going back to regex literals and making the function specific
> to quantifiers again even though the current code is more
> readable/maintainable, since this isn't needed to handle simple cases
> with whitespace followed by `)`)

Since babel-plugin-transform-xregexp automatically compiles the `new
RegExp()` calls into literals, we get (most of) the performance back
without sacrificing the readability of having separate subpatterns.
@josephfrazier josephfrazier deleted the strip-useless-groups branch May 1, 2017 13:03
josephfrazier added a commit to josephfrazier/xregexp that referenced this pull request May 1, 2017
Following up on slevithan#164, this
change prevents a `(?:)` from being inserted in the following places:

* At the beginning of a non-capturing group (the end is already handled)
* Before or after a `|`
* At the beginning or the end of the pattern

This solution isn't as complete as the one suggested in
slevithan#179, but it's a decent
stopgap.
speecyy added a commit to speecyy/xregexp that referenced this pull request Jan 6, 2018
This makes the "Constructor with x flag, whitespace, and comments" test
fast again. From slevithan/xregexp#164 (comment):

> A couple ideas: avoid the string concatenation in `isPatternNext`
> (possibly going back to regex literals and making the function specific
> to quantifiers again even though the current code is more
> readable/maintainable, since this isn't needed to handle simple cases
> with whitespace followed by `)`)

Since babel-plugin-transform-xregexp automatically compiles the `new
RegExp()` calls into literals, we get (most of) the performance back
without sacrificing the readability of having separate subpatterns.
3590212051 added a commit to 3590212051/POPmotion that referenced this pull request Jan 8, 2018
This makes the "Constructor with x flag, whitespace, and comments" test
fast again. From slevithan/xregexp#164 (comment):

> A couple ideas: avoid the string concatenation in `isPatternNext`
> (possibly going back to regex literals and making the function specific
> to quantifiers again even though the current code is more
> readable/maintainable, since this isn't needed to handle simple cases
> with whitespace followed by `)`)

Since babel-plugin-transform-xregexp automatically compiles the `new
RegExp()` calls into literals, we get (most of) the performance back
without sacrificing the readability of having separate subpatterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants