Skip to content
This repository has been archived by the owner on Jan 25, 2022. It is now read-only.

Always add "groups" to regexp-result to make it easier to speed-up the common case? #34

Closed
anba opened this issue Oct 23, 2017 · 28 comments

Comments

@anba
Copy link

anba commented Oct 23, 2017

I wonder if the following change for RegExpBuiltinExec will make it easier to improve the performance of GetSubstitution for the common case when only normal RegExp objects are used in RegExp.prototype[@@replace].

  24. If R contains any GroupName,
    a. Let groups be ObjectCreate(null).
    b. Perform ! CreateDataProperty(A, "groups", groups).
+ 25. Else,
+   a. Perform ! CreateDataProperty(A, "groups", undefined).

With the current semantics, implementations always need to lookup "groups" on %ArrayPrototype% and %ObjectPrototype% before using an optimized fast path. For example this test case

Object.prototype.groups = {
    foo: "bar"
};
print("--abc--".replace(/abc/, "$<foo>"));

prints "--bar--" with the current spec proposal.

@littledan
Copy link
Member

littledan commented Oct 23, 2017

I believe we discussed this possibility at some point, and decided to not do it in favor of keeping semantics entirely the same when there are no groups.

Implementers, would this change affect performance of your engines? cc @jgruber @bterlson @msaboff @tschneidereit

@getify
Copy link

getify commented Nov 11, 2017

Setting aside performance reasons, I would strongly urge inclusion of groups in every result object, with any other value besides null or undefined (or just simply not present), so that we don't have to do ternary/|| checks to access it (or wait for ?. optional chaining), nor do we have to do destructuring defaults.

Even if it was false (which has a decent semantic to it), that would mean you could say v.groups.foo or ({ groups: { foo } } = v) without fear of runtime failure.

Please don't make this an unnecessary hazard.

@littledan
Copy link
Member

@getify Could you mention a use case where you'd really want to read a group of a particular name, but while not knowing what the RegExp is?

@getify
Copy link

getify commented Nov 17, 2017

3 use cases:

  1. I built a UI for a web app once where we let users pick from (click on) any of various items in a word cloud, and whichever ones selected were adding elements to a list, which would then be processed to make a dynamic regex, that was then applied to filter a result set. The UI also let them type their own word in.

    I can imagine, for example, that the preset words add an element that, in the final regex, does a named capture group, whereas any user-typed in words would be in an unnamed group (or vice versa).

    It's therefore impossible to predict ahead of time if the regex you're processing with will have named groups or not. You'd like to be able to process, say with destructuring and some defaults, whether or not it has named groups, or falling back to indexed groups, etc.

  2. When the regex is used to produce a result set, and a different piece of code processes the result set from where the result is created:

    results
    .map( processWithRegex )
    .filter( filterOutSomeStuff );

    Say processWithRegex(..) has different conditions where it uses different regexes, some with named groups, some without, and filterOutSomeStuff(..) simply takes regex-results and checks them for certain contents to decide if they're allowed in the final list or not... at the moment it's running, it has no idea which regex was used to produce it, so it would like to be able to gracefully check for named groups in the result, or fall back to checking indexed groups if not present.

  3. When the regex is specified as an input to a function that will simply use it to process, the processor has relatively little guarantee what kind of regex it receives:

    var filterOutDateMonth = (re,monthNum) => list => list.filter( v => {
         var res = re.exec(v);
         if (res.groups.month == monthNum or res[1] == monthNum) return false;
         return true;
      } );
    
    pipe(
       filterOutDateMonth( /^(?<month>\d{2}/, 10 ),
       filterOutDateMonth( /^(\d{1,2}(?=\/))?/, 7 )
    )( ["7/19/80", "07-4-521431", "10-31-2017" ] );

@mathiasbynens
Copy link
Member

Avoiding prototype chain lookups seems sensible to me, and this is indeed what both V8 and JSC implement. I’m in favor of this change. WDYT, @schuay @hashseed @msaboff @tschneidereit @bterlson?

@ljharb
Copy link
Member

ljharb commented Dec 8, 2017

So it would be an own property on regexes with groups, and there'd be a prototype fallback that returned what, an empty object? If so, would it be the same one every time, would it be frozen, etc?

@getify
Copy link

getify commented Dec 8, 2017

As a developer, I think it'd be nice if there was a way to tell cleanly that no named groups were matched (other than checking for Object.keys() being empty). If groups was always an owned property on the result object, and it was either an object if there were named groups matched, or false if none were used/matched, that would give a decent way of safely checking. results.groups would be falsy if no match, but results.groups.foo would safely work and also be falsy. The problem with null or undefined is that the .foo property access would be an error.

Edit: Unfortunately, in is not allowed against false, so "foo" in results.groups would still be unsafe.

@tschneidereit
Copy link
Member

I agree that avoiding the prototype chain lookups makes sense, yes.

I don't have a particularly strong opinion on whether the value of the property should be undefined or something else as @getify requests. If the latter, it should probably be an empty null-__proto__ object.

@jandem, please speak up if for whatever reason this wouldn't be good for SpiderMonkey.

@ljharb
Copy link
Member

ljharb commented Dec 9, 2017

@tschneidereit if it's an empty null object, should it be frozen and always the same object? or newly created every time?

@tschneidereit
Copy link
Member

@tschneidereit if it's an empty null object, should it be frozen and always the same object? or newly created every time?

I don't have a particularly strong opinion on this. It being the same object would certainly be more performant, but I don't know how much it'd matter, or whether there might be downsides I'm not immediately seeing.

@msaboff
Copy link

msaboff commented Dec 9, 2017

Avoiding prototype chain lookups seems sensible to me, and this is indeed what both V8 and JSC implement. I’m in favor of this change. WDYT, @schuay @hashseed @msaboff @tschneidereit @bterlson?

I'm fine with this spec change.

Concerning @getify's suggestion of a group being set to false when none of the groups match, it is somewhat late to make this type of change. This suggestion doesn't provide improvements to the more likely case of some groups matching while others don't. I also don't like the the duality of results.group depending on access.

The suggestion to always provide group, even for match results of expressions without named captures would be a breaking change.

@littledan
Copy link
Member

littledan commented Dec 10, 2017

For @getify 's concern, we discussed this earlier in #12 . Would the use cases you mention be satisfied by the proposal for the ?. operator, if such an operator is used with the groups property?

For @anba's suggestion, it seems reasonable to me. I didn't understand the fast path correctness issue on the first reading--that does add some kinds of motivation, though at the same time, it's not the completely unsolvable--in theory, the RegExp fastpath could be skipped when a groups property is added (though I am not happy about this requirement to check and resulting performance cliff at all).

To clarify with respect to @mathiasbynens 's comment, @anba 's proposal does not describe V8's current implementation. Actually, this is an edge case where V8 doesn't currently have consistent behavior. You have to somehow trick V8 into taking the slow path to get the spec's behavior if you want groups higher in the prototype chain to show up, as in the following transcript:

d8> Array.prototype.groups = { foo: "bar" }
{foo: "bar"}
d8> print("--abc--".replace(/abc/, "$<foo>"));
--$<foo>--
undefined
d8> RegExp.prototype.x = 1;
1
d8> print("--abc--".replace(/abc/, "$<foo>"));
--bar--
undefined

I would've hoped that we could only look at the groups property for RegExps which use groups. However, given the generality made to support RegExp subclassing, the only way I could think to make this work is with another flag-like thing, which just seems unnecessarily complicated. So @anba 's suggestion seems decent; I'm just slightly worried that there will be some measurable performance cost to adding this extra property to all existing RegExps.

@schuay @hashseed what do you think?

@hashseed
Copy link

hashseed commented Dec 11, 2017

FWIW, in V8's fast path for RegExp.prototype.@@replace, we do not actually materialize the matched object, so

Object.prototype.groups = {
    foo: "bar"
};
print("--abc--".replace(/(abc)/, "$<foo>"));

prints --$<foo>--

However, in the slow path, we look up the prototype chain as per spec, and print --bar--. I filed a bug in V8 for this.

Due to this, I think it makes sense to either

  1. always create the groups property.
  2. only do a own-property look up for groups.

I slightly lean towards (2) because (1) affects the behavior of existing regexps that do not use named capture group syntax.

@littledan
Copy link
Member

2, is interesting. It'd be new ground--I don't know about another place in the spec that we do an own-property lookup. Should a getter work there?

@hashseed
Copy link

Option 3: groups has a null prototype?

@schuay
Copy link

schuay commented Dec 11, 2017

Nice find!

I'd be ok with both 1. and 2. as proposed above. 1. would give us the additional advantage of having not having two different RegExpResult shapes (one with, one without the groups property).

Note that with 1., RegExp.p.exec could still be overridden and return result objects of arbitrary shape, leading to a possible prototype lookup later on.

@littledan there are one or two spots that check HasOwnProperty before calling Get, e.g.: https://tc39.github.io/ecma262/#sec-function.prototype.bind.

@schuay
Copy link

schuay commented Dec 11, 2017

Option 3: groups has a null prototype?

It already does. But that's a separate topic (prototype-lookup for properties on the groups object vs. the 'groups' property on the RegExpResult).

@ljharb
Copy link
Member

ljharb commented Dec 11, 2017

@schuay with option 1, what value would you expect when there’s no named captures in the regex?

@schuay
Copy link

schuay commented Dec 11, 2017

@schuay with option 1, what value would you expect when there’s no named captures in the regex?

My gut feeling would be undefined as suggested by @anba.

@getify
Copy link

getify commented Dec 11, 2017

@littledan the problem, as I mentioned in my first post of this thread, is that the ?. operator is much earlier in its stage progression, which means there will likely be a significant gap between named captures and that operator... maybe a year or two. If the two features were held and shipped simultaneously, that'd be no concern, but that seems unlikely.

In practice, my prediction is this will mean the suboptimal solutions (like || {}) will become entrenched in a lot of code (code which could take many years before it's updated to use ?. )... that concern is doubly so for code which doesn't actually need these hacks but may suffer them anyway, as this hazard seems similar to others before it that have spread more as cult-myth rather than based on actual justification.

Of course, that also means future optimization hurdles for engines, as those kinds of hacks are harder to statically detect.

I'm just basically saying that it's a hazard/footgun we can reasonably predict, so designing it into the language is unfortunate when we could tweak just a little bit and avoid/minimize it.

Regarding the "breaking change" claims... can someone provide any real example/use-case where code will change/break if 'groups' shows up on all regex result objects (not the regexp objects themselves)? And specifically, if the worry is that code (libraries/etc) annotates regexp results with extra properties and may collide with this new "groups" property, how is that concern unique only to existing code and NOT a hazard if named-group regexes start being passed around into those utilities? Seems like the hazard would be universal (and quite low).

I'm failing to imagine any such legitimate "breaking" case, but I provided 3 real non-imagined counter cases above.

@littledan
Copy link
Member

@getify I see the concern with || getting used. I'm not so happy about this either. But if ?. will solve the entire problem, I'm still a little skeptical. Do you recommend that JS programmers refrain from using null or undefined in their libraries, JSON, etc and instead use empty objects for a similar reason? I'm not so concerned about the compatibility aspect of using {}, more the performance impact and complexity of the solution.

@schuay Oh, I see your point about HasOwnProperty/Get; I was picturing somehow using GetOwnProperty. The two are observably different, and I don't think anyone does the latter pattern, but yes, we could do that. Does anyone have a preference between option 1 and 2 then? If no one has any preference at all, I'll do 1 since it seems very marginally simpler.

@bakkot
Copy link

bakkot commented Dec 11, 2017

@littledan FWIW I am also skeptical of designing things on the assumption that ?. or something like it will eventually make it into the language. I am not convinced that it will, personally, given how difficult it's been to come up with any kind of consistent semantics for it.

That said, I agree that it would be weird to design this specific API to avoid non-ObjectCoercible values to make it easier for programmers to not worry about whether or not the object exists before trying to read properties of it, since the vast majority of the APIs in JavaScript and the web platform were not designed that way - including notably String.prototype.match. I don't think it's that much of a hazard: undefined is just how the language indicates "not present".

@getify
Copy link

getify commented Dec 11, 2017

@littledan

But if ?. will solve the entire problem, I'm still a little skeptical.

I understand. I wish ?. was further along, or could be pushed forward faster, to alleviate these kinds of concerns.

But I also think we shouldn't design one feature contingent on another entirely unrelated proposal-feature working perfectly and shipping cleanly. Hard to predict/correlate unrelated paths through this process.

Do you recommend that JS programmers refrain from using null or undefined in their libraries, JSON, etc and instead use empty objects for a similar reason?

I indeed would consider it very sketchy (and discouraged) if someone designed an API where, based on run-time conditions/inputs, an object on the public API was either present or not present.

Imagine for a moment that jQuery had done something like this with their default Ajax settings... like instead of their ajaxSetup(..) function, if they'd said $.ajaxOptions is by default undefined, but if you set it to an object (or call the function), it gets populated with an object holding those defaults. Wouldn't that seem like an entirely strange/suspect/error-prone API design decision?

Think how many jQuery users could potentially trip over trying to make assignments like $.ajaxOptions.method = "POST"; and having those assignments potentially fail? And imagine all the crufty hackery that would have set in (and been in code bases for a decade+), like $.ajaxOptions = ($.ajaxOptions || {}).method = "POST";, etc. The far more sane/robust thing would be for $.ajaxOptions to always start off as an empty object.

And my sentiment here is a hundred fold stronger when we're talking about built-in language stuff, which can never change. jQuery could theoretically realize this mistake later and deprecate/fix it. JS can't.

What if function foo() { .. } made a function who's .prototype was undefined, and where you had to make/assign it before you could have added to it? Same deal, that would have been a much stronger hazard than just defaulting it to an empty object.

@ljharb
Copy link
Member

ljharb commented Dec 11, 2017

@littledan fwiw, the spec doesn't tend to use GetOwnProperty, but things like Object.assign and Object.{keys,values,entries} and Object.freeze all look at only own properties, which is effectively the same.

@getify
Copy link

getify commented Dec 11, 2017

@bakkot

since the vast majority of the APIs in JavaScript and the web platform were not designed that way

Can you give any examples from JS or the web platform, as precedent, where an object is either present or absent on some API/namespace, depending on how the API is used (or other runtime conditions)? Not return values from functions (match(..)), but actual API members. I've tried to think of such examples, and haven't come up with any.

@ljharb
Copy link
Member

ljharb commented Dec 11, 2017

@getify something could primarily be conditional if it's the return value from a function (including a getter); but sure, arrow functions lack .prototype while normal functions have it set to an empty object.

@bakkot
Copy link

bakkot commented Dec 11, 2017

@getify

Not return values from functions (match(..)), but actual API members.

I'm not sure I understand why this distinction is important. Given that we don't have proper support for returning multiple values and therefore get by with returning objects with multiple properties, a property being present or absent on the single value returned from a function amounts to the same thing as the function returning multiple values one of which may be present or absent.

That is to say, I don't really understand why str.match(r).groups only conditionally being an object (vs undefined or null) is significantly different from str.match(r) only conditionally being an object (vs undefined or null).

In any case: the body property of Response objects produced by fetch is null if the Response is opaque (e.g. if it was from a cross-origin no-cors fetch), and is otherwise a ReadableStream.

@littledan
Copy link
Member

littledan commented Dec 12, 2017

Can you give any examples from JS or the web platform, as precedent, where an object is either present or absent on some API/namespace, depending on how the API is used (or other runtime conditions)?

Things which might be either null or another value is so core to the web platform that WebIDL's type system has nullable types built in for that use case. If you want to see some places where it's used, search around in the HTML spec for a question mark. If it's in an IDL block, that's a case that might be null or might be an object or other value, depending on runtime conditions.

littledan added a commit that referenced this issue Dec 12, 2017
If the RegExp does not have named groups, set a groups property
of the result to undefined. This change is made to facilitate
optimizing RegExp implementations, which often take a "fast path"
for built-in, unmodified to not incur the overhead of the full
subclassable semantics. If the "groups" property may be read
from higher up on the prototype chain, the fast path for
RegExp.prototype.replace would become more complex or slower, or
alternatively, more conditions about the environment might need
to be checked as a precondition to apply the fast path.

An alternate approach would be to only read an own groups property
based on a HasOwnProperty test followed by a Get. I don't see big
advantages or disadvantages to that approach vs this one, and I'd
be fine to revisit this patch if more differentiating factors
are raised before Stage 4.

Closes #34
littledan added a commit that referenced this issue Dec 12, 2017
If the RegExp does not have named groups, set a groups property
of the result to undefined. This change is made to facilitate
optimizing RegExp implementations, which often take a "fast path"
for built-in, unmodified to not incur the overhead of the full
subclassable semantics. If the "groups" property may be read
from higher up on the prototype chain, the fast path for
RegExp.prototype.replace would become more complex or slower, or
alternatively, more conditions about the environment might need
to be checked as a precondition to apply the fast path.

An alternate approach would be to only read an own groups property
based on a HasOwnProperty test followed by a Get. I don't see big
advantages or disadvantages to that approach vs this one, and I'd
be fine to revisit this patch if more differentiating factors
are raised before Stage 4.

Closes #34
littledan added a commit that referenced this issue Dec 12, 2017
If the RegExp does not have named groups, set a groups property
of the result to undefined. This change is made to facilitate
optimizing RegExp implementations, which often take a "fast path"
for built-in, unmodified to not incur the overhead of the full
subclassable semantics. If the "groups" property may be read
from higher up on the prototype chain, the fast path for
RegExp.prototype.replace would become more complex or slower, or
alternatively, more conditions about the environment might need
to be checked as a precondition to apply the fast path.

An alternate approach would be to only read an own groups property
based on a HasOwnProperty test followed by a Get. I don't see big
advantages or disadvantages to that approach vs this one, and I'd
be fine to revisit this patch if more differentiating factors
are raised before Stage 4.

Closes #34
pull bot pushed a commit to sunwenli/webkit that referenced this issue Nov 12, 2019
https://bugs.webkit.org/show_bug.cgi?id=204067

Patch by Alexey Shvayka <shvaikalesh@gmail.com> on 2019-11-12
Reviewed by Ross Kirsling.

JSTests:

* test262/expectations.yaml: Mark 4 test cases as passing.

Source/JavaScriptCore:

After RegExp named capture groups were initially implemented in JSC, the spec was changed
to unconditionally create "groups" property.
(tc39/proposal-regexp-named-groups#34)

This patch implements the change (that was shipped by V8), reducing number of structures
we use for RegExpMatchesArray, and also sets [[Prototype]] of "groups" object to `null`.
(step 24 of https://tc39.es/ecma262/#sec-regexpbuiltinexec)

* dfg/DFGAbstractInterpreterInlines.h:
(JSC::DFG::AbstractInterpreter<AbstractStateType>::executeEffects):
* dfg/DFGStrengthReductionPhase.cpp:
(JSC::DFG::StrengthReductionPhase::handleNode):
* runtime/JSGlobalObject.cpp:
(JSC::JSGlobalObject::init):
(JSC::JSGlobalObject::fireWatchpointAndMakeAllArrayStructuresSlowPut):
(JSC::JSGlobalObject::visitChildren):
* runtime/JSGlobalObject.h:
(JSC::JSGlobalObject::regExpMatchesArrayStructure const):
(JSC::JSGlobalObject::regExpMatchesArrayWithGroupsStructure const): Deleted.
* runtime/RegExpMatchesArray.cpp:
(JSC::createStructureImpl):
(JSC::createRegExpMatchesArrayWithGroupsStructure): Deleted.
(JSC::createRegExpMatchesArrayWithGroupsSlowPutStructure): Deleted.
* runtime/RegExpMatchesArray.h:
(JSC::createRegExpMatchesArray):
* runtime/StringPrototype.cpp:
(JSC::replaceUsingRegExpSearch):

git-svn-id: http://svn.webkit.org/repository/webkit/trunk@252374 268f45cc-cd09-0410-ab3c-d52691b4dbfc
ryanhaddad pushed a commit to WebKit/WebKit that referenced this issue Dec 22, 2020
https://bugs.webkit.org/show_bug.cgi?id=204067

Patch by Alexey Shvayka <shvaikalesh@gmail.com> on 2019-11-12
Reviewed by Ross Kirsling.

JSTests:

* test262/expectations.yaml: Mark 4 test cases as passing.

Source/JavaScriptCore:

After RegExp named capture groups were initially implemented in JSC, the spec was changed
to unconditionally create "groups" property.
(tc39/proposal-regexp-named-groups#34)

This patch implements the change (that was shipped by V8), reducing number of structures
we use for RegExpMatchesArray, and also sets [[Prototype]] of "groups" object to `null`.
(step 24 of https://tc39.es/ecma262/#sec-regexpbuiltinexec)

* dfg/DFGAbstractInterpreterInlines.h:
(JSC::DFG::AbstractInterpreter<AbstractStateType>::executeEffects):
* dfg/DFGStrengthReductionPhase.cpp:
(JSC::DFG::StrengthReductionPhase::handleNode):
* runtime/JSGlobalObject.cpp:
(JSC::JSGlobalObject::init):
(JSC::JSGlobalObject::fireWatchpointAndMakeAllArrayStructuresSlowPut):
(JSC::JSGlobalObject::visitChildren):
* runtime/JSGlobalObject.h:
(JSC::JSGlobalObject::regExpMatchesArrayStructure const):
(JSC::JSGlobalObject::regExpMatchesArrayWithGroupsStructure const): Deleted.
* runtime/RegExpMatchesArray.cpp:
(JSC::createStructureImpl):
(JSC::createRegExpMatchesArrayWithGroupsStructure): Deleted.
(JSC::createRegExpMatchesArrayWithGroupsSlowPutStructure): Deleted.
* runtime/RegExpMatchesArray.h:
(JSC::createRegExpMatchesArray):
* runtime/StringPrototype.cpp:
(JSC::replaceUsingRegExpSearch):

Canonical link: https://commits.webkit.org/217426@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@252374 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants