Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplification of ES2015 RegExp semantics #89

Closed
wants to merge 3 commits into from

Conversation

littledan
Copy link
Member

ES2015 made RegExps more generic for subclassing and monkeypatching in a few ways:

  • RegExp.prototype[Symbol.{split,search,replace,match}] can be overridden
  • RegExp.prototype.exec can be overridden to change behavior
  • Flag getters can be overridden (for some functions, but that doesn't take effect for the main RegExp).

This patch series makes the first mechanism the sole one which can be used. The main advantages are simplicity and performance. More detail is written here: https://github.com/littledan/conservative-regexp .

This is a moderately large normative change, so I'd like to discuss whether this makes sense at the November TC39 meeting.

In ES2015, regular expression methods like RegExp.prototype[Symbol.search]
invoked the receiver's "exec" method to get at the core of matching.
This patch removes that extra degree of genericity to call out to
the builtin RegExp exec implementation directly. Type checks are
added where appropriate to ensure that this direct call is safe.
In some places in the ES2015 RegExp specification, flag values are
read from [[OriginalFlags]], and in other places, they are read
by invoking Get() on the RegExp to get the flag value.
This inconsistency does not have any obvious benefits for subclassing
and may have some performance cost. This patch standardizes on
reading flag values from [[OriginalFlags]].
Previously, the ES spec matched RegExps using a RegExpExec internal
algorithm. This algorithm has a couple properties which don't
completely with its usages, leading to awkward workarounds:

- It modifies lastIndex
- There is no mechanism for passing additional flags

Now that "exec" is not called as a method, but rather RegExpExec
is directly invoked from methods like RegExp.prototype[Symbol.split],
it is possible to refactor RegExpExec to create InnerRegExpExec
which does not have these issues.

This patch does that refactoring and takes advantage of it to remove
the lastIndex "save and restore" from RegExp.prototype[Symbol.search]
and, more importantly, the extra RegExp allocation from
RegExp.prototype[Symbol.split]. This is a normative change, since
it affects when the lastIndex property is read and written.
@littledan
Copy link
Member Author

@bterlson @ljharb @allenwb What do you think?

@bterlson
Copy link
Member

/cc @goyakin

Will take a close look soon.

@allenwb
Copy link
Member

allenwb commented Oct 17, 2015

This isn't just a bug fix, it is a major functional change to the ES6 design for subclassing regular expressions. As such it should be presented through the staged TC39 proposal process.so it can have full TC39 visibility and scrutiny before ibeing present to integration into Ecma-262

@allenwb
Copy link
Member

allenwb commented Oct 17, 2015

I'll probably go into more details in additional comments, but to start this change proposal seems to reflect a fundamental misunderstanding of the ES6 RegExp subclass extensibility design.

An important feature of the overall design, is that subclasses typically only need to over-ride the exec method. All the other methods including including the algorithmically complex @@split, @@search, @@replace, @@match, do not normally need to be over-ridden.

This is a key part of the design. It means that the actually RegExp matching semantics are isolated in the exec method. That means that subclass that need to change the matching language (for example, by adding additional pattern features) or want to replace the actual pattern match engine only need to over-ride exec. In your proposal, such a subclass would not only have to over-ride exec but all current and future methods that internally need to do a pattern match.

This makes it significantly harder (and more error prone) to introduce RegExp subclasses that extends the pattern matching language and semantics. In addition, it will make it the future addition of new RegExp.prototype methods (for example matchAll) more problematic as it means that already deployed RegExp subclasses would inherit incorrect versions of such methods.

I'm not saying that the current design can't be improved, but any changes should preserve the basic idea that only a single method needs to be over-ridden to extend/modify/replace the regular expression pattern language and/or semantics. The RegExp instance copying and manipulation that is preformed within the algorithms arises from the use of exec as the primary over-ride point and its legacy API details in particular its access and modification of the state of the actual RegExp instance. Those manipulations are necessary to enable the other algorithms to internally use exec in place of directly calling the internal [[Matcher]] as they did in previous editions . On way to improve the design would be to add a new method (for now, let's call it Symbol.basicMatch) that takes flags and start position as parameters (instead of accessing instance state) and returns match state without performing any mutation of the actual RegExp instance. Arguably, it would have been better for ES6 to have been specified in the manner, but there was only so much that could be done in the the available development time and there were plenty of bigger fish to fry.

This specific proposal seems like a bad idea. It seems to trade away the long term utility of RegExp extensiblity for short term implementation considerations. Rather than doing that, it would be better to put the effort into refining the design in a way that preserves the original goals and address any implementation concerns you have.

@littledan
Copy link
Member Author

@allenwb Agreed, this is a normative change and needs to come to consensus at TC39. It's not clear exactly how/whether pull requests should move through the stages or not. Some pull requests implementing normative are being taken up without this process, just with a 1-time consensus at a TC39 meeting, for example my pull request about Number.prototype. But maybe it should be a little more formal, especially for a larger change like this. Maybe we should discuss this process question at the next meeting.

For goals, I agree it'd be good to make it easier for users to subclass RegExps easily if possible. However, it will also be important to not regress performance. Do you have particular use cases in mind for subclassing RegExp? This could help guide the design. I was imagining that it would be big, well-thought-out libraries that subclass RegExp, in which case it wouldn't be such a big problem to write a couple hundred lines of code to do split, etc. properly from scratch.

As for Symbol.basicMatch: I like this idea at a high level. What's a bit difficult to work out is the details. The important thing is that the 'y' flag needs to be passed in at some points, and the new lastIndex needs to be passed out (sometimes). Here's a strawman for how it could work:

Symbol.basicMatch takes two arguments, the string and whether the 'y' flag is forced, and returns the typical match object. The match object grows a new field, lastIndex, which contains either the new lastIndex value or null (to indicate that the lastIndex should not be changed).

Basically, this Symbol.basicMatch would correspond in behavior to InnerRegExpExec, except that there's a user-visible API change (that lastIndex is in the match object that users get back from all sorts of ReExp functions) and it's restricted to turning on the y flag (since that is the most practical one to manipulate later).

If we go down this path, then I'd argue that we don't need the other things like Symbol.match, Symbol.split, etc. If users can override the core of exec, then it makes sense for the higher things to be shared. I don't see any other places in Javascript where there are multiple nested override points in a new API like this.

What do you think?

@bterlson bterlson added the needs consensus This needs committee consensus before it can be eligible to be merged. label Oct 22, 2015
@littledan
Copy link
Member Author

Actually, I believe Symbol.basicMatch would have negative performance implications for V8. In our implementation of String.prototype.replace, for example, we make a single call into the RegExp engine to get all the matches. Making multiple calls to the Symbol.basicMatch method would be slower because the RegExp engine cannot find all matches faster through its own strategies. For example, some RegExps would lend themselves to Boyer-Moore style search, but if Symbol.basicMatch is observably called at all successive indices, then such a strategy would be much more difficult to take (where the engine would have to prove that no one was watching it do this).

@allenwb
Copy link
Member

allenwb commented Oct 25, 2015

@littledan

First, WRT use cases. There are many possible reasons for subclassing RegExp. You might want to extend (or restrict) the pattern syntax or added new flags. In doing so, you you might choose to translate the new pattern syntax (for some or all cases) so it works with the built-in match engine. Or, you might decide to implement your own matcher. These are all plausible and not unlike things that packages such as XRegExp have done in the past.

Reading both what you have actually said above and reading between the lines, I gather that your issue is that you have benchmarks (micro? synthetic?) where a naive ES6 spec-complaint implementation of String split/search/replace/match is performing worse than you current implementation and you are attributing this to the fact that in the ES6 spec these function dynamically dispatch (via property lookup of the exec method) of the regexp rather than directly calling an internal [[Match]] abstract operation, as in previous editions.

So, why haven't you optimized for the most common case? Isn't specialization (either static or dynamic) generally the key to creating a high perf JS engine? There is nothing stopping you from recognizing opportunities where @@replace (and the other similar RegExp methods) can be inlined into the String. replace method (etc.) and similar recognizing that the exec method can also be in-lined into those. In fact, with an appropriate set of specialization guards, doesn't that get you back to exactly what you have with your current ES5 level implementation?

I'd be much more sympathetic to issues that relating to parts of the spec. that you are finding get in the way of such optimizations. For example, I just noticed that the RegExp @@match, @@replace, and @@split methods all have a call to RegExpExec (the dynamic dispatch to the exec method) that occurs within loops. I'd agree with you if you were to argue those methods should probably only lookup exec once and then use that result within their loops.

@littledan
Copy link
Member Author

For use cases: XRegExp looks to me like a perfect example of a use case for overwriting everything. That's what my proposal provides support for. The proposal should also allow translating the pattern in the constructor and, after that, acting like the builtin RegExp.

We used to make a separate call out to the underlying engine for each time exec'ing the RegExp. Making a single call out to the RegExp engine for several matches improved performance by 20%. That optimization would be much more difficult to dynamically prove correct when it is calling out to an .exec method.

@allenwb
Copy link
Member

allenwb commented Oct 30, 2015

Requiring that everything has to be rewritten if you change the internals of the pattern matching engine would be a gross violation of the DRY principle and contrary to OO best practices for designing subclass extensible base classes.

RegExp in ES6 was designed for dual roles in a manner that is common in dynamic OO languages. It is an abstract base class in the sense that it defines an abstract interface along with generic implementations of the methods of that interface. It is also a concrete base class that provides a specific internal implementation of regular expression pattern matching that is exposed to the generic methods. A subclass may be created for the purpose of extending/modify the public interface while still using the inherited pattern matching implementation. A different subclass might be created for the purpose using a new or different internal pattern matcher while preserving the inherited interface and reusing the inherited generic method implementations. The ES6 RegExp supports both modes of subclassing. Your proposal only supports one of those use cases rather than both. That would be a significant regression of functionality from what was approved by TC39 for ES6.

We should not start from the premise that we need to change the set of use cases supported by RegExp in order to support your desired optimizations. Instead, we should look at what (if anything) in the current algorithms makes it difficult for you to optimize and try to find use case preserving fixes to the algorithms. I'm happy to work with you on that. I've already made some suggestions in previous comments, but I'll take another stab at it in another separate comment that I'll post shortly.

@domenic
Copy link
Member

domenic commented Oct 30, 2015

Allen, can you point to the minutes where TC39 discussed and approved the ES6 RegExp design? There wasn't any wiki proposal page for it that I recall. Maybe there was a presentation, with slides that were preserved? It was a pretty drastic increase in complexity and surface area for the API so it would be good to review the meeting minutes where all of these points were previously discussed and agreed upon.

@allenwb
Copy link
Member

allenwb commented Oct 30, 2015

The refactoring of the String match/replace/search/split and RegExp methods in ES6 was done to provide extension points that allow new classes (include RegExp subclasses) to easily integrate into the existing string APIs. However, it was also the intent that the refactoring should not change the observable behavior of String match/replace/search/split when used with standard built-in RegExp instances and the intrinsic REgExp.prototype methods. In that latter goal was actually achieved then all that should be needed to continue to use optimized ES5 implementations of String match/replace/search/split is the addition of guards that detect when the ES5 implementations are no longer applicable. In those cases, String match/replace/search need to fall back to the more general ES6 algorithms.

So, what sort of guards are needed. Lets look at String.prototype.replace. Assuming that we have an existing optimized version of that method in our ES5 implementation, what are the guards we need to identify situations where we can continue to use it. Here are the things we need to check for:

  • the searchValue object needs to be checked that it has the internal state of built-in RegExp instance (this is an implementation dependent check)
  • Accessing the "exec" property of searchValue must return the built-in RegExp.prototype.exec intrinsic. (this makes sure the standard match semantics are being used)
  • Accessing the @@replace property of searchValue must return the built-in RegExp.prototype.@@replace intrinsic. (this makes sure that the standard @@replace method semantics are intended to be used)

If those three conditions are meet, then you should be able to use you existing implementation. They allow you to essentially re-inline into String search the algorithm that was factored out to create RegExp @@search and to also use any direct calls to or inlining of the matching algorithm that you may have in your ES5 implementation.

Can you identify anything in the ES6 RegExp algorithms. that would make that inlining invalid?

@allenwb
Copy link
Member

allenwb commented Oct 30, 2015

@domenic I'm sure you aren't suggesting that TC39 didn't have an opportunity to review or approve the ES6 RegExp subclassing materials. Certainly that can't be the case, because spec. language for the basic RegExp refactoring existed literally for years within ES6 drafts and all member organizations voted at least twice (and presumably after reviewing the complete spec. to their satisfaction) to advance the ES6 spec. fo GA approval.

I'm not going to try to backtrace discussions of this through years of meeting notes es-discuss threads. But I will point out, the the basic approach to making subclassable built-ins for ES6 trace back to at least 2011. See http://web.archive.org/web/20141214075422/http://wiki.ecmascript.org/doku.php?id=strawman:es5_internal_nominal_typing

@domenic
Copy link
Member

domenic commented Oct 30, 2015

So there was no presentation of or discussion of the RegExp design? And there was no consensus on it besides the usual ability to vote against the whole spec in the GA?

@domenic
Copy link
Member

domenic commented Oct 30, 2015

I'll do the research later today, tracking their first introduction to the spec with contemporary meeting agendas where the proposal might have been presented.

@allenwb
Copy link
Member

allenwb commented Oct 30, 2015

@domenic no, the RegExp design was discussed many times. I don't recall whether or not there was a standalone presentation on it. Why is that relevant to this discussion.

@littledan
Copy link
Member Author

@allenwb I see how inlining could be done in principle, but this is rather difficult for real implementations to do in a situation like this. The V8 RegExp engine (which is shared by Firefox) compiles the RegExp into an NFA implemented in native code. Calls into this code have a certain amount of cost to set up registers for its own calling convention. No inlining is done across this boundary between the two distinct compilers--inlining is done in V8 only for JavaScript code and a fixed set of intrinisics currently. This behavior works out well for us for now, but ES2015 would require a much more complicated implementation strategy to do correctly without a performance regression.

@allenwb
Copy link
Member

allenwb commented Nov 2, 2015

@littledan Just to be clear, I wasn't suggesting that you necessarily need to do dynamic inlining. You have a version of the ES5-level string methods that apparently already have the level of integration you want with your RegExp engine. For the default cases, the ES6 spec. should produce the same results as those existing methods. So, can't you just hand code the necessary guards such that in the default situations your ES6 implementations of String.prototype.match/replace/earch/split just call your existing optimized versions and only fall back to the full ES6 semantics when the guards fail.

Obviously there is a small cost associated with such guards but I've have to see hard evidence before I'll believe that that cost is in any way significant within real applications.

@wycats
Copy link
Contributor

wycats commented Nov 11, 2015

So there was no presentation of or discussion of the RegExp design?

I remember this particular detail coming up a number of times. I also remember general consensus on the idea of exec as a "kernel" method for RegExp subclassing.

@BrendanEich
Copy link

V8 / Google folks: we certainly discussed this, I made a point of airing all the issues. ES6 is done. I don't think you can argue "errata" to make this kind of change.

/be

@littledan
Copy link
Member Author

@littledan
Copy link
Member Author

This was proposal rejected at TC39.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs consensus This needs committee consensus before it can be eligible to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants