Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Open
wants to merge 2 commits into
base: master
from

Conversation

@gibson042
Copy link
Contributor

commented Oct 8, 2019

@waldemarhorwat recently expressed objections to moving the Annex B regular expression syntax into the main grammar because of its order-dependent productions. However, I found an example of the same already in the main grammar, for surrogate pairs in codepoint-based "Unicode" regular expressions. This PR inserts a lookahead assertion to correct that ambiguity, and also applies the same treatment to replace some NumericLiteral prose with a formal assertion.

[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
[+U] RegExpUnicodeSurrogatePair
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits

This comment has been minimized.

Copy link
@jmdyck

jmdyck Oct 8, 2019

Collaborator

RegExpUnicodeSurrogatePair generates a set of terminal sequences, each of length 11. This doesn't fit with the current definition of lookahead-restrictions.

[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
[+U] RegExpUnicodeSurrogatePair
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits

This comment has been minimized.

Copy link
@bakkot

bakkot Oct 8, 2019

Contributor

lookahead ∉ RegExpUnicodeSurrogatePair seems a little suspicious to me, since RegExpUnicodeSurrogatePair describes a set of sequences of terminals rather than just a set of terminals. As far as I know the only other place where there are sequences of length greater than 1 in a lookahead-restriction-set is the async function restriction in ExpressionStatement, and there the sequence is of length precisely two (and also it's written out more explicitly).

Could this instead be written as

[+U] `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate]
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate

? That feels clearer to me, if it's equivalent. (And if it's not, I'm confused.)

This comment has been minimized.

Copy link
@gibson042

gibson042 Oct 8, 2019

Author Contributor

That is invalid (see below), but even with refactoring would still be a lookahead of six code points. And it's not more clear to me, but I would be willing to switch to it if there's consensus.

@gibson042

This comment has been minimized.

Copy link
Contributor Author

commented Oct 8, 2019

There are currently three flavors of negative lookahead assertions, all defined in Grammar Notation (emphasis mine):

  • [lookahead ∉ set], in which "set can be written as a comma separated list of one or two element terminal sequences enclosed in curly brackets"
  • [lookahead ∉ set], in which—"for convenience"—set can be "written as a nonterminal, in which case it represents the set of all terminals to which that nonterminal could expand"
  • [lookahead ≠ terminal]

So `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate] would not be valid, because the ≠ notation is only allowed with a single terminal input element.

But addressing the more substantive point, I agree with @jmdyck that the intent is to bound lookahead, in particular limiting it to two terminals (corresponding with an LR(2) grammar). However, that is already not the case for two reasons. One of them has to do with the preservation of
LineTerminator input elements, such that a strict reading of section 5.1.2 requires unbounded lookahead for [lookahead ≠ `let [`] (because let and [ could be separated by an arbitrary amount of LineTerminator-replaced MultiLineComment sequences), and even a loose reading in which consecutive LineTerminator elements were collapsed would still require a lookahead of up to three elements (let, LineTerminator, [). And the other reason why lookahead is not actually bounded at two or even three terminals is the very section that I am changing... \uD834\uDF06 in a Unicode regular expression must be parsed as a single U+1D306 TETRAGRAM FOR CENTRE code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate `\u` TrailSurrogate rather than as a U+D834 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate followed by a U+DF06 code point expanded from RegExpUnicodeEscapeSequence_U :: `u` TrailSurrogate, and the content enforcing that restriction is currently prose rather than formal lookahead semantics, even though an actual implementation is not permitted to recognize a RegExpUnicodeEscapeSequence_U :: `u` LeadSurrogate expansion without confirming that the following six code points of source text do not match `\u` TrailSurrogate—which is exactly equivalent to a lookahead assertion!

It is worth noting, but not necessarily compelling, that this applies to the Regular Expression grammar rather than to the syntactic grammar. Nevertheless, I believe that these particulars should be formally expressed where possible, even if that means admitting that ECMAScript is not as friendly to parse as it might otherwise appear to be (🙄). I am willing to update the Grammar Notation section accordingly if that is the consensus. But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

@ljharb ljharb requested a review from waldemarhorwat Oct 8, 2019
@jmdyck

This comment has been minimized.

Copy link
Collaborator

commented Oct 8, 2019

But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out.

I have a feeling that would be a bad idea: too much chance of unintended consequences. (Might depend on the exact formulation, though.) Currently, I think using a lookahead-restriction is the best option, along with any necessary changes to the Grammar Notation section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.