Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Formally disambiguate the non-Annex-B grammar #1727

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
65 changes: 21 additions & 44 deletions spec.html
Expand Up @@ -10858,10 +10858,13 @@ <h2>Syntax</h2>
CommonToken ::
IdentifierName
Punctuator
NumericLiteral
NumericLiteral [lookahead &notin; DecimalDigit] [lookahead &notin; IdentifierStart]
StringLiteral
Template
</emu-grammar>
<emu-note>
<p>The lookahead restrictions for |NumericLiteral| require that source text like `3in` is rejected rather than processed as the two input elements `3` and `in`.</p>
</emu-note>
<emu-note>
<p>The |DivPunctuator|, |RegularExpressionLiteral|, |RightBracePunctuator|, and |TemplateSubstitutionTail| productions derive additional tokens that are not included in the |CommonToken| production.</p>
</emu-note>
Expand Down Expand Up @@ -11164,10 +11167,6 @@ <h2>Syntax</h2>
HexDigit :: one of
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9` `a` `b` `c` `d` `e` `f` `A` `B` `C` `D` `E` `F`
</emu-grammar>
<p>The |SourceCharacter| immediately following a |NumericLiteral| must not be an |IdentifierStart| or |DecimalDigit|.</p>
<emu-note>
<p>For example: `3in` is an error and not the two input elements `3` and `in`.</p>
</emu-note>
<p>A conforming implementation, when processing strict mode code, must not extend, as described in <emu-xref href="#sec-additional-syntax-numeric-literals"></emu-xref>, the syntax of |NumericLiteral| to include <emu-xref href="#prod-annexB-LegacyOctalIntegerLiteral"></emu-xref>, nor extend the syntax of |DecimalIntegerLiteral| to include <emu-xref href="#prod-annexB-NonOctalDecimalIntegerLiteral"></emu-xref>.</p>

<emu-clause id="sec-static-semantics-mv">
Expand Down Expand Up @@ -30493,24 +30492,20 @@ <h2>Syntax</h2>
&lt;ZWJ&gt;

RegExpUnicodeEscapeSequence[U] ::
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` LeadSurrogate
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
[+U] RegExpUnicodeSurrogatePair
[+U] [lookahead &notin; RegExpUnicodeSurrogatePair] `u` Hex4Digits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegExpUnicodeSurrogatePair generates a set of terminal sequences, each of length 11. This doesn't fit with the current definition of lookahead-restrictions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lookahead &notin; RegExpUnicodeSurrogatePair seems a little suspicious to me, since RegExpUnicodeSurrogatePair describes a set of sequences of terminals rather than just a set of terminals. As far as I know the only other place where there are sequences of length greater than 1 in a lookahead-restriction-set is the async function restriction in ExpressionStatement, and there the sequence is of length precisely two (and also it's written out more explicitly).

Could this instead be written as

[+U] `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate]
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate

? That feels clearer to me, if it's equivalent. (And if it's not, I'm confused.)

Copy link
Contributor Author

@gibson042 gibson042 Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is invalid (see below), but even with refactoring would still be a lookahead of six code points. And it's not more clear to me, but I would be willing to switch to it if there's consensus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, in the ES grammars, a lookahead-constraint either:
(a) occurs at the end of a right-hand-side, or
(b) occurs before a nonterminal, where that nonterminal derives phrases that begin with the disallowed sequences (i.e. the constraint is 'scoped' to that nonterminal).

So the right-hand-side:

[lookahead &notin; RegExpUnicodeSurrogatePair] `u` Hex4Digits

would be quite unusual, in that the lookahead-constraint has to "look through" the terminal u and nonterminal HexDigits, and then look past them to a potential u Hex4Digits following. For this reason, I prefer @bakkot's suggestion:

LeadSurrogate [lookahead != `\u` TrailSurrogate]

as it eliminates the "look through" and winds up with a fairly standard end-of-RHS constraint. (But yes, the nature of its lookahead-sequence would require tweaking 5.1.5 Grammar Notation.)

Also, I think I'd prefer it to come after the Lead+Trail right-hand-side:

[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` LeadSurrogate [lookahead != `\u` TrailSurrogate]

(The other thing I like about this solution is that just those two lines make it really obvious why the lookahead-constraint is needed.)

[~U] `u` Hex4Digits
[+U] `u{` CodePoint `}`
</emu-grammar>
<p>Each `\\u` |TrailSurrogate| for which the choice of associated `u` |LeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |LeadSurrogate| that would otherwise have no corresponding `\\u` |TrailSurrogate|.</p>
<emu-grammar type="definition">

RegExpUnicodeSurrogatePair ::
`u` LeadSurrogate `\u` TrailSurrogate

LeadSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF]

TrailSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF]

NonSurrogate ::
Hex4Digits [> but only if the SV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF]

IdentityEscape[U] ::
[+U] SyntaxCharacter
[+U] `/`
Expand Down Expand Up @@ -30866,42 +30861,26 @@ <h1>Static Semantics: CharacterValue</h1>
<emu-alg>
1. Return the numeric value of the code unit that is the SV of |HexEscapeSequence|.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u` LeadSurrogate `\u` TrailSurrogate</emu-grammar>
<emu-alg>
1. Let _lead_ be the CharacterValue of |LeadSurrogate|.
1. Let _trail_ be the CharacterValue of |TrailSurrogate|.
1. Let _cp_ be UTF16Decode(_lead_, _trail_).
1. Return the code point value of _cp_.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u` LeadSurrogate</emu-grammar>
<emu-alg>
1. Return the CharacterValue of |LeadSurrogate|.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u` TrailSurrogate</emu-grammar>
<emu-alg>
1. Return the CharacterValue of |TrailSurrogate|.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u` NonSurrogate</emu-grammar>
<emu-alg>
1. Return the CharacterValue of |NonSurrogate|.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u` Hex4Digits</emu-grammar>
<emu-alg>
1. Return the Number value for the MV of |Hex4Digits|.
</emu-alg>
<emu-grammar>RegExpUnicodeEscapeSequence :: `u{` CodePoint `}`</emu-grammar>
<emu-alg>
1. Return the Number value for the MV of |CodePoint|.
</emu-alg>
<emu-grammar>
RegExpUnicodeEscapeSequence :: `u` Hex4Digits

LeadSurrogate :: Hex4Digits

TrailSurrogate :: Hex4Digits

NonSurrogate :: Hex4Digits
</emu-grammar>
<emu-alg>
1. Return the Number value for the MV of |HexDigits|.
1. Return the Number value for the MV of |Hex4Digits|.
</emu-alg>
<emu-grammar>RegExpUnicodeSurrogatePair :: `u` LeadSurrogate `\u` TrailSurrogate</emu-grammar>
<emu-alg>
1. Let _lead_ be the CharacterValue of |LeadSurrogate|.
1. Let _trail_ be the CharacterValue of |TrailSurrogate|.
1. Let _cp_ be UTF16Decode(_lead_, _trail_).
1. Return the code point value of _cp_.
</emu-alg>
<emu-grammar>CharacterEscape :: IdentityEscape</emu-grammar>
<emu-alg>
Expand Down Expand Up @@ -41454,11 +41433,9 @@ <h1>Regular Expressions</h1>
<emu-prodref name=RegExpIdentifierStart></emu-prodref>
<emu-prodref name=RegExpIdentifierPart></emu-prodref>
<emu-prodref name=RegExpUnicodeEscapeSequence></emu-prodref>
<p>Each `\\u` |TrailSurrogate| for which the choice of associated `u` |LeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |LeadSurrogate| that would otherwise have no corresponding `\\u` |TrailSurrogate|.</p>
<p>&nbsp;</p>
<emu-prodref name=RegExpUnicodeSurrogatePair></emu-prodref>
<emu-prodref name=LeadSurrogate></emu-prodref>
<emu-prodref name=TrailSurrogate></emu-prodref>
<emu-prodref name=NonSurrogate></emu-prodref>
<emu-prodref name=IdentityEscape></emu-prodref>
<emu-prodref name=DecimalEscape></emu-prodref>
<emu-prodref name=CharacterClassEscape></emu-prodref>
Expand Down