From da244abb353773f0ecb8d15ae499ca8cd525ef59 Mon Sep 17 00:00:00 2001 From: Richard Gibson Date: Mon, 7 Oct 2019 20:17:59 -0400 Subject: [PATCH 1/2] Editorial: Formalize consumption of surrogate pairs in Unicode regular expressions --- spec.html | 56 +++++++++++++++++-------------------------------------- 1 file changed, 17 insertions(+), 39 deletions(-) diff --git a/spec.html b/spec.html index 3912d02f49..399807eb74 100644 --- a/spec.html +++ b/spec.html @@ -30493,24 +30493,20 @@

Syntax

<ZWJ> RegExpUnicodeEscapeSequence[U] :: - [+U] `u` LeadSurrogate `\u` TrailSurrogate - [+U] `u` LeadSurrogate - [+U] `u` TrailSurrogate - [+U] `u` NonSurrogate + [+U] RegExpUnicodeSurrogatePair + [+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits [~U] `u` Hex4Digits [+U] `u{` CodePoint `}` - -

Each `\\u` |TrailSurrogate| for which the choice of associated `u` |LeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |LeadSurrogate| that would otherwise have no corresponding `\\u` |TrailSurrogate|.

- + + RegExpUnicodeSurrogatePair :: + `u` LeadSurrogate `\u` TrailSurrogate + LeadSurrogate :: Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xD800 to 0xDBFF] TrailSurrogate :: Hex4Digits [> but only if the SV of |Hex4Digits| is in the inclusive range 0xDC00 to 0xDFFF] - NonSurrogate :: - Hex4Digits [> but only if the SV of |Hex4Digits| is not in the inclusive range 0xD800 to 0xDFFF] - IdentityEscape[U] :: [+U] SyntaxCharacter [+U] `/` @@ -30866,42 +30862,26 @@

Static Semantics: CharacterValue

1. Return the numeric value of the code unit that is the SV of |HexEscapeSequence|. - RegExpUnicodeEscapeSequence :: `u` LeadSurrogate `\u` TrailSurrogate - - 1. Let _lead_ be the CharacterValue of |LeadSurrogate|. - 1. Let _trail_ be the CharacterValue of |TrailSurrogate|. - 1. Let _cp_ be UTF16Decode(_lead_, _trail_). - 1. Return the code point value of _cp_. - - RegExpUnicodeEscapeSequence :: `u` LeadSurrogate - - 1. Return the CharacterValue of |LeadSurrogate|. - - RegExpUnicodeEscapeSequence :: `u` TrailSurrogate - - 1. Return the CharacterValue of |TrailSurrogate|. - - RegExpUnicodeEscapeSequence :: `u` NonSurrogate - - 1. Return the CharacterValue of |NonSurrogate|. - - RegExpUnicodeEscapeSequence :: `u` Hex4Digits - - 1. Return the Number value for the MV of |Hex4Digits|. - RegExpUnicodeEscapeSequence :: `u{` CodePoint `}` 1. Return the Number value for the MV of |CodePoint|. + RegExpUnicodeEscapeSequence :: `u` Hex4Digits + LeadSurrogate :: Hex4Digits TrailSurrogate :: Hex4Digits - - NonSurrogate :: Hex4Digits - 1. Return the Number value for the MV of |HexDigits|. + 1. Return the Number value for the MV of |Hex4Digits|. + + RegExpUnicodeSurrogatePair :: `u` LeadSurrogate `\u` TrailSurrogate + + 1. Let _lead_ be the CharacterValue of |LeadSurrogate|. + 1. Let _trail_ be the CharacterValue of |TrailSurrogate|. + 1. Let _cp_ be UTF16Decode(_lead_, _trail_). + 1. Return the code point value of _cp_. CharacterEscape :: IdentityEscape @@ -41454,11 +41434,9 @@

Regular Expressions

-

Each `\\u` |TrailSurrogate| for which the choice of associated `u` |LeadSurrogate| is ambiguous shall be associated with the nearest possible `u` |LeadSurrogate| that would otherwise have no corresponding `\\u` |TrailSurrogate|.

-

 

+ - From a222ba635a0b666995e58048c26943b058be7813 Mon Sep 17 00:00:00 2001 From: Richard Gibson Date: Mon, 7 Oct 2019 20:44:05 -0400 Subject: [PATCH 2/2] Editorial: Replace post-NumericLiteral lookahead prose with formal assertions --- spec.html | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/spec.html b/spec.html index 399807eb74..bc6f7c56bc 100644 --- a/spec.html +++ b/spec.html @@ -10858,10 +10858,13 @@

Syntax

CommonToken :: IdentifierName Punctuator - NumericLiteral + NumericLiteral [lookahead ∉ DecimalDigit] [lookahead ∉ IdentifierStart] StringLiteral Template
+ +

The lookahead restrictions for |NumericLiteral| require that source text like `3in` is rejected rather than processed as the two input elements `3` and `in`.

+

The |DivPunctuator|, |RegularExpressionLiteral|, |RightBracePunctuator|, and |TemplateSubstitutionTail| productions derive additional tokens that are not included in the |CommonToken| production.

@@ -11164,10 +11167,6 @@

Syntax

HexDigit :: one of `0` `1` `2` `3` `4` `5` `6` `7` `8` `9` `a` `b` `c` `d` `e` `f` `A` `B` `C` `D` `E` `F` -

The |SourceCharacter| immediately following a |NumericLiteral| must not be an |IdentifierStart| or |DecimalDigit|.

- -

For example: `3in` is an error and not the two input elements `3` and `in`.

-

A conforming implementation, when processing strict mode code, must not extend, as described in , the syntax of |NumericLiteral| to include , nor extend the syntax of |DecimalIntegerLiteral| to include .