Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Fully specify legal escape sequences in RegExp capture group names #1869

Merged
merged 1 commit into from
Apr 2, 2020

Conversation

bakkot
Copy link
Contributor

@bakkot bakkot commented Feb 10, 2020

Fixes #1861. See that issue for context. Marked as normative despite fixing a spec bug because there are two sensible behaviors possible here, either of which is compatible with the current (incomplete) specification.

This PR takes the approach of making /(?<\ud835\udc9c>.)/u legal (in addition to making /(?<\u{1d49c}>.)/u legal, which is more obviously the intent of the current incomplete spec text).

@bakkot bakkot added normative change Affects behavior required to correctly evaluate some ECMAScript source text needs consensus This needs committee consensus before it can be eligible to be merged. spec bug labels Feb 10, 2020
@mathiasbynens
Copy link
Member

Thanks for fixing the spec bug that accidentally made \u{1D49C} illegal!

As explained in #1861, I'd prefer making /(?<\uD835\uDC9C>.)/u remain illegal, since group names are thought of as IdentifierNames, and because var \uD835\uDC9C or groups.\uD835\uDC9C aren't valid either.

@bakkot
Copy link
Contributor Author

bakkot commented Feb 10, 2020

Yup, I'm happy either way. Mostly I just wanted to have a PR to discuss next meeting, and this behavior was easier to implement.

@bakkot bakkot added the es2020 label Mar 4, 2020
Copy link
Member

@mathiasbynens mathiasbynens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as "request changes" per my earlier comment. Let's discuss this in plenary!

Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix to my incoherent spec text!

@littledan
Copy link
Member

To echo my other comment, I'm happy with the change in grammar here, as it seems most consistent with how RegExps deal with literals in general.

@littledan littledan added the needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 label Mar 25, 2020
@mathiasbynens
Copy link
Member

Goals

  • Tooling: The solution should not prevent tools from transforming valid source text to an equivalent ASCII-safe version (“an ASCIIfier”), and should ideally not force such tools to implement a regular expression tokenizer/parser.
  • Symmetry: I posit that people think of foo in /(?<foo>.)/u and in match.groups.foo as “the same thing”, and today’s syntax matches that expectation: any source text foo that’s valid in the former context can be copied to the latter context without introducing a syntax error. Ideally, we preserve this symmetry: there should be no mismatch between how groups can be named in the pattern vs. how they’re accessed.
  • Mental model: The solution should avoid introducing confusing edge cases if possible, and should ideally lead to a clear and simple mental model.

Guiding example

An ASCIIfier gets the following source code as input:

'a'.match(/(?<𐊧>🎈{2})/).groups.𐊧;
'b'.match(/(?<𐊧>🎈{2})/u).groups.𐊧;
'c'.match(RegExp('(?<𐊧>🎈{2})', 'u')).groups.𐊧;
const 𐊧 = 42;

Note the following:

  • The capture group name is 𐊧 which is U+102A7 CARIAN LETTER A2.
  • 🎈 is U+1F388 BALLOON.
  • The non-u pattern is equivalent to \uD83C\uDF88{2} and thus matches '🎈\uDF88'.
  • The u pattern matches '🎈🎈'.
  • We expect the ASCIIfier to preserve the match semantics of the input. (There is no way to ASCIIfy without changing any semantics, since the source text transformation is inherently observable through e.g. re.source, re.toString, and the containing Function.prototype.toString.)

Proposal 1

The current proposal is to allow escaping astral identifiers as individually-escaped surrogate halves in group names.

  • Tooling: ✅ This transformation preserves the important semantics, and doesn’t require RegExp tokenizing/parsing to produce correct output:
    • Within any literal (RegExp literal, string literal, template literal, tagged template), escape any astral symbol as \uXXXX\uXXXX regardless of the context.
    • Within any identifier, escape any astral symbol as \u{…}.
  • Symmetry:
  • Mental model: ❌ “In RegExp patterns, capture group names follow identifier syntax, except unlike identifiers individually-escaped surrogate halves that combine into a valid astral identifier symbol are also accepted.”

This proposal allows the ASCIIfier to output:

'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups.\u{102A7};
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups.\u{102A7};
const \u{102A7} = 42;

// …or, if the ASCIIfier is slightly more advanced…

'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups['\u{102A7}'];
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups['\u{102A7}'];
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups['\u{102A7}'];
const \u{102A7} = 42;

// …or…

'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups['\uD800\uDEA7'];
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups['\uD800\uDEA7'];
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups['\uD800\uDEA7'];
const \u{102A7} = 42;

Proposal 2

Continue disallowing individually-escaped surrogates in group names, and instead allow \u{…} in group names even in non-u RegExps.

  • Tooling: ❌ Still possible, but requires tokenizing regular expression patterns to produce correct output in all cases. (For non-u RegExps, astral symbols can only safely be converted into \u{…} within capture group names.)
    • Within any regular expression literal with the u flag, escape any astral symbol as \u{…} regardless of the context.
    • Within any regular expression literal without the u flag, escape any astral symbol within a capture group name as \u{…}, and any other astral symbols as \uXXXX. (This requires tokenizing the pattern.)
    • Within any remaining literal (string literal, template literal, tagged template), escape any astral symbol as \u{…}.
    • Within any identifier, escape any astral symbol as \u{…}.
  • Symmetry:
  • Mental model: ❌ “Capture group names support \u{…} syntax for astral identifier symbols, even in non-u RegExp patterns. \u{…} now has two different meanings in non-u RegExps: /(?<\u{102A7}>\u{102A7})/.”

This proposal allows the ASCIIfier to output:

'a'.match(/(?<\u{102A7}>\uD83C\uDF88{2})/).groups.\u{102A7};
'b'.match(/(?<\u{102A7}>\uD83C\uDF88{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\u{102A7}>\u{1F388}{2})', 'u')).\u{102A7};
const \u{102A7} = 42;

Proposal 3

Ban astral group names in non-u RegExps, i.e. make them throw early SyntaxError exceptions.

  • Tooling: ✅ This transformation preserves the important semantics, and doesn’t require RegExp tokenizing/parsing to produce correct output:
    • Within any regular expression literal with the u flag, escape any astral symbol as \u{…} regardless of the context.
    • Within any regular expression literal without the u flag, escape any astral symbol as \uXXXX regardless of the context.
    • Within any remaining literal (string literal, template literal, tagged template), escape any astral symbol as \u{…}.
    • Within any identifier, escape any astral symbol as \u{…}.
  • Symmetry:
  • Mental model: ✅ Remains what it has been since ES2015: “The u flag improves support for astral symbols in regular expression patterns. \u{…} only has special behavior in u RegExps.”
  • Additional pro: ✅ Engines already do this, so we know it’s web-compatible.

With this change, the first line of the source program remains invalid:

'a'.match(/(?<𐊧>🎈{2})/).groups.𐊧; // throws early SyntaxError
'b'.match(/(?<𐊧>🎈{2})/u).groups.𐊧;
'c'.match(RegExp('(?<𐊧>🎈{2})', 'u')).groups.𐊧;
const 𐊧 = 42;

This proposal allows an ASCIIfier to output:

'a'.match(/(?<\uD83C\uDF88>\uD83C\uDF88{2})/).groups.\u{102A7}; // still throws early SyntaxError
'b'.match(/(?<\u{102A7}>\u{1F388}{2})/u).groups.\u{102A7};
'c'.match(RegExp('(?<\u{102A7}>\u{1F388}{2})', 'u')).\u{102A7};
const \u{102A7} = 42;

The only “downside” to this proposal is that it introduces one more difference between non-u and u RegExps. But then again, they’re different by design, and since we have no choice but to introduce some inconsistency to resolve this issue, I propose we do it here.

TL;DR

I believe proposal 3 is the way to go. It addresses the concerns that were raised: it’s tooling-friendly, preserves symmetry, and gates the changes on the presence of the u flag, following the mental model that was established with ES2015. Please consider it.

@bmeck
Copy link
Member

bmeck commented Apr 1, 2020

I believe there is a 4th way that was missed for discussion: allow surrogate pair syntax outside of RegExp grammar as an Identifier and inside of named capture group Ids:

var \uD800\uDEA7;
\uD800\uDEA7 = 'a'.match(/(?<\uD83C\uDF88>.)/).groups./.\uD800\uDEA7

This would satisfy the thinking #1869 (comment)

@gibson042
Copy link
Contributor

gibson042 commented Apr 1, 2020

  • Symmetry: I posit that people think of foo in /(?<foo>.)/u and in match.groups.foo as “the same thing”, and today’s syntax matches that expectation: any source text foo that’s valid in the former context can be copied to the latter context without introducing a syntax error. Ideally, we preserve this symmetry: there should be no mismatch between how groups can be named in the pattern vs. how they’re accessed.

I think you're overvaluing this particular narrow symmetry... for the sake of argument (and echoing @bmeck), what about the symmetry of \u{…} with \u𝘏𝘌𝘈𝘋\u𝘛𝘈𝘐𝘓?

Proposal 4

Interpret surrogate pairs as single code points in IdentifierName, just as they are in string, template, and regular expression literals (including in capture group names). Apply code point semantics to \u{…} everywhere in Unicode regular expressions, and nowhere in non-Unicode regular expressions (where it remains identical to u{…}).

  • Tooling: ✅ This transformation preserves the important semantics, and doesn’t require RegExp tokenizing/parsing to produce correct output:
    • Within any literal (RegExp literal, string literal, template literal, tagged template) or identifier, escape any astral symbol as \uXXXX\uXXXX regardless of the context.
  • Symmetry: ✅ ➕
  • Mental model: ✅ “An astral code point can always be represented as \u𝘏𝘌𝘈𝘋\u𝘛𝘈𝘐𝘓, and additionally as the more convenient \u{…} except in old-style non-u regular expressions.”

This proposal allows an ASCIIfier to output:

'a'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/).groups.\uD800\uDEA7;
'b'.match(/(?<\uD800\uDEA7>\uD83C\uDF88{2})/u).groups.\uD800\uDEA7;
'c'.match(RegExp('(?<\uD800\uDEA7>\uD83C\uDF88{2})', 'u')).groups.\uD800\uDEA7;
const \uD800\uDEA7 = 42;

@mathiasbynens
Copy link
Member

Proposal 4 is another option, indeed. I would still prefer proposal 3 since a) it doesn't introduce the unfortunate concept of surrogates in more places in the language and b) it doesn't introduce a new mental model, but rather fits within a pre-existing one. Proposal 4 seems like a larger change.

@mathiasbynens
Copy link
Member

mathiasbynens commented Apr 1, 2020

Here’s why I think expecting astral symbols to work within capture group names in non-u RegExps does not align with ES2015 history, and why proposal 3 aligns best with the precedent set in ES2015.

This is a list of places where astral symbols don’t work as expected in non-u regexps:

  • .
  • quantifiers
  • character classes
  • character class escapes

The ES2015 u flag was introduced to fix astral support for the above. It also enables the \u{…} syntax (as well as some other things like enable \p{…} syntax, but those aren’t relevant to the discussion).

Given the above, I assert that there is no reasonable expectation that astral symbols are supported in any context within a regular expression pattern, unless the u flag is set. I’d expect /(?<𐊧>.)/ to throw, and /(?<𐊧>.)/u to be valid (matching implementation reality).

The same thing then applies for the \u{…}-escaped versions of these regular expressions: I’d expect /(?<\u{102A7}>.)/ to throw, and /(?<\u{102A7}>.)/u to be valid (matching implementation reality).

@bakkot
Copy link
Contributor Author

bakkot commented Apr 1, 2020

@mathiasbynens Another downside of proposal three is that Chinese-language users will be unable to use certain nouns in their language as group names in certain regexes, for reasons which will appear totally opaque. (Is anyone really going to already know that "tungsten" (钨) is a BMP character but "seaborgium" (𨭎) is astral?) That seems like it is clearly unacceptable.

@mathiasbynens
Copy link
Member

@mathiasbynens Another downside of proposal three is that Chinese-language users will be unable to use certain nouns in their language as group names in certain regexes, for reasons which will appear totally opaque. (Is anyone really going to already know that "tungsten" (钨) is a BMP character but "seaborgium" (𨭎) is astral?) That seems like it is clearly unacceptable.

This is already the case for everything else in non-u RegExps. 𨭎 already does not work as expected in character classes, with quantifiers, etc. It seems weird to accept it doesn't work in all these other cases but simultaneously expect it to work in this one very specific case. The solution to this problem was introduced in ES2015.

@bakkot
Copy link
Contributor Author

bakkot commented Apr 1, 2020

This is already the case for everything else in non-u RegExps.

In that non-u RegExps have weird behavior when matching against non-BMP text, yes. But that is a fact about their behavior. A user who knows that such RegExps operate one code unit at a time could still justifiably be surprised that capture group names, which have nothing to do with the behavior of the RegExp, also have this strange limitations.

@mathiasbynens
Copy link
Member

A user who knows that such RegExps operate one code unit at a time could still justifiably be surprised that capture group names, which have nothing to do with the behavior of the RegExp, also have this strange limitations.

Strange limitations are inherent to non-u RegExps. I doubt people think "capture group names have nothing to do with the behavior of the RegExp" — clearly, they affect the resulting match object. The confusion could also happen the other way around: the fact that a group name can be used within such a RegExp might cause one to believe its characters are handled correctly elsewhere in the pattern as well. In current implementation reality, as well as with proposal 3, trying to use a valid identifier astral symbol in the name triggers an early error, which could be seen as a warning: "oh right, I meant to use the u flag here".

@bakkot
Copy link
Contributor Author

bakkot commented Apr 1, 2020

Strange limitations are inherent to non-u RegExps.

Yes, but that doesn't mean that creating more of them is costless. When weighed against the (to my mind) quite small benefits of ensuring the symmetry and mental model you would like to have are unbroken even in this edge case, which will be exposed to vanishingly few users, I think the cost of propagating this particular strange limitation (which will be exposed to far more users) dominates.

@bakkot bakkot force-pushed the sv-regexp-unicode-escape-sequence branch from 7644ce6 to 77e0f8b Compare April 1, 2020 23:41
@bakkot
Copy link
Contributor Author

bakkot commented Apr 1, 2020

Per consensus today, all of

/(?<\ud835\udc9c>.)/
/(?<\ud835\udc9c>.)/u
/(?<\u{1d49c}>.)/
/(?<\u{1d49c}>.)/u
/(?<𝒜>.)/
/(?<𝒜>.)/u

will be legal. I have updated this PR to implement those semantics.

@ljharb ljharb added has consensus This has committee consensus. and removed needs consensus This needs committee consensus before it can be eligible to be merged. labels Apr 1, 2020
@ljharb ljharb dismissed mathiasbynens’s stale review April 1, 2020 23:53

discussed in plenary

@bakkot bakkot force-pushed the sv-regexp-unicode-escape-sequence branch from 77e0f8b to d374ff9 Compare April 2, 2020 00:04
@bakkot bakkot changed the title Normative: Use CharacterValue for RegExpUnicodeEscapeSequence Normative: Fully specify legal escape sequences in RegExp capture group names Apr 2, 2020
ljharb pushed a commit that referenced this pull request Apr 2, 2020
@ljharb ljharb force-pushed the sv-regexp-unicode-escape-sequence branch from d374ff9 to 84c683b Compare April 2, 2020 02:34
ljharb pushed a commit that referenced this pull request Apr 2, 2020
…up names now (#1869)

This commit makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
@ljharb ljharb force-pushed the sv-regexp-unicode-escape-sequence branch from 84c683b to 87ff636 Compare April 2, 2020 02:36
ljharb pushed a commit that referenced this pull request Apr 2, 2020
…up names (#1869)

This commit makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
@ljharb ljharb force-pushed the sv-regexp-unicode-escape-sequence branch from 87ff636 to e61ddcb Compare April 2, 2020 02:36
ljharb pushed a commit that referenced this pull request Apr 2, 2020
…up names (#1869)

This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as \u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions.

This commit thus makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
@ljharb ljharb force-pushed the sv-regexp-unicode-escape-sequence branch from e61ddcb to d1d466c Compare April 2, 2020 02:39
…up names (#1869)

This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as `\u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions.

Fixes #1861
@ljharb ljharb force-pushed the sv-regexp-unicode-escape-sequence branch from d1d466c to 249c466 Compare April 2, 2020 02:41
@ljharb ljharb merged commit 249c466 into master Apr 2, 2020
@ljharb ljharb deleted the sv-regexp-unicode-escape-sequence branch April 2, 2020 02:45
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Apr 2, 2020
PR tc39#1869 removed/changed the [U] parameter in certain defining productions,
but missed some collateral changes in referring productions.
@mysticatea
Copy link
Contributor

Hi.

I guess that this change should update the second and third paragraphs of 21.2.2 Pattern Semantics.

A Pattern is either a BMP pattern or a Unicode pattern depending upon whether or not its associated flags contain a u. A BMP pattern matches against a String interpreted as consisting of a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane. ...(omission)... If a BMP pattern contains a non-BMP SourceCharacter the entire pattern is encoded using UTF-16 and the individual code units of that encoding are used as the elements of the List.

This means, in /(?<𝒜>)/, the group name is two characters U+D835 and U+DC9C, then the U+D835 doesn't match UnicodeIDStart production, so it will be illegal. If we want /(?<𝒜>)/ to be legal, I think that we have to update the meaning of "character" in the BMP pattern.

mysticatea added a commit to mysticatea/acorn that referenced this pull request Apr 2, 2020
marijnh pushed a commit to acornjs/acorn that referenced this pull request Apr 2, 2020
@bakkot
Copy link
Contributor Author

bakkot commented Apr 2, 2020

@mysticatea Nice catch! #1932 has a fix.

ljharb pushed a commit that referenced this pull request Apr 2, 2020
This is a followup to #1869.
Per consensus, `/(?<𝒜>.)/` should be legal -
but the rules for parsing non-u patterns are such that the source text is parsed by treating each half of a surrogate pair as an individual code point,
so the rules for RegExpIdentifierStart and RegExpIdentifierPart need to be tweaked to allow surrogate pairs for that case.
jmdyck added a commit to jmdyck/ecma262 that referenced this pull request Apr 2, 2020
... in productions for RegExpIdentifierStart and RegExpIdentifierPart.
(That is, revert PR tc39#1869's change from '?U' to '+U' here.)

(fixup for PR tc39#1932)
ryanhaddad pushed a commit to WebKit/WebKit that referenced this pull request Dec 22, 2020
…de regular expressions

https://bugs.webkit.org/show_bug.cgi?id=210309

Reviewed by Ross Kirsling.

JSTests:

* stress/regexp-named-capture-groups.js: New test added.
(shouldBe):
(shouldThrowInvalidGroupSpecifierName):
* test262/expectations.yaml: Updated for now failing tests.
When test262 gets updated for this change, this can be reverted.

Source/JavaScriptCore:

Update YARR pattern processing to allow for non-BMP unicode identifier characters in named capture groups.

This change was discussed and approved at the March/April 2020 TC-39 meeting.
See tc39/ecma262#1869 for the discussion and change.

Updated tryConsumeUnicodeEscape() to allow for unicode escapes in non-unicode flagged regex's.
Added the same support to consumePossibleSurrogatePair().

* yarr/YarrParser.h:
(JSC::Yarr::Parser::consumePossibleSurrogatePair):
(JSC::Yarr::Parser::parseCharacterClass):
(JSC::Yarr::Parser::parseTokens):
(JSC::Yarr::Parser::tryConsumeUnicodeEscape):
(JSC::Yarr::Parser::tryConsumeIdentifierCharacter):


Canonical link: https://commits.webkit.org/223334@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@260033 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has consensus This has committee consensus. needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 normative change Affects behavior required to correctly evaluate some ECMAScript source text spec bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the SV of RegExpUnicodeEscapeSequence is used but not defined
8 participants