Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normative: Update GetSubstitution to match reality #1732

Closed

Conversation

gibson042
Copy link
Contributor

@gibson042 gibson042 commented Oct 9, 2019

THIS PR HAS BEEN REPLACED BY #3157

$nn patterns fall back to $n when there aren't at least nn captures

Fixes gh-1426

Preview: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#sec-getsubstitution

@gibson042 gibson042 added normative change Affects behavior required to correctly evaluate some ECMAScript source text needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 web reality labels Oct 9, 2019
spec.html Outdated Show resolved Hide resolved
spec.html Outdated
1. If _capture_ is *undefined*, replace the text through `>` with the empty string.
1. Otherwise, replace the text through `>` with ? ToString(_capture_).
</emu-alg>
If _nn_ is `00` or the MV of _nn_ &gt; _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If _nn_ is `00` or the MV of _nn_ &gt; _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|).
If _nn_ is `00` or the MV of _nn_ &gt; _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code unit (i.e., a single |DecimalDigit|).

("units" -> "unit")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to leave this, because the intent is to split the matched code units into two subsequences, replacing the first and preserving the second (which happens to be a single code unit, but is not required to be so for the logic to work).

spec.html Outdated Show resolved Hide resolved
spec.html Outdated
1. Let _k_ be 0.
1. Repeat, while _k_ &lt; _replacementLength_
1. Let _replaceablePattern_ be the longest sequence of consecutive code units from _replacement_ beginning with the code unit at index _k_ such that _replaceablePattern_ matches the &ldquo;Code units&rdquo; column of a row in <emu-xref href="#table-45"></emu-xref>.
1. If _replaceablePattern_ is not empty,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. If _replaceablePattern_ is not empty,
1. If _replaceablePattern_ is not empty, then

Also, this condition assumes that _replaceablePattern_ is empty if there's no match, but there's nothing that actually makes it so. If might be better to say If no such match is found, then (and swap the two arms).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic itself makes it so. "The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units. If it is more than zero, then replaceablePattern is not empty and we will append the corresponding replacement text to result. Otherwise, we will append the code unit at index k.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units.

Based on that quote, it could be, but the part you elided is such that _replaceablePattern_ matches the "Code units" column of a row in [Table 53], and an empty sequence of code units does not match the "Code units" column of any row in the table.

spec.html Outdated
Comment on lines 29953 to 29957
If the MV of _n_ &gt; _m_, the replacement is the matched code units (equivalent to no replacement).
<br>
Otherwise, if the element in _captures_ at index equal to the MV of _n_ minus 1 is *undefined*, the replacement is the empty String.
<br>
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _n_ minus 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were to use an <emu-alg>, that would allow (something like) Let _nmv_ be the MV of _n_, which you could then use at 3 points.

It would also let you replace the element in _captures_ at index equal to the MV of _n_ minus 1 with _captures_[_nmv_ - 1].

For actual names, I think I'd recommend _N_ for the code unit (which matches the variable in column 1), and then _n_ for its MV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered <emu-alg>, but ultimately preferred the appearance without it: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#table-45 (and I'd get rid of the <emu-alg> for $<…> if I could find a good way to do so). As always, though, I'm willing to change if there's consensus.

As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.

Well, of course, you'd need to change the name in column 2 as well. Sorry that wasn't clear. Anyhow, the point there was to free up the name _n_ for the MV of _N_, but that isn't needed if you don't convert to <emu-alg>.

spec.html Outdated
<br>
Otherwise, if the element in _captures_ at index equal to the MV of _nn_ minus 1 is *undefined*, the replacement is the empty String.
<br>
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _nn_ minus 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto above comment re using <emu-alg>.

Copy link
Contributor Author

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @jmdyck. I've updated the PR with some of your suggestions and added explanations for why I didn't accept the others.

spec.html Outdated
1. Let _k_ be 0.
1. Repeat, while _k_ &lt; _replacementLength_
1. Let _replaceablePattern_ be the longest sequence of consecutive code units from _replacement_ beginning with the code unit at index _k_ such that _replaceablePattern_ matches the &ldquo;Code units&rdquo; column of a row in <emu-xref href="#table-45"></emu-xref>.
1. If _replaceablePattern_ is not empty,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic itself makes it so. "The longest sequence of consecutive code units from replacement beginning with the code unit at index k…" is either zero code units or more than zero code units. If it is more than zero, then replaceablePattern is not empty and we will append the corresponding replacement text to result. Otherwise, we will append the code unit at index k.

spec.html Outdated
1. If _capture_ is *undefined*, replace the text through `>` with the empty string.
1. Otherwise, replace the text through `>` with ? ToString(_capture_).
</emu-alg>
If _nn_ is `00` or the MV of _nn_ &gt; _m_, the replacement is the string-concatenation of the replacement for the first two matched code units (i.e., a replacement specified by one of the two preceding rows) and the remaining matched code units (i.e., a single |DecimalDigit|).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to leave this, because the intent is to split the matched code units into two subsequences, replacing the first and preserving the second (which happens to be a single code unit, but is not required to be so for the logic to work).

spec.html Outdated
Comment on lines 29953 to 29957
If the MV of _n_ &gt; _m_, the replacement is the matched code units (equivalent to no replacement).
<br>
Otherwise, if the element in _captures_ at index equal to the MV of _n_ minus 1 is *undefined*, the replacement is the empty String.
<br>
Otherwise, the replacement is the element in _captures_ at index equal to the MV of _n_ minus 1.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered <emu-alg>, but ultimately preferred the appearance without it: https://deploy-preview-1732--ecma262-snapshots.netlify.com/#table-45 (and I'd get rid of the <emu-alg> for $<…> if I could find a good way to do so). As always, though, I'm willing to change if there's consensus.

As for N vs. n, we must use the latter because the former is not defined as the expansion of a lexical grammar production and thus has no MV.

spec.html Outdated Show resolved Hide resolved
@gibson042 gibson042 force-pushed the gh-1426-GetSubstitution-dollar-nn branch from 87592e6 to e5941f0 Compare November 13, 2019 21:20
spec.html Show resolved Hide resolved
philn pushed a commit to philn/old-webkit that referenced this pull request Nov 24, 2019
…replace]

https://bugs.webkit.org/show_bug.cgi?id=204490

Reviewed by Mark Lam.

JSTests:

* stress/replace-indexed-backreferences.js: Added.
* test262/expectations.yaml: Mark four test cases as passing.

Source/JavaScriptCore:

String.prototype.replace and RegExp.prototype[Symbol.replace] are meant to perform the same substitution
of $-backreferences (called GetSubstitution in the spec: https://tc39.es/ecma262/#sec-getsubstitution).

The implementation of this in StringPrototype.cpp is correct but the one in RegExpPrototype.js is not.
In particular, the latter *removes* backreferences with out-of-range indices, instead of leaving them as-is.

One thing that is *not* broken in either implementation and thus maintained here is the fact $10 is interpreted
as $1 followed by a 0 when we have 1 <= n < 10 captures (and analogously for other invalid $nn backreferences).
This behavior is consistent across all engines but currently described incorrectly in the spec; this patch thus
aligns with the spec PR currently open to correct this (tc39/ecma262#1732).

* builtins/RegExpPrototype.js:
(getSubstitution): Ensure that invalid backreferences remain untouched in the output string.
(replace): Fix off-by-one error when populating captures list. We shouldn't be reserving a slot for the full match.


git-svn-id: http://svn.webkit.org/repository/webkit/trunk@252836 268f45cc-cd09-0410-ab3c-d52691b4dbfc
Copy link
Member

@littledan littledan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks editorially correct to me (and I'm in favor of it on that basis), but I don't understand why it's normative. I thought the no-substitution case was covered by clauses like, "If n > m, no replacement is done." Edit: And seems like a critical fix, given how confusing my previous wording was. LGTM!

@ljharb ljharb requested review from syg, michaelficarra, a team and bakkot March 5, 2020 21:15
@devsnek
Copy link
Member

devsnek commented Aug 16, 2020

is this still a thing?

@ljharb
Copy link
Member

ljharb commented Aug 17, 2020

I believe so - it needs test262 tests, a rebase, and an updated review.

@gibson042
Copy link
Contributor Author

I'm planning to rebase and update this PR after #2021 lands.

@michaelficarra
Copy link
Member

@gibson042 #2021 has landed, do you want to rebase this?

@gibson042
Copy link
Contributor Author

Yep, it should be done this week.

ryanhaddad pushed a commit to WebKit/WebKit that referenced this pull request Dec 22, 2020
…replace]

https://bugs.webkit.org/show_bug.cgi?id=204490

Reviewed by Mark Lam.

JSTests:

* stress/replace-indexed-backreferences.js: Added.
* test262/expectations.yaml: Mark four test cases as passing.

Source/JavaScriptCore:

String.prototype.replace and RegExp.prototype[Symbol.replace] are meant to perform the same substitution
of $-backreferences (called GetSubstitution in the spec: https://tc39.es/ecma262/#sec-getsubstitution).

The implementation of this in StringPrototype.cpp is correct but the one in RegExpPrototype.js is not.
In particular, the latter *removes* backreferences with out-of-range indices, instead of leaving them as-is.

One thing that is *not* broken in either implementation and thus maintained here is the fact $10 is interpreted
as $1 followed by a 0 when we have 1 <= n < 10 captures (and analogously for other invalid $nn backreferences).
This behavior is consistent across all engines but currently described incorrectly in the spec; this patch thus
aligns with the spec PR currently open to correct this (tc39/ecma262#1732).

* builtins/RegExpPrototype.js:
(getSubstitution): Ensure that invalid backreferences remain untouched in the output string.
(replace): Fix off-by-one error when populating captures list. We shouldn't be reserving a slot for the full match.


Canonical link: https://commits.webkit.org/217811@main
git-svn-id: https://svn.webkit.org/repository/webkit/trunk@252836 268f45cc-cd09-0410-ab3c-d52691b4dbfc
@ljharb ljharb force-pushed the master branch 3 times, most recently from 3d0c24c to 7a79833 Compare June 29, 2021 02:21
@bakkot
Copy link
Contributor

bakkot commented Aug 17, 2021

I think this already has tests, actually, or at least one test, as pointed out in the original issue.

@gibson042 gibson042 marked this pull request as draft April 14, 2022 18:55
@gibson042
Copy link
Contributor Author

Replaced by #3157.

@gibson042 gibson042 closed this Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs test262 tests The proposal should specify how to test an implementation. Ideally via github.com/tc39/test262 normative change Affects behavior required to correctly evaluate some ECMAScript source text web reality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in GetSubstitution or test262
7 participants