Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to eval #8

Merged
merged 1 commit into from
Aug 29, 2018

Conversation

jswalden
Copy link
Contributor

@jswalden jswalden commented Aug 8, 2018

There's already spec algorithm/text for interpreting JS strings that contain lone surrogates, in the manner that QuoteJSONString wants to interpret them -- as code points, with lone surrogates interpreted as whole code points. We should reuse that.

Additionally, the main spec already has definitions for leading/trailing surrogate that can be reused. That seems better than introducing new definitions that cross-reference elsewhere.

I have no idea what the custom is for dels/inses or how exactly this is supposed to be written -- the existing dels/inses do not match up with the spec text in https://github.com/tc39/ecma262 right now, so I'm not sure what they're acting against -- so I mostly just used them with respect to the text currently there.

@jswalden jswalden mentioned this pull request Aug 8, 2018
4 tasks
spec.emu Outdated
@@ -14,15 +14,14 @@ contributors: Richard Gibson
<!-- es6num="24.3.2.2" -->
<emu-clause id="sec-quotejsonstring" aoid="QuoteJSONString">
<h1>Runtime Semantics: QuoteJSONString ( _value_ )</h1>
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p>
<p>As defined in the Unicode Standard, a <dfn id="high-surrogate-code-unit">high-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF and a <dfn id="low-surrogate-code-unit">low-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF.</p>
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code <del>units</del><ins>points</ins> within it.</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a little weird to mix “code units” and “code points” in the same sentence, especially with the word “other“ in there:

code units and escapes certain other code unitspoints

Copy link
Collaborator

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to the approach, but I think we first need to update this proposal to account for changes as of tc39/ecma262@557f2c1 and tc39/ecma262@04af92d . Would you like to do that @jswalden, or should I?

spec.emu Outdated
1. For each code unit _C_ in _value_, do
1. If _C_ is the code unit 0x0022 (QUOTATION MARK) or the code unit 0x005C (REVERSE SOLIDUS), then
1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do
1. If _C_ is the code <del>unit</del><ins>point</ins> 0x0022 (QUOTATION MARK) or the code <del>unit</del><ins>point</ins> 0x005C (REVERSE SOLIDUS), then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looks like this needs to be rebased on top of current master (which as of tc39/ecma262@557f2c1 has removed this special case in favor of a general replacements table) before making further changes.

spec.emu Outdated
@@ -70,18 +69,14 @@ contributors: Richard Gibson
</tbody>
</table>
1. Set _product_ to the string-concatenation of _product_ and _abbrev_.
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then
1. Else if _C_ has a numeric value less than 0x0020 (SPACE)<ins>, or _C_ is a <emu-xref href="#leading-surrogate"></emu-xref> or <emu-xref href="#trailing-surrogate"></emu-xref></ins>, then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is semantically invalid; C is defined above as a code point but the definitions of "leading surrogate" and "trailing surrogate" apply to code units (and for that matter, the input to UnicodeEscape as currently specified must also be a code unit).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picky but also quite fair. I've massaged the wording around to "or C has the same numeric value as a leading-surrogate code unit or trailing-surrogate code unit". It's a little awkward because I have to apply manual type conversions to both left-hand and right-hand sides, but maybe it does the trick?

@michaelficarra
Copy link
Member

I'm very confused by this PR given we are so out of sync with what's in ecma262 today. @jswalden what "algorithm used to interpret strings passed to eval" are you talking about?

@jswalden
Copy link
Contributor Author

@michaelficarra https://tc39.github.io/ecma262/#sec-performeval step 6 begins, "Let script be the ECMAScript code that is the result of parsing x, interpreted as UTF-16 encoded Unicode text as described in 6.1.4, for the goal symbol Script."

What's happening there is a JavaScript string of 16-bit code units is being converted into the stream of Unicode code points suitable for interpretation with respect to the Script production -- interpreting valid code points as the same values and lone surrogates as code points of those same values. This is done by reference to the algorithm in 6.1.4.

This conversion algorithm is exactly the same as what you want to do here, in determining how to quote each portion of the input string into a JSON literal. Does that clarify things?

@jswalden jswalden force-pushed the interpret-as-utf16 branch 2 times, most recently from fb003fb to 6720a7b Compare August 10, 2018 01:29
@jswalden
Copy link
Contributor Author

Okay, rebased against the main spec -- which I assume is done by copying the main spec text into spec.emu, then making ins/del changes against it? If there's something more to do here, let me know.

Incidentally, I would probably make that table have only two columns -- and put U+0008 BACKSPACE in one column and \b in the other -- but I'm not sure how to mark up the inses/dels for that, nor how well the rendered view of it would actually display if I tried.

@gibson042
Copy link
Collaborator

@jswalden I think you misunderstood; I meant that we had to get this proposal defined on top of the spec master and then open a PR that changes it from iterating over code units to iterating over code points. Trying to both at the same time just confuses matters.

@gibson042
Copy link
Collaborator

@jswalden I just landed the proposal rebase, so you're clear to rebase this PR on top of the new proposal master.

@jswalden
Copy link
Contributor Author

@gibson042 Done, I think.

Copy link
Collaborator

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the iteration change. A bit more cleanup and this is good to land.

spec.emu Outdated
@@ -16,17 +16,13 @@ contributors: Richard Gibson
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p>
<emu-alg>
1. Let _product_ be the String value consisting solely of the code unit 0x0022 (QUOTATION MARK).
1. For each code unit _C_ in _value_, do
1. If the numeric value of _C_ is listed in the Code Unit Value column of <emu-xref href="#table-json-single-character-escapes"></emu-xref>, then
1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing </ins> end tag.

spec.emu Outdated
@@ -46,7 +42,7 @@ contributors: Richard Gibson
</tr>
<tr>
<td>
`0x0008`
<del>0x</del><ins>U+</ins>0008
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's cover these changes a little more clearly:

<del>`0x0008`</del><ins>U+0008</ins>

spec.emu Outdated
1. Set _product_ to the string-concatenation of _product_ and the Escape Sequence for _C_ as specified in <emu-xref href="#table-json-single-character-escapes"></emu-xref>.
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), <ins>or _C_ has the same numeric value as a <emu-xref href="#leading-surrogate-code-unit"></emu-xref> or <emu-xref href="#trailing-surrogate-code-unit"></emu-xref>,</ins> then
1. Set _product_ to the string-concatenation of _product_ and UnicodeEscape(_C_).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnicodeEscape is documented to accept a code unit, not a code point, so either that operation or invocations thereof need to updated as well.

@jswalden
Copy link
Contributor Author

Updates done. If I were touching the main spec directly, using the usual diff-tools, I would change UnicodeEscape (which is only used by QuoteJSONString) to accept code points. But I am not, so I will not.

@gibson042 gibson042 merged commit 0ca7553 into tc39:master Aug 29, 2018
@jswalden jswalden deleted the interpret-as-utf16 branch August 30, 2018 20:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants