Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to `eval` #8

jswalden · 2018-08-08T21:33:50Z

There's already spec algorithm/text for interpreting JS strings that contain lone surrogates, in the manner that QuoteJSONString wants to interpret them -- as code points, with lone surrogates interpreted as whole code points. We should reuse that.

Additionally, the main spec already has definitions for leading/trailing surrogate that can be reused. That seems better than introducing new definitions that cross-reference elsewhere.

I have no idea what the custom is for dels/inses or how exactly this is supposed to be written -- the existing dels/inses do not match up with the spec text in https://github.com/tc39/ecma262 right now, so I'm not sure what they're acting against -- so I mostly just used them with respect to the text currently there.

mathiasbynens · 2018-08-09T10:51:29Z

spec.emu

@@ -14,15 +14,14 @@ contributors: Richard Gibson
 <!-- es6num="24.3.2.2" -->
 <emu-clause id="sec-quotejsonstring" aoid="QuoteJSONString">
  <h1>Runtime Semantics: QuoteJSONString ( _value_ )</h1>
-  <p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p>
-  <p>As defined in the Unicode Standard, a <dfn id="high-surrogate-code-unit">high-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF and a <dfn id="low-surrogate-code-unit">low-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF.</p>
+  <p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code <del>units</del><ins>points</ins> within it.</p>


It’s a little weird to mix “code units” and “code points” in the same sentence, especially with the word “other“ in there:

code units and escapes certain other code ~~units~~points

gibson042

I'm not opposed to the approach, but I think we first need to update this proposal to account for changes as of tc39/ecma262@557f2c1 and tc39/ecma262@04af92d . Would you like to do that @jswalden, or should I?

gibson042 · 2018-08-09T20:50:34Z

spec.emu

-    1. For each code unit _C_ in _value_, do
-      1. If _C_ is the code unit 0x0022 (QUOTATION MARK) or the code unit 0x005C (REVERSE SOLIDUS), then
+    1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do
+      1. If _C_ is the code <del>unit</del><ins>point</ins> 0x0022 (QUOTATION MARK) or the code <del>unit</del><ins>point</ins> 0x005C (REVERSE SOLIDUS), then


Hmm, looks like this needs to be rebased on top of current master (which as of tc39/ecma262@557f2c1 has removed this special case in favor of a general replacements table) before making further changes.

gibson042 · 2018-08-09T20:58:22Z

spec.emu

@@ -70,18 +69,14 @@ contributors: Richard Gibson
            </tbody>
          </table>
        1. Set _product_ to the string-concatenation of _product_ and _abbrev_.
-      1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then
+      1. Else if _C_ has a numeric value less than 0x0020 (SPACE)<ins>, or _C_ is a <emu-xref href="#leading-surrogate"></emu-xref> or <emu-xref href="#trailing-surrogate"></emu-xref></ins>, then


This is semantically invalid; C is defined above as a code point but the definitions of "leading surrogate" and "trailing surrogate" apply to code units (and for that matter, the input to UnicodeEscape as currently specified must also be a code unit).

Picky but also quite fair. I've massaged the wording around to "or C has the same numeric value as a leading-surrogate code unit or trailing-surrogate code unit". It's a little awkward because I have to apply manual type conversions to both left-hand and right-hand sides, but maybe it does the trick?

michaelficarra · 2018-08-09T21:46:22Z

I'm very confused by this PR given we are so out of sync with what's in ecma262 today. @jswalden what "algorithm used to interpret strings passed to eval" are you talking about?

jswalden · 2018-08-10T00:54:24Z

@michaelficarra https://tc39.github.io/ecma262/#sec-performeval step 6 begins, "Let script be the ECMAScript code that is the result of parsing x, interpreted as UTF-16 encoded Unicode text as described in 6.1.4, for the goal symbol Script."

What's happening there is a JavaScript string of 16-bit code units is being converted into the stream of Unicode code points suitable for interpretation with respect to the Script production -- interpreting valid code points as the same values and lone surrogates as code points of those same values. This is done by reference to the algorithm in 6.1.4.

This conversion algorithm is exactly the same as what you want to do here, in determining how to quote each portion of the input string into a JSON literal. Does that clarify things?

jswalden · 2018-08-10T01:32:58Z

Okay, rebased against the main spec -- which I assume is done by copying the main spec text into spec.emu, then making ins/del changes against it? If there's something more to do here, let me know.

Incidentally, I would probably make that table have only two columns -- and put U+0008 BACKSPACE in one column and \b in the other -- but I'm not sure how to mark up the inses/dels for that, nor how well the rendered view of it would actually display if I tried.

gibson042 · 2018-08-14T18:13:39Z

@jswalden I think you misunderstood; I meant that we had to get this proposal defined on top of the spec master and then open a PR that changes it from iterating over code units to iterating over code points. Trying to both at the same time just confuses matters.

gibson042 · 2018-08-14T18:54:54Z

@jswalden I just landed the proposal rebase, so you're clear to rebase this PR on top of the new proposal master.

jswalden · 2018-08-14T22:47:03Z

@gibson042 Done, I think.

gibson042

I like the iteration change. A bit more cleanup and this is good to land.

gibson042 · 2018-08-25T00:56:45Z

spec.emu

@@ -16,17 +16,13 @@ contributors: Richard Gibson
  <p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p>
  <emu-alg>
    1. Let _product_ be the String value consisting solely of the code unit 0x0022 (QUOTATION MARK).
-    1. For each code unit _C_ in _value_, do
-      1. If the numeric value of _C_ is listed in the Code Unit Value column of <emu-xref href="#table-json-single-character-escapes"></emu-xref>, then
+    1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do


Missing </ins> end tag.

gibson042 · 2018-08-25T01:05:21Z

spec.emu

@@ -46,7 +42,7 @@ contributors: Richard Gibson
      </tr>
      <tr>
        <td>
-          `0x0008`
+          <del>0x</del><ins>U+</ins>0008


Let's cover these changes a little more clearly:

<del>`0x0008`</del><ins>U+0008</ins>

gibson042 · 2018-08-25T01:11:19Z

spec.emu

        1. Set _product_ to the string-concatenation of _product_ and the Escape Sequence for _C_ as specified in <emu-xref href="#table-json-single-character-escapes"></emu-xref>.
-      1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then
+      1. Else if _C_ has a numeric value less than 0x0020 (SPACE), <ins>or _C_ has the same numeric value as a <emu-xref href="#leading-surrogate-code-unit"></emu-xref> or <emu-xref href="#trailing-surrogate-code-unit"></emu-xref>,</ins> then
        1. Set _product_ to the string-concatenation of _product_ and UnicodeEscape(_C_).


UnicodeEscape is documented to accept a code unit, not a code point, so either that operation or invocations thereof need to updated as well.

… to interpret strings passed to `eval`.

jswalden · 2018-08-27T23:10:48Z

Updates done. If I were touching the main spec directly, using the usual diff-tools, I would change UnicodeEscape (which is only used by QuoteJSONString) to accept code points. But I am not, so I will not.

jswalden mentioned this pull request Aug 8, 2018

Advance to Stage 3 #6

Closed

4 tasks

mathiasbynens reviewed Aug 9, 2018

View reviewed changes

gibson042 suggested changes Aug 9, 2018

View reviewed changes

jswalden force-pushed the interpret-as-utf16 branch 2 times, most recently from fb003fb to 6720a7b Compare August 10, 2018 01:29

jswalden force-pushed the interpret-as-utf16 branch from 6720a7b to d778fe2 Compare August 14, 2018 22:45

jswalden force-pushed the interpret-as-utf16 branch from d778fe2 to 9087749 Compare August 15, 2018 01:02

gibson042 suggested changes Aug 25, 2018

View reviewed changes

Iterate over the code points of _value_ using the same algorithm used…

cebbe18

… to interpret strings passed to `eval`.

jswalden force-pushed the interpret-as-utf16 branch from 9087749 to cebbe18 Compare August 27, 2018 23:07

michaelficarra approved these changes Aug 28, 2018

View reviewed changes

gibson042 approved these changes Aug 29, 2018

View reviewed changes

gibson042 merged commit 0ca7553 into tc39:master Aug 29, 2018

jswalden deleted the interpret-as-utf16 branch August 30, 2018 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to `eval` #8

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to `eval` #8

jswalden commented Aug 8, 2018

mathiasbynens Aug 9, 2018

gibson042 left a comment

gibson042 Aug 9, 2018

gibson042 Aug 9, 2018

jswalden Aug 10, 2018

michaelficarra commented Aug 9, 2018

jswalden commented Aug 10, 2018

jswalden commented Aug 10, 2018

gibson042 commented Aug 14, 2018

gibson042 commented Aug 14, 2018

jswalden commented Aug 14, 2018

gibson042 left a comment

gibson042 Aug 25, 2018

gibson042 Aug 25, 2018

gibson042 Aug 25, 2018

jswalden commented Aug 27, 2018

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to eval #8

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to eval #8

Conversation

jswalden commented Aug 8, 2018

mathiasbynens Aug 9, 2018

Choose a reason for hiding this comment

gibson042 left a comment

Choose a reason for hiding this comment

gibson042 Aug 9, 2018

Choose a reason for hiding this comment

gibson042 Aug 9, 2018

Choose a reason for hiding this comment

jswalden Aug 10, 2018

Choose a reason for hiding this comment

michaelficarra commented Aug 9, 2018

jswalden commented Aug 10, 2018

jswalden commented Aug 10, 2018

gibson042 commented Aug 14, 2018

gibson042 commented Aug 14, 2018

jswalden commented Aug 14, 2018

gibson042 left a comment

Choose a reason for hiding this comment

gibson042 Aug 25, 2018

Choose a reason for hiding this comment

gibson042 Aug 25, 2018

Choose a reason for hiding this comment

gibson042 Aug 25, 2018

Choose a reason for hiding this comment

jswalden commented Aug 27, 2018

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to `eval` #8

Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to `eval` #8