-
Notifications
You must be signed in to change notification settings - Fork 15
Iterate over the code points of _value_ using the same algorithm used to interpret strings passed to eval
#8
Conversation
spec.emu
Outdated
@@ -14,15 +14,14 @@ contributors: Richard Gibson | |||
<!-- es6num="24.3.2.2" --> | |||
<emu-clause id="sec-quotejsonstring" aoid="QuoteJSONString"> | |||
<h1>Runtime Semantics: QuoteJSONString ( _value_ )</h1> | |||
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p> | |||
<p>As defined in the Unicode Standard, a <dfn id="high-surrogate-code-unit">high-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xD800 to 0xDBFF and a <dfn id="low-surrogate-code-unit">low-surrogate code unit</dfn> is a code unit with a numeric value in the inclusive range 0xDC00 to 0xDFFF.</p> | |||
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code <del>units</del><ins>points</ins> within it.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a little weird to mix “code units” and “code points” in the same sentence, especially with the word “other“ in there:
code units and escapes certain other code
unitspoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to the approach, but I think we first need to update this proposal to account for changes as of tc39/ecma262@557f2c1 and tc39/ecma262@04af92d . Would you like to do that @jswalden, or should I?
spec.emu
Outdated
1. For each code unit _C_ in _value_, do | ||
1. If _C_ is the code unit 0x0022 (QUOTATION MARK) or the code unit 0x005C (REVERSE SOLIDUS), then | ||
1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do | ||
1. If _C_ is the code <del>unit</del><ins>point</ins> 0x0022 (QUOTATION MARK) or the code <del>unit</del><ins>point</ins> 0x005C (REVERSE SOLIDUS), then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, looks like this needs to be rebased on top of current master (which as of tc39/ecma262@557f2c1 has removed this special case in favor of a general replacements table) before making further changes.
spec.emu
Outdated
@@ -70,18 +69,14 @@ contributors: Richard Gibson | |||
</tbody> | |||
</table> | |||
1. Set _product_ to the string-concatenation of _product_ and _abbrev_. | |||
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then | |||
1. Else if _C_ has a numeric value less than 0x0020 (SPACE)<ins>, or _C_ is a <emu-xref href="#leading-surrogate"></emu-xref> or <emu-xref href="#trailing-surrogate"></emu-xref></ins>, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is semantically invalid; C is defined above as a code point but the definitions of "leading surrogate" and "trailing surrogate" apply to code units (and for that matter, the input to UnicodeEscape as currently specified must also be a code unit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picky but also quite fair. I've massaged the wording around to "or C has the same numeric value as a leading-surrogate code unit or trailing-surrogate code unit". It's a little awkward because I have to apply manual type conversions to both left-hand and right-hand sides, but maybe it does the trick?
I'm very confused by this PR given we are so out of sync with what's in ecma262 today. @jswalden what "algorithm used to interpret strings passed to |
@michaelficarra https://tc39.github.io/ecma262/#sec-performeval step 6 begins, "Let script be the ECMAScript code that is the result of parsing x, interpreted as UTF-16 encoded Unicode text as described in 6.1.4, for the goal symbol Script." What's happening there is a JavaScript string of 16-bit code units is being converted into the stream of Unicode code points suitable for interpretation with respect to the Script production -- interpreting valid code points as the same values and lone surrogates as code points of those same values. This is done by reference to the algorithm in 6.1.4. This conversion algorithm is exactly the same as what you want to do here, in determining how to quote each portion of the input string into a JSON literal. Does that clarify things? |
fb003fb
to
6720a7b
Compare
Okay, rebased against the main spec -- which I assume is done by copying the main spec text into spec.emu, then making ins/del changes against it? If there's something more to do here, let me know. Incidentally, I would probably make that table have only two columns -- and put |
@jswalden I think you misunderstood; I meant that we had to get this proposal defined on top of the spec master and then open a PR that changes it from iterating over code units to iterating over code points. Trying to both at the same time just confuses matters. |
@jswalden I just landed the proposal rebase, so you're clear to rebase this PR on top of the new proposal master. |
6720a7b
to
d778fe2
Compare
@gibson042 Done, I think. |
d778fe2
to
9087749
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the iteration change. A bit more cleanup and this is good to land.
spec.emu
Outdated
@@ -16,17 +16,13 @@ contributors: Richard Gibson | |||
<p>The abstract operation QuoteJSONString with argument _value_ wraps a String value in QUOTATION MARK code units and escapes certain other code units within it.</p> | |||
<emu-alg> | |||
1. Let _product_ be the String value consisting solely of the code unit 0x0022 (QUOTATION MARK). | |||
1. For each code unit _C_ in _value_, do | |||
1. If the numeric value of _C_ is listed in the Code Unit Value column of <emu-xref href="#table-json-single-character-escapes"></emu-xref>, then | |||
1. For each code <del>unit</del><ins>point</ins> _C_ in _value_ <ins>when interpreted as UTF-16 encoded Unicode text as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing </ins>
end tag.
spec.emu
Outdated
@@ -46,7 +42,7 @@ contributors: Richard Gibson | |||
</tr> | |||
<tr> | |||
<td> | |||
`0x0008` | |||
<del>0x</del><ins>U+</ins>0008 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's cover these changes a little more clearly:
<del>`0x0008`</del><ins>U+0008</ins>
spec.emu
Outdated
1. Set _product_ to the string-concatenation of _product_ and the Escape Sequence for _C_ as specified in <emu-xref href="#table-json-single-character-escapes"></emu-xref>. | ||
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), then | ||
1. Else if _C_ has a numeric value less than 0x0020 (SPACE), <ins>or _C_ has the same numeric value as a <emu-xref href="#leading-surrogate-code-unit"></emu-xref> or <emu-xref href="#trailing-surrogate-code-unit"></emu-xref>,</ins> then | ||
1. Set _product_ to the string-concatenation of _product_ and UnicodeEscape(_C_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnicodeEscape is documented to accept a code unit, not a code point, so either that operation or invocations thereof need to updated as well.
… to interpret strings passed to `eval`.
9087749
to
cebbe18
Compare
Updates done. If I were touching the main spec directly, using the usual diff-tools, I would change UnicodeEscape (which is only used by QuoteJSONString) to accept code points. But I am not, so I will not. |
There's already spec algorithm/text for interpreting JS strings that contain lone surrogates, in the manner that QuoteJSONString wants to interpret them -- as code points, with lone surrogates interpreted as whole code points. We should reuse that.
Additionally, the main spec already has definitions for leading/trailing surrogate that can be reused. That seems better than introducing new definitions that cross-reference elsewhere.
I have no idea what the custom is for dels/inses or how exactly this is supposed to be written -- the existing dels/inses do not match up with the spec text in https://github.com/tc39/ecma262 right now, so I'm not sure what they're acting against -- so I mostly just used them with respect to the text currently there.