Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: Reference leading/trailing surrogate definitions more #1532

Open
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@gibson042
Copy link
Contributor

commented May 9, 2019

No description provided.

@ljharb

ljharb approved these changes May 10, 2019

spec.html Outdated
1. Let _second_ be the numeric value of the code unit at index _position_ + 1 within the String _s_.
1. If _second_ < 0xDC00 or _second_ > 0xDFFF, let _resultString_ be the String value consisting of the single code unit _first_.
1. Let _second_ be the code unit at index _position_ + 1 within the String _s_.
1. If _second_ is not a <emu-xref href="#trailing-surrogate"></emu-xref>, let _resultString_ be the String value consisting of the single code unit _first_.
1. Else, let _resultString_ be the string-concatenation of the code unit _first_ and the code unit _second_.

This comment has been minimized.

Copy link
@ljharb

ljharb May 10, 2019

Member

it kind of seems like we could benefit from an abstract operation that takes s, and position, and returns either the concatenation, or a List of first, second - then we could use it both here, and in codePointAt (and potentially other places)

This comment has been minimized.

Copy link
@ljharb

ljharb May 12, 2019

Member

It’d be nice to get that in this PR, if possible ;-)

This comment has been minimized.

Copy link
@gibson042

gibson042 May 12, 2019

Author Contributor

Added in 9105de5, though I'm not sure whether the new operation belongs with UTF16Encoding and UTF16Decode under Section 10.1: Source Text or somewhere else, or even whether all three should move to something like a new "Code Point Operations" section under Section 7: Abstract Operations. Please share your thoughts.

This comment has been minimized.

Copy link
@ljharb

ljharb May 12, 2019

Member

Thanks, this is great. i don’t have an opinion on where it lives; alongside the utf16 ops seems fine for now.

@ljharb ljharb requested review from zenparsing, tc39/ecma262-editors and mathiasbynens May 10, 2019

Show resolved Hide resolved spec.html Outdated

@gibson042 gibson042 force-pushed the gibson042:2019-05-surrogate-refs branch from af74864 to 9105de5 May 12, 2019


<emu-clause id="sec-codepointat" aoid="CodePointAt">
<h1>Static Semantics: CodePointAt ( _string_, _position_ )</h1>
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

This comment has been minimized.

Copy link
@mathiasbynens

mathiasbynens May 12, 2019

Member

Nit: UTF-16 encodedUTF-16-encoded

Suggested change
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16 encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>
<p>The abstract operation CodePointAt interprets a String _string_ as a sequence of UTF-16-encoded code points, as described in <emu-xref href="#sec-ecmascript-language-types-string-type"></emu-xref>, and reads from it a single code point starting with the code unit at index _position_. When called, the following steps are performed:</p>

This comment has been minimized.

Copy link
@gibson042

gibson042 May 13, 2019

Author Contributor

This one I will leave for now, because the "UTF-16 encoded" text is common to several other parts of the spec (e.g., ToNumber and many subsections of String objects).

@jmdyck

This comment has been minimized.

Copy link
Collaborator

commented May 13, 2019

I generally like the refactoring here, as it increases the encapsulation of ECMAScript's version of UTF-16 decoding. However, the lines that compare the numeric value of _cp_ to 0xFFFF are a bit of a leak in the abstraction.

Also, it seems a bit odd that, because CodePointAt calls UTF16Decode, %StringIteratorPrototype%.next has to then immediately call UTF16Encoding because it's only interested in the code units.

I like @ljharb's original suggestion where the abstract op returned the concatenation or a List of the contributing code units. (Not sure what you'd call it, maybe UTF16CodeUnitsAt.) We could still have CodePointAt().

@gibson042

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

The operations have similar but distinct requirements.

  • %StringIteratorPrototype%.next needs the count of code units to slice
  • AdvanceStringIndex needs the count for index advancement
  • String.prototype.codePointAt needs a numeric code point value
  • Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

The 1) CodePointAt operation I've just added reports the code point and lets the operations that need count infer it by comparison to the maximum single-code-unit value, which is a bit messy but IMO no worse than 2) $hardToName returning a List of code units and letting the operations that need code unit count get it from the number of elements and the operations that need a code point get it from UTF16Decode. I suppose the other alternatives would be 3) CodePointAt returning a Record with both kinds of data, or 4) $hardToName2 returning just a count of code units.

I like option 3, but it seems like overkill to me so I have fallen back on 1. If there is consensus on another option, though, I'll make the change.

@ljharb

This comment has been minimized.

Copy link
Member

commented May 14, 2019

Option 3 does sound nice.

@jmdyck

This comment has been minimized.

Copy link
Collaborator

commented May 15, 2019

Encode needs everything (the count for index advancing and a numeric value for unpaired surrogate checking and UTF-8 encoding).

Yeah, Encode is the real problem, because it's the only one that needs information at multiple "levels".

I like option 3 [returning a Record], but it seems like overkill to me

I suspect it wouldn't be that bad.

You could also fold in Encode's lone surrogate checking (another leak in the abstraction): a field of the record could indicate whether a properly-encoded code point was found. That would almost completely encapsulate ECMAScript's version of UTF-16 decoding.

Show resolved Hide resolved spec.html
Show resolved Hide resolved spec.html
@gibson042

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2019

OK, updated CodePointAt to return a Record: 21e0abb...f2f6a60

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

  1. If first is a trailing surrogate or at the end of the string, return $unpaired.
  2. Read second.
  3. If second is not a trailing surrogate, return $unpaired.
  4. Let cp be UTF16Decode(first, second).
  5. Return $surrogatePair.
Show resolved Hide resolved spec.html
Show resolved Hide resolved spec.html
Show resolved Hide resolved spec.html
@jmdyck

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

OK, updated CodePointAt to return a Record:

Yeah, I like this approach.

The logic now uses multi-statement conditional blocks to avoid duplicating the unpaired surrogate return value, but could also be re-linearized upon request:

Personally, I think the suggested linearization would make the operation easier to understand. Duplicating the unpaired return doesn't bother me.

One thing that I think might increase readability would be to put each 'return' on its own (sub)step, because then:

  • depending on window width, you're more likely to get each Record constructor on a single line, and
  • the Record constructors will (mostly) line up vertically, making it easier to compare them.
@jmdyck

This comment has been minimized.

Copy link
Collaborator

commented May 21, 2019

Rename [[CodeUnitCount]] to [[Length]] for better small-window rendering

I've got mixed feelings about this, because CodeUnitCount was really clear about its meaning. Could you add a sentence to the preamble for CodePointAt, defining [[Length]]?

@ljharb

This comment has been minimized.

Copy link
Member

commented May 22, 2019

I’m not super concerned about small window rendering, but I’m very concerned about self-documenting names - maybe we could flip that one back?

@gibson042

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

OK, fair enough. Reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.