Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion to scalar value string #345

Open
wlammen opened this issue Oct 9, 2020 · 7 comments
Open

Conversion to scalar value string #345

wlammen opened this issue Oct 9, 2020 · 7 comments

Comments

@wlammen
Copy link

wlammen commented Oct 9, 2020

In Section 4.6 I read

To convert a string into a scalar value string, replace any surrogates with U+FFFD.

This is technically correct, but, I guess, not what you wanted to express, though, especially in the light of the following note.

Lets look at the example given before, where a string contains 3 surrogates:

0xD83D, 0xDCA9, and 0xD800

The first two combine ordinarily to a character called a 'PILE OF POO. Only the third (U+D800) is not part of a surrogate pair, and, thus, subject to substitution by U+FFFD, as suggested by Unicode chapter 5.22. The chosen wording in the Infra standard demands to substitute all 3 characters, instead. Is that really what you want? And to what end??

I suggest to change the cited text to

To convert a string into a scalar value string, replace any unpaired surrogates with U+FFFD.

Wolf Lammen

PS I was notified by @andreubotella that I mix up the definition of surrogate with that of Unicode. So read with care!

@andreubotella
Copy link
Member

andreubotella commented Oct 9, 2020

Since a surrogate is defined as being a code point, not a code unit, the very presence of the word "surrogate" in that algorithm automatically implies that the string must be first converted to a code point representation (as per the second normative paragraph in the "strings" section), which itself transforms surrogate (code unit) pairs into the corresponding non-BMP scalar value. Therefore, all remaining surrogates are unpaired surrogates.

But I think that's made very clear in the note immediately below that paragraph.

@wlammen
Copy link
Author

wlammen commented Oct 9, 2020

You are right. I followed the link to the definition of surrogate, and mixed it up with code unit. Thanks for the explanation.

There is a feeling of uneasiness though, since, what Infra calls a surrogate is a surrogate pair in Unicode (chapter 3.8). And a single surrogate (which only exists as high/low surrogate in Unicode) has a Unicode 32 bit code point, which it hasn't in Infra. It lives just as a code unit there. People familiar with Unicode will have many occasions to get lost in the difference of definitions.

@wlammen
Copy link
Author

wlammen commented Oct 9, 2020

I don't get this right.

Infra defines a surrogate as being a code point between U+D800 and U+DFFF. And a code point is a Unicode code point. Which means it is not a character? Because in Unicode a code point is a number https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf#I1.8857 .

So it is a number between 0xD800 and 0xDFFF. OK fine. Then it is a code unit, too. And a string is an array of code units.

Then why on earth does my example string

0xD83D, 0xDCA9, and 0xD800

not consist of three surrogates? That are subject to replacement during conversion to a value string.

I don't understand this.

PS. @andreubotella responded to this post while I was still editing. Please look at previous versions (available under ...) to fully understand his answer.

@andreubotella
Copy link
Member

andreubotella commented Oct 9, 2020

Infra's "surrogate" corresponds to Unicode's "surrogate code point" (not to Unicode's "surrogate pair"). Now, Infra doesn't have a definition of "surrogate pair", but the usages in non-normative text corresponds to Unicode's definition, which refers to code units. Those uses are somewhat informal, and maybe should be linked to Unicode's definition.

Infra defines a surrogate as being a code point between U+D800 and U+DFFF. And a code point is a Unicode code point. Which means it is not a character? Because in Unicode a code point is a number https://www.unicode.org/versions/Unicode13.0.0/ch01.pdf#I1.8857 .

Unicode defines a mapping between numbers ("code points") and meanings (...sorta). The term "character" is often used to refer to those meanings, but since the mapping is one-to-one, you can use "code point" and "character" interchangeably, as we do in Infra:

Code points are sometimes referred to as characters and in certain contexts are prefixed with "0x" rather than "U+".


So it is a number between 0xD800 and 0xDFFF. OK fine. Then it is a code unit, too. And a string is an array of code units.

A code unit is its own specific type of number which is limited to the range between 0 and 0xFFFF, inclusive (this definition of code unit is different from Unicode's, see Unicode 2.5). A code point is its own specific type of number which is limited to the range between 0 and 0x10FFFF, inclusive. An Infra string is defined as a sequence of code units, whether they be valid UTF-16 or not. Additionally, an Infra string can be converted (implicitly, even) into a sequence of code points – a conversion which transforms each code unit to its corresponding code point, except for surrogate code unit pairs.

So, a string containing the following code units:

0xD83D 0xDCA9 0xD800

would be converted to a sequence of code points:

U+1F4A9 U+D800

You might argue that having U+D800 there is a violation of Unicode, but the web started using Unicode before surrogates were a thing, and changing this behavior might break existing sites.

And finally, if you were to transform that string to a scalar value string, you'd start from the code point conversion and replace any remaining surrogate code points:

U+1F4A9 U+FFFD

Edit: The reason why you'd start from the code point conversion is because "replace any surrogates with U+FFFD" talks about "surrogates", which Infra defines as code points, and "U+FFFD", which is the representation commonly used to refer to code points. The conversion is implicit, and maybe we should work on that (on #319 perhaps?), but the usage of code point in that algorithm forces that interpretation.

@wlammen
Copy link
Author

wlammen commented Oct 9, 2020

@andreubotella First I thought one should close this issue. But I think one should leave it open. There are some details to clarify.

@wlammen
Copy link
Author

wlammen commented Oct 10, 2020

I revisited this issue to analyze what was difficult to understand for me.
Lets first look at my post #345 (comment). I wrote

...a number between 0xD800 and 0xDFFF ... is a code unit, too.

No. A number (value in general) is not the same as the memory unit (or abstraction thereof) containing it. This breaks the argument chain.

But what holds for me, holds likewise for the standard. If it declares a surrogate as being a value, it should not use it for (code-)units containing them. Mixing this up in one phrase is seen here:

The replaced surrogates are always isolated surrogates..

Things like surrogate pair or isolated surrogate do not make any sense given Infra's definition. IMO not keeping a clear wording here is particularly misleading, because other leading standards Unicode or ECMA-262 identify a surrogate with a unit containing a value from the surrogates range, see for example https://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#M9.27720.Heading.132.Surrogates.Area. To illustrate how far Unicode differs from Infra in this respect, note that a string containing a surrogate (Infra) is ill-formed in Unicode, while surrogates (Unicode) are common and most often correctly used characters in UTF-16. I wonder whether one is not better off by aligning one's terms consistently with these big documents. Of course, technically, one can establish one's own language, for whatever reasons.

@annevk
Copy link
Member

annevk commented Oct 12, 2020

It declares a surrogate as being a code point within a specific range of code points (code points are also signaled using the U+ syntax). And it defines that when a string is discussed in the context of code points it does not contain surrogate pairs. This is further explained in the non-normative text.

I think it holds together well.

I suppose we might take editorial PRs here, but it really depends on what they would say. I'm personally quite happy with the model we have here as it works really well for strings within the context of the web platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants