Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow surrogate codepoints when escaped with \u #20270

Open
mnemnion opened this issue Jun 11, 2024 · 28 comments
Open

Allow surrogate codepoints when escaped with \u #20270

mnemnion opened this issue Jun 11, 2024 · 28 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@mnemnion
Copy link

mnemnion commented Jun 11, 2024

Update: the first post (which you're reading) is imprecise in some ways, especially the use of the word "valid". Please skip ahead to this comment for a better take on the aims and goals here.

In generating some Unicode test data, I discovered that Zig doesn't allow surrogates, even when encoded with \u.

Minimal reproduction:

const invalid_str = "\u{d800}";

The error (Zig 0.12.0): "error: unicode escape does not correspond to a valid codepoint".

The error message is not correct, the UTF-16 surrogates are valid codepoints, in category Other: surrogate. Here's a property page for U+D800.

It makes sense to me that the parser finding random surrogates in the string should reject that as not well-formed, just like it would balk on control character, or a bare \xff. Such random garbage is most likely a mistake. The same applies to overlong encodings, you could argue that they're "well encoded" in that they match the necessary pattern for a UTF-8 encoded point, but the standard specifically forbids them. Unlike the surrogates, these are not codepoints.

But Zig does not demand that string data is well formed UTF-8, the \x byte encoding can represent arbitrary bytes within a string. In fact, "\u{01}" is valid, when it would not be if embedded raw in the string.

It doesn't make sense to me that a codepoint specifically encoded as e.g. \u{d800} would create a compile error. That's the author affirmatively requesting that said codepoint be decoded into UTF-8 and added to the sequence, it's not the sort of thing which happens by accident. It has an exact interpretation, .{0xed, 0xao, 0x80}, which is validly-encoded UTF-8. I'll contrast this with \u{80000000}, which can't be turned into bytes: this is genuinely invalid. .{0xc0, 0xaf} is also invalid, despite having the superficial qualities of a UTF-8 codepoint, since it's overlong. There's no way to represent those in U+ notation, so this is a bit off topic, the point is that I could stuff it into a string with \xco\xaf and that would be fine with the compiler.

There's no reason for the compiler to contain extra logic to prevent surrogates from being represented using \u notation, when it's specifically requested. In my case, it means that if I want test data which covers the entire three-byte range of Unicode, I must detect the surrogate ranges and special-case encode them as \x sequences. Or perhaps someone might be writing a fixup tool which detects an invalid encoding of surrogate pairs and produces correct UTF-8. There, too, the test data will have surrogates in it, and this behavior is an arbitrary limitation to work around. TL;DR, the compiler should accept \u notation for all codepoints in Unicode, surrogates included.

@mlugg mlugg added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Jun 12, 2024
@squeek502
Copy link
Collaborator

squeek502 commented Jun 12, 2024

My current understanding is that there is no canonical UTF-8 representation of surrogate code points, hence WTF-8.

From the latest Unicode Standard:

3.8 Surrogates
D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
D72 High-surrogate code unit: A 16-bit code unit in the range D800 16 to DBFF 16, used in UTF-16 as the leading code unit of a surrogate pair.
D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
D74 Low-surrogate code unit: A 16-bit code unit in the range DC00 16 to DFFF 16, used in UTF-16 as the trailing code unit of a surrogate pair.
• High-surrogate and low-surrogate code points are designated only for that use.
• High-surrogate and low-surrogate code units are used only in the context of the UTF-16 character encoding form.

3.9 Unicode Encoding Forms
D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

UTF-8
• Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

So, the most precise error for current Zig would be unicode escape does not correspond to a valid unicode scalar value, and perhaps this proposal should instead be advocating for \u to encode the code points as WTF-8?

@mnemnion
Copy link
Author

Yes, this gets at the distinction between a Unicode character and a Unicode codepoint. The surrogates are not valid Unicode characters, nor the noncharacter codepoints. They are codepoints, however: they have an assigned value, character properties, and a valid (let's say: specific, or: singular) encoding in every Unicode encoding. Scalar value is the modern way the standard uses to refer to characters, because what civilians think of as a character includes Farmer Bob here 👨🏻‍🌾, which is made up of several Unicode scalar values.

The standard Unicode uses for codepoints is U+abcdef, which is reflected in the use of \u style escape sequences in most programming languages. When the standard says "surrogate pairs have no interpretation in UTF-8", that's in contrast to UTF-16, when a pair of surrogates stands for a scalar value from a higher plane.

But they do have an encoding in UTF-8. What the standard is getting it is that it's forbidden to interpret surrogate pairs in UTF-8 as the abstract character / scalar value which they would represent if found in UTF-16. This is invalid for the same reason that overlong encodings are invalid: it serves no purpose, complicates software, and has security implications which are decidedly negative.

All of these are good reason to reject bare surrogates in string data! But they aren't good reasons to treat \u{d800} as a syntax error. If you look at the property page I linked to, it shows the UTF-8 encoding for that codepoint. The page isn't canonical, but it's generated from canonical data. Both the codepoint and its encoding are official parts of Unicode, and UTF-8, respectively.

A string containing these codepoints is ill-formed UTF-8. WTF-8 doesn't have to be brought into the picture, because once turned into binary data, Zig strings aren't WTF-8 either, they allow all sorts of byte sequences which aren't valid in either encoding.

As a sort of Zen koan to convey my meaning: if the surrogates didn't have an encoding in UTF-8, then software wouldn't be able to reject those byte sequences as invalid. That's also the most pragmatic reason I'm advocating for their inclusion in the \u notation: representing invalid data is an important thing to be able to do.

It's invalid to have the surrogate codepoints anywhere in a UTF-8 string, just as it's invalid to have only one in a UT-16 string, or a low followed by a high, but in both cases is is a codepoint and it does have an encoding.

There are all sorts of ways for UTF-8 to be ill-formed, all of them are supported through the \x sequences in Zig strings. Some of those ill-formed sequences represent Unicode codepoints: my case is that we should be able to encode those codepoints using the \u sequence, which exists for codepoints, and which, stripped of the extraneous requirement for well-formedness (which does not exist in Zig strings), would have a one-to-one correspondence to Unicode codepoints.

So we all agree: if the Zig parser is chugging along through a string, and sees these three bytes: 0xed, 0xa0, 0x80, that should of course be a syntax error. But if it sees \u{d800}, it should do what Julia (for an example I'm familiar with) does with \ud800: generate those three bytes for the in-memory representation of the string.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 12, 2024

It would be helpful if you could link to the relevant parts of the Unicode Standard and/or UCD. I'm unable to find anything that specifically talks about/declares a canonical encoding of surrogate code points for UTF-8.

From what I can tell, U+D800 is treated roughly the same as e.g. code point U+110000. Both are technically representable using <= 4 bytes with the normal UTF-8 encoding algorithm (see RFC2279 where encoding up to 7FFF FFFF was allowed), but are now defined by the standard to be unrepresentable/invalid for historical UTF-16 related reasons. It's unclear to me why \u{d800} would be allowed but \u{110000} wouldn't be on that basis (in contrast to defining \u to encode the code point as WTF-8, which would make a clear distinction).

As for my position on this proposal:

  • Zig enforcing valid unicode scalar values for \u seems like a reasonable/defensible choice (and the error message should be updated to be more accurate if this behavior is kept)
  • Zig allowing surrogate code points in \u escapes and encoding them as WTF-8 seems like a reasonable/defensible choice, but I'm not sure it's defensible using the Unicode Standard, and instead the argument would be about the practicality/use cases (e.g. that this test case and other WTF-8 test cases would be more understandable, that being able to use \u with surrogate code points would make working with WTF-8 paths using string literals nicer, etc)

EDIT: To be clear: I'm largely in favor of this proposal. My only reservation would be the potential for users to accidentally/unexpectedly create a sequence of ill-formed UTF-8, but the risk of that does seem pretty minor.


EDIT#2: Some clearer statements:

RFC3629:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

Unicode Standard Ch 2.4:

Restricted Interchange. Code points that are not assigned to abstract characters are subject to restrictions in interchange.
• Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form.

squeek502 added a commit to squeek502/zig that referenced this issue Jun 12, 2024
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences.

From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf:

> D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
> D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
>
> 3.9 Unicode Encoding Forms
> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Related: ziglang#20270
@zzo38
Copy link

zzo38 commented Jun 12, 2024

UTF-8 can also be treated as an encoding of 31-bit numbers rather than of Unicode characters (although if it is a valid Unicode code point number then it is a valid encoding of a Unicode character (or at least a part of one)); and in that interpretation, \u{d800} makes sense as producing the UTF-8 encoding of the number 55926, and this encoding is well-defined (even though there is no Unicode character with code point number 55926). (Similarly, \u{7fffffff} is also valid if it would produce the UTF-8 encoding of the number 2147483647, although of course it should be an error if you tried to include such a code in a UTF-16 string.)

My only reservation would be the potential for users to accidentally/unexpectedly create a sequence of ill-formed UTF-8, but the risk of that does seem pretty minor.

If you are concerned about that, then a (suppressable) warning message might make sense.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 12, 2024

the UTF-8 encoding of the number 55926
the UTF-8 encoding of the number 2147483647

My reading of UTF-8 as defined by RFC3629 and the Unicode Standard is that there are no such encodings, i.e. U+D800-U+DFFF and > U+10FFFF are explicitly disallowed from being encoded.

Hence why I think this proposal should be focused on defining \u as being WTF-8 encoded instead.

(side note: 7FFFFFFF, if legal, would actually be encoded as 6 bytes as detailed in the obsoleted RFC2279; the highest possible value encoded by 4 bytes is 1FFFFF)

(suppressable) warning message

Zig doesn't have warnings

squeek502 added a commit to squeek502/zig that referenced this issue Jun 12, 2024
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences.

From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf:

> D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
> D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
>
> 3.9 Unicode Encoding Forms
> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Related: ziglang#20270
@Vexu Vexu added this to the 0.14.0 milestone Jun 12, 2024
@mnemnion
Copy link
Author

It's possible to describe the proposal without reference to WTF-8. I think that's generally valuable, because the goal isn't the same as that of WTF-8. WTF-8 defines a more permissive encoding, whereas Zig strings don't define an encoding at all.

I'm going to use bold and italics here to reference terms defined by, or used in, the standard, and try to draw out some precise language, which will satisfy the desire for the proposal to be a) expressed according to the language of the standard and b) without reference to the separate WTF-8 standard.

The proposal relates to codepoints. Unicode defines U+0 through U+10ffff as codepoints. In Zig, with the exception we're discussing, these may all be represented as \u{0} through \u{10ffff}.

Currently, without said exception, Zig turns these escape sequences into what the standard calls code unit sequences, specifically those of UTF-8. In UTF-8, the surrogate codepoints are not valid code unit sequences, because they don't correspond to a valid abstract character or, equivalently, scalar value. In UTF-16, this is true of mismatched surrogates: low-high, or one or the other encountered without a pair.

Section C8 reads:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall interpret that code unit sequence according to the corresponding code point sequence.

So now we introduce code point sequences as well. These are a specific byte pattern representing any codepoint in any encoding. Corollary: every codepoint, without exception, has a code point sequence in all Unicode encodings.

Are we sure that surrogates are codepoints? We are:

D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

Also, D15:

Surrogate code points and noncharacters are considered assigned code points, but not assigned characters.

And we have this:

D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form.

Any code unit sequence that would correspond to a code point outside the defined range of Unicode scalar values would, for example, be ill-formed.

Defining a scalar value thus:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF₁₆ and E000₁₆ to 10FFFF₁₆, inclusive.

Referring to UTF-8 specifically, the standard dismisses surrogates as follows:

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

So here's the synthesis: in order to reject ill-formed values as defined by D84, a UTF-8 decoder first recognizes that code point sequence, as it is directed to do by C8, then rejects it as an invalid code unit sequence.

The proposal

Current behavior: Zig interprets a sequence \u{xxxxxx} not (as the documentation says) as a "hexadecimal Unicode code point UTF-8 encoded (1 or more digits)", but rather as a hexadecimal Unicode scalar value, and translates escape sequences for which that interpretation is valid into UTF-8 code unit sequences.

Proposed behavior: Zig interprets a sequence \u{xxxxx} as corresponding to a Unicode codepoint, and generates the corresponding UTF-8 code point sequence, without reference to its validity as UTF-8.

I believe that here we have a succinct description of the proposal, which uses terminology in the way in which the Unicode standard uses it.

@mnemnion
Copy link
Author

Small addendum: this proposal, as expressed above, rejects sequences such as \u{110000} and higher values.

While one can follow the rules elucidated in the FSS UTF-8 standard of 1993, and turn that into bytes, U+110000 is not a codepoint, and those bytes are therefore not a code point sequence. So those remain invalid.

andrewrk pushed a commit that referenced this issue Jun 12, 2024
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences.

From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf:

> D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
> D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
>
> 3.9 Unicode Encoding Forms
> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Related: #20270
@squeek502
Copy link
Collaborator

squeek502 commented Jun 12, 2024

My problems with that proposal:

  • You're justifying it using spec language around parsing instead of encoding. Currently, the Zig language reference says that \u will be UTF-8 encoded (and your intention is to keep it that way), but my understanding is still that there is no specified way to UTF-8 encode U+D800-U+DFFF.
  • I don't think this is ultimately very relevant to making a decision about this proposal and only serves to obscure the effects. I think biting the bullet and proposing to define \u to encode as WTF-8 (and therefore accept surrogate code points) makes for a clearer/better proposal in terms of the tradeoffs being made.

Here's my attempt at this proposal:

Status quo

Currently, Zig defines \u{NNNNNN} as

hexadecimal Unicode scalar value UTF-8 encoded (1 or more digits)

This means that:

  • All code points except the surrogate code points D800-DFFF are allowed
  • By using only \u escapes, there is no possible way to end up with ill-formed UTF-8.

However, with Zig now commonly returning and accepting WTF-8, being able to use \u for surrogate code points has practical use cases.

The proposal

This proposal is to instead define \u{NNNNNN} as

hexadecimal Unicode code point WTF-8 encoded (1 or more digits)

With potentially a clarifying note:

WTF-8 is a superset of UTF-8 that allows the codepoints U+D800 to U+DFFF (surrogate codepoints) to be encoded using the normal UTF-8 encoding algorithm

Use cases

The main use case here would be avoiding the need to manually specify the bytes for the WTF-8 encoding of surrogate code points in string literals. For example,

zig/lib/std/unicode.zig

Lines 2039 to 2047 in d9bd34f

try testRoundtripWtf8("\xed\x9f\xbf"); // not a surrogate half
try testRoundtripWtf8("\xed\xa0\xbd"); // high surrogate
try testRoundtripWtf8("\xed\xb2\xa9"); // low surrogate
try testRoundtripWtf8("\xed\xa0\xbd \xed\xb2\xa9"); // <high surrogate><space><low surrogate>
try testRoundtripWtf8("\xed\xa0\x80\xed\xaf\xbf"); // <high surrogate><high surrogate>
try testRoundtripWtf8("\xed\xa0\x80\xee\x80\x80"); // <high surrogate><not surrogate>
try testRoundtripWtf8("\xed\x9f\xbf\xed\xb0\x80"); // <not surrogate><low surrogate>
try testRoundtripWtf8("a\xed\xb0\x80"); // <not surrogate><low surrogate>
try testRoundtripWtf8("\xf0\x9f\x92\xa9"); // U+1F4A9, encoded as a surrogate pair in WTF-16

Could instead look like:

    try testRoundtripWtf8("\u{D7FF}"); // not a surrogate half
    try testRoundtripWtf8("\u{D83D}"); // high surrogate
    try testRoundtripWtf8("\u{DCA9}"); // low surrogate
    try testRoundtripWtf8("\u{D83D} \u{DCA9}"); // <high surrogate><space><low surrogate>
    try testRoundtripWtf8("\u{D800}\u{DBFF}"); // <high surrogate><high surrogate>
    try testRoundtripWtf8("\u{D800}\u{E000}"); // <high surrogate><not surrogate>
    try testRoundtripWtf8("\u{D7FF}\u{DC00}"); // <not surrogate><low surrogate>
    try testRoundtripWtf8("a\u{DC00}"); // <not surrogate><low surrogate>
    try testRoundtripWtf8("\u{1F4A9}"); // U+1F4A9, encoded as a surrogate pair in WTF-16

(this test [and others in std.unicode] was somewhat of a pain to write initially due to having to specify the WTF-8 encoding as bytes)

In the event that you're specifying a path/environment variable that happens to contain an unpaired surrogate, then not needing to manually specify it as WTF-8 encoded bytes would make things much nicer:

const path = "some/wtf8path/with\u{d83d}surrogates\u{dc00}";
var file = try std.fs.cwd().openFile(path, .{});
defer file.close();

const expected_val = "some wtf8 value with \u{d83d} surrogates \u{dc00}";
const actual_val = try std.process.getEnvVarOwned(allocator, "FOO");
defer allocator.free(actual_val);

if (std.mem.eql(u8, expected_val, actual_val)) {
    // ...
}

With status quo, you'd have to look up/calculate the WTF-8 encoded form of the surrogate code points and specify them as bytes.

Potential drawbacks

The biggest drawback would be that \u escapes would lose the property of always resulting in well-formed UTF-8, which may be surprising/unexpected.

However, using \u with surrogate code points seems somewhat hard to do by accident. The most "realistic" accidental usage I can think of would come from doing something like looping through all values 0 through 10FFFF and generating Zig code with string literals that have \u escapes with that value.

WTF-8 can be ill-formed, too

Another thing to keep in mind is that WTF-8 is intended to only encode unpaired surrogates. Doing something like:

"\u{D83D}\u{DCA9}" // <high surrogate><low surrogate>

would result in ill-formed WTF-8, since the encoded surrogate code points form a surrogate pair.

  • Ill-formed WTF-8 will not roundtrip when converting to WTF-16 and back to WTF-8 (the surrogate pair, when converted to WTF-16, will be indistinguishable from any other surrogate pair, so when converting back to WTF-8, it will end up as the UTF-8 encoding of the code point that is encoded by the surrogate pair)
  • This is not necessarily a huge deal, since most of the time WTF-8 is just used as an interchange format and will be converted to WTF-16 before being used in syscalls. It is a problem if you're e.g. comparing two WTF-8 strings, though.
  • Note also that the std.unicode WTF-8 functions do not attempt to detect/enforce WTF-8 well-formedness; it is up to the user to ensure well-formedness if they have reason to care about it (see the notes in Fix handling of Windows (WTF-16) and WASI (UTF-8) paths, etc #19005)

@mnemnion
Copy link
Author

WTF-8 can be ill-formed, too

This is one of a few reasons why I don't believe that bringing WTF-8 into the picture adds much to the discussion. UTF-8 is an encoding, and it's the one which Zig source code uses, but Zig isn't a language which validates the encoding property of strings.

I believe I've demonstrated adequately that the standard has a concept of a code point encoding, which the proposal for \u{xxxxxx} escape sequences precisely matches.

The standard devotes a great deal of energy to making sure that no one gets the idea that invalid use of codepoints is in any sense valid Unicode. It doesn't spend a lot of energy discussing the ins and outs of invalid sequences. Nonetheless, we get things like the official Unicode name for overlong encodings, a "non-shortest form".

I draw your attention again to this paragraph:

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

So I'm not sure what we're supposed to do with this claim:

but my understanding is still that there is no specified way to UTF-8 encode U+D800-U+DFFF.

There is a well-understood way to emit a code point sequence corresponding to those characters. It's given in table 3.6 of the standard. No one, at any point, in discussing this proposal, has ever had the misapprehension that the result would be well-formed UTF-8.

As a result, there's no concrete difference being discussed here.

Proposed behavior: Zig interprets a sequence \u{xxxxx} as corresponding to a Unicode codepoint, and generates the corresponding UTF-8 code point sequence, without reference to its validity as UTF-8.

This is accurate, and actionable, and uses the language of the standard.

What are you trying to accomplish by rephrasing it? I could tweak it a bit, actually, it has a bit of the flavor of standardese, given what I was reading right before I wrote it:

Proposed behavior: Zig interprets a sequence \u{xxxxx} as corresponding to a Unicode codepoint, and generates the corresponding UTF-8 code point sequence, whether or not this is a valid UTF-8 scalar value.

"the UTF-8 code point sequence corresponding to a Unicode codepoint" is, as I took pains to illustrate, a well-defined concept according to the standard. There's no need to bring a second standard along for the ride.

@mnemnion
Copy link
Author

To try and keep this issue on track, I'll offer some other examples of when this might be useful.

A regex to detect surrogate codepoint sequences: r"[\u{d800}-\u{dfff}]". This would be annoying to represent using \x escape sequences. Similarly, the range of all three-byte code point sequences can be represented as the pair '\u{800}', '\u{ffff}'. If the goal is to represent all three-byte code unit sequences, this can be accomplished with two ranges, rather than one. However, going in the other direction, representing all the three byte code point sequences, without access to the \u{xxxx} notation for the surrogates, is a slog.

Both of these are pretty useful when writing programs which are expected to detect invalid code unit sequences and handle them correctly. Where "correctly" can be as WTF-8, or replacing it with \u{fffd}, or whatever makes sense for the application. Making it difficult to handle these code point sequences makes it more likely that their existence will be ignored.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 13, 2024

Proposed behavior: Zig interprets a sequence \u{xxxxx} as corresponding to a Unicode codepoint, and generates the corresponding UTF-8 code point sequence, whether or not this is a valid [Unicode] scalar value.

This is a description of what the WTF-8 spec calls generalized UTF-8, distinct from UTF-8 (see the difference from the Unicode Standard for the table of well-formed byte sequences).

What are you trying to accomplish by rephrasing it?

I think it makes things more clear, but I also think the particular language used is largely a distraction, apologies for dragging it out. The use cases/drawbacks are the most relevant thing, and are unrelated to the language used to define \u.

@hippietrail
Copy link

The terminology "surrogate codepoints" is a bit ambiguous. It could conceivably refer to either "surrogate pairs" or "unpaired surrogates". I haven't read the entire thread but it seems to be about the latter, so might I suggest changing the title of the issue to Allow unpaired surrogate codepoints when escaped with \u.
Apologies if I misinterpreted.

@squeek502
Copy link
Collaborator

squeek502 commented Jun 13, 2024

It could conceivably refer to either "surrogate pairs" or "unpaired surrogates"

It refers to neither in this case; a surrogate code point is just a code point between U+D800 and U+DFFF (see the quotes from the standard in this comment). The terms "surrogate pairs" and "unpaired surrogates" are only relevant when talking about particular encodings (UTF-16/WTF-16/WTF-8).

@mnemnion
Copy link
Author

Allow unpaired surrogate codepoints when escaped with \u.

This is, strictly speaking, more restrictive than the actual proposal, because allowing surrogate codepoints (which is a complete idea which composes in certain ways) means allowing paired ones also, by using two escape sequences. That's some of the whole hoopla about WTF-8: it defines generalized UTF-8, which allows surrogate codepoints and surrogate pairs, and WTF-8, which only allows the former. I think we're past all the litigation on that front, there are a couple ways to describe the proposal which amount to the same thing.

The term "surrogate codepoint" isn't ambiguous, however, no matter what encoding of Unicode we're referring to. In UTF-16 (only), paired surrogate codepoints are used to encode codepoints which correspond to scalar values, but those latter codepoints are not "surrogate codepoints". So in UTF-16, we would say that two code point sequences encoding two surrogate codepoints, make up one code unit sequence corresponding to one codepoint, which is a scalar value or abstract character. In UTF-8 and UTF-32, the surrogate code point sequences are ill-formed, in UTF-32, each of these is a single code unit.

In UTF-8, the code unit is a byte, so any byte sequence is a code unit sequence, which is either well-formed or ill-formed. Surrogate code point sequences are an example of an ill-formed sequence, and the one this proposal concerns itself with. "\xff\xff\xff" is an example of another ill-formed code unit sequence, one which is out of scope.

It's also true that if we take generalized UTF-8 from the WTF-8 standard, and subtract every valid sequence in Unicode's UTF-8, we end up with the same code unit sequences: every UTF-8 encoded code point sequence which does not encoded a scalar value. This is an equivalent way of describing things.

@exxjob
Copy link

exxjob commented Jun 15, 2024

Half slabs and teapots. Related:

@Paul-Dempsey
Copy link

IMO: \u as suggested by it's name represents a Unicode scalar (not UTF-16) (to avoid the confusing term "codepoint). If you want to encode arbitrary bytes of test data that aren't valid UTF-8 (such as unpaired UTF-16 surrogates), \x is already available and easy to read: "\xD8\x00". I find the status quo perfectly reasonable and capable of readably representing these use cases, without inserting ambiguity into the interpretation of \u escapes.

@nektro
Copy link
Contributor

nektro commented Jun 17, 2024

codepoint isnt a confusing term. imo \u should only be converting the codepoint number to utf-8 bytes, not also caring about whether the output is a valid utf-8 string.

@Paul-Dempsey
Copy link

Paul-Dempsey commented Jun 17, 2024

I'm fine with extending \x to allow longer sequences of hex digits, which easily accomplishes the goal of being able to represent arbitrary data. An unpaired surrogate is not valid utf-8 bytes -- it is wtf-8 bytes. I'd be happy with \w for 16-bit codepoints converted to wtf-8 bytes. (perhaps better in bikeshed).

@nektro
Copy link
Contributor

nektro commented Jun 17, 2024

codepoints are not 16 bit and the utf-8 encoding algorithm is not limited by the same bound codepoints are

@hippietrail
Copy link

I'd be happy with \w for 16-bit codepoints converted to wtf-8 bytes.

I'm assuming you mean codepoints in the BMP (Basic Multilingual Plane) and you're not mixing up Unicode with old UCS-2? The wording "16-bit codepoint" should be avoided due to this ambiguity.

The ASCII subset, the Latin 1 subset, and the BMP subset are all kinda special and more common in decreasing order than the rest of Unicode. 7-bit ASCII is represented exactly the same in UTF-8, Latin 1 shares codepoints from 0-255 but 128-255 need more than one byte, the bump covers 0-65535 so can be represented in 16 bits. Full Unicode coverage requires 21 bits.

Even though the Latin 1 range is a bit special does it really warrant a special escape convention?

@mnemnion
Copy link
Author

mnemnion commented Jun 18, 2024

IMO: \u as suggested by it's name represents a Unicode scalar (not UTF-16) (to avoid the confusing term "codepoint).

My opinion is that this is, in fact, the confusing read. This proposal would mean that for every codepoint which can be referred to as U+abcd, the sequence \u{abcd} will also refer to that codepoint, as encoded using the UTF-8 encoding rules (Unicode Standard15.0, table 3.6).

Your reference to UTF-16 confuses me. This is entirely about UTF-8. Yes, it's invalid for many things which a Zig string can contain, to be in a UTF-8 string. Including code point sequences which do not represent scalar values. The question is whether that distinction creates a valuable restriction on the use of \u, which, I think it does not.

An unpaired surrogate is not valid utf-8 bytes -- it is wtf-8 bytes.

We know this.

I'd be happy with \w for 16-bit codepoints converted to wtf-8 bytes. (perhaps better in bikeshed)

This proposal takes no stance on encoding UTF-16. "16-bit codepoints converted to wtf-8 bytes" is not, so far as I can determine, a sentence which makes any sense.

@andersen
Copy link

andersen commented Jun 18, 2024

in order to reject ill-formed values [...], a UTF-8 decoder first recognizes that code point sequence [...], then rejects it as an invalid code unit sequence.

[Mostly tangential: This is obviously not true for octet sequences that cannot be mapped to code points (e.g., "\xFF"). More to the point, UTF-8 decoders do not necessarily do this for 'byte sequence[s] that would otherwise map to code points U+D800..U+DFFF' or values beyond U+10FFFF either. See, e.g., https://encoding.spec.whatwg.org/#utf-8-decoder for a decoding algorithm that rejects such code unit sequences without first mapping them to invalid code points.]

for every codepoint which can be referred to as U+abcd, the sequence \u{abcd} will also refer to that codepoint, as encoded using the UTF-8 encoding rules (Unicode Standard15.0, table 3.6).

Sorry for playing the devil's advocate, but, according to its caption and heading, that table technically only provides a mapping for Unicode scalar values, specifically excluding the range from U+D800 to U+DFFF.

As pointed out by @squeek502, adding, without further restriction, surrogate code points to UTF-8 gives what WTF-8 refers to as generalised UTF-8, which can also be defined by changing the first column heading of Table 3-6 in the Unicode Standard from 'Scalar Value' to 'Code Point' (in the same way as CESU-8 changed it to 'UTF-16 Code Unit'), thereby providing a natural extension to the UTF-8 standard that allows both paired and unpaired surrogate code points to be encoded.

@mnemnion
Copy link
Author

There are three ways a UTF-8 string can be wrong: it can have invalid bytes, it can encode a code point sequence which is not a valid scalar value or it can be a non-shortest encoding. This proposes interpreting all \u{abcd} as encoding a code point sequence whether it's a scalar value or not.

If it helps you to think of that in terms of "generalized UTF-8" from the WTF-8 standard, knock yourself out. These are two ways of talking about the same thing. I don't think there was ever a point where this tangent added clarity to the proposal, but if there was such a point, we are well past it.

@mnemnion
Copy link
Author

Also, the header of 3.6 is "UTF-8 Bit Sequence", and the bit patterns captions "scalar value" are able to include all of the non-scalar-value codepoints. So if pedantry on the subject is that important to you, then we're referring to bits encoded as in table 3.6, with the word "scalar values" struck out, and replaced with "codepoint".

There's no actual ambiguity here. This is all wasted effort.

@andersen
Copy link

we're referring to bits encoded as in table 3.6, with the word "scalar values" struck out, and replaced with "codepoint".

:-)

This is all wasted effort.

Maybe. I was hoping that if we could all agree that UTF-8-style encoding of surrogates constitutes a small extension/generalisation of the current official standard, then future unproductive discussion concerning the exact definition of UTF-8 might be avoided. Sorry if the result was the opposite of what I intended. I do actually support your proposal.

@mnemnion
Copy link
Author

What I should have done is added an update to the first comment linking to the second draft of the proposal. I used the term "valid codepoint" there, and if there's something we all agree on, it's that an encoded surrogate code point sequence is not valid UTF-8. So I went ahead and did so.

ryoppippi pushed a commit to ryoppippi/zig that referenced this issue Jul 5, 2024
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences.

From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf:

> D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
> D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
>
> 3.9 Unicode Encoding Forms
> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Related: ziglang#20270
SammyJames pushed a commit to SammyJames/zig that referenced this issue Aug 7, 2024
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences.

From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf:

> D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.
> D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.
>
> 3.9 Unicode Encoding Forms
> D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Related: ziglang#20270
@mnemnion
Copy link
Author

mnemnion commented Sep 3, 2024

Another argument in favor of this proposal, which I just discovered, is this inconsistency:

const lit_char = '\u{d801}'; // This is legal
const lit_str = "\u{d801}"; // doesn't compile: 
// error: unicode escape does not correspond to a valid unicode scalar value

This is good news for what I'm working on, because it's one less special case to consider, but I don't see a strong argument for the same escape sequence working differently in a character or a string context. Working for strings as well would be less surprising, more consistent, and more useful.

@mnemnion
Copy link
Author

mnemnion commented Sep 3, 2024

As a follow-up observation, C1 control codes are allowed 'raw' in string and codepoint literals, and perhaps should not be.

They pose some of the same risks as the C0 sequences. I encourage the interested to try this in their terminal:

> printf "abc\xc2\x9b31mdef"

In WezTerm, this prints the "def" in red, because \xc2\x9b is the C1 for CSI, usually spelled \x1b[. iTerm2 doesn't cooperate, replacing it with a single space. It's possible to add those bytes to a Zig string or character literal with a hex editor, and the parser accepts them as valid.

If pressed to make a decision one way or the other, I would say that no Unicode control characters from the Cc category should be present unescaped in a single-line string, or a literal character, and only \n and \r in multilines; this is the same policy as status quo, just with the C1 sequences for U+80-U+9F also included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests

10 participants