-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow surrogate codepoints when escaped with \u #20270
Comments
My current understanding is that there is no canonical UTF-8 representation of surrogate code points, hence WTF-8. From the latest Unicode Standard:
So, the most precise error for current Zig would be |
Yes, this gets at the distinction between a Unicode character and a Unicode codepoint. The surrogates are not valid Unicode characters, nor the noncharacter codepoints. They are codepoints, however: they have an assigned value, character properties, and a valid (let's say: specific, or: singular) encoding in every Unicode encoding. Scalar value is the modern way the standard uses to refer to characters, because what civilians think of as a character includes Farmer Bob here 👨🏻🌾, which is made up of several Unicode scalar values. The standard Unicode uses for codepoints is But they do have an encoding in UTF-8. What the standard is getting it is that it's forbidden to interpret surrogate pairs in UTF-8 as the abstract character / scalar value which they would represent if found in UTF-16. This is invalid for the same reason that overlong encodings are invalid: it serves no purpose, complicates software, and has security implications which are decidedly negative. All of these are good reason to reject bare surrogates in string data! But they aren't good reasons to treat A string containing these codepoints is ill-formed UTF-8. WTF-8 doesn't have to be brought into the picture, because once turned into binary data, Zig strings aren't WTF-8 either, they allow all sorts of byte sequences which aren't valid in either encoding. As a sort of Zen koan to convey my meaning: if the surrogates didn't have an encoding in UTF-8, then software wouldn't be able to reject those byte sequences as invalid. That's also the most pragmatic reason I'm advocating for their inclusion in the It's invalid to have the surrogate codepoints anywhere in a UTF-8 string, just as it's invalid to have only one in a UT-16 string, or a low followed by a high, but in both cases is is a codepoint and it does have an encoding. There are all sorts of ways for UTF-8 to be ill-formed, all of them are supported through the So we all agree: if the Zig parser is chugging along through a string, and sees these three bytes: |
It would be helpful if you could link to the relevant parts of the Unicode Standard and/or UCD. I'm unable to find anything that specifically talks about/declares a canonical encoding of surrogate code points for UTF-8. From what I can tell, As for my position on this proposal:
EDIT: To be clear: I'm largely in favor of this proposal. My only reservation would be the potential for users to accidentally/unexpectedly create a sequence of ill-formed UTF-8, but the risk of that does seem pretty minor. EDIT#2: Some clearer statements:
|
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences. From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf: > D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. > D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. > > 3.9 Unicode Encoding Forms > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. Related: ziglang#20270
UTF-8 can also be treated as an encoding of 31-bit numbers rather than of Unicode characters (although if it is a valid Unicode code point number then it is a valid encoding of a Unicode character (or at least a part of one)); and in that interpretation,
If you are concerned about that, then a (suppressable) warning message might make sense. |
My reading of UTF-8 as defined by RFC3629 and the Unicode Standard is that there are no such encodings, i.e. U+D800-U+DFFF and > U+10FFFF are explicitly disallowed from being encoded. Hence why I think this proposal should be focused on defining (side note:
Zig doesn't have warnings |
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences. From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf: > D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. > D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. > > 3.9 Unicode Encoding Forms > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. Related: ziglang#20270
It's possible to describe the proposal without reference to WTF-8. I think that's generally valuable, because the goal isn't the same as that of WTF-8. WTF-8 defines a more permissive encoding, whereas Zig strings don't define an encoding at all. I'm going to use bold and italics here to reference terms defined by, or used in, the standard, and try to draw out some precise language, which will satisfy the desire for the proposal to be a) expressed according to the language of the standard and b) without reference to the separate WTF-8 standard. The proposal relates to codepoints. Unicode defines Currently, without said exception, Zig turns these escape sequences into what the standard calls code unit sequences, specifically those of UTF-8. In UTF-8, the surrogate codepoints are not valid code unit sequences, because they don't correspond to a valid abstract character or, equivalently, scalar value. In UTF-16, this is true of mismatched surrogates: low-high, or one or the other encountered without a pair. Section C8 reads:
So now we introduce code point sequences as well. These are a specific byte pattern representing any codepoint in any encoding. Corollary: every codepoint, without exception, has a code point sequence in all Unicode encodings. Are we sure that surrogates are codepoints? We are:
Also, D15:
And we have this:
Defining a scalar value thus:
Referring to UTF-8 specifically, the standard dismisses surrogates as follows:
So here's the synthesis: in order to reject ill-formed values as defined by The proposalCurrent behavior: Zig interprets a sequence Proposed behavior: Zig interprets a sequence I believe that here we have a succinct description of the proposal, which uses terminology in the way in which the Unicode standard uses it. |
Small addendum: this proposal, as expressed above, rejects sequences such as While one can follow the rules elucidated in the FSS UTF-8 standard of 1993, and turn that into bytes, |
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences. From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf: > D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. > D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. > > 3.9 Unicode Encoding Forms > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. Related: #20270
My problems with that proposal:
Here's my attempt at this proposal: Status quoCurrently, Zig defines
This means that:
However, with Zig now commonly returning and accepting WTF-8, being able to use The proposalThis proposal is to instead define
With potentially a clarifying note:
Use casesThe main use case here would be avoiding the need to manually specify the bytes for the WTF-8 encoding of surrogate code points in string literals. For example, Lines 2039 to 2047 in d9bd34f
Could instead look like: try testRoundtripWtf8("\u{D7FF}"); // not a surrogate half
try testRoundtripWtf8("\u{D83D}"); // high surrogate
try testRoundtripWtf8("\u{DCA9}"); // low surrogate
try testRoundtripWtf8("\u{D83D} \u{DCA9}"); // <high surrogate><space><low surrogate>
try testRoundtripWtf8("\u{D800}\u{DBFF}"); // <high surrogate><high surrogate>
try testRoundtripWtf8("\u{D800}\u{E000}"); // <high surrogate><not surrogate>
try testRoundtripWtf8("\u{D7FF}\u{DC00}"); // <not surrogate><low surrogate>
try testRoundtripWtf8("a\u{DC00}"); // <not surrogate><low surrogate>
try testRoundtripWtf8("\u{1F4A9}"); // U+1F4A9, encoded as a surrogate pair in WTF-16 (this test [and others in In the event that you're specifying a path/environment variable that happens to contain an unpaired surrogate, then not needing to manually specify it as WTF-8 encoded bytes would make things much nicer: const path = "some/wtf8path/with\u{d83d}surrogates\u{dc00}";
var file = try std.fs.cwd().openFile(path, .{});
defer file.close();
const expected_val = "some wtf8 value with \u{d83d} surrogates \u{dc00}";
const actual_val = try std.process.getEnvVarOwned(allocator, "FOO");
defer allocator.free(actual_val);
if (std.mem.eql(u8, expected_val, actual_val)) {
// ...
} With status quo, you'd have to look up/calculate the WTF-8 encoded form of the surrogate code points and specify them as bytes. Potential drawbacksThe biggest drawback would be that However, using WTF-8 can be ill-formed, tooAnother thing to keep in mind is that WTF-8 is intended to only encode unpaired surrogates. Doing something like: "\u{D83D}\u{DCA9}" // <high surrogate><low surrogate> would result in ill-formed WTF-8, since the encoded surrogate code points form a surrogate pair.
|
This is one of a few reasons why I don't believe that bringing WTF-8 into the picture adds much to the discussion. UTF-8 is an encoding, and it's the one which Zig source code uses, but Zig isn't a language which validates the encoding property of strings. I believe I've demonstrated adequately that the standard has a concept of a code point encoding, which the proposal for The standard devotes a great deal of energy to making sure that no one gets the idea that invalid use of codepoints is in any sense valid Unicode. It doesn't spend a lot of energy discussing the ins and outs of invalid sequences. Nonetheless, we get things like the official Unicode name for overlong encodings, a "non-shortest form". I draw your attention again to this paragraph:
So I'm not sure what we're supposed to do with this claim:
There is a well-understood way to emit a code point sequence corresponding to those characters. It's given in table 3.6 of the standard. No one, at any point, in discussing this proposal, has ever had the misapprehension that the result would be well-formed UTF-8. As a result, there's no concrete difference being discussed here.
This is accurate, and actionable, and uses the language of the standard. What are you trying to accomplish by rephrasing it? I could tweak it a bit, actually, it has a bit of the flavor of standardese, given what I was reading right before I wrote it:
"the UTF-8 code point sequence corresponding to a Unicode codepoint" is, as I took pains to illustrate, a well-defined concept according to the standard. There's no need to bring a second standard along for the ride. |
To try and keep this issue on track, I'll offer some other examples of when this might be useful. A regex to detect surrogate codepoint sequences: Both of these are pretty useful when writing programs which are expected to detect invalid code unit sequences and handle them correctly. Where "correctly" can be as WTF-8, or replacing it with |
This is a description of what the WTF-8 spec calls generalized UTF-8, distinct from UTF-8 (see the difference from the Unicode Standard for the table of well-formed byte sequences).
I think it makes things more clear, but I also think the particular language used is largely a distraction, apologies for dragging it out. The use cases/drawbacks are the most relevant thing, and are unrelated to the language used to define |
The terminology "surrogate codepoints" is a bit ambiguous. It could conceivably refer to either "surrogate pairs" or "unpaired surrogates". I haven't read the entire thread but it seems to be about the latter, so might I suggest changing the title of the issue to |
It refers to neither in this case; a surrogate code point is just a code point between U+D800 and U+DFFF (see the quotes from the standard in this comment). The terms "surrogate pairs" and "unpaired surrogates" are only relevant when talking about particular encodings (UTF-16/WTF-16/WTF-8). |
This is, strictly speaking, more restrictive than the actual proposal, because allowing surrogate codepoints (which is a complete idea which composes in certain ways) means allowing paired ones also, by using two escape sequences. That's some of the whole hoopla about WTF-8: it defines generalized UTF-8, which allows surrogate codepoints and surrogate pairs, and WTF-8, which only allows the former. I think we're past all the litigation on that front, there are a couple ways to describe the proposal which amount to the same thing. The term "surrogate codepoint" isn't ambiguous, however, no matter what encoding of Unicode we're referring to. In UTF-16 (only), paired surrogate codepoints are used to encode codepoints which correspond to scalar values, but those latter codepoints are not "surrogate codepoints". So in UTF-16, we would say that two code point sequences encoding two surrogate codepoints, make up one code unit sequence corresponding to one codepoint, which is a scalar value or abstract character. In UTF-8 and UTF-32, the surrogate code point sequences are ill-formed, in UTF-32, each of these is a single code unit. In UTF-8, the code unit is a byte, so any byte sequence is a code unit sequence, which is either well-formed or ill-formed. Surrogate code point sequences are an example of an ill-formed sequence, and the one this proposal concerns itself with. It's also true that if we take generalized UTF-8 from the WTF-8 standard, and subtract every valid sequence in Unicode's UTF-8, we end up with the same code unit sequences: every UTF-8 encoded code point sequence which does not encoded a scalar value. This is an equivalent way of describing things. |
Half slabs and teapots. Related: |
IMO: \u as suggested by it's name represents a Unicode scalar (not UTF-16) (to avoid the confusing term "codepoint). If you want to encode arbitrary bytes of test data that aren't valid UTF-8 (such as unpaired UTF-16 surrogates), \x is already available and easy to read: "\xD8\x00". I find the status quo perfectly reasonable and capable of readably representing these use cases, without inserting ambiguity into the interpretation of \u escapes. |
codepoint isnt a confusing term. imo |
I'm fine with extending \x to allow longer sequences of hex digits, which easily accomplishes the goal of being able to represent arbitrary data. An unpaired surrogate is not valid utf-8 bytes -- it is wtf-8 bytes. I'd be happy with \w for 16-bit codepoints converted to wtf-8 bytes. (perhaps better in bikeshed). |
codepoints are not 16 bit and the utf-8 encoding algorithm is not limited by the same bound codepoints are |
I'm assuming you mean codepoints in the BMP (Basic Multilingual Plane) and you're not mixing up Unicode with old UCS-2? The wording "16-bit codepoint" should be avoided due to this ambiguity. The ASCII subset, the Latin 1 subset, and the BMP subset are all kinda special and more common in decreasing order than the rest of Unicode. 7-bit ASCII is represented exactly the same in UTF-8, Latin 1 shares codepoints from 0-255 but 128-255 need more than one byte, the bump covers 0-65535 so can be represented in 16 bits. Full Unicode coverage requires 21 bits. Even though the Latin 1 range is a bit special does it really warrant a special escape convention? |
My opinion is that this is, in fact, the confusing read. This proposal would mean that for every codepoint which can be referred to as Your reference to UTF-16 confuses me. This is entirely about UTF-8. Yes, it's invalid for many things which a Zig string can contain, to be in a UTF-8 string. Including code point sequences which do not represent scalar values. The question is whether that distinction creates a valuable restriction on the use of
We know this.
This proposal takes no stance on encoding UTF-16. "16-bit codepoints converted to wtf-8 bytes" is not, so far as I can determine, a sentence which makes any sense. |
[Mostly tangential: This is obviously not true for octet sequences that cannot be mapped to code points (e.g., "\xFF"). More to the point, UTF-8 decoders do not necessarily do this for 'byte sequence[s] that would otherwise map to code points U+D800..U+DFFF' or values beyond U+10FFFF either. See, e.g., https://encoding.spec.whatwg.org/#utf-8-decoder for a decoding algorithm that rejects such code unit sequences without first mapping them to invalid code points.]
Sorry for playing the devil's advocate, but, according to its caption and heading, that table technically only provides a mapping for Unicode scalar values, specifically excluding the range from U+D800 to U+DFFF. As pointed out by @squeek502, adding, without further restriction, surrogate code points to UTF-8 gives what WTF-8 refers to as generalised UTF-8, which can also be defined by changing the first column heading of Table 3-6 in the Unicode Standard from 'Scalar Value' to 'Code Point' (in the same way as CESU-8 changed it to 'UTF-16 Code Unit'), thereby providing a natural extension to the UTF-8 standard that allows both paired and unpaired surrogate code points to be encoded. |
There are three ways a UTF-8 string can be wrong: it can have invalid bytes, it can encode a code point sequence which is not a valid scalar value or it can be a non-shortest encoding. This proposes interpreting all If it helps you to think of that in terms of "generalized UTF-8" from the WTF-8 standard, knock yourself out. These are two ways of talking about the same thing. I don't think there was ever a point where this tangent added clarity to the proposal, but if there was such a point, we are well past it. |
Also, the header of There's no actual ambiguity here. This is all wasted effort. |
:-)
Maybe. I was hoping that if we could all agree that UTF-8-style encoding of surrogates constitutes a small extension/generalisation of the current official standard, then future unproductive discussion concerning the exact definition of UTF-8 might be avoided. Sorry if the result was the opposite of what I intended. I do actually support your proposal. |
What I should have done is added an update to the first comment linking to the second draft of the proposal. I used the term "valid codepoint" there, and if there's something we all agree on, it's that an encoded surrogate code point sequence is not valid UTF-8. So I went ahead and did so. |
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences. From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf: > D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. > D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. > > 3.9 Unicode Encoding Forms > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. Related: ziglang#20270
The surrogate code points U+D800 to U+DFFF are valid code points but are not Unicode scalar values. This commit makes the error message more accurately reflect what is actually allowed in `\u` escape sequences. From https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf: > D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF. > D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF. > > 3.9 Unicode Encoding Forms > D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. Related: ziglang#20270
Another argument in favor of this proposal, which I just discovered, is this inconsistency: const lit_char = '\u{d801}'; // This is legal
const lit_str = "\u{d801}"; // doesn't compile:
// error: unicode escape does not correspond to a valid unicode scalar value This is good news for what I'm working on, because it's one less special case to consider, but I don't see a strong argument for the same escape sequence working differently in a character or a string context. Working for strings as well would be less surprising, more consistent, and more useful. |
As a follow-up observation, C1 control codes are allowed 'raw' in string and codepoint literals, and perhaps should not be. They pose some of the same risks as the C0 sequences. I encourage the interested to try this in their terminal: > printf "abc\xc2\x9b31mdef" In WezTerm, this prints the "def" in red, because If pressed to make a decision one way or the other, I would say that no Unicode control characters from the Cc category should be present unescaped in a single-line string, or a literal character, and only |
Update: the first post (which you're reading) is imprecise in some ways, especially the use of the word "valid". Please skip ahead to this comment for a better take on the aims and goals here.
In generating some Unicode test data, I discovered that Zig doesn't allow surrogates, even when encoded with
\u
.Minimal reproduction:
The error (Zig 0.12.0): "error: unicode escape does not correspond to a valid codepoint".
The error message is not correct, the UTF-16 surrogates are valid codepoints, in category Other: surrogate. Here's a property page for U+D800.
It makes sense to me that the parser finding random surrogates in the string should reject that as not well-formed, just like it would balk on control character, or a bare
\xff
. Such random garbage is most likely a mistake. The same applies to overlong encodings, you could argue that they're "well encoded" in that they match the necessary pattern for a UTF-8 encoded point, but the standard specifically forbids them. Unlike the surrogates, these are not codepoints.But Zig does not demand that string data is well formed UTF-8, the
\x
byte encoding can represent arbitrary bytes within a string. In fact,"\u{01}"
is valid, when it would not be if embedded raw in the string.It doesn't make sense to me that a codepoint specifically encoded as e.g.
\u{d800}
would create a compile error. That's the author affirmatively requesting that said codepoint be decoded into UTF-8 and added to the sequence, it's not the sort of thing which happens by accident. It has an exact interpretation,.{0xed, 0xao, 0x80}
, which is validly-encoded UTF-8. I'll contrast this with\u{80000000}
, which can't be turned into bytes: this is genuinely invalid..{0xc0, 0xaf}
is also invalid, despite having the superficial qualities of a UTF-8 codepoint, since it's overlong. There's no way to represent those inU+
notation, so this is a bit off topic, the point is that I could stuff it into a string with\xco\xaf
and that would be fine with the compiler.There's no reason for the compiler to contain extra logic to prevent surrogates from being represented using
\u
notation, when it's specifically requested. In my case, it means that if I want test data which covers the entire three-byte range of Unicode, I must detect the surrogate ranges and special-case encode them as\x
sequences. Or perhaps someone might be writing a fixup tool which detects an invalid encoding of surrogate pairs and produces correct UTF-8. There, too, the test data will have surrogates in it, and this behavior is an arbitrary limitation to work around. TL;DR, the compiler should accept\u
notation for all codepoints in Unicode, surrogates included.The text was updated successfully, but these errors were encountered: