New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Make \xNN mean utf8 code unit, not unicode codepoint. #2800

Closed
graydon opened this Issue Jul 4, 2012 · 19 comments

Comments

Projects
None yet
6 participants
@graydon
Contributor

graydon commented Jul 4, 2012

There's not a lot of consensus on this between languages, but the C and C++ paths (also perl, go, and at least python3 'bytes' literals, though not 'string') treat this escape as a code unit, not a codepoint.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 5, 2012

I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding.
If a string consists of UTF-8 code units, then a single character may consist of one to four code units.
So I can have a ten-character string with a length of 40.
Operations like getting a substring can leave you with broken characters, by extracting fewer than all of the code units of a character.
As for other languages, Python used to do different things depending on how it was compiled.
This is fixed as of Python 3.3, and it now supports the full Unicode range without having to deal with surrogate pairs, and string operations are much more intuitive for it.
Can’t think of many examples off-hand, but one other language that defines characters in terms of code points, rather than code units in some specific encoding, is Haskell, at least since Haskell 98.

ghost commented Jul 5, 2012

I think Unicode code points are much more intuitive to work with, not having to deal with implementation details of some specific encoding.
If a string consists of UTF-8 code units, then a single character may consist of one to four code units.
So I can have a ten-character string with a length of 40.
Operations like getting a substring can leave you with broken characters, by extracting fewer than all of the code units of a character.
As for other languages, Python used to do different things depending on how it was compiled.
This is fixed as of Python 3.3, and it now supports the full Unicode range without having to deal with surrogate pairs, and string operations are much more intuitive for it.
Can’t think of many examples off-hand, but one other language that defines characters in terms of code points, rather than code units in some specific encoding, is Haskell, at least since Haskell 98.

@graydon

This comment has been minimized.

Show comment
Hide comment
@graydon

graydon Jul 5, 2012

Contributor

Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's [char], not str. Likewise if you want a utf16 array, that's a different thing too. Python is actually the wrong precedent here; we're a systems language and users frequently flip into "I know about the in-memory implementation" assumptions, even rely on them.

That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html

Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!"

Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write utf16! "hello \U0010f0B1" and have it expand at compile time to [0xfeff_u16, 0x0068_u16, 0x0065_u16, 0x006c_u16, 0x006c_u16, 0x006f_u16, 0xd8c3_u16, 0xdcb1_u16], or similar. Just note that this has a different type from str.

Contributor

graydon commented Jul 5, 2012

Our strings definitely are utf8, it's not just "some specific encoding". We're very much exposing that and expecting programmers to know what that means. As they know what 2s complement integers (not auto-expanding-to-bignums) and 754 floating point (not rationals) are and what they do. If you want an array of unicode codepoints, that's [char], not str. Likewise if you want a utf16 array, that's a different thing too. Python is actually the wrong precedent here; we're a systems language and users frequently flip into "I know about the in-memory implementation" assumptions, even rely on them.

That said, I'm somewhat sympathetic to the arguments about which way to do this. Followup is on the list, over in the thread that created this bug: https://mail.mozilla.org/pipermail/rust-dev/2012-July/002024.html

Yes, this is part of the "utf8 monoculture" some people despise, but I am somewhat unrepentant about it. I think it's as stable, flexible and long-term an encoding as we're likely to see for years; the only plausible competitor on the horizon is GB18030 and it even covers different codepoints, so it's not really fair to consider it a "different encoding", it's a whole different charset. And, in any case, my experience is that the harm done to language users, especially systems-language users, by being ambiguous about the in-memory meaning of literals in program text far outweighs the harm done by picking some particular unambiguous interpretation. IOW on this topic I think the risk of underspecification is higher than the risk of overspecification. It would be more useful to support multiple-explicit-encodings -- even permit tagging a whole file as written-to-a-different-default-encoding -- if that every becomes a real concern, than to throw our hands up about the encoding and say "strings are implementation-specific!"

Incidentally, it should be trivial to write a syntax extension that maps from encoding-to-encoding at compile time, i.e. one that lets you write utf16! "hello \U0010f0B1" and have it expand at compile time to [0xfeff_u16, 0x0068_u16, 0x0065_u16, 0x006c_u16, 0x006c_u16, 0x006f_u16, 0xd8c3_u16, 0xdcb1_u16], or similar. Just note that this has a different type from str.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost Jul 6, 2012

Thank you for the explanation, it makes a lot of sense.
I assumed too much about the str type and its purpose, and appreciate the clarification.
Perhaps there is room in the standard library for a text module of some sort, for doing more high-level work with text?

ghost commented Jul 6, 2012

Thank you for the explanation, it makes a lot of sense.
I assumed too much about the str type and its purpose, and appreciate the clarification.
Perhaps there is room in the standard library for a text module of some sort, for doing more high-level work with text?

@graydon

This comment has been minimized.

Show comment
Hide comment
@graydon

graydon Jul 6, 2012

Contributor

Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu.

Contributor

graydon commented Jul 6, 2012

Definitely. Some machinery exists in core for handling basic tasks associated with strings in the various operating-system-required encodings; more will wind up in libstd, likely a binding to libicu.

@pcwalton

This comment has been minimized.

Show comment
Hide comment
@pcwalton

pcwalton May 9, 2013

Contributor

Nominated for backwards compatible

Contributor

pcwalton commented May 9, 2013

Nominated for backwards compatible

@graydon

This comment has been minimized.

Show comment
Hide comment
@graydon

graydon Jun 6, 2013

Contributor

accepted for backwards-compatible milestone

Contributor

graydon commented Jun 6, 2013

accepted for backwards-compatible milestone

@cmr

This comment has been minimized.

Show comment
Hide comment
@cmr

cmr Aug 5, 2013

Member

I agree that \xNN should be a code unit and \u... should be code point.

Member

cmr commented Aug 5, 2013

I agree that \xNN should be a code unit and \u... should be code point.

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Sep 26, 2013

Member

cc me

Member

pnkfelix commented Sep 26, 2013

cc me

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Jan 2, 2014

Contributor

I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want "\xEF\xBF\xBD" rather than "\uFFFD". (Of course, byte string literals are a different story.)

If the concern is that "\xNN" looks like a byte while it’s not, we could only allow it for values in the ASCII range (\x00 to \x7F) i.e. for code points that are represented as one byte in UTF-8.

If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like "\x80" should be compile-time errors so as to not break str’s promise to contain valid UTF-8.

Contributor

SimonSapin commented Jan 2, 2014

I disagree. I don’t see a reason to use code units in (Unicode) literal strings. Why would you want "\xEF\xBF\xBD" rather than "\uFFFD". (Of course, byte string literals are a different story.)

If the concern is that "\xNN" looks like a byte while it’s not, we could only allow it for values in the ASCII range (\x00 to \x7F) i.e. for code points that are represented as one byte in UTF-8.

If \xNN is still changed to represent code units, literals that contain invalid UTF-8 like "\x80" should be compile-time errors so as to not break str’s promise to contain valid UTF-8.

@cmr

This comment has been minimized.

Show comment
Hide comment
@cmr

cmr Jan 2, 2014

Member

I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time bytes!(...) didn't exist and IMO that allows for much nicer literals than \xNN style stuff. I no longer agree with this change.

Member

cmr commented Jan 2, 2014

I agree with that, @SimonSapin. The reason I wanted it was for byte string literals, but at the time bytes!(...) didn't exist and IMO that allows for much nicer literals than \xNN style stuff. I no longer agree with this change.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Jan 2, 2014

Contributor

Yes, I also want byte literals (and found this when searching for that.)

Contributor

SimonSapin commented Jan 2, 2014

Yes, I also want byte literals (and found this when searching for that.)

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jan 3, 2014

Contributor

If \xFF indicates a code unit does that mean that character literals need to support multiple \x escapes? I can't tell yet if there's any precedent for that in other languages.

Contributor

brson commented Jan 3, 2014

If \xFF indicates a code unit does that mean that character literals need to support multiple \x escapes? I can't tell yet if there's any precedent for that in other languages.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jan 3, 2014

Contributor

I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python.

Contributor

brson commented Jan 3, 2014

I guess the strongest argument in favor of code units is Behdad's, that it would make our string escapes compatible with C and Python.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jan 3, 2014

Contributor

Behdad's full argument:

Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that
\xHH be modified to allow inputting direct UTF-8 bytes. For ASCII it doesn't
make any different. For Latin1, it gives the impression that strings are
stored in Latin1, which is not the case. It would also make C / Python
escaped strings directly usable in Rust. Ie. '\xE2\x98\xBA' would be a single
character equivalent to '\u263a', not three Latin1 characters.

Contributor

brson commented Jan 3, 2014

Behdad's full argument:

Here: "\xHH, \uHHHH, \UHHHHHHHH Unicode escapes", I strongly suggest that
\xHH be modified to allow inputting direct UTF-8 bytes. For ASCII it doesn't
make any different. For Latin1, it gives the impression that strings are
stored in Latin1, which is not the case. It would also make C / Python
escaped strings directly usable in Rust. Ie. '\xE2\x98\xBA' would be a single
character equivalent to '\u263a', not three Latin1 characters.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Jan 3, 2014

Contributor

I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and u'\xE2\x98\xBA' is indeed three code points in the Latin1 range.

Contributor

SimonSapin commented Jan 3, 2014

I don’t know about C, but Behdad’s argument does not apply to Python. Python (both in 2.x and 3.x) has two types of strings: byte strings, where \xHH is a byte and \uHHHH is not an escape sequence; and Unicode strings where \xHH is a code point and u'\xE2\x98\xBA' is indeed three code points in the Latin1 range.

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jan 18, 2014

Contributor

There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done.

Contributor

brson commented Jan 18, 2014

There doesn't seem to be a definitive argument for either side, and since changing these to be code units makes their validation slightly harder, I'm inclined to just leave as-is and call it done.

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Jan 18, 2014

Member

@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here

(Though that table is missing C# it seems.)

So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ?

Member

pnkfelix commented Jan 18, 2014

@SimonSapin indeed, I think graydon said the same thing in his initial response to Behdad, along with providing a more complete table of what different languages do here

(Though that table is missing C# it seems.)

So is Rust going to be more like python and scheme, or more like perl, go, C, C++, ruby... ?

@pnkfelix

This comment has been minimized.

Show comment
Hide comment
@pnkfelix

pnkfelix Jan 18, 2014

Member

(having said that, I'm fine with brson's suggestion to leave things as they are.)

Member

pnkfelix commented Jan 18, 2014

(having said that, I'm fine with brson's suggestion to leave things as they are.)

@brson

This comment has been minimized.

Show comment
Hide comment
@brson

brson Jan 21, 2014

Contributor

In today's meeting we decided to leave this as is.

Contributor

brson commented Jan 21, 2014

In today's meeting we decided to leave this as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment