Implement std::convert traits for char #35755

Merged
merged 2 commits into from Sep 1, 2016

Conversation

Projects
None yet
8 participants
@SimonSapin
Contributor

SimonSapin commented Aug 17, 2016

This is motivated by avoiding the as operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that From<u8> for char should be implemented: while it matches an existing behavior of as, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.

@rust-highfive

This comment has been minimized.

Show comment
Hide comment
@rust-highfive

rust-highfive Aug 17, 2016

Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

Collaborator

rust-highfive commented Aug 17, 2016

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@sfackler sfackler added the T-libs label Aug 17, 2016

src/libcore/char.rs
+ ///
+ /// Surrogates are used in the UTF-16 encoding, and therefore are not characters.
+ SurrogateCodePoint,
+}

This comment has been minimized.

@sfackler

sfackler Aug 17, 2016

Member

We're never going to need to add any extra cases to this enum, right? Should we stick a __ForExtensibility variant just in case?

@sfackler

sfackler Aug 17, 2016

Member

We're never going to need to add any extra cases to this enum, right? Should we stick a __ForExtensibility variant just in case?

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Aug 17, 2016

Member

I personally feel okay about the u8 -> char impl as it seems like what everyone would expect to happen, but am interested to see what other people think.

cc @rust-lang/libs

Member

sfackler commented Aug 17, 2016

I personally feel okay about the u8 -> char impl as it seems like what everyone would expect to happen, but am interested to see what other people think.

cc @rust-lang/libs

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Aug 18, 2016

Member

Seems reasonable to me, but I'd prefer to use an opaque struct with optional method accessors rather than an enum for the error type in TryFrom

Member

alexcrichton commented Aug 18, 2016

Seems reasonable to me, but I'd prefer to use an opaque struct with optional method accessors rather than an enum for the error type in TryFrom

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Aug 18, 2016

Contributor

@sfackler Re extensibility, I don’t expect this to ever be needed. The range of Unicode Scalar Values changed exactly once in the history of Unicode. At first it was “16 bits ought to be enough for everybody” 0x0000...0xFFFF. When that turned out not to be enough and a lot of systems were already using u16 as a code unit, UTF-16 was introduced. The current range, 0x0000...0xD7FF | 0xE000...0x10FFFF, is exactly what can be encoded by UTF-16. (UTF-8’s original design can support up to 0x7FFF_FFFF with 6-bytes sequences. It is now artificially restricted to match.)

In Unicode 9.0, 76% of the million and some code points are unassigned, so they’re not expected to run out. And breaking compatibility with UTF-16 is such a breaking change that I imagine it’s not even considered.

And this concern disappears with…

@alexcrichton Yeah, I also considered an opaque struct. Even without accessor method since I can’t think of a use case for it. (Code like a WTF-8 implementation that wants to deal with surrogate code points will likely do its own code point arithmetic anyway.) And a method cal always be added later. I’ve updated the PR.

Contributor

SimonSapin commented Aug 18, 2016

@sfackler Re extensibility, I don’t expect this to ever be needed. The range of Unicode Scalar Values changed exactly once in the history of Unicode. At first it was “16 bits ought to be enough for everybody” 0x0000...0xFFFF. When that turned out not to be enough and a lot of systems were already using u16 as a code unit, UTF-16 was introduced. The current range, 0x0000...0xD7FF | 0xE000...0x10FFFF, is exactly what can be encoded by UTF-16. (UTF-8’s original design can support up to 0x7FFF_FFFF with 6-bytes sequences. It is now artificially restricted to match.)

In Unicode 9.0, 76% of the million and some code points are unassigned, so they’re not expected to run out. And breaking compatibility with UTF-16 is such a breaking change that I imagine it’s not even considered.

And this concern disappears with…

@alexcrichton Yeah, I also considered an opaque struct. Even without accessor method since I can’t think of a use case for it. (Code like a WTF-8 implementation that wants to deal with surrogate code points will likely do its own code point arithmetic anyway.) And a method cal always be added later. I’ve updated the PR.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Aug 23, 2016

Member

Discussed during @rust-lang/libs triage today, conclusion was to merge. Thanks for the update @SimonSapin!

@bors: r+

Member

alexcrichton commented Aug 23, 2016

Discussed during @rust-lang/libs triage today, conclusion was to merge. Thanks for the update @SimonSapin!

@bors: r+

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 23, 2016

Contributor

📌 Commit 82678c5 has been approved by alexcrichton

Contributor

bors commented Aug 23, 2016

📌 Commit 82678c5 has been approved by alexcrichton

eddyb added a commit to eddyb/rust that referenced this pull request Aug 23, 2016

Rollup merge of #35755 - SimonSapin:char_convert, r=alexcrichton
Implement std::convert traits for char

This is motivated by avoiding the `as` operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that `From<u8> for char` should be implemented: while it matches an existing behavior of `as`, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.

@eddyb eddyb referenced this pull request Aug 23, 2016

Closed

Rollup of 15 pull requests #35944

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 23, 2016

Contributor

☔️ The latest upstream changes (presumably #35656) made this pull request unmergeable. Please resolve the merge conflicts.

Contributor

bors commented Aug 23, 2016

☔️ The latest upstream changes (presumably #35656) made this pull request unmergeable. Please resolve the merge conflicts.

src/libcore/char.rs
@@ -176,6 +172,41 @@ pub unsafe fn from_u32_unchecked(i: u32) -> char {
transmute(i)
}
+#[stable(feature = "char_convert", since = "1.12.0")]

This comment has been minimized.

@ollie27

ollie27 Aug 27, 2016

Contributor

Should these not be "1.13.0"?

@ollie27

ollie27 Aug 27, 2016

Contributor

Should these not be "1.13.0"?

This comment has been minimized.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

Not at the time I first opened this PR, but now yes. Fixed.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

Not at the time I first opened this PR, but now yes. Fixed.

src/libcore/char.rs
+#[stable(feature = "char_convert", since = "1.12.0")]
+impl From<u8> for char {
+ #[inline]
+ fn from(i: u8) -> Self {

This comment has been minimized.

@ollie27

ollie27 Aug 27, 2016

Contributor

I think it would be a good idea to add documentation to this method as it's not completely obvious how it's implemented. Specifically it should mention the code page.

@ollie27

ollie27 Aug 27, 2016

Contributor

I think it would be a good idea to add documentation to this method as it's not completely obvious how it's implemented. Specifically it should mention the code page.

This comment has been minimized.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

I disagree with using the term "code page", but I’ve added a detailed doc-comment.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

I disagree with using the term "code page", but I’ve added a detailed doc-comment.

+/// The error type returned when a conversion from u32 to char fails.
+#[unstable(feature = "try_from", issue = "33417")]
+#[derive(Copy, Clone, Debug, PartialEq, Eq)]
+pub struct CharTryFromError(());

This comment has been minimized.

@ollie27

ollie27 Aug 27, 2016

Contributor

This should probably implement std::error::Error.

@ollie27

ollie27 Aug 27, 2016

Contributor

This should probably implement std::error::Error.

This comment has been minimized.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

Done.

@SimonSapin

SimonSapin Aug 29, 2016

Contributor

Done.

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Aug 29, 2016

Contributor

🔒 Merge conflict

Contributor

bors commented Aug 29, 2016

🔒 Merge conflict

SimonSapin added some commits Aug 17, 2016

Implement From<char> for u32, and From<u8> for char
These fit with other From implementations between integer types.

This helps the coding style of avoiding the 'as' operator that sometimes
silently truncates, and signals that these specific conversions are
lossless and infaillible.
Implement TryFrom<u32> for char
For symmetry with From<char> for u32.
@alexcrichton

This comment has been minimized.

Show comment
Hide comment
Member

alexcrichton commented Aug 29, 2016

@bors

This comment has been minimized.

Show comment
Hide comment
@bors

bors Sep 1, 2016

Contributor

⌛️ Testing commit f040208 with merge b2799a5...

Contributor

bors commented Sep 1, 2016

⌛️ Testing commit f040208 with merge b2799a5...

bors added a commit that referenced this pull request Sep 1, 2016

Auto merge of #35755 - SimonSapin:char_convert, r=alexcrichton
Implement std::convert traits for char

This is motivated by avoiding the `as` operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that `From<u8> for char` should be implemented: while it matches an existing behavior of `as`, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.

@bors bors merged commit f040208 into rust-lang:master Sep 1, 2016

1 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
homu Test successful
Details

@bluss bluss added the relnotes label Sep 8, 2016

@bluss

This comment has been minimized.

Show comment
Hide comment
@bluss

bluss Sep 8, 2016

Contributor

Mini-reminder: let's tag more user-visible stuff with relnotes

Contributor

bluss commented Sep 8, 2016

Mini-reminder: let's tag more user-visible stuff with relnotes

@SimonSapin SimonSapin deleted the SimonSapin:char_convert branch Sep 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment