Tracking issue for UTF-16 decoding iterators #27830

SimonSapin · 2015-08-14T10:47:16Z

#27808 proposes exposing in std::char two iterator adaptors Utf16Decoder and Utf16LossyDecoder. This functionality was previously only available with an API that require allocation (String::from_utf16{,_lossy}) or using the unstable rustc_unicode crate directly.

They are exposed unstable with a new utf16_decoder feature name. I’d like to stabilize them when we’re confident with the naming and API.

The text was updated successfully, but these errors were encountered:

alexcrichton · 2016-03-11T18:01:57Z

🔔 This issue is now entering its cycle-long final comment period for stabilization in 1.9 🔔

alexcrichton · 2016-03-11T23:00:36Z

My personal opinion:

Ok, so the APIs that this is talking about are:

std::char::decode_utf16
std::char::DecodeUtf16
impl Iterator<Item=Result<char, u16>> for DecodeUtf16
std::char::REPLACEMENT_CHARACTER

I would not personally be a huge fan of stabilizing this cycle, some concerns being:

If we have decode_utf16, shouldn't we have decode_utf8? We have String::from_utf16{,_lossy} to mirror str::from_utf8{,_lossy}.
Do we want to support decode_utfXX_lossy? Are they easy enough to do with a .map(|e| e.unwrap_or(REPLACEMENT_CHARACTER))?
I don't think we have many other top-level functions in the standard library that take iterators and return iterators?
Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?
The error type of the iterator may be a little unidiomatic, we probably want to wrap it up in something to provide extra contextual information in the future.

I think I'd be more comfortable with stabilizing given some utf-8 decoding functions as well, but it's probably worth also looking at the matrix of conversions we have:

A	B	A -> B	B -> A
`Iterator<u8>`	`Iterator<char>`	not implemented	`flat_map(
`Iterator<u16>`	`Iterator<char>`	`char::decode_utf16`	`flat_map(
`&[u8]`	`&str`	`str::from_utf8`	`str::as_bytes`
`&[u16]`	`String`	`String::from_utf16`	impossible
`Iterator<u16>`	`str`	impossible	`str::encode_utf16`

I guess in that sense we'd be "complete" with decode_utf8? Is that all we need to put this issue to rest?

aturon · 2016-04-06T17:19:44Z

@alexcrichton

I don't think we have many other top-level functions in the standard library that take iterators and return iterators?

I don't see this as a concern, really. The more we work with iterators, the more natural things like "iterator transformers" such as this are. It's a fine ball to get rolling in my opinion.

The error type of the iterator may be a little unidiomatic, we probably want to wrap it up in something to provide extra contextual information in the future.

Agreed.

Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?

Perhaps so -- but IMO this should influence organization more than anything. That is, we might want to think about a submodule for constants, if we do anticipate adding more over time.

Most of the other points have the flavor of: why stabilize just this one piece? I agree that I'd really like to have an overall vision here; I feel like every cycle we stabilize a couple of related methods. That said, your matrix is pretty useful, and does indeed suggest we should land both this and an analogous utf8 decoder.

BurntSushi · 2016-04-06T21:00:06Z

I would be in favor of postponing until we have a more complete vision here as well (I'm a consumer of UTF-8 functions like this, which I typically just implement in-crate). There may be other routines we want to consider as well, for example, decoding a UTF-8 sequence in reverse can often be useful.

SimonSapin · 2016-04-07T11:25:07Z

Do we want to support decode_utfXX_lossy? Are they easy enough to do with a .map(|e| e.unwrap_or(REPLACEMENT_CHARACTER))?

Since decode_utf16() has a dedicated return type this could also be decode_utf16().lossy(). IMO this is nicer than a function that would need to be imported separately. FWIW I had this in the PR originally and removed it at your suggestion :) #27808 (comment)

Having only one char constant in the module seems kinda odd, surely there are more constant-like things (like newline?) which we may wish to have?

We can remove the constant if lossy decoding is built-in.

I guess in that sense we'd be "complete" with decode_utf8? Is that all we need to put this issue to rest?

Yes, I do think we’re missing something lower-level than we currently have in std for decoding UTF-8. Something that does not allocate (could potentially be in libcore) and is more flexible than str::from_utf8 (which is strict (not lossy) and requires the entire input to be in a single slice).

But since &str and String use UTF-8 internally I don’t think decoding to Iterator<char> is ideal. We’d like to use slices of the input when possible.

I have some experiments at https://github.com/SimonSapin/rust-utf8. (In you’re interested, the commit history shows a number of different APIs I tried.) It supports "incremental" lossy decoding: input is a number of &[u8] chunks and a single code point may span multiple chunks. Output borrows input when possible and never allocates.

But this is significantly more API surface than, say, an iterator adaptor. And there’s probably a wide variety of use cases with slightly different constraints (@BurntSushi mentioned decoding in reverse), so I don’t know if it makes sense to try and support all of them in std.

Still, it’d be nice to have a single UTF-8 decoding primitive std that everything else could be based on. That way future performance improvements like #30740 (more may be possible with SIMD) would apply automatically, ideally. Though I don’t know yet what that primitive should be exactly, or if it’s even possible.

BurntSushi · 2016-04-07T11:34:25Z

@SimonSapin Would fn decode_utf8(src: &[u8]) -> Result<(char, usize), ::std::str::Utf8Error> be enough? See, e.g., https://github.com/rust-lang-nursery/regex/blob/master/src/utf8.rs#L53-L64 and https://github.com/rust-lang-nursery/regex/blob/master/src/utf8.rs#L115-L117 for the reverse case. (They are both unoptimized and return an Option instead of a Result.)

SimonSapin · 2016-04-07T11:59:13Z

@BurntSushi Building something that yields &str slices borrowing the input on top of this API is a fair amount of extra logic, even more so to support code points spanning chunks. And it would do more work than necessary (compute actual 32-bit values rather than just figure out which byte sequences are well-formed), I don’t know if the compile could optimize that away. And this API does not enable optimizations like #30740 that decode more than one code point at once.

alexcrichton · 2016-04-07T17:19:46Z

The libs team discussed this during triage yesterday and the conclusion was to stabilize essentially everything as-is modulo changing the error returned by the iterator.

We felt that there's room for decoding an iterator of u8 to char, but we can always add that later.

This commit applies all stabilizations, renamings, and deprecations that the library team has decided on for the upcoming 1.9 release. All tracking issues have gone through a cycle-long "final comment period" and the specific APIs stabilized/deprecated are: Stable * `std::panic` * `std::panic::catch_unwind` (renamed from `recover`) * `std::panic::resume_unwind` (renamed from `propagate`) * `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`) * `std::panic::UnwindSafe` (renamed from `RecoverSafe`) * `str::is_char_boundary` * `<*const T>::as_ref` * `<*mut T>::as_ref` * `<*mut T>::as_mut` * `AsciiExt::make_ascii_uppercase` * `AsciiExt::make_ascii_lowercase` * `char::decode_utf16` * `char::DecodeUtf16` * `char::DecodeUtf16Error` * `char::DecodeUtf16Error::unpaired_surrogate` * `BTreeSet::take` * `BTreeSet::replace` * `BTreeSet::get` * `HashSet::take` * `HashSet::replace` * `HashSet::get` * `OsString::with_capacity` * `OsString::clear` * `OsString::capacity` * `OsString::reserve` * `OsString::reserve_exact` * `OsStr::is_empty` * `OsStr::len` * `std::os::unix::thread` * `RawPthread` * `JoinHandleExt` * `JoinHandleExt::as_pthread_t` * `JoinHandleExt::into_pthread_t` * `HashSet::hasher` * `HashMap::hasher` * `CommandExt::exec` * `File::try_clone` * `SocketAddr::set_ip` * `SocketAddr::set_port` * `SocketAddrV4::set_ip` * `SocketAddrV4::set_port` * `SocketAddrV6::set_ip` * `SocketAddrV6::set_port` * `SocketAddrV6::set_flowinfo` * `SocketAddrV6::set_scope_id` * `<[T]>::copy_from_slice` * `ptr::read_volatile` * `ptr::write_volatile` * The `#[deprecated]` attribute * `OpenOptions::create_new` Deprecated * `std::raw::Slice` - use raw parts of `slice` module instead * `std::raw::Repr` - use raw parts of `slice` module instead * `str::char_range_at` - use slicing plus `chars()` plus `len_utf8` * `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8` * `str::char_at` - use slicing plus `chars()` * `str::char_at_reverse` - use slicing plus `chars().rev()` * `str::slice_shift_char` - use `chars()` plus `Chars::as_str` * `CommandExt::session_leader` - use `before_exec` instead. Closes rust-lang#27719 cc rust-lang#27751 (deprecating the `Slice` bits) Closes rust-lang#27754 Closes rust-lang#27780 Closes rust-lang#27809 Closes rust-lang#27811 Closes rust-lang#27830 Closes rust-lang#28050 Closes rust-lang#29453 Closes rust-lang#29791 Closes rust-lang#29935 Closes rust-lang#30014 Closes rust-lang#30752 Closes rust-lang#31262 cc rust-lang#31398 (still need to deal with `before_exec`) Closes rust-lang#31405 Closes rust-lang#31572 Closes rust-lang#31755 Closes rust-lang#31756

std: Stabilize APIs for the 1.9 release This commit applies all stabilizations, renamings, and deprecations that the library team has decided on for the upcoming 1.9 release. All tracking issues have gone through a cycle-long "final comment period" and the specific APIs stabilized/deprecated are: Stable * `std::panic` * `std::panic::catch_unwind` (renamed from `recover`) * `std::panic::resume_unwind` (renamed from `propagate`) * `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`) * `std::panic::UnwindSafe` (renamed from `RecoverSafe`) * `str::is_char_boundary` * `<*const T>::as_ref` * `<*mut T>::as_ref` * `<*mut T>::as_mut` * `AsciiExt::make_ascii_uppercase` * `AsciiExt::make_ascii_lowercase` * `char::decode_utf16` * `char::DecodeUtf16` * `char::DecodeUtf16Error` * `char::DecodeUtf16Error::unpaired_surrogate` * `BTreeSet::take` * `BTreeSet::replace` * `BTreeSet::get` * `HashSet::take` * `HashSet::replace` * `HashSet::get` * `OsString::with_capacity` * `OsString::clear` * `OsString::capacity` * `OsString::reserve` * `OsString::reserve_exact` * `OsStr::is_empty` * `OsStr::len` * `std::os::unix::thread` * `RawPthread` * `JoinHandleExt` * `JoinHandleExt::as_pthread_t` * `JoinHandleExt::into_pthread_t` * `HashSet::hasher` * `HashMap::hasher` * `CommandExt::exec` * `File::try_clone` * `SocketAddr::set_ip` * `SocketAddr::set_port` * `SocketAddrV4::set_ip` * `SocketAddrV4::set_port` * `SocketAddrV6::set_ip` * `SocketAddrV6::set_port` * `SocketAddrV6::set_flowinfo` * `SocketAddrV6::set_scope_id` * `<[T]>::copy_from_slice` * `ptr::read_volatile` * `ptr::write_volatile` * The `#[deprecated]` attribute * `OpenOptions::create_new` Deprecated * `std::raw::Slice` - use raw parts of `slice` module instead * `std::raw::Repr` - use raw parts of `slice` module instead * `str::char_range_at` - use slicing plus `chars()` plus `len_utf8` * `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8` * `str::char_at` - use slicing plus `chars()` * `str::char_at_reverse` - use slicing plus `chars().rev()` * `str::slice_shift_char` - use `chars()` plus `Chars::as_str` * `CommandExt::session_leader` - use `before_exec` instead. Closes #27719 cc #27751 (deprecating the `Slice` bits) Closes #27754 Closes #27780 Closes #27809 Closes #27811 Closes #27830 Closes #28050 Closes #29453 Closes #29791 Closes #29935 Closes #30014 Closes #30752 Closes #31262 cc #31398 (still need to deal with `before_exec`) Closes #31405 Closes #31572 Closes #31755 Closes #31756

This commit applies all stabilizations, renamings, and deprecations that the library team has decided on for the upcoming 1.9 release. All tracking issues have gone through a cycle-long "final comment period" and the specific APIs stabilized/deprecated are: Stable * `std::panic` * `std::panic::catch_unwind` (renamed from `recover`) * `std::panic::resume_unwind` (renamed from `propagate`) * `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`) * `std::panic::UnwindSafe` (renamed from `RecoverSafe`) * `str::is_char_boundary` * `<*const T>::as_ref` * `<*mut T>::as_ref` * `<*mut T>::as_mut` * `AsciiExt::make_ascii_uppercase` * `AsciiExt::make_ascii_lowercase` * `char::decode_utf16` * `char::DecodeUtf16` * `char::DecodeUtf16Error` * `char::DecodeUtf16Error::unpaired_surrogate` * `BTreeSet::take` * `BTreeSet::replace` * `BTreeSet::get` * `HashSet::take` * `HashSet::replace` * `HashSet::get` * `OsString::with_capacity` * `OsString::clear` * `OsString::capacity` * `OsString::reserve` * `OsString::reserve_exact` * `OsStr::is_empty` * `OsStr::len` * `std::os::unix::thread` * `RawPthread` * `JoinHandleExt` * `JoinHandleExt::as_pthread_t` * `JoinHandleExt::into_pthread_t` * `HashSet::hasher` * `HashMap::hasher` * `CommandExt::exec` * `File::try_clone` * `SocketAddr::set_ip` * `SocketAddr::set_port` * `SocketAddrV4::set_ip` * `SocketAddrV4::set_port` * `SocketAddrV6::set_ip` * `SocketAddrV6::set_port` * `SocketAddrV6::set_flowinfo` * `SocketAddrV6::set_scope_id` * `<[T]>::copy_from_slice` * `ptr::read_volatile` * `ptr::write_volatile` * The `#[deprecated]` attribute * `OpenOptions::create_new` Deprecated * `std::raw::Slice` - use raw parts of `slice` module instead * `std::raw::Repr` - use raw parts of `slice` module instead * `str::char_range_at` - use slicing plus `chars()` plus `len_utf8` * `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8` * `str::char_at` - use slicing plus `chars()` * `str::char_at_reverse` - use slicing plus `chars().rev()` * `str::slice_shift_char` - use `chars()` plus `Chars::as_str` * `CommandExt::session_leader` - use `before_exec` instead. Closes rust-lang#27719 cc rust-lang#27751 (deprecating the `Slice` bits) Closes rust-lang#27754 Closes rust-lang#27780 Closes rust-lang#27809 Closes rust-lang#27811 Closes rust-lang#27830 Closes rust-lang#28050 Closes rust-lang#29453 Closes rust-lang#29791 Closes rust-lang#29935 Closes rust-lang#30014 Closes rust-lang#30752 Closes rust-lang#31262 cc rust-lang#31398 (still need to deal with `before_exec`) Closes rust-lang#31405 Closes rust-lang#31572 Closes rust-lang#31755 Closes rust-lang#31756

lambda-fairy · 2018-03-24T01:19:28Z

Is there any reason that these items are defined in std_unicode and not core?

I was investigating #49319 and ended up here.

SimonSapin · 2018-03-24T01:23:48Z

I’ll respond in #49319 since this thread has been closed for two years :)

SimonSapin mentioned this issue Aug 14, 2015

Refactor low-level UTF-16 decoding and move Borrow(Mut) to libcore. #27808

Merged

steveklabnik added the A-libs label Aug 17, 2015

alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Mar 8, 2016

aturon added the I-nominated label Mar 9, 2016

alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Mar 11, 2016

alexcrichton mentioned this issue Apr 7, 2016

std: Stabilize APIs for the 1.9 release #32804

Merged

alexcrichton mentioned this issue Apr 7, 2016

Tracking issue for char encoding methods #27784

Closed

bors closed this as completed in #32804 Apr 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for UTF-16 decoding iterators #27830

Tracking issue for UTF-16 decoding iterators #27830

SimonSapin commented Aug 14, 2015

alexcrichton commented Mar 11, 2016

alexcrichton commented Mar 11, 2016

aturon commented Apr 6, 2016

BurntSushi commented Apr 6, 2016

SimonSapin commented Apr 7, 2016

BurntSushi commented Apr 7, 2016

SimonSapin commented Apr 7, 2016

alexcrichton commented Apr 7, 2016

lambda-fairy commented Mar 24, 2018

SimonSapin commented Mar 24, 2018

Tracking issue for UTF-16 decoding iterators #27830

Tracking issue for UTF-16 decoding iterators #27830

Comments

SimonSapin commented Aug 14, 2015

alexcrichton commented Mar 11, 2016

alexcrichton commented Mar 11, 2016

aturon commented Apr 6, 2016

BurntSushi commented Apr 6, 2016

SimonSapin commented Apr 7, 2016

BurntSushi commented Apr 7, 2016

SimonSapin commented Apr 7, 2016

alexcrichton commented Apr 7, 2016

lambda-fairy commented Mar 24, 2018

SimonSapin commented Mar 24, 2018