Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for char encoding methods #27784

Closed
alexcrichton opened this Issue Aug 13, 2015 · 71 comments

Comments

Projects
None yet
@alexcrichton
Copy link
Member

alexcrichton commented Aug 13, 2015

This is a tracking issue for the unstable unicode feature and the char::encode_utf{8,16} methods.

The interfaces here are a little wonky but are done for performance. It's not clear whether these need to be exported or not or if there's a better method to do so through iterators.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Aug 13, 2015

How about returning enums like enum OneOrTwo { One(u16), Two(u16, u16) } or enum Utf16Encoding { SingleCodeUnit(u16), SurrogatePair(u16, u16) }?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Aug 13, 2015

Certainly possible, but there's also the question of ergonomics here in terms of what to do with that after you've got the information.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Aug 13, 2015

I think that one form or another of this functionality that doesn’t require allocation should be exposed. Returning an iterator is nicer than taking &mut [_], but I don’t know about performance.

@nagisa

This comment has been minimized.

Copy link
Contributor

nagisa commented Aug 13, 2015

I’ve suggested taking &mut [u8; 4]/&mut [u16; 2] and returning usize once. The major downside is inability to convert from slice to array.

Taking anything else than slice also makes following use case not as elegant as it is now:

let mut buffer = Vec::with_capacity(alot);
let mut idx = 0;
loop {
     idx += some_char().encode_utf8(&mut buffer[idx..]).unwrap();
}

bors added a commit that referenced this issue Aug 27, 2015

Auto merge of #27808 - SimonSapin:utf16decoder, r=alexcrichton
* Rename `Utf16Items` to `Utf16Decoder`. "Items" is meaningless.
* Generalize it to any `u16` iterator, not just `[u16].iter()`
* Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl.
* Replace `Utf16Item::to_char_lossy` with a `Utf16Decoder::lossy` iterator adaptor.

This is a [breaking change], but only for users of the unstable `rustc_unicode` crate.

I’d like this functionality to be stabilized and re-exported in `std` eventually, as the "low-level equivalent" of `String::from_utf16` and `String::from_utf16_lossy` like #27784 is the low-level equivalent of #27714.

CC @aturon, @alexcrichton
@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Oct 28, 2015

How about returning something that both is an iterator and dereferences to a slice?

struct Utf8Char {
    bytes: [u8; 4],
    position: usize,
}

impl Deref for Utf8Char {
    type Target = [u8];
    fn deref(&self) -> &[u8] { &self.bytes[self.position..] }
}

impl Iterator for Utf8Char {
    type Item = u8;
    fn next(&mut self) -> Option<u8> {
        if self.position < self.bytes.len() {
            let byte = self.bytes[self.position];
            self.position += 1;
            Some(byte)
        } else {
            None
        }
    }
}

(“Short” code points have zeros as padding at the start of the array.)

… and similarly for UTF-16, but with [u16; 2] instead of [u8; 4].

@BurntSushi

This comment has been minimized.

Copy link
Member

BurntSushi commented Nov 8, 2015

@SimonSapin That looks really sweet to me!

@kkawakam kkawakam referenced this issue Jan 13, 2016

Closed

Compile against Rust Stable #17

3 of 5 tasks complete
@BurntSushi

This comment has been minimized.

Copy link
Member

BurntSushi commented Jan 20, 2016

@SimonSapin In your deref method, I think that should be &self.bytes[..self.position], right?

@BurntSushi

This comment has been minimized.

Copy link
Member

BurntSushi commented Jan 20, 2016

@SimonSapin What do you think about also exposing decode_{utf8,utf16} methods? Basically, if you have some bytes and want the next encoded char out of it, today I think you need to decode into a string and then call chars, which is a bit roundabout (and does extra work I believe).

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Jan 21, 2016

No, deref returns the slice that hasn’t been consumed by the iterator yet. For code points that have less than 4 bytes to begin with, padding is at the start of the array, not the end.

We already have char::decode_utf16 that takes and returns iterators.

For UTF-8 I do want to expose a decoder that’s more low-level than what we currently have, but I’m not sure what it should look like. I have some experiments at https://github.com/SimonSapin/rust-utf8

@BurntSushi

This comment has been minimized.

Copy link
Member

BurntSushi commented Jan 21, 2016

padding is at the start of the array, not the end.

Ah! That was what I missed. Thanks for the clarification.

I have some experiments at https://github.com/SimonSapin/rust-utf8

Interesting. That is much more complex than I had thought it would be! (I hadn't considered returning additional info about incomplete sequences.)

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Jan 21, 2016

Most of the complexity comes from self-imposed constraints:

  • Support “chunked” decoding so you can start processing, say, an HTML document before it’s finished downloading from the network. The bytes for a single char can be split across chunks.
  • Make it possible to emit &str slices that borrow &[u8] input bytes whenever possible, to avoid copying too many bytes.

I don’t know how much of that should be in the standard library.

But when the standard library gets performance improvement like #30740 (and perhaps more in the future with SIMD or something?), ideally they’d be in a low-level algorithm that everything else builds on top of.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 11, 2016

🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔

The API proposed by @SimonSapin seems reasonable, perhaps in the form of:

impl char {
    fn encode_utf8(&self) -> EncodeUtf8;
}

struct EncodeUtf8 {
    // ...
}

impl Iterator for EncodeUtf8 {
    type Item = u8;
    // ...
}

impl EncodeUtf8 {
    #[unstable(...)]
    pub fn as_slice(&self) -> &[u8] { /* ... */ }
}

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 11, 2016

std: Change `encode_utf{8,16}` to return iterators
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc rust-lang#27784

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 11, 2016

std: Change `encode_utf{8,16}` to return iterators
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc rust-lang#27784

bors added a commit that referenced this issue Mar 22, 2016

Auto merge of #32204 - alexcrichton:redesign-char-encoding-types, r=a…
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 22, 2016

std: Change `encode_utf{8,16}` to return iterators
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc rust-lang#27784

bors added a commit that referenced this issue Mar 22, 2016

Auto merge of #32204 - alexcrichton:redesign-char-encoding-types, r=a…
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784

bors added a commit that referenced this issue Mar 22, 2016

Auto merge of #32204 - alexcrichton:redesign-char-encoding-types, r=a…
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
@Stebalien

This comment has been minimized.

Copy link
Contributor

Stebalien commented Mar 23, 2016

Why add an as_slice method instead of implementing Deref and Index? I was under the impression that &thing[..] should always be prefered to thing.as_slice().

@daschl

This comment has been minimized.

Copy link

daschl commented Oct 10, 2016

@alexcrichton not sure if this is in scope, but I have the use case where I want to write a "char" into a u8 array at a given offset. Right now I've copied the main part from stdlib and do it like so:

#[inline]
fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    if (ch.len_utf8() + offset) > array.len() {
        return false;
    }

    let code = *ch as u32;
    if code < MAX_ONE_B {
        array[offset] = code as u8;
    } else if code < MAX_TWO_B {
        array[offset] = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
        array[offset + 1] = (code & 0x3F) as u8 | TAG_CONT;
    } else if code < MAX_THREE_B {
        array[offset] = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
        array[offset + 1] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code & 0x3F) as u8 | TAG_CONT;
    } else {
        array[offset] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
        array[offset + 1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
        array[offset + 2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
        array[offset + 3] = (code & 0x3F) as u8 | TAG_CONT;
    }
    true
}

Again not sure if this is in scope, but I wanted to bring it up in case such a use case makes sense to integrate.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Oct 10, 2016

@daschl you can deal with the offset by slicing:

fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
    let slice = &mut array[offset..];
    if ch.len_utf8() > slice.len() {
        return false
    }
    ch.encode_utf8(slice);
    true
}
@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Oct 10, 2016

@alexcrichton All of your suggestions are good. &'a mut str seems like the most powerful one, and it allows recovering all the other ones.

I don't know if this data point is interesting, but ArrayString ended up copying this code too. When I shaped it after how it's used it ends up looking like

pub fn encode_utf8(ch: char, buf: &mut [u8]) -> Result<usize, EncodeError>

impl: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/char_ext.rs#L28-L51

use: https://github.com/bluss/arrayvec/blob/d43c959fa8afa912c497104b60a64892e1178f9d/src/array_string.rs#L114-L119

@daschl

This comment has been minimized.

Copy link

daschl commented Oct 10, 2016

@SimonSapin ah good to know, thanks! lets hope it gets stable soon then

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Oct 10, 2016

I would vote for no panic. That also simplifies the bounds checking (the impl linked above has all bounds checks elided, since buf.len() is being tested manually).

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Nov 1, 2016

@rfcbot fcp merge

These methods have been around for awhile, I propose we merge!

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Nov 1, 2016

er, stabilize*

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Nov 1, 2016

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams:

Concerns:

Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 4, 2016

It's not obvious to me that the current API is the one we want since there has been ongoing discussion. @bluss recently suggested it should not panic, and the current implementation does panic.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 4, 2016

@rfcbot concern panic-vs-not-panic

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Nov 4, 2016

I'm personally convinced by @SimonSapin's points above, which is that there is a statically known size (4 and 2) to encode all utf8 and utf16 characters. Callers basically always need to pass in that size buffer, so it seems more like a programmer error if you pass in a small buffer than a runtime error that should be handled.

@SimonSapin

This comment has been minimized.

Copy link
Contributor

SimonSapin commented Nov 4, 2016

To reiterate: you can use 2 or 4 to keep things simple or to use a statically-sized array, but you can also use char::len_utf8 or char::len_utf16 to find the precise length that is needed.

@bluss

This comment has been minimized.

Copy link
Contributor

bluss commented Nov 4, 2016

I don't want to cause a gridlock, it was just what my perspective was in that moment. The fixed capacity use case is more uncommon, and the motivation for panic is following the usual conventions, so it certainly makes sense.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 10, 2016

OK, it sounds like there's no major opposition to the api as is.

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Nov 10, 2016

@rfcbot resolved panic-vs-not-panic

@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Nov 11, 2016

Apart from the panicking. I'm a bit confused right now about what the actual API/signature is going to be. The one that returns &mut str ?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Nov 11, 2016

@Kimundi

Yeah encode_utf8 looks like:

fn encode_utf8(self, dst: &mut [u8]) -> &mut str

and encode_utf16 looks like:

fn encode_utf16(self, dst: &mut [u16]) -> &mut [u16]
@Kimundi

This comment has been minimized.

Copy link
Member

Kimundi commented Nov 11, 2016

Alright!

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Nov 12, 2016

🔔 This is now entering its final comment period, as per the review above. 🔔

psst @alexcrichton, I wasn't able to add the final-comment-period label, please do so.

@rfcbot

This comment has been minimized.

Copy link

rfcbot commented Nov 22, 2016

The final comment period is now complete.

bors added a commit that referenced this issue Dec 16, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

bors added a commit that referenced this issue Dec 16, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

bors added a commit that referenced this issue Dec 16, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

bors added a commit that referenced this issue Dec 16, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

bors added a commit that referenced this issue Dec 18, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

bors added a commit that referenced this issue Dec 18, 2016

Auto merge of #38369 - aturon:stab-1.15, r=alexcrichton
Library stabilizations/deprecations for 1.15 release

Stabilized:

- `std::iter::Iterator::{min_by, max_by}`
- `std::os::*::fs::FileExt`
- `std::sync::atomic::Atomic*::{get_mut, into_inner}`
- `std::vec::IntoIter::{as_slice, as_mut_slice}`
- `std::sync::mpsc::Receiver::try_iter`
- `std::os::unix::process::CommandExt::before_exec`
- `std::rc::Rc::{strong_count, weak_count}`
- `std::sync::Arc::{strong_count, weak_count}`
- `std::char::{encode_utf8, encode_utf16}`
- `std::cell::Ref::clone`
- `std::io::Take::into_inner`

Deprecated:

- `std::rc::Rc::{would_unwrap, is_unique}`
- `std::cell::RefCell::borrow_state`

Closes #23755
Closes #27733
Closes #27746
Closes #27784
Closes #28356
Closes #31398
Closes #34931
Closes #35601
Closes #35603
Closes #35918
Closes #36105

@bors bors closed this in #38369 Dec 18, 2016

gwenn added a commit to gwenn/rustyline that referenced this issue Dec 26, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.