Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTracking issue for char encoding methods #27784
Comments
alexcrichton
added
T-libs
B-unstable
labels
Aug 13, 2015
This comment has been minimized.
This comment has been minimized.
|
How about returning enums like |
This comment has been minimized.
This comment has been minimized.
|
Certainly possible, but there's also the question of ergonomics here in terms of what to do with that after you've got the information. |
This was referenced Aug 13, 2015
This comment has been minimized.
This comment has been minimized.
|
I think that one form or another of this functionality that doesn’t require allocation should be exposed. Returning an iterator is nicer than taking |
This comment has been minimized.
This comment has been minimized.
|
I’ve suggested taking Taking anything else than slice also makes following use case not as elegant as it is now: let mut buffer = Vec::with_capacity(alot);
let mut idx = 0;
loop {
idx += some_char().encode_utf8(&mut buffer[idx..]).unwrap();
} |
Ms2ger
referenced this issue
Aug 16, 2015
Open
Tracking: Unstable Rust feature gates used by Servo #5286
bors
added a commit
that referenced
this issue
Aug 27, 2015
This comment has been minimized.
This comment has been minimized.
|
How about returning something that both is an iterator and dereferences to a slice? struct Utf8Char {
bytes: [u8; 4],
position: usize,
}
impl Deref for Utf8Char {
type Target = [u8];
fn deref(&self) -> &[u8] { &self.bytes[self.position..] }
}
impl Iterator for Utf8Char {
type Item = u8;
fn next(&mut self) -> Option<u8> {
if self.position < self.bytes.len() {
let byte = self.bytes[self.position];
self.position += 1;
Some(byte)
} else {
None
}
}
}(“Short” code points have zeros as padding at the start of the array.) … and similarly for UTF-16, but with |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin That looks really sweet to me! |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin In your |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin What do you think about also exposing |
This comment has been minimized.
This comment has been minimized.
|
No, We already have For UTF-8 I do want to expose a decoder that’s more low-level than what we currently have, but I’m not sure what it should look like. I have some experiments at https://github.com/SimonSapin/rust-utf8 |
This comment has been minimized.
This comment has been minimized.
Ah! That was what I missed. Thanks for the clarification.
Interesting. That is much more complex than I had thought it would be! (I hadn't considered returning additional info about incomplete sequences.) |
This comment has been minimized.
This comment has been minimized.
|
Most of the complexity comes from self-imposed constraints:
I don’t know how much of that should be in the standard library. But when the standard library gets performance improvement like #30740 (and perhaps more in the future with SIMD or something?), ideally they’d be in a low-level algorithm that everything else builds on top of. |
alexcrichton
added
the
I-nominated
label
Mar 9, 2016
This comment has been minimized.
This comment has been minimized.
|
The API proposed by @SimonSapin seems reasonable, perhaps in the form of: impl char {
fn encode_utf8(&self) -> EncodeUtf8;
}
struct EncodeUtf8 {
// ...
}
impl Iterator for EncodeUtf8 {
type Item = u8;
// ...
}
impl EncodeUtf8 {
#[unstable(...)]
pub fn as_slice(&self) -> &[u8] { /* ... */ }
} |
alexcrichton
added
final-comment-period
and removed
I-nominated
labels
Mar 11, 2016
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Mar 11, 2016
alexcrichton
referenced this issue
Mar 11, 2016
Merged
std: Change `encode_utf{8,16}` to return iterators #32204
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Mar 11, 2016
bors
added a commit
that referenced
this issue
Mar 22, 2016
alexcrichton
added a commit
to alexcrichton/rust
that referenced
this issue
Mar 22, 2016
bors
added a commit
that referenced
this issue
Mar 22, 2016
bors
added a commit
that referenced
this issue
Mar 22, 2016
This comment has been minimized.
This comment has been minimized.
|
Why add an |
This comment has been minimized.
This comment has been minimized.
daschl
commented
Oct 10, 2016
|
@alexcrichton not sure if this is in scope, but I have the use case where I want to write a "char" into a u8 array at a given offset. Right now I've copied the main part from stdlib and do it like so: #[inline]
fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
if (ch.len_utf8() + offset) > array.len() {
return false;
}
let code = *ch as u32;
if code < MAX_ONE_B {
array[offset] = code as u8;
} else if code < MAX_TWO_B {
array[offset] = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
array[offset + 1] = (code & 0x3F) as u8 | TAG_CONT;
} else if code < MAX_THREE_B {
array[offset] = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
array[offset + 1] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
array[offset + 2] = (code & 0x3F) as u8 | TAG_CONT;
} else {
array[offset] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
array[offset + 1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
array[offset + 2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
array[offset + 3] = (code & 0x3F) as u8 | TAG_CONT;
}
true
}Again not sure if this is in scope, but I wanted to bring it up in case such a use case makes sense to integrate. |
This comment has been minimized.
This comment has been minimized.
|
@daschl you can deal with the offset by slicing: fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
let slice = &mut array[offset..];
if ch.len_utf8() > slice.len() {
return false
}
ch.encode_utf8(slice);
true
} |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton All of your suggestions are good. I don't know if this data point is interesting, but ArrayString ended up copying this code too. When I shaped it after how it's used it ends up looking like
|
This comment has been minimized.
This comment has been minimized.
daschl
commented
Oct 10, 2016
|
@SimonSapin ah good to know, thanks! lets hope it gets stable soon then |
This comment has been minimized.
This comment has been minimized.
|
I would vote for no panic. That also simplifies the bounds checking (the impl linked above has all bounds checks elided, since |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot fcp merge These methods have been around for awhile, I propose we merge! |
This comment has been minimized.
This comment has been minimized.
|
er, stabilize* |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Nov 1, 2016
•
|
Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams: Concerns:
Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
This comment has been minimized.
This comment has been minimized.
|
It's not obvious to me that the current API is the one we want since there has been ongoing discussion. @bluss recently suggested it should not panic, and the current implementation does panic. |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot concern panic-vs-not-panic |
This comment has been minimized.
This comment has been minimized.
|
I'm personally convinced by @SimonSapin's points above, which is that there is a statically known size (4 and 2) to encode all utf8 and utf16 characters. Callers basically always need to pass in that size buffer, so it seems more like a programmer error if you pass in a small buffer than a runtime error that should be handled. |
This comment has been minimized.
This comment has been minimized.
|
To reiterate: you can use 2 or 4 to keep things simple or to use a statically-sized array, but you can also use |
This comment has been minimized.
This comment has been minimized.
|
I don't want to cause a gridlock, it was just what my perspective was in that moment. The fixed capacity use case is more uncommon, and the motivation for panic is following the usual conventions, so it certainly makes sense. |
This comment has been minimized.
This comment has been minimized.
|
OK, it sounds like there's no major opposition to the api as is. |
This comment has been minimized.
This comment has been minimized.
|
@rfcbot resolved panic-vs-not-panic |
This comment has been minimized.
This comment has been minimized.
|
Apart from the panicking. I'm a bit confused right now about what the actual API/signature is going to be. The one that returns |
This comment has been minimized.
This comment has been minimized.
|
Yeah fn encode_utf8(self, dst: &mut [u8]) -> &mut strand fn encode_utf16(self, dst: &mut [u16]) -> &mut [u16] |
This comment has been minimized.
This comment has been minimized.
|
Alright! |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Nov 12, 2016
|
psst @alexcrichton, I wasn't able to add the |
alexcrichton
added
the
final-comment-period
label
Nov 12, 2016
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Nov 22, 2016
|
The final comment period is now complete. |
alexcrichton commentedAug 13, 2015
This is a tracking issue for the unstable
unicodefeature and thechar::encode_utf{8,16}methods.The interfaces here are a little wonky but are done for performance. It's not clear whether these need to be exported or not or if there's a better method to do so through iterators.