Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upTracking issue for Read::chars #27802
Comments
alexcrichton
added
A-io
T-libs
B-unstable
labels
Aug 13, 2015
This comment has been minimized.
This comment has been minimized.
|
It would be nice if |
This comment has been minimized.
This comment has been minimized.
gyscos
commented
Sep 10, 2015
|
If we want to make let mut chars = reader.chars();
loop {
for c in chars {
// ...
}
if chars.was_valid_utf8() { break; }
println!("Encountered invalid byte sequence.");
}Or it could provide a more informative message, similar to the current CharsError. On the other hand, it is not so difficult to treat the |
This comment has been minimized.
This comment has been minimized.
|
Nominating for 1.6 discussion |
alexcrichton
added
the
I-nominated
label
Nov 4, 2015
This comment has been minimized.
This comment has been minimized.
|
These functions seem like excellent candidates to move out-of-tree into an |
alexcrichton
added
final-comment-period
and removed
I-nominated
labels
Nov 5, 2015
This comment has been minimized.
This comment has been minimized.
|
Perhaps a |
This comment has been minimized.
This comment has been minimized.
|
I’m in favor of stabilizing It’s unstable because
(The same would apply to This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630 Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Here are some failing tests. let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert_eq!(chars.next().unwrap().unwrap(), 'b');
assert_eq!(chars.next().unwrap().unwrap(), 'c');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none());
let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
let mut chars = buf.chars();
assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
assert_eq!(chars.next().unwrap().unwrap(), 'a');
assert!(chars.next().is_none());I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from |
This comment has been minimized.
This comment has been minimized.
|
Moving |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
While I agree that |
This comment has been minimized.
This comment has been minimized.
|
How does |
This comment has been minimized.
This comment has been minimized.
|
@SimonSapin there were some perf numbers reading from a file with |
This comment has been minimized.
This comment has been minimized.
|
I think |
This comment has been minimized.
This comment has been minimized.
softprops
commented
Nov 6, 2015
|
I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness. |
This comment has been minimized.
This comment has been minimized.
|
I would expect it to perform much better than on a raw @cybergeek94 Yeah I suspect the performance isn't abysmal, but when compared to Yeah that's actually a case where I believe |
This comment has been minimized.
This comment has been minimized.
gyscos
commented
Nov 6, 2015
|
@alexcrichton |
This comment has been minimized.
This comment has been minimized.
softprops
commented
Nov 6, 2015
In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string. |
This comment has been minimized.
This comment has been minimized.
|
@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs |
This comment has been minimized.
This comment has been minimized.
|
@gyscos hm yes, I guess it would! That would definitely mean that @softprops in theory you can transform an iterator of |
This comment has been minimized.
This comment has been minimized.
gyscos
commented
Nov 6, 2015
|
So a solution is to move |
This comment has been minimized.
This comment has been minimized.
|
Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like It does raise a question though that if |
This comment has been minimized.
This comment has been minimized.
|
I think the reasoning to move |
This comment has been minimized.
This comment has been minimized.
|
The performance issue being alleviated is only a benevolent side-effect. |
This comment has been minimized.
This comment has been minimized.
|
Hm I don't think that this is a correctness problem that can be solved by "just moving to |
This comment has been minimized.
This comment has been minimized.
gyscos
commented
Nov 7, 2015
|
@SimonSapin was saying that |
This comment has been minimized.
This comment has been minimized.
|
Does this sound correct?: The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8). It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous). If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes. It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same). |
lotabout
added a commit
to lotabout/skim
that referenced
this issue
Jan 19, 2017
brookst
added a commit
to brookst/skim
that referenced
this issue
Jan 19, 2017
This comment has been minimized.
This comment has been minimized.
|
TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use. I’m in favor of deprecating and eventually removing this with no in- I think that anything that looks at one I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this: pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }
pub enum DecodeResult<'a> {
Ok(&'a str),
/// These three slices cover all of the original input.
/// `decode` should be called again with the third one as the new input.
Error(&'a str, InvalidSequence<'a>, &'a [u8]),
Incomplete(&'a str, IncompleteChar),
}
pub struct InvalidSequence<'a>(pub &'a [u8]);
pub struct IncompleteChar {
// Fields are private. They include a [u8; 4] buffer.
}
impl IncompleteChar {
pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
-> TryCompleteResult<'char, 'input> { /* ... */ }
}
pub enum TryCompleteResult<'char, 'input> {
Ok(&'char str, &'input [u8]), // str.chars().count() == 1
Error(InvalidSequence<'char>, &'input [u8]),
StillIncomplete,
}It’s complicated. It requires the user to think about a lot of corner cases, especially around We can hide some of the details with a stateful decoder: pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }
impl Decoder {
pub fn new() -> Self;
pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
-> DecoderResult<'decoder, 'input>;
/// Signal that there is no more input.
/// The decoder might contain a partial `char` which becomes an error.
pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}
/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
/// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,
/// Up to the first error, if any
pub decoded: &'input str,
/// Whether we did find an error
pub error: Result<(), InvalidSequence<'input>>
/// Call `decoder.decode()` again with this, if non-empty
pub remaining_input_after_error: &'input [u8]
}
/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);Even so, it’s very easy to misuse, for example by ignoring part of Either of these is complicated enough that I don’t think it belongs in |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I would support deprecating |
This comment has been minimized.
This comment has been minimized.
|
Another attempt turned out almost nice: pub struct Decoder {
buffer: [u8; 4],
/* ... */
}
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}
pub struct DecoderIter<'a> {
decoder: &'a mut Decoder,
/* ... */
}
impl<'a> Iterator for DecoderIter<'a> {
type Item = Result<&'a str, &'a [u8]>;
}Except it doesn’t work. impl<'a> DecoderIter<'a> {
pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
} let mut iter = decoder.next_chunk(input);
while let Some(result) = iter.next() {
// ...
}This compiles, but something like We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no. pub struct Decoder { /* ... */ }
impl Decoder {
pub fn new() -> Self { /* ... */ }
pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
-> DecoderIter<'decoder, 'input> { /* ... */ }
}
pub struct DecoderIter<'decoder, 'input> { /* ... */ }
impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
pub fn next<'buffer>(&'buffer mut self)
-> Option<Result<EitherLifetime<'buffer, 'input, str>,
EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}
pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
Buffer(&'buffer T),
Input(&'input T),
}
impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
match *self {
EitherLifetime::Input(x) => x,
EitherLifetime::Buffer(x) => x,
}
}
} |
This comment has been minimized.
This comment has been minimized.
Can you elaborate? I don't follow here. |
This comment has been minimized.
This comment has been minimized.
Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change |
Mark-Simulacrum
removed
the
A-io
label
Jun 25, 2017
Mark-Simulacrum
added
the
C-tracking-issue
label
Jul 22, 2017
This comment has been minimized.
This comment has been minimized.
wilysword
commented
Sep 5, 2017
|
I tend to agree that this should be removed from |
This comment has been minimized.
This comment has been minimized.
|
hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings. |
This comment has been minimized.
This comment has been minimized.
|
The libs team discussed this and consensus was to deprecate the @rfcbot fcp close Code that does not care about processing data incrementally can use Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the |
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Mar 30, 2018
•
|
Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
rfcbot
added
the
proposed-final-comment-period
label
Mar 30, 2018
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Apr 4, 2018
|
|
rfcbot
added
final-comment-period
and removed
proposed-final-comment-period
labels
Apr 4, 2018
This comment has been minimized.
This comment has been minimized.
rfcbot
commented
Apr 14, 2018
|
The final comment period is now complete. |
SimonSapin
added a commit
to SimonSapin/rust
that referenced
this issue
Apr 14, 2018
SimonSapin
referenced this issue
Apr 14, 2018
Merged
Deprecate Read::chars and char::decode_utf8 #49970
SimonSapin
added a commit
to SimonSapin/rust
that referenced
this issue
Apr 15, 2018
kennytm
added a commit
to kennytm/rust
that referenced
this issue
Apr 24, 2018
This comment has been minimized.
This comment has been minimized.
|
Deprecated in #49970 |
alexcrichton commentedAug 13, 2015
•
edited by Mark-Simulacrum
This is a tracking issue for the deprecated
std::io::Read::charsAPI.