New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for Read::chars #27802

Closed
alexcrichton opened this Issue Aug 13, 2015 · 88 comments

Comments

Projects
None yet
@alexcrichton
Member

alexcrichton commented Aug 13, 2015

This is a tracking issue for the deprecated std::io::Read::chars API.

@eminence

This comment has been minimized.

Show comment
Hide comment
@eminence

eminence Aug 25, 2015

Contributor

It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.

Contributor

eminence commented Aug 25, 2015

It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Sep 10, 2015

If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:

let mut chars = reader.chars();
loop {
    for c in chars {
        // ...
    }
    if chars.was_valid_utf8() { break; }
    println!("Encountered invalid byte sequence.");
}

Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?

On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.

gyscos commented Sep 10, 2015

If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:

let mut chars = reader.chars();
loop {
    for c in chars {
        // ...
    }
    if chars.was_valid_utf8() { break; }
    println!("Encountered invalid byte sequence.");
}

Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?

On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 4, 2015

Member

Nominating for 1.6 discussion

Member

alexcrichton commented Nov 4, 2015

Nominating for 1.6 discussion

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 5, 2015

Member

🔔 This issue is now entering its cycle-long final comment period for deprecation 🔔

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

Member

alexcrichton commented Nov 5, 2015

🔔 This issue is now entering its cycle-long final comment period for deprecation 🔔

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 5, 2015

Contributor

.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.

Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.

Contributor

abonander commented Nov 5, 2015

.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.

Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 5, 2015

Contributor

I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:

It’s unstable because

the semantics of a partial read/write of where errors happen is currently unclear and may change

(The same would apply to chars_lossy.)

This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.

Here are some failing tests.

        let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert_eq!(chars.next().unwrap().unwrap(), 'b');
        assert_eq!(chars.next().unwrap().unwrap(), 'c');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.

Contributor

SimonSapin commented Nov 5, 2015

I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:

It’s unstable because

the semantics of a partial read/write of where errors happen is currently unclear and may change

(The same would apply to chars_lossy.)

This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.

Here are some failing tests.

        let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert_eq!(chars.next().unwrap().unwrap(), 'b');
        assert_eq!(chars.next().unwrap().unwrap(), 'c');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Nov 5, 2015

Member

Moving chars to BufRead does not seem unreasonable.

Member

sfackler commented Nov 5, 2015

Moving chars to BufRead does not seem unreasonable.

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 5, 2015

Contributor

@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)

Contributor

abonander commented Nov 5, 2015

@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 6, 2015

Member

While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.

Member

alexcrichton commented Nov 6, 2015

While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 6, 2015

Contributor

How does bytes perform when used with BufReader<_>?

Contributor

SimonSapin commented Nov 6, 2015

How does bytes perform when used with BufReader<_>?

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 6, 2015

Contributor

@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.

Contributor

abonander commented Nov 6, 2015

@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 6, 2015

Contributor

I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.

Contributor

abonander commented Nov 6, 2015

I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.

@softprops

This comment has been minimized.

Show comment
Hide comment
@softprops

softprops Nov 6, 2015

I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.

softprops commented Nov 6, 2015

I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 6, 2015

Member

@SimonSapin

I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.


@cybergeek94

Yeah I suspect the performance isn't abysmal, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)


@softprops

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).

Member

alexcrichton commented Nov 6, 2015

@SimonSapin

I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.


@cybergeek94

Yeah I suspect the performance isn't abysmal, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)


@softprops

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 6, 2015

@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?

gyscos commented Nov 6, 2015

@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?

@softprops

This comment has been minimized.

Show comment
Hide comment
@softprops

softprops Nov 6, 2015

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"

In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.

softprops commented Nov 6, 2015

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"

In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Nov 6, 2015

Contributor

@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs

Contributor

SimonSapin commented Nov 6, 2015

@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 6, 2015

Member

@gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.

@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing

Member

alexcrichton commented Nov 6, 2015

@gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.

@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 6, 2015

So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't need to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?

gyscos commented Nov 6, 2015

So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't need to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 6, 2015

Member

Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.

It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.

Member

alexcrichton commented Nov 6, 2015

Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.

It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Nov 6, 2015

Member

I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.

Member

sfackler commented Nov 6, 2015

I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 7, 2015

Contributor

The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.

Contributor

abonander commented Nov 7, 2015

The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Nov 7, 2015

Member

Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.

Member

alexcrichton commented Nov 7, 2015

Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 7, 2015

@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.

gyscos commented Nov 7, 2015

@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Nov 7, 2015

Member

You don't put things back in the buffer, you just don't call consume on things you don't want to, right?

Member

sfackler commented Nov 7, 2015

You don't put things back in the buffer, you just don't call consume on things you don't want to, right?

@abonander

This comment has been minimized.

Show comment
Hide comment
@abonander

abonander Nov 7, 2015

Contributor

it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time.

If only BufRead had an API to force an early read into the buffer... but I digress. I closed that RFC because I didn't have the time or the energy to argue for it.

The iterator can always consume the last bytes out of the buffer to force it to read more. If the next read is too short (a partial read returning < 4 bytes? honestly, does that even happen?), it can be considered an error just like if we hit EOF while expecting more bytes.

Contributor

abonander commented Nov 7, 2015

it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time.

If only BufRead had an API to force an early read into the buffer... but I digress. I closed that RFC because I didn't have the time or the energy to argue for it.

The iterator can always consume the last bytes out of the buffer to force it to read more. If the next read is too short (a partial read returning < 4 bytes? honestly, does that even happen?), it can be considered an error just like if we hit EOF while expecting more bytes.

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 10, 2015

Compared to a Read, the BufRead makes it at least possible to be correct: since it should only consume the longest correct subsequence, it could read the bytes one by one and consume them only when they keep the subsequence valid. Only, as @SimonSapin said, this would mean duplicating some code for the checks.
So it is possible to implement the correct behaviour for BufRead (while it is not for Read), but it is inconvenient if the reader has too short a buffer, or only returns too short reads.

gyscos commented Nov 10, 2015

Compared to a Read, the BufRead makes it at least possible to be correct: since it should only consume the longest correct subsequence, it could read the bytes one by one and consume them only when they keep the subsequence valid. Only, as @SimonSapin said, this would mean duplicating some code for the checks.
So it is possible to implement the correct behaviour for BufRead (while it is not for Read), but it is inconvenient if the reader has too short a buffer, or only returns too short reads.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Nov 10, 2015

Member

The too-short read problem is independent of using a BufRead or a Read. The underlying Read could return data 1 byte at a time itself if it was feeling particularly pathological. A BufReader that only has 1 byte remaining in its buffer is only going to return that one byte in the next call.

Member

sfackler commented Nov 10, 2015

The too-short read problem is independent of using a BufRead or a Read. The underlying Read could return data 1 byte at a time itself if it was feeling particularly pathological. A BufReader that only has 1 byte remaining in its buffer is only going to return that one byte in the next call.

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 10, 2015

I'm not sure I understand the too-short read problem.

If the reader stops returning anything in the middle of a codepoint, it's obviously an error.
If it doesn't, we can read the bytes from the BufRead one by one, and consume it until we find an invalid byte (that we don't consume). This works even with a BufReader using a 1-byte buffer.
This is completely impossible with a simple Read, since we must consume each byte as we read it, preventing us from leaving the first invalid byte untouched.
In this regard, BufRead vs Read does change what is possible to achieve.

gyscos commented Nov 10, 2015

I'm not sure I understand the too-short read problem.

If the reader stops returning anything in the middle of a codepoint, it's obviously an error.
If it doesn't, we can read the bytes from the BufRead one by one, and consume it until we find an invalid byte (that we don't consume). This works even with a BufReader using a 1-byte buffer.
This is completely impossible with a simple Read, since we must consume each byte as we read it, preventing us from leaving the first invalid byte untouched.
In this regard, BufRead vs Read does change what is possible to achieve.

@sfackler

This comment has been minimized.

Show comment
Hide comment
@sfackler

sfackler Nov 10, 2015

Member

Yeah, I was commenting on "but it is inconvenient if the reader has too short a buffer, or only returns too short reads." That's just a thing you have to deal with.

Member

sfackler commented Nov 10, 2015

Yeah, I was commenting on "but it is inconvenient if the reader has too short a buffer, or only returns too short reads." That's just a thing you have to deal with.

@gyscos

This comment has been minimized.

Show comment
Hide comment
@gyscos

gyscos Nov 10, 2015

I see.

To recap, concerning Chars, the options are (apart from taking it out of tree):

  1. Not follow the UTF-8 best practices on short read and:
    1. Sometimes consume too much from the BufRead in order to force a new read
    2. Return an error on short read
  2. Follow the UTF-8 best practices by properly handling byte-by-byte verification by:
    1. Duplicating code from str::from_utf8 (actually str::run_utf8_validation_iterator) and using it from here
    2. Sharing code with str, by:
      1. Exposing code from the str module (making it public?) - will need some modification in the str module
      2. Moving the Chars functionality to the str module (it is basically a streamed utf-8 decoder, could be more general than just Read?)

Note that not following the UTF-8 best practices mean BufRead does not add any correctness, "just" some performance improvements.

I like option 2.ii.a, but maybe it's too much of an impact for this Chars problem?

gyscos commented Nov 10, 2015

I see.

To recap, concerning Chars, the options are (apart from taking it out of tree):

  1. Not follow the UTF-8 best practices on short read and:
    1. Sometimes consume too much from the BufRead in order to force a new read
    2. Return an error on short read
  2. Follow the UTF-8 best practices by properly handling byte-by-byte verification by:
    1. Duplicating code from str::from_utf8 (actually str::run_utf8_validation_iterator) and using it from here
    2. Sharing code with str, by:
      1. Exposing code from the str module (making it public?) - will need some modification in the str module
      2. Moving the Chars functionality to the str module (it is basically a streamed utf-8 decoder, could be more general than just Read?)

Note that not following the UTF-8 best practices mean BufRead does not add any correctness, "just" some performance improvements.

I like option 2.ii.a, but maybe it's too much of an impact for this Chars problem?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Dec 1, 2015

Member

Given the number of details here, I think stabilizing a chars method on BufRead/Read should probably require an RFC, especially if we start messing with the str module API. It seems like that makes it (and the other methods in this tracking issue) an ideal candidate to evolve out of tree.

Member

BurntSushi commented Dec 1, 2015

Given the number of details here, I think stabilizing a chars method on BufRead/Read should probably require an RFC, especially if we start messing with the str module API. It seems like that makes it (and the other methods in this tracking issue) an ideal candidate to evolve out of tree.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Dec 3, 2015

Member

The libs team discussed this during triage today and the conclusion was to deprecate tee and broadcast while leaving chars unstable. Further action there can hold off on an RFC.

Member

alexcrichton commented Dec 3, 2015

The libs team discussed this during triage today and the conclusion was to deprecate tee and broadcast while leaving chars unstable. Further action there can hold off on an RFC.

@ArtemGr

This comment has been minimized.

Show comment
Hide comment
@ArtemGr

ArtemGr Dec 13, 2015

Contributor

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

So is there a crate? Sorry if I missed one.

Contributor

ArtemGr commented Dec 13, 2015

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

So is there a crate? Sorry if I missed one.

@taralx

This comment has been minimized.

Show comment
Hide comment
@taralx

taralx Jun 21, 2016

Contributor

I still have one major objection to #33801 that I've noted in that PR.

Contributor

taralx commented Jun 21, 2016

I still have one major objection to #33801 that I've noted in that PR.

@jwilm

This comment has been minimized.

Show comment
Hide comment
@jwilm

jwilm Jul 1, 2016

My concern with #33801 relates to partial reads. As it stands, there's no way to know how much data was consumed in the case of a Read source like &[u8]. Some ways to fix that is returning the number of consumed bytes in Utf8Error::IncompleteUtf8 (not great), return the number of bytes left over in that variant, or return the specific bytes left over in that variant.

jwilm commented Jul 1, 2016

My concern with #33801 relates to partial reads. As it stands, there's no way to know how much data was consumed in the case of a Read source like &[u8]. Some ways to fix that is returning the number of consumed bytes in Utf8Error::IncompleteUtf8 (not great), return the number of bytes left over in that variant, or return the specific bytes left over in that variant.

@jwilm

This comment has been minimized.

Show comment
Hide comment
@jwilm

jwilm Jul 1, 2016

After thinking on InvalidUtf8 and IncompleteUtf8 errors some more, I've come to the conclusion they need to include the problematic bytes. This nicely handles the case of a partial read when using a &[u8] (a consumer can just stick the bytes in the front of the buffer before reading again), and an InvalidUtf8 could be handled however the consumer sees fit.

jwilm commented Jul 1, 2016

After thinking on InvalidUtf8 and IncompleteUtf8 errors some more, I've come to the conclusion they need to include the problematic bytes. This nicely handles the case of a partial read when using a &[u8] (a consumer can just stick the bytes in the front of the buffer before reading again), and an InvalidUtf8 could be handled however the consumer sees fit.

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Jan 5, 2017

Member

🔔 This issue is now entering its final comment period 🔔

@alexcrichton Looks like this was forgotten? Anything that needs to be done here to stabilize this?

Member

Manishearth commented Jan 5, 2017

🔔 This issue is now entering its final comment period 🔔

@alexcrichton Looks like this was forgotten? Anything that needs to be done here to stabilize this?

@jwilm

This comment has been minimized.

Show comment
Hide comment
@jwilm

jwilm Jan 5, 2017

I've published a crate called utf8parse which has a dramatically different API for parsing a &[u8] as UTF-8. I don't believe it's a complete solution to the problems discussed here, but perhaps showing it here may be helpful. Let me give you a quick summary of the utf8parse API, and then I'll compare it with the issues being discussed with the Utf8Chars API.

There's two types in utf8parse, a Parser and a Receiver. First, the Receiver type looks like this:

/// Handles codepoint and invalid sequence events from the parser.
pub trait Receiver {
    /// Called whenever a codepoint is parsed successfully
    fn codepoint(&mut self, char);

    /// Called when an invalid_sequence is detected
    fn invalid_sequence(&mut self);
}

After creating a Parser, it's only got one useful method:

fn advance<R>(&mut self, receiver: &mut R, byte: u8)
    where R: Receiver

One byte is pushed at a time into the parser, and occasionally codepoints/errors are provided to the Receiver. Driving the parser looks like this:

let bytes = bytes_from_somewhere();
let mut receiver = make_my_receiver();
let mut parser = Parser::new();

for byte in &bytes {
    parser.advance(&mut receiver, byte);
}

Not as nice as the Utf8Chars iterator, but it has some advantages. First, the code driving the parser can know exactly where an error is encountered. For example, it would be possible to keep track of indices when a valid codepoint is encountered and when and invalid codepoint is identified. In this way, it's possible to know the bad bytes. Additionally, if the end of the buffer is reached without receiving a codepoint at the same time, the incomplete bytes would be known based on the byte index of the last valid codepoint.

In summary, this solves including bytes in the Incomplete/InvalidUtf8 error types at the cost of a non-iterator API. It separates out fetching bytes from some Read type and actually parsing those bytes as UTF-8.

Maybe it makes sense for the standard library to include a low level UTF-8 parser like this that can be tailored for different situations - instead of this use-case specific iterator based solution.

jwilm commented Jan 5, 2017

I've published a crate called utf8parse which has a dramatically different API for parsing a &[u8] as UTF-8. I don't believe it's a complete solution to the problems discussed here, but perhaps showing it here may be helpful. Let me give you a quick summary of the utf8parse API, and then I'll compare it with the issues being discussed with the Utf8Chars API.

There's two types in utf8parse, a Parser and a Receiver. First, the Receiver type looks like this:

/// Handles codepoint and invalid sequence events from the parser.
pub trait Receiver {
    /// Called whenever a codepoint is parsed successfully
    fn codepoint(&mut self, char);

    /// Called when an invalid_sequence is detected
    fn invalid_sequence(&mut self);
}

After creating a Parser, it's only got one useful method:

fn advance<R>(&mut self, receiver: &mut R, byte: u8)
    where R: Receiver

One byte is pushed at a time into the parser, and occasionally codepoints/errors are provided to the Receiver. Driving the parser looks like this:

let bytes = bytes_from_somewhere();
let mut receiver = make_my_receiver();
let mut parser = Parser::new();

for byte in &bytes {
    parser.advance(&mut receiver, byte);
}

Not as nice as the Utf8Chars iterator, but it has some advantages. First, the code driving the parser can know exactly where an error is encountered. For example, it would be possible to keep track of indices when a valid codepoint is encountered and when and invalid codepoint is identified. In this way, it's possible to know the bad bytes. Additionally, if the end of the buffer is reached without receiving a codepoint at the same time, the incomplete bytes would be known based on the byte index of the last valid codepoint.

In summary, this solves including bytes in the Incomplete/InvalidUtf8 error types at the cost of a non-iterator API. It separates out fetching bytes from some Read type and actually parsing those bytes as UTF-8.

Maybe it makes sense for the standard library to include a low level UTF-8 parser like this that can be tailored for different situations - instead of this use-case specific iterator based solution.

@alexcrichton

This comment has been minimized.

Show comment
Hide comment
@alexcrichton

alexcrichton Jan 6, 2017

Member

@Manishearth that was basically 6 months ago at this point, and unfortunately I've forgotten the context of this in the interim.

Member

alexcrichton commented Jan 6, 2017

@Manishearth that was basically 6 months ago at this point, and unfortunately I've forgotten the context of this in the interim.

@Manishearth

This comment has been minimized.

Show comment
Hide comment
@Manishearth

Manishearth Jan 6, 2017

Member

cc @rust-lang/libs reviving this

Member

Manishearth commented Jan 6, 2017

cc @rust-lang/libs reviving this

@derekdreery

This comment has been minimized.

Show comment
Hide comment
@derekdreery

derekdreery Jan 6, 2017

Contributor

Does this sound correct?:

The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).

It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).

If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.

It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).

Contributor

derekdreery commented Jan 6, 2017

Does this sound correct?:

The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).

It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).

If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.

It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).

lotabout added a commit to lotabout/skim that referenced this issue Jan 19, 2017

close #49, compile on stable
Previouly, skim relys on nightly rust for `io::chars`
Now use crate utf8parse instead.
Check rust-lang/rust#27802 (comment)

brookst added a commit to brookst/skim that referenced this issue Jan 19, 2017

Replace unstable chars usage
Use private char iterator as done in kkawakam/rustyline#38 while waiting
for stabilisation of the chars method per rust-lang/rust#27802
This removes the need for `#[feature(io)]` letting skim compile on rust
stable.

@SimonSapin SimonSapin referenced this issue Feb 19, 2017

Open

Tracking issue for 1.0.0 tracking issues #39954

8 of 28 tasks complete
@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Mar 4, 2017

Contributor

TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.

I’m in favor of deprecating and eventually removing this with no in-std replacement.


I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.

I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:

pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }

pub enum DecodeResult<'a> {
    Ok(&'a str),

    /// These three slices cover all of the original input.
    /// `decode` should be called again with the third one as the new input.
    Error(&'a str, InvalidSequence<'a>, &'a [u8]),

    Incomplete(&'a str, IncompleteChar),
}

pub struct InvalidSequence<'a>(pub &'a [u8]);

pub struct IncompleteChar {
    // Fields are private. They include a [u8; 4] buffer.
}

impl IncompleteChar {
    pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
                                       -> TryCompleteResult<'char, 'input> { /* ... */ }
}

pub enum TryCompleteResult<'char, 'input> {
    Ok(&'char str, &'input [u8]),  // str.chars().count() == 1
    Error(InvalidSequence<'char>, &'input [u8]),
    StillIncomplete,
}

It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.

We can hide some of the details with a stateful decoder:

pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }

impl Decoder {
    pub fn new() -> Self;

    pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
                                    -> DecoderResult<'decoder, 'input>;

    /// Signal that there is no more input.
    /// The decoder might contain a partial `char` which becomes an error.
    pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}

/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
    /// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
    pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,

    /// Up to the first error, if any
    pub decoded: &'input str,

    /// Whether we did find an error
    pub error: Result<(), InvalidSequence<'input>>

    /// Call `decoder.decode()` again with this, if non-empty
    pub remaining_input_after_error: &'input [u8]
}

/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);

Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.

Either of these is complicated enough that I don’t think it belongs in libcore or libstd.

Contributor

SimonSapin commented Mar 4, 2017

TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.

I’m in favor of deprecating and eventually removing this with no in-std replacement.


I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.

I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:

pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }

pub enum DecodeResult<'a> {
    Ok(&'a str),

    /// These three slices cover all of the original input.
    /// `decode` should be called again with the third one as the new input.
    Error(&'a str, InvalidSequence<'a>, &'a [u8]),

    Incomplete(&'a str, IncompleteChar),
}

pub struct InvalidSequence<'a>(pub &'a [u8]);

pub struct IncompleteChar {
    // Fields are private. They include a [u8; 4] buffer.
}

impl IncompleteChar {
    pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
                                       -> TryCompleteResult<'char, 'input> { /* ... */ }
}

pub enum TryCompleteResult<'char, 'input> {
    Ok(&'char str, &'input [u8]),  // str.chars().count() == 1
    Error(InvalidSequence<'char>, &'input [u8]),
    StillIncomplete,
}

It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.

We can hide some of the details with a stateful decoder:

pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }

impl Decoder {
    pub fn new() -> Self;

    pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
                                    -> DecoderResult<'decoder, 'input>;

    /// Signal that there is no more input.
    /// The decoder might contain a partial `char` which becomes an error.
    pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}

/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
    /// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
    pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,

    /// Up to the first error, if any
    pub decoded: &'input str,

    /// Whether we did find an error
    pub error: Result<(), InvalidSequence<'input>>

    /// Call `decoder.decode()` again with this, if non-empty
    pub remaining_input_after_error: &'input [u8]
}

/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);

Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.

Either of these is complicated enough that I don’t think it belongs in libcore or libstd.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Mar 4, 2017

Contributor

By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.

Contributor

SimonSapin commented Mar 4, 2017

By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.

@taralx

This comment has been minimized.

Show comment
Hide comment
@taralx

taralx Mar 5, 2017

Contributor

I would support deprecating Read::chars in favor of seeing what the community can produce.

Contributor

taralx commented Mar 5, 2017

I would support deprecating Read::chars in favor of seeing what the community can produce.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Mar 6, 2017

Contributor

Another attempt turned out almost nice:

pub struct Decoder { 
    buffer: [u8; 4],
    /* ... */
}

impl Decoder {
    pub fn new() -> Self { /* ... */ }
    pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
    pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}

pub struct DecoderIter<'a> {
    decoder: &'a mut Decoder,
    /* ... */ 
}

impl<'a> Iterator for DecoderIter<'a> {
    type Item = Result<&'a str, &'a [u8]>;
}

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:

impl<'a> DecoderIter<'a> {
    pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}
    let mut iter = decoder.next_chunk(input);
    while let Some(result) = iter.next() {
         // ...
    }

This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.

We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.

pub struct Decoder { /* ... */ }

impl Decoder {
    pub fn new() -> Self { /* ... */ }

    pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }

    pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }
}

pub struct DecoderIter<'decoder, 'input> { /* ... */ }

impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
    pub fn next<'buffer>(&'buffer mut self) 
        -> Option<Result<EitherLifetime<'buffer, 'input, str>,
                         EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}

pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
    Buffer(&'buffer T),
    Input(&'input T),
}

impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
    pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
        match *self {
            EitherLifetime::Input(x) => x,
            EitherLifetime::Buffer(x) => x,
        }
    }
}
Contributor

SimonSapin commented Mar 6, 2017

Another attempt turned out almost nice:

pub struct Decoder { 
    buffer: [u8; 4],
    /* ... */
}

impl Decoder {
    pub fn new() -> Self { /* ... */ }
    pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
    pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}

pub struct DecoderIter<'a> {
    decoder: &'a mut Decoder,
    /* ... */ 
}

impl<'a> Iterator for DecoderIter<'a> {
    type Item = Result<&'a str, &'a [u8]>;
}

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:

impl<'a> DecoderIter<'a> {
    pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}
    let mut iter = decoder.next_chunk(input);
    while let Some(result) = iter.next() {
         // ...
    }

This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.

We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.

pub struct Decoder { /* ... */ }

impl Decoder {
    pub fn new() -> Self { /* ... */ }

    pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }

    pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }
}

pub struct DecoderIter<'decoder, 'input> { /* ... */ }

impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
    pub fn next<'buffer>(&'buffer mut self) 
        -> Option<Result<EitherLifetime<'buffer, 'input, str>,
                         EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}

pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
    Buffer(&'buffer T),
    Input(&'input T),
}

impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
    pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
        match *self {
            EitherLifetime::Input(x) => x,
            EitherLifetime::Buffer(x) => x,
        }
    }
}
@taralx

This comment has been minimized.

Show comment
Hide comment
@taralx

taralx Mar 7, 2017

Contributor

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator.

Can you elaborate? I don't follow here.

Contributor

taralx commented Mar 7, 2017

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator.

Can you elaborate? I don't follow here.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Mar 7, 2017

Contributor

@taralx

  • DecoderIter<'a> contains &'a mut Decoder.
  • Decoder contains a [u8; 4] buffer.
  • The DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).
  • In some cases, the next method wants to return a borrow of the buffer with (simplified) std::str::from_utf8_unchecked(&self.decoder.buffer).
  • Since this borrow is going through &'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).
  • The solution is to return &'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.

Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.

Contributor

SimonSapin commented Mar 7, 2017

@taralx

  • DecoderIter<'a> contains &'a mut Decoder.
  • Decoder contains a [u8; 4] buffer.
  • The DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).
  • In some cases, the next method wants to return a borrow of the buffer with (simplified) std::str::from_utf8_unchecked(&self.decoder.buffer).
  • Since this borrow is going through &'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).
  • The solution is to return &'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.

Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.

@wilysword

This comment has been minimized.

Show comment
Hide comment
@wilysword

wilysword Sep 5, 2017

I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.

wilysword commented Sep 5, 2017

I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Sep 5, 2017

Contributor

hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.

Contributor

SimonSapin commented Sep 5, 2017

hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Mar 30, 2018

Contributor

The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.

@rfcbot fcp close

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

Contributor

SimonSapin commented Mar 30, 2018

The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.

@rfcbot fcp close

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

@rfcbot

This comment has been minimized.

Show comment
Hide comment
@rfcbot

rfcbot Mar 30, 2018

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot commented Mar 30, 2018

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot

This comment has been minimized.

Show comment
Hide comment
@rfcbot

rfcbot Apr 4, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot commented Apr 4, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot

This comment has been minimized.

Show comment
Hide comment
@rfcbot

rfcbot Apr 14, 2018

The final comment period is now complete.

rfcbot commented Apr 14, 2018

The final comment period is now complete.

kennytm added a commit to kennytm/rust that referenced this issue Apr 24, 2018

@SimonSapin

This comment has been minimized.

Show comment
Hide comment
@SimonSapin

SimonSapin Apr 24, 2018

Contributor

Deprecated in #49970

Contributor

SimonSapin commented Apr 24, 2018

Deprecated in #49970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment