Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There's no good way to iterate over all newlines with BufRead #55743

Open
spease opened this issue Nov 7, 2018 · 9 comments
Open

There's no good way to iterate over all newlines with BufRead #55743

spease opened this issue Nov 7, 2018 · 9 comments
Labels
A-io Area: std::io, std::fs, std::net and std::path A-str Area: str and String T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@spease
Copy link

spease commented Nov 7, 2018

If you have files with a newline of '\r', the only way to handle them is to use split. But if you have files with a newline of '\n' or '\r\n', then you can use lines().

This is counterintuitive and problematic from the perspective of the BufRead interface - there's no way to use it to iterate over any of the three types. I'd typically expect that lines() would handle this transparently.

I'm not sure if this is the right place, but I don't want it to get lost because it cost me major hassle.

@Stargateur
Copy link
Contributor

That clearly specified in the doc:

The iterator returned from this function will yield instances of io::Result. Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.

https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines

So I don't agree about "This is counterintuitive".

Also this is only used on very old system according to wikipedia:

Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9

https://en.wikipedia.org/wiki/Newline#Representation

The only one "recent" is OS-9, I think it's not useful to add \r to lines() function.

@hellow554
Copy link
Contributor

Some embedded systems use \r for their newlines as well. Sad, but true.

But I would not mind to use split instead of lines TBH.

@estebank estebank added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Nov 7, 2018
@spease
Copy link
Author

spease commented Nov 8, 2018

@Stargateur Maybe it's an age thing, but I've thought of "\r", "\n", and "\r\n" as being the trio of possible newlines to be truly crossplatform.

@hellow554 The problem is that split('\r') means that I no longer split on '\n', so it won't work for files using that newline separator. And since BufRead only appears to provide either read_line (which doesn't cover '\r') or read_until (which only covers 1 byte character) there's no way to cover all the bases with the basic BufRead interface.

If I read the data into a string and then split, I can use a closure that allows me to solve the problem. However, then I can't provide an iterator chain as the function reading the file has ownership of the string, and the rest of the iterator chain uses &str. This approach also prevents true stream processing of the data (although in this case I didn't notice a practical difference when I modified the code in other ways to be more streaming and less read-buffer-write, so the ownership issue was the only dealbreaker).

The confusing part is that if you load the data into most advanced text editors, they'll silently split the newlines correctly.

And of course all of this is a lot of thought to put into just splitting a file based on newlines.

@Stargateur
Copy link
Contributor

Stargateur commented Nov 8, 2018

Well, you need a more low level feature, and BufRead actually have this feature:

use std::io::{BufRead, BufReader, Error};
use std::str;

fn main() -> Result<(), Error> {
    let mut f = BufReader::new(
        "Hello\nI'm a test\rThis is a second test\r\nand a third\n\rand the last one".as_bytes(),
    );

    loop {
        let length = {
            let buffer = f.fill_buf()?;
            let line_size = buffer
                .iter()
                .take_while(|c| **c != b'\n' && **c != b'\r')
                .count();
            if buffer.len() == 0 {
                break Ok(());
            }

            println!("{:?}", str::from_utf8(&buffer[..line_size]));

            line_size
                + if line_size < buffer.len() {
                    // we found a delimiter
                    if line_size + 1 < buffer.len() // we look if we found two delimiter
                    && buffer[line_size] == b'\r'
                    && buffer[line_size + 1] == b'\n'
                    {
                        2
                    } else {
                        1
                    }
                } else {
                    0
                }
        };

        f.consume(length);
    }
}

But that indeed require some work. You can also see that handle '\r' require the double of comparaison, to handle a very old and obsolete feature not required by 99% (random stats ;)) of use case that expensive.

@spease
Copy link
Author

spease commented Dec 25, 2018

That's an unacceptable amount of code to have to write for such a simple operation. Somebody could easily take one look at that and be immediately dissuaded from using the language. Consider Python's splitlines method:
https://docs.python.org/3/library/stdtypes.html#str.splitlines

Not only does it handle \r, but it also handles several additional line delimiter characters, including unicode. Even Rust's string lines() functions don't handle these:
https://doc.rust-lang.org/std/primitive.str.html#method.lines

Rust is supposed to be about correctness-first, not performance at the cost of cut corners and exposed footguns.

@Stargateur
Copy link
Contributor

Stargateur commented Dec 26, 2018

Rust is supposed to be about correctness-first, not performance at the cost of cut corners and exposed footguns.

Where did you read that ? Rust is intent to be safe and as fast as possible. Also if you want code in python code in it. Rust is not python. Every language has its pros and its cons. If you judge a language by the number of line you need to write, yes Rust is not the best candidate.

That's an unacceptable amount of code to have to write for such a simple operation.

Ah, just do a custum trait and add it. The fact that std don't have it builtin is not a big problem. You can copy my naif exemple. Just create your custom iterator, and you will do all of this with one line in you app code.

@spease
Copy link
Author

spease commented Dec 26, 2018

Where did you read that ? Rust is intent to be safe and as fast as possible. Also if you want code in python code in it. Rust is not python. Every language has its pros and its cons. If you judge a language by the number of line you need to write, yes Rust is not the best candidate.

I thought it was on the website, but I think the initial point of entry was a discussion. Possibly about parsing floats. But it looks like the website has changed, so this could do with some clarification. As this seems to be one of the few cases where safety and performance are mutually exclusive.

It's one thing to judge a language by the number of lines you have to write. It's quite another for it to look like it's cutting corners enough in the name of performance that you have to be careful about the sort of data you pass to functions in the standard library because they're designed to only work with the "usual" data. For instance, if you could pass "143.12" to a float, but "4235e-15" would simply get ignored.

This is exactly the kind of brick wall that deters people from using Rust. If you're on the common path it works fine, but deviate ever so slightly and suddenly you're stuck spending orders of magnitudes more time to completely rewrite part of the standard lib. For this to happen with something as mundane as splitting a function by newlines would be a red flag to somebody that the language is not mature enough to use for production software.

EDIT: I mean, thanks for writing the example, but if this were a third-party library then my boss would additionally want to check its license and that of its dependencies. Keep in mind that people coming from a C++ background have a developed instinct to flinch at the idea of pulling in dependencies because they expect it to be a constant hassle to build and integrate etc. And it also isn't as discoverable as the standard library, so it raises the possibility of somebody giving up then and there. It's amazing how many problems can be solved with a well-orchestrated google search sometimes.

@theronic
Copy link

Posted an SO question related to this.

@hellow554
Copy link
Contributor

hellow554 commented Mar 13, 2019

Posted an SO question related to this.

But why @theronic? This post is already clear enough IMHO.

@workingjubilee workingjubilee added A-io Area: std::io, std::fs, std::net and std::path A-str Area: str and String labels Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: std::io, std::fs, std::net and std::path A-str Area: str and String T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants