There's no good way to iterate over all newlines with BufRead #55743

spease · 2018-11-07T07:41:09Z

If you have files with a newline of '\r', the only way to handle them is to use split. But if you have files with a newline of '\n' or '\r\n', then you can use lines().

This is counterintuitive and problematic from the perspective of the BufRead interface - there's no way to use it to iterate over any of the three types. I'd typically expect that lines() would handle this transparently.

I'm not sure if this is the right place, but I don't want it to get lost because it cost me major hassle.

Stargateur · 2018-11-07T11:44:41Z

That clearly specified in the doc:

The iterator returned from this function will yield instances of io::Result. Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.

https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines

So I don't agree about "This is counterintuitive".

Also this is only used on very old system according to wikipedia:

Commodore 8-bit machines (C64, C128), Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9

https://en.wikipedia.org/wiki/Newline#Representation

The only one "recent" is OS-9, I think it's not useful to add \r to lines() function.

hellow554 · 2018-11-07T12:51:55Z

Some embedded systems use \r for their newlines as well. Sad, but true.

But I would not mind to use split instead of lines TBH.

spease · 2018-11-08T03:29:54Z

@Stargateur Maybe it's an age thing, but I've thought of "\r", "\n", and "\r\n" as being the trio of possible newlines to be truly crossplatform.

@hellow554 The problem is that split('\r') means that I no longer split on '\n', so it won't work for files using that newline separator. And since BufRead only appears to provide either read_line (which doesn't cover '\r') or read_until (which only covers 1 byte character) there's no way to cover all the bases with the basic BufRead interface.

If I read the data into a string and then split, I can use a closure that allows me to solve the problem. However, then I can't provide an iterator chain as the function reading the file has ownership of the string, and the rest of the iterator chain uses &str. This approach also prevents true stream processing of the data (although in this case I didn't notice a practical difference when I modified the code in other ways to be more streaming and less read-buffer-write, so the ownership issue was the only dealbreaker).

The confusing part is that if you load the data into most advanced text editors, they'll silently split the newlines correctly.

And of course all of this is a lot of thought to put into just splitting a file based on newlines.

Stargateur · 2018-11-08T09:54:32Z

Well, you need a more low level feature, and BufRead actually have this feature:

use std::io::{BufRead, BufReader, Error};
use std::str;

fn main() -> Result<(), Error> {
    let mut f = BufReader::new(
        "Hello\nI'm a test\rThis is a second test\r\nand a third\n\rand the last one".as_bytes(),
    );

    loop {
        let length = {
            let buffer = f.fill_buf()?;
            let line_size = buffer
                .iter()
                .take_while(|c| **c != b'\n' && **c != b'\r')
                .count();
            if buffer.len() == 0 {
                break Ok(());
            }

            println!("{:?}", str::from_utf8(&buffer[..line_size]));

            line_size
                + if line_size < buffer.len() {
                    // we found a delimiter
                    if line_size + 1 < buffer.len() // we look if we found two delimiter
                    && buffer[line_size] == b'\r'
                    && buffer[line_size + 1] == b'\n'
                    {
                        2
                    } else {
                        1
                    }
                } else {
                    0
                }
        };

        f.consume(length);
    }
}

But that indeed require some work. You can also see that handle '\r' require the double of comparaison, to handle a very old and obsolete feature not required by 99% (random stats ;)) of use case that expensive.

spease · 2018-12-25T23:28:22Z

That's an unacceptable amount of code to have to write for such a simple operation. Somebody could easily take one look at that and be immediately dissuaded from using the language. Consider Python's splitlines method:
https://docs.python.org/3/library/stdtypes.html#str.splitlines

Not only does it handle \r, but it also handles several additional line delimiter characters, including unicode. Even Rust's string lines() functions don't handle these:
https://doc.rust-lang.org/std/primitive.str.html#method.lines

Rust is supposed to be about correctness-first, not performance at the cost of cut corners and exposed footguns.

Stargateur · 2018-12-26T07:23:54Z

Rust is supposed to be about correctness-first, not performance at the cost of cut corners and exposed footguns.

Where did you read that ? Rust is intent to be safe and as fast as possible. Also if you want code in python code in it. Rust is not python. Every language has its pros and its cons. If you judge a language by the number of line you need to write, yes Rust is not the best candidate.

That's an unacceptable amount of code to have to write for such a simple operation.

Ah, just do a custum trait and add it. The fact that std don't have it builtin is not a big problem. You can copy my naif exemple. Just create your custom iterator, and you will do all of this with one line in you app code.

spease · 2018-12-26T15:58:39Z

Where did you read that ? Rust is intent to be safe and as fast as possible. Also if you want code in python code in it. Rust is not python. Every language has its pros and its cons. If you judge a language by the number of line you need to write, yes Rust is not the best candidate.

I thought it was on the website, but I think the initial point of entry was a discussion. Possibly about parsing floats. But it looks like the website has changed, so this could do with some clarification. As this seems to be one of the few cases where safety and performance are mutually exclusive.

It's one thing to judge a language by the number of lines you have to write. It's quite another for it to look like it's cutting corners enough in the name of performance that you have to be careful about the sort of data you pass to functions in the standard library because they're designed to only work with the "usual" data. For instance, if you could pass "143.12" to a float, but "4235e-15" would simply get ignored.

This is exactly the kind of brick wall that deters people from using Rust. If you're on the common path it works fine, but deviate ever so slightly and suddenly you're stuck spending orders of magnitudes more time to completely rewrite part of the standard lib. For this to happen with something as mundane as splitting a function by newlines would be a red flag to somebody that the language is not mature enough to use for production software.

EDIT: I mean, thanks for writing the example, but if this were a third-party library then my boss would additionally want to check its license and that of its dependencies. Keep in mind that people coming from a C++ background have a developed instinct to flinch at the idea of pulling in dependencies because they expect it to be a constant hassle to build and integrate etc. And it also isn't as discoverable as the standard library, so it raises the possibility of somebody giving up then and there. It's amazing how many problems can be solved with a well-orchestrated google search sometimes.

theronic · 2019-03-13T14:31:17Z

Posted an SO question related to this.

hellow554 · 2019-03-13T15:35:58Z

Posted an SO question related to this.

But why @theronic? This post is already clear enough IMHO.

estebank added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Nov 7, 2018

workingjubilee added A-io Area: std::io, std::fs, std::net and std::path A-str Area: str and String labels Jul 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There's no good way to iterate over all newlines with BufRead #55743

There's no good way to iterate over all newlines with BufRead #55743

spease commented Nov 7, 2018

Stargateur commented Nov 7, 2018

hellow554 commented Nov 7, 2018

spease commented Nov 8, 2018

Stargateur commented Nov 8, 2018 •

edited

spease commented Dec 25, 2018

Stargateur commented Dec 26, 2018 •

edited

spease commented Dec 26, 2018 •

edited

theronic commented Mar 13, 2019

hellow554 commented Mar 13, 2019 •

edited

There's no good way to iterate over all newlines with BufRead #55743

There's no good way to iterate over all newlines with BufRead #55743

Comments

spease commented Nov 7, 2018

Stargateur commented Nov 7, 2018

hellow554 commented Nov 7, 2018

spease commented Nov 8, 2018

Stargateur commented Nov 8, 2018 • edited

spease commented Dec 25, 2018

Stargateur commented Dec 26, 2018 • edited

spease commented Dec 26, 2018 • edited

theronic commented Mar 13, 2019

hellow554 commented Mar 13, 2019 • edited

Stargateur commented Nov 8, 2018 •

edited

Stargateur commented Dec 26, 2018 •

edited

spease commented Dec 26, 2018 •

edited

hellow554 commented Mar 13, 2019 •

edited