-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There's no good way to iterate over all newlines with BufRead #55743
Comments
That clearly specified in the doc:
https://doc.rust-lang.org/std/io/trait.BufRead.html#method.lines So I don't agree about "This is counterintuitive". Also this is only used on very old system according to wikipedia:
https://en.wikipedia.org/wiki/Newline#Representation The only one "recent" is OS-9, I think it's not useful to add |
Some embedded systems use But I would not mind to use |
@Stargateur Maybe it's an age thing, but I've thought of "\r", "\n", and "\r\n" as being the trio of possible newlines to be truly crossplatform. @hellow554 The problem is that split('\r') means that I no longer split on '\n', so it won't work for files using that newline separator. And since BufRead only appears to provide either read_line (which doesn't cover '\r') or read_until (which only covers 1 byte character) there's no way to cover all the bases with the basic BufRead interface. If I read the data into a string and then split, I can use a closure that allows me to solve the problem. However, then I can't provide an iterator chain as the function reading the file has ownership of the string, and the rest of the iterator chain uses &str. This approach also prevents true stream processing of the data (although in this case I didn't notice a practical difference when I modified the code in other ways to be more streaming and less read-buffer-write, so the ownership issue was the only dealbreaker). The confusing part is that if you load the data into most advanced text editors, they'll silently split the newlines correctly. And of course all of this is a lot of thought to put into just splitting a file based on newlines. |
Well, you need a more low level feature, and use std::io::{BufRead, BufReader, Error};
use std::str;
fn main() -> Result<(), Error> {
let mut f = BufReader::new(
"Hello\nI'm a test\rThis is a second test\r\nand a third\n\rand the last one".as_bytes(),
);
loop {
let length = {
let buffer = f.fill_buf()?;
let line_size = buffer
.iter()
.take_while(|c| **c != b'\n' && **c != b'\r')
.count();
if buffer.len() == 0 {
break Ok(());
}
println!("{:?}", str::from_utf8(&buffer[..line_size]));
line_size
+ if line_size < buffer.len() {
// we found a delimiter
if line_size + 1 < buffer.len() // we look if we found two delimiter
&& buffer[line_size] == b'\r'
&& buffer[line_size + 1] == b'\n'
{
2
} else {
1
}
} else {
0
}
};
f.consume(length);
}
} But that indeed require some work. You can also see that handle '\r' require the double of comparaison, to handle a very old and obsolete feature not required by 99% (random stats ;)) of use case that expensive. |
That's an unacceptable amount of code to have to write for such a simple operation. Somebody could easily take one look at that and be immediately dissuaded from using the language. Consider Python's splitlines method: Not only does it handle \r, but it also handles several additional line delimiter characters, including unicode. Even Rust's string lines() functions don't handle these: Rust is supposed to be about correctness-first, not performance at the cost of cut corners and exposed footguns. |
Where did you read that ? Rust is intent to be safe and as fast as possible. Also if you want code in python code in it. Rust is not python. Every language has its pros and its cons. If you judge a language by the number of line you need to write, yes Rust is not the best candidate.
Ah, just do a custum trait and add it. The fact that std don't have it builtin is not a big problem. You can copy my naif exemple. Just create your custom iterator, and you will do all of this with one line in you app code. |
I thought it was on the website, but I think the initial point of entry was a discussion. Possibly about parsing floats. But it looks like the website has changed, so this could do with some clarification. As this seems to be one of the few cases where safety and performance are mutually exclusive. It's one thing to judge a language by the number of lines you have to write. It's quite another for it to look like it's cutting corners enough in the name of performance that you have to be careful about the sort of data you pass to functions in the standard library because they're designed to only work with the "usual" data. For instance, if you could pass "143.12" to a float, but "4235e-15" would simply get ignored. This is exactly the kind of brick wall that deters people from using Rust. If you're on the common path it works fine, but deviate ever so slightly and suddenly you're stuck spending orders of magnitudes more time to completely rewrite part of the standard lib. For this to happen with something as mundane as splitting a function by newlines would be a red flag to somebody that the language is not mature enough to use for production software. EDIT: I mean, thanks for writing the example, but if this were a third-party library then my boss would additionally want to check its license and that of its dependencies. Keep in mind that people coming from a C++ background have a developed instinct to flinch at the idea of pulling in dependencies because they expect it to be a constant hassle to build and integrate etc. And it also isn't as discoverable as the standard library, so it raises the possibility of somebody giving up then and there. It's amazing how many problems can be solved with a well-orchestrated google search sometimes. |
Posted an SO question related to this. |
But why @theronic? This post is already clear enough IMHO. |
If you have files with a newline of '\r', the only way to handle them is to use split. But if you have files with a newline of '\n' or '\r\n', then you can use lines().
This is counterintuitive and problematic from the perspective of the BufRead interface - there's no way to use it to iterate over any of the three types. I'd typically expect that lines() would handle this transparently.
I'm not sure if this is the right place, but I don't want it to get lost because it cost me major hassle.
The text was updated successfully, but these errors were encountered: