Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libs] add recursive variant of fs::read_dir #69684

Open
yoshuawuyts opened this issue Mar 3, 2020 · 6 comments
Open

[libs] add recursive variant of fs::read_dir #69684

yoshuawuyts opened this issue Mar 3, 2020 · 6 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@yoshuawuyts
Copy link
Member

Motivation

As part of the fs module there exist various recursive and non-recursive operations. create_dir and create_dir_all. remove_dir and remove_dir_all. But to read the contents of a directory there currently only exists read_dir, but no recursive counterpart:

Action Kind Base function Recursive variant
Create a directory fs::create_dir fs::create_dir_all
Remove a directory fs::remove_dir fs::remove_dir_all
Read a directory's contents fs::read_dir ...

I noticed the omission of this method when trying to read out a directory recursively, and discovered this was the only directory operation that doesn't have a recursive counterpart.

Usage overview

I'd imagine fs::read_dir_all would be used in much the same way as fs::read_dir:

for entry in fs::read_dir_all(dir)? {
    let entry = entry?;
    dbg!(entry.path());
}

Implementation overview

Roughly the API additions to std::fs would look like this:

#[derive(Debug)]
pub struct ReadDirAll;

impl Iterator for ReadDirAll {
    type Item = Result<DirEntry>;
}

pub fn read_dir_all<P: AsRef<Path>>(path: P) -> Result<ReadDirAll>;

Potential drawbacks

I remember reading something about the readdir(3) syscall, and a file filtering as part of the kernel. But I can't seem to find any reference to that anymore. I also thought this had been referenced in a prior RFC, but that too I cannot find.

But for argument's sake, even if there was a variant for readdir variant that took a filter I'd argue that since std::fs::read_dir doesn't provide filtering, neither should std::fs_read_dir_all. And if we'd want a variant that does provide filtering, it would not be a clean counterpart to fs::read_dir, so it would warrant introducing under a new name.

These functions should be able to co-exist with each other, much the same way fs::read_to_string and io::Read::read_to_string co-exist (both useful, but is a simplified version of the other).

Similar functionality in other languages

Conclusion

I think fs::read_dir_all makes for a straight forward addition to the stdlib. People expect this functionality to exist by default, and the shape and name of this function doesn't seem controversial. Even if filtering supersets of this functionality could possibly exist, they would likely be a different shape, and could co-exist with this function. Thanks!

@jonas-schievink jonas-schievink added C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 3, 2020
@yoshuawuyts
Copy link
Member Author

Oh I guess an obvious omission is burntsushi/walkdir, which provides very detailed control of how to walk directories, max depth, strategies and filtering. I'm thinking this might be what I remembered seeing in the past.

I would imagine functionality as provided by walkdir would likely also have a place in std; and the existence of one should preclude the existence of the other. It's nice to have shorthand APIs that are intuitive to use. But so are highly configurable constructs that can be adapted to every need.

@the8472
Copy link
Member

the8472 commented Mar 4, 2020

Imo there is no obvious, right solution for recursive directory walking. Do you follow symlinks or not? Do you respect device boundaries? Should inaccessible directories communicated as errors or be silently skipped? Should directories themselves be part of the resulting entries or not? How to avoid descending into potentially huge hidden dirs (e.g. cache dirs, containing git objects)? How do you handle recursion depth (which again is a bigger issue when symlinks are involved)?

For remove_dir_all those decisions are much easier to make and create_dir_all is simpler yet as it only walks down a single branch, not the whole tree. So they're not good analogies.

bash has the find command generally available.

That is not a bash built-in. but standard unix tool with multiple implementations (gnu, busybox, ...). And it has a ton of options, which also indicates that there's no one size fits all solution.

Oh I guess an obvious omission is burntsushi/walkdir,

That's not the only one! For example there's ignore which implements a parallel walker which can be necessary because walking directories is IO and thus slow. In my experience naive directory walking is agonizingly slow when performed on network filesystems.

Speaking of which, async getdents might become a thing at some point and then you'll want to steer users towards that instead of a slow synchronous implementation which alternates between IO and consuming results.

There are some other crates too.

about the readdir(3) syscall, and a file filtering as part of the kernel.

That's more of a a libc function. On linux it's using the getdents(2) syscall under the hood. Perhaps you were thinking of ftw(3), which does the filtering in userspace.

So given all these issues maybe that's one of those cases where batteries shouldn't be included.

@Boscop
Copy link

Boscop commented Mar 4, 2020

I agree with @the8472, there are many ways to skin this cat, and I never expected std to provide this.
(I've been using walkdir but recently discovered jwalk which claims to be faster (parallel).)

@luser
Copy link
Contributor

luser commented Mar 5, 2020

I would imagine functionality as provided by walkdir would likely also have a place in std; and the existence of one should preclude the existence of the other. It's nice to have shorthand APIs that are intuitive to use. But so are highly configurable constructs that can be adapted to every need.

I think this is a reasonable statement. Even the walkdir README shows a very simple example with no options first:

use walkdir::WalkDir;

for entry in WalkDir::new("foo") {
    let entry = entry.unwrap();
    println!("{}", entry.path().display());
}

That's virtually identical to your proposed simple API. For more complex operations in which you wanted to set options we could provide a builder type. This would mirror the simple APIs of std::fs::File::{create,open} and the configurability of std::fs::OpenOptions.

@matu3ba
Copy link

matu3ba commented Feb 5, 2022

Quote BurntSushi:

On my system, the performance of walkdir, find and nftw is comparable.

Note, that walkdir, find and nftw do quite some book-keeping, so they dont provide optimal performance.

Such a function would require several options with compilation time cost to be a good solution for all cases and further requires hand-tuning per OS to be optimal (the syscall wrapper cost becomes significant with amount of calls): https://github.com/romkatv/gitstatus/blob/master/docs/listdir.md

If being (performance-) optimal wrt the use case is still a goal of Rust, then I would advice against this and also suggest to close this issue, as providing fine-granular configurability for applications at scale will lead to bloat and hard maintenance (compatibility hazard of C++).

@the8472
Copy link
Member

the8472 commented Feb 5, 2022

Regarding performance, io_uring will gain getdents support in 5.17. This could even benefit single-threaded, synchronous code by queuing up requests that will be fulfilled in the background while user code processes the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants