Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::io::fs::walk_dir has O(n^2) behaviour #13411

Closed
huonw opened this issue Apr 8, 2014 · 10 comments · Fixed by #13720
Closed

std::io::fs::walk_dir has O(n^2) behaviour #13411

huonw opened this issue Apr 8, 2014 · 10 comments · Fixed by #13720
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@huonw
Copy link
Member

huonw commented Apr 8, 2014

It calls .shift on a vector repeatedly.

Example:

// walk_dir.rs
fn main() {
    let dir = Path::new(std::os::args()[1]);
    for _ in std::io::fs::walk_dir(&dir).unwrap() {}
}
$ n=50000; mkdir -p $n; bash -c "touch $n/{1..$n}"
$ time ./walk_dir 50000/

real    0m1.442s
user    0m1.404s
sys     0m0.032s

$ n=100000; mkdir -p $n; bash -c "touch $n/{1..$n}"
$ time ./walk_dir 100000/

real    0m6.228s
user    0m6.096s
sys     0m0.132s

50000 and 100000 are directories with that many files in them. Theoretically this wants a deque.

@pongad
Copy link
Contributor

pongad commented Apr 9, 2014

@huonw Would keeping an index be sufficient? Just return Some(stack[pos]) and bump pos each time. It seems to me we don't need any fancy data structure. I can work on this.

@huonw
Copy link
Member Author

huonw commented Apr 10, 2014

That might work, but it does mean that the size of the internal vector grows to be the total number of elements ever touched, rather than just the number of children of recently-walked directories. It would also require cloning each of them. (Maybe not, if it stored Option<Path> internally and returned stack[pos].take()...)

@pongad
Copy link
Contributor

pongad commented Apr 10, 2014

What about this. When we run out of space at the end, check if we are using
more than half of the vector. If we are, allocate a new one at least twice
as large. If we are not, swap all the values to the beginning of the vector.
This way we have amortized linear time without needing to use a full blown
deque.
On Apr 10, 2014 2:14 AM, "Huon Wilson" notifications@github.com wrote:

That might work, but it does mean that the size of the internal vector
grows to be the total number of elements ever touched, rather than just the
number of children of recently-walked directories. It would also require
cloning each of them. (Maybe not, if it stored Option internally
and returned stack[pos].take()...)


Reply to this email directly or view it on GitHubhttps://github.com//issues/13411#issuecomment-40046273
.

@huonw
Copy link
Member Author

huonw commented Apr 10, 2014

If you can get that to work, I guess so. But really, it would be better/more correct to just use a proper Deque... if std had access to one.

@pongad
Copy link
Contributor

pongad commented Apr 12, 2014

I am going to guess that taking walk_dir out of std is undesirable. I can't really think of another way to do this rather than creating a simple deque. I'll work on it if you think this is acceptable.

@liigo
Copy link
Contributor

liigo commented Apr 13, 2014

"It calls .shift on a vector repeatedly." but it's very slow.
#12884 Vec/Str need an efficient way to shift() multi-items

@huonw
Copy link
Member Author

huonw commented Apr 13, 2014

This isn't multiple items, it needs to shift the Paths off one at a time.

@huonw
Copy link
Member Author

huonw commented Apr 13, 2014

(@pongad I would be OK with moving deque into a "secret" module inside std and reexporting it from libcollections, however rustdoc doesn't support this nicely just yet.)

@Thiez
Copy link
Contributor

Thiez commented Apr 13, 2014

How about something along these lines:

/// An iterator which walks over a directory
pub struct Directories {
    stack: Vec<std::vec::MoveItems<Path>>,
}

impl Iterator<Path> for Directories {
  fn next(&mut self) -> Option<Path> {
    enum LoopResult {
        AddAndReturn(std::vec::MoveItems<Path,Path),
        Return(Path),
        Pop,
    }
    loop {
      let loop_result;
      match self.stack.last() {
        Some(ref mut path_iter.next()) => {
          Some(path) {
            if (path.is_dir() {
              match readdir(&path) {
                Ok(dirs) => {
                  loop_result = AddAndReturn(dirs.move_iter(), path)
                },
                Err(..) => {
                  loop_result = Return(path)
                },
              }
            } else {
              loop_result = Return(path)
            }
          },
          None => {
            loop_result = Pop
          },
        }
        None => {
          return None,
        }
      };
      match loop_result {
        AddAndReturn(dirs,path) => {
          self.stack.push(dirs);
          return path
        },
        Return(path) => {
          return path
        },
        Pop => {
          self.stack.pop()
        },
      };
    }
  }
}

I didn't test the above, removing syntax errors is left as an exercise to the reader.

Edit: Never mind, I see this changes the order that walk_dir guarantees. I'll try and rewrite my proposal tonight to make it emulate the current behaviour.

@huonw
Copy link
Member Author

huonw commented Apr 13, 2014

I'm not sure the current order is important, I don't know though (if it's not important we can probably just use Vec as a stack, which is efficient).

However, if we wish to retain the current order, I really do think the correct solution is to just use a deque, rather than write a complicated specialised data structure for this relatively simple task. (Unless that data structure has other significant benefits, of course. Maybe avoiding the copies from each readdir vec to main deque will be noticably more efficient?)

aturon added a commit to aturon/rust that referenced this issue Apr 24, 2014
The `walk_dir` iterator was simulating a queue using a vector (in particular, using `shift`),
leading to O(n^2) performance. Since the order was not well-specified (see issue rust-lang#13411),
the simplest fix is to use the vector as a stack (and thus yield a depth-first traversal).
This patch does exactly that.  It leaves the order as originally specified -- "some top-down
order" -- and adds a test to ensure a top-down traversal.

Note that the underlying `readdir` function does not specify any particular order, nor
does the system call it uses.

Closes rust-lang#13411.
bors added a commit that referenced this issue Apr 24, 2014
The `walk_dir` iterator was simulating a queue using a vector (in particular, using `shift`),
leading to O(n^2) performance. Since the order was not well-specified (see issue #13411),
the simplest fix is to use the vector as a stack (and thus yield a depth-first traversal).
This patch does exactly that, and adds a test checking for depth-first behavior.

Note that the underlying `readdir` function does not specify any particular order, nor
does the system call it uses.

Closes #13411.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants