Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs.Dir.deleteTree: Optimize for non-deeply-nested directories #13073

Merged
merged 9 commits into from Oct 12, 2022

Conversation

squeek502
Copy link
Collaborator

@squeek502 squeek502 commented Oct 5, 2022

From @andrewrk in #13070 (comment):

I'd like to fully exhaust the non-recursive, non-heap-allocating implementation options before resorting to this. Could you try a version that uses a fixed-size stack buffer (and resorts to redundant syscall operations in the case that the buffer is exhausted)?

I think we should have only one implementation in the standard library. Hopefully it can accomplish all the characteristics we want:

  • Equivalent or better perf of rm -rf.
  • No possibility of stack overflow due to "user input" of a deeply nested file system tree
  • No allocator parameter required.

This PR implements this suggestion and (I think) accomplishes everything in the list. It is just as fast as the recursive implementation in #13070 for directories that stay within the fixed-size stack buffer, and only slows down (temporarily) for directories that exceed that level of nesting.

See #13048 for more context. Benchmarking is done with the same folder structure mentioned here:

tested with the folder that resulted from npx create-react-app my-app; 41209 files, 344MiB

Linux benchmarks:

$ hyperfine "./rm-fast my-app-copy" "./rm-master my-app-copy" --prepare="cp -r my-app my-app-copy" --warmup 1 --runs 20
Benchmark #1: ./rm-fast my-app-copy
  Time (mean ± σ):     382.7 ms ±   7.8 ms    [User: 13.6 ms, System: 358.9 ms]
  Range (min … max):   372.2 ms … 395.9 ms    20 runs
 
Benchmark #2: ./rm-master my-app-copy
  Time (mean ± σ):     701.2 ms ±  28.3 ms    [User: 36.8 ms, System: 646.2 ms]
  Range (min … max):   673.4 ms … 799.4 ms    20 runs
 
Summary
  './rm-fast my-app-copy' ran
    1.83 ± 0.08 times faster than './rm-master my-app-copy'

(note: the test directory's nesting level stays within the fixed-size buffer)

Benchmarks against the recursive version from #13070 and rm -rf:

$ hyperfine "./rm-recursive my-app-copy" "./rm-fast my-app-copy" "rm -rf my-app-copy" --prepare="cp -r my-app my-app-copy" --warmup 1 --runs 20
Benchmark #1: ./rm-recursive my-app-copy
  Time (mean ± σ):     389.1 ms ±  12.9 ms    [User: 19.5 ms, System: 360.2 ms]
  Range (min … max):   370.5 ms … 430.6 ms    20 runs
 
Benchmark #2: ./rm-fast my-app-copy
  Time (mean ± σ):     392.6 ms ±  17.0 ms    [User: 17.4 ms, System: 359.5 ms]
  Range (min … max):   373.0 ms … 444.8 ms    20 runs
 
Benchmark #3: rm -rf my-app-copy
  Time (mean ± σ):     423.9 ms ±  35.4 ms    [User: 17.5 ms, System: 379.4 ms]
  Range (min … max):   391.6 ms … 518.9 ms    20 runs
 
Summary
  './rm-recursive my-app-copy' ran
    1.01 ± 0.05 times faster than './rm-fast my-app-copy'
    1.09 ± 0.10 times faster than 'rm -rf my-app-copy'

strace is identical to the recursive verison as long as the nesting stays within the fixed-size buffer:

$ strace -c ./rm-fast my-app-copy
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.38    0.435533          10     41210         1 unlinkat
  8.21    0.044482           4      9744           getdents64
  7.89    0.042770           8      4863           close
  3.51    0.019030           3      4863           openat
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.541815                 60683         1 total

And the syscall count doesn't increase too much if the fallback is only needed for a few things. The directory I'm testing with has 10 nested directories max; here's the strace -c after setting deleteTree's BoundedArray size to 8:

$ strace -c ./rm-fast my-app-copy
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 80.95    0.456681          11     41210         1 unlinkat
  8.05    0.045412           4      9754           getdents64
  7.56    0.042634           8      4873           close
  3.44    0.019404           3      4873           openat
  0.00    0.000000           0         4           rt_sigaction
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.564131                 60717         1 total

and here it is with the BoundedArray size set to 4:

$ strace -c ./rm-fast my-app-copy
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.64    0.463573          11     41210         1 unlinkat
  9.61    0.055929           4     12428           getdents64
  6.02    0.035049           4      7548           close
  4.74    0.027564           3      7548           openat
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    0.582115                 68737         1 total

This also includes some optimizations for what is now the fallback implementation of deleteTree, in that it eliminates some redundant deleteFile calls from the implementation while not changing its functionality (see #13048 (comment) and the commit message in the list below for more info). This leads to a decent reduction in syscalls but a pretty insignificant decrease in wallclock time.

Before:

$ strace -c ./rm-master my-app-copy
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.82    0.580480           8     65368     24159 unlinkat
 33.35    0.404906          13     29036           getdents64
  7.55    0.091704           3     24159           close
  6.73    0.081669           3     24159           openat
  4.55    0.055245           2     24159           lseek
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    1.214004                166884     24159 total

After:

$ strace -c ./rm-updated my-app-copy
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 45.90    0.496441          12     41210         1 unlinkat
 37.58    0.406455          13     29036           getdents64
  8.60    0.093017           3     24159           close
  7.93    0.085720           3     24159           openat
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           prlimit64
------ ----------- ----------- --------- --------- ----------------
100.00    1.081633                118567         1 total

(note also that the lseek syscalls are also gone; this is via the introduction of IterableDir.iterateAssumeFirstIteration which the new deleteFile implementation also takes advantage of)

Benchmark results:

$ hyperfine "./rm-master my-app-copy" "./rm-updated my-app-copy" --prepare="cp -r my-app my-app-copy" --warmup 1 --runs 20
Benchmark #1: ./rm-master my-app-copy
  Time (mean ± σ):     695.0 ms ±  11.9 ms    [User: 28.7 ms, System: 654.1 ms]
  Range (min … max):   678.5 ms … 718.0 ms    20 runs
 
Benchmark #2: ./rm-updated my-app-copy
  Time (mean ± σ):     683.6 ms ±  28.2 ms    [User: 26.1 ms, System: 639.1 ms]
  Range (min … max):   662.6 ms … 793.6 ms    20 runs
 
Summary
  './rm-updated my-app-copy' ran
    1.02 ± 0.05 times faster than './rm-master my-app-copy'

This also adds IterableDir.Iterator.reset which was helpful here to allow for resetting the iterator without needing to store the IterableDir in order to re-create a new iterator.


Closes #13048 if merged.

This allows for avoiding an unnecessary lseek (or equivalent) call in places where it can be known that the fd has not had its cursor modified yet.
There are two parts to this:

1. The deleteFile call on the sub_path has been moved outside the loop, since if the first call fails with `IsDir` then it's very likely that all the subsequent calls will do the same. Instead, if the `openIterableDir` call ever hits `NotDir` after the `deleteFile` hit `IsDir`, then we assume that the tree was deleted at some point and can consider the deleteTree a success.

2. Inside the `dir_it.next()` loop, we look at entry.kind and only try doing the relevant (deleteFile/openIterableDir) operation, but always fall back to the other if we get the relevant error (NotDir/IsDir).
`deleteTree` now uses a stack-allocated stack for the first 16 nested directories, and then falls back to the previous implementation (which only keeps 1 directory open at a time) when it runs out of room in its stack. This allows the function to perform as well as a recursive implementation for most use-cases without needing allocation or introducing the possibility of stack overflow.
iter: IterableDir.Iterator,
};

var stack = std.BoundedArray(StackItem, 16){};
Copy link
Collaborator Author

@squeek502 squeek502 Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 16 here is totally arbitrary, no idea what types of things should be taken into account when determining what this size should be.

Might be worth noting that @sizeOf(StackItem) is 8248 on Linux (all platforms will have similar sizes because of the buf field of IterableDir.Iterator), so @sizeOf(std.BoundedArray(StackItem, 16)) ends up being 131976 or ~128.9KiB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 128.9 KiB is significantly more than a standard library function should expect to be available on the stack. For example, the total thread stack size on alpine linux is 128 KiB: https://ariadne.space/2021/06/25/understanding-thread-stack-sizes-and-how-alpine-is-different/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128.9KiB sounds like a whopping lot for deleting files, when the default can be below 328k: docker-library/cassandra#192 for CI desktop stuff.

I dont think there is any good way to automatically use "more stack", so defaulting to minimal stack usage and giving tools on how to measure stack sizes for the targets (for common desktop stuff) sounds like the best option here.

This would be nice to document as comparison against other common deleteTree implementations. Other implementations use commonly up to 10 nesting levels of file descriptors to prevent other processes doing weird stuff with the currently to be deleted tree with perf costs.
It would be nice to document that this function does not prevent that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that 8 KiB is quite large for the buffer used by IterableDir.Iterator. Perhaps that should be reduced?

Copy link
Collaborator Author

@squeek502 squeek502 Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think a reasonable limit would be here? A BoundedArray size of 4 would be ~32KiB for reference.

EDIT: Yeah, I am unsure why buf is so large for IterableDir.Iterator.
EDIT#2: Seems to be to allow for reading a lot of dirent's per getdents64 (or the equivalent) syscall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT#2: Seems to be to allow for reading a lot of dirent's per getdents64 (or the equivalent) syscall.

Yes, that seems to be the idea, though my sense is that 1-2 KiB would be pretty much as effective as the current 8 in practice.

Copy link
Collaborator Author

@squeek502 squeek502 Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my sense is that 1-2 KiB would be pretty much as effective as the current 8 in practice.

You appear to be correct. 1024 and 2048 are pretty much the same from my minimal testing on Linux, and 8192 only wins out on directories that have huge numbers of entries (e.g. 25,000 files in a single directory):

Benchmarks on the dir I've been using (41209 files from npx create-react-app my-app):

Benchmark #1: ./walk-1024 my-app
  Time (mean ± σ):      29.5 ms ±   2.1 ms    [User: 4.5 ms, System: 24.8 ms]
  Range (min … max):    24.6 ms …  36.6 ms    94 runs
 
Benchmark #2: ./walk-2048 my-app
  Time (mean ± σ):      29.8 ms ±   1.9 ms    [User: 4.9 ms, System: 24.7 ms]
  Range (min … max):    24.9 ms …  34.5 ms    98 runs
 
Benchmark #3: ./walk-4096 my-app
  Time (mean ± σ):      30.2 ms ±   2.5 ms    [User: 4.8 ms, System: 25.2 ms]
  Range (min … max):    25.7 ms …  37.4 ms    107 runs
 
Benchmark #4: ./walk-8192 my-app
  Time (mean ± σ):      30.6 ms ±   2.6 ms    [User: 5.7 ms, System: 24.6 ms]
  Range (min … max):    26.7 ms …  41.1 ms    107 runs
 
Summary
  './walk-1024 my-app' ran
    1.01 ± 0.10 times faster than './walk-2048 my-app'
    1.03 ± 0.11 times faster than './walk-4096 my-app'
    1.04 ± 0.11 times faster than './walk-8192 my-app'

Benchmark on a single dir that contains 25,000 files in it:

Benchmark #1: ./walk-1024 dir-with-tons-of-entries
  Time (mean ± σ):       9.1 ms ±   2.0 ms    [User: 0.8 ms, System: 8.3 ms]
  Range (min … max):     5.5 ms …  13.2 ms    270 runs
 
Benchmark #2: ./walk-2048 dir-with-tons-of-entries
  Time (mean ± σ):       9.2 ms ±   2.0 ms    [User: 0.8 ms, System: 8.4 ms]
  Range (min … max):     5.4 ms …  12.5 ms    475 runs
 
Benchmark #3: ./walk-4096 dir-with-tons-of-entries
  Time (mean ± σ):       9.5 ms ±   1.8 ms    [User: 0.7 ms, System: 8.9 ms]
  Range (min … max):     5.4 ms …  12.7 ms    255 runs
 
Benchmark #4: ./walk-8192 dir-with-tons-of-entries
  Time (mean ± σ):       8.9 ms ±   2.1 ms    [User: 1.0 ms, System: 7.9 ms]
  Range (min … max):     5.3 ms …  13.0 ms    253 runs
 
Summary
  './walk-8192 dir-with-tons-of-entries' ran
    1.02 ± 0.34 times faster than './walk-1024 dir-with-tons-of-entries'
    1.04 ± 0.33 times faster than './walk-2048 dir-with-tons-of-entries'
    1.08 ± 0.33 times faster than './walk-4096 dir-with-tons-of-entries'

EDIT: Went ahead and reduced buf to 1024. This means that @sizeOf(std.BoundedArray(StackItem, 16)) is 17288 or ~16.9KiB. Still unsure if that's too large or what a reasonable size to shoot for would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 16.9 KiB is quite reasonable as a default. Thanks for taking the time to benchmark this!

@ifreund
Copy link
Member

ifreund commented Oct 5, 2022

Very nice work on this PR, I think this is the right direction for the standard library to take!

Copy link
Contributor

@matu3ba matu3ba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of fallback method is not nice for targets with very limited stack size.

I dont think we should be that greedy with the user stack and let the user opt-in to not break usage in docker/low stack size platforms.

If recursion is necessary, please add a limit to at least detect other processes doing weird stuff like moving or adding files into to be deleted tree.
Other implementations guard against (could be called limited safety) that at significant performance costs.

@@ -2035,54 +2076,215 @@ pub const Dir = struct {
/// this function recursively removes its entries and then tries again.
/// This operation is not atomic on most file systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing: does not follow symlinks.

iter: IterableDir.Iterator,
};

var stack = std.BoundedArray(StackItem, 16){};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128.9KiB sounds like a whopping lot for deleting files, when the default can be below 328k: docker-library/cassandra#192 for CI desktop stuff.

I dont think there is any good way to automatically use "more stack", so defaulting to minimal stack usage and giving tools on how to measure stack sizes for the targets (for common desktop stuff) sounds like the best option here.

This would be nice to document as comparison against other common deleteTree implementations. Other implementations use commonly up to 10 nesting levels of file descriptors to prevent other processes doing weird stuff with the currently to be deleted tree with perf costs.
It would be nice to document that this function does not prevent that.

lib/std/fs.zig Outdated
top.parent_dir.deleteDir(top.name) catch |err| switch (err) {
error.FileNotFound => {},
error.DirNotEmpty => {
// reset the iterator and try again
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a default and user-configruable upper bound of tries for deletion.
Otherwise, one can not detect other buggy processes continuously inserting directories or files.

Copy link
Collaborator Author

@squeek502 squeek502 Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you open a new issue for this? I believe the problem you're describing exists with the current master implementation as well, so I'd like to consider it out of the scope of this PR as I'm not trying to alter functionality, just increase the speed.

This was sized large so that `getdents` (and other platforms' equivalents) could provide large amounts of entries per syscall, but some benchmarking seems to indicate that the larger 8192 sizing doesn't actually lead to performance gains outside of edge cases like extremely large amounts of entries within a single directory (e.g. 25,000 files in one directory), and even then the gains are minimal ('./walk-8192 dir-with-tons-of-entries' ran 1.02 ± 0.34 times faster than './walk-1024 dir-with-tons-of-entries').

Note: Sizes 1024 and 2048 had similar performance characteristics, so the smaller of the two was chosen.
We don't control sub_path so it may contain directory components; therefore, NotDir is a potential error when acting on sub_path.
Windows requires the directory handle to be closed before attempting to delete the directory, so now we do that and then re-open it if we need to retry (from getting DirNotEmpty when trying to delete).
@squeek502
Copy link
Collaborator Author

squeek502 commented Oct 6, 2022

A summary of the recent changes and some explanations:

  • I reduced the Iterator buf size to 1024 for every platform.
    • On Linux, this means that @sizeOf(std.BoundedArray(StackItem, 16)) is 17288 or ~16.9KiB. Still unsure if that's too large or what a reasonable size to shoot for would be.
    • On Linux this doesn't seem to have much of a performance impact (see fs.Dir.deleteTree: Optimize for non-deeply-nested directories #13073 (comment)).
    • On Windows, the larger buf size is more performant, not totally sure what tradeoff should be made there:
      • 'walk-8192.exe' ran
        1.02 ± 0.01 times faster than 'walk-2048.exe'
        1.03 ± 0.01 times faster than 'walk-1536.exe'
        1.07 ± 0.01 times faster than 'walk-1024.exe'
  • I renamed deleteTreeFallback to deleteTreeMinStackSize and made it pub again as per @matu3ba's comment. Let me know if there's a better name for it.
  • I fixed the new deleteTree implementation failing on Windows (with FileBusy) since I was closing the directory handle after deleting the directory which is a no-go on Windows. I applied the Windows fix to all platforms which introduces some unnecessary complexity on platforms that don't need to close the handle before deletion. I'm unsure about this choice.

Also did some basic benchmarking of the new implementation on Windows:

Benchmark 1: rm-updated.exe
  Time (mean ± σ):      5.612 s ±  0.173 s    [User: 0.193 s, System: 5.162 s]
  Range (min … max):    5.435 s …  5.902 s    10 runs

Benchmark 2: rm-master.exe
  Time (mean ± σ):      8.227 s ±  0.109 s    [User: 0.338 s, System: 7.621 s]
  Range (min … max):    8.075 s …  8.384 s    10 runs

Summary
  'rm-updated.exe' ran
    1.47 ± 0.05 times faster than 'rm-master.exe'

(rm-updated.exe is this branch with all the changes so far)

EDIT: rmdir on Windows for context:

Benchmark 1: rmdir /s /q my-app-copy
  Time (mean ± σ):      6.347 s ±  0.134 s    [User: 0.170 s, System: 5.863 s]
  Range (min … max):    6.180 s …  6.564 s    10 runs

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! This looks almost ready to merge, however, I believe I did find a regression.

@@ -301,7 +301,7 @@ pub const IterableDir = struct {
.macos, .ios, .freebsd, .netbsd, .dragonfly, .openbsd, .solaris => struct {
dir: Dir,
seek: i64,
buf: [8192]u8, // TODO align(@alignOf(os.system.dirent)),
buf: [1024]u8, // TODO align(@alignOf(os.system.dirent)),
Copy link
Member

@andrewrk andrewrk Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these bufs need to be at least PATH_MAX. Consider if a directory contains a file with the longest possible name. In such case, getdents would fail with EINVAL.

I suggest to put this code somewhere:

comptime assert(buf.len >= @sizeOf(os.system.dirent) + MAX_PATH_BYTES);

Copy link
Collaborator Author

@squeek502 squeek502 Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the relevant max length here is the max filename length (i.e. an individual component) rather than PATH_MAX. From a cursory search, the filename max length is 255 on Linux and 256 on Windows, so the reduced buf size should (hopefully) not cause as issue.

I'll investigate this more to be sure though, as I'm planning on working on #8268 as well which is directly related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point - in this case I think it's fine to merge this as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean lowering it to even 512 or 256 should be fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nektro 256 would be a problem since @sizeOf(os.system.dirent) is > 256 on Linux. 512 would work but would need to be benchmarked to ensure it doesn't regress performance (see #13073 (comment)).

@andrewrk andrewrk merged commit c23b3e6 into ziglang:master Oct 12, 2022
@andrewrk
Copy link
Member

Thanks for doing this work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

on macOS, std.fs.deleteTree() is 40% slower than rm -rf
5 participants