-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shutil.rmtree is inefficient due to listdir() instead of scandir() #72750
Comments
The use of os.listdir severely limits the speed of this function on Using the new-in-Python 3.5 os.scandir should eliminate this |
You need to cache the names up front because the loop is unlinking entries, and readdir isn't consistent when the directory entries are mutated between calls. emscripten-core/emscripten#2528 FindFirstFile/FindNextFile likely has similar issues, even if they're not consistently seen (due to DeleteFile itself not guaranteeing deletion until the last handle to the file is closed). scandir might save some stat calls, but you'd need to convert it from generator to list before the loop begins, which would limit the savings a bit. |
The main issue on *nix is more likely that by using listdir you get directory order, while what you really need is inode ordering. scandir allows for that, since you get the inode from the DirEntry with no extra syscalls - especially without an open() or stat(). Other optimizations are also possible. For example opening the directory and using unlinkat() would likely shave off a bit of CPU. But the dominating factor here is likely the bad access pattern. |
Proposed patch implements shutil.rmtree using os.scandir. Needed file descriptors support in os.scandir (bpo-25996). I did not test how this affects the performance of shutil.rmtree. |
Benchmarks show about 20% speed up. |
Following Antoine's suggestion the patch now makes shutil.rmtree() using os.scandir() on all platforms. I doubt about one thing. This patch changes os.listdir passed to the onerror handler to os.scandir. This can break a user code that checks if the first argument in onerror is os.listdir. If keep this change, it should be documented in the "Porting to 3.7" section. Alternatively, we can continue passing os.listdir if os.scandir() failed despites the fact that os.listdir no longer used. |
I think we should change to os.scandir. No need to accumulate compatibility baggage like that. |
I've filed https://bugs.python.org/issue32453, which is about O(n^2) deletion behaviour for large directories. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: