-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use scandir() to speed up pathlib globbing #70220
Comments
The globbing functionality in pathlib (Path.glob() and Path.rglob()) might benefit from using the new optimized os.scandir() interface. It currently just uses os.listdir(). The Path.iterdir() method might also benefit (though less so). There's also a sideways connection with http://bugs.python.org/issue26031 (adding an optional stat cache) -- the cache could possibly keep the DirEntry objects and use their (hopefully cached) attributes. This is more speculative though (and what if the platform's DirEntry doesn't cache?) |
Related issue: issue bpo-25596 "regular files handled as directories in the glob module". |
As I recall, if the platform's DirEntry doesn't provide the cacheable attributes when first called, those attributes will be looked up (and cached) on first access. |
scandir() is not magic. It simply provides info given by the OS: see readdir() on UNIX and FindFirstFile()/FindNextFile() on Windows. DirEntry calls os.stat() if needed, but it caches the result. DirEntry doc tries to explain when syscalls or required or not, depending on the requested information and the platform: |
The DirEntry docs say for most methods "In most cases, no system call is |
Guido, it's true that in almost all cases you get the speedup (no system call), and it's very much worth using. But the idea with the docs being non-committal is because being specific would make the docs fairly complex. I believe it's as follows for is_file/is_dir/is_symlink:
Do you think the docs should try to make this more specific? |
Ben, I think it's worth calling out what the rules are around symlinks. I'm Another question: for symlinks, there are two different possible stat Related, "this method always requires a system call", that remark seems to I'd be happy to review a doc update patch if you make one. |
"Another question: for symlinks, there are two different possible stat Hopefully, both are cached. It's directly the result of stat() and |
Proposed minimal patch implements globbing in pathlib using os.scandir(). Here are results of microbenchmarks: $ ./python -m timeit -s "from pathlib import Path; p = Path()" -- "list(p.glob('**/*'))"
Unpatched: 598 msec per loop
Patched: 372 msec per loop
$ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/*'))"
Unpatched: 1.33 sec per loop
Patched: 804 msec per loop
$ ./python -m timeit -s "from pathlib import Path; p = Path('/usr/')" -- "list(p.glob('lib*/**/'))"
Unpatched: 750 msec per loop
Patched: 180 msec per loop See msg257954 in bpo-25596 for comparison with the glob module. |
Guido, I've made some tweaks and improvements to the DirEntry docs here: http://bugs.python.org/issue26248 -- the idea is to fix the issues you mentioned to clarify when system calls are required with symlinks, mentioning that the results are cached separately for follow_symlinks True and False, etc. |
New changeset 927665c4aaab by Serhiy Storchaka in branch 'default': |
pathlib._Accessor
#25701Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: