-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use scandir() to speed up the glob module #69782
Comments
The glob module happily joins names of regular files together with os.path.join() or attempts to list the files contained into a regular file (sic). The same 'except os.error' statement is used to handle both these cases and the case of a non readable directory. The attached patch makes the code more correct and easier to understand. |
This second patch rewrites the conditionals that decide whether to call _iglob() recursively, in a more natural way and without changing the behavior from the first patch. The patch also refactors the code with a new _listdir() function. |
In general the patch LGTM. It allows to speed up glob by 5-10% in warn test. But much more gain we can achieve by using os.scandir(). Here are results of microbenchmarks: $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched: 201 msec per loop
Using isdir(): 181 msec per loop
Using scandir(): 65.2 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched: 2.06 sec per loop
Using isdir(): 1.92 sec per loop
Using scandir(): 820 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched: 1.77 sec per loop
Using isdir(): 1.61 sec per loop
Using scandir(): 431 msec per loop Yet one benefit is that iglob() now yields path names without the delay for reading the full content of a directory (see bpo-22167). |
I don't think it is a good idea to break backward compatibility here even if the globN are internal functions. Better would be to continue to have globN functions that support the existing API and emit deprecation warnings. |
I think so too. I just wanted someone to confirmed that it is not overcautiousness. For now glob1() is used only in one place in the stdlib (bpo-16620). But there was other problem with previous patch, the same as with current implementation of os.walk() (bpo-25995). It makes glob to use a lot of file descriptors. Updated patch lefts deprecated glob1() and glob2() and makes glob to consume all scandir iterator before starting to yield values (but the problem with fd leaks on non-refcounted implementations is still left, bpo-25994). This doesn't affect performance, but lefts the issue with delaying (bpo-22167). |
os.scandir() is called recursively in the last patch and the file descriptors are not closed until returning from the recursion. |
I may be missing something, anyway here are few comments on the patch: |
No, os.scandir() is called non-recursively in _iterdir(), and since _iterdir() results are always accumulated in a list, a recursion starts only after exhausting the os.scandir() iterator and closing the file descriptor. We need first to resolve bpo-25994 to close the file descriptor explicitly.
Patched code passes existing tests. Do you have additional tests?
Ah, I thought is_dir() doesn't follow symlinks. Yes, now the call to entry.is_symlink() is not necessary.
Yes, but this can complicate the rest of the code. _rlistdir() is called with dironly=False only when the pattern ends with '**'. I'm not sure this is enough important case for optimization. In most cases '**' is used in the middle of the pattern, and all names yielded by _iterdir() are directory names (or broken symlinks). |
Right, I failed to note that point. And so, since the file descriptor opened by os.scandir() is closed within each call to recursive _rlistdir(), then my other comment about EMFILE does not stand as well. |
Related issue: bpo-26032 "Use scandir() to speed up pathlib globbing". |
(IOW once this patch has been applied maybe you can do the same for globbing in pathlib as requested in issue bpo-26032.) |
Adding a doc patch. |
FWIW I have followed the idea of having _iterdir() yielding the DirEntry entry instead of the name in Serhiy's patch. There is a slight performance gain. Now _glob0() and _glob1() do return a list of directories when dironly is true but there is now another place where OSError must be tracked, so it is not clear if this is worth it. glob_scandir_2_diff.patch is the differential patch between glob_scandir_2.patch and glob_scandir_3.patch. Here are the performance tests run with both patches. $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
glob_scandir_2.patch: 33.1 msec per loop
glob_scandir_3.patch: 33.8 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
glob_scandir_2.patch: 927 msec per loop
glob_scandir_3.patch: 850 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
glob_scandir_2.patch: 423 msec per loop
glob_scandir_3.patch: 337 msec per loop |
Oops, the list comprehension in _glob1() of glob_scandir_3.patch should be instead: names = [x.name for x in _iterdir(dirname, dironly)] |
Actually the microbenchmarks comparison between glob_scandir_2.patch and glob_scandir_3.patch must be made by removing the call to entry.is_symlink() also in glob_scandir_2.patch. |
@serhiy, I assume this is in your capable hands? Personally I'd like to have seen the refactoring in a separate diff from the switch to scandir(), just to ensure that the refactoring does not change anything and to make it easier to read the changes for the scandir() switch. |
Thank you Xavier for correcting the documentation. Here is minimal patch that switches the glob module to scandir(). |
Results of microbenchmarks for glob_scandir_4.patch: $ ./python -m timeit -s "from glob import glob" -- "glob('**/*', recursive=True)"
Unpatched: 275 msec per loop
Patched: 79.4 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/*', recursive=True)"
Unpatched: 2.11 sec per loop
Patched: 624 msec per loop
$ ./python -m timeit -s "from glob import glob" -- "glob('/usr/lib*/**/', recursive=True)"
Unpatched: 1.79 sec per loop
Patched: 281 msec per loop |
Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob(). |
Looks like switching to scandir will also resolve http://bugs.python.org/issue22167 without further action. |
Oh well. Too bad. Never mind then. |
The patch is completed now. The with statement is used to avoid FD leaks. |
Who can review Serhiy's patch? P.S. core of the patch seems good to me, but I'm not an expert on stdlib. |
I think the change in general is approved by GvR. If there are some implementation bugs, we can fix them later. |
New changeset cb7ee9d9cddd by Serhiy Storchaka in branch 'default': |
Thanks Serhiy! --Guido (mobile) On Sep 6, 2016 12:35 PM, "Roundup Robot" <report@bugs.python.org> wrote:
|
New changeset 48e573e0a610 by Serhiy Storchaka in branch 'default': |
@serhiy please comment the implications / limitations of the fallback on Windows. Is it that scandir cannot handle bytes argument only? |
os.scandir() cannot handle bytes argument on Windows. If an argument is string, os.scandir() yields entries with string names, if an argument is bytes object, os.scandir() yields entries with bytes names. Opened bpo-27998 for adding support of bytes paths in os.scandir(). After this will be implemented, a workaround cb7ee9d9cddd can be reverted. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: