-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support recursive globs #58176
Comments
This is a feature I've wanted to use in too many times to remember. I've made a patch with an implementation, docs and a test. I've named the function rglob and tried to stay within the conventions of the glob package. |
I'm inclined to close this as a functional duplicate of http://bugs.python.org/issue13229 |
I'd say it's very close to a duplicate but maybe isn't so. If walkdir is added then rglob can be implemented using it. I'd say "rglob" to "walkdir" is like "urlopen" to "http.client". One is the stupid and simple function (that still has a bazillion use cases) and the other is the heavy lifting swiss army knife. "file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')". |
Agreed. |
A fair point indeed. To follow the shutil naming convention (rmtree, copytree, and likely chmodtree, chowntree), a more appropriate name might be "globtree". (Thanks to string methods, the 'r' prefix doesn't read correctly to me: what does "globbing from the right" mean?) |
Well, if you put it in the glob module, it doesn't have to follow the |
I can live with it either way - I just wanted to point out that our current examples of this kind of recursive filesystem access use a 'tree' suffix rather than an 'r' prefix. |
"file_paths(filtered_walk('.', included_files=['*.py']))" is a lot longer than "rglob('*.py')". It is, but is that a good enough reason to have both? It can also be achieved with just a bit more code using the simple That it adds additional maintenance burden on the coredevs goes without saying :-) Each such new burden should have a very good reason. To conclude, personally I'm -1 on this, especially if |
It is. globbing is a well-known operation that many people expect to be
Which "Python spirit" are you talking about? We have many high-level |
There is an alternative: supporting ** syntax, e.g. '**/.py', which should find all *.py files in the current directory and all descendents. At present glob('**/.py') is equivalent to glob('*/*.py'), but we would say this behavior was undefined and the new behavior would be a new feature. |
According to Wikipedia (http://en.wikipedia.org/wiki/Glob_%28programming%29) - "The noun "glob" is used to refer to a particular pattern, e.g. "use the glob *.log to match all those log files"". IOW, globbing is usually understood as the act of expanding a pattern to the files it matches. Nothing in that implies recursive traversal of a directory tree. On the other hand, os.walk and/or walkdir suggest that in their name.
There should be one -- and preferably only one -- obvious way to do it. Admittedly, we already have more than one, and a high-level tool is proposed with Nick's walkdir. Why add *yet another* high-level tool? |
Still, that's a common need. "I want all Python files in a subtree".
I don't know why "walk" is supposedly more recursive than "glob".
Because the walkdir spelling (IIUC) is longish, tedious and awkward. |
Google "walk directory". First hit is a Rosetta code page with
walkdir is a new module proposal. If its API is tedious and awkward, |
Feedback from Antoine on IRC about my syntax proposal: “The "**" meaning is not really universal like other quantifiers are. [...] (also, it would be quite harder to implement, I think)” That and the compat issue makes me go in favor of a new function. I’m not sure glob is the right place: when you use glob.glob, the search is rooted in the current directory, and you may have sub-directories in your pattern, e.g. 'Lib/*/main.py'. A function meaning “look for this file pattern recursively” would be IMO more at home in fnmatch. |
You still haven't explained what your problem is with the idea of an
walkdir is not yet a module proposal, there's not even a PEP for it, and This issue has a working patch for rglob(), which is a single, obvious, (and, yes, rglob() can be reimplemented using walkdir later, if there is |
The problem is that I prefer the walkdir approach, because it solves a Anyway, I'm not sure what else I can add to the discussion. I'm I've stated my preference, and I understand and respect yours. So |
I'm trying the patch and its behaviour is strange: >>> list(glob.rglob('setup.py'))
['setup.py']
>>> list(glob.rglob('setu*.py'))
[]
>>> list(glob.rglob('*/setu*.py'))
['./setup.py', './Mac/Tools/Doc/setup.py', './Tools/test2to3/setup.py', './Doc/includes/setup.py', './PC/example_nt/setup.py'] I can understand the first example (although that makes the documentation slightly incorrect, since you need an explicit "*" path component for the search to be recursive), but the second one looks straight wrong. |
It looks quite strange to me that '/' should be allowed in a function that recurses down directories (see my messages above). OTOH fnmatch is not really appropriate, contrary to my earlier feeling. (Restoring my title change: as my messages were apparently overlooked, I assume that Eli did not revert my change on purpose but by replying to older email) |
Oops, Éric, sorry about the title. I didn't even notice :) |
I think it's important to be clear on what the walkdir API aims to be: a composable toolkit of utilities for directory tree processing. It's overall design is inspired directly by the itertools module. Yes, it started life as a simple proposal to add shutil.filtered_walk (http://bugs.python.org/issue13229), but I soon realised that implementing this solely as a monolothic function would be foolish, since that approach isn't composable. What if you just wanted file filtering? Or depth limiting? Having it as a filtering toolkit lets you choose the exact filters you need for a given use case. walkdir.filtered_walk() is just an API for composing filtering pipelines without needing to pass the same information to multiple pipeline stages. However, along with that itertools inspired iterator pipeline based design, I've also inherited Raymond's preference that particular *use cases* start life as recipes in the documentation. A recursive glob is just a basic walkdir pipeline composition: >>> from walkdir import file_paths, include_files
>>> def globtree(pattern, path='.'):
... return file_paths(include_files(os.walk(path), pattern))
Since filtered_walk() is just a pipeline builder, the composition can also be written:
>>> from walkdir import file_paths, filtered_walk
>>> def globtree(pattern, path='.'):
... return file_paths(filtered_walk(path, included_files=[pattern])) That latter approach then suggests an alternative signature for globtree: def globtree(*patterns, **kwds):
kwds.setdefault("top", ".")
return file_paths(filtered_walk(included_files=patterns, **kwds)) >>> print '\n'.join(sorted(globtree('*.rst')))
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst
>>> print '\n'.join(sorted(globtree('*.rst', '*.py')))
./conf.py
./index.rst
./py3k_binary_protocols.rst
./venv_bootstrap.rst On a somewhat related note, I'd also like to see us start concentrating higher level shell utilities in the shutil namespace so users don't have to check multiple locations for shell-related functionality quite so often (that is, I'd prefer shutil.globtree over glob.rglob). |
I think it's important to remember where we are coming from. Many people So I'm not against walkdir *per se*, but I'm -1 on the idea that walkdir
I think it's rather nice, but it should be available as a stdlib Recipes are really overrated: they aren't tested, they aren't
Well, if glob() already lived in shutil, this decision would be a |
We do have the option of aliasing glob.iglob as shutil.glob... |
Would it be feasible to deprecate the 'glob' module, moving its functionality to shutil? In some future Python version, then, the module can be removed. The same fate would go for fnmatch, I guess. There are too many modules lying around dealing with the same problems. On a related note, the doc of glob explicitly mentions that it is implemented with os.listdir and fnmatch. Similarly, *if* the recursive glob gets accepted it should be implemented with walkdir (once that's in). |
This discussion (particularly my final globtree recipe) made me realise that the exact same approach would greatly improve the usability of the all_paths, file_paths and dir_paths iterators in walkdir [1]. Accordingly, walkdir 0.4 will let you write a recursive grep for ReST and Python source files as: file_paths(top, included_files="*.py *.rst".split()) Scanning multiple directories will be as simple as: file_paths(dir1, dir2, included_files="*.py *.rst".split()) |
Thanks for the bug find Antoine, I worked surprisingly hard trying to make this right in more edge cases and while fixing it I noticed rglob/globtree has 3 options:
Note that absolute paths with/without wildcards don't have this ambiguity. In both rel/abs wildcards should match directories and files alike. Which option do you guys think would be best? I already have a fixed patch for option 1 and 3 but I'd rather hear your thoughts before I introduce either. P.s. another slight issue I ran into is the fact that fnmatch doesn't ignore os.curdir: >>> fnmatch.fnmatch('./a', 'a')
False |
I think it should be the existing glob.glob(). We won't introduce a new |
I don't have a strong opinion on "rglob vs glob" so whichever way the majority here thinks is fine by me. |
For "**" globbing see http://ant.apache.org/manual/dirtasks.html#patterns . If we extend pattern syntax of templates, why not implement Perl, Tcl or Bash extensions? |
On Sun, Apr 1, 2012 at 4:42 PM, Serhiy Storchaka <report@bugs.python.org> wrote:
They mention that "mypackage/test/ is interpreted as if it were mypackage/test/**" so that's not really an option. I'm pretty sure we should only recurse if "**" appears explicitly.
I'm not sure what you mean here but if it's that ##{} stuff then it should probably be discussed in a separate issue as it's not related to recursive globs. |
I added the doublestar functionality to iglob and updated the docs and tests. Also, a few readability renames in that module were a long time coming. I'd love to hear your feedback. |
So, anybody for or against this patch? I'd really like to see this feature make its way in... |
I think the feature is useful, but someone needs to review the patch. |
I'm looking at the tests and I don't understand why '**/bcd/*' should match 'a/bcd/efg/ha'. Am I missing something? |
Here is a patch which implements recursive globbing which conforms to Bash globbing with "globstar" option. For backward compatibility recursive globbing off by default and works only if new argument "recursive" is true (default is False). I am not sure this is a better variant. Possible the default should be True. '**' pattern is very unlikely in old code. However recursive globbing on arbitrary pattern and arbitrary tree is not safe, it can hang on recursive symlinks. The patch contains changes from bpo-16618. |
Patch updated for current tip. |
I should add a symlink loop detecting to _rlistdir() as Antoine advised me on IRC. |
In fact glob() is already protected against an endless recursion (in the same way as Bash). The level of recursion is simply limited by the maximum length of the path. So I did not change the implementation, I have just added a test for symlink loop. I also corrected the other new tests so that they should not fail on the platform without symlinks. |
In updated patch fixed warning/errors when ran with -b or -bb options. |
Oops, Python 3.4 has ** support in pathlib, but we missed Serhiy's patch for the glob module itself. We should resolve that discrepancy for 3.5 :) |
Could you make a review Nick? |
Mostly looks good to me, just two comments:
|
Thank you Nick.
Only consistency with other helper functions (glob0, glob1).
Of course. This is just a typo. |
Looks good to me! |
New changeset ff4b9d654691 by Serhiy Storchaka in branch 'default': |
Thank you for your review Nick. |
The test failed on a buildbot, I reopen the issue. ====================================================================== Traceback (most recent call last):
File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_glob.py", line 284, in test_selflink
self.assertIn(path, results)
AssertionError: '@test_23056_tmp_dir/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/file' not found in {'noodly2', '@test_23056_tmp-\udcff.py', '__pycache__'} |
New changeset 180f5bf7d1b9 by Serhiy Storchaka in branch 'default': |
Thank you Victor. The test was failed also when run it directly, omitting the
Now it is fixed. However perhaps we should consider as a bug if a test ran by regrtest doesn't |
=> yes, I opened the issue bpo-22390. |
Please review one word documentation change at #12918 to clarify that recursive glob ** follows symbolic links to directories. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: