-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add doublestar '**' support #79
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs tests that actually walk the filesystem, not only call MatchPrefix
. Please add a test for the matched empty directory case that you changed as well.
The current implementation only changes IncludePatterns
. Usually (eg. for Dockerfile) IncludePatterns
is completely unused and FollowPaths
is used instead so that symlinks continue to work. It would be weird to have an exception like this. Otoh, I wouldn't like this to grow very large with multiple implementations. The decision to use what go stdlib provides was for simplicity.
if pattern == "**" || pattern == "**/" { | ||
return true, isDir | ||
} | ||
pattParts := strings.Split(pattern, string(filepath.Separator)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a function better avoided in a function called for every scanned file(times number of patterns). Although I guess filepath.Walk
isn't very memory efficient either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is probably not the most efficient, but at least the cost is only paid when **
is used, otherwise it will be the same cost (plus a strings.Contains
) as before.
I can look into adding some benchmarking tests to try to see what the relative cost is, and perhaps set a goal to beat in the future.
} | ||
origDir := filepath.Join(root, dir) | ||
if _, ok := seenDirs[origDir]; !ok { | ||
fi, err := os.Stat(origDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this stat call for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fn
called later requires an os.FileInfo. The updated code will skip the fn
call the first time we see a partial
matched directory, so we lose the original stat results. Later when we determine the directory actually needs to exist (via non partial match in subdirectory) then we re-stat the original directory so we can pass the correct os.FileInfo back to the fn
callback. An alternate approach would be to capture the original stat calls in a map to reuse in the case we find a non-partial match, but I was concerned about the memory cost of preserving all the stats for a large directory tree and figured the cost of res-stating the directory was a better trade off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I overlooked that fi
was used and only saw the error handling (that I thought was already handled before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I was concerned about the memory cost of preserving all the stats for a large directory tree
I guess you only need to keep a stack of the current parents and can reuse the space once walking one directory has been completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into that, it seems plausible that we can just keep a small stack of FileInfo and then pop/push as we traverse the tree.
walker.go
Outdated
if strings.Contains(pattern, "**") { | ||
// short-circuit for single "**" | ||
if pattern == "**" || pattern == "**/" { | ||
return true, isDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isDir
handling here is quite confusing. Ideally, this should be done in a wrapper layer outside matchPrefix
and let this function work on strings, not files. Seems like the meaning of the "partial" result here has changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The isDir
usage here is probably not needed, adding Walk tests like you suggest should prove that out.
I didn't intend to change the meaning of partial
results, it is just that partial
is ambiguous in the context of **
since we don't know how deep the pattern should match.
For example: **/baz
should return partial=true
for foo/bar
if bar
is a directory (since bar
might have children that match our pattern). But if bar
is a file, then it cannot possibly contain children so we can return a negative non-partial match.
Thanks! Will add Walk tests.
Yeah, in our buildkit/llb usage we almost entirely use I suppose another option might be to plumb a callback down via llb to fsutil so that we can provide our own filter logic? I imagine we would have to allow for an interface similar to the matchPrefix function here. Then we can provide the doublestar logic via a 3rd party library. |
Another option would be to provide a |
The regex Filter is an interesting idea. I have not had time today to work on the Walk tests (hopefully tomorrow). After I have some tests that match the doubleglob behavior I want I will try those tests against a regex filter list as well to make sure we can get the same results. I would imagine that |
I guess in that case you would do |
I have updated the tests to include Walk tests. Also I have added a Benchmark test to allow comparing some
|
I have not had time to look into regex filtering yet. I have also not been able to look into the cache/stack of |
I have done more investigation, on both a regex filter and optimizing the os.Stat ... in the end, I think it is not worth pursuing. I first optimized the Next, I started looking into the implementation side of a regex matching and realized it was going to be "complex". Before I continued down that road I wanted to see what the best-case scenario was for performance. So I hacked
This is optimized to be best-case scenario for the benchmark tests and these are the results:
The optimized cheat is fractionally faster, but clearly the filesystem walking is consuming a vast majority of the time we are spending. Given that, I dont think it makes much sense to optimize |
Hi @tonistiigi, could you take a look at this again? |
I didn't suggest the filter method for performance. It is brute-force and therefore don't think very optimal. My main issue is still that this is modifying If we can solve these issues I'm fine with continuing with the custom |
@coryb you still working on this? (also looks like it needs a rebase) |
Superseded by #108 |
Hi @tonistiigi, hoping for your thoughts on this PR
When dealing with
llb.Local
in buildkit I generally want to use doublestar**
syntax for larger projects. For instance, to collect the source for compiling Go code I want to usellb.Local(".", llb.IncludePatterns([]string{"**/*.go"}))
. Typically the workaround is to just skip theIncludePatterns
but of course, that will invalidate the caching if any non-Go files change, which is less than ideal. The implementation is a bit tricky, but it is only used when we find a**
pattern, otherwise your original implementation is used.Additionally there is a change in
Walk
that will prevent empty directories from being generated when those directories were only partial matches. Perhaps there is something I am missing here but if you have patternfoo/*/bar
with directory:you would end up with:
I would not expect the
foo/a
andfoo/c
directories to show up in the result since they don't match my pattern, so I have fixed this behavior. If this is intended behavior, I am curious what the reasoning is.Trying to add this functionality for related issues/work in our HLB project
[cc @hinshun]