Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add doublestar '**' support #79

Closed
wants to merge 1 commit into from
Closed

Conversation

coryb
Copy link
Collaborator

@coryb coryb commented May 8, 2020

Hi @tonistiigi, hoping for your thoughts on this PR

When dealing with llb.Local in buildkit I generally want to use doublestar ** syntax for larger projects. For instance, to collect the source for compiling Go code I want to use llb.Local(".", llb.IncludePatterns([]string{"**/*.go"})). Typically the workaround is to just skip the IncludePatterns but of course, that will invalidate the caching if any non-Go files change, which is less than ideal. The implementation is a bit tricky, but it is only used when we find a ** pattern, otherwise your original implementation is used.

Additionally there is a change in Walk that will prevent empty directories from being generated when those directories were only partial matches. Perhaps there is something I am missing here but if you have pattern foo/*/bar with directory:

foo/a/b
foo/c/d
foo/f/bar

you would end up with:

foo/a
foo/c
foo/f/bar

I would not expect the foo/a and foo/c directories to show up in the result since they don't match my pattern, so I have fixed this behavior. If this is intended behavior, I am curious what the reasoning is.

Trying to add this functionality for related issues/work in our HLB project

[cc @hinshun]

Copy link
Owner

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests that actually walk the filesystem, not only call MatchPrefix. Please add a test for the matched empty directory case that you changed as well.

The current implementation only changes IncludePatterns. Usually (eg. for Dockerfile) IncludePatterns is completely unused and FollowPaths is used instead so that symlinks continue to work. It would be weird to have an exception like this. Otoh, I wouldn't like this to grow very large with multiple implementations. The decision to use what go stdlib provides was for simplicity.

if pattern == "**" || pattern == "**/" {
return true, isDir
}
pattParts := strings.Split(pattern, string(filepath.Separator))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a function better avoided in a function called for every scanned file(times number of patterns). Although I guess filepath.Walk isn't very memory efficient either.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is probably not the most efficient, but at least the cost is only paid when ** is used, otherwise it will be the same cost (plus a strings.Contains) as before.

I can look into adding some benchmarking tests to try to see what the relative cost is, and perhaps set a goal to beat in the future.

}
origDir := filepath.Join(root, dir)
if _, ok := seenDirs[origDir]; !ok {
fi, err := os.Stat(origDir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this stat call for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fn called later requires an os.FileInfo. The updated code will skip the fn call the first time we see a partial matched directory, so we lose the original stat results. Later when we determine the directory actually needs to exist (via non partial match in subdirectory) then we re-stat the original directory so we can pass the correct os.FileInfo back to the fn callback. An alternate approach would be to capture the original stat calls in a map to reuse in the case we find a non-partial match, but I was concerned about the memory cost of preserving all the stats for a large directory tree and figured the cost of res-stating the directory was a better trade off.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I overlooked that fi was used and only saw the error handling (that I thought was already handled before).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I was concerned about the memory cost of preserving all the stats for a large directory tree

I guess you only need to keep a stack of the current parents and can reuse the space once walking one directory has been completed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into that, it seems plausible that we can just keep a small stack of FileInfo and then pop/push as we traverse the tree.

walker.go Outdated
if strings.Contains(pattern, "**") {
// short-circuit for single "**"
if pattern == "**" || pattern == "**/" {
return true, isDir
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isDir handling here is quite confusing. Ideally, this should be done in a wrapper layer outside matchPrefix and let this function work on strings, not files. Seems like the meaning of the "partial" result here has changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isDir usage here is probably not needed, adding Walk tests like you suggest should prove that out.

I didn't intend to change the meaning of partial results, it is just that partial is ambiguous in the context of ** since we don't know how deep the pattern should match.

For example: **/baz should return partial=true for foo/bar if bar is a directory (since bar might have children that match our pattern). But if bar is a file, then it cannot possibly contain children so we can return a negative non-partial match.

@coryb
Copy link
Collaborator Author

coryb commented May 11, 2020

This needs tests that actually walk the filesystem, not only call MatchPrefix. Please add a test for the matched empty directory case that you changed as well.

Thanks! Will add Walk tests.

The current implementation only changes IncludePatterns. Usually (eg. for Dockerfile) IncludePatterns is completely unused and FollowPaths is used instead so that symlinks continue to work. It would be weird to have an exception like this. Otoh, I wouldn't like this to grow very large with multiple implementations. The decision to use what go stdlib provides was for simplicity.

Yeah, in our buildkit/llb usage we almost entirely use IncludePatterns rather than FollowPaths, I have not actually looked into the implementation of FollowPaths. I am also loath to add this sort of complexity in general, I wish it was handled in the stdlib, but I also really want to use this syntax to describe my inputs in llb. It is much more clear what I am intended with: **/*.go and **/*_test.go.

I suppose another option might be to plumb a callback down via llb to fsutil so that we can provide our own filter logic? I imagine we would have to allow for an interface similar to the matchPrefix function here. Then we can provide the doublestar logic via a 3rd party library.

@tonistiigi
Copy link
Owner

Another option would be to provide a Filter option that could take a full regexp and is just applied on every file with no requirement that the parent needs to match (and no optimization). This would probably be simpler but also force a new local every time this functionality is needed (that might be ok), compared to for example today where in Dockerfile all paths used in COPY are combined into single local.

@coryb
Copy link
Collaborator Author

coryb commented May 12, 2020

The regex Filter is an interesting idea. I have not had time today to work on the Walk tests (hopefully tomorrow). After I have some tests that match the doubleglob behavior I want I will try those tests against a regex filter list as well to make sure we can get the same results.

I would imagine that **/*.go might be replaced by .*[.]go$ but we would likely need the isDir logic so we can return partial results on directories. Although thinking more, it might have similar complexities in needing to understand the pattern so we can not recurse into unwanted directories. For instance if we have pattern foo/.*/.*[.]go$ then we don't want to recurse into bar/... (which might be very large and slow to recurse). I suppose we could rewrite the regex dynamically to something like (foo)?/(.*)?/(.*[.].go)?$ and then look for matches to see if any groups match (for partial) or all groups match (full match). Anyway, something to investigate.

@tonistiigi
Copy link
Owner

For instance if we have pattern foo/./.[.]go$ then we don't want to recurse into bar/...

I guess in that case you would do includepatterns=foo && filter=*.go

@coryb
Copy link
Collaborator Author

coryb commented May 13, 2020

I have updated the tests to include Walk tests. Also I have added a Benchmark test to allow comparing some */*/target vs **/target with varying tree depth/breadth. Surprisingly it seems the doublestar logic is on-par (a little slower) than the basic glob implementation:

go test -bench BenchmarkWalker
goos: darwin
goarch: amd64
pkg: github.com/tonistiigi/fsutil
BenchmarkWalker/[1]-target-12               4900            218188 ns/op
BenchmarkWalker/[1]-**/target-12            5067            230914 ns/op
BenchmarkWalker/[2]-*/target-12              190           6022956 ns/op
BenchmarkWalker/[2]-**/target-12             193           6352133 ns/op
BenchmarkWalker/[3]-*/*/target-12              6         177068504 ns/op
BenchmarkWalker/[3]-**/target-12               6         182745666 ns/op
BenchmarkWalker/[4]-*/*/*/target-12            6         191665884 ns/op
BenchmarkWalker/[4]-**/target-12               5         201749391 ns/op
BenchmarkWalker/[5]-*/*/*/*/target-12         18          68218440 ns/op
BenchmarkWalker/[5]-**/target-12              18          73289431 ns/op
BenchmarkWalker/[6]-*/*/*/*/*/target-12                       10         114470129 ns/op
BenchmarkWalker/[6]-**/target-12                               9         113529333 ns/op
PASS
ok      github.com/tonistiigi/fsutil    46.156s

@coryb
Copy link
Collaborator Author

coryb commented May 13, 2020

I have not had time to look into regex filtering yet. I have also not been able to look into the cache/stack of os.FileInfo to prevent the redundant os.Stat. I hope to get time for that tomorrow.

@coryb
Copy link
Collaborator Author

coryb commented May 14, 2020

I have done more investigation, on both a regex filter and optimizing the os.Stat ... in the end, I think it is not worth pursuing.

I first optimized the os.Stat to use a naive "keep everything in a map" solution then compared the benchmarks. There was no discernable difference (~0% statistical difference). So I don't think a more complicated solution keeping track of the stack of FileInfo's is warranted.

Next, I started looking into the implementation side of a regex matching and realized it was going to be "complex". Before I continued down that road I wanted to see what the best-case scenario was for performance. So I hacked matchPrefix to look for a magic cheat: string in the pattern and simply return the results we were looking for, no path inspection, matching etc:

func matchPrefix(pattern, name string, isDir bool) (bool, bool) {
	if strings.HasPrefix(pattern, "cheat:") {
		return isDir || strings.HasSuffix(name, "target"), true
	}
	...

This is optimized to be best-case scenario for the benchmark tests and these are the results:

$ go test -bench BenchmarkWalker
goos: darwin
goarch: amd64
pkg: github.com/tonistiigi/fsutil
BenchmarkWalker/[1]-target-12               4686            223169 ns/op
BenchmarkWalker/[1]-**/target-12            5096            225976 ns/op
BenchmarkWalker/[1]-cheat:-12               5839            228134 ns/op
BenchmarkWalker/[2]-*/target-12              207           5839738 ns/op
BenchmarkWalker/[2]-**/target-12             199           6041092 ns/op
BenchmarkWalker/[2]-cheat:-12                194           5839930 ns/op
BenchmarkWalker/[3]-*/*/target-12              7         170048875 ns/op
BenchmarkWalker/[3]-**/target-12               6         175851256 ns/op
BenchmarkWalker/[3]-cheat:-12                  7         159818388 ns/op
BenchmarkWalker/[4]-*/*/*/target-12            6         192166106 ns/op
BenchmarkWalker/[4]-**/target-12               5         203263314 ns/op
BenchmarkWalker/[4]-cheat:-12                  6         182275727 ns/op
BenchmarkWalker/[5]-*/*/*/*/target-12         16          62743312 ns/op
BenchmarkWalker/[5]-**/target-12              18          68184740 ns/op
BenchmarkWalker/[5]-cheat:-12                 19          63127306 ns/op
BenchmarkWalker/[6]-*/*/*/*/*/target-12                       12         106073111 ns/op
BenchmarkWalker/[6]-**/target-12                              10         109747113 ns/op
BenchmarkWalker/[6]-cheat:-12                                 10         104327820 ns/op
PASS
ok      github.com/tonistiigi/fsutil    62.559s

The optimized cheat is fractionally faster, but clearly the filesystem walking is consuming a vast majority of the time we are spending. Given that, I dont think it makes much sense to optimize matchPrefix and I think doublestar is probably the most straightforward solution (from a user perspective).

@hinshun
Copy link

hinshun commented Jun 11, 2020

Hi @tonistiigi, could you take a look at this again?

@tonistiigi
Copy link
Owner

I didn't suggest the filter method for performance. It is brute-force and therefore don't think very optimal. My main issue is still that this is modifying IncludePatterns that is unused by 99% of people using BuildKit that use FollowPaths instead to set the files that are needed. The other issue is that this is used together with copy implementation (in Dockerfile for example). So if this gets merged ** would work for transferring context there but not for actual copy implementation. With filter, it is a new option that is not used by Dockerfile until it wants to add new features as well.

If we can solve these issues I'm fine with continuing with the custom ** approach.

@thaJeztah
Copy link
Collaborator

@coryb you still working on this? (also looks like it needs a rebase)

@aaronlehmann
Copy link
Collaborator

Superseded by #108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants