Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: child/parent operators #141

Open
intelfx opened this issue Jun 17, 2024 · 3 comments
Open

RFE: child/parent operators #141

intelfx opened this issue Jun 17, 2024 · 3 comments

Comments

@intelfx
Copy link

intelfx commented Jun 17, 2024

It would be an interesting addition to the GNU find syntax to have some sort of possibility to evaluate parts of the find expression in context of a child/parent file.

If that's too confusing, a few examples in pseudo-find syntax with the proposed extension:

  1. Exclude all directories containing a CACHEDIR.TAG file:

    find . -type d -child \( -type f -name CACHEDIR.TAG \) -prune -or ...
  2. Find all directories that look like a Borg repository:

    find . -type d -child \( -type f -name config -execdir grep -q -Fx '[repository]' {} \; \) -child \( -type d -name data \)

If this syntax is infeasible to implement efficiently due to requirement to perform nested iterations in general case, I can imagine another variant of this syntax:

find . -type d -child config \( -type f -execdir grep -q -Fx '[repository]' {} \; \) -child data \( -type d \)

In this case, the -child operator has two operands: (1) a string representing a specific child file name to examine, and (2) a subexpression that is evaluated at most one time against the specific file named by the first operand (or not at all if there is no such file).

@tavianator
Copy link
Owner

This is an interesting idea, and the kind of thing that's come up before (e..g #92), where you want to match directories or not based on their contents.

There are a couple tricks that currently work:

  • You can sometimes match the child itself and then print the parent path. E.g. to match directories that do contain CACHEDIR.TAG:

    tavianator@graphene $ bfs -name CACHEDIR.TAG -printf '%h\n'
    ./.cargo/registry
    ./.cache/fontconfig
    ./.cache/pipx

    To do further processing of these directories, you could use -printf '%h\0' | xargs -0 .... However, this doesn't help find directories that do not contain a matching file1.

  • Use -exec bfs ... -exit 1 as a filter. This is super inefficient, but it works:

    tavianator@graphene $ bfs -type d -exec bfs -f {} -mindepth 1 -maxdepth 1 -name CACHEDIR.TAG -exit 1 \; -print
    .
    ./Desktop
    ./Downloads
    ...

    You could use other commands as filters too, e.g.

    tavianator@graphene $ bfs -type d -exec sh -c '! test -e "$1/CACHEDIR.TAG"' sh {} \; -print
    ...

This kinda reminds me of the :has() selector in CSS. It would be theoretically possible to implement

$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)

via a recursive bftw() call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient. But I'm not sure the complexity is worth it.

Footnotes

  1. You could use something like comm -z -23 <(bfs ... -print0 | sort -z) <(bfs ... -printf '%h\0' | sort -z) I guess, but that's pretty gross and still doesn't let you prune directories easily.

@intelfx
Copy link
Author

intelfx commented Jun 17, 2024

  • You can sometimes match the child itself and then print the parent path. E.g. to match directories that do contain CACHEDIR.TAG:

    tavianator@graphene $ bfs -name CACHEDIR.TAG -printf '%h\n'
    ./.cargo/registry
    ./.cache/fontconfig
    ./.cache/pipx

That's precisely what I'm doing now; however, it doesn't allow to -prune these directories unless I'm mistaken?

  • You could use other commands as filters too, e.g.

    tavianator@graphene $ bfs -type d -exec sh -c '! test -e "$1/CACHEDIR.TAG"' sh {} \; -print

Yes, -exec test ... is also something I tried, but the fork/exec overhead becomes pretty prohibitive.


It would be theoretically possible to implement

$ bfs -not -has \( -maxdepth 1 -name CACHEDIR.TAG \)

via a recursive bftw() call, and it could even share resources (ioq thread pool, open fd cache) with the parent call to be more efficient

Nice! Yes, this is exactly what I was proposing. Thanks for the hint, I might try actually doing this because the lengths I have to go to to work around lack of this feature are not really pleasant.

But I'm not sure the complexity is worth it.

If that's too complex, perhaps the second, more limited form of this proposal (one that basically lets you do a -exec test ... in-process) would be acceptable?

$ bfs -not -has CACHEDIR.TAG \( -type f \)

@tavianator
Copy link
Owner

I think a nicer middle-ground might be

$ bfs -exclude -has-child \( -type f -name CACHEDIR.TAG \)

which would behave semantically like

$ bfs -exclude -has \( -mindepth 1 -maxdepth 1 -type f -name CACHEDIR.TAG \)

-has-child avoids the exponential complexity of unrestricted -has and can be implemented without patching bftw() at all, I believe. The cost of the extra flexibility is an extra readdir(), but I think it's probably worth it. (We could even have the optimizer detect a non-wildcard -name and convert the readdir() into stat(".../CACHEDIR.TAG") if it makes a big difference.)

Btw for correct CACHEDIR.TAG detection you should also be checking that the contents of the file starts with Signature: 8a477f597d28d172789f06886806bc55, according to https://bford.info/cachedir/, but that may not be worth it in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants