New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misbehavior in fdupes with tagged #244

Closed
amerlyq opened this Issue Jul 16, 2017 · 5 comments

Comments

Projects
None yet
3 participants
@amerlyq

amerlyq commented Jul 16, 2017

I abuse rmlint to search exported headers from build tree into ./build/staging_dir/usr/include.
General usage looks like this:

echo "${whole_list_of_package_headers}" | sort | find_dups ./build/staging_dir/usr/include 

Contents of find_dups is as follow:

cmd=( rmlint --hidden --types=df --output=fdupes --rank-by=pdlma --match-without-extension )

## Keep ./usr/include, remove package/*.h
cmd+=( -kM - // -- "$@" )   #1 BUG: -kM don't work as expected
# cmd+=( -Km -- "$@" // - ) #2 OK

## Keep package/*.h, remove ./usr/include
# cmd+=( -KM - // -- "$@" ) #3 OK
# cmd+=( -km -- "$@" // - ) #4 OK

exec "${cmd[@]}"

Results are next (some minor differences are visible only on big output, sorry for that)

#1

./build/jpeg-6b/jconfig.h

./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h

./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jerror.h

./build/staging_dir/usr/include/jerror.h

./build/jpeg-6b/jmorecfg.h

./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jpeglib.h

./build/staging_dir/usr/include/jpeglib.h
#2

./build/staging_dir/usr/include/jconfig.h
./build/jpeg-6b/jconfig.h

./build/staging_dir/usr/include/jerror.h
./build/jpeg-6b/jerror.h

./build/staging_dir/usr/include/jmorecfg.h
./build/jpeg-6b/jmorecfg.h

./build/staging_dir/usr/include/jpegint.h
./build/jpeg-6b/jpegint.h

./build/staging_dir/usr/include/jpeglib.h
./build/jpeg-6b/jpeglib.h
#3

./build/jpeg-6b/jconfig.h
./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h
./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jmorecfg.h
./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jerror.h
./build/staging_dir/usr/include/jerror.h

./build/jpeg-6b/jpeglib.h
./build/staging_dir/usr/include/jpeglib.h
#4

./build/jpeg-6b/jconfig.h
./build/staging_dir/usr/include/jconfig.h

./build/jpeg-6b/jpegint.h
./build/staging_dir/usr/include/jpegint.h

./build/jpeg-6b/jmorecfg.h
./build/staging_dir/usr/include/jmorecfg.h

./build/jpeg-6b/jpeglib.h
./build/staging_dir/usr/include/jpeglib.h

./build/jpeg-6b/jerror.h
./build/staging_dir/usr/include/jerror.h

I assume results #1 and #2 must be similar to each other, as do #3 and #4.
But you can see, that #1 is strictly different -- fdupes treats all files as completely equal!
Is it a bug, or fdupes spec requires that strange behaviour?

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Jul 17, 2017

Interesting; I never really considered -kM or -Km as valid options.

Case #1, -kM, should mean only report duplicates of untagged files but don't delete any tagged files. So it should only report untagged files that have untagged duplicates. In your case it shouldn't report any duplicates; instead it's reporting pairs of files but reporting both as originals, which is kind of correct.

Case #2, -Km shouldn't report any duplicates but unfortunately it is.

I think my preferred solution for this would be to make -kM and -Km invalid option combinations, unless you can convince me that these are useful combinations.

@amerlyq

This comment has been minimized.

amerlyq commented Jul 17, 2017

Two situations

  • Many files several destinations
  • Several files many destinations

Three usecases

  • Find which files from original list (subset) were exported to destinations
  • Find which destinations were used when exporting files
  • Compose pairs file--destination (per each duplicate with same content)

Options -k/-K/-m/-M/-S looked like solution exactly to my usecase (with little post-processing of fdupes output).
I don't have lingering attachment to these options, but man creates impression of them being independent.

       -k --keep-all-tagged / -K --keep-all-untagged
              Don’t delete any duplicates that are in tagged paths (-k) or that are in non-tagged paths (-K).  (Tagged paths are those that were named after //).
       -m --must-match-tagged / -M --must-match-untagged
              Only look for duplicates of which at least one is in one of the tagged paths.  (Paths that were named after //).
       -S --rank-by=criteria (default: pOma)
              ... Note: original path criteria (specified using //) will always take first priority over -S options.

Actual usage notes resulting from man of those three options:

  • whichever untagged or tagged list bigger -- it's supplied through stdin => therefore I respectively use -- "$@" // - or `- // -- "$@" form. However, I always prefer second form -- because only that form has "free tail w/o other syntax" for positional arguments. Which is more clean and neat style for bash and functional languages.
  • According to man whichever -S rank sorting is, tagged are always shown first, so if I use only second form as described above, anything supplied through arguments must be ordered first.
  • But using -KM with second form actually inverts this logic of -S, ordering untagged list from - first anyways (inconsistent man?)
  • So, in my both situations (whichever files or destinations have more files), using only my beloved second form - // -- "$@" and switching between -KM/-km I can control to output original files or destinations first.
    Problem resolved.

But, in the next step I assumed algorithms for searching/comparison (w/o delving into code, of course).
If my assumption correct, performance would be better if comparison subset for -m/-M is small.
In such way invalid (by your logic) forms -kM/-Km were formed.
-k/-Kstill controls which side of files/destinations to output, but -m/-M used to prefer known in advance smaller subset of files between sides.
It's only my assumption about algorithms performance. So, if my assumptions are irrelevant to actual code, those option combinations are better to remove as you wished to avoid confusion similar to mine (however then we don't need two different options -- one would be enough to represent -km/-KM).
Otherwise them must be fixed in suggestion of choosing lesser subset of files for comparison algorithms.

@amerlyq

This comment has been minimized.

amerlyq commented Jul 18, 2017

Actually, I'm beginning to incline to your proposal to simply prohibit usage of -kM/-Km.
Which leaves us with the only independently valid variants: -k/-K/-m/-M/-km/-KM.
It will be the most reasonable solution (of course, if the inner implementation itself auto-chooses smaller subset for comparison ops).

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Jul 22, 2017

Ok I'll go ahead and add the code to prohibit kM|Km

sahib added a commit that referenced this issue Nov 20, 2017

@sahib

This comment has been minimized.

Owner

sahib commented Nov 20, 2017

I'll added a check to prohibit -kM and -Km as of af47630. If tests passes, I'll close this ticket.

@sahib sahib closed this Nov 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment