Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New sort criteria (for original detection): "maximize link count" #196

Closed
Awerick opened this issue Nov 11, 2016 · 14 comments
Closed

New sort criteria (for original detection): "maximize link count" #196

Awerick opened this issue Nov 11, 2016 · 14 comments

Comments

@Awerick
Copy link

Awerick commented Nov 11, 2016

For ensuring minimal/decreased disk usage, I would like to request a new sort criteria for the -S/--rank-by option: "maximize link count".
This might decrease the disk usage when running rmlint on duplicates, which are already hardlinked from another path (see examples below).
(The tool hardlink (for Debian-based distros) provides this capability somehow with the -m/--maximize option, see manpage.)

Simple example

  • Consider a path A that contains a file foo (at inode 1001).
  • Path B contains the same file foo two times: One time hardlinked to the file foo at path A (i.e. same inode number 1001 and link count 2) and one time separately (say inode 1002, link count 1).

Now consider rmlint -c sh:hardlink is executed on path B.
(I consider -c sh:hardlink, i.e. the mode in which duplicates are replaced with hardlinks, however the other use cases, e.g. removing duplicates, would probably benefit as well.)
Current behavior: It might happen that the separate file (at inode 1002) is treated as the original and both files are hardlinked to it. This is not optimal because in the end the data still exists two times: Once for the file in Path A and once for the files in Path B.
With the requested criteria: It would be better to treat the file with the "highest link count" as the original (which is the file at inode 1001) so that all files get hardlinked to this one. By this the data only exists once on the hardware.

Advanced Approach

The approach above to "keep the highest link count" may still not be optimal in all cases, a better approach would be to "maximize the link count". Advanced example:

  • Consider that path B contains the file foo four times: One time hardlinked to the file foo at path A (link count 2) and three times hardlinked "with each other" (link count 3).

If all files were hardlinked to the "highest link count" (as suggested above), the data would still exist two times: Once for the file in path A (link count 1) and once for the files in path B (link count 4).
With the advanced approach this will be avoided: Link the files in a way, that, in the end, the "link count is maximized". In our example: The data only exists once for all files in path A and B (link count 5).

(Possible workaround with the current rmlint version: If all the files with the maximum link count are somehow guaranteed to be in one path, the flag separator // can be used to tag these as originals.)

Sorry for the long text, but I hope, it made the desired behavior (and the benefits) clear.

@sahib
Copy link
Owner

sahib commented Nov 11, 2016

Hey @Awerick,

the simple approach sounds very reasonable to me and it should be easy to implement
This seems to be the same what hardlink seems to be doing.

The advanced approach is probably out of the scope of rmlint. Reasoning: I don't think that it should traverse the files outside of the paths directly given (which would be needed to find out the other links to a given file, albeit being an quite expensive operation since / would need to be traversed to find all).

@sahib
Copy link
Owner

sahib commented Nov 11, 2016

I barely tested it, but I think e4a99c0 does what you want.
Will come back later with more testing.

Edit: Use this to test the feature:

$ rmlint -S H A/ B/

@Awerick
Copy link
Author

Awerick commented Nov 11, 2016

Thanks for the feedback and the fast commit!

Regarding the advanced approach:
I totally agree that rmlint should not traverse any files outside the given paths!

However I think it could be implemented without the need to traverse any other paths:

  • For the duplicates found by rmlint (e.g. file foo / all files that share the same content), one can consider all duplicates that are hardlinked with each other as one "hardlink group": All duplicates in a "hardlink group" share the same inode number.

  • The goal of the advanced approach is: »Hardlink all found duplicates towards that "hardlink group" that yields the highest link count in the end.«
    To detect this group, we need to know, how much hardlinks each "hardlink group" has outside the given path.
    The "hardlink group" with the most "outside hardlinks" will be the wanted group: If we link all duplicates towards this group, we achieve the highest possible link count!

  • The number of "outside hardlinks" can be computed without the need of traversing any outside paths: We can simply subtract the number of files in each "hardlink group" from the linkcount of that "hardlink group". Pseudocode idea:

    foreach hlgroup:
        hlgroup.outsidefiles = hlgroup.linkcount - hlgroup.foundfiles
    

    If the "hardlink group" with the biggest number for hlgroup.outsidefiles is considered the original, the link count is maximized!

If I am not mistaken... ;-)

@sahib
Copy link
Owner

sahib commented Nov 12, 2016

Ah, It seems I misunderstood your approach then. Thanks for the detailed explanation; looks promising.
This might be a bit harder to fit into the concept of -S, but I'll try to put together a prototype.

@sahib
Copy link
Owner

sahib commented Nov 12, 2016

If I understood you correctly (that's never guaranteed...) 648c0ac should include the advanced approach as -S O. -S H is still there and may be useful to sort by hardlinks when there are no outlyers (see example below). Also I somehow like the OH in there. 😄 The changes are in the feature/outlyer-sortcriteria branch. Documentation and tests still missing.

$ mkdir links/a links/b -p
$ echo x > links/a/foo
$ ln links/a/foo links/b/foo-from-a
$ echo x > links/b/foo
$ ln links/b/foo links/b/foo
$ ln links/b/foo links/b/foo-link-1
$ ln links/b/foo links/b/foo-link-2
$ tree links
links
├── a
│   └── foo
└── b
    ├── foo
    ├── foo-from-a
    ├── foo-link-1
    └── foo-link-2

2 directories, 5 files
$ ./rmlint -S O links/b

# Duplicate(s):
    ls '/home/sahib/dev/rmlint/links/b/foo-from-a'    # link to highest outlyer.
    rm '/home/sahib/dev/rmlint/links/b/foo'
    rm '/home/sahib/dev/rmlint/links/b/foo-link-1'
    rm '/home/sahib/dev/rmlint/links/b/foo-link-2'

[...]
$ ./rmlint -S OH links

# Duplicate(s):
    ls '/home/sahib/dev/rmlint/links/b/foo'    # link to highest link count (no outlyers)
    rm '/home/sahib/dev/rmlint/links/b/foo-link-1'
    rm '/home/sahib/dev/rmlint/links/b/foo-link-2'
    rm '/home/sahib/dev/rmlint/links/a/foo'
    rm '/home/sahib/dev/rmlint/links/b/foo-from-a'

[...]

@Awerick
Copy link
Author

Awerick commented Nov 12, 2016

Wow, you are quick – and everything seems to behave correctly!
Thank you!

(I have some more considerations for documenting/describing the behavior that I will comment on Monday or so.)

When looking at your first example, I fooled myself: I thought rmlint should link to /links/b/foo – but no, linking to links/b/foo-from-a is of course correct! (edit: fixed wrong path)

In the example you used the option -S OH – however -S O can (since it is tied: probably by chance?) already lead to the same result (in fact it did for me :) ) – I think an even more convincing test to show the difference between H and O would be to run them on the same path, /links/b.

Interesting: I intuitively thought that in -S OH the H would have no effect but when I thought about it has a different meaning and randomly might lead to a different result!

To understand and test the behavior of the new options I came up with a setup where the options H, O, OH and HO all have a different meaning.
(In the following scenario H (and also O) has two possible results because the criteria results in a tie – so I set the second criteria for H to a (resp. for O to A) to ensure that the result is different from HO (resp. OH).)

Setup overview

  • file fooA*:
  • link count ("all"): 5, outside links: 0
  • should be considered original with sorting criteria Ha
  • file fooB*:
  • link count ("all"): 5, outside links: 1
  • should be considered original with sorting criteria HO
  • file fooC*:
  • link count ("all"): 4, outside links: 2
  • should be considered original with sorting criteria OH
  • file fooD*:
  • link count ("all"): 3, outside links: 2
  • should be considered original with sorting criteria OA
  • (Info:
  • sorting criteria H would randomly consider fooA or fooB as the original.
  • sorting criteria O would randomly consider fooC or fooD as the original.)

Create setup

$ mkdir -p links/inside links/outside
$ echo foo > 'links/inside/fooA (1)'
$ echo foo > 'links/inside/fooB (1)'
$ echo foo > 'links/inside/fooC (1)'
$ echo foo > 'links/inside/fooD (1)'
  # fooA: create another 4 (inside) hardlinks
$ ln 'links/inside/fooA (1)' 'links/inside/fooA (2)'
$ ln 'links/inside/fooA (1)' 'links/inside/fooA (3)'
$ ln 'links/inside/fooA (1)' 'links/inside/fooA (4)'
$ ln 'links/inside/fooA (1)' 'links/inside/fooA (5)'
  # fooB: create another 3 (inside) hardlinks + 1 outside hard link
$ ln 'links/inside/fooB (1)' 'links/inside/fooB (2)'
$ ln 'links/inside/fooB (1)' 'links/inside/fooB (3)'
$ ln 'links/inside/fooB (1)' 'links/inside/fooB (4)'
$ ln 'links/inside/fooB (1)' 'links/outside/fooB (5)'
  # fooC: create another 1 (inside) hardlink  + 2 outside hard links
$ ln 'links/inside/fooC (1)' 'links/inside/fooC (2)'
$ ln 'links/inside/fooC (1)' 'links/outside/fooC (3)'
$ ln 'links/inside/fooC (1)' 'links/outside/fooC (4)'
  # fooD: create                                2 outside hard links
$ ln 'links/inside/fooD (1)' 'links/outside/fooD (2)'
$ ln 'links/inside/fooD (1)' 'links/outside/fooD (3)'

Show setup

$ ls -ilR links/   # -i for inode number, -l for link count
links/:
[...]
10773 d[…]  2  […]  inside
10774 d[…]  2  […]  outside

links/inside:
[...]
10775  […]  5  […]  fooA (1)
10775  […]  5  […]  fooA (2)
10775  […]  5  […]  fooA (3)
10775  […]  5  […]  fooA (4)
10775  […]  5  […]  fooA (5)
10776  […]  5  […]  fooB (1)
10776  […]  5  […]  fooB (2)
10776  […]  5  […]  fooB (3)
10776  […]  5  […]  fooB (4)
10777  […]  4  […]  fooC (1)
10777  […]  4  […]  fooC (2)
10778  […]  3  […]  fooD (1)

links/outside:
[...]
10776  […]  5  […]  fooB (5)
10777  […]  4  […]  fooC (3)
10777  […]  4  […]  fooC (4)
10778  […]  3  […]  fooD (2)
10778  […]  3  […]  fooD (3)

Test four different sort criteria combinations for rmlint

(The ? in the following output was normally 1 for me, but for others it might also be 2, 3, 4 or 5.)

$ rmlint -S Ha links/inside/
[...]
ls '/home/awerick/links/inside/fooA (?)'
[...]

$ rmlint -S HO links/inside/
[...]
ls '/home/awerick/links/inside/fooB (?)'
[...]

$ rmlint -S OH links/inside/
[...]
ls '/home/awerick/links/inside/fooC (?)'
[...]

$ rmlint -S OA links/inside/
[...]
ls '/home/awerick/links/inside/fooD (?)'
[...]

Result of the Test

Great: When I run the four tests, they all returned what they were supposed to return! :-)

@sahib
Copy link
Owner

sahib commented Nov 13, 2016

Good to hear it fits your usecase.

I allowed myself to copy your testcase as an automated one (see 16254c4). I'm still wondering if OH would be a good default if any link option is given that includes hardlinks. Regarding the (?) in the last box: You can fix that by appending an a to HO and OH. The rest of the output should already have a fixed ordering.

Thanks for your detailed and well thought responses, I'm really not used to that.

@Awerick
Copy link
Author

Awerick commented Nov 14, 2016

I'm glad if the testcase is usuable for your testsuite!

Using OH as a default: I am also not sure about it – probably in most cases the H does not make a big difference with O already being the first criteria. However if a lot of the files have a huge number of hardlinks, it might speed up the process...

Well, this is my first github issue, so I wanted to be as understandable as possible – if I comment more regularly in the future, my comments might become a little bit more rough ;-)

@Awerick
Copy link
Author

Awerick commented Nov 14, 2016

As promised: Some thoughts and suggestions for the manual/documentation:

I think the manual now describes the behavior of O very good ("keep file with most outside hardlinks" or so)! But it still might be difficult for users to decide whether they want H or O, because the overlaps and differences between these two can be confusing...

Maybe some of the following remarks could be helpful in the manual to understand the effects of H and O more easily?

  1. Both options are actually only useful/relevant, if the traversed path contains files with hardlinks (i.e. link count > 1).
  2. Especially when using the hardlink handler with -c sh:hardlink (but also with the default remove handler), the criteria O might be valuable as it can lead to more freed disk space (=minimal disk usage) than H, because it considers whether an file is also hardlinked from outside the paths traversed by rmlint.

(In case of the hardlink handler (-c sh:hardlink), it might be helpful to understand that the criteria H only considers which file currently has the highest link count, whereas the criteria O maximizes the link count in the end.)
3. ("OH" or "OHa(m?)" might be a good combination to consider when using -c sh:hardlink.)

Some details on point 2:

  • I realized that other handlers than the hardlink handler (-c sh:hardlink) might benefit with minimal disk usage from O as well, this is why I mentioned the remove handler here.
  • The second part of point 2 (in braces) seems not to apply to the remove handler but only to the hardlink handler.
  • I think, the second part of point 2 might be helpful for users that know the formulation "maximizes the link count" from the hardlink manual and wonder, whether H or O achieves this in rmlint. (Although I am not sure whether hardlink's option --maximize really implements the O criteria or actually only the H criteria.)

For dealing with hardlinked files in general (even without criteria O and H), maybe one of the following comments could also be useful in the manual?

  • For investigating which files are hardlinked, the (Linux) commands ls -i (shows file inode numbers) and find / -samefile FILE (list all files with the same inode number as FILE) might be helpful.
  • Warning: When running rmlint on files that are also hardlinked from outside of the paths traversed by rmlint, keep in mind to not change those outside files. (Maybe this is covered by point 2. in the PROBLEMS section? or is it too obvious/negligible to mention it?):
  • The rmlint output how much bytes would be saved by deleting the duplicates does not take into account, if some of the duplicates are hardlinked. (Or advanced ;D : "Have outside hardlinks") (I am not sure whether this is something that can be calculated easily / realiably.)

@sahib
Copy link
Owner

sahib commented Nov 15, 2016

  1. Especially when using the hardlink handler with -c sh:hardlink (but
    also with the default remove handler), the criteria O might be valuable as
    it can lead to more freed disk space (=minimal disk usage) than H, because
    it considers whether an file is also hardlinked from outside the paths
    traversed by rmlint.

(In case of the hardlink handler (-c sh:hardlink), it might be helpful
to understand that the criteria H only considers which file currently
has the highest link count, whereas the criteria O maximizes the link
count in the end.)

I tried to reword some things in the manpage. See 3584976.

  1. ("OH" or "OHa(m?)" might be a good combination to consider when using
    -c sh:hardlink.)

I changed the default to pOHma. The former default was pm.
See als the explanation in the commit.

  • I think, the second part of point 2 might be helpful for users that know
    the formulation "maximizes the link count" from the hardlink manual and
    wonder, whether H or O achieves this in rmlint. (Although I am not
    sure whether hardlink's option --maximize really implements the O
    criteria or actually only the H criteria.)

hardlink does the same as H does.

  • For investigating which files are hardlinked, the (Linux) commands ls -i
    (shows file inode numbers) and find / -samefile FILE (list all files with
    the same inode number as FILE) might be helpful.

I added it to the -c sh:hardlink explanation.

  • Warning: When running rmlint on files that are also hardlinked from
    outside of the paths traversed by rmlint, keep in mind to not change
    those outside files. (Maybe this is covered by point 2. in the
    PROBLEMS section? or is it too obvious/negligible to mention it?):

I extended the PROBLEMS section a bit.

  • The rmlint output how much bytes would be saved by deleting the
    duplicates does not take into account, if some of the duplicates are
    hardlinked. (Or advanced ;D : "Have outside hardlinks") (I am not sure
    whether this is something that can be calculated easily / realiably.)

Actually, rmlint does take in account hardlinks for the size calculation:

$ echo xxx > hello
$ ln hello world
$ rmlint hello world
...
==> In total 2 files, whereof 1 are duplicates in 1 groups.
==> This equals 0 B of duplicates which could be removed.

However, this does not work (yet) when having outside hardlinks.

$ echo xxx > hello
$ echo xxx > world
$ ln hello hello_link
$ ln world world_link
$ rmlint *_link
==> In total 2 files, whereof 1 are duplicates in 1 groups.
==> This equals 4 B of duplicates which could be removed.

This would require us to take the link count of each removed file into account.
I guess this would be doable (Im too lazy to think right now), but the logic
might be a bit fiddly.

@Awerick
Copy link
Author

Awerick commented Nov 15, 2016

Thanks again for your quick answers!

I changed the default to pOHma. The former default was pm.

Is this not a little bit radical? I think pm was a better default for the other handlers.

Maybe I misunderstood your previous comment ("I'm still wondering if OH would be a good default if any link option is given that includes hardlinks."). My answer ("I am also not sure about it...") only applied to the hardlink handler:

  • Only for the hardlink handler (-c sh:hardlink) I think O or OH might be a good default.

  • However in general (for the other handlers), pm is probably still the better default – even if hardlinked files are involved. I imagine the following example (no "outside hardlinks" are involved):

    fooA     -- oldest mtime (same content as fooB but not hardlinked to them)
    fooB (1) -- hardlinked with fooB (2)
    fooB (2) -- hardlinked with fooB (1)
    

    Previously the default behavior was to keep fooA due to the oldest mtime and remove all other duplicates. Now it would keep fooB (1) because it currently has the highest link count (although the remaining file will have link count 1 anyway) and remove the other two files. Probably in this case (with the remove handler being executed) the user does not care for the hardlinks/link count and would rather like the mtime being considered as with the former default.

hardlink does the same as H does [Sourcecode link].

Ah, very interesting – thanks for digging this up!

Actually, rmlint does take in account hardlinks for the size calculation: [...]

Oh yes! Probably I interpreted the result incorrectly when I tested it.

A rough idea how outside hardlinks could be taken into account: Every duplicate that has at least one outside hardlink (outside_hardlinks > 0) will not save any space when getting removed and can therefore be omitted from the size calculation.
However I am not sure whether this is robust, especially for the other handlers, and is correct for all edge cases.

@sahib
Copy link
Owner

sahib commented Nov 15, 2016

Is this not a little bit radical? I think pm was a better default for the other handlers.

Maybe. 😄

Maybe I misunderstood your previous comment [...]

No, you didn't misunderstand that, I just made up my mind afterwards.

I'm still thinking about it, but I kinda like it. The default sortcriteria is a heuristic anyways, you can always
find cases where it won't work optimally. You seem to be more knowledgeable than most users, therefore
I suspect that you'd change the default anyways to fit your dataset. My reasoning late yesterday was somewhat like this:

  • p is to maintain the path ordering.
  • OH might lead to a little less "fragmentation" by choosing files to delete that tend to have
    a lower outer hardlink count (or total hardlink count if tied).
  • m will have the same effect to files when there are no hardlinks involved. I consider this
    to be the "default" case for most average datasets like music or photo collections.
    This is really the core argument for me: Adding OH has only effects when hardlinks are involved.
  • a was added to maintain a stable ordering. Until now the ordering might vary if several files had
    the same mtime. No big deal, but as default it's probably more intuitive if the ordering is fixed.

The developer's perspective is always a bit biased and different from a user's perspective.
Therefore I guess I give it a few days before I make that final decision.

A rough idea how outside hardlinks could be taken into account: Every duplicate that has at least one outside hardlink (outside_hardlinks > 0) will not save any space when getting removed and can therefore be omitted from the size calculation. However I am not sure whether this is robust, especially for the other handlers, and is correct for all edge cases.

True. 4cc9fa7 implements this approach. I'm not sure either if this approach handles all edge cases (well, I can't really think of one, but that's the problem with edge cases...), but it should be better than before.

@Awerick
Copy link
Author

Awerick commented Nov 15, 2016

Ah ok :-)
It's interesting to have your reasoning about the sorting criteria defaults!

Well, if the average users will benefit from the new default, this is certainly good. I just thought, OH would be rather for hardlink specific use cases. But as I only got to know rmlint very recently and only use it's hardlink capabilities, I am probably (as you also mentioned) not the main target of the defaults 😄

But for changes to the defaults, feedback from other users/developers (possibly less biased) would probably be valuable.
(And of course: For users that might be affected by this or have to adapt their settings, it would be desirable to be informed about this early on / before the change gets released.)
(Edit: Ah, #198 probably addresses this already...)

Phew! All those different hardlink constellations, sorting criterias, handler modes and discussions of the last days can make one slightly dizzy %)
But still I am really happy how you addressed all my considerations and wishes! 😃
(Also sorry if my comments were often so long but I think this subject can be somewhat complex.)

@sahib
Copy link
Owner

sahib commented Nov 16, 2016

Phew! All those different hardlink constellations, sorting criterias, handler modes and discussions of the last days can make one slightly dizzy %)

Better don't open an issue including symbolic links then. 😄 (j/k)

I think this issue reached it's end of life. Closing now.
If there are remaining issues feel free to reopen.

@sahib sahib closed this as completed Nov 16, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants