Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--hash-unmatched seems to scan the whole dataset, like --hash-uniques #614

Open
intelfx opened this issue Mar 8, 2023 · 2 comments
Open

Comments

@intelfx
Copy link

intelfx commented Mar 8, 2023

rmlint version

dataset

I have a 30-something TB dataset, that consists of ~20 TB uniques and ~11 TB size-twins:

$ du -hs /mnt/data
32T     /mnt/data

$ find /mnt/data -type f -printf '%s\n' | sort | uniq -c | awk -c '
function bscalc(_in) { "bscalc -H " _in | getline _out; return _out; }
$1 == 1 { nr_uniqs += $1; size_uniqs += $1 * $2; }
$1 != 1 { nr_twins += $1; size_twins += $1 * $2; }
END { 
  printf "Uniques: total %d size %s\n", nr_uniqs, bscalc(size_uniqs);
  printf "Twins: total %d size %s\n", nr_twins, bscalc(size_twins);
}'
Uniques: total 202799 size 19.76 TiB
Twins: total 3074218 size 11.78 TiB

actual behavior

Basic rmlint invocation without --hash-unmatched (ignore --without-fiemap, it's just there to speed up preprocessing, progress-bars were also trimmed):

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3034739 / found 33504 other lint)
Matching (100 dupes of 63 originals; 12058,91 GB to scan in 3067241 files, ETA:  7d 14h 55m 44s)
^C

Control rmlint invocation with --hash-uniques:

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-uniques /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 108d  8h 40m 45s)
^C

Now, --hash-unmatched:

$ rmlint -T df,dd -j --progress -o pretty -c sh:handler=clone --hidden --without-fiemap --hash-unmatched /mnt/data
Traversing (3276566 usable files / 0 + 0 ignored files / folders)
Preprocessing (reduces files to 3237535 / found 33504 other lint)
Matching (7 dupes of 7 originals; 32301,25 GB to scan in 3270955 files, ETA: 120d  9h 31m 56s)
^C

expected behavior

Isn't --hash-unmatched supposed to only scan size twins (i. e. 12 TB at most)?

intelfx added a commit to intelfx/rmlint that referenced this issue Mar 8, 2023
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 8, 2023
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 8, 2023
@intelfx
Copy link
Author

intelfx commented Mar 8, 2023

I can make --hash-unmatched do what it says on the tin with this code, but it feels hacky:

rmlint/lib/shredder.c

Lines 839 to 842 in 675089d

if (!(group->num_files >= 2) && !group->session->cfg->hash_uniques) {
// no hashing required (yet)
return;
}

I wonder if there is something else subtly wrong in the code.


It appears that when --hash-unmatched is used in an unmodified rmlint, this condition is responsible for hashing all the single-file groups:

rmlint/lib/shredder.c

Lines 855 to 859 in 675089d

} else if(group->n_inodes == 1 && group->n_unhashed_clusters > 0 &&
group->session->cfg->merge_directories) {
/* special case of hardlinked files that still need hashing to help identify
* matching directories */
group->status = RM_SHRED_GROUP_START_HASHING;

Could someone please explain what exactly is being done here, what's the idea behind this special case?

intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
intelfx added a commit to intelfx/rmlint that referenced this issue Mar 9, 2023
`group->n_inodes == 1` is also true for groups that simply consist of a
single file. This condition will cause all single-file groups to be
hashed if `--merge-directories` is also set.

Additionally, the whole `group->n_inodes == 1` condition is redundant
because not following on the branch means that `group->n_clusters == 1`
and therefore `group->n_inodes == 1`.

Thus, check that we are actually dealing with a group of hardlinks (and
not just a single-file group that did not cause an early return because
we are also doing `--hash-unmatched`).

Fixes sahib#614.
@intelfx
Copy link
Author

intelfx commented Mar 10, 2023

Disregard the comment above (the suggested fix is wrong), see proper analysis in the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant