Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: add hashSUM file support #5352

Merged
merged 16 commits into from Jul 7, 2021
Merged

cmd: add hashSUM file support #5352

merged 16 commits into from Jul 7, 2021

Conversation

ivandeex
Copy link
Member

@ivandeex ivandeex commented May 22, 2021

What is the purpose of this change?

Currently rclone check supports matching two file trees by sizes and hashes.
However, rclone does not support so called hash-SUM files.
These are produced by the de-facto standard family of GNU core utilities like sha1sum.

Use Cases

SUM files provide end-to-end checksums, an invaluable tool where storage sums don't exist.
They serve the following use cases:

  • User-side "cache" of remote hashsums.
  • As a measure to detect bit-rot timely.
  • Tracking integrity across multiple data migrations.
  • Migration of ancient data corpus to incompatible provider.
  • And more...

File Naming

There exist two widely used naming schemes for SUM files:

  1. MD5SUM, SHA1SUM, etc - uppercase name without extension carrying the hash digest type right in the name,
    usually located right beside the files to check.
  2. something.md5, somethingelse.sha1, etc - having hash type in extension.

File Format

The SUM file format is very simple, has been stable for years and supported by a wide range of software. However, there is no official standard. GNU man pages (md5sum et al) is the de-facto standard.

SUM files are line oriented. They don't record the hash digest type - it should be known in advance.
Each line contains:

  1. Hash digest
  2. Space character
  3. Another space or * (asterisk, see below)
  4. File name, up to the line end.

The asterisk modifier is a legacy artifact of GNU coreutils builds for DOS/Windows where it used to denote that the file is binary (with Unix line endings). It means nothing on Unix. The proposed implementation of SUM file support simply ignores asterisks and treats all files as binary (no DOS line feed translation), following the SUM file parser in the Go coreutils port.

The "distance" between digest and file name must always be 2 characters, be it " *" for binary files or " " (two spaces) for ex text files. The 3rd space, if any, will belong to the file name.

Note: I could not find any real world use of the " |" modifier mentioned by @klauspost in #157 (comment).

Unimplemented Features

Besides ignoring asterisks, this implementation will not:

  • perform backslash unescaping (ancient hack to tackle extended unicode and whitespace in file names)
  • support DOS-related flags --binary and --text (we treat files as binary, use them as is and never translate EOLs)
  • handle logic of rarely used flags --strict and --zero from the GNU manual

These features could be implemented by future PRs, but I guess nobody will ever need them.

Note that integration with rclone fs.Hash API (e.g. using SUM files as a transparent hashsum source or cache) is also out of scope for this PR.

Making Sum Files

rclone can produce valid SUM files with command
rclone hashsum HASH REMOTE:PATH --output-file FILE [--download]
(one minor feature that I miss sometimes is sorting output by file name - one day I'm gonna submit a PR)

Standard coreutils normally print SUM data to STDOUT so the file can be produced by redirection sha1sum /files > SHA1SUM. To check file tree against a SUM file, use the -c flag like sha1sum -c SHA1SUM /files. The hash digest used is defined by the name of the executable but not recorded anywhere in the sum file. This pull request handles only the check case above.

CLI: Constraints

This patch tries to enroll the -c GNU convention into the rclone CLI flag family respecting the following constraints:

  1. The -c short flag is already taken, so we use upper-case -C to feel as close as possible.
  2. The --checksum long flag is also taken, so we use --checkfile (no dash since it's a noun).
    Probably --sum-file would be closer to the file format name, but the first letter c better correlates with the short flag. See a note below.

CLI: Variants

The patch introduces a single internal function operations.CheckSum and the following command-line interfaces for it:

  1. The hashsum command gains new flag --checkfile SUMFILE, ex.:
    rclone hashsum CRC32 -C sumfile.crc32 remote:path/to/files
  2. Commands sha1sum and md5sum also gain this flag, ex.:
    rclone sha1sum --checkfile SHA1SUM remote:path/tofiles
  3. New command checksum HASHNAME SUMFILE REMOTE:PATH is added, ex.:
    rclone checksum md5 remote:path/to/MD5SUM remote:path/to --exclude MD5SUM
  4. The check command gains a new form with 3 arguments, ex.:
    rclone check HASHNAME SUMFILE REMOTE:PATH
  5. The check command in the current 2-argument form
    gains the new flag --checkfile HASHNAME.

As the last case feels slightly out of sync, let me bring an example:

rclone check REMOTE1:PATH1 REMOTE2:PATH2     # match two file trees by hash/size
rclone check -C sha1 SUMFILE REMOTE:PATH     # check file tree against a SUM file by hash

All the variants are placed in separate commits within this PR so reviewer can decide which are good for final squash/merge and which are doomed.

Notes

  1. In all cases above the SUM file can point to a remote object using the REMOTE:PATH notation as well as be a local file
    (SUM files are frequently distributed within the same tree as the data).
  2. The new rclone checksum command supports all extended reporting features from rclone check,
    like --combined, --missing-... and so on.
  3. The md5sum, sha1sum and hashsum commands do not support extended check reporting features.
    Instead, they work as if --combined - was supplied (i.e. all sigils are reported to stdout, unless --progress is enabled).
    This behavior is compatible with -c in GNU utilities.
  4. All proposed commands support filtering, which can be useful if a sum file is located beside the files to be checked.
  5. All variants support the --download flag and respect --checkers.
  6. The --checkfile option might be named as --check-from or --check-by (with short name -C),
    or just --check if not taken. I'm leaving the decision up to reviewer.

Was the change discussed in an issue or in the forum before?

Approved by #157 (comment)
Fixes #1005

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-Q

@albertony
Copy link
Contributor

Good stuff!
How does the various command variants handle non-existent and added files? If checksum file contains information for a file that no longer exists in the remote, I'm guessing this will be reported by all variants? But what about the opposite, if remote contains a file that is not represented in checksum file, is there a difference between how, say hashsum and check commands, handle this?

@ivandeex
Copy link
Member Author

ivandeex commented May 22, 2021

How does the various command variants handle non-existent and added files? If checksum file contains information for a file that no longer exists in the remote, I'm guessing this will be reported by all variants? But what about the opposite, if remote contains a file that is not represented in checksum file, is there a difference between how, say hashsum and check commands, handle this?

All variants follow the check command convention, where a file absent in source tree is marked by the - sigil and a file absent on destination is marked by +, by declaring the sum file as source and the file tree under check as destination.
That's all.

Notes:

  1. SUM file indeed represents an ancient/faraway/gone source of a file tree.
  2. Filters can be used to prevent failure exit due to missing files or missing SUM digests.
  3. The flag --missing-on-src will output missing hashes, --missing-on-dst will output missing files

@ivandeex ivandeex requested a review from ncw May 22, 2021 19:58
@ivandeex
Copy link
Member Author

ivandeex commented May 24, 2021

Find beta build here:
https://github.com/ivandeex/rclone/releases/tag/v1.55.1-checksum06

Integration tests pass for me.

@ivandeex
Copy link
Member Author

@ncw Please review!

@ncw
Copy link
Member

ncw commented May 28, 2021

This is a very nice bit of work :-)

I'll comment on your initial doc - I've deleted everything I agree with which is most of it!

I'm very fond of storing MD5SUM files at the root of my archives so this is great for that!

SUM files are line oriented. They don't record the hash digest type - it should be known in advance.

I'll just note there is another more sophisticated format invented by the BSDs which you enable with the --tag command line switch. This does have the hash type in. However it is quite uncommon and I much prefer the simpler format you've described.

rclone can produce valid SUM files with command
rclone hashsum HASH REMOTE:PATH --output-file FILE [--download]
(one minor feature that I miss sometimes is sorting output by file name - one day I'm gonna submit a PR)

A PR to list everything in some defined order would be great! It will slow down listings I think but it could apply to anything which calls fs.ListFn

This patch tries to enroll the -c GNU convention into the rclone CLI flag family respecting the following constraints:

  1. The -c short flag is already taken, so we use upper-case -C to feel as close as possible.

Seems OK

  1. The --checksum long flag is also taken, so we use --checkfile (no dash since it's a noun).
    Probably --sum-file would be closer to the file format name, but the first letter c better correlates with the short flag. See a note below.

md5sum calls this --check if we want to engage prior art

-c, --check          read MD5 sums from the FILEs and check them

Though --checkfile is probably more general purpose. It could even be --checksumfile

The patch introduces a single internal function operations.CheckSum and the following command-line interfaces for it:

  1. The hashsum command gains new flag --checkfile SUMFILE, ex.:
    rclone hashsum CRC32 -C sumfile.crc32 remote:path/to/files

Great

  1. Commands sha1sum and md5sum also gain this flag, ex.:
    rclone sha1sum --checkfile SHA1SUM remote:path/tofiles

Super

  1. New command checksum HASHNAME SUMFILE REMOTE:PATH is added, ex.:
    rclone checksum md5 remote:path/to/MD5SUM remote:path/to --exclude MD5SUM

Is there anything this command does that can't be done with 1-3,5? Reading below, it is the rclone check outputs that make it unique.

I wonder whether this command should be able to generate the checksum files too? Effectively obsoleting the badly named rclone hashsum?

  1. The check command gains a new form with 3 arguments, ex.:
    rclone check HASHNAME SUMFILE REMOTE:PATH

I think I prefer --checkfile over this as in 5, then it leaves the
door open for rclone check to take multiple arguments.

  1. The check command in the current 2-argument form
    gains the new flag --checkfile HASHNAME.

Great

All the variants are placed in separate commits within this PR so reviewer can decide which are good for final squash/merge and which are doomed.

Decisions, decisions!

  1. In all cases above the SUM file can point to a remote object using the REMOTE:PATH notation as well as be a local file
    (SUM files are frequently distributed within the same tree as the data).

Nice.

  1. The new rclone checksum command supports all extended reporting features from rclone check,
    like --combined, --missing-... and so on.

I see...

  1. The md5sum, sha1sum and hashsum commands do not support extended check reporting features.
    Instead, they work as if --combined - was supplied (i.e. all sigils are reported to stdout, unless --progress is enabled).
    This behavior is compatible with -c in GNU utilities.

OK

Except for how do they report with --progress - to the log?

  1. The --checkfile option might be named as --check-from or --check-by (with short name -C),
    or just --check if not taken. I'm leaving the decision up to reviewer.

I think --checkfile is fine. Or possibly --checksumfile but maybe that is too long?

Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work :-)

I put some comments inline but I think this is nearly ready to merge :-)

fs/operations/check.go Show resolved Hide resolved
cmd/check/check.go Outdated Show resolved Hide resolved
cmd/check/check.go Outdated Show resolved Hide resolved
fs/operations/check.go Outdated Show resolved Hide resolved
fs/operations/check.go Show resolved Hide resolved
@ivandeex
Copy link
Member Author

ivandeex commented May 29, 2021

md5sum calls this --check if we want to engage prior art

-c, --check          read MD5 sums from the FILEs and check them

This is not present on the master branch. Probably make commanddocs has not been run for long. This PR reruns it for affected commands so this legacy will disappear from docs.
PR #5383 will rerun it for the rest.

I think --checkfile is fine. Or possibly --checksumfile but maybe that is too long?

Feels too long IMO.

A PR to list everything in some defined order would be great! It will slow down listings I think but it could apply to anything which calls fs.ListFn

If output should go to stdout, it will not slow down... It will postpone everything up until the very end when we sort accumulated buffer and spit it out.

If output should go to a file, it will not slow down either.
We just output like before and in parallel collect data in buffer.
In the end we sort it and overwrite yet unclosed file....

I will submit a PR with more details :-)

I wonder whether this command (checksum) should be able to generate the checksum files too? Effectively obsoleting the badly named rclone hashsum?

Then there should be a way to "drop" 2nd argument.
Maybe a 2-argument form of checksum or -?

Before trying to imagine it let's see how things will be looking right after this PR merged:

hashsum sha1 path                    # dump
checksum sha1 sumfile path           # check
hashsum sha1 -C sumfile path         # check 
hashsum sha1 path --output-file out  # dump

Back to future... checksum dumping sums:

hashsum sha1 path --output-file -       # dump to stdout
checksum sha1 path                      # dump to stdout
checksum sha1 path --output-file out    # dump to file
checksum sha1 - path                    # dump to stdout
checksum sha1 - path --output-file out  # dump to file

2-args or - or what?
Next PR or this PR?
Or save revolutions for later and keep things as is: hashsum dumps, checksum checks?

I assume that making "checksum generate the checksum files" is just an idea for future.

@ivandeex
Copy link
Member Author

ivandeex commented May 29, 2021

@ncw
I hope I addressed your review comments.

Please take a look at inline threads and mark as resolved if you are satisfied.

Postponed changes were copied to #2749 (comment)

Hashing ideas for future: #949 (comment), #157 (comment), #626 (comment)

Summary of introduced command forms after 1st review:

rclone check sums.sha1 remote:path --checkfile sha1    # note: flag is named "--checkfile" but carries hash name!
rclone checksum sha1 sums.sha1 remote:path             # checksum by default checks, hashsum by default prints sums
rclone hashsum sha1 remote:path --checkfile sums.sha1
rclone sha1sum remote:path --checkfile sums.sha1
rclone md5sum remote:path --checkfile sums.md5

I assume that making "checksum generate the checksum files" is just an idea for future.

Looking forward for merging this.

PTAL 🙏

@ivandeex ivandeex requested a review from ncw May 31, 2021 15:01
This was referenced Jun 7, 2021
@ivandeex
Copy link
Member Author

@ncw PTAL

@ivandeex
Copy link
Member Author

ivandeex commented Jul 7, 2021

@ncw IMHO this can be merged for 1.56

@ivandeex ivandeex mentioned this pull request Jul 7, 2021
5 tasks
@ivandeex ivandeex added this to the v1.56 milestone Jul 7, 2021
Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this can be merged for 1.56

Great!

I think there are some commits that need removing from the commit log - these are noted in our discussion, but other than those this looks great!

So please remove those commits and merge :-)

Thanks

Nick

@ivandeex
Copy link
Member Author

ivandeex commented Jul 7, 2021

squashing to single commit. reverted commits will annihilate.

@ivandeex ivandeex merged commit b40d9bd into rclone:master Jul 7, 2021
@ivandeex ivandeex deleted the pr-checksum branch July 7, 2021 15:34
@tb582
Copy link

tb582 commented Jul 13, 2021

am I reading this right that its not possible to supply rclone your own already existing file of SUM's ?

@ncw
Copy link
Member

ncw commented Jul 14, 2021

@tb582 you can now, either with rclone md5sum -C or rclone checksum - check out the latest beta. You can't yet supply checksums for use in a sync, but we are thinking about that.

negative0 pushed a commit to negative0/rclone that referenced this pull request Aug 13, 2021
Currently rclone check supports matching two file trees by sizes and hashes.
This change adds support for SUM files produced by GNU utilities like sha1sum.

Fixes rclone#1005 

Note: checksum by default checks, hashsum by default prints sums.
New flag is named "--checkfile" but carries hash name.
Summary of introduced command forms:

```
rclone check sums.sha1 remote:path --checkfile sha1
rclone checksum sha1 sums.sha1 remote:path             
rclone hashsum sha1 remote:path --checkfile sums.sha1
rclone sha1sum remote:path --checkfile sums.sha1
rclone md5sum remote:path --checkfile sums.md5
```
@ivandeex ivandeex mentioned this pull request Sep 9, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rclone md5sum: implement an Unix md5sum-like "-c" option
4 participants