New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd: add hashSUM file support #5352
Conversation
Good stuff! |
All variants follow the Notes:
|
Find beta build here: Integration tests pass for me. |
@ncw Please review! |
This is a very nice bit of work :-) I'll comment on your initial doc - I've deleted everything I agree with which is most of it! I'm very fond of storing MD5SUM files at the root of my archives so this is great for that!
I'll just note there is another more sophisticated format invented by the BSDs which you enable with the
A PR to list everything in some defined order would be great! It will slow down listings I think but it could apply to anything which calls
Seems OK
Though
Great
Super
Is there anything this command does that can't be done with 1-3,5? Reading below, it is the I wonder whether this command should be able to generate the checksum files too? Effectively obsoleting the badly named
I think I prefer
Great
Decisions, decisions!
Nice.
I see...
OK Except for how do they report with --progress - to the log?
I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work :-)
I put some comments inline but I think this is nearly ready to merge :-)
This reverts commit 6b51dda.
This is not present on the master branch. Probably
Feels too long IMO.
If output should go to stdout, it will not slow down... It will postpone everything up until the very end when we sort accumulated buffer and spit it out. If output should go to a file, it will not slow down either. I will submit a PR with more details :-)
Then there should be a way to "drop" 2nd argument. Before trying to imagine it let's see how things will be looking right after this PR merged:
Back to future... checksum dumping sums:
2-args or I assume that making "checksum generate the checksum files" is just an idea for future. |
@ncw Please take a look at inline threads and mark as resolved if you are satisfied. Postponed changes were copied to #2749 (comment) Hashing ideas for future: #949 (comment), #157 (comment), #626 (comment) Summary of introduced command forms after 1st review:
I assume that making "checksum generate the checksum files" is just an idea for future. Looking forward for merging this. PTAL 🙏 |
@ncw PTAL |
@ncw IMHO this can be merged for 1.56 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this can be merged for 1.56
Great!
I think there are some commits that need removing from the commit log - these are noted in our discussion, but other than those this looks great!
So please remove those commits and merge :-)
Thanks
Nick
squashing to single commit. reverted commits will annihilate. |
am I reading this right that its not possible to supply rclone your own already existing file of SUM's ? |
@tb582 you can now, either with |
Currently rclone check supports matching two file trees by sizes and hashes. This change adds support for SUM files produced by GNU utilities like sha1sum. Fixes rclone#1005 Note: checksum by default checks, hashsum by default prints sums. New flag is named "--checkfile" but carries hash name. Summary of introduced command forms: ``` rclone check sums.sha1 remote:path --checkfile sha1 rclone checksum sha1 sums.sha1 remote:path rclone hashsum sha1 remote:path --checkfile sums.sha1 rclone sha1sum remote:path --checkfile sums.sha1 rclone md5sum remote:path --checkfile sums.md5 ```
What is the purpose of this change?
Currently
rclone check
supports matching two file trees by sizes and hashes.However, rclone does not support so called hash-SUM files.
These are produced by the de-facto standard family of GNU core utilities like sha1sum.
Use Cases
SUM files provide end-to-end checksums, an invaluable tool where storage sums don't exist.
They serve the following use cases:
File Naming
There exist two widely used naming schemes for SUM files:
MD5SUM
,SHA1SUM
, etc - uppercase name without extension carrying the hash digest type right in the name,usually located right beside the files to check.
something.md5
,somethingelse.sha1
, etc - having hash type in extension.File Format
The SUM file format is very simple, has been stable for years and supported by a wide range of software. However, there is no official standard. GNU man pages (md5sum et al) is the de-facto standard.
SUM files are line oriented. They don't record the hash digest type - it should be known in advance.
Each line contains:
*
(asterisk, see below)The asterisk modifier is a legacy artifact of GNU coreutils builds for DOS/Windows where it used to denote that the file is binary (with Unix line endings). It means nothing on Unix. The proposed implementation of SUM file support simply ignores asterisks and treats all files as binary (no DOS line feed translation), following the SUM file parser in the Go coreutils port.
The "distance" between digest and file name must always be 2 characters, be it
" *"
for binary files or" "
(two spaces) for ex text files. The 3rd space, if any, will belong to the file name.Note: I could not find any real world use of the
" |"
modifier mentioned by @klauspost in #157 (comment).Unimplemented Features
Besides ignoring asterisks, this implementation will not:
These features could be implemented by future PRs, but I guess nobody will ever need them.
Note that integration with rclone
fs.Hash
API (e.g. using SUM files as a transparent hashsum source or cache) is also out of scope for this PR.Making Sum Files
rclone can produce valid SUM files with command
rclone hashsum HASH REMOTE:PATH --output-file FILE [--download]
(one minor feature that I miss sometimes is sorting output by file name - one day I'm gonna submit a PR)
Standard coreutils normally print SUM data to STDOUT so the file can be produced by redirection
sha1sum /files > SHA1SUM
. To check file tree against a SUM file, use the-c
flag likesha1sum -c SHA1SUM /files
. The hash digest used is defined by the name of the executable but not recorded anywhere in the sum file. This pull request handles only the check case above.CLI: Constraints
This patch tries to enroll the
-c
GNU convention into the rclone CLI flag family respecting the following constraints:-c
short flag is already taken, so we use upper-case-C
to feel as close as possible.--checksum
long flag is also taken, so we use--checkfile
(no dash since it's a noun).Probably
--sum-file
would be closer to the file format name, but the first letterc
better correlates with the short flag. See a note below.CLI: Variants
The patch introduces a single internal function
operations.CheckSum
and the following command-line interfaces for it:hashsum
command gains new flag--checkfile SUMFILE
, ex.:rclone hashsum CRC32 -C sumfile.crc32 remote:path/to/files
sha1sum
andmd5sum
also gain this flag, ex.:rclone sha1sum --checkfile SHA1SUM remote:path/tofiles
checksum HASHNAME SUMFILE REMOTE:PATH
is added, ex.:rclone checksum md5 remote:path/to/MD5SUM remote:path/to --exclude MD5SUM
check
command gains a new form with 3 arguments, ex.:rclone check HASHNAME SUMFILE REMOTE:PATH
check
command in the current 2-argument formgains the new flag
--checkfile HASHNAME
.As the last case feels slightly out of sync, let me bring an example:
All the variants are placed in separate commits within this PR so reviewer can decide which are good for final squash/merge and which are doomed.
Notes
REMOTE:PATH
notation as well as be a local file(SUM files are frequently distributed within the same tree as the data).
rclone checksum
command supports all extended reporting features fromrclone check
,like
--combined
,--missing-...
and so on.md5sum
,sha1sum
andhashsum
commands do not support extended check reporting features.Instead, they work as if
--combined -
was supplied (i.e. all sigils are reported to stdout, unless--progress
is enabled).This behavior is compatible with
-c
in GNU utilities.--download
flag and respect--checkers
.--checkfile
option might be named as--check-from
or--check-by
(with short name-C
),or just
--check
if not taken. I'm leaving the decision up to reviewer.Was the change discussed in an issue or in the forum before?
Approved by #157 (comment)
Fixes #1005
Checklist