Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support blake3 / b3sum as hash #7765

Open
nh2 opened this issue Apr 15, 2024 · 5 comments
Open

Support blake3 / b3sum as hash #7765

nh2 opened this issue Apr 15, 2024 · 5 comments

Comments

@nh2
Copy link
Contributor

nh2 commented Apr 15, 2024

Some parts of rclone, such as the SFTP checksum, currently support only md5sum and sha1sum. These are both very slow, necessarily sequential hashes.

BLAKE3 with b3sum is a tree hash and thus scales with CPUs, parallel single disk access (SSDs), and multi-disk array access (RAID, striped networked drives), e.g. > 6 GB/s single threaded from the official benchmark:

b3sum benchmark

It would be nice if rclone could support b3sum as an alternative to md5sum and sha1sum.

There are other planned uses of it in rclone, e.g.:

And rclone already indirectly depends on the blake3 Go package:

rclone/go.mod

Line 173 in cc3ae93

github.com/zeebo/blake3 v0.2.3 // indirect

@albertony
Copy link
Contributor

albertony commented Apr 15, 2024

Do you specifically want sftp to support executing b3sum, the same way as md5sum and sha1sum, because you intend to use it for supporting checksums on copy etc on this specific backend? Or do suggest support for blake3 more in general? I think adding support in rclone hashsum command would probably be relevant, as it already can be used with the sftp backend on the remote end as the checksum command, i.e. if not md5sum or sha1sum is not available but having the rclone binary on the server is allowed.

I've seen the same as you. A while ago I played around with using this as hash for the local filesystem backend in rclone, but did not get consistently better performance results that lead me to finalize a PR for it. The IO contribution, caching etc seemed to affect the results far more than the actual hash calculation, however there might be niche cases where it could be relevant, I just didn't spend more time on it.

When speaking of hash performance, xxHash (XXH3) is also often part of the discussion, and is normally even faster - probably the fastest around currently? In contrast to blake3 it is not a cryptographic hash, and is therefore in another league sort of, however for file checksumming it may not be a requirement.
Edit: It was also briefly discussed in forum 3 years ago: https://forum.rclone.org/t/faster-non-cryptographic-hashing-algorithm-for-faster-file-comparison/23601

As a curiosity, some do even use a combination of both:

Ccache uses BLAKE3, a very fast cryptographic hash algorithm, for the hashing. On a cache hit, ccache is able to supply all of the correct compiler outputs (including all warnings, dependency file, etc) from the cache. Data stored in the cache is checksummed with XXH3, an extremely fast non-cryptographic algorithm, to detect corruption.

(https://ccache.dev/manual/4.9.html#_how_ccache_works)

@albertony
Copy link
Contributor

I just updated my previous experimental implementation, and pushed a draft #7767, which will create beta builds at https://beta.rclone.org/branch/add-xxh-blake-hash/ in case anyone feels like testing it out.

@ncw
Copy link
Member

ncw commented Apr 15, 2024

Having a tree based hash is a very interesting idea and one which, for example the dropboxhash is emulating in a simplistic way. The rclone internals aren't currently optimized for tree based hashes though, they expect sequential hashes. I'm not sure the go interface supports nonsequential hashes.

However getting sftp to support b3sum will work well in conjunction.

I have a slight concern about sftp startup times. Lots of people use sftp without a config file which means that it probes for shells/supported hashes each time it is used. Perhaps we should delay hash support probing until it is asked for?

@albertony
Copy link
Contributor

I have a slight concern about sftp startup times. Lots of people use sftp without a config file which means that it probes for shells/supported hashes each time it is used. Perhaps we should delay hash support probing until it is asked for?

Good point, I agree we need to look into that if/when additional hash is added to sftp backend.

@albertony
Copy link
Contributor

albertony commented Apr 16, 2024

On second though... I assumed you meant it probes hashes on each NewFs or similar, but I don't think it does? Don't have sftp server to test against right now edit: based on reading code, and quick testing against rclone serve sftp. I think it probes for shell type, but not hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants