-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hasher backend and KV store #5587
Conversation
This comment has been minimized.
This comment has been minimized.
This is a great implementation @ivandeex. I was imagining that we'd patch all the backends but this is much better. I wonder whether we should support easy integration of this as nested remotes are a bit of a pain to set up... I had the idea that wrapping backends should support doing
I'll give it a quick once over now.
Reviewed :-)
I'd like to get all devs on the same page with releasing test binaries to users. At the moment if you push to a branch on the main repo you'll get betas built etc. What I'd like to do is make it so that if a team member makes a PR then this also makes a beta - what do you think of that idea? Ideally we'd make some actions bot which posted a link in the PR once it was built. |
👍 👍 👍
👍 An API like
seems reasonable and straightforward in the case when wrapping remote has empty root relative to the inner remote (we could even stuff the chain builder right into good old 👍 FS chains with empty wrapper root should be straightforward on the command line too. 👎 I can't imagine INI syntax for FS chains even with empty wrapper root. Probably needs TOML, not INI ??? 👎 I can't imagine what would extended remote path be for wrappers with non-empty root like
I will add docs tomorrow. I'm not nearly as lightning fast :)
I'd be happy to make a beta branch for bisync!
I don't know. I just can test out whatever you make. |
55b93aa
to
4709c9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've given this a quick once over - looking great!
I'd like to have a discussion about the use of bolt? I'd like to only have one KV database in rclone (once we have disabled the cache backend). Is bolt the right one?
I was thinking that if you called wrapper.NewFs with remote="" it would say, ah no remote, lets use root instead. This would enable this syntax to just work...
It would then decide that since its config remote was empty to set remote = remote,inner_opt1=val2:mypath and root="" and then call remote.NewFs with inner_opt and root=mypath. So I don't think a new API is needed as long as we are happy with that syntax.
Yes that is a potential problem. So we are saying that if we had a hasher set up on /backup how do we access a path /backup/current? What would hasher do if it you set up a hasher for /backup and then tried to use /backup/current. Is that effectively a new hasher? The VFS cache is designed so that you can do that, but it is quite a heavy load to impose on all backends. Here are some ideas
Great :-)
I figured out how to do everything except work out if the PR is from a team member, but I'll carry on! [ivandeex] I intended to edit a quoted reply but edited your post by mistake. I tried to do the best to restore. |
Currently only hasher plans to use it. We could however make it a common API:
I'd be happy to factor KV store away and make the backend just use an API.
bolt is a satisfactory one. |
Hasher will jump in if you setup
No.
VFS jumps in when you |
Currently wrapping(transient) rclone backends only provide means for wrapping other remotes. You are talking about "injecting" (or mixing in) behaviors of transient backends into other remotes. I didn't set such a goal. rclone lacks internal APIs for such mixins. It's out of scope for this PR. |
Let me experiment with your sketch for some time. Maybe I can deliver a PR with new rpath parser in 2-3 months. @ncw |
Probably adding optional Update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:(
This should probably have a separate name for each user
Bolt is really the right solution for the VFS metadata - when there gets to be a lot of it, scanning lots of little JSON files is very slow. However I didn't want to commit to Bolt at the time and was wary of the multiple process limitations. Another place I'd like a KV database is doing syncs larger than memory - eg sync two 20 million entry directories - without using 20 GB of RAM. I think your idea to stick it in
I was looking also at badger however my main concern is longevity and stability rather than absolute performance.
Yes lets discuss this later. I just wanted to share my ideas.
It would be a good idea to finish some of the things we've started. I've got a list of things too sitting in branches here. |
Alternatively we could use multiple bolt buckets per database, but the more files the better due to fine grained locking. The "user" approach also calls for:
I also had an idea to use a
If we make it an API then "procedural users" will accept anything.
I'll create a dedicated ticket and copy the discussion there.
Holy true. |
I like the idea of one file per "purpose" if you see what I mean.
There is the stuff in the VFS cache which has had many patches and we should probably start from! This could be factored out into Lines 78 to 100 in b35db61
Interesting idea.
Agreed
👍
:-) |
We can consider various db file naming schemes below
|
Note to myself. Look what was factored in reply to |
I think you know it better... |
A quick note about last two commits... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this for 1.57
I see new backends as relatively low risk. This one is marked experimental so if we need to change the config/commands/interface the users have been warned!
Before this change we checked that features.ReadMimeTime was set if and only if the Object.MimeType method was implemented. However this test is overly general - we don't care if Objects advertise MimeType when features.ReadMimeTime is set provided that they always return an empty string (which is what a wrapping backend might do). This patch implements that logic.
Add bolt-based key-value database support. Quick API description: rclone#5587 (comment)
After testing concurrent calling of `kv.Start` and `db.Stop` I had to restrict more parts of these under mutex to make results deterministic without Sleep's in the test body. It's more safe but has potential to lock Start for up to 2 seconds due to `db.open`.
Merging |
Before this change we checked that features.ReadMimeTime was set if and only if the Object.MimeType method was implemented. However this test is overly general - we don't care if Objects advertise MimeType when features.ReadMimeTime is set provided that they always return an empty string (which is what a wrapping backend might do). This patch implements that logic.
Add bolt-based key-value database support. Quick API description: #5587 (comment)
After testing concurrent calling of `kv.Start` and `db.Stop` I had to restrict more parts of these under mutex to make results deterministic without Sleep's in the test body. It's more safe but has potential to lock Start for up to 2 seconds due to `db.open`.
@ncw |
@Sarke |
Add bolt-based key-value database support. Quick API description: rclone/rclone#5587 (comment)
What is the purpose of this change?
This PR adds new
hasher
backend after preliminary design proposal in #949 (comment). The actual design is described below and solves the following Goals:Getting Started
So you want to cache or otherwise handle checksums for existing
myRemote:path
or/local/path
?Install patched rclone.
Read documentation.
Open
~/.config/rclone.conf
in a text editor and add new section(s):The backend takes basically the following parameters:
remote
(likealias
, required)hashes
- comma separated list (by defaultmd5,sha1
)max_age
- maximum time to keep a checksum value, e.g.1h30m
or30d
0
will disable caching completely,off
will cache "forever" (i.e. until the files get changed)UPDATE: hash_names renamed to
hashes
, can be separated by comma only (blanks not allowed anymore)Use it as
Hasher2:subdir/file
instead of base remote. Hasher will transparently update cache with new checksums when a file is fully read or overwritten, like:The way to refresh all cached checksums (even unsupported by the base backend) for a subtree is to re-download all files in the subtree. For example, use
hashsum --download
using any supported hashsum on the command line (we just care to re-read):How It Works
rclone hashsum
(ormd5sum
orsha1sum
):auto_size
then download object and calculate requested hashes on the fly.fingerprint
(including size, modtime if supported, first-found other hash if any).Other operations:
move
will update keys of existing cache entriesdelete
will remove cache entrypurge
will remove all cache entries underneathmax_age
is notoff
)(this helps against bit-rot or database thrashing when a file got removed externally)
Pre-Seed from a SUM File
Hasher supports two backend commands: generic SUM file
import
and faster but less consistentstickyimport
.Instead of SHA1 it can be any hash supported by the remote. The last argument can point to either a local or an
other-remote:path
text file in SUM format. The command will parse the SUM file, then walk down the path given by the first argument, snapshot current fingerprints and fill in the cache entries correspondingly.hasher:dir/subdir
.--checkers
to make it faster. Or usestickyimport
if you don't care about fingerprints and consistency.stickyimport
is similar toimport
but works much faster because it does not need to stat existing files and skips initial tree walk. Instead of binding cache entries to file fingerprints it creates sticky entries bound to the file name alone ignoring size, modification time etc. Such hash entries can be replaced only bypurge
,delete
,backend drop
or by full re-read/re-write of the files.Cache Storage
Note: Details updated after recent KV redesign
Cached checksums are stored as
bolt
database files under~/.cache/rclone/kv/
, one per base backend, named likeBaseRemote~hasher.bolt
. Checksums for multiplealias
-es into a single base backend will be stored in the single database. All local paths are treated as aliases into thelocal
backend (unless crypted or chunked) and stored in~/.cache/rclone/kv/local~hasher.bolt
.By default database is opened in read-only mode and can be shared by multiple rclone instances. When a change is needed, rclone will reopen database in exclusive read-write mode for a moment and reopen it back for sharing after changing to let other instances access it.
The
bolt
engine shows good performance for databases with up to a million entries on ordinary average hardware. Reopening is cheap and takes less than a millisecond.You can print or drop database using custom backend commands:
Boilerplate
Was the change discussed in an issue or in the forum before?
Fixes #949
Fixes #157
Fixes #626
Checklist