RWFileHandle.Flush error: failed to transfer file from cache to remote #2095
gocryptfs on top of rclone mount. I updated to current git and this started happening.
rclone: Version "v1.39-DEV" starting with parameters ["/usr/local/bin/rclone" "mount" "-vv" "--stats" "5m" "--allow-other" "--default-permissions" "--vfs-cache-mode" "writes" "--low-level-retries=200" "--bwlimit=250K" "--log-file=/var/log/rclone/rclone.log" "gd:" "/Drive/.remote"]
The text was updated successfully, but these errors were encountered:
Initial test on:
Moving a directory which contains a 2gb 7z archive that always reproduced the error went through just fine. I will leave it open though while I test a bit more because other directories produced the error randomly while moving big queues of directories containing smaller files with midnight commander.
What I noticed immediately though is a slight delay cding into directories on the remote with midnight commander, although no error in logs and network conditions are nominal. I can replicate this latency with ls in the shell. Something related to the caching of the file tree maybe?
EDIT: For clarification, there was always a tiny amount of latency there, now it's just a lot more noticeable. This the result of ls without arguments on a directory with 746 directories inside.
Unfortunately I did not take a measurement with the previous binary.
That is interesting, thanks for reporting it.
It is almost certainly to do with fc32fee where I've set the --attr-timeout to 0 by default. That may have been too conservative though (it is the most accurate setting but also the slowest), so I'm wondering whether the default should be maybe
Further testing on the original issue looks good. Can't reproduce so far.
On the latency side of things. The previous posted measurement was while uploading was going on (limited bw, nowhere near saturating upstream):
How dangerous is 100ms?
Great testing thank you.
What it means is that there is a 100ms window when the kernel is hanging onto a cached value where the real value has changed externally.
Yes that is correct.
No, you need to actually change the file to change the modification time. I think you are thinking of access time which rclone doesn't track.
It is no big deal if the modification time is wrong, what does cause problems is if the length is wrong. So the kernel thinks the file is 100 bytes long, but it is actually 200 bytes long after an update. Those 100 bytes get chopped off with no errors.
Do you think I should change the default to 100ms - that is a very small window of opportunity for badness to happen and it does increase the performance dramatically?
I was not thinking of access time but modification time indeed. By "any change whatsoever" I meant any write op. Commit did not describe the exact conditions for corruption or the kind of corruption and I was thinking that any write operation inside a directory (such as simply writing a new file) will change the modification timestamp of said directory. My main concern is corruption while operating on the remote via another instance of rclone running elsewhere which is something I am currently doing as I sync some directory trees between other linux systems.
Thank you for the clarification.
This is a tough question. In my eyes, data reliability > performance but this only affects files which are modified by different sources and the time window is rather small. Granted that the most common use case is a single mount for data storage and that the difference in performance is huge, it could make sense to change the default BUT on the other hand any user who decides to setup something more complicated will be oblivious of this issue. I am also assuming that this is impossible to handle as there's no mechanism similar to inotify etc provided by the remote APIs.
My suggestion would be to leave the default as it is, to ensure data reliability and add a "Performance Tweaks" section to the documentation that lists this parameter with a caveat clearly describing the limitation so someone can make an informed decision on their own on if they should change the default or not. Other performance related parameters could be listed there as well with any possible quirks.
For my use case, I will just isolate the directory trees used for syncing into their own rclone instance with the default 0 setting and use 100ms for everything else.
I just had an instance of the original error again. Unfortunately I had turned off -vv
rclone doesn't (yet) track directory modification times, so you are safe there.
If you are only uploading new files then I think everything will be fine. If you are changing existing files, then there are windows of opportunity to get inconsistent results. However I've been trying to minimise these as this is exactly one of the use cases I've been working on. https://beta.rclone.org is served from an rclone mount which mounts a storage bucket. I hope with the latest changes that I've reduced that windows down to zero (provided you don't set --attr-timeout) but since the cloud storage systems are only eventually consistent I expect there are still some ways you could get inconsistent data. These inconsistent results should only ever be temporary though.
I think I'm going to take your suggestion. I wrote this in the docs
You can use the flag --attr-timeout to set the time the kernel caches
The default is 0s - no caching - which is recommended for filesystems
If you set it higher ('1s' or '1m' say) then the kernel will call back
This is the same as setting the attr_timeout option in mount.fuse.
That looks like a different problem, as if the cached file was being written to while it was being uploaded (which shouldn't be possible) or possibly it was a genuine corrupted upload (unlikely).
Any chance you could replicate with with a
I did not pay full attention to the log as it looked like the same on first glance and frankly I was not expecting something new heh. I did eventually notice the error was different but I was trying to replicate with -vv before providing further feedback. I have been running uploads non-stop for 4 days now with -vv, including rerunning the queue that first produced this but nothing so far.