-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large files uploaded via crypt remote to B2 are missing SHA1SUMs #1767
Comments
I'm not sure this is possible with the B2 API. To store an SHA1 on a large file requires us to store it in the file metadata. We need to know this when we create the upload session before we've read any data. However we can't know the SHA1 until the end of the session until we've actually read all the file bytes, and unfortunately I don't think b2 has an API for changing the metadata. Is that correct @breunigs or have I missed something? |
@ncw, thanks for the explanation. Can you elaborate a bit more on how does the crypt remote come into play? |
For a large file upload to b2 we
Now the trouble comes in step 1. For a local object, querying the SHA1 causes the file to be read and the SHA1 to be computed. This means we read the file again in step 3 which is unfortunate but not the end of the world. However when uploading a crypted file, the object does not know its SHA1. To find it crypt would have to read the entire file, encrypt it and SHA1 it. The encryption would then be repeated in step 3. The encryption would have to use the same nonce (which will drive any cryptographers nuts!), and the crypt module would have to persist objects (which it doesn't at the moment). So to fix this we would need to do one of
I'll email my contact at b2 about the last two options, see what he says! |
@ncw, thanks for this thorough explanation. I understand the difficulty of the problem now.
Thanks, that's much appreciated! Indeed, the last options would make it easy to support this on rclone's side (and probably other tools using B2's API). |
Sorry for my late answer, but yes, I have the same understanding as you do @ncw. That being said, we do calculate the sha1's for each part being uploaded, which offers at least some safeguard against corruption. This should also happen through crypt, even if it's not visible from their APIs (I believe). |
@breunigs thanks for the confirmation
He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard. |
@ncw, thanks for pushing this. I concur, it's great to hear Backblaze is very helpful and willing to discuss extending their API to support such a use case. Fingers crossed! |
@ncw, I've read from https://rclone.org/b2/ that "When using B2 with crypt files are encrypted into a temporary location and streamed from there. This is required to calculate the encrypted file’s checksum before beginning the upload." |
@ya-isakov those docs are for 1.39 - the docs for for the latest beta don't include that, they say this instead
|
If I'm reading B2's Large Files docs correctly, they still don't support setting SHA1 checksum after uploading all the parts but only at the call to If the caller knows the SHA1 of the entire large file being uploaded, Backblaze recommends specifying the SHA1 in the fileInfo during the call to @ncw, could you ping your B2 contact to see if there has been any internal progress on the issue? |
I asked him a few months ago and he said it was in-progress I think. Unfortunately it isn't as easy as you might think as the whole of b2 is built on immutable objects... |
Great, nice to hear it's in-progress.
Yes, I can imagine... |
@ncw, following #3210, I saw that you've implemented SetModTime using server side copy:
Could we also leverage server side copy to set the SHA1 checksum after uploading the whole large file via the crypt remote? As far as I understand it, we could do something like:
|
Yes this would be possible.
Yes, though I'd just use b2_copy_file as it doesn't have any size limits. I think there may be one disadvantage to doing this. Doing a server side copy creates another version which doubles the storage. I suppose rclone could delete the source version once the copy is complete... Maybe rclone should be doing this in the SetModTime call too? |
Maybe it would be better to use b2_copy_part, as it will return sha1 sum. Without range option, it should probably return checksum of the whole file. And I'm voting for delete the old version. |
This is great news.
I couldn't find any size limits mentioned in b2_copy_part's documentation, but I haven't actually used this myself.
Yes, I agree with @ya-isakov, I think the source version should be deleted afterwards.
I haven't looked at the source code but if it only uses B2's copy API to set a different modification time, then it would make sense to delete the original afterwards. |
What is the largest file size I can upload to B2 via crypt without encountering this issue? |
Up to the limit set by |
I'm not quite clear on the practical impact of this issue. If I understand correctly, uploads are still checksummed and guaranteed to be correct, but corruption to the files at rest on B2, and when downloading, wouldn't be detected? But in that case they wouldn't decrypt (properly), so the user - and presumably |
Correct
Also correct
Yes crypt will notice any corruption in files. So the actual effect is probably quite small. |
|
Sorry for noise! |
You need to use |
@ncw Quick question, if the SHA1 is missing on any large file uploads to B2, what would be the expected output on |
Try rclone sha1sum on the underlying files to see whether they have sha1sums or not. |
I deleted my previous response because I figured out I was confused about what you were actually asking me to do. To confirm, you wanted me to run If so, I checked with both large and small files. In the case of small files, the SHA1 is reported as found on B2's file browser. In the case of large files, the SHA1 is also reported as found on B2's file browser, but there are two values listed in their interface: One labeled So, does this mean that this issue is actually solved? Is the Edit: I have downloaded a large encrypted file from my B2 remote. When I use Edit 2: It occurs to me that I updated Rclone between uploading these files to the crypted B2 and running Edit 3: Out of curiosity, I tracked down a large file that was uploaded after I updated Rclone to 1.59.2. It's about 30GB, and it does have |
Looking at the code, I think this issue was fixed in 1648c1a which was first released in v1.52.0. This allows, for the local disk only, large file uploads to b2 wrapped in crypt to have the So I think this issue is fixed now and I will close it. |
Fedora 26 (Workstation Edition), 64 bit
B2 Backblaze
An example of a large file uploaded via
crypt
remote to B2 that misses SHA1SUM:An example of the same large file uploaded directly to B2 remote that has SHA1SUM:
The text was updated successfully, but these errors were encountered: