Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files uploaded via crypt remote to B2 are missing SHA1SUMs #1767

Open
tjanez opened this issue Oct 22, 2017 · 16 comments

Comments

@tjanez
Copy link

commented Oct 22, 2017

What is your rclone version (eg output from rclone -V)

rclone v1.38
- os/arch: linux/amd64
- go version: go1.9

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Fedora 26 (Workstation Edition), 64 bit

Which cloud storage system are you using? (eg Google Drive)

B2 Backblaze

The command you were trying to run (eg rclone copy /tmp remote:tmp)

An example of a large file uploaded via crypt remote to B2 that misses SHA1SUM:

$ rclone mkdir b2:crypted-missing-sha1sum-test
$ rclone config
Configure new crypted remote named crypted-missing-sha1sum-test:

Remote config
--------------------
[crypted-missing-sha1sum-test]
remote = b2:crypted-missing-sha1sum-test
filename_encryption = off
password = *** ENCRYPTED ***
password2 = *** ENCRYPTED ***
--------------------

$ ls -lh ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso 
-rw-rw-r--. 1 tadej tadej 1.5G Jul  6 00:34 /home/tadej/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso
$ rclone copy -v ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso crypted-missing-sha1sum-test:
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Modify window is 1ms
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Waiting for checks to finish
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Waiting for transfers to finish
2017/10/22 15:07:20 INFO  : 
Transferred:   383.938 MBytes (6.041 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            0
Elapsed time:      1m3.5s
Transferring:
 *     Fedora-Workstation-Live-x86_64-26-1.5.iso: 25% done, 8.578 MBytes/s, ETA: 2m9s

[... output trimmed... ]

2017/10/22 15:23:43 INFO  : Fedora-Workstation-Live-x86_64-26-1.5.iso: Copied (new)
2017/10/22 15:23:43 INFO  : 
Transferred:   1.456 GBytes (1.424 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            1
Elapsed time:    17m26.8s
$ rclone sha1sum b2:crypted-missing-sha1sum-test
                                          Fedora-Workstation-Live-x86_64-26-1.5.iso.bin
$ rclone sha1sum crypted-missing-sha1sum-test:
                                          Fedora-Workstation-Live-x86_64-26-1.5.iso

An example of the same large file uploaded directly to B2 remote that has SHA1SUM:

$ rclone mkdir b2:missing-sha1sum-test
$ ls -lh ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso 
-rw-rw-r--. 1 tadej tadej 1.5G Jul  6 00:34 /home/tadej/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso
$ rclone copy -v ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso b2:missing-sha1sum-test/
2017/10/22 13:37:22 INFO  : B2 bucket missing-sha1sum-test: Modify window is 1ms
2017/10/22 13:37:23 INFO  : B2 bucket missing-sha1sum-test: Waiting for checks to finish
2017/10/22 13:37:23 INFO  : B2 bucket missing-sha1sum-test: Waiting for transfers to finish
2017/10/22 13:37:37 INFO  : 
Transferred:   7.346 MBytes (467.081 kBytes/s)
Errors:                 0
Checks:                 0
Transferred:            0
Elapsed time:       16.1s
Transferring:
 *     Fedora-Workstation-Live-x86_64-26-1.5.iso:  0% done, 1.299 MBytes/s, ETA: 19m2s

[... output trimmed... ]

2017/10/22 13:54:56 INFO  : Fedora-Workstation-Live-x86_64-26-1.5.iso: Copied (new)
2017/10/22 13:54:56 INFO  : 
Transferred:   1.456 GBytes (1.413 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            1
Elapsed time:    17m35.3s

$ rclone sha1sum b2:missing-sha1sum-test
65880b7e61f995df0009ecf556d736552526d2e0  Fedora-Workstation-Live-x86_64-26-1.5.iso
@ncw

This comment has been minimized.

Copy link
Collaborator

commented Oct 23, 2017

I'm not sure this is possible with the B2 API.

To store an SHA1 on a large file requires us to store it in the file metadata. We need to know this when we create the upload session before we've read any data.

However we can't know the SHA1 until the end of the session until we've actually read all the file bytes, and unfortunately I don't think b2 has an API for changing the metadata.

Is that correct @breunigs or have I missed something?

@tjanez

This comment has been minimized.

Copy link
Author

commented Oct 23, 2017

To store an SHA1 on a large file requires us to store it in the file metadata. We need to know this when we create the upload session before we've read any data.

However we can't know the SHA1 until the end of the session until we've actually read all the file bytes, and unfortunately I don't think b2 has an API for changing the metadata.

@ncw, thanks for the explanation.

Can you elaborate a bit more on how does the crypt remote come into play?
To be more precise, how is SHA1 computed when a large file is uploaded directly to B2 remote and how is it computed when a large file is uploaded via crypt remote (to B2 remote)?

@ncw

This comment has been minimized.

Copy link
Collaborator

commented Oct 24, 2017

For a large file upload to b2 we

  1. query the source object to see if it has an SHA1
  2. if it does then put it in the metadata for the create upload session
  3. upload the file in chunks
  4. finalise the upload session

Now the trouble comes in step 1. For a local object, querying the SHA1 causes the file to be read and the SHA1 to be computed. This means we read the file again in step 3 which is unfortunate but not the end of the world.

However when uploading a crypted file, the object does not know its SHA1. To find it crypt would have to read the entire file, encrypt it and SHA1 it. The encryption would then be repeated in step 3. The encryption would have to use the same nonce (which will drive any cryptographers nuts!), and the crypt module would have to persist objects (which it doesn't at the moment).

So to fix this we would need to do one of

  • spool large files to disk
  • encrypt large files twice (and complicate the internals of crypt even more)
  • get b2 to add a update the metadata API call (would be useful for set mod time also)
  • get b2 to make the large file upload finalize take the total SHA1 for the metadata

I'll email my contact at b2 about the last two options, see what he says!

@tjanez

This comment has been minimized.

Copy link
Author

commented Oct 24, 2017

For a large file upload to b2 we

  1. query the source object to see if it has an SHA1
  2. if it does then put it in the metadata for the create upload session
  3. upload the file in chunks
  4. finalise the upload session

Now the trouble comes in step 1. For a local object, querying the SHA1 causes the file to be read and the SHA1 to be computed. This means we read the file again in step 3 which is unfortunate but not the end of the world.

However when uploading a crypted file, the object does not know its SHA1. To find it crypt would have to read the entire file, encrypt it and SHA1 it. The encryption would then be repeated in step 3. The encryption would have to use the same nonce (which will drive any cryptographers nuts!), and the crypt module would have to persist objects (which it doesn't at the moment).

@ncw, thanks for this thorough explanation. I understand the difficulty of the problem now.

So to fix this we would need to do one of

  • spool large files to disk
  • encrypt large files twice (and complicate the internals of crypt even more)
  • get b2 to add a update the metadata API call (would be useful for set mod time also)
  • get b2 to make the large file upload finalize take the total SHA1 for the metadata

I'll email my contact at b2 about the last two options, see what he says!

Thanks, that's much appreciated!

Indeed, the last options would make it easy to support this on rclone's side (and probably other tools using B2's API).

@breunigs

This comment has been minimized.

Copy link
Collaborator

commented Oct 26, 2017

Sorry for my late answer, but yes, I have the same understanding as you do @ncw. That being said, we do calculate the sha1's for each part being uploaded, which offers at least some safeguard against corruption. This should also happen through crypt, even if it's not visible from their APIs (I believe).

@ncw

This comment has been minimized.

Copy link
Collaborator

commented Nov 1, 2017

@breunigs thanks for the confirmation

I'll email my contact at b2 about the last two options, see what he says!

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

@tjanez

This comment has been minimized.

Copy link
Author

commented Nov 1, 2017

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

@ncw, thanks for pushing this.

I concur, it's great to hear Backblaze is very helpful and willing to discuss extending their API to support such a use case. Fingers crossed!

@ya-isakov

This comment has been minimized.

Copy link

commented Dec 7, 2017

@ncw, I've read from https://rclone.org/b2/ that "When using B2 with crypt files are encrypted into a temporary location and streamed from there. This is required to calculate the encrypted file’s checksum before beginning the upload."
So, why do you need to crypt two times, to calculate SHA1? Why file cannot be fully encrypted first in /tmp, SHA1 sum calculated in the process, then file split to chunks and uploaded?

@ncw

This comment has been minimized.

Copy link
Collaborator

commented Dec 12, 2017

@ya-isakov those docs are for 1.39 - the docs for for the latest beta don't include that, they say this instead

Sources which don't support SHA1, in particular `crypt` will upload
large files without SHA1 checksums.  This may be fixed in the future
(see [#1767](https://github.com/ncw/rclone/issues/1767)).

@tjanez

This comment has been minimized.

Copy link
Author

commented Jul 15, 2018

I'll email my contact at b2 about the last two options, see what he says!

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

If I'm reading B2's Large Files docs correctly, they still don't support setting SHA1 checksum after uploading all the parts but only at the call to b2_start_large_file:

If the caller knows the SHA1 of the entire large file being uploaded, Backblaze recommends specifying the SHA1 in the fileInfo during the call to b2_start_large_file. Inside the fileInfo specify one of the keys as large_file_sha1 and for the value use a 40 byte hex string representing the SHA1.

@ncw, could you ping your B2 contact to see if there has been any internal progress on the issue?

@ncw

This comment has been minimized.

Copy link
Collaborator

commented Jul 15, 2018

I asked him a few months ago and he said it was in-progress I think. Unfortunately it isn't as easy as you might think as the whole of b2 is built on immutable objects...

@tjanez

This comment has been minimized.

Copy link
Author

commented Jul 15, 2018

I asked him a few months ago and he said it was in-progress I think.

Great, nice to hear it's in-progress.

Unfortunately it isn't as easy as you might think as the whole of b2 is built on immutable objects...

Yes, I can imagine...

@tjanez

This comment has been minimized.

Copy link
Author

commented May 29, 2019

@ncw, following #3210, I saw that you've implemented SetModTime using server side copy:

This creates a new version using a server side copy while updating the metadata.

Could we also leverage server side copy to set the SHA1 checksum after uploading the whole large file via the crypt remote?

As far as I understand it, we could do something like:

  • perform upload as we do it now
  • then create a server-side copy by:
    • specifying the correct SHA in the call to b2_start_large_file
    • repeating b2_copy_part
    • finishing with b2_finish_large_file
@ncw

This comment has been minimized.

Copy link
Collaborator

commented Jun 6, 2019

@ncw, following #3210, I saw that you've implemented SetModTime using server side copy:

This creates a new version using a server side copy while updating the metadata.

Could we also leverage server side copy to set the SHA1 checksum after uploading the whole large file via the crypt remote?

Yes this would be possible.

As far as I understand it, we could do something like:

  • perform upload as we do it now

  • then create a server-side copy by:

    • specifying the correct SHA in the call to b2_start_large_file
    • repeating b2_copy_part
    • finishing with b2_finish_large_file

Yes, though I'd just use b2_copy_file as it doesn't have any size limits.

I think there may be one disadvantage to doing this. Doing a server side copy creates another version which doubles the storage.

I suppose rclone could delete the source version once the copy is complete... Maybe rclone should be doing this in the SetModTime call too?

@ya-isakov

This comment has been minimized.

Copy link

commented Jun 6, 2019

Maybe it would be better to use b2_copy_part, as it will return sha1 sum. Without range option, it should probably return checksum of the whole file. And I'm voting for delete the old version.

@tjanez

This comment has been minimized.

Copy link
Author

commented Jun 7, 2019

Yes this would be possible.

This is great news.

Yes, though I'd just use b2_copy_file as it doesn't have any size limits.

I couldn't find any size limits mentioned in b2_copy_part's documentation, but I haven't actually used this myself.

I think there may be one disadvantage to doing this. Doing a server side copy creates another version which doubles the storage.
I suppose rclone could delete the source version once the copy is complete...

Yes, I agree with @ya-isakov, I think the source version should be deleted afterwards.

Maybe rclone should be doing this in the SetModTime call too?

I haven't looked at the source code but if it only uses B2's copy API to set a different modification time, then it would make sense to delete the original afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.