Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azureblob: occasional corrupted uploads when using --checksum flag #7590

Closed
ncw opened this issue Jan 23, 2024 · 3 comments
Closed

azureblob: occasional corrupted uploads when using --checksum flag #7590

ncw opened this issue Jan 23, 2024 · 3 comments

Comments

@ncw
Copy link
Member

ncw commented Jan 23, 2024

It was reported that rclone occasionally uploaded corrupted data to azure blob when using rclone sync/copy/move.

This turned out to be a race condition updating the block count which caused blocks to be duplicated.

This bug was introduced in this commit in v1.64.0 and has been fixed in v1.65.2

0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056

Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected at upload time.

The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as:

ERROR : file.bin.XXX.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

Note that Microsoft Azure Storage Explorer does check the checksum when attempting the download and the download will fail for corrupted files.

When can the problem happen

The problem can only happen when uploading files greater than --azureblob-chunk-size which is 4MiB by default to Azure blob storage.

The problem can happen when using:

  • rclone sync, rclone copy or rclone move` to upload files to Azure blob storage
  • rclone mount with --vfs-cache-mode writes or --vfs-cache-mode full to upload files to Azure blob storage

The problem is unlikely to happen when using:

  • rclone rcat to upload files to Azure blob storage
  • rclone mount with --vfs-cache-mode off (the default) to upload files to Azure blob storage

Mitigating circumstances

Things which decrease the probability of the problem:

  • Not using the --checksum flag (the race is very rare without this flag)
  • --azureblob-chunk-size set larger than the default of 4MiB
  • --azureblob-concurrency set smaller than the default of 16

The race can be mitigated with:

  • --azureblob-concurrency 1

When uploading from local disk to Azure blob the --checksum flag makes the race much more likely. Without the --checksum flag the race happens rarely (we haven't observed it in testing, but it could happen). We think the race gets more likely with --checksum because the kernel has cached the file when reading the checksum so the blocks are submitted much more quickly.

How the data is corrupted

If you have corrupted data, then it will be corrupted in a very specific way.

What you will see is 1 or more duplicated blocks. The blocks are of size --azureblob-chunk-size (4MiB by default) and the duplications will happen on a block boundary. The duplicated block will overwrite some other block.

For example, assuming your file had the following 4MiB blocks

A B C D E F G H

If one block is corrupted it could look like this instead

A B C D C F G H

How to check for corruptions

Unfortunately it is possible to have files in Azure blob storage which do not match their MD5 - Azure blob does not check this. It does check the MD5 for each individual chunk of the upload, but not for the entire file. From the Azure docs:

x-ms-blob-content-md5

Optional. An MD5 hash of the blob content. This hash isn't validated, because the hashes for the individual blocks were validated when each was uploaded.

The Get Blob operation returns the value of this header in the Content-MD5 response header.

If this property isn't specified with the request, it's cleared for the blob if the request is successful.

This means that rclone check won't discover the corruptions. It is necessary to use rclone check --download to actually download the file data to detect the corruptions.

If you don't have a copy of the original data to validate against you can use rclone copy to copy the data back to local disk to detect corruptions and for every corrupted file you will get an error like:

ERROR : file.bin.XXX.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

If you want rclone to download the file regardless of corruptions then use the --ignore-checksum flag.

The fix

The fix for this problem removes the race on creating the block ID.

To make sure it does not happen again when finalalising the transfer we explicitly check the block ID list to check it has all the blocks we are expecting and the IDs are as we expecting in the right order. We tested that if this check had been in place before it would have caught the problem. This check completes the verification that MD5 we upload matches the data we upload so should prevent other classes of corruption too.

How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.
ncw added a commit that referenced this issue Jan 23, 2024
It was reported that rclone copy occasionally uploaded corrupted data
to azure blob.

This turned out to be a race condition updating the block count which
caused blocks to be duplicated.

This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2

0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056

This race only seems to happen if `--checksum` is used but can happen otherwise.

Unfortunately Azure blob does not check the MD5 that we send them so
despite sending incorrect data this corruption is not detected. The
corruption is detected when rclone tries to download the file, so
attempting to copy the files back to local disk will result in errors
such as:

    ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

This adds a check to test the blocklist we upload is as we expected
which would have caught the problem had it been in place earlier.
@ncw
Copy link
Member Author

ncw commented Jan 23, 2024

This fixes the problem - an overnight test has revealed no corruptions.

v1.66.0-beta.7667.4ab81ad84.fix-azureblob-corruption on branch fix-azureblob-corruption

ncw added a commit that referenced this issue Jan 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data
to azure blob.

This turned out to be a race condition updating the block count which
caused blocks to be duplicated.

This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2

0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056

This race only seems to happen if `--checksum` is used but can happen otherwise.

Unfortunately Azure blob does not check the MD5 that we send them so
despite sending incorrect data this corruption is not detected. The
corruption is detected when rclone tries to download the file, so
attempting to copy the files back to local disk will result in errors
such as:

    ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

This adds a check to test the blocklist we upload is as we expected
which would have caught the problem had it been in place earlier.
@ncw
Copy link
Member Author

ncw commented Jan 24, 2024

I've merged this fix to master now which means it will be in the latest beta in 15-30 minutes and released in v1.66 and v1.65.2

ncw added a commit that referenced this issue Jan 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data
to azure blob.

This turned out to be a race condition updating the block count which
caused blocks to be duplicated.

This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2

0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056

This race only seems to happen if `--checksum` is used but can happen otherwise.

Unfortunately Azure blob does not check the MD5 that we send them so
despite sending incorrect data this corruption is not detected. The
corruption is detected when rclone tries to download the file, so
attempting to copy the files back to local disk will result in errors
such as:

    ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

This adds a check to test the blocklist we upload is as we expected
which would have caught the problem had it been in place earlier.
@ncw
Copy link
Member Author

ncw commented Jan 24, 2024

This is now released in v1.65.2 also.

@ncw ncw closed this as completed Jan 24, 2024
@ncw ncw changed the title azureblob: occasionally corrupted uploads when using --checksum flag azureblob: occasional corrupted uploads when using --checksum flag Jan 24, 2024
WuTofu pushed a commit to WuTofu/rclone that referenced this issue Feb 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data
to azure blob.

This turned out to be a race condition updating the block count which
caused blocks to be duplicated.

This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2

0427177 azureblob: implement OpenChunkWriter and multi-thread uploads rclone#7056

This race only seems to happen if `--checksum` is used but can happen otherwise.

Unfortunately Azure blob does not check the MD5 that we send them so
despite sending incorrect data this corruption is not detected. The
corruption is detected when rclone tries to download the file, so
attempting to copy the files back to local disk will result in errors
such as:

    ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY"

This adds a check to test the blocklist we upload is as we expected
which would have caught the problem had it been in place earlier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant