New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azureblob: occasional corrupted uploads when using --checksum flag #7590
Comments
ncw
added a commit
that referenced
this issue
Jan 23, 2024
It was reported that rclone copy occasionally uploaded corrupted data to azure blob. This turned out to be a race condition updating the block count which caused blocks to be duplicated. This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2 0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056 This race only seems to happen if `--checksum` is used but can happen otherwise. Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected. The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as: ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY" This adds a check to test the blocklist we upload is as we expected which would have caught the problem had it been in place earlier.
This fixes the problem - an overnight test has revealed no corruptions. v1.66.0-beta.7667.4ab81ad84.fix-azureblob-corruption on branch fix-azureblob-corruption |
ncw
added a commit
that referenced
this issue
Jan 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data to azure blob. This turned out to be a race condition updating the block count which caused blocks to be duplicated. This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2 0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056 This race only seems to happen if `--checksum` is used but can happen otherwise. Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected. The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as: ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY" This adds a check to test the blocklist we upload is as we expected which would have caught the problem had it been in place earlier.
I've merged this fix to master now which means it will be in the latest beta in 15-30 minutes and released in v1.66 and v1.65.2 |
ncw
added a commit
that referenced
this issue
Jan 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data to azure blob. This turned out to be a race condition updating the block count which caused blocks to be duplicated. This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2 0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056 This race only seems to happen if `--checksum` is used but can happen otherwise. Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected. The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as: ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY" This adds a check to test the blocklist we upload is as we expected which would have caught the problem had it been in place earlier.
This is now released in v1.65.2 also. |
ncw
changed the title
azureblob: occasionally corrupted uploads when using --checksum flag
azureblob: occasional corrupted uploads when using --checksum flag
Jan 24, 2024
WuTofu
pushed a commit
to WuTofu/rclone
that referenced
this issue
Feb 24, 2024
It was reported that rclone copy occasionally uploaded corrupted data to azure blob. This turned out to be a race condition updating the block count which caused blocks to be duplicated. This bug was introduced in this commit in v1.64.0 and will be fixed in v1.65.2 0427177 azureblob: implement OpenChunkWriter and multi-thread uploads rclone#7056 This race only seems to happen if `--checksum` is used but can happen otherwise. Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected. The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as: ERROR : file.pokosuf5.partial: corrupted on transfer: md5 hash differ "XXX" vs "YYY" This adds a check to test the blocklist we upload is as we expected which would have caught the problem had it been in place earlier.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It was reported that rclone occasionally uploaded corrupted data to azure blob when using
rclone sync/copy/move
.This turned out to be a race condition updating the block count which caused blocks to be duplicated.
This bug was introduced in this commit in v1.64.0 and has been fixed in v1.65.2
0427177 azureblob: implement OpenChunkWriter and multi-thread uploads #7056
Unfortunately Azure blob does not check the MD5 that we send them so despite sending incorrect data this corruption is not detected at upload time.
The corruption is detected when rclone tries to download the file, so attempting to copy the files back to local disk will result in errors such as:
Note that Microsoft Azure Storage Explorer does check the checksum when attempting the download and the download will fail for corrupted files.
When can the problem happen
The problem can only happen when uploading files greater than
--azureblob-chunk-size
which is 4MiB by default to Azure blob storage.The problem can happen when using:
rclone sync
,rclone copy
or rclone move` to upload files to Azure blob storagerclone mount
with--vfs-cache-mode writes
or--vfs-cache-mode full
to upload files to Azure blob storageThe problem is unlikely to happen when using:
rclone rcat
to upload files to Azure blob storagerclone mount
with--vfs-cache-mode off
(the default) to upload files to Azure blob storageMitigating circumstances
Things which decrease the probability of the problem:
--checksum
flag (the race is very rare without this flag)--azureblob-chunk-size
set larger than the default of 4MiB--azureblob-concurrency
set smaller than the default of 16The race can be mitigated with:
--azureblob-concurrency 1
When uploading from local disk to Azure blob the
--checksum
flag makes the race much more likely. Without the--checksum
flag the race happens rarely (we haven't observed it in testing, but it could happen). We think the race gets more likely with--checksum
because the kernel has cached the file when reading the checksum so the blocks are submitted much more quickly.How the data is corrupted
If you have corrupted data, then it will be corrupted in a very specific way.
What you will see is 1 or more duplicated blocks. The blocks are of size
--azureblob-chunk-size
(4MiB by default) and the duplications will happen on a block boundary. The duplicated block will overwrite some other block.For example, assuming your file had the following 4MiB blocks
If one block is corrupted it could look like this instead
How to check for corruptions
Unfortunately it is possible to have files in Azure blob storage which do not match their MD5 - Azure blob does not check this. It does check the MD5 for each individual chunk of the upload, but not for the entire file. From the Azure docs:
This means that
rclone check
won't discover the corruptions. It is necessary to userclone check --download
to actually download the file data to detect the corruptions.If you don't have a copy of the original data to validate against you can use
rclone copy
to copy the data back to local disk to detect corruptions and for every corrupted file you will get an error like:If you want rclone to download the file regardless of corruptions then use the
--ignore-checksum
flag.The fix
The fix for this problem removes the race on creating the block ID.
To make sure it does not happen again when finalalising the transfer we explicitly check the block ID list to check it has all the blocks we are expecting and the IDs are as we expecting in the right order. We tested that if this check had been in place before it would have caught the problem. This check completes the verification that MD5 we upload matches the data we upload so should prevent other classes of corruption too.
How to use GitHub
The text was updated successfully, but these errors were encountered: