Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal: number of used blobs is larger than number of available blobs! #3268

Closed
kmwoley opened this issue Feb 6, 2021 · 30 comments
Closed

Fatal: number of used blobs is larger than number of available blobs! #3268

kmwoley opened this issue Feb 6, 2021 · 30 comments
Labels
backend: b2 state: need feedback waiting for feedback, e.g. from the submitter state: need investigating cause unknown, need investigating/troubleshooting

Comments

@kmwoley
Copy link

kmwoley commented Feb 6, 2021

Restic Version

restic 0.9.6 compiled with go1.13.4 on linux/amd64

Command(s)

restic prune
restic check --read-data-subset=05/53
(the commands are automated as part of a weekly backup maintenance script)

What did you expect?

An error free prune followed by an error free check.

What happened instead?

Both the prune and the check failed.

Output of prune:

counting files in repo
building new index for repo
incomplete pack file (will be removed): 056c60842c58dd17d452aacbd8a3a2a28b7729a55df17862c0a9715205178ded
[5:30:10] 100.00%  148048 / 148048 packs

incomplete pack file (will be removed): 173d2b9e645bd13eb794149b3f6f7b938940272f5a3b10cc0d48b13b61f0f73d
incomplete pack file (will be removed): 25be7e6fc725d5c5354e56133eda461d4fc2189ef90a458db837f1580eedffe6
incomplete pack file (will be removed): 429b728243cd9bf1777fe87eba0717ec3d7e04708692972db539ec168bdf35a2
incomplete pack file (will be removed): 54dcb5d83bd0ddeecc49a73f1e48e817b4cc0d5999a0663103455a0335a6116f
incomplete pack file (will be removed): 581b0af89c5ea7ba0bf47d8cfe36463001c08c46a534263c44c779185eb8ad80
incomplete pack file (will be removed): 5908682c326af00157838c8335e15414c7a7e79b0450148edbfa9d32b6eb2989
incomplete pack file (will be removed): 745354f809468d971a5488567f73461348fdc92468e8cc6140c9a50b53f8a265
incomplete pack file (will be removed): 79f599e46864df4a3c94a7b515939998045d7c4984d12f5733a91324e22774ed
incomplete pack file (will be removed): 7ee35edaa9c22be3855a54f67fff32ab88e1b90c0f65a3c7c27852fd55236292
incomplete pack file (will be removed): 8da7e3200bb309debf0ad338387c26dafc97de1afc7ba3011c0ba1241c6d0d08
incomplete pack file (will be removed): 8f74dab11a8de782759efd25db340326c3524851a7c46a4cb2df23d175546328
incomplete pack file (will be removed): a044002799ae0a4c8b0caeed64a3c1c2314f3bd5652f52e4c92af1f47a1ecc9f
incomplete pack file (will be removed): a60b3a2a76f225256377731107b3889c924413a1db8ac212db59c5f2b6efb87d
incomplete pack file (will be removed): b0404632536e3fa1851066a783ba08d9a871fa45b6037e1471ac8b0cfc3fa34a
incomplete pack file (will be removed): b492fd1c892f2b5e9ed63df629aa084d220730686de252b3030484989254f8c1
incomplete pack file (will be removed): b5bbee347e45875217ad33288e03899cbee0c2fcc0fcc7edb2892eb50ef89990
incomplete pack file (will be removed): c05b64748d945517114f6fede76c95f9336ddf03e13994c5c04f34b7bcd43d95
incomplete pack file (will be removed): ce95f9d896c44eb9da8717276d3959242edff24fcf7f8c79efc2908d34f26645
incomplete pack file (will be removed): d1a46a0406973a86eceed8314470ed6c3a25642ba810c1af3bb2f0e1e275a469
incomplete pack file (will be removed): d2fa90492bac4743f184dc24137247de74ee56afd122e66318c511e39f43f013
incomplete pack file (will be removed): d4c832432f9addf84646adc3b340853b65dcfc97d49d0d74ed9e3bba2a2c7cc9
incomplete pack file (will be removed): d542d0c8a92b5ad1c37169c8fcfdff9ce810df04a2804c9e0e4e8d4db512246b
incomplete pack file (will be removed): e57d166ccd9b367380139b8e563add2e6069af33f7a6f26d68969623e7b493a1
incomplete pack file (will be removed): f4507052a2d20fbea4db14c8c4ce8d1e68af912f3efa77ae42b6c60a015db480
incomplete pack file (will be removed): f51a455d0dbde2fd452bd607f5d1e5b2cf79cca0bca2ae3d6b996ac7bbaeeb19
incomplete pack file (will be removed): f5abb13ab88935727bf60e3032cf28176897bd2f807e53a19c50a2fc6817dd8e
incomplete pack file (will be removed): fad51e77a344f1cf9e38fc7073b9c638d1d407c2d505e6847620d3577084a865
incomplete pack file (will be removed): faf4d42c802b96b95017b492793d174dc14da7a177c40deb049aa16b24c1aae0
incomplete pack file (will be removed): fe735339b138268beb4aadd0ae1e9fa1bcfc79e1b741fd1fb152613e594f66ed
repository contains 148018 packs (826578 blobs) with 671.196 GiB
processed 826578 blobs: 0 duplicate blobs, 0 B duplicate
load all snapshots
find data that is still in use for 213 snapshots
[0:32] 100.00%  213 / 213 snapshots

Fatal: number of used blobs is larger than number of available blobs!
Please report this error (along with the output of the 'prune' run) at
https://github.com/restic/restic/issues/new

Output of Check

using temporary cache in /tmp/restic-check-cache-061487687
created new cache in /tmp/restic-check-cache-061487687
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read group #5 of 2874 data packs (out of total 148048 packs in 53 groups)
Pack ID does not match, want 39186bde, got 1da9e122
Pack ID does not match, want 04efce75, got a76690ba
[46:52] 100.00%  2874 / 2874 items
Fatal: repository contains errors

Other details

  • Backend: B2
  • This machine spent at least a week with it's clock set incorrectly after a power failure/reboot. There were several backups, prunes and checks performed while the local machine had a clock that was set to a time roughly 30 days in the past. Backups, purnes and checks since the clock has been fixed have been successful.
@kmwoley
Copy link
Author

kmwoley commented Feb 6, 2021

FYI - I'm in the process of running another check to see if I can reproduce the check errors. Some backups have occurred between the check that produced the error and now. It's not done, but I already see new and different errors (i.e. Pack ID does not match, want d8df39a1, got 551ce046 which is different than reported above).

How worrisome are the Pack ID does not match, incomplete pack file, and the number of used blobs is larger than number available blobs errors?

I would love to get some advice on how to check to ensure that my backup data is not corrupt and is still reliable. Any suggestions on next steps would be greatly appreciated.

@MichaelEischer
Copy link
Member

@kmwoley You're using an old restic version, the current version is 0.11.0, please update it. check is much faster since restic 0.10.0.

Did restic report errors during backup in the time between the last successful check run and the now failing check run? Is there anything suspicious in the log of the last working prune run?

Is the repository used by multiple hosts or just a single host? Was the repository always stored on B2 or did you create it locally and upload it to B2? I'm asking because the "incomplete pack file" shouldn't show up for B2.

Did you run any commands previously to try to repair the damage? I'm especially interested in whether you ran rebuild-index. If the answer is no, then please create a backup of the repository index (the index/ folder in the repository). Or at least do NOT run rebuild index for now.

The Pack ID does not match is a serious problem as it indicates that some parts of the backup repository are corrupted. Please use restic list packs | grep -e 39186bde -e 04efce75 to get the full pack ids. Then for both files follow the following steps:

  • Please download the file manually and run sha256sum on it. For example using restic cat --quiet --no-cache pack 39186bde[...] | shasum -a256.
  • Does restic cat --quiet pack 39186bde[...] | shasum -a256 return the same or a different result? This will let restic use a cached copy of the file if available. You could also look into the cache yourself (usually located at ~/.cache/restic/<repo_id>/).
  • Next, please use the Backblaze Web UI or a B2 client to get some metadata about the file. The most interesting would be creation time and modification time if available and the file size. Are there maybe several versions of the file?

Please also try these steps for some of the files reported as "incomplete pack file".

FYI - I'm in the process of running another check to see if I can reproduce the check errors. Some backups have occurred between the check that produced the error and now. It's not done, but I already see new and different errors (i.e. Pack ID does not match, want d8df39a1, got 551ce046 which is different than reported above).

You ran the exact same restic check --read-data-subset=05/53 command? If yes, then the second check run report the exact same errors. Or at least all already reported errors plus maybe a few new ones for new files. If the errors keep changing then it could also be a hardware problem on the host creating the backup.

How worrisome are the Pack ID does not match, incomplete pack file, and the number of used blobs is larger than number available blobs errors?

Pack ID does not match and number of used blobs is larger than number available blobs always indicate that the backup repository has been damaged in some way. It's usually possible to recover by removing / replacing the damaged parts. incomplete pack file isn't something I'd expect to see on B2.

@kmwoley
Copy link
Author

kmwoley commented Feb 7, 2021

Thank you so much for the detailed response. I'll do my best to answer your questions. I apologize if this is too much detail.

Background

Did restic report errors during backup in the time between the last successful check run and the now failing check run? Is there anything suspicious in the log of the last working prune run?

There's nothing suspicious in the last successful prune run log nor the backups between the last successful prune/check and now. A short history:

  • Successful Forget, Prune & Check: 2021-01-25 (When the local clock was mistakenly set to 2021-01-09)
  • Successful Forget, Prune & Check: 2021-01-30 (After the local clock has been fixed)
  • Roughly 80 successful backups occurred between 2021-01-30 and 2021-02-05 (after the last successful prune & check)
  • Successful Forget, but Failed Prune & Check: 2021-02-05

I've got full logs for all of the above if it's useful, but I didn't see any errors reported in any of those until what I posted in the bug report.

Is the repository used by multiple hosts or just a single host? Was the repository always stored on B2 or did you create it locally and upload it to B2? I'm asking because the "incomplete pack file" shouldn't show up for B2.

  • The repository is used by a single host, first created over a year ago (in 2/2020)
  • The repo was created originally on B2, has always been on B2
  • Current size is 721GB according to Backblaze
  • There are multiple backups run every week (some every other hour, some daily, some weekly for different data sets - each tagged uniquely).
  • Over the past year it has passed partial checks weekly (iterating through subsets via check --read-data-subset=n/53) until now.
  • The repo has been pruned successfully once a week until now.

Did you run any commands previously to try to repair the damage? ...

No, nothing has been done yet to attempt repair. I have not run rebuild index.

Testing Hashes - all sha256sum's are okay when testing manually

So, this is where it gets interesting.

The packs that check reported in error don't appear to have errors:

# restic cat --quiet pack 04efce75d571f83548156e0e3c3be03a60a454b9ba822356d9b9eccf3ddf04f2 | shasum -a256
04efce75d571f83548156e0e3c3be03a60a454b9ba822356d9b9eccf3ddf04f2  -
# restic cat --quiet --no-cache pack 04efce75d571f83548156e0e3c3be03a60a454b9ba822356d9b9eccf3ddf04f2 | shasum -a256
04efce75d571f83548156e0e3c3be03a60a454b9ba822356d9b9eccf3ddf04f2  -
# restic cat --quiet pack 39186bde83c0a31e89315f268904d3d809d193243fdd0853d9402fd0fed92710 | shasum -a256 
39186bde83c0a31e89315f268904d3d809d193243fdd0853d9402fd0fed92710  -
# restic cat --quiet --no-cache pack 39186bde83c0a31e89315f268904d3d809d193243fdd0853d9402fd0fed92710 | shasum -a256
39186bde83c0a31e89315f268904d3d809d193243fdd0853d9402fd0fed92710  -

Looking on B2, via the Web UI

  • 04efce75[...] was uploaded on 02/20/2020 (almost a year ago)
  • 39186bde[...] was uploaded on 02/28/2020 (almost a year ago)

I downloaded both files directly from B2 via the web UI and checked their hashes on a different machine - both passed there as well.

I tried these steps on a few of the incomplete pack files as well - all of them passed with matching hashes.

Running check --read-data multiple times gives different results

In the past 24 hours I ran check --read-data twice - once from the original host running the older version of restic, and once from a Windows 10 machine running the latest restic (version 0.11). I did not let them run until completion. I observed that they both showed errors, but showed different errors (i.e. the Pack ID's in error were not the same).

From Linux (restic 0.9.6):

# restic check --read-data
using temporary cache in /tmp/restic-check-cache-129095016
repository e07618ce opened successfully, password is correct
created new cache in /tmp/restic-check-cache-129095016
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want 173d2b9e, got 8ca8dda5
Pack ID does not match, want c8726586, got d11f8da2
Pack ID does not match, want 1fb4983c, got 980d560b
Pack ID does not match, want 8f94b769, got f6a17209
Pack ID does not match, want b19cad00, got 073a64df
Pack ID does not match, want dcf2858c, got 460a15a5
Pack ID does not match, want 881e29ef, got 39afd135
Pack ID does not match, want 7619326b, got 89ddd788
Pack ID does not match, want e25d9b8e, got 11bf3dc1
Pack ID does not match, want c5e24039, got 1b0f24b0
Pack ID does not match, want b820e106, got 8027b4f5
Pack ID does not match, want 1f7a1456, got c054ec8d
Pack ID does not match, want 906ab349, got 0a14a085
Pack ID does not match, want b7a05220, got 4ab60bf0
Pack ID does not match, want 17f4b6fb, got 7162cfda
Pack ID does not match, want 755a9521, got 0ad383b3
Pack ID does not match, want 5908682c, got 67eba940
[12:11:12] 29.79%  44106 / 148050 items

Aborted via CTRL-C at this point.

From Windows (restic 0.11):

PS C:\restic> .\restic check --read-data
using temporary cache in C:\Users\Kevin\AppData\Local\Temp\restic-check-cache-019862051
repository e07618ce opened successfully, password is correct
created new cache in C:\Users\Kevin\AppData\Local\Temp\restic-check-cache-019862051
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want b820e106, got d29cab67
Pack ID does not match, want 09adece9, got 76c2b314
Pack ID does not match, want ed471b41, got 6f2342e8
Pack ID does not match, want 8ddad374, got 60da8dec
Pack ID does not match, want 28922947, got b70c1408
Pack ID does not match, want 2872560b, got e9a9dd61
Pack ID does not match, want 36f30e23, got 09738f65
Pack ID does not match, want fdd3c3c5, got d84260ac
Pack ID does not match, want 201bccd6, got 0ac7e95f
Pack ID does not match, want c8b4317a, got 397e889a
Pack ID does not match, want 155b0238, got 9f0ecb31
[11:55:21] 21.02%  31120 / 148050 packs

Aborted via CTRL-C at this point.

Spot checking each of these packs for hash errors has shown that NONE of them actually have errors!

I would assume that the check checks the same packs in the same order, so that result is odd to me. Am I making a fair assumption there?

Next steps?

My next step is to run the shorter --read-data-subset=05/53 check to completion multiple times on different machines to try to eliminate a hardware problem on any one machine.

But, honestly, that feels silly at this point since each time I spot check any apparent errors the files appear to have hashes that match.

@MichaelEischer - do you have any other ideas of things to try/check to diagnose this? Thank you in advance for your help.

@MichaelEischer
Copy link
Member

@kmwoley Thanks a lot for the detailed reply. Judging from the prune/check schedule the problem must have appeared somewhere between 2021-01-30 and 2021-02-05, when assuming that restic was responsible. But pack files which seem to have different hashes when looking at them in a different moment sound like there might be something else going on.

I would assume that the check checks the same packs in the same order, so that result is odd to me. Am I making a fair assumption there?

The order in which packs files are check is effectively random and is very likely different each time. But I've noticed that a few pack file IDs have shown up multiple times in different logs:

incomplete pack file (will be removed): 5908682c326af00157838c8335e15414c7a7e79b0450148edbfa9d32b6eb2989
Pack ID does not match, want 5908682c, got 67eba940

incomplete pack file (will be removed): 173d2b9e645bd13eb794149b3f6f7b938940272f5a3b10cc0d48b13b61f0f73d
Pack ID does not match, want 173d2b9e, got 8ca8dda5

Pack ID does not match, want b820e106, got 8027b4f5
Pack ID does not match, want b820e106, got d29cab67

The last file seems to be particularly interesting: apparently restic ended up with a different file content both times. Could you run restic list packs | grep b820e106 to get the full ID and then run restic cat --quiet pack b820e106[...] | shasum -a256 for that file? I really wonder whether that will give a correct result or yet another different unexpected hash of the file content. If the hash matches then it looks like the file itself is intact on B2 but somehow restic ends up with a wrong result when checking lots of files. I don't have any good leads on how that could happen. Maybe a problem during the download, maybe some hardware problem on the host, maybe something in restic or the operating system? (although you've tried Windows/Linux and different hardware by now?)

To make really sure that the pack file content is actually intact, please build the branch debug-1999 which contains some code to debug (corrupted) files (via go build -tags debug ./cmd/restic). If you need more detailed instructions, feel free to ask. Then run:

$ ./restic debug examine b820e106[...]

That should give us information about the internal structure of the file and the damage that it may have suffered.

Just to be sure: the restic check runs on windows and linux were using a different machine?

But, honestly, that feels silly at this point since each time I spot check any apparent errors the files appear to have hashes that match.

That is very, very strange indeed. Is there anything which might tamper with the network connection between restic and B2?

@MichaelEischer MichaelEischer added backend: b2 state: need feedback waiting for feedback, e.g. from the submitter state: need investigating cause unknown, need investigating/troubleshooting labels Feb 9, 2021
@kmwoley
Copy link
Author

kmwoley commented Feb 10, 2021

Thanks for the response - I'll be a few days before I have time to run the next diagnosis steps. I should have a response back to you by this weekend (2021-02-14).

@fd0
Copy link
Member

fd0 commented Feb 10, 2021

ping @kurin, any idea what's going on here? It looks like restic during check gets different data than when only a single file is downloaded via cat, what may have caused that?

@kurin
Copy link
Contributor

kurin commented Feb 10, 2021

How big are these files? As I recall, restic chunks data before sending it to any backends. If these chunks are less than the size at which blazer switches to the B2 large file API, which by default is at 100M, then every object in B2 will have a SHA1 metadata tag, which is verified by the server on upload. This can be verified by a client on download: https://pkg.go.dev/github.com/kurin/blazer/b2#Reader.Verify

That said, I haven't really done much with this project in a couple years. B2 has a v2 API and also an S3-emulated API, which is probably the best bet for everybody.

@MichaelEischer
Copy link
Member

@kurin In #3271 (comment) this data corruption affects files with only a few MB. And apparently the file content may even change between subsequent downloads. Such small files should also be downloaded as a single chunk? So there's not too much happening which could go wrong in the library. I'm starting to think that this could be a problem on the B2 server side...

@kurin
Copy link
Contributor

kurin commented Feb 10, 2021

If you download the blobs and hash them in a loop, do they give consistent results? It would surprise me if there were a bug in blazer we're only just seeing now, but not a whole lot, and not more than it would surprise me that B2 is serving corrupted data.

@fd0
Copy link
Member

fd0 commented Feb 12, 2021

I've reached out to Backblaze and asked for someone from engineering to help us debug this :)

@kmwoley
Copy link
Author

kmwoley commented Feb 13, 2021

Summary

I believe this to be an issue independent of Restic. I have downloaded one pack directly from B2's web UI that resulted in a checksum mismatch. I have run 2 full data checks from different machines which appear to give random checksum results. Every single manual pack checksum check I have done passes, except for the one documented below.

Manual Checksum Failure - 49791035

In the full --read-data checks, there were 5 packs that showed in error across the two checks. I manually verified each of those packs by downloading the pack from B2's Web UI. One resulted in a corrupt file for the first time I downloaded the file (on Windows, Chrome). On second download, the file was correct.

Here's the SHA-1 results (to compare against the Backblaze UI):
image

Binary Diff

I was curious if his was a single bit flip, or something else. So I put the corrupt file and the correct file into a binary comparison tool. The result was that about 12% of the files differed... in 5 different locations... not just the start or end.

image

I also downloaded the file through the UI 5 more times... each of those subsequent downloads through the UI was correct.

Checking the single file in a loop

I'm currently running the download / shasum of the single file in a loop - I'll report back on what I find there.

Summary of --read-data testing

  • I have run two full data checks from two different machines using Restic 0.11. The detailed results are posted below.
  • The rate of errors is different between the two machines. The Linux machine produces ~ 0.84 errors/hour of running check. The Windows machine produced ~0.56 errors/hour.
  • Of the 67 packs in error total (out of 148050), only 5 of the same packs showed up as in error from both machines.

Full Check Results

Windows Machine

Windows 10, Restic 0.11, Wireless Network

PS C:\restic> .\restic check --read-data
using temporary cache in C:\Users\Kevin\AppData\Local\Temp\restic-check-cache-954866003
repository e07618ce opened successfully, password is correct
created new cache in C:\Users\Kevin\AppData\Local\Temp\restic-check-cache-954866003
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want 8ddad374, got b4237d17
Pack ID does not match, want f5ef3b30, got 957945b4
Pack ID does not match, want 76512646, got d983233e
Pack ID does not match, want eefb18c5, got 28e6b164
Pack ID does not match, want 202597c6, got 6976c331
Pack ID does not match, want a0047d1a, got 3135cc61
Pack ID does not match, want f72d22f6, got bd0050a0
Pack ID does not match, want f37288d6, got 1caab1ba
Pack ID does not match, want 581b0af8, got 05113ec4
Pack ID does not match, want 8e21bea4, got f6ace65a
Pack ID does not match, want 27b07e9e, got 98c3ded8
Pack ID does not match, want 81870b82, got cdee9d5d
Pack ID does not match, want eb228e74, got 736f0e8f
Pack ID does not match, want af8d336b, got 3e076569
Pack ID does not match, want 99fe9e94, got 70e33c03
Pack ID does not match, want 02080be7, got b2514977
Pack ID does not match, want 4ac9361f, got 5fde1c78
Pack ID does not match, want 7cd68700, got ea422a27
Pack ID does not match, want ba75213e, got be695407
Pack ID does not match, want 49791035, got a1baa971
Pack ID does not match, want ca8e84cc, got a0cee332
Pack ID does not match, want f900df97, got 2960de63
Pack ID does not match, want 6db65d0a, got 2c4792af
Pack ID does not match, want 353ae954, got 87130773
Pack ID does not match, want 1fb4983c, got 4a49100a
Pack ID does not match, want ce95f9d8, got 4ec61790
Pack ID does not match, want 31b6ede2, got c7f7b114
[48:20:01] 100.00%  148050 / 148050 packs
Fatal: repository contains errors
PS C:\restic>

Linux Machine

Unraid Linux (backup host). Restic 0.11. Wired network.

using temporary cache in /tmp/restic-check-cache-339533226
repository e07618ce opened successfully, password is correct
created new cache in /tmp/restic-check-cache-339533226
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
Pack ID does not match, want c29bc1a4, got b9d5d04b
Pack ID does not match, want bd505916, got 9861462a
Pack ID does not match, want 176ccc63, got 8a69d9fd
Pack ID does not match, want 287d3c84, got 7c6bee66
Pack ID does not match, want 4ac9361f, got 4c2b16cd
Pack ID does not match, want 316d4bf5, got 5ede609f
Pack ID does not match, want df72f5a1, got 6aa414b7
Pack ID does not match, want ad74edf2, got 03632064
Pack ID does not match, want ba2b9894, got 63c0f8cb
Pack ID does not match, want f4f93c0e, got 09f69c2f
Pack ID does not match, want 082c2dcd, got bf4f50d7
Pack ID does not match, want aeadd833, got b21a1f58
Pack ID does not match, want 4199ff0a, got 574624b6
Pack ID does not match, want f5ef3b30, got dff3c6a0
Pack ID does not match, want 35f93620, got f18216a5
Pack ID does not match, want 27b07e9e, got 735d923f
Pack ID does not match, want 1dfbc37c, got d32bdfe4
Pack ID does not match, want 429b7282, got 54890386
Pack ID does not match, want 7cd68700, got cbf88637
Pack ID does not match, want 7ef7ad06, got fde1949d
Pack ID does not match, want e15cdbeb, got 16621489
Pack ID does not match, want 119cd505, got de561fff
Pack ID does not match, want 6f0f9d33, got 537b8924
Pack ID does not match, want 864f3137, got 37590313
Pack ID does not match, want 97bccb43, got cf730317
Pack ID does not match, want aad302e6, got 29f03512
Pack ID does not match, want aab71342, got 24eec66d
Pack ID does not match, want 47336b30, got 8f871085
Pack ID does not match, want 5adc97ac, got 6c59e41a
Pack ID does not match, want 49791035, got 36db95e9
Pack ID does not match, want 201bccd6, got bd0c9dd6
Pack ID does not match, want aa03c2b8, got 4a52996d
Pack ID does not match, want 79f599e4, got 1d5c3fa6
Pack ID does not match, want a463af1a, got b9cc48da
Pack ID does not match, want 83cf9760, got bfc67dca
Pack ID does not match, want 66a803f6, got fb373c39
Pack ID does not match, want c51332bd, got 3be73f0b
Pack ID does not match, want 313d8b37, got 6b2d8184
Pack ID does not match, want 2ebd7b8f, got c40c820f
Pack ID does not match, want eb9c1d4a, got 15322491
[47:54:31] 100.00%  148050 / 148050 packs
Fatal: repository contains errors

Packs with errors from both runs

Want Got Env Size (KB)
49791035 a1baa971 Windows 5,182
49791035 36db95e9 Linux 5,182
27b07e9e 98c3ded8 Windows 4,859
27b07e9e 735d923f Linux 4,859
4ac9361f 5fde1c78 Windows 4,711
4ac9361f 4c2b16cd Linux 4,711
7cd68700 ea422a27 Windows 4,215
7cd68700 cbf88637 Linux 4,215
f5ef3b30 957945b4 Windows 4,300
f5ef3b30 dff3c6a0 Linux 4,300

I downloaded each of these packs from B2 using the GUI to validate their checksums. All of them passed, except 49791035 as noted above.

Other notes

  • I recently had upgraded my router firmware. I downgraded it to the previous version and ran a --read-data check to see if it made a difference. It did not - I still got errors. :/

@kmwoley
Copy link
Author

kmwoley commented Feb 13, 2021

Looking at #3271 - this looks like the same behavior.
Hopefully @fd0 's contacts at Backblaze can give us some joy.

@kmwoley
Copy link
Author

kmwoley commented Feb 14, 2021

Update - Loop Testing Completed

@kurin, as suggested, I ran loops to repeatedly download the file to see if I could force / reproduce the download the error.

I used used different techniques (1500 iterations each):

I was unable to reproduce the error by just looping over and downloading the same pack repeatedly.

I'm at a loss, folks.
I submitted a ticked to Backblaze support.
Any suggested next steps?

@fd0
Copy link
Member

fd0 commented Feb 14, 2021

@kmwoley thanks a lot for the comprehensive debugging! Especially downloading the file via the Backblaze web UI and diffing it is helpful (I hope)! I'll contact Backblaze again and show them what you found. It's really odd behavior that should not happen...

@nilayp
Copy link

nilayp commented Feb 14, 2021

Nilay from Backblaze here. @fd0 has indeed contacted us and we are investigating. And thanks for those of you that opened support tickets with us too.

For those reporting this problem, did you all start noticing these errors at about the same day, Friday, Feb 5th (when this issue was opened?)

@kmwoley
Copy link
Author

kmwoley commented Feb 14, 2021

@nilayp - thanks for investigating!

I first experienced the issue at the earliest at 2021-02-05 22:08:08 PST, or possibly up to 5.5 hours later, based on my logs.

@adworacz filed #3266 which looks like it might have occurred before I saw the issue, assuming it's related.

@cmeyer23 filed #3271 which looks like it occurred about the same time as I reported, assuming it's related.

@adworacz
Copy link

@nilayp I noticed this issue prior to Feb 5th, at least Feb 4th. I opened #3266 after I had run several "expensive" operations that took ~1 day to run per operation. I don't have better logs from then (something that I've since remedied going forwards), so it's difficult to say what day exactly I first noticed the issue, but it's right around the time of Feb 4th.

That said, I hadn't run a backup or prune for a few weeks before then, so it's a rather large time window prior.

@nilayp
Copy link

nilayp commented Feb 16, 2021

Thanks to @kmwoley for reproducing this issue without restic. Our engineering team has the detail he provided and we are currently investigating.

@tomm87
Copy link

tomm87 commented Feb 17, 2021

I'm the user who opened apparently-related issue #3272, with an answer for @nilayp : I have a Restic repository in B2 for which Restic first reported missing blobs during a prune operation on February 6th. The previous prune was January 2nd; no problems were found at that time. Thank you!

@kmwoley
Copy link
Author

kmwoley commented Feb 20, 2021

Hey folks, I'm still waiting to hear back from Backblaze.
In the mean time, I updated to Restic 0.12 and kicked off a data check to see what would happen. (Who knows, right?)

Is it expected that the check does not show progress for a long period of time (i.e. hours) at this stage?

# restic check --read-data | tee 2021.02.20.log
using temporary cache in /tmp/restic-check-cache-740384393
created new cache in /tmp/restic-check-cache-740384393
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
[1:51] 100.00%  215 / 215 snapshots


I can see from my system monitor that it's downloading at full-speed and using a core of CPU, and has been for about an hour or more without reporting any progress.

@cmeyer23
Copy link

Thanks for the update @kmwoley. I also posted a Backblaze support ticket last weekend and the only update was on Monday that they would start to look into the issue on Tuesday. I also ran a restic check today, but only on a ~100GB subset of the 1.3T repo. The check encountered no errors. I am still running restic 0.11 and always log the output. What you are seeing is normal, except I have never seen the line about checking the snapshots (I guess that is new to 0.12?). I presume your next line will be [elapsed time] 100.00% X / X packs when the check has completed.

@kmwoley
Copy link
Author

kmwoley commented Feb 21, 2021

restic check --read-data still throwing errors. Not really surprising.
Looks like the progress indicator has changed in 0.12; I normally see some progress indicated as it's running.

@MichaelEischer
Copy link
Member

@kmwoley

Is it expected that the check does not show progress for a long period of time (i.e. hours) at this stage?

# restic check --read-data | tee 2021.02.20.log
using temporary cache in /tmp/restic-check-cache-740384393
created new cache in /tmp/restic-check-cache-740384393
create exclusive lock for repository
load indexes
check all packs
check snapshots, trees and blobs
read all data
[1:51] 100.00%  215 / 215 snapshots

I can see from my system monitor that it's downloading at full-speed and using a core of CPU, and has been for about an hour or more without reporting any progress.

As you pipe the output of check into another command, that is considered as non-interactive usage. And for that case restic doesn't output any live progress. With restic 0.12.0 it's now possible to configure that using the RESTIC_PROGRESS_FPS environment variable or by sending restic a SIGUSR1 signal.

@nilayp
Copy link

nilayp commented Feb 22, 2021

After a multi day exploration using a number of our engineers, we believe we determined the root cause of this issue.

When data is uploaded to our infrastructure, each file is erasure coded across 20 storage servers, which we call a Backblaze Vault. The files identified by @kmwoley and other Restic customers were all stored in vault-1129.

This vault happened to have disk errors on a number of hard drives. Therefore, it was possible for a download to intermittently be corrupt, as reported. However, because of the way files are stored across multiple servers, the file is still correct and available. No data has been lost.

Backblaze has an integrity checker which slowly runs over the disks looking for issues. In fact, by the time engineering started looking into the problem last week, some of the corrupt parts where already detected and moved aside and queued to be rebuilt. This led to taking longer to understand the problem. Going forward, our tools to determine the state of files will be improved. Also, there is a project that will be going live in the next month or so to better detect this issue while reading a file for download.

At this point - we believe the issue has been resolved; vault-1129 is healthy again. If this is not the case, please do let us know by opening a support ticket and referencing this post.

It is a mystery why only restic customers reported this issue to Backblaze and further why the data was stored on vault-1129. (For context, each vault has 1,200 drives and as of Dec. 31, 2020, we have 165,530 drives in production, or ~137 vaults. This was probably just a coincidence.

@kmwoley
Copy link
Author

kmwoley commented Feb 22, 2021

@nilayp - thank you for the details and the response! I appreciate how responsive and transparent you and the folks at Backblaze have been. I do have to say that I'm both surprised and disappointed that the B2 architecture allows for files which are detectibly-corrupt to be served without warning. And that the architecture didn't detect & heal itself for around 13 days (I was getting errors as recently as yesterday). I'm glad to hear you all are focusing on making improvements in this area - I look forward to hearing about what you do here, if you end up blogging publicly about it. Again - thank you for being so responsive while working with me to get to the bottom of this issue and for sharing the root cause publicly.

As for why only restic users reported this issue to Backblaze - coincidence? Or maybe restic is just that good? 😉

For my part, I'm re-running a full restic check --read-data to verify that there are not any reported corruptions. I'll report back in a couple of days when that completes.

@kmwoley
Copy link
Author

kmwoley commented Feb 22, 2021

Welp. Bad news. I'm still seeing corruption when I run restic check --read-data... got an error within 10 min of running --read-data.

I've sent details to Backblaze support to continue the investigation.

@nilayp
Copy link

nilayp commented Feb 23, 2021

I misspoke above. The issue has only been resolved for the corrupt files that were reported to Backblaze via support tickets. There are likely other files that are caught up in this issue that may still be corrupted. As I mentioned, we have an integrity checker that is slowly cleaning this up. There are other countermeasures being developed and I'll update this issue with a status when I have one.

I apologize for stating this issue was resolved. It is not and continues to be something we are actively working on.

@kmwoley
Copy link
Author

kmwoley commented Mar 1, 2021

Hey folks - I can confirm that Backblaze deployed a fix on Feb 26, 2021 (around or before 12:29 PST) that resolved the data reading errors I was experiencing. Many thanks to all of you who helped diagnose the issue and get it in front of Backblaze for attention.

@kmwoley kmwoley closed this as completed Mar 1, 2021
@nilayp
Copy link

nilayp commented Mar 1, 2021

@kmwoley is correct. We deployed a fix that should finally resolve this issue. Thanks to @kmwoley and everyone who reported the issue. If you continue to see a problem, please file a ticket and reference this issue.

Thanks to everyone who helped diagnose this issue and your patience.

@fd0
Copy link
Member

fd0 commented Mar 1, 2021

Awesome, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: b2 state: need feedback waiting for feedback, e.g. from the submitter state: need investigating cause unknown, need investigating/troubleshooting
Projects
None yet
Development

No branches or pull requests

8 participants