-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal: number of used blobs is larger than number of available blobs! #3268
Comments
FYI - I'm in the process of running another check to see if I can reproduce the How worrisome are the I would love to get some advice on how to check to ensure that my backup data is not corrupt and is still reliable. Any suggestions on next steps would be greatly appreciated. |
@kmwoley You're using an old restic version, the current version is 0.11.0, please update it. Did restic report errors during Is the repository used by multiple hosts or just a single host? Was the repository always stored on B2 or did you create it locally and upload it to B2? I'm asking because the "incomplete pack file" shouldn't show up for B2. Did you run any commands previously to try to repair the damage? I'm especially interested in whether you ran The
Please also try these steps for some of the files reported as "incomplete pack file".
You ran the exact same
|
Thank you so much for the detailed response. I'll do my best to answer your questions. I apologize if this is too much detail. Background
There's nothing suspicious in the last successful prune run log nor the backups between the last successful prune/check and now. A short history:
I've got full logs for all of the above if it's useful, but I didn't see any errors reported in any of those until what I posted in the bug report.
No, nothing has been done yet to attempt repair. I have not run Testing Hashes - all sha256sum's are okay when testing manuallySo, this is where it gets interesting. The packs that
Looking on B2, via the Web UI
I downloaded both files directly from B2 via the web UI and checked their hashes on a different machine - both passed there as well. I tried these steps on a few of the incomplete pack files as well - all of them passed with matching hashes. Running
|
@kmwoley Thanks a lot for the detailed reply. Judging from the prune/check schedule the problem must have appeared somewhere between 2021-01-30 and 2021-02-05, when assuming that restic was responsible. But pack files which seem to have different hashes when looking at them in a different moment sound like there might be something else going on.
The order in which packs files are check is effectively random and is very likely different each time. But I've noticed that a few pack file IDs have shown up multiple times in different logs:
The last file seems to be particularly interesting: apparently restic ended up with a different file content both times. Could you run To make really sure that the pack file content is actually intact, please build the branch debug-1999 which contains some code to debug (corrupted) files (via
That should give us information about the internal structure of the file and the damage that it may have suffered. Just to be sure: the restic check runs on windows and linux were using a different machine?
That is very, very strange indeed. Is there anything which might tamper with the network connection between restic and B2? |
Thanks for the response - I'll be a few days before I have time to run the next diagnosis steps. I should have a response back to you by this weekend (2021-02-14). |
ping @kurin, any idea what's going on here? It looks like restic during |
How big are these files? As I recall, restic chunks data before sending it to any backends. If these chunks are less than the size at which blazer switches to the B2 large file API, which by default is at 100M, then every object in B2 will have a SHA1 metadata tag, which is verified by the server on upload. This can be verified by a client on download: https://pkg.go.dev/github.com/kurin/blazer/b2#Reader.Verify That said, I haven't really done much with this project in a couple years. B2 has a v2 API and also an S3-emulated API, which is probably the best bet for everybody. |
@kurin In #3271 (comment) this data corruption affects files with only a few MB. And apparently the file content may even change between subsequent downloads. Such small files should also be downloaded as a single chunk? So there's not too much happening which could go wrong in the library. I'm starting to think that this could be a problem on the B2 server side... |
If you download the blobs and hash them in a loop, do they give consistent results? It would surprise me if there were a bug in blazer we're only just seeing now, but not a whole lot, and not more than it would surprise me that B2 is serving corrupted data. |
I've reached out to Backblaze and asked for someone from engineering to help us debug this :) |
Update - Loop Testing Completed@kurin, as suggested, I ran loops to repeatedly download the file to see if I could force / reproduce the download the error. I used used different techniques (1500 iterations each):
I was unable to reproduce the error by just looping over and downloading the same pack repeatedly. I'm at a loss, folks. |
@kmwoley thanks a lot for the comprehensive debugging! Especially downloading the file via the Backblaze web UI and diffing it is helpful (I hope)! I'll contact Backblaze again and show them what you found. It's really odd behavior that should not happen... |
Nilay from Backblaze here. @fd0 has indeed contacted us and we are investigating. And thanks for those of you that opened support tickets with us too. For those reporting this problem, did you all start noticing these errors at about the same day, Friday, Feb 5th (when this issue was opened?) |
@nilayp - thanks for investigating! I first experienced the issue at the earliest at 2021-02-05 22:08:08 PST, or possibly up to 5.5 hours later, based on my logs. @adworacz filed #3266 which looks like it might have occurred before I saw the issue, assuming it's related. @cmeyer23 filed #3271 which looks like it occurred about the same time as I reported, assuming it's related. |
@nilayp I noticed this issue prior to Feb 5th, at least Feb 4th. I opened #3266 after I had run several "expensive" operations that took ~1 day to run per operation. I don't have better logs from then (something that I've since remedied going forwards), so it's difficult to say what day exactly I first noticed the issue, but it's right around the time of Feb 4th. That said, I hadn't run a backup or prune for a few weeks before then, so it's a rather large time window prior. |
Thanks to @kmwoley for reproducing this issue without restic. Our engineering team has the detail he provided and we are currently investigating. |
Hey folks, I'm still waiting to hear back from Backblaze. Is it expected that the check does not show progress for a long period of time (i.e. hours) at this stage?
I can see from my system monitor that it's downloading at full-speed and using a core of CPU, and has been for about an hour or more without reporting any progress. |
Thanks for the update @kmwoley. I also posted a Backblaze support ticket last weekend and the only update was on Monday that they would start to look into the issue on Tuesday. I also ran a |
|
As you pipe the output of check into another command, that is considered as non-interactive usage. And for that case restic doesn't output any live progress. With restic 0.12.0 it's now possible to configure that using the |
After a multi day exploration using a number of our engineers, we believe we determined the root cause of this issue. When data is uploaded to our infrastructure, each file is erasure coded across 20 storage servers, which we call a Backblaze Vault. The files identified by @kmwoley and other Restic customers were all stored in vault-1129. This vault happened to have disk errors on a number of hard drives. Therefore, it was possible for a download to intermittently be corrupt, as reported. However, because of the way files are stored across multiple servers, the file is still correct and available. No data has been lost. Backblaze has an integrity checker which slowly runs over the disks looking for issues. In fact, by the time engineering started looking into the problem last week, some of the corrupt parts where already detected and moved aside and queued to be rebuilt. This led to taking longer to understand the problem. Going forward, our tools to determine the state of files will be improved. Also, there is a project that will be going live in the next month or so to better detect this issue while reading a file for download. At this point - we believe the issue has been resolved; vault-1129 is healthy again. If this is not the case, please do let us know by opening a support ticket and referencing this post. It is a mystery why only restic customers reported this issue to Backblaze and further why the data was stored on vault-1129. (For context, each vault has 1,200 drives and as of Dec. 31, 2020, we have 165,530 drives in production, or ~137 vaults. This was probably just a coincidence. |
@nilayp - thank you for the details and the response! I appreciate how responsive and transparent you and the folks at Backblaze have been. I do have to say that I'm both surprised and disappointed that the B2 architecture allows for files which are detectibly-corrupt to be served without warning. And that the architecture didn't detect & heal itself for around 13 days (I was getting errors as recently as yesterday). I'm glad to hear you all are focusing on making improvements in this area - I look forward to hearing about what you do here, if you end up blogging publicly about it. Again - thank you for being so responsive while working with me to get to the bottom of this issue and for sharing the root cause publicly. As for why only For my part, I'm re-running a full |
Welp. Bad news. I'm still seeing corruption when I run I've sent details to Backblaze support to continue the investigation. |
I misspoke above. The issue has only been resolved for the corrupt files that were reported to Backblaze via support tickets. There are likely other files that are caught up in this issue that may still be corrupted. As I mentioned, we have an integrity checker that is slowly cleaning this up. There are other countermeasures being developed and I'll update this issue with a status when I have one. I apologize for stating this issue was resolved. It is not and continues to be something we are actively working on. |
Hey folks - I can confirm that Backblaze deployed a fix on Feb 26, 2021 (around or before 12:29 PST) that resolved the data reading errors I was experiencing. Many thanks to all of you who helped diagnose the issue and get it in front of Backblaze for attention. |
Awesome, thanks a lot! |
Restic Version
restic 0.9.6 compiled with go1.13.4 on linux/amd64
Command(s)
restic prune
restic check --read-data-subset=05/53
(the commands are automated as part of a weekly backup maintenance script)
What did you expect?
An error free prune followed by an error free check.
What happened instead?
Both the
prune
and thecheck
failed.Output of prune:
Output of Check
Other details
The text was updated successfully, but these errors were encountered: