Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restorer: allow writing target file blobs out of order #2195

Merged
merged 2 commits into from Feb 26, 2020

Conversation

@ifedorenko
Copy link
Contributor

@ifedorenko ifedorenko commented Mar 3, 2019

What is the purpose of this change? What does it change?

Much simpler implementation that guarantees each required pack
is downloaded only once (and hence does not need to manage
pack cache). Also improves large file restore performance.

Was the change discussed in an issue or in the forum before?

See #2074
https://forum.restic.net/t/restore-using-rclone-gdrive-backend-is-slow/1112/8
https://forum.restic.net/t/degraded-restore-performance-s3-backend/1400

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review
@ifedorenko ifedorenko mentioned this pull request Mar 3, 2019
5 of 7 tasks
@codecov-io
Copy link

@codecov-io codecov-io commented Mar 3, 2019

Codecov Report

No coverage uploaded for pull request base (master@af20015). Click here to learn what that means.
The diff coverage is 67.7%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2195   +/-   ##
=========================================
  Coverage          ?   45.82%           
=========================================
  Files             ?      174           
  Lines             ?    14290           
  Branches          ?        0           
=========================================
  Hits              ?     6549           
  Misses            ?     6730           
  Partials          ?     1011
Impacted Files Coverage Δ
internal/restorer/restorer.go 42.85% <100%> (ø)
internal/restorer/filerestorer.go 66.02% <63.84%> (ø)
internal/restorer/fileswriter.go 72.09% <82.75%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af20015...c83044f. Read the comment docs.

@ifedorenko
Copy link
Contributor Author

@ifedorenko ifedorenko commented Mar 3, 2019

@fd0 @mholt can you review this PR when you get a chance? It is the code discussed in #2074, and should be ready or almost ready to be merged to master.

Compared to master, this PR does not guarantee target files are written sequentially, start to finish, which allowed much simpler implementation and faster multithreaded restore of very large files.

@mholt
Copy link
Contributor

@mholt mholt commented Mar 3, 2019

Sure -- I just barely finished my masters thesis so I'm in major catch-up mode, but at a glance I really like the ~1K LoC reduction. Don't wait up on me, but I will try to give it a high-level once over soon.

@ifedorenko ifedorenko force-pushed the ifedorenko:out-of-order-restore-no-progress branch from f4c7335 to 9491b27 Mar 17, 2019
@ifedorenko ifedorenko force-pushed the ifedorenko:out-of-order-restore-no-progress branch from 9491b27 to 5db73a6 May 1, 2019
@DurvalMenezes
Copy link

@DurvalMenezes DurvalMenezes commented May 22, 2019

Hello everyone, and specially @ifedorenko.

Just to make sure: this PR objective is speed optimization, not memory usage, right?

I ask because it has been implied in the forum that it could help me save some memory during the restore, but I'm seeing the exact opposite (about double memory usage as compared to "standard" v0.9.5 restic).

PS: if you need anything, I compiled it with --tags profile, so I can provide anything the pprof runtime webserver makes available.

Cheers,
-- Durval.

@ifedorenko
Copy link
Contributor Author

@ifedorenko ifedorenko commented May 22, 2019

This PR is supposed to use little less memory compared to 0.9.5, can't think of why it takes 2x memory instead. I'd need to know what data structures use the memory in your tests to suggest anything specific.

@DurvalMenezes
Copy link

@DurvalMenezes DurvalMenezes commented May 22, 2019

Hello @ifedorenko,

This PR is supposed to use little less memory compared to 0.9.5, can't think of why it takes 2x memory instead.

Perhaps my assertion of 2X memory was premature: it used 2x the memory in the first 15m, but that could have been because it just progressed farther in that interval than standard 0.9.5.

I'm monitoring it as it runs, and after ~1h05m since then, it seems to have plateaued at about the same 36GB of RAM it reached in those first 15 mins. If it has really plateaued, it would be using only 90% (ie, 10% less) than the ~40GB that standard 0.9.5 ended up using up to the point I had to kill it.

I will continue to monitor it and report back on its memory use.

I'd need to know what data structures use the memory in your tests to suggest anything specific.

Would the output of curl http://localhost:6060/debug/pprof/heap provide what you need? Please let me know so I can be prepared.

Cheers,
-- Durval.

@DurvalMenezes
Copy link

@DurvalMenezes DurvalMenezes commented May 22, 2019

I will continue to monitor it and report back on its memory use.

The current out-of-order-restore-no-progress restore just blew past the point where I had to kill the previous one (at ~59GB restored) and the RSS memory usage stays at ~36GB, so my initial claim of 2x the memory usage was indeed premature: it seems this PR actually uses ~10% less memory than standard restic, as @ifedorenko expected.

Sorry for the false alarm.

@DurvalMenezes
Copy link

@DurvalMenezes DurvalMenezes commented May 23, 2019

The restore finished a few hours later and, in a word: WOW! :-)

In more words: it finished in 4h08m the same restore that standard 0.9.5 restic took over 72 hours and reached only 2/3 -- so, performance-wise, congratulations @ifedorenko! Thanks to your work, I escaped having to junk restic and all the hard work I invested implementing and tuning it here -- it would not have been viable to keep using a backup method that was so slow when restoring things.

On a less positive note, I detected two corrupted files -- more details on my forum post. To sum it up, I think there's a good chance it wasn't caused by restic -- I will do some more tests to try and locate the issue, and then come back to report.

Cheers,
-- Durval.

@DurvalMenezes
Copy link

@DurvalMenezes DurvalMenezes commented May 23, 2019

RSS memory usage stays at ~36GB, so my initial claim of 2x the memory usage was indeed premature: it seems this PR actually uses ~10% less memory than standard restic, as @ifedorenko expected.

More info memory-wise: when I re-run the exact same restic restore command, just cd'ing to a new directory (effectivelly changing the restore destination, as I explicitly use --target=.), the maximum RSS usage was only ~24GB, ie, ~33% less than the first time I ran that same command! Does this make any sense to you?

// of concurrent writeToFile invocations plus cacheCap.
// writes blobs to target files.
// multiple files can be written to concurrently.
// multiple blobs can be concurrently written to the same file.

This comment has been minimized.

@saikpr

saikpr Jun 18, 2019

This might cause issues with FUSE based systems, especially FUSE mounted high-latency systems.

This comment has been minimized.

@rawtaz

rawtaz Feb 13, 2020
Contributor

@saikpr Can you please elaborate a bit on this? In what way can it cause issues, for what reason in FUSE?

@hyperfekt
Copy link

@hyperfekt hyperfekt commented Jul 30, 2019

This code restored my ~60GB backup in half an hour, whereas the old restorer code would have finished in months (no hyperbole) at the rate it was going. In both cases the entire repo was already downloaded. As a user, I'd highly recommend merging this.
EDIT: For context, I assume I'm hitting a bit of a degenerate case in the old restorer code because I run a backup every 5 minutes.

@saikpr
Copy link

@saikpr saikpr commented Jul 30, 2019

This did improve the RAM and speed (but not that much if you increase the Chunk Size to 128MB, the difference was significant at low chunk sizes)

But Rclone was throwing a lot more Didn't finish writing GET request (wrote 135396258/135407930 bytes) errors than without it.

Eventually both the restores were exactly same, byte-by-byte only the stderr was polluted with the above error.

You can find the logs here out_of_band.txt

@ruiztulio
Copy link

@ruiztulio ruiztulio commented Aug 6, 2019

Hi,

I was having an issue with this:

image

The last error was because I cancelled the restore process. And also sometimes the restore fail, those backups are about 45G, sometimes the restore only does 35G

But after using the branch from @ifedorenko it worked perfectly:

image

As you may notice we are using rest-server (I had to blur url and customer name). Hope this add any value to the discussion.

Regards

@saikpr
Copy link

@saikpr saikpr commented Aug 7, 2019

@Church-
Copy link

@Church- Church- commented Sep 17, 2019

Having already demo'd this branch at work, I definitely think it needs to be merged in.

Allowed me to circumvent a ton of out of cache capacity errors I was running into.

@ifedorenko ifedorenko force-pushed the ifedorenko:out-of-order-restore-no-progress branch from 5db73a6 to ab23ab9 Nov 27, 2019
@ifedorenko
Copy link
Contributor Author

@ifedorenko ifedorenko commented Nov 27, 2019

fyi, rebased my branch on the current master, also split vendor/... changes to a separate commit

@bkmeneguello
Copy link

@bkmeneguello bkmeneguello commented Nov 28, 2019

Really waiting this become production to start using restic in my company. We have some huge VM disks to synchronize to GDrive and actually they don't perform well to restore. I've tested this branch and worked perfectly, but until become stable I can't use in production.

@rawtaz rawtaz requested a review from fd0 Nov 28, 2019
@StrikeLines
Copy link

@StrikeLines StrikeLines commented Dec 2, 2019

This branch is an absolute life saver. I had to use PR2195 in order to successfully restore about 50 TB of business data. The stock restic restore had been running for weeks with no end in sight. This version restored 9 TB of files overnight!

imo, in it's current state, restic-master is not a usable backup solution. It's terrible restore performance is unacceptable, making it useless on large repositories.

This pull request should be merged to master as soon as possible. This deserves to be prioritized above all other pending requests.

Thanks ifedorenko. This saved my bacon.

@ifedorenko
Copy link
Contributor Author

@ifedorenko ifedorenko commented Dec 2, 2019

For the record, I wrote the current broken implementation too, so I am to blame for the current state of affairs. There is O(N^2) loop there, which gets progressively slower with number of blobs in a file. It can be fixed, but the fix would add complexity to already complex code and I think this PR is overall better deal.

@rawtaz rawtaz force-pushed the ifedorenko:out-of-order-restore-no-progress branch from c83044f to 9dbafbf Feb 26, 2020
@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Feb 26, 2020

@ifedorenko I'm terribly sorry for saying there were unresolved lines. You indeed fixed that a while ago. GitHub for some reason still showed them as unresolved, so I thought they weren't. I clicked the resolve button now, so no more confusion on that part :P

I took the liberty of updating the changelog file a bit and squashed some commits. From where I stand this PR is ready for merging, just going to get @fd0 on it in a short while before we push the button. I have no doubts it's going to be merged soon.

@ifedorenko
Copy link
Contributor Author

@ifedorenko ifedorenko commented Feb 26, 2020

@rawtaz I would prefer two commits, to be honest, one for the code changes including changelog, and the other for go mod ... generated noise.

ifedorenko added 2 commits Nov 27, 2019
Much simpler implementation that guarantees each required pack
is downloaded only once (and hence does not need to manage
pack cache). Also improves large file restore performance.

Signed-off-by: Igor Fedorenko <igor@ifedorenko.com>
Signed-off-by: Igor Fedorenko <igor@ifedorenko.com>
@rawtaz rawtaz force-pushed the ifedorenko:out-of-order-restore-no-progress branch from 9dbafbf to c52198d Feb 26, 2020
@fd0
fd0 approved these changes Feb 26, 2020
Copy link
Member

@fd0 fd0 left a comment

Awesome work, thank you very much @ifedorenko! And I'm very sorry it took us so long to get this merged :)

@rawtaz rawtaz merged commit b5c7778 into restic:master Feb 26, 2020
2 checks passed
2 checks passed
@travis-ci
Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@ifedorenko ifedorenko deleted the ifedorenko:out-of-order-restore-no-progress branch Apr 2, 2020
@buhkguj
Copy link

@buhkguj buhkguj commented Apr 15, 2020

@fd0: We have the same problem. Restoring from 1TB repository brings up the caching errors on serveral files. We are using 0.9.6. Please release this fix ASAP if not done already!

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Apr 15, 2020

We have the same problem. Restoring from 1TB repository brings up the caching errors on serveral files. We are using 0.9.6.

@buhkguj Please be more specific. The same problem as what? Is the problem you are referring to already documented somewhere? If so, please provide a link to it. Thanks.

@buhkguj
Copy link

@buhkguj buhkguj commented Apr 15, 2020

@buhkguj Please be more specific. The same problem as what? Is the problem you are referring to already documented somewhere? If so, please provide a link to it. Thanks.

We got restore errors like this: "not enough cache capacity: requested 9976743, available 6836375"

Issue: #2244
This PR shoud fix, but I think it is not included in latest release yet. (0.9.6)
Thanks for your great work.

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Apr 15, 2020

@buhkguj You're correct in that this PR is not included in 0.9.6 (it will be in the next release, of course). If you want to try this PR, you can download the latest master build from here. If you do, please let us know if it helps with the problems you're having!

@i0x71
Copy link

@i0x71 i0x71 commented Apr 15, 2020

Guys,
Its been 11 months since this fix was made available, the version available on the main site this doesn't Include it. This really brings down the morale and trust in the team, I would imagine there are lots of people out there who faced this issue (me included) only to find themselves on this page. 11 months the after fix was available.

@saikpr
Copy link

@saikpr saikpr commented Apr 15, 2020

@i0x71 @rawtaz You can find the binaries on the beta website here : https://beta.restic.net/restic-v0.9.6-160-gf033850a/
I am pretty sure most of the maintainers have been busy given the present situation, and they would have a lot on their mind.

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Apr 16, 2020

@i0x71 That is 11 months during which you and everyone else who had this specific problem has had all the opportunity in the world to try/use this fix, and in fact some users who wanted to try it actually did so and have been using it ever since.

All you had to do was build restic yourself (which is just one command assuming you have Go installed), and if you didn't know about this you just had to ask. We'd be happy to help you in case you didn't know how to do it.

That said, can you tell me in what way the team that maintains restic is less trustworthy due to it taking a long time to get this fix into master (and released)? In what way does us having a life to manage, sometimes with severe issues to deal with that you cannot even imagine, and having kids, family, relatives and a full time job including travels to work out during the approximately 15 hours we're awake, make us less trustworthy?

On the contrary I would suggest that we'd be less trustworthy if we merged this and furthermore put it out as an official release the day after the PR was created. I hope you understand why that would be more of an issue.

FWIW, there will always be changes and fixes that are either waiting to be merged or that have been merged but not put into a release, this was just one of them. I agree it took too long, but there are reasons for that and anyone who needed this fix had the possibility to use it, you don't need a release for that. At the very least just ask if you can try it beforehand. Thanks!

@buhkguj
Copy link

@buhkguj buhkguj commented Apr 16, 2020

I tested the beta bin and the restore was done without errors!
Thanks for the tip. I realy understand that it takes time, specially during this strange period we all live in right now. Anyway I hope there is a way to release it soon, so we can use it on production.
Thanks again for this great tool!

@julijane
Copy link

@julijane julijane commented May 22, 2020

I'm sorry but I sort of have to agree with the criticism. Today I had to panic-restore some data and then stumbled across the errors and that really did not lighten my day as it increased my panic even more to find that the data was not fully restored. It took me some googling and clicking through issues here to find this and the link to the beta binaries which then could properly restore the data.

While I understand some of the reasoning I really think that if there is an issue with the released version not restoring (all) data it is a good idea to release the fixed version in time. People in my situation will clearly be thankful because when you are in need of a restore it really does not lighten the day when the restore fails. At minimum imho there should be a note about the issue and pointing to the beta release.

And yes, my initial reaction was along the lines of "ok, forget about restic, need to check for a better backup tool later". I'm glad that in the end it turned out all good, but still. I think some might be underestimating the weight of this issue and how it makes the endusers feel.

And I certainly feel that "there will always be changes and fixes that are either waiting to be merged or that have been merged but not put into a release, this was just one of them" seems like a odd statement considering that we are not talking about some oddball feature but the core functionality of a backup tool: restoring data.

Just my 2ct, thanks for considering my thoughts and of course thanks for all the work behind this project. I really do not mean this in a rude way please take it as constructive criticism.

@dhilgarth
Copy link

@dhilgarth dhilgarth commented Jun 20, 2020

It might help if there would be some kind of roadmap where users would be able to see when the new version will be released.
Does something like that exist?

@darkdragon-001
Copy link
Contributor

@darkdragon-001 darkdragon-001 commented Jun 20, 2020

@dhilgarth One it's released, you will see the release tag for the commits in this pull request (StackOverflow). In addition, some roadmap to see which issues are blocking a release would be nice.

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Jun 21, 2020

Does something like that exist?

No. And as you can see there's a huge flow of issues, comments and PRs taking up a lot of time for the maintainers. It's hard to manage all that and we need the flexibility in releasing when we think that we are able to. Having to maintain a roadmap would in my personal opinion only be counter-productive, it would not help.

@darkdragon-001
Copy link
Contributor

@darkdragon-001 darkdragon-001 commented Jun 21, 2020

@rawtaz All these project management tools help to stay focused and solve the important issues first instead of being overwhelmed by the huge flow of issues. Overall productivity will increase. If it was my project, I would make use of them. But the team probably knows best what works for them...

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Jun 21, 2020

I spent some time with one of them and it turned to not work at all. It was a complete waste of time, and I was honestly a bit annoyed by the fact that it advertises itself as this great thing, yet when you try it it's utter garbage. I guess this hasn't helped raising my love for such tools :D To be fair, it wasn't a project management tool specifically, more like a GitHub issues+PR management tool. Anyway, I hear you, and my comment was only about how I feel about it, perhaps others in the team want to or use such tools. Regardless, I truly don't think having a roadmap would help us. A release will come when it's ready.

@darkdragon-001
Copy link
Contributor

@darkdragon-001 darkdragon-001 commented Jun 21, 2020

@rawtaz To get me right, I am not requesting some sort of timeline when to expect the next release. It would be more to assign issues to milestones (which represent the next minor, major and future releases) and cheer for your great work for every solved issue. This way, users can see which issues are blocking a release and estimate how much work is left before a release can happen (it will also avoid users posting questions like this). Using such tools allows to separate communication on (new) issues and working on / fixing issues. This way one can stay focused while working on issues which usually increases productivity a lot.

@darkdragon-001
Copy link
Contributor

@darkdragon-001 darkdragon-001 commented Jun 21, 2020

The dotnet team uses them extensively to organize the 5000+ open issues. Other examples for inspiration: docker, mono, OpenZFS.

@rawtaz
Copy link
Contributor

@rawtaz rawtaz commented Jun 21, 2020

That's great. I am not going to add any further comments on this. Thanks for your suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet