Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplementation of prune #2718

Merged
merged 13 commits into from
Nov 5, 2020
Merged

Reimplementation of prune #2718

merged 13 commits into from
Nov 5, 2020

Conversation

aawsome
Copy link
Contributor

@aawsome aawsome commented May 1, 2020

What is the purpose of this change? What does it change?

Reimplement prune such that the main problems are solved:

  • prune is slow, especially for remote repositories
  • prune uses too much memory
  • prune is not configurable

Features are:

  • Cancels if repository is in an invalid state. When running and being canceled at any time it should leave a working repository state. Re-pruning of a repo resulting from an aborted prune should safely work. This makes the pruning process really safe.
  • Find out what to do by only reading snapshots, the in-memory index and the list of pack files (-> this is very fast and uses almost no bandwidth!)
  • User options that are easy to understand but still allow to fine-tune the pruning result
  • Optimize such that the size of packs-to-repack is minimal w.r.t. the given user parameters. This minimizes the necessary download of pack files (improves speed and bandwidth usage a lot for remote repos) and even allows to prune while only using data that is already present in the repo cache (interesting e.g. for cold storage)
  • Builds the new index without the need to read load all packs (=> improves speed and bandwidth a lot for remote repos) -> This feature is no longer included, has been moved to Rebuild index in prune by using in-memory index #2842
  • Flag for dry run to see what would be done

Output looks like this:

enter password for repository: 
repository 28fa4257 opened successfully, password is correct
get all snapshots
load indexes
find data that is still in use for 2 snapshots
[0:00] 100.00%  2 / 2 snapshots
find packs in index and calculate used size...
collect packs for deletion and repacking...
[0:00] 100.00%  5 / 5 packs processed

to repack:           69 blobs / 1.078 MiB
this removes         67 blobs / 1.047 MiB
to delete:            7 blobs / 25.726 KiB
total prune:         74 blobs / 1.072 MiB
unused size after prune: 0.00% of total size

repacking packs...
[0:00] 100.00%  2 / 2 packs repacked
counting files in repo
[0:00] 100.00%  3 / 3 packs
finding old index files
saved new indexes as [622627ee]
remove 4 old index files
[0:00] 100.00%  4 / 4 files deleted
remove 3 old packs
[0:00] 100.00%  3 / 3 files deleted
done.

Moreover forget is enhanced:

  • uses re-implemented prune, of course
  • allow real prune simulation with restic forget --dry-run --prune

This PR has the following prerequisites:

(#2839 is not a must any longer; #2842 is now independent; #2843 is included here)

Was the change discussed in an issue or in the forum before?

see #2547

Prune issues have been widely discussed, e.g. #1140 #1599 #1723 #1985 #2162 #2227 #2305
There are also other PR trying to improve the situation, see #1994 #2340 #2507.

closes #1140
closes #1723
closes #1985
closes #2112
closes #2227
closes #2305

Checklist

  • I have read the Contribution Guidelines
  • I have enabled maintainer edits for this PR
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review

@aawsome
Copy link
Contributor Author

aawsome commented May 1, 2020

Open points are:

  • see TODOs when using repository/repack.Repack and repository/master_index.RebuildIndex (will remain open within this PR)
  • what happens when cleanup is cancelled and re-started? -> should work now
  • how does it work with broken repos? (e.g. SHA sum of repacked blob does not fit) -> works as good as recent version of prune. Still potential for improvement, but not in this PR.
  • testing -> I finished the tests for most settings. Tests for "edge cases" (duplicates, mixed packs) are open.
  • add documentation
  • add automated tests

@MichaelEischer
Copy link
Member

MichaelEischer commented May 2, 2020

I did not have time yet for a close look at the code, but I'd suggest to leave handling of broken repos (which probably requires some tricky code changes) for a later, separate PR. Adding/Changing a few hundred lines of critical code (i.e. one that could permanently damage repositories) is enough for one PR. The current PR should just check that a repository seems intact before making any changes, like the existing prune implementation.

The added feature set compared to the current prune implementation is already rather large: it adds parallelization of several steps, probably uses a lot less memory, adds thresholds for rewriting and rebuilds the index in memory.

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a short summary of the below comments:

  • I'd suggest to use a set of command line options like max-unused-percentage=20, repack-small=true, repack-mixed=true and repack-trees-only=false (the assignment denotes the default values)
  • The handling of duplicate blobs is inconsistent
  • For correctness/robustness a few additional sanity check are necessary (I hope that I've found them all), but the most important checks are already in place
  • I think that rebuilding the index in memory is the way to go, but that probably requires changes to the archiver/indexUploader which in the end should be split into a separate PR

I'd suggest to defer further parallelization related changes of existing code to a later PR as it's probably a good idea to salvage #2340 .

What is the reason why the command is called cleanup instead of prune?

cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
cmd/restic/cmd_cleanup.go Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented May 11, 2020

@MichaelEischer Thanks a lot for your very valuable comments!

* I'd suggest to use a set of command line options like `max-unused-percentage=20`, `repack-small=true`, `repack-mixed=true` and `repack-trees-only=false` (the assignment denotes the default values)

This has been changed. I used --max-unused-percent=1.5 (can be specified as float) as this is now a global limit and I would suggest a much smaller default value. If there are better options for the name or default value, I'm open to suggestions.

I also added --repack-duplicates=true and --no-rebuild-index=false.
The latter will use (if set to true) the repository.Rebuild-Index function resulting in a single - maybe large - index file. If not set (default) the rebuild-index from the present command logic is used which reads all pack files to rebuild the index. Both options are not optimal but are possible with just minimal code change and hence IMO worth to keep. An optimal solution for the index treatment would be subject to another PR.

* The handling of duplicate blobs is inconsistent

I hope this is fixed now, see --repack-duplicates

* For correctness/robustness a few additional sanity check are necessary (I hope that I've found them all), but the most important checks are already in place

Most are added. I just realized that I forgot one, this will come soon.

* I think that rebuilding the index in memory is the way to go, but that probably requires changes to the archiver/indexUploader which in the end should be split into a separate PR

I agree, see above.

I'd suggest to defer further parallelization related changes of existing code to a later PR as it's probably a good idea to salvage #2340 .

👍

What is the reason why the command is called cleanup instead of prune?

It is a historical reason that I had first two, then three commands. I still used a new command to make comparisons of the old prune call with the new one easy.
I'll change the name to prune if we agree that this will replace it.

@aawsome
Copy link
Contributor Author

aawsome commented May 12, 2020

For anyone who wants to try out this PR (please using a copy of your repo!):

  • I don't know if I can easily find/generate test settings having duplicate blobs or mixed blobs within a pack file. If there is someone willing to test these "edge cases", I would be very happy!
  • use -n (=--dry-run) first. It allows you to see exactly what would be done. This especially helps if you want to play around with the flags (e.g. --max-unused-percent).
  • using --max-unused-percent=100 --no-rebuild-index --repack-duplicates=false --repack-mixed=false --repack-small=falsedoes never repack anything but still delete 100% unused files.
  • using --repack-trees-only will most likely be not able to achieve the wished unused ratio. However, it ensures that only data is read that should be already in the cache (allows usage with "cold storage"). Also using this flag, there will be not much packs marked for repacking. Thus this flag will give a big speedup especially for remote repositories.
  • --no-rebuild-index=false (the default) principally execute a restic rebuild-index after repacking. I do prefer --no-rebuild-index=true

Hoping to get feedback from many users!

@seqizz
Copy link

seqizz commented May 13, 2020

Thanks for this PR @aawsome

For anyone who wants to try out this PR (please using a copy of your repo!)

I'm now running some tests on one of my trashy repos, which is ~750gb and has a lot of non-pruned chunks/snapshots in it. Will compare the memory/timing values with dry running prune.

Since there are a lot of features under this, I have a small question in the meantime:

option to choose an index rebuild without needing to load all packs (=> improves speed and bandwidth a lot for remote repos)

Can't clearly see the option for this, is it --repack-trees-only?

I am planning to run faster stuff (e.g. index/tree cleanup) more often and keep the IO-heavy stuff (e.g. clear completely unused chunks) later. Of course I am assuming "completely unused" stuff should not require exclusive lock, since they won't be referred anywhere after other clean actions, does this sound right?

@aawsome
Copy link
Contributor Author

aawsome commented May 13, 2020

option to choose an index rebuild without needing to load all packs (=> improves speed and bandwidth a lot for remote repos)

Can't clearly see the option for this, is it --repack-trees-only?

I was talking about --no-rebuild-index. If you do not specify this flag, it will essentially run a restic rebuild-index which reads all pack files. If you specify this flag, it uses the in-memory index, but (this is an open shortcoming that will not be fixed in this PR) only produce one (possibly large) index file.

I am planning to run faster stuff (e.g. index/tree cleanup) more often and keep the IO-heavy stuff (e.g. clear completely unused chunks) later. Of course I am assuming "completely unused" stuff should not require exclusive lock, since they won't be referred anywhere after other clean actions, does this sound right?

ATM there are no lock-free action. The problem is that the repository must be locked between looking for used blobs (by scanning snapshots and the according trees) and the actual deletion of files. A two-phase deletion can be implemented independently of this PR.

I also do not think that your expectations about "faster" and "slower" will be true. If you are able to tolerate some unused space in your repository I expect that a "standard" prune run will be fast enough such that you will not need to think about two runs with different parameters.
The flag --max-unused-percent works on a repository-wide level and does exactly what you expect:
If you prune with --max-unused-percent=1.5 and your repository size after pruning is 10GB, you might have up to 150 MB unused data still in your repository - and those 150MB are chosen such that the minimal IO is needed. The more "wasted" space you can afford the faster your prune run will be.
Of course, if there is unused data that can be removed by just deleting files in the repo, these files will always be removed. The flag only gives an upper limit.

Of course I am very curious about the results of your tests! Make sure to use -n to test out the effect of your chosen parameters.

@seqizz
Copy link

seqizz commented May 13, 2020

Thanks for the clarification. I guess that'd be helpful, since I can spare some disk and 1.5 seems to be a good compromise.

Some notes:

  • no-rebuild-index log not surprisingly has no difference, since this can't do anything with dry-run
  • I tried to set --max-unused-percent to 0.5, 10 and 10.0, nothing changed on the output. It gave unused size after prune: 0.16% of total size every time, all other numbers were also exactly same.

Logs:
cleanup-maxunused10.txt
cleanup-norebuildindex.txt
cleanup-standard.txt

@aawsome
Copy link
Contributor Author

aawsome commented May 13, 2020

* `no-rebuild-index` log not surprisingly has no difference, since this can't do anything with `dry-run`

Ah, of course - I forgot to mention that the index rebuilding only takes place if not called with -n.

* I tried to set `--max-unused-percent` to 0.5, 10 and 10.0, nothing changed on the output. It gave `unused size after prune: 0.16% of total size` every time, all other numbers were also exactly same.

The reason is that you have a lot completely unused pack files. Those will be always deleted. You just have 229 partly used packs and keeping them all will result in 0.16% wasted size. To see a difference you would have to specify a value smaller than 0.16. Then some packs will be repacked.

Logs:

Thanks a lot for the log. It is a bit surprising to me that most of the time is used for "collect packs for deletion and repacking" - maybe I should add a progress bar there...

Can you make a copy of your repo to run without --dry-run and to do a comparison with the actual prune?

@seqizz
Copy link

seqizz commented May 14, 2020

Can you make a copy of your repo to run without --dry-run and to do a comparison with the actual prune?

Sorry for late answer, I'll report back in a day (there is some extra-IO on the host currently, which will compromise the tests).

But I've copied the repo & have a small question: What would be a good comparison?

  • Prune vs cleanup
  • Prune vs cleanup --no-rebuild-index
  • Prune + rebuild-index vs cleanup

@aawsome
Copy link
Contributor Author

aawsome commented May 14, 2020

What would be a good comparison?

A good comparison for your repo would be: prune vs cleanup vs cleanup --no-rebuild-index

EDIT: prune already runs rebuild-index implicitely, that is with respect to the index change prune behaves exactly like cleanup without --no-rebuild-index.
So I'm interested in the effect of --no-rebuild-index and the difference between prune and cleanup without regarding the index changes.

Playing around with the other flags would not affect the result too much, as you have already noticed (will always delete all completely unused pack files).

@seqizz
Copy link

seqizz commented May 14, 2020

prune already runs rebuild-index implicitely

Oh, I didn't know that. I hope there is some difference between the commands though: Sometimes I got prune failures with messages like Error output: tree XX not found in repository and running rebuild-index explicitly somehow fixes it.

Anyway sounds like you need 3-way comparison so I'll copy the repo once more :)

@MichaelEischer
Copy link
Member

@seqizz The current prune command actually rebuilds the repository index twice: The first time in-memory and the second time once all unused blobs were removed. So my guess is that cleanup is a lot faster that prune even if it is configured to removed every unused blob.

Error output: tree XX not found in repository errors are also fatal errors for the cleanup command. Some situations where rebuild-index fixes the error might perhaps be handled better by a future PR. For now the cleanup implementation has actually slightly stricter repository integrity checks in place than the current prune implementation.

@aawsome aawsome changed the title Add new command cleanup Reimplementation of prune May 15, 2020
@aawsome
Copy link
Contributor Author

aawsome commented May 15, 2020

Starting with commit 9bef5ee I renamed the command to prune. I.e. this PR now reimplements the old prune command.
Also integration tests have been extended to cover various parameter specifications.

@seqizz
Copy link

seqizz commented May 15, 2020

Alright I've used the last commit, a.k.a. "new prune" versus old one. Both time and memory-wise it looks very good. Attaching the logs.

aawsome-prune.txt
aawsome-prune-noindex.txt
upstream-prune.txt

@aawsome
Copy link
Contributor Author

aawsome commented May 15, 2020

@seqizz Thanks for your tests!
And wow - runtime went down from 1h34min to 28min and to 7min with --no-rebuild-index. This is what I call a huge performance gain!
I'm also quite satisfied with the memory usage. 👍

I suspect that your storage has quite a pretty high bandwidth and low latency. The effect should be even much greater on real "remote" storages.

Did you run a restic check after the prune to ensure that the repo is in a sane state?

If you wish to do further tests, you could do:

  • try the new prune with --max-unused-percent=0 This should do what the old prune does. Times and memory however will be quite similar to your actual tests
  • rerun the old and new prune on the resulting already-pruned repo. I expect the old prune to run for more than 1h and the new one to finish in about 1min. With other words: If there is not much to delete, the new prune should be really fast (especially with --no-rebuild-index). I think this can allow you to run the new prune much more often.

@aawsome
Copy link
Contributor Author

aawsome commented May 16, 2020

I'm quite confident that the actual state of this PR is ready for a review and can be a valuable improvement to restic!

Thanks a lot @MichaelEischer for all your ideas, discussions and comments on early versions! Also thanks @seqizz for doing the performance tests!

Now, who do we need to review this?
I hope @MichaelEischer can have a view at these changes? Do we also need a review from @fd0?

I'm happy to get remarks or ideas what to change from code reviewers!
Of course, after review and maybe fixing some things, I'm going to squash the changes such that they can be nicely merged.

@aawsome aawsome marked this pull request as ready for review May 16, 2020 19:57
cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
@aawsome
Copy link
Contributor Author

aawsome commented May 22, 2020

I made a review myself after leaving the topic for a couple of days. I realized that the --no-rebuild-index flag was confusing and the shortcoming easy to fix. So The flag is now removed and prune uses the in-memory-index to create the new index files by default. IMO this again improves this change a lot!

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just took a quite look at the index rebuilding changes, so this is at best a partial review.

internal/repository/master_index.go Outdated Show resolved Hide resolved
internal/repository/master_index.go Outdated Show resolved Hide resolved
cmd/restic/cmd_prune.go Outdated Show resolved Hide resolved
@MichaelEischer MichaelEischer self-requested a review May 23, 2020 17:15
@fd0
Copy link
Member

fd0 commented Nov 3, 2020

I'd like to leave one other thing here (while you're all subscribed to this issue):

You all did a tremendous job! Not only writing the code, but also trying it out, giving feedback, improving the code, adding documentation, and even thinking about the user interface and which options and knobs to make available to users! While reading through the comments, at one point I thought "oh, this is not a good idea to expose directly to end users, it's way too technical and nobody will understand without reading the code what the option does", and a few comments later there was a discussion and you came to the same conclusion.

Unfortunately, I have less spare time to work on restic than I used to have. I'm sure it'll return eventually, but for now I need to select the issues and PRs that I look at in order to advance the project.

Just to give you an idea: It literally took me the whole day to read all the comments and dig into the code. I was very thorough because it touches the one place within restic which removes data. I have no idea how long that would have taken if it were not for you all doing the bulk of the work!

I'm humbled to be part of such an awesome community!

Since @MichaelEischer has requested changes (and now I'm a co-author of this PR), we'll wait for him to look at the new code and approve it, then it can be merged. I forgot that earlier :)

Copy link
Member

@MichaelEischer MichaelEischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the current code and the latest changes. Thanks a lot everyone for the discussions and @aawsome for the dozens of code iterations.

@aawsome
Copy link
Contributor Author

aawsome commented Nov 4, 2020

Is there something open here? Just asking as this PR is not merged yet and I'm kind of waiting to check if everything is fine with the dependent PRs after this one is merged...

@rawtaz
Copy link
Contributor

rawtaz commented Nov 4, 2020

@aawsome No, it's fine. Our Gophers are currently relocating themselves in order to summon the merge gods, it'll be done in not too long (although perhaps not today). Stay tuned, and thanks from me too! 🚀

@fd0
Copy link
Member

fd0 commented Nov 5, 2020

I've just released a new version of restic (so we have a release with the VSS code from #2274 included). I'll now merge this code to master, then we can extensively test it and resolve all remaining bugs over the next weeks :)

@fd0 fd0 merged commit 5144141 into restic:master Nov 5, 2020
@zx2c4
Copy link

zx2c4 commented Nov 5, 2020

Horrah! Thanks so much for spending the time on this, @aawsome and @fd0 and @MichaelEischer.

I'll now merge this code to master, then we can extensively test it and resolve all remaining bugs over the next weeks :)

With the increased time for testing, it might sense to get @aawsome's other PRs in there now-ish too, so these can go out together.

@nunoperalta
Copy link
Contributor

nunoperalta commented Feb 14, 2021

Hey everyone, @aawsome @fd0 @MichaelEischer
I see that this was finally released! Been waiting for this.

I see the change log says: "By default, the prune command no longer removes all unused data."

As an AWS S3 user, I pay per space used in S3, and the idea of pruning old snapshots is to (obviously) save costs.
What exactly is "prune" doing, if not removing unused data?
Just want to understand what this means, as while I'm glad prune will be faster, I definitely don't want to keep old unused data in the bucket forever.

Thank you very much!

@rawtaz
Copy link
Contributor

rawtaz commented Feb 14, 2021

@nunoperalta It's referring to the things you can read about at https://restic.readthedocs.io/en/stable/060_forget.html#customize-pruning , in particular the --max-unused option.

By default it allows for leaving 5% of unused data instead of repacking that last part, but you can change it to 0 if you want. But please note that it's not just storage costs you have to pay, you also pay for transactions with your S3, right?

@nunoperalta
Copy link
Contributor

But please note that it's not just storage costs you have to pay, you also pay for transactions with your S3, right?

Yep, exactly. Transactions & Data Transfer is what I found to be the most expensive about Prune.

So you find that leaving at 5% is "ideal" for a balance/trade-off?

Thank you very much.

@rawtaz
Copy link
Contributor

rawtaz commented Feb 14, 2021

So you find that leaving at 5% is "ideal" for a balance/trade-off?

Smarter people than me came to that conclusion, and I think it seems reasonable.

@aawsome
Copy link
Contributor Author

aawsome commented Feb 14, 2021

Yep, exactly. Transactions & Data Transfer is what I found to be the most expensive about Prune.

Both is dramatically reduced by this PR compared to "old" prune. But the best value for --max-unused depends on the individual costs of your storage as well as on the actual data in your repo. You can also play around with the dry run option to find a suitable value in your case. Feel also free to discuss or share good settings for specific storage backends in the forum!

@vrusinov
Copy link
Contributor

Hey, I was quietly following this PR and been using it for a good while without any major issues. Glad to see 0.12 was released recently and contains this PR.

In a nice coincidence, I also got a notification that my employer's opensource program has accepted some of the people working on it for the Open Source Peer Bonus Program. You should have gotten an e-mail about this - it is real and not a fraud. :)

You can read more about the program it here.

@aawsome
Copy link
Contributor Author

aawsome commented Mar 19, 2021

@vrusinov Thanks a lot for nominating me - I'm feeling very honored! I just received the bonus. I will donate a part of it to the Free Software Foundation Europe - and buy some nice gifts for my wife and children 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet