Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' #2513

Open
wants to merge 8 commits into
base: master
from

Conversation

@aawsome
Copy link

aawsome commented Dec 13, 2019

What is the purpose of this change? What does it change?

There are many issues with the prune command and it does a lot of things:

  • rebuild the index from pack pack files (is this pruning?)
  • remove blobs that are not used from the index
  • repack packs that contain unused blobs (and to do so again rebuild the index)
  • remove packs that are not used

In this PR three new commands are added:
cleanup-index removes all blobs not used in snapshots from index
cleanup-packs removes all packs that are not referenced by the index
repack-index repacks the index to get rid of small index files

With these three commands prune functionality can be done for usual repository state (i.e. non-broken repo).

All three commands are supposed to be fast and not more memory-consuming than 'backup' or 'check'.
Maybe in future a rewrite of 'prune' can use these commands. They just use the index implementation from either internal/repository or internal/index and only read index and metadata from the repositories (which should be already in the cache).

The new command can mitigate the situation in meanwhile and allow to clean up non-pruneable repositories, especially for large remote repositories.

Was the change discussed in an issue or in the forum before?

Prune issues have been widely discussed, e.g. #1140 #1599 #1723 #1985 #2162 #2227 #2305
There are also other PR trying to improve the situation, see #1994 #2340 #2507.

Maybe this pull request can be merged pretty fast as there is no change to existing functionality.
I'm looking forward to getting feedback from code-reviewers 馃槃

closes #1599
closes #1985
closes #2162
closes #2227
closes #2305

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review
Add new commands:
'cleanup-index' removes all blobs not used in snapshots from index
'cleanup-packs' removes all packs that are not referenced by the index
Alexander Weiss
Changelog was added
@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Dec 15, 2019

I do not know why the build on macOS failed after adding the changlog (commit 7fcd225)...
Can anybody help?

Alexander Weiss added 2 commits Dec 22, 2019
- optimize cleanup-index
- add repacking of packs to cleanup-packs (WIP!)
- cleanup-packs can now be used to repack packs
- cleanup-index also checks for used blobs not in index
@irasnyd

This comment has been minimized.

Copy link

irasnyd commented Jan 2, 2020

I can report that I used the master restic + this PR built on December 19th, 2019 to prune a very large (12M object / ~55TB) AWS S3 backed repository. It was very useful, and worked perfectly as far as I can tell.

@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Jan 2, 2020

@irasnyd Thank you for your feedback. I'm very pleased that this PR could help you!

I guess you used the version where only completely unused packs are deleted?
In cleanup-packs I added a option to allow repacking of packs that are partly used or small. This could save some more space, but is slow as it needs to download all relevant packs and re-upload them again.

I just realized that the command is not really verbose - I'll change this so that you can use --dry-run to play around with the new parameters --unused-percent, --unused-space and --used-space.

Alexander Weiss
@irasnyd

This comment has been minimized.

Copy link

irasnyd commented Jan 2, 2020

@irasnyd Thank you for your feedback. I'm very pleased that this PR could help you!

I guess you used the version where only completely unused packs are deleted?

Yes, that is correct. I merged this PR up to commit 7fcd225.

In cleanup-packs I added a option to allow repacking of packs that are partly used or small. This could save some more space, but is slow as it needs to download all relevant packs and re-upload them again.

I just realized that the command is not really verbose - I'll change this so that you can use --dry-run to play around with the new parameters --unused-percent, --unused-space and --used-space.

Thanks for the additions. I think they are valuable improvements to this new functionality.

I won't be able to test them anytime very soon. I lifecycle my data into the AWS S3 Infrequent Access tier (a "colder storage" tier) to save costs, and I don't want to risk increased charges by repacking. It isn't worth it to me at the current time.

@irasnyd irasnyd mentioned this pull request Jan 3, 2020
3 of 7 tasks complete
@seqizz

This comment has been minimized.

Copy link

seqizz commented Jan 27, 2020

Tried these commands and seems to act pretty good for a 500Gb repository (5min for index cleanup, 35min for packs - although the default verbosity was a bit much on packs). Good job 馃憤

Alexander Weiss
Added the command `repack-index`. With this command the index files are
repacked so that small index files are put together into larger ones.
@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Feb 5, 2020

I've added a new command repack-index. With this command small index files can be rebuild together resulting in larger index files.

With the commands cleanup-index, cleanup-packs and repack-index all prune functionalities in terms of deleting unsued things and repacking small resulting files are now present. These new commands use less memory and are faster than the actual prune implementation.

Only recovery actions like rebuilding the index from pack files are not covered by the commands in this PR. Also a next step could be to put the parts together to a single new prune or forget functionality.

@aawsome aawsome changed the title Prune issues: add new commands 'cleanup-index' and 'cleanup-packs' Prune issues: add new commands 'cleanup-index', 'cleanup-packs' and 'repack-index' Feb 5, 2020
@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Feb 19, 2020

@tscs37 about your comment (in #2473)

Keep in mind that Prune took almost 17 days to complete for me (>400 hours) so unless the optimizations in the linked issue can improve on prune runtimes by two orders of magnitudes, it's not going to solve the underlying issue for the amount of data I back up

I would like to see the timing results of using this PR, if you can try them out. I do assume you get a HUGE improvement (and also reduced B2 costs) as standard prune downloads and scan your whole repository while these commands mainly work only on the index.

@tscs37

This comment has been minimized.

Copy link

tscs37 commented Feb 20, 2020

Both cleanup commands were indeed a lot faster and ran over about 8 hours, however the repack-index command did crash fairly early into processing, so I'm currently rebuilding the index (restic check on a subset of the backup gave a green light) and can't say anything about it's performance conclusively (though it ran fairly fast atleast as far as it got).

Stacktrace:


repository 2dfcbd9a opened successfully, password is correct                              
load all index files                                                                      
[0:22] 73.26%  400 / 546 index files loaded                                               
pack 94487184 already present in the index                                                
github.com/restic/restic/internal/index.(*Index).AddPack                                  
        internal/index/index.go:262                                                       
github.com/restic/restic/internal/index.Load.func1                                        
        internal/index/index.go:229                                                       
github.com/restic/restic/internal/repository.(*Repository).List.func1                     
        internal/repository/repository.go:643                                             
github.com/restic/restic/internal/backend.(*RetryBackend).List.func1.1                    
        internal/backend/backend_retry.go:133                                             
github.com/restic/restic/internal/backend/b2.(*b2Backend).List                            
        internal/backend/b2/b2.go:289                                                     
github.com/restic/restic/internal/backend.(*RetryBackend).List.func1                      
        internal/backend/backend_retry.go:127                                             
github.com/cenkalti/backoff.RetryNotify                                                   
        vendor/github.com/cenkalti/backoff/retry.go:37                                    
github.com/restic/restic/internal/backend.(*RetryBackend).retry                           
        internal/backend/backend_retry.go:36
github.com/restic/restic/internal/backend.(*RetryBackend).List
        internal/backend/backend_retry.go:126 
github.com/restic/restic/internal/repository.(*Repository).List
        internal/repository/repository.go:637 
github.com/restic/restic/internal/index.Load
        internal/index/index.go:201
main.RepackIndex
        cmd/restic/cmd_repack_index.go:68
main.runRepackIndex
        cmd/restic/cmd_repack_index.go:48
main.glob..func23
        cmd/restic/cmd_repack_index.go:18
github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:762
github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:852
github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:800
main.main
        cmd/restic/main.go:86
runtime.main
        /usr/lib/go/src/runtime/proc.go:203
runtime.goexit
        /usr/lib/go/src/runtime/asm_amd64.s:1357

@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Feb 20, 2020

@tscs37 Thank you for testing and reporting the error.

repack-index failed during reading the index and did not change anything. cleanup-index and cleanup-packs both leavs the repository in a sane state so if you only use the two commands, you can perfectly clean-up your repository; it just may happen that a lot of index files accumulate over time (wichdoes not affect functionality but may lead to performance issues)

I'll look after this issue in repack-index...

- Add handling to AddPack when packs are present in more than one
  index file
@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Feb 23, 2020

@tscs37 The issue you reported should now be fixed with the last commit.

Alexander Weiss
- update Packs after merging
@tscs37

This comment has been minimized.

Copy link

tscs37 commented Feb 23, 2020

Thanks, after rebuilding the index and a check, it seems everything is running fine with all three commands, they do seem to cleanup quite a bit of data, though from the looks of it, a proper prune can still reach a tiny bit more data overall.

@aawsome

This comment has been minimized.

Copy link
Author

aawsome commented Feb 23, 2020

@tscs37: by default cleanup-packs only removes packs that are completely unused. This can be done without reading any pack from the repo and hence is very fast. I have good experiences with my repos and the "overhead" (packs that are partly used) is in my case less than 1% of the repo size.

If you want to also repack partly used packs (as prune does), you can use the optional flags --unused-percent, --unused-size, --used-size.
Using --unused-size=1 will basically repack all partly used packs. This however means that the packs-to-repack have all to be read from the repo which can be quite slow for remote repos. (But is still faster than prune which reads all packs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can鈥檛 perform that action at this time.