Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune repack threshold #1994

Open
wants to merge 11 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@teknynja
Copy link

teknynja commented Sep 10, 2018

What is the purpose of this change? What does it change?

This change adds the --repack-threshold flag to the prune command allowing users to specify the amount of unused space to allow in packs when removing data from the repository

Was the change discussed in an issue or in the forum before?

Discussed in Issue #1985

Closes #348

Checklist

  • I have read the Contribution Guidelines
  • I have added tests for all changes in this PR
  • I have added documentation for the changes (in the manual)
  • There's a new file in changelog/unreleased/ that describes the changes for our users (template here)
  • I have run gofmt on the code in all commits
  • All commit messages are formatted in the same style as the other commits in the repo
  • I'm done, this Pull Request is ready for review
stats.totalPacks += 1
stats.totalBytes += uint64(pack.Size)
for _, blob := range pack.Entries {
stats.totalBlobs += 1

This comment has been minimized.

Copy link
@houndci-bot

houndci-bot Sep 10, 2018

should replace stats.totalBlobs += 1 with stats.totalBlobs++

if blobCount[h] > 1 {
duplicateBlobs++
duplicateBytes += uint64(entry.Length)
stats.totalPacks += 1

This comment has been minimized.

Copy link
@houndci-bot

houndci-bot Sep 10, 2018

should replace stats.totalPacks += 1 with stats.totalPacks++

Show resolved Hide resolved cmd/restic/cmd_prune.go
@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Sep 10, 2018

The command gofmt -w **/*.go as shown in the CONTRIBUTING.md does not work for me when run in bash at the root of the project (including not showing any errors), so I assumed there was nothing to format.

CI builds were failing indicating that gofmt wasn't run on the files I modified so first I tried gofmt -w ., which not only formatted my files but also everything under ./vendor - I did not want to be messing with those files, so I finally came up with this:

find . -path ./vendor -prune -o -name "*.go" -exec gofmt -w {} \;

which is way more complicated than it should be.

What command should I use to make sure the project files are properly formatted before I commit?

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 13, 2018

Sorry I went dark there for a few weeks, got pulled off onto something else! So is there anything else we need to cover here? I think we were discussing the need to add tests, but there isn't much existing for cmd_prune that I was able to find.

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 13, 2018

Now that 0.9.3 is released I'll try to review this ASAP, sorry for the delay!

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 17, 2018

Codecov Report

Merging #1994 into master will decrease coverage by 3.73%.
The diff coverage is 78.94%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #1994      +/-   ##
=========================================
- Coverage   50.74%     47%   -3.74%     
=========================================
  Files         176     176              
  Lines       14143   14078      -65     
=========================================
- Hits         7177    6618     -559     
- Misses       5920    6469     +549     
+ Partials     1046     991      -55
Impacted Files Coverage Δ
cmd/restic/cmd_forget.go 42.53% <0%> (-0.32%) ⬇️
cmd/restic/cmd_prune.go 66.12% <81.08%> (+2.96%) ⬆️
internal/backend/b2/b2.go 0% <0%> (-80.69%) ⬇️
internal/backend/swift/swift.go 0% <0%> (-78.45%) ⬇️
internal/backend/gs/gs.go 0% <0%> (-73.61%) ⬇️
internal/backend/azure/azure.go 0% <0%> (-69.46%) ⬇️
internal/backend/swift/config.go 36.95% <0%> (-54.35%) ⬇️
internal/index/index.go 77.18% <0%> (-5.12%) ⬇️
internal/restic/testing.go 79.8% <0%> (-0.2%) ⬇️
cmd/restic/cmd_recover.go
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9310cd0...abc1572. Read the comment docs.

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 17, 2018

Just merged 0.9.3 into this pull request

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 17, 2018

Oh, unfortunately that makes the review a lot harder (and the git history even more confusing) 😦

Can you please remove the merge commit? If you like to base your code on the latest master branch, please use git rebase instead. Thanks! 😄

@teknynja teknynja force-pushed the teknynja:prune-repack-threshold branch from 88a0fb9 to b6ec889 Oct 17, 2018

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 17, 2018

Yikes! Sorry - I'm pretty much a noob when it comes to collaborative git! (I've been using it for the most part exclusively for private development so far)

Hopefully I did the remove commit and the rebase correctly...

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 17, 2018

Uh, it looks odd now. Do you need help? I can try to resolve this for you and force-push the branch, but then you need to remove the branch, fetch the changes from github and checkout the branch locally again...

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 17, 2018

Ugh, Sorry! I don't mean to create more work for you! But yes, I guess I do need more hand-holding on this - just let me know what I need to do to fix this - Thanks!

@mlissner

This comment has been minimized.

Copy link
Contributor

mlissner commented Oct 26, 2018

Two thoughts:

  1. Do we want to choose a more reasonable default for this than the current 100%? Or if not a default, some kind of recommendation in the documentation? I'm thinking somewhere around 80% might make sense?

    Never mind! I just saw in the updated docs that 80% is indeed the new default. Wow, I guess that's some level of confirmation on that choice.

  2. Have we done any performance testing to see the impact of this? I'd love to know how big an impact the new default makes. For example, do we think this will make prune usable again on big backups (in the low TB range)?

Myk Shire and others added some commits Sep 8, 2018

Add repack-threshold flag to prune command
Add flag to prune command that is used to specify the threshold
percentage of unused data in a pack to trigger a rewrite of the
blobs in that pack.
Change default Repack Threshold
Default --repack_threshold for `prune` command (and pruning from
the `forget` command) is set by the DefaultRepackThreshold
constant in cmd_prune.go. Default value is now 20 instead of 0.
Implement repack-threshold in prune
Adds feature to `prune` command that allows user to specify the
minimum percentage of unused data in a pack before that pack
will be re-written during a prune operation.

Implementation removed second pack processing loop as it was
no longer needed, also refactored more variables into the `stat`
structure, and combined some loops to speed up processing.
Revert bash-completion.sh
Oops - doc/bash-completion.sh is generated file, so switch
it back to the original version for now

@fd0 fd0 force-pushed the teknynja:prune-repack-threshold branch from e106a6a to abc1572 Oct 28, 2018

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 28, 2018

So, I just spent half an hour untangling the mess of several merge commits, it's in a good shape by now, and I made sure that the code the commits add represent your last state. It's a bit odd that you used two different email addresses for the commits.

For you, it's probably the easiest to delete the branch locally, then pull the repo (e.g. git fetch origin) and checkout the fresh branch (git checkout -b prune-repack-threshold origin/prune-repack-threshold), since this was a so-called non-fast-forward commit.

I'll look at the code itself next.

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 29, 2018

Ah, re-reading my text this morning, "the mess of several merge commits" was not meant in any way negative, just a statement. I know (very well) that Git can be a real pain to use, even for me, and I've been using Git since 2006. I'm always happy to help :)

Going forward, please don't use git merge for feature branches and pull requests, use git rebase master instead.

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 29, 2018

@fd0 - No worries, messes happen when I thrash around like I did! I'm glad I learned about the rebase command, just sorry it was at your expense. As far as the multiple emails, I was working on different machines and apparently i never re-set the user.name and user.email on one of them. Anyways, thank you very much for taking the time to help me out and clean up my mess.

Out of an abundance of caution, I just got rid of the local working repo of my fork and re-cloned it, and checked out my "prune" branch so hopefully I'm on the same page as you now. (I didn't know you (upstream) could fix up my fork like that! Learned something else today too.)

@mlissner - I ran this against my "virtual machine" repository (mostly >20GB files) which was at just a bit over 1TB. I ran the forget command to remove all but the latest snapshot (which ran very quickly) and then did a prune with the threshold set at 100% (only remove packs that are completely unused). The prune still took a very long time (> 12 hours) but removed 231GB and left 188GB of unused blobs remaining, reducing my repository from 1018GB and 202777 files to 770GB and 154625 files. Looking at the B2 transactions it looks like it was mostly "b2_download_file_by_name" and a lot of "b2_delete_file" and not a lot of uploading, just as expected.

The nice thing about the --repack-threshold on the prune command is that you can start out with a higher percentage and if that doesn't remove enough data for your taste, you can re-run it with a lower percentage to remove more partially empty packs (provided you don't mind waiting for the prune command to run again!)

@mlissner

This comment has been minimized.

Copy link
Contributor

mlissner commented Oct 30, 2018

Hm, I sense testing the performance of this is going to be tough since we can't compare two thresholds easily. What I'd find really valuable is if there was a way to redo that test, @teknynja, except setting the threshold to 80% and seeing what the difference is.

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 30, 2018

@mlissner - We might be able to setup some tests using a local repository instead of an internet-backed one - the speed is much better and we can try out different scenarios. What kind of data are you interested in trying? Mostly large or small files? Are the file contents mostly static or do they change a lot? Are there a lot of new files create/old files deleted?

While i was working on the code I tested against a small local repository with a few dozen small (20MB) to large (1500MB) files where I would delete a few at a time, create a snapshot, remove the old snapshots, and then prune (I know, not very thorough). Under those circumstances it seemed like a fairly linear relationship between what was removed at 100% down through 0%.

I was holding off on live testing until the code could be reviewed, but curiosity got the better of me. I've only been using this patch against my live data for the last week or so but only at 100% (which is the use case I was interested in to prevent a bunch of writes when pruning). Pruning takes so long on large backblaze repositories that I haven't invested the time in seeing what lower percentages are like.

Unfortunately, I've already done another backup against my VM repository since pruning it, so a direct comparison probably isn't going to work now. I've got a repository that contains more "working" files so perhaps I'll give that one a try soon with varying prune levels and report back.

@mlissner

This comment has been minimized.

Copy link
Contributor

mlissner commented Oct 30, 2018

Yeah, I use backblaze for my storage too, so my big question is: Will this fix prune so that I can use it?

We have millions of fairly small files (1-10 page PDFs, mostly), coming to a total of a couple TB of data. We mostly only add data, so prune isn't super critical, but right now it's so slow I have it completely disabled.

I'm a bit confused though. What's the difference between 100% and the old default behavior? I thought that was the same?

@teknynja

This comment has been minimized.

Copy link
Author

teknynja commented Oct 30, 2018

@mlissner - I'm not sure this change will do much for you then - it still takes prune a long time to build the index (it does it twice, but i'm not sure it's absolutely required the second time). It will save the time it takes to rebuild and re-write packs which takes a lot of time as well.

As far as the percentage value, it represents the percentage of empty space in a pack required to consider repacking/deleting it. So 100% means the pack must be completely empty to be removed, whereas 0% means that any amount of empty space in the pack will cause it to be rewritten (the old behavior). @fd0 suggested setting the default for the threshold to 20% (so only packs with 20% or more empty space will be repacked) which is what I've done, although for my use I'd pretty much only use 0% or 100%. My concern with prune and backblaze is that rewriting the packs uses transactions that incur fees past a certain threshold and I'm trying to avoid those.

I would love to figure out how to speed up indexing so that pruning (and other operations) would be significantly faster!

@mlissner

This comment has been minimized.

Copy link
Contributor

mlissner commented Oct 30, 2018

Super helpful. Thank you @teknynja. (I had my understanding of the percentage backwards in my mind.) I guess this will help things then, but isn't the holy grail I've been waiting for that'd make prune workable.

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Oct 31, 2018

So 100% means the pack must be completely empty to be removed

Small correction: "100%" means only packs which consist of completely unneeded data are deleted. This is a good setting for backends like B2, where the storage does not cost that much, but the transfer/API calls cost a lot more.

@ifelsefi

This comment has been minimized.

Copy link

ifelsefi commented Dec 19, 2018

Hi

Will this be merged soon?

@ifelsefi

This comment has been minimized.

Copy link

ifelsefi commented Dec 20, 2018

Hi

I am using this code however it takes many hours to even count the files in my repo when doing restic prune. Pruning then takes several days though I have never actually completed a prune operation.

I noticed that only one thread is doing this.

Could making this multithreaded help?

My repo is around 130TB. Given restic cannot really prune it efficiently we may need to stop using restic.

@fd0

This comment has been minimized.

Copy link
Member

fd0 commented Dec 21, 2018

Could making this multithreaded help?

Yes, among others. I have plenty of ideas how to improve the situation (but not much time right now). I still need time to look at this code, sorry for the delay! I hope to find some time during the holidays.

In general, we need to be very careful touching this code here, because it's the one place where data is removed permanently. We started with a very basic (and slow) implementation, and we'll improve from there.

My repo is around 130TB. Given restic cannot really prune it efficiently we may need to stop using restic.

I totally understand, this is one of the areas where restic is still lacking. I hope you find something in the meantime which works for you! :)

@ifelsefi

This comment has been minimized.

Copy link

ifelsefi commented Dec 21, 2018

Hi @fd0

Thank you for putting together a great program. Appreciate any effort on improving this portion of the code.

@jsreynolds

This comment has been minimized.

Copy link

jsreynolds commented Feb 6, 2019

@ifelsefi just curious, you have a lot of data there and I'm only about 1.8TB. I've been using Duplicacy but looking (and testing) Restic periodically to see if I could move to it at some point. Did you happen to stick with Restic or move to another product?

My repo is around 130TB. Given restic cannot really prune it efficiently we may need to stop using restic.

return nil
})
if err != nil {
return err
}

Verbosef("building new index for repo\n")

bar := newProgressMax(!gopts.Quiet, uint64(stats.packs), "packs")
bar := newProgressMax(!gopts.Quiet, uint64(stats.totalFiles), "packs")

This comment has been minimized.

Copy link
@BenoitKnecht

BenoitKnecht Mar 27, 2019

Contributor

I think you meant stats.totalPacks here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.