Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prune: Avoid running the "find data that is still in use" step when not needed #812

Closed
dsommers opened this issue Feb 17, 2017 · 4 comments

Comments

@dsommers
Copy link

dsommers commented Feb 17, 2017

Output of restic version

restic 0.4.0 (v0.4.0-46-gc83e608)
compiled with go1.6.3 on linux/amd64

This is a suggestion for an enhancement ... even though I am not 100% sure it is a valid case, but lets hear what you think.

When re-running a prune job on a repository which is already pruned, it still does does this:

counting files in repo
building new index for repo
[0:37] 100.00%  2828 / 2828 packs
repository contains 2828 packs (685497 blobs) with 12.118 GiB bytes
processed 685497 blobs: 0 duplicate blobs, 0B duplicate
load all snapshots
find data that is still in use for 132 snapshots
...

Is it truly needed to run the 'find data that is still in use...' and further if '0 duplicate blobs, 0B duplicate' is the result of the preliminary check?

I can see that in some cases, such maintenance may make some sense, so I would suggest adding a --force argument to the prune mode to keep the current behaviour.

The reason for this request is that I have a script which runs regularly in the background when I log into my computer. The first thing it does is to run a 'restic prune' to do a clean-up at least once a day. Then a loop hits where it runs restic forget and restic backup at certain intervals throughout the day - until I log out and shutdown my computer. As the restic prune job can easily take up to 30-45 minutes in my setup (even longer when I'm connected to a VPN), it would be great to speed up this pruning step when not strictly needed. Currently my script avoids the prune step when I'm on the VPN/not at my local LAN; to reduce both CPU and network load.

Any thoughts?

@zcalusic
Copy link
Member

You don't know how lucky you're. ;) My prune stage on personal repo runs about 15h. :(

But, back to the subject. '0 duplicate blobs, 0B duplicate' is informative, but is not an estimate how much will be pruned. Finding packs to delete/rewrite happens later in "find data" stage, which is the slowest stage. From what I know, duplicate blobs can happen only if you simultaneously backup to the same repo. But, prunes main task is to find what you "forgotten" before, and unfortunately it takes time, cause the whole repo must be scanned. @fd0 promised to optimize it in the future.

@dsommers
Copy link
Author

Thanks, @zcalusic, for explaining these details. I understand now things are not how I thought it was. Unless there are any reasons to keep this ticket open, it may be closed.

@fd0
Copy link
Member

fd0 commented Feb 18, 2017

Thanks @zcalusic for the correct explanation. This will be solved when I get around to implementing the local metadata cache, prune should be greatly sped up. In addition, I'm thinking about adding the list of referenced blobs by a snapshot to the local cache, so that restic runs only once per snapshot.

I'm going to close this issue for now, #29 tracks implementing this metadata cache.

@fd0 fd0 closed this as completed Feb 18, 2017
@fd0 fd0 changed the title [RFE] prune: Avoid running the "find data that is still in use" step when not needed prune: Avoid running the "find data that is still in use" step when not needed Feb 18, 2017
@fd0
Copy link
Member

fd0 commented Feb 21, 2017

PR #817 may also be interesting, it adds a --prune switch to the forget command which automatically runs prune if snapshots have been removed (so a prune is needed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants