prune: Avoid running the "find data that is still in use" step when not needed #812

dsommers · 2017-02-17T22:41:24Z

Output of restic version

restic 0.4.0 (v0.4.0-46-gc83e608)
compiled with go1.6.3 on linux/amd64

This is a suggestion for an enhancement ... even though I am not 100% sure it is a valid case, but lets hear what you think.

When re-running a prune job on a repository which is already pruned, it still does does this:

counting files in repo
building new index for repo
[0:37] 100.00%  2828 / 2828 packs
repository contains 2828 packs (685497 blobs) with 12.118 GiB bytes
processed 685497 blobs: 0 duplicate blobs, 0B duplicate
load all snapshots
find data that is still in use for 132 snapshots
...

Is it truly needed to run the 'find data that is still in use...' and further if '0 duplicate blobs, 0B duplicate' is the result of the preliminary check?

I can see that in some cases, such maintenance may make some sense, so I would suggest adding a --force argument to the prune mode to keep the current behaviour.

The reason for this request is that I have a script which runs regularly in the background when I log into my computer. The first thing it does is to run a 'restic prune' to do a clean-up at least once a day. Then a loop hits where it runs restic forget and restic backup at certain intervals throughout the day - until I log out and shutdown my computer. As the restic prune job can easily take up to 30-45 minutes in my setup (even longer when I'm connected to a VPN), it would be great to speed up this pruning step when not strictly needed. Currently my script avoids the prune step when I'm on the VPN/not at my local LAN; to reduce both CPU and network load.

Any thoughts?

The text was updated successfully, but these errors were encountered:

zcalusic · 2017-02-17T23:16:24Z

You don't know how lucky you're. ;) My prune stage on personal repo runs about 15h. :(

But, back to the subject. '0 duplicate blobs, 0B duplicate' is informative, but is not an estimate how much will be pruned. Finding packs to delete/rewrite happens later in "find data" stage, which is the slowest stage. From what I know, duplicate blobs can happen only if you simultaneously backup to the same repo. But, prunes main task is to find what you "forgotten" before, and unfortunately it takes time, cause the whole repo must be scanned. @fd0 promised to optimize it in the future.

dsommers · 2017-02-17T23:42:48Z

Thanks, @zcalusic, for explaining these details. I understand now things are not how I thought it was. Unless there are any reasons to keep this ticket open, it may be closed.

fd0 · 2017-02-18T09:23:02Z

Thanks @zcalusic for the correct explanation. This will be solved when I get around to implementing the local metadata cache, prune should be greatly sped up. In addition, I'm thinking about adding the list of referenced blobs by a snapshot to the local cache, so that restic runs only once per snapshot.

I'm going to close this issue for now, #29 tracks implementing this metadata cache.

fd0 · 2017-02-21T10:10:13Z

PR #817 may also be interesting, it adds a --prune switch to the forget command which automatically runs prune if snapshots have been removed (so a prune is needed).

fd0 closed this as completed Feb 18, 2017

fd0 changed the title ~~[RFE] prune: Avoid running the "find data that is still in use" step when not needed~~ prune: Avoid running the "find data that is still in use" step when not needed Feb 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prune: Avoid running the "find data that is still in use" step when not needed #812

prune: Avoid running the "find data that is still in use" step when not needed #812

dsommers commented Feb 17, 2017 •

edited

zcalusic commented Feb 17, 2017

dsommers commented Feb 17, 2017

fd0 commented Feb 18, 2017

fd0 commented Feb 21, 2017

prune: Avoid running the "find data that is still in use" step when not needed #812

prune: Avoid running the "find data that is still in use" step when not needed #812

Comments

dsommers commented Feb 17, 2017 • edited

zcalusic commented Feb 17, 2017

dsommers commented Feb 17, 2017

fd0 commented Feb 18, 2017

fd0 commented Feb 21, 2017

dsommers commented Feb 17, 2017 •

edited