New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
option to convert step-holds to step-bookmarks in the pruner when it tries to destroy a step-held snapshot #288
Comments
I figure the best way to address this need is an option on the pruner that converts any existing replication holds to bookmarks. Converting means
Note that the |
Would this address your needs? |
it is used as from and it might also simply be not accessible and offline by the time. Now is it possible to resume a resumable send from a bookmark? |
|
Hmm ok, if resuming a send from a bookmark works, then your strategy of creating a bookmark for every hold and then releasing that hold, enabling the pruner to delete the snapshot would solve this problem. It might even make sense to make this the default option (I don't really see any downsides). "it" was referring to snapshots/resume tokens that are residing on the sink pool. As I regularly detach my sink pool, "it" would not be accessible most of the time. |
One more idea about this feature, but possibly, this is a separate feature: The implementation would
zrepl only ever keeps at most two holds per send-side dataset. But, if I understand you correctly, deferred destroys wouldn't address your problem, since you don't want the snapshot to stick around any longer, right? |
zrepl currently holds like 200 snapshots per dataset for me currently. So this might also be a separate bug. But yes, you understood correctly, I effectively want to make an "offsite" backup which I connect to semi-regularly and because I only connect semiregularly, I don't want to waste the space on my main machine. That's why I want the snapshots to be destroyed and only bookmarks to be left. |
All right, wrt to the 200 holds, keep an eye on #282 - I'll be pushing some code there soon, although @InsanePrawn 's WIP implementation in that branch already works. |
@JMoVS in #293, there are now subcommands for removing stale holds There should also be some minor performance improvements (less zfs CLI invocations). Would you mind testing that PR so we can get it merged soon? Please provide any feedback on the PR in the comments there, unless it relates to this issue. |
We discussed this in voice chat tonight:
|
To clarify: the current situation is a bug: if "keep-non-replicated" is not set, step-holds created by zrepl must be destroyed by zrepl, because the user really didn't want snapshots to accumulate on the sending side. A single hold violating this policy might not be all that bad, but hold leaks such as #316 exacerbate the problem. |
I think ideally we need something better than 'keep all not replicated' snapshots: a different set of pruning rules. Also, not_replicated or whatever succeeds it should optionally take a list of JobIDs to support unconvential/unsupported setups, e.g. pruning in snap jobs or multiple jobs replicating from one source dataset. |
That's what a configuration that specifies
|
Huh, actually, I'm not sure how your (@InsanePrawn ) setup looks like. Which job does the pruning? The snap job or the replicating job? |
Both. It depends. It's complicated! My abbreviated experimental configs for the three instances I'm using to test the 0.3 work is this, as is, on disk right now. I've played with the interval lengths a couple of times, these might be stupid for a number of reasons, I can't remember. node 1 -> node 2 -> node 3
(Yes, this replicates from the source to the primary backup and from the primary to the secondary. Some blogs might tell you that's a bad thing to do, because corruption occurring on the primary replicates onto the secondary, etc. Still unsure whether you can have one of those nasty uncorrectable blocks be successfully sent and received by ZFS.) |
@JMoVS : I was talking to @InsanePrawn yesterday and he suggested that, for the 0.3 release, it might be sufficient to have a per-job option that disables step holds for incremental sends:
Suppose you yank the external drive during an incremental
Do you agree? |
yes, that would work I think. As long as at all times there will be a bookmark to increment from. ;-) |
This is a stop-gap solution until we re-write the pruner to support rules for removing step holds. Note that disabling step holds for incremental sends does not affect zrepl's guarantee that incremental replication is always possible: Suppose you yank the external drive during an incremental @from -> @to step: * restarting that step or future incrementals @from -> @to_later` will be possible because the replication cursor bookmark points to @from until the step is complete * resuming @from -> @to will work as long as the pruner on your internal pool doesn't come around to destroy @to. * in that case, the replication algorithm should determine that the resumable state on the receiving side isuseless because @to no longer exists on the sending side, and consequently clear it, and restart an incremental step @from -> @to_later refs #288
@JMoVS
EDIT |
Side note on the actual topic of this issue: the WIP branch
|
This is a stop-gap solution until we re-write the pruner to support rules for removing step holds. Note that disabling step holds for incremental sends does not affect zrepl's guarantee that incremental replication is always possible: Suppose you yank the external drive during an incremental @from -> @to step: * restarting that step or future incrementals @from -> @to_later` will be possible because the replication cursor bookmark points to @from until the step is complete * resuming @from -> @to will work as long as the pruner on your internal pool doesn't come around to destroy @to. * in that case, the replication algorithm should determine that the resumable state on the receiving side isuseless because @to no longer exists on the sending side, and consequently clear it, and restart an incremental step @from -> @to_later refs #288
This is a stop-gap solution until we re-write the pruner to support rules for removing step holds. Note that disabling step holds for incremental sends does not affect zrepl's guarantee that incremental replication is always possible: Suppose you yank the external drive during an incremental @from -> @to step: * restarting that step or future incrementals @from -> @to_later` will be possible because the replication cursor bookmark points to @from until the step is complete * resuming @from -> @to will work as long as the pruner on your internal pool doesn't come around to destroy @to. * in that case, the replication algorithm should determine that the resumable state on the receiving side isuseless because @to no longer exists on the sending side, and consequently clear it, and restart an incremental step @from -> @to_later refs #288
I'm not sure the problem I've just noticed is the same, but I'll report it here anyway in case. Problem as seen by zrepl status (for several datasets, this is one example):
Reproducing the problem manually:
I found the clue to the solution here:
I wondered if 17th Feb 2022 was around time time I last upgraded zrepl. But it was actually over a month earlier that I did so, according to
|
@candlerb I think @JMoVS is asking for a feature, but your writing suggests that you think something is not working right. If so, what do you feel like is not working right? Step holds are an integral part of zrepl, see https://zrepl.github.io/configuration/overview.html#how-replication-works |
I think something's not working right: a stale hold was left in place for an ancient (17 Feb 2022) snapshot, meaning that snapshot pruning gave an error when trying to destroy the snapshot, so this snapshot was left around indefinitely.
I don't know the reason why this hold was left there. Perhaps zrepl was prematurely aborted mid-replication. However, in such situations, I would expect it to be picked up and removed either by a subsequent replication, or by the snapshot pruning process. |
As requested on IRC, here's my use case:
I have a laptop with an internal disk and zpool. This zpool is zrepl configured as a push job, pushing to my USB connected 2-way mirror zpool. Now I primarily care about the fact that incremental backup is always possible, so that means that a bookmark from which to resume from is always available.
With the new holding logic, I would need the pruner to also clear all holds (as a setting).
The reason is as follows:
If replication happens, it creates the holds. Now when I disconnect and export the external mirror zpool, zrepl will still hold those snapshots. I have configured zrepl to only ever hold 20 snapshots on my laptop to conserve space. And it happens that I'm away for a week or two.
In this use case, I would prefer for zrepl to release the hold, destroy the snapshots and then when the mirror targt pool becomes availble again, delete the resumable send state and send an incremental backup from the bookmark. This way, my local machine would always only ever have 20 snapshots.
The text was updated successfully, but these errors were encountered: