New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic if tenative & regular replication cursors present for the same snapshot #666
Comments
I still have zfs-auto-snapshot running and use zrepl for migrating hardware in the future. |
Please provide the debug log output starting from ca. 5 seconds until the panic. |
Sorry for not getting back to this earlier. What's weird here is that zrepl sees abstractions the following abstractions before the send attempt starts:
According to Lines 274 to 279 in 498af6b
it should only try to delete step holds and tenative replication cursor bookmarks. However, as indicated by the panic, One of the tentative cursors is for the snapshot as the replication cursor, namely, Trying to reproduce this locally by creating a fake CURSORTENTATIVE bookmark yields your panic. Next step will be to debug it. |
In the meantime, deleting the tenative cursors should fix the crash for you. However, only do it if you have a regular replication cursor for the same snapshot. Otherwise, incremental replication might break and you'd have to re-replicate the dataset. To list replication cursors and tentative replication cursors, use
Then compare on a per-filesystem basis whether you have a regular replication cursor for the tentative cursor.
and a
If that is the case, you can delete the tenative cursor like so
After doing this for all filesystems, restart the zrepl daemon and retry replication. |
So, the problem is that the check is overly cautious. Lines 240 to 280 in 498af6b
and following lines ensure that the underlying snapshot or bookmark is not the send step's But that's actually too conservative: if a replication cursor bookmark is used as Now, the problem is: the replication planner could just as easily use the tentative cursor instead of the replication cursor. In that situation, the code snippet above would still try to delete the tentative cursor, and the So, I guess to solve this properly, we need to shift more of the logic into the |
…'t use it (#714) Before this PR, we would panic in the `check` phase of `endpoint.Send()`'s `TryBatchDestroy` call in the following cases: the current protection strategy does NOT produce a tentative replication cursor AND * `FromVersion` is a tentative cursor bookmark * `FromVersion` is a snapshot, and there exists a tentative cursor bookmark for that snapshot * `FromVersion` is a bookmark != tentative cursor bookmark, but there exists a tentative cursor bookmark for the same snapshot as the `FromVersion` bookmark In those cases, the `check` concluded that we would delete `FromVersion`. It came to that conclusion because the tentative cursor isn't part of `obsoleteAbs` if the protection strategy doesn't produce a tentative replication cursor. The scenarios above can happen if the user changes the protection strategy from "with tentative cursor" to one "without tentative replication cursor", while there is a tentative replication cursor on disk. The workaround was to rename the tentative cursor. In all cases above, `TryBatchDestroy` would have destroyed the tentative cursor. In case 1, that would fail the `Send` step and potentially break replication if the cursor is the last common bookmark. The `check` conclusion was correct. In cases 2 and 3, deleting the tentative cursor would have been fine because `FromVersion` was a different entity than the tentative cursor. So, destroying the tentative cursor would be the right call. The solution in this PR is as follows: * add the `FromVersion` to the `liveAbs` set of live abstractions * rewrite the `check` closure to use the full dataset path (`fullpath`) to identify the concrete ZFS object instead of the `zfs.FilesystemVersionEqualIdentity`, which is only identified by matching GUID. * Holds have no dataset path and are not the `FromVersion` in any case, so disregard them. fixes #666
After this error:
I changed from
replication:
protection:
initial: guarantee_incremental
incremental: guarantee_incremental
to
replication:
protection:
initial: guarantee_resumability
incremental: guarantee_resumability
After this I got this panic::
My config is:
The text was updated successfully, but these errors were encountered: