Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic if tenative & regular replication cursors present for the same snapshot #666

Closed
ggzengel opened this issue Jan 3, 2023 · 7 comments · Fixed by #714
Closed

panic if tenative & regular replication cursors present for the same snapshot #666

ggzengel opened this issue Jan 3, 2023 · 7 comments · Fixed by #714

Comments

@ggzengel
Copy link

ggzengel commented Jan 3, 2023

After this error:

Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$tUfl.Vn2w./7rT]: cannot connect err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tUfl.Vn2w./7rT./cvn.ixI5]: error listing receiver filesystems err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.19.80.211:1213: connect: connection refused\"" errType="*status.Error"
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tUfl.Vn2w./7rT./cvn]: most recent error in this attempt attempt_number="0" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.19.80.211:1213: connect: connection refused\""
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tUfl.Vn2w./7rT./cvn]: temporary connectivity-related error identified, start waiting for reconnect attempt_number="0" deadline="2023-01-03 18:28:40.28186702 +0000 UTC m=+1989903.101960353"
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][rpc.data][Uv38$tUfl$dtxN$dtxN.Hkko]: ping failed err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$dtxN$dtxN.Hkko]: ping failed err="rpc error: code = Canceled desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 10.19.80.211:1213: connect: connection refused\""
Jan  3 18:18:40 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tUfl.Vn2w./7rT./cvn]: reconnecting failed, aborting run err="receiver is not reachable: control and data rpc failed to respond to ping rpcs" attempt_number="0"
Jan  3 18:18:41 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$tUfl.Vn2w./7rT]: cannot connect err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:18:43 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$tUfl.Vn2w./7rT]: cannot connect err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:18:45 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$tUfl.Vn2w./7rT]: cannot connect err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:18:48 zfs1 zrepl[30336]: [Backup][rpc.ctrl][Uv38$tUfl$tUfl.Vn2w./7rT]: cannot connect err="dial tcp 10.19.80.211:1213: connect: connection refused"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$0nL5$0nL5]: dry run send request failed err="validate send arguments: `To` invalid: dataset \"zpool1/store/ImagePools/drive_I/images03@zfs-auto-snap_frequent-2023-01-03-1730\" does not exist"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$0nL5$0nL5]: error computing size estimate err="validate send arguments: `To` invalid: dataset \"zpool1/store/ImagePools/drive_I/images03@zfs-auto-snap_frequent-2023-01-03-1730\" does not exist" step="zpool1/store/ImagePools/drive_I/images03(@zrepl_20230103_172838_000 => @zfs-auto-snap_frequent-2023-01-03-1730)" filesystem="zpool1/store/ImagePools/drive_I/images03"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$dST9$dST9]: dry run send request failed err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$dST9$dST9]: error computing size estimate err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n" step="zpool1/store/ImagePools/drive_I/images03(@zrepl_20230103_181838_000 => @zrepl_20230103_182839_000)" filesystem="zpool1/store/ImagePools/drive_I/images03"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$s+Gb$s+Gb]: dry run send request failed err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$s+Gb$s+Gb]: error computing size estimate err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n" step="zpool1/store/ImagePools/drive_I/images03(@zrepl_20230103_084837_000 => @zfs-auto-snap_hourly-2023-01-03-0917)" filesystem="zpool1/store/ImagePools/drive_I/images03"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$L9uy$L9uy]: dry run send request failed err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 18:31:17 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$tEwW$L9uy$L9uy]: error computing size estimate filesystem="zpool1/store/ImagePools/drive_I/images03" err="validate send arguments: `To` invalid: zfs exited with error: signal: killed\nstderr:\n" step="zpool1/store/ImagePools/drive_I/images03(@zfs-auto-snap_frequent-2023-01-03-1815 => @zfs-auto-snap_hourly-2023-01-03-1817)"
Jan  3 18:38:40 zfs1 zrepl[30336]: [Backup][snapshot][Uv38$tUfl$p5sD$p5sD.bCOG]: callback channel is full, discarding snapshot update event
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$8QEb$8QEb]: dry run send request failed err="validate send arguments: `From` invalid: dataset \"zpool1/store/zmt-backup/ShortTerm/px-images@zfs-auto-snap_frequent-2023-01-03-1745\" does not exist"
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$8QEb$8QEb]: error computing size estimate filesystem="zpool1/store/zmt-backup/ShortTerm/px-images" err="validate send arguments: `From` invalid: dataset \"zpool1/store/zmt-backup/ShortTerm/px-images@zfs-auto-snap_frequent-2023-01-03-1745\" does not exist" step="zpool1/store/zmt-backup/ShortTerm/px-images(@zfs-auto-snap_frequent-2023-01-03-1745 => @zrepl_20230103_174839_000)"
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$U9X+$U9X+]: dry run send request failed err="validate send arguments: cannot check whether filesystem \"zpool1/store/zmt-backup/ShortTerm/px-images\" is encrypted: zfs get encryption enabled fs=\"zpool1/store/zmt-backup/ShortTerm/px-images\": cannot get `encryption` property: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$U9X+$U9X+]: error computing size estimate filesystem="zpool1/store/zmt-backup/ShortTerm/px-images" err="validate send arguments: cannot check whether filesystem \"zpool1/store/zmt-backup/ShortTerm/px-images\" is encrypted: zfs get encryption enabled fs=\"zpool1/store/zmt-backup/ShortTerm/px-images\": cannot get `encryption` property: zfs exited with error: signal: killed\nstderr:\n" step="zpool1/store/zmt-backup/ShortTerm/px-images(@zrepl_20230103_142837_000 => @zfs-auto-snap_hourly-2023-01-03-1517)"
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$ab4D$ab4D]: dry run send request failed err="validate send arguments: `From` invalid: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 18:45:39 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$jucS$ab4D$ab4D]: error computing size estimate filesystem="zpool1/store/zmt-backup/ShortTerm/px-images" err="validate send arguments: `From` invalid: zfs exited with error: signal: killed\nstderr:\n" step="zpool1/store/zmt-backup/ShortTerm/px-images(@zfs-auto-snap_frequent-2023-01-03-1800 => @zrepl_20230103_180839_000)"
Jan  3 18:48:40 zfs1 zrepl[30336]: [Backup][snapshot][Uv38$tUfl$p5sD$p5sD.bCOG]: callback channel is full, discarding snapshot update event
Jan  3 18:58:40 zfs1 zrepl[30336]: [Backup][snapshot][Uv38$tUfl$p5sD$p5sD.bCOG]: callback channel is full, discarding snapshot update event
Jan  3 19:00:09 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$siLN$GNFz$GNFz]: dry run send request failed err="validate send arguments: `To` invalid: dataset \"zpool1/store/zmt-backup/LongTerm@zfs-auto-snap_frequent-2023-01-03-1800\" does not exist"
Jan  3 19:00:09 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$siLN$GNFz$GNFz]: error computing size estimate step="zpool1/store/zmt-backup/LongTerm(@zrepl_20230103_175837_000 => @zfs-auto-snap_frequent-2023-01-03-1800)" filesystem="zpool1/store/zmt-backup/LongTerm" err="validate send arguments: `To` invalid: dataset \"zpool1/store/zmt-backup/LongTerm@zfs-auto-snap_frequent-2023-01-03-1800\" does not exist"
Jan  3 19:00:09 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$siLN$dsXD$dsXD]: dry run send request failed err="validate send arguments: cannot check whether filesystem \"zpool1/store/zmt-backup/LongTerm\" is encrypted: zfs get encryption enabled fs=\"zpool1/store/zmt-backup/LongTerm\": cannot get `encryption` property: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 19:00:09 zfs1 zrepl[30336]: [Backup][repl][Uv38$tUfl$siLN$dsXD$dsXD]: error computing size estimate filesystem="zpool1/store/zmt-backup/LongTerm" step="zpool1/store/zmt-backup/LongTerm(@zfs-auto-snap_hourly-2023-01-03-1717 => @zrepl_20230103_171839_000)" err="validate send arguments: cannot check whether filesystem \"zpool1/store/zmt-backup/LongTerm\" is encrypted: zfs get encryption enabled fs=\"zpool1/store/zmt-backup/LongTerm\": cannot get `encryption` property: zfs exited with error: signal: killed\nstderr:\n"
Jan  3 19:08:40 zfs1 zrepl[30336]: [Backup][snapshot][Uv38$tUfl$p5sD$p5sD.bCOG]: callback channel is full, discarding snapshot update event
Jan  3 19:18:40 zfs1 zrepl[30336]: [Backup][snapshot][Uv38$tUfl$p5sD$p5sD.bCOG]: callback channel is full, discarding snapshot update event

I changed from
replication:
protection:
initial: guarantee_incremental
incremental: guarantee_incremental
to
replication:
protection:
initial: guarantee_resumability
incremental: guarantee_resumability

After this I got this panic::

Jan  3 19:27:51 zfs1 zrepl[8373]: panic: impl error: "#zrepl_CURSOR_G_c8647f14c5e6e62f_J_Backup" would be destroyed because it is considered stale but it is part of of sendArgs=zfs.ZFSSendArgsValidated{
Jan  3 19:27:51 zfs1 zrepl[8373]:     ZFSSendArgsUnvalidated: zfs.ZFSSendArgsUnvalidated{
Jan  3 19:27:51 zfs1 zrepl[8373]:         FS:           "zpool1/store/zmt-backup/LongTerm/server-v5",
Jan  3 19:27:51 zfs1 zrepl[8373]:         From:         &zfs.ZFSSendArgVersion{RelName:"#zrepl_CURSOR_G_c8647f14c5e6e62f_J_Backup", GUID:0xc8647f14c5e6e62f},
Jan  3 19:27:51 zfs1 zrepl[8373]:         To:           &zfs.ZFSSendArgVersion{RelName:"@zfs-auto-snap_hourly-2023-01-03-0717", GUID:0x177845bf6975809b},
Jan  3 19:27:51 zfs1 zrepl[8373]:         ZFSSendFlags: zfs.ZFSSendFlags{
Jan  3 19:27:51 zfs1 zrepl[8373]:             Encrypted:        &nodefault.Bool{},
Jan  3 19:27:51 zfs1 zrepl[8373]:             Properties:       false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             BackupProperties: false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             Raw:              false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             LargeBlocks:      false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             Compressed:       false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             EmbeddedData:     false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             Saved:            false,
Jan  3 19:27:51 zfs1 zrepl[8373]:             ResumeToken:      "",
Jan  3 19:27:51 zfs1 zrepl[8373]:         },
Jan  3 19:27:51 zfs1 zrepl[8373]:     },
Jan  3 19:27:51 zfs1 zrepl[8373]:     FromVersion: &zfs.FilesystemVersion{
Jan  3 19:27:51 zfs1 zrepl[8373]:         Type:      "bookmark",
Jan  3 19:27:51 zfs1 zrepl[8373]:         Name:      "zrepl_CURSOR_G_c8647f14c5e6e62f_J_Backup",
Jan  3 19:27:51 zfs1 zrepl[8373]:         Guid:      0xc8647f14c5e6e62f,
Jan  3 19:27:51 zfs1 zrepl[8373]:         CreateTXG: 0x3230bb0,
Jan  3 19:27:51 zfs1 zrepl[8373]:         Creation:  time.Time{
Jan  3 19:27:51 zfs1 zrepl[8373]:             wall: 0x0,
Jan  3 19:27:51 zfs1 zrepl[8373]:             ext:  63808324117,
Jan  3 19:27:51 zfs1 zrepl[8373]:             loc:  &time.Location{
Jan  3 19:27:51 zfs1 zrepl[8373]:                 name: "Local",
Jan  3 19:27:51 zfs1 zrepl[8373]:                 zone: {
Jan  3 19:27:51 zfs1 zrepl[8373]:                     {name:"UTC", offset:0, isDST:false},
Jan  3 19:27:51 zfs1 zrepl[8373]:                 },
Jan  3 19:27:51 zfs1 zrepl[8373]:                 tx: {
Jan  3 19:27:51 zfs1 zrepl[8373]:                     {when:-576460752303423488, index:0x0, isstd:false, isutc:false},
Jan  3 19:27:51 zfs1 zrepl[8373]:                 },
Jan  3 19:27:51 zfs1 zrepl[8373]:                 extend:     "UTC0",
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheStart: 9223372036854775807,
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheEnd:   9223372036854775807,
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheZone:  &time.zone{(CYCLIC REFERENCE)},
Jan  3 19:27:51 zfs1 zrepl[8373]:             },
Jan  3 19:27:51 zfs1 zrepl[8373]:         },
Jan  3 19:27:51 zfs1 zrepl[8373]:         UserRefs: zfs.OptionUint64{},
Jan  3 19:27:51 zfs1 zrepl[8373]:     },
Jan  3 19:27:51 zfs1 zrepl[8373]:     ToVersion: zfs.FilesystemVersion{
Jan  3 19:27:51 zfs1 zrepl[8373]:         Type:      "snapshot",
Jan  3 19:27:51 zfs1 zrepl[8373]:         Name:      "zfs-auto-snap_hourly-2023-01-03-0717",
Jan  3 19:27:51 zfs1 zrepl[8373]:         Guid:      0x177845bf6975809b,
Jan  3 19:27:51 zfs1 zrepl[8373]:         CreateTXG: 0x3231576,
Jan  3 19:27:51 zfs1 zrepl[8373]:         Creation:  time.Time{
Jan  3 19:27:51 zfs1 zrepl[8373]:             wall: 0x0,
Jan  3 19:27:51 zfs1 zrepl[8373]:             ext:  63808327030,
Jan  3 19:27:51 zfs1 zrepl[8373]:             loc:  &time.Location{
Jan  3 19:27:51 zfs1 zrepl[8373]:                 name: "Local",
Jan  3 19:27:51 zfs1 zrepl[8373]:                 zone: {
Jan  3 19:27:51 zfs1 zrepl[8373]:                     {name:"UTC", offset:0, isDST:false},
Jan  3 19:27:51 zfs1 zrepl[8373]:                 },
Jan  3 19:27:51 zfs1 zrepl[8373]:                 tx: {
Jan  3 19:27:51 zfs1 zrepl[8373]:                     {when:-576460752303423488, index:0x0, isstd:false, isutc:false},
Jan  3 19:27:51 zfs1 zrepl[8373]:                 },
Jan  3 19:27:51 zfs1 zrepl[8373]:                 extend:     "UTC0",
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheStart: 9223372036854775807,
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheEnd:   9223372036854775807,
Jan  3 19:27:51 zfs1 zrepl[8373]:                 cacheZone:  &time.zone{(CYCLIC REFERENCE)},
Jan  3 19:27:51 zfs1 zrepl[8373]:             },
Jan  3 19:27:51 zfs1 zrepl[8373]:         },
Jan  3 19:27:51 zfs1 zrepl[8373]:         UserRefs: zfs.OptionUint64{Value:0x0, Valid:true},
Jan  3 19:27:51 zfs1 zrepl[8373]:     },
Jan  3 19:27:51 zfs1 zrepl[8373]: }

My config is:

global:
  logging:
    # use syslog instead of stdout because it makes journald happy
    - type: syslog
      format: human
      level: warn

jobs:
  - name: Backup
    type: push
    filesystems: {
      "zpool1/store<": true
    }
    send:
      encrypted: false
      raw: false
    conflict_resolution:
      initial_replication: all
    replication:
      protection:
        initial:     guarantee_resumability
        #initial:     guarantee_nothing
        incremental: guarantee_resumability
        #incremental: guarantee_nothing
    connect:
      type: tcp
      address: "zfs2:1213"
    snapshotting:
      type: periodic
      prefix: zrepl_
      interval: 10m
    pruning:
      keep_sender:
        - type: regex
          negate: true
          regex: "^zrepl_.*"
        - type: grid
          grid: 1x1h(keep=all) | 23x1h | 14x1d | 4x1w
          regex: "^zrepl_.*"
      keep_receiver:
        - type: regex
          negate: true
          regex: "^zrepl_.*"
        - type: grid
          grid: 1x1h(keep=all) | 23x1h | 14x1d | 4x1w
          regex: "^zrepl_.*"
@ggzengel
Copy link
Author

ggzengel commented Jan 3, 2023

zrepl version
client: zrepl version=v0.6.0 go=go1.19.2 GOOS=linux GOARCH=amd64 Compiler=gc

I still have zfs-auto-snapshot running and use zrepl for migrating hardware in the future.

@problame
Copy link
Member

problame commented Jan 4, 2023

Please provide the debug log output starting from ca. 5 seconds until the panic.

@ggzengel
Copy link
Author

ggzengel commented Jan 4, 2023

log.txt

@problame problame added the bug label Jan 29, 2023
@problame
Copy link
Member

problame commented Jan 29, 2023

Sorry for not getting back to this earlier.

What's weird here is that zrepl sees abstractions the following abstractions before the send attempt starts:

Jan  4 09:51:23 zfs1 zrepl[17148]: [Backup][endpoint][Uv38$hW91$RRYM$RRYM.krk0.Dz3s.wX2L.2c6b.778b.juJD]: loaded step abstractions for filesystem abstractions="[tentative-replication-cursor-bookmark-v2 zpool1/store/zmt-backup/LongTerm#zrepl_CURSORTENTATIVE__G_9b923fe5d4c2cd60_J_Backup tentative-replication-cursor-bookmark-v2 zpool1/store/zmt-backup/LongTerm#zrepl_CURSORTENTATIVE__G_cf65c2737871e83f_J_Backup replication-cursor-bookmark-v2 zpool1/store/zmt-backup/LongTerm#zrepl_CURSOR_G_9b923fe5d4c2cd60_J_Backup step-hold \"zrepl_STEP_J_Backup\" on zpool1/store/zmt-backup/LongTerm@zfs-auto-snap_daily-2023-01-03-0747]" fs="zpool1/store/zmt-backup/LongTerm"

According to

zrepl/endpoint/endpoint.go

Lines 274 to 279 in 498af6b

destroyTypes := AbstractionTypeSet{
AbstractionStepHold: true,
AbstractionTentativeReplicationCursorBookmark: true,
}
abstractionsCacheSingleton.TryBatchDestroy(ctx, s.jobId, sendArgs.FS, destroyTypes, keep, check)
}()

it should only try to delete step holds and tenative replication cursor bookmarks.

However, as indicated by the panic, TryBatchDestroy decided to destroy regular replication cursor #zrepl_CURSOR_G_9b923fe5d4c2cd60_J_Backup.

One of the tentative cursors is for the snapshot as the replication cursor, namely, #zrepl_CURSORTENTATIVE__G_cf65c2737871e83f_J_Backup.

Trying to reproduce this locally by creating a fake CURSORTENTATIVE bookmark yields your panic.
So, I have a reproducer for this.

Next step will be to debug it.

@problame
Copy link
Member

problame commented Jan 29, 2023

In the meantime, deleting the tenative cursors should fix the crash for you.

However, only do it if you have a regular replication cursor for the same snapshot. Otherwise, incremental replication might break and you'd have to re-replicate the dataset.

To list replication cursors and tentative replication cursors, use

root@zrepl-1:/etc/zrepl# zrepl zfs-abstraction list --type replication-cursor-bookmark-v1,replication-cursor-bookmark-v2
replication-cursor-bookmark-v2 p1/s#zrepl_CURSOR_G_baac25f2cda22016_J_backup
root@zrepl-1:/etc/zrepl# zrepl zfs-abstraction list --type tentative-replication-cursor-bookmark-v2
tentative-replication-cursor-bookmark-v2 p1/s#zrepl_CURSORTENTATIVE__G_baac25f2cda22016_J_backup

Then compare on a per-filesystem basis whether you have a regular replication cursor for the tentative cursor.
In the example above, that is the case, because there is both a

p1/s#zrepl_CURSORTENTATIVE__G_baac25f2cda22016_J_backup

and a

p1/s#zrepl_CURSOR_G_baac25f2cda22016_J_backup
                    ^ same ID as baac25f2cda22016

If that is the case, you can delete the tenative cursor like so

zfs destroy p1/s#zrepl_CURSORTENTATIVE__G_baac25f2cda22016_J_backup

After doing this for all filesystems, restart the zrepl daemon and retry replication.
It should work.

@problame problame changed the title panic: impl error panic if tenative & regular replication cursors present for the same snapshot Jan 29, 2023
@problame problame added this to the 0.7: Snapshot Management milestone Jan 29, 2023
@problame
Copy link
Member

So, the problem is that the check is overly cautious.
Specifically,

zrepl/endpoint/endpoint.go

Lines 240 to 280 in 498af6b

keep := func(a Abstraction) (keep bool) {
keep = false
for _, k := range liveAbs {
keep = keep || AbstractionEquals(a, k)
}
return keep
}
check := func(obsoleteAbs []Abstraction) {
// last line of defense: check that we don't destroy the incremental `from` and `to`
// if we did that, we might be about to blow away the last common filesystem version between sender and receiver
mustLiveVersions := []zfs.FilesystemVersion{sendArgs.ToVersion}
if sendArgs.FromVersion != nil {
mustLiveVersions = append(mustLiveVersions, *sendArgs.FromVersion)
}
for _, staleVersion := range obsoleteAbs {
for _, mustLiveVersion := range mustLiveVersions {
isSendArg := zfs.FilesystemVersionEqualIdentity(mustLiveVersion, staleVersion.GetFilesystemVersion())
stepHoldBasedGuaranteeStrategy := false
k := replicationGuaranteeStrategy.Kind()
switch k {
case ReplicationGuaranteeKindResumability:
stepHoldBasedGuaranteeStrategy = true
case ReplicationGuaranteeKindIncremental:
case ReplicationGuaranteeKindNone:
default:
panic(fmt.Sprintf("this is supposed to be an exhaustive match, got %v", k))
}
isSnapshot := mustLiveVersion.IsSnapshot()
if isSendArg && (!isSnapshot || stepHoldBasedGuaranteeStrategy) {
panic(fmt.Sprintf("impl error: %q would be destroyed because it is considered stale but it is part of of sendArgs=%s", mustLiveVersion.String(), pretty.Sprint(sendArgs)))
}
}
}
}
destroyTypes := AbstractionTypeSet{
AbstractionStepHold: true,
AbstractionTentativeReplicationCursorBookmark: true,
}
abstractionsCacheSingleton.TryBatchDestroy(ctx, s.jobId, sendArgs.FS, destroyTypes, keep, check)
}()

and following lines ensure that the underlying snapshot or bookmark is not the send step's from or to version.

But that's actually too conservative: if a replication cursor bookmark is used as from, and there is another tentative cursor, then FilesystemVersionEqualIdentity() will return true, even though the send would succeed if we deleted the tentative cursor.

Now, the problem is: the replication planner could just as easily use the tentative cursor instead of the replication cursor. In that situation, the code snippet above would still try to delete the tentative cursor, and the check would prevent that from happening (good!).

So, I guess to solve this properly, we need to shift more of the logic into the keep filter, and remove the check filter. keep needs to include the from and to args if they are zrepl abstractions, regardless of whether the type of abstraction is in destroyTypes or not.

@problame
Copy link
Member

problame commented Jul 4, 2023

@ggzengel sorry for the incredibly late turnaround.

The fix is done: #714

problame added a commit that referenced this issue Jul 4, 2023
…'t use it (#714)

Before this PR, we would panic in the `check` phase of `endpoint.Send()`'s `TryBatchDestroy` call in the following cases: the current protection strategy does NOT produce a tentative replication cursor AND
  * `FromVersion` is a tentative cursor bookmark
  * `FromVersion` is a snapshot, and there exists a tentative cursor bookmark for that snapshot
  * `FromVersion` is a bookmark != tentative cursor bookmark, but there exists a tentative cursor bookmark for the same snapshot as the `FromVersion` bookmark

In those cases, the `check` concluded that we would delete `FromVersion`.
It came to that conclusion because the tentative cursor isn't part of `obsoleteAbs` if the protection strategy doesn't produce a tentative replication cursor.

The scenarios above can happen if the user changes the protection strategy from "with tentative cursor" to one "without tentative replication cursor", while there is a tentative replication cursor on disk.
The workaround was to rename the tentative cursor.

In all cases above, `TryBatchDestroy` would have destroyed the tentative cursor.

In case 1, that would fail the `Send` step and potentially break replication if the cursor is the last common bookmark. The `check` conclusion was correct.

In cases 2 and 3, deleting the tentative cursor would have been fine because `FromVersion` was a different entity than the tentative cursor. So, destroying the tentative cursor would be the right call.

The solution in this PR is as follows:
* add the `FromVersion` to the `liveAbs` set of live abstractions
* rewrite the `check` closure to use the full dataset path (`fullpath`) to identify the concrete ZFS object instead of the `zfs.FilesystemVersionEqualIdentity`, which is only identified by matching GUID.
  * Holds have no dataset path and are not the `FromVersion` in any case, so disregard them.

fixes #666
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants