endpoint: refactor, fix stale holds on initial replication, holds release subcmds #293

problame · 2020-03-26T23:00:06Z

endpoint: refactor, fix stale holds on initial replication, holds release subcmds

endpoint abstractions now share an interface Abstraction
pkg endpoint now has a query facitilty (ListAbstractions) which is
used to find on-disk
- step holds and bookmarks
- replication cursors (v1, v2)
- last-received-holds
the zrepl holds list command consumes endpoint.ListAbstractions
the new zrepl holds release-{all,stale} commands can be used
to remove abstractions of package endpoint

supersedes #282
fixes #280
fixes #278

problame · 2020-03-26T23:08:52Z

@InsanePrawn would you mind testing this commit?

problame · 2020-03-26T23:09:03Z

(Also, any code review is appreciated!)

…ease subcmds - endpoint abstractions now share an interface `Abstraction` - pkg endpoint now has a query facitilty (`ListAbstractions`) which is used to find on-disk - step holds and bookmarks - replication cursors (v1, v2) - last-received-holds - the `zrepl holds list` command consumes endpoint.ListAbstractions - the new `zrepl holds release-{all,stale}` commands can be used to remove abstractions of package endpoint Co-authored-by: InsanePrawn <insane.prawny@gmail.com> supersedes #282 fixes #280 fixes #278

…stor and SendCompleted RPCs

fixes #286

… metrics refs #196

…to problame/holds-release-and-hold-leak-fix-v2

problame · 2020-03-27T20:16:50Z

Run thison my setup for a day, see if there are any memory leaks we didn't spot.
ListStale duplicates the staleness criteria for each abstraction. That should be refactored.

JMoVS · 2020-03-28T10:09:35Z

I ran it and cleared some holds (incidentally while zrepl was running), it then complained that it couldn't replicate because the last-received-hold couldn't be moved because it has to be a snapshot.

I then cleared all holds, installed the master branch (which worked again), then tried the branch again. Even then, it currently always fails with:

erver error: cannot move last-received-hold: last-received-hold: target must be a snapshot: seaOfTime/encRoot/SergeantPepperSenior/ocean/Users@zrepl_20200328_045822_000

like errors for every dataset.

The master branch works though

problame · 2020-03-28T12:47:19Z

Which commit were you running? I pushed 8755847 to this branch yesterday evening which, despite it's wrong commit message, fixes the problem with the last-received-hold. I'd appreciate it if you could once again compare the most recent version of this branch and continue testing it.

JMoVS · 2020-03-29T08:11:55Z

Weird, I had thought that I tried that - nevertheless I tried again and it works now. It's just not as speedy as the non-hold based version at all currently

problame · 2020-03-29T08:20:17Z

Please let it do a few runs, then send the debug (!) logs. We are logging execution time now, I'll run some stats and see what's the bottleneck in your case.

JMoVS · 2020-03-29T08:28:02Z

config option log level "debug" is sufficient?

problame · 2020-03-29T08:35:18Z

Yes, just make sure the thing you are logging to doesn't throw away log messages, that would skew the stats. (I suppose it doesn't)

… metrics refs #196

…s' into problame/holds-release-and-hold-leak-fix-v2

… metrics refs #196

…holds-release-and-hold-leak-fix-v2

… metrics refs #196

…holds-release-and-hold-leak-fix-v2

problame · 2020-03-29T18:13:18Z

~~[x] only call zfs.ZFSHolds if we are actually requesting hold-based abstractions~~ we already do that
only do zfs list -t snapshot or zfs list -t snapshot,not both if only bookmark or only hold extractors in query

…lds on HintMostRecentCommonAncestor

problame · 2020-04-05T13:42:59Z

Use userrefs on ZFSListFIlesystemVersions and only visit those snapshots with > 0

problame · 2020-04-05T15:14:51Z

zrepl holds release-stale should consider the replication cursors if available as an Until bound by default (and maybe have an option to disable it).

…ng (likely needs fixup from next commit)

… least one hold

…holds-release-and-hold-leak-fix-v2

…kmarks

…stChan and ZFSBookmark

…cursors and falling back to step holds for initial replication

@JMoVS

…olds release subcmds, more efficient ZFS queries The motivation for this recatoring are based on two independent issues: - @JMoVS found that the changes merged as part of #259 slowed his OS X based installation down significantly. Analysis of the zfs command logging introduced in #296 showed that `zfs holds` took most of the execution time, and they pointed out that not all of those `zfs holds` invocations were actually necessary. I.e.: zrepl was inefficient about retrieving information from ZFS. - @InsanePrawn found that failures on initial replication would lead to step holds accumulating on the sending side, i.e. they would never be cleaned up in the HintMostRecentCommonAncestor RPC handler. That was because we only sent that RPC if there was a most recent common ancestor detected during replication planning. @InsanePrawn prototyped an implementation of a `zrepl holds release` command to mitigate the situation. As part of that development work and back-and-forth with @problame, it became evident that the abstractions that #259 built on top of zfs in package endpoint (step holds, replication cursor, last-received-hold), were not well-represented for re-use in the `zrepl holds release` subocommand. This commit refactors package endpoint to address both of these issues: - endpoint abstractions now share an interface `Abstraction` that, among other things, provides a uniform `Destroy()` method. However, that method should not be destroyed directly but instead the package-level `BatchDestroy` function should be used in order to allow for a migration to zfs channel programs in the future. - endpoint now has a query facitilty (`ListAbstractions`) which is used to find on-disk - step holds and bookmarks - replication cursors (v1, v2) - last-received-holds By describing the query in a struct, we can centralized the retrieval of information via the ZFS CLI and only have to be clever once. We are "clever" in the following ways: - When asking for hold-based abstractions, we only run `zfs holds` on snapshot that have `userrefs` > 0 - To support this functionality, add field `UserRefs` to zfs.FilesystemVersion and retrieve it anywhere we retrieve zfs.FilesystemVersion from ZFS. - When asking only for bookmark-based abstractions, we only run `zfs list -t bookmark`, not with snapshots. - Currently unused (except for CLI) per-filesystem concurrent lookup - Option to only include abstractions with CreateTXG in a specified range - refactor `endpoint`'s various ZFS info retrieval methods to use `ListAbstractions` - change the `zrepl holds list` command to consume endpoint.ListAbstractions - Add a `ListStale` method which, given a query template, lists stale holds and bookmarks. - it uses replication cursor has different modes - the new `zrepl holds release-{all,stale}` commands can be used to remove abstractions of package endpoint - Adjust HintMostRecentCommonAncestor RPC for stale-holds cleanup: - send it also if no most recent common ancestor exists between sender and receiver - have the sender clean up its abstractions when it receives the RPC with no most recent common ancestor, using `ListStale` - Due to changed semantics, bump the protocol version. - Adjust HintMostRecentCommonAncestor RPC for performance problems encountered by @JMoVS - by default, per (job,fs)-combination, only consider cleaning step holds in the createtxg range `[last replication cursor,conservatively-estimated-receive-side-version)` - this behavior ensures resumability at cost proportional to the time that replication was donw - however, as explained in a comment, we might leak holds if the zrepl daemon stops running - that trade-off is acceptable because in the presumably rare this might happen the user has two tools at their hand: - Tool 1: run `zrepl holds release-stale` - Tool 2: use env var `ZREPL_ENDPOINT_SENDER_HINT_MOST_RECENT_STEP_HOLD_CLEANUP_MODE` to adjust the lower bound of the createtxg range (search for it in the code). The env var can also be used to disable hold-cleanup on the send-side entirely. supersedes closes #293 supersedes closes #282 fixes #280 fixes #278 Additionaly, we fixed a couple of bugs: - zfs: fix half-nil error reporting of dataset-does-not-exist for ZFSListChan and ZFSBookmark - endpoint: Sender's `HintMostRecentCommonAncestor` handler would not check whether access to the specified filesystem was allowed.

problame · 2020-04-07T17:00:52Z

final revision of this PR will be reviewed here #300

@JMoVS

…fs-abstractions subcmd, more efficient ZFS queries The motivation for this recatoring are based on two independent issues: - @JMoVS found that the changes merged as part of #259 slowed his OS X based installation down significantly. Analysis of the zfs command logging introduced in #296 showed that `zfs holds` took most of the execution time, and they pointed out that not all of those `zfs holds` invocations were actually necessary. I.e.: zrepl was inefficient about retrieving information from ZFS. - @InsanePrawn found that failures on initial replication would lead to step holds accumulating on the sending side, i.e. they would never be cleaned up in the HintMostRecentCommonAncestor RPC handler. That was because we only sent that RPC if there was a most recent common ancestor detected during replication planning. @InsanePrawn prototyped an implementation of a `zrepl zfs-abstractions release` command to mitigate the situation. As part of that development work and back-and-forth with @problame, it became evident that the abstractions that #259 built on top of zfs in package endpoint (step holds, replication cursor, last-received-hold), were not well-represented for re-use in the `zrepl zfs-abstractions release` subocommand prototype. This commit refactors package endpoint to address both of these issues: - endpoint abstractions now share an interface `Abstraction` that, among other things, provides a uniform `Destroy()` method. However, that method should not be destroyed directly but instead the package-level `BatchDestroy` function should be used in order to allow for a migration to zfs channel programs in the future. - endpoint now has a query facitilty (`ListAbstractions`) which is used to find on-disk - step holds and bookmarks - replication cursors (v1, v2) - last-received-holds By describing the query in a struct, we can centralized the retrieval of information via the ZFS CLI and only have to be clever once. We are "clever" in the following ways: - When asking for hold-based abstractions, we only run `zfs holds` on snapshot that have `userrefs` > 0 - To support this functionality, add field `UserRefs` to zfs.FilesystemVersion and retrieve it anywhere we retrieve zfs.FilesystemVersion from ZFS. - When asking only for bookmark-based abstractions, we only run `zfs list -t bookmark`, not with snapshots. - Currently unused (except for CLI) per-filesystem concurrent lookup - Option to only include abstractions with CreateTXG in a specified range - refactor `endpoint`'s various ZFS info retrieval methods to use `ListAbstractions` - rename the `zrepl holds list` command to `zrepl zfs-abstractions list` - make `zrepl zfs-abstractions list` consume endpoint.ListAbstractions - Add a `ListStale` method which, given a query template, lists stale holds and bookmarks. - it uses replication cursor has different modes - the new `zrepl zfs-abstractions release-{all,stale}` commands can be used to remove abstractions of package endpoint - Adjust HintMostRecentCommonAncestor RPC for stale-holds cleanup: - send it also if no most recent common ancestor exists between sender and receiver - have the sender clean up its abstractions when it receives the RPC with no most recent common ancestor, using `ListStale` - Due to changed semantics, bump the protocol version. - Adjust HintMostRecentCommonAncestor RPC for performance problems encountered by @JMoVS - by default, per (job,fs)-combination, only consider cleaning step holds in the createtxg range `[last replication cursor,conservatively-estimated-receive-side-version)` - this behavior ensures resumability at cost proportional to the time that replication was donw - however, as explained in a comment, we might leak holds if the zrepl daemon stops running - that trade-off is acceptable because in the presumably rare this might happen the user has two tools at their hand: - Tool 1: run `zrepl zfs-abstractions release-stale` - Tool 2: use env var `ZREPL_ENDPOINT_SENDER_HINT_MOST_RECENT_STEP_HOLD_CLEANUP_MODE` to adjust the lower bound of the createtxg range (search for it in the code). The env var can also be used to disable hold-cleanup on the send-side entirely. supersedes closes #293 supersedes closes #282 fixes #280 fixes #278 Additionaly, we fixed a couple of bugs: - zfs: fix half-nil error reporting of dataset-does-not-exist for ZFSListChan and ZFSBookmark - endpoint: Sender's `HintMostRecentCommonAncestor` handler would not check whether access to the specified filesystem was allowed.

problame force-pushed the problame/holds-release-and-hold-leak-fix-v2 branch from d026b62 to 3b74b0c Compare March 26, 2020 23:04

problame mentioned this pull request Mar 26, 2020

[wip] holds release subcmd #282

Closed

problame added 2 commits March 27, 2020 00:12

endpoint: add filesystem path validation for HintMostRecentCommonAnce…

4a4ec4d

…stor and SendCompleted RPCs

problame force-pushed the problame/holds-release-and-hold-leak-fix-v2 branch 2 times, most recently from 8b74e9c to 5749103 Compare March 26, 2020 23:25

problame mentioned this pull request Mar 26, 2020

option to convert step-holds to step-bookmarks in the pruner when it tries to destroy a step-held snapshot #288

Open

build: go1.14 + address tlsconf deprecation notice

47d7bba

fixes #286

problame force-pushed the problame/holds-release-and-hold-leak-fix-v2 branch from 5749103 to 47d7bba Compare March 26, 2020 23:29

zfs: introduce pkg zfs/zfscmd for command logging, status, prometheus…

4e13ea4

… metrics refs #196

problame mentioned this pull request Mar 27, 2020

zfs command logging, status, prometheus metrics #296

Merged

problame added 3 commits March 27, 2020 17:16

SQUASH THIS MERGE branch 'problame/zfs-command-logging-and-status' in…

586a4ff

…to problame/holds-release-and-hold-leak-fix-v2

fix move replication cursor

8755847

endpoint: concurrent queries

0b44e25

problame added 6 commits March 29, 2020 14:01

zfs: introduce pkg zfs/zfscmd for command logging, status, prometheus…

815b432

… metrics refs #196

WIP SQUASH MERGE Merge branch 'problame/zfs-command-logging-and-statu…

8bfaba1

…s' into problame/holds-release-and-hold-leak-fix-v2

zfs: introduce pkg zfs/zfscmd for command logging, status, prometheus…

deeca76

… metrics refs #196

Merge branch 'problame/zfs-command-logging-and-status' into problame/…

bc291e6

…holds-release-and-hold-leak-fix-v2

zfs: introduce pkg zfs/zfscmd for command logging, status, prometheus…

bab4240

… metrics refs #196

Merge branch 'problame/zfs-command-logging-and-status' into problame/…

d09ca24

…holds-release-and-hold-leak-fix-v2

problame added 4 commits March 30, 2020 00:57

remove outdated FIXME

4ecbb32

endpoint: environment variable for disabling cleanup of stale step ho…

8c27e57

…lds on HintMostRecentCommonAncestor

FIXUP typo in environment variable

86698c2

range-based createtxg queries

e82aea5

problame added 16 commits April 5, 2020 19:19

zfs: userrefs, platformtests for ListFilesystemVersions and ListMappi…

b16a9ed

…ng (likely needs fixup from next commit)

endpoint: zfs abstraction: use new api (fixup)

f703933

fixup zfs changes

568a112

endpoint abstr: use userrefs to only issue zfs holds if there is at…

8cab6e9

… least one hold

envconst.Var

4e0574e

fixup e82aea5: gofmt

05852bd

zfs changes platformtest

4f255ce

endpoint: step hold cleanup based on replication cursor bookmark

a02b4c1

Merge branch 'problame/zfs-command-logging-and-status' into problame/…

1fc4e62

…holds-release-and-hold-leak-fix-v2

range bounds fixup

2c15db9

zfs changes fixup

da8b168

consider replication cursor when determining stale step-holds and boo…

4fd369b

…kmarks

endpoint: rename abstraction super-"classes"

7061282

zfs: fix half-nil error reporting of dataset-does-not-exist for ZFSLi…

ac2eb9f

…stChan and ZFSBookmark

endpoint: fix ListStale for step-* abstractions by using replication …

f5f9421

…cursors and falling back to step holds for initial replication

endpoing: better log message for v1 replication cursors

515ddb8

problame mentioned this pull request Apr 7, 2020

endpoint: refactor, fix stale holds on initial replication failure, holds release subcmds, more efficient ZFS queries #300

Merged

problame closed this Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endpoint: refactor, fix stale holds on initial replication, holds release subcmds #293

endpoint: refactor, fix stale holds on initial replication, holds release subcmds #293

problame commented Mar 26, 2020 •

edited

problame commented Mar 26, 2020

problame commented Mar 26, 2020

problame commented Mar 27, 2020 •

edited

JMoVS commented Mar 28, 2020

problame commented Mar 28, 2020

JMoVS commented Mar 29, 2020

problame commented Mar 29, 2020

JMoVS commented Mar 29, 2020

problame commented Mar 29, 2020

problame commented Mar 29, 2020 •

edited

problame commented Apr 5, 2020 •

edited

problame commented Apr 5, 2020 •

edited

problame commented Apr 7, 2020

endpoint: refactor, fix stale holds on initial replication, holds release subcmds #293

endpoint: refactor, fix stale holds on initial replication, holds release subcmds #293

Conversation

problame commented Mar 26, 2020 • edited

problame commented Mar 26, 2020

problame commented Mar 26, 2020

problame commented Mar 27, 2020 • edited

JMoVS commented Mar 28, 2020

problame commented Mar 28, 2020

JMoVS commented Mar 29, 2020

problame commented Mar 29, 2020

JMoVS commented Mar 29, 2020

problame commented Mar 29, 2020

problame commented Mar 29, 2020 • edited

problame commented Apr 5, 2020 • edited

problame commented Apr 5, 2020 • edited

problame commented Apr 7, 2020

problame commented Mar 26, 2020 •

edited

problame commented Mar 27, 2020 •

edited

problame commented Mar 29, 2020 •

edited

problame commented Apr 5, 2020 •

edited

problame commented Apr 5, 2020 •

edited