Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config questions, and taking over existing replicas #337

Closed
candlerb opened this issue Jun 20, 2020 · 8 comments
Closed

Config questions, and taking over existing replicas #337

candlerb opened this issue Jun 20, 2020 · 8 comments
Labels

Comments

@candlerb
Copy link
Contributor

(Sorry if this is not the right place to post, but I couldn't find a linked forum/community group)

I am looking to migrate from an existing zfs replication tool ( syncoid ) to zrepl, but after a first pass of reading the docs I still had some questions.

  1. In the case of backing up multiple hosts, where each host has a push job, is it OK for them all to communicate with the same sink job? And do the push job names have to be distinct, e.g. "host1_backup", "host2_backup" etc?
    • I think I am able to answer this after re-reading the docs

    • At https://zrepl.github.io/v0.3.0-rc1/configuration/overview.html#n-push-jobs-to-1-sink it says:

      It is thus safe to push to one sink job with different client identities.

      Then I had to dig further: "client identity" is transport-specific. e.g. if using the TCP transport then it's the originating IP address mapped to an identity via a configuration table in the sink.

    • It's still not immediately clear if the push job names on different push hosts must be distinct. If the job names are only of local significance (i.e. not sent over the wire) then they could have the same job name.

  2. Can a zrepl sink take over an existing replica, copied by another tool? Or does it have to do the first replication from scratch? I will experiment (below).
  3. Can you strip out path components from the source before appending them to root_fs?
    • It appears not: here it says:

      ZFS filesystems are received to $root_fs/$client_identity/$source_path

      and I don't see a way in a push job to strip the path prefix.

Let me give a specific example. I am currently doing syncoid push jobs to replicate lxd containers like this:

/usr/local/bin/syncoid --sendoptions="Lce" --recvoptions="o compression=lz4" --quiet -r --skip-parent \
   zfs/lxd/containers root@storage1:storage1/backup/containers

At the source side (host nuc1), I have datasets like this:

zfs/lxd
zfs/lxd/containers
zfs/lxd/containers/builder
zfs/lxd/containers/ca
zfs/lxd/containers/cache1
...etc

and snapshots like this:

...
zfs/lxd/containers/builder@autosnap_2020-06-20_10:00:08_hourly
zfs/lxd/containers/builder@autosnap_2020-06-20_11:00:07_hourly
zfs/lxd/containers/builder@syncoid_nuc1_2020-06-20:11:59:39
zfs/lxd/containers/builder@autosnap_2020-06-20_12:00:09_hourly
zfs/lxd/containers/builder@autosnap_2020-06-20_13:00:07_hourly
zfs/lxd/containers/builder@autosnap_2020-06-20_14:00:08_hourly
zfs/lxd/containers/builder@autosnap_2020-06-20_15:00:07_hourly
zfs/lxd/containers/builder@autosnap_2020-06-20_16:00:09_hourly
...

The @syncoid... snapshot is effectively a bookmark, and the others are just period snapshots, which are also replicated.

On another source (host nuc2), I have datasets like this:

zfs/lxd
zfs/lxd/containers
zfs/lxd/containers/apt-cacher
zfs/lxd/containers/cache2
...etc

At the destination side I have:

storage1/backup/containers
storage1/backup/containers/apt-cacher
storage1/backup/containers/builder
storage1/backup/containers/ca
storage1/backup/containers/cache1
storage1/backup/containers/cache2
...

That is, the containers from both hosts are replicated into the same target parent dataset - the idea being that I can move containers between hosts without ending up with two backups.

Also, note that the original path prefix "zfs/lxd" is not present in the destination.

As far as I can tell, if I make a zrepl sink job with root_fs "storage1/backup", then on replication I will be forced to deliver to

storage1/backup/nuc1/zfs/lxd/containers/builder
^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^
    root_fs    cli_id      source dataset

Is that correct?

This is not too big a big deal: I can rename all my existing replica datasets to match the new schema. Also, if a container moves I can do a corresponding rename on the backup server.

After this renaming, I'll test whether zrepl is able to pick up the existing target dataset and apply further incremental replication, or whether it's going to insist on re-replicating from scratch. I will report back here - I think it's a use case which is worth documenting.

@candlerb
Copy link
Contributor Author

Supplementary question: if you want to have sinks with different root_fs for different clients, do you have to make different sink jobs listening on different ports?

Or can you make multiple sink jobs which listen on the same port, and the client identity is used to bind to the correct sink job?

@candlerb
Copy link
Contributor Author

I renamed all the existing destination snapshots to new names. At first it didn't appear they could be taken over:

        Problem: one or more of the filesystems encountered errors
        Progress: [=>-------------------------------------------------] 22.2 MiB / 767.9 MiB @ 0 B/s
          zfs/lxd/containers/builder                            STEP-ERROR (step 0/6, 24.8 KiB/2.2 MiB)
            server error: zfs exited with error: exit status 1
            stderr:
            cannot receive incremental stream: destination storage1/backup/nuc1/zfs/lxd/containers/builder has been modified
            since most recent snapshot

However I did a "zfs rollback " on each filesystem, and after that zrepl was happy - performing incremental updates on top of what was already there. Success!

It would be nice to rename the sanoid snapshots to match zrepl naming scheme:

storage1/backup/nuc1/zfs/lxd/containers/builder@autosnap_2020-06-20_16:00:09_hourly
storage1/backup/nuc1/zfs/lxd/containers/builder@zrepl_20200620_161754_000

The date_time is obvious, but I don't know what the _000 represents. Still, this isn't important - I can just let the sanoid snapshots age out, and then delete them manually. Actually: I will just continue to run sanoid with autosnap = no, autoprune = yes and it will age them out for me.

Many thanks for releasing such an excellent tool!

(Aside: whilst syncoid does work, I have a requirement for replicating huge filesystems over TCP without ssh. Syncoid only works over ssh, so I had to compile ssh-hpn with crypto disabled to get decent performance - that was painful)

@InsanePrawn
Copy link
Contributor

Hi,
I bet @problame got some great answers to your sink questions!

can just let the sanoid snapshots age out, and then delete them manually. Actually: I will just continue to run sanoid with autosnap = no, autoprune = yes and it will age them out for me.

zrepl's pruning grid works on the snapshot's creation times as tracked by the snapshot's creation property, not the snapshot name, so in theory you should be able to prune all your snapshots with zrepl's grid.
Either way should work fine, just make sure that zrepl's pruning regexes are properly set up to [not] include the sanoid snaps [and vice versa].

(Aside: whilst syncoid does work, I have a requirement for replicating huge filesystems over TCP without ssh. Syncoid only works over ssh, so I had to compile ssh-hpn with crypto disabled to get decent performance - that was painful)

Out of curiosity: Does your CPU have AES instructions (aesni)? Have you tried running vanilla (whatever your distro offers by default) openssh but forcing AES cipher suites, especially AES-CTR if your openssh version allows that mode? Many people seem to try regular openssh for zfs send and get caught off-guard by the default chacha-poly cipher's performance, which can often be outperformed by AES on x86 hardware with aesni instructions. Of course, no encryption is still faster than hardware-accelerated AES on it's own, so it comes down to your other bottlenecks and security needs. (If AES performs well and you're interested in encryption, using TLS might be a good middle of the road solution, even though certificate management sucks)

[...]

After some dirty benchmarking on my haswell desktop (avx2!) i7-4790, aes128-gcm@openssh.com seems to be faster than aes128-ctr on my machine, which was a little surprising to me; YMMV. Results on my server were similar but somewhat unsteady.

$ for i in `ssh -Q cipher`; do dd if=/dev/zero bs=1M count=1000 2> /dev/null | ssh -c "$i" localhost "bash -c 'time -p cat' > /dev/null" 2>&1 | grep real | awk '{print "'$i': "1000 / $2" MB/s" }' && sleep 1; done


aes128-ctr: 625 MB/s
aes192-ctr: 602.41 MB/s
aes256-ctr: 617.284 MB/s
aes128-gcm@openssh.com: 729.927 MB/s
aes256-gcm@openssh.com: 694.444 MB/s
chacha20-poly1305@openssh.com: 480.769 MB/s

@candlerb
Copy link
Contributor Author

zrepl's pruning grid works on the snapshot's creation times as tracked by the snapshot's creation property, not the snapshot name, so in theory you should be able to prune all your snapshots with zrepl's grid.

That's very cool - I will test now.

      keep_receiver:
        - type: grid
          grid: 48x1h | 28x1d | 3x30d
          regex: "^(zrepl|syncoid|autosnap)_.*"
        - type: regex
          negate: true
          regex: "^(zrepl|syncoid|autosnap)_.*"

I hadn't realised that zrepl would be able to do time-based pruning of any snapshot, not just ones created by zrepl - and that it doesn't rely on the format of the snapshot name. I think that's worth saying explicitly in the prune page.

After some dirty benchmarking on my haswell desktop (avx2!) i7-4790, aes128-gcm@openssh.com seems to be faster than aes128-ctr on my machine, which was a little surprising to me; YMMV. Results on my server were similar but somewhat unsteady.

Thank you for the suggestion, I hadn't tried changing ciphers. I went back to a couple of these machines - they are Xeon Bronze 3104 (6 cores/12 threads, 1.7GHz). I see the "aes" flag does exist in /proc/cpuinfo.

Trying your benchmark without any cipher selection:

root@monster15:~# dd if=/dev/zero bs=1M count=1000 2> /dev/null | ssh monster16-10g "bash -c 'time -p cat' > /dev/null" 2>&1 | grep real | awk '{print "'$i': "1000 / $2" MB/s" }'
: 135.501 MB/s

That was the sort of bottleneck I was seeing. ssh -v says it chose chacha20-poly1305@openssh.com

Trying your sweep across ciphers:

aes128-ctr: 324.675 MB/s
aes192-ctr: 340.136 MB/s
aes256-ctr: 285.714 MB/s
aes128-gcm@openssh.com: 381.679 MB/s
aes256-gcm@openssh.com: 350.877 MB/s
chacha20-poly1305@openssh.com: 116.414 MB/s

So, aes128 is better, but still slower than the disk arrays can generate. ssh is single-threaded, and these Xeon Bronze processors have a low clock speed per core. Your older i7-4790 has a base speed of 3.6GHz, and turbo 4.0GHz - more than twice as fast.

It would be interesting to see what zrepl's tls transport is capable of - although I don't see any settings for selecting the cipher. I suppose it will pick the crypto/tls library's preferred default?

@problame
Copy link
Member

@InsanePrawn thanks for jumping in so quickly!


In the case of backing up multiple hosts, where each host has a push job, is it OK for them all to communicate with the same sink job? And do the push job names have to be distinct, e.g. "host1_backup", "host2_backup" etc?

I think I am able to answer this after re-reading the docs

At https://zrepl.github.io/v0.3.0-rc1/configuration/overview.html#n-push-jobs-to-1-sink it says:

    It is thus safe to push to one sink job with different client identities.

Then I had to dig further: "client identity" is transport-specific. e.g. if using the TCP transport then it's the originating IP address mapped to an identity via a configuration table in the sink.

It's still not immediately clear if the push job names on different push hosts must be distinct. If the job names are only of local significance (i.e. not sent over the wire) then they could have the same job name.

Job names are not currently (and hopefully never) transmitted over the wire. Maybe at some point we are going to assign GUIDs to jobs and transfer those, but not ATM.
Thus, you can name all push jobs the same.
The sink distinguishes them by their client identity solely.

Can a zrepl sink take over an existing replica, copied by another tool? Or does it have to do the first replication from scratch? I will experiment (below).

Thanks for your experiment, I imagine this thread could be quite insightful to other syncoid users who are considering to switch.
Could you find the time to write up a Quick-Start-style Guide for those switching converting from syncoid to zrepl? And as part of that guide, document the constraints ("What work's, what doesn't") you learned about in this PR? (The special :issue: ref allows you to refer to existing GitHub issues).

Can you strip out path components from the source before appending them to root_fs?

It appears not: here it says:

    ZFS filesystems are received to $root_fs/$client_identity/$source_path

and I don't see a way in a push job to strip the path prefix.

...

As far as I can tell, if I make a zrepl sink job with root_fs "storage1/backup", then on replication I will be forced to deliver to

storage1/backup/nuc1/zfs/lxd/containers/builder
^^^^^^^^^^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^
root_fs cli_id source dataset

Is that correct?

Yes, I saw you comment on #253, I'll continue that discussion there.

However I did a "zfs rollback " on each filesystem, and after that zrepl was happy - performing incremental updates on top of what was already there. Success!

Yeah, that's sometime necessary, I have not fully figured out why. Is mountpoint=none on those datasets? That's usually the culprit.

Many thanks for releasing such an excellent tool!

(Aside: whilst syncoid does work, I have a requirement for replicating huge filesystems over TCP without ssh. Syncoid only works over ssh, so I had to compile ssh-hpn with crypto disabled to get decent performance - that was painful)

Thank you very much, compliments like this make the whole thing worthwile :)

I hadn't realised that zrepl would be able to do time-based pruning of any snapshot, not just ones created by zrepl - and that it doesn't rely on the format of the snapshot name. I think that's worth saying explicitly in the prune page.

Would you mind opening up a small doc PR for this as well? Fresh eyes are usually the best doc writers ;)


Meta

(Sorry if this is not the right place to post, but I couldn't find a linked forum/community group)

I have considered setting up Discourse a few times, but I guess unless we have some SSO options, the bar of creating a new account is quite high.
And even if this was solved (e.g. OpenZFS hosting a Discourse for the entire ecosystem), I think that most 'support' issues such as this one eventually turn into actual code issues + PRs, so maybe we're just not stable enough yet for a non-GitHub support forum.

We have a #zrepl IRC channel on Freenode, but it's pretty silent and the ephemerality of IRC (even with public logs) makes it a non-solution for the "support-forum" use case.

@candlerb
Copy link
Contributor Author

Thank you. I'll see if I can find time for PRs, won't be this week though.

There was one other question I had. Suppose you have multiple clients pushing to the same server, but you want different clients to use a different root_fs.

Do these have to be different sink jobs listening on different ports? Or can you have multiple sink jobs bound to the same port, and server would use the client identity to decide which sink job to associate each session with?

@problame
Copy link
Member

There was one other question I had. Suppose you have multiple clients pushing to the same server, but you want different clients to use a different root_fs.

Do these have to be different sink jobs listening on different ports? Or can you have multiple sink jobs bound to the same port, and server would use the client identity to decide which sink job to associate each session with?

Yes, you need to create two sinks ATM.
(I thought this is related to #253 ? I answered there.)

Anyways, this limitation exists only because I had to pick some config model at the time of zrepl's initial release. Based on pure gut feeling.
I guess the main pain point is that you cannot share a common listener?
Feel free to post suggestions for config syntax that would better supports your use case.
(Please open a separate issue for this though, otherwise I'm likely to lose track :/ )

@problame
Copy link
Member

I'll close this issue since all questions seem to have been answered.

Open unassigned issue for a quick-start guide is out here: #368

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants