Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - online split clone #2105

Open
greg-hydrogen opened this Issue Feb 3, 2014 · 27 comments

Comments

8 participants
@greg-hydrogen
Copy link

greg-hydrogen commented Feb 3, 2014

Hello Everyone,

I wasn't sure where the correct place to post a request is as this is not an issue, so feel free to close this if this is not correct.

I have a feature request that might be useful to others. I am looking for the capability to split a clone when online, pretty much the same as the netapp vol clone split

there are certain times that a clone has completely diverged from the parent and it doesn't make sense to have the two filesystems linked. The only way I can think to do this today is to perform a zfs send/recv, but this will likely require some downtime to ensure consistency.

What I am proposing is that since zfs knows the blocks that are associated with the parent filesystem, there is a possibility to copy those blocks to a new area and repoint the clone to use those blocks instead (hopefully I have explained that properly). The end state will be a split clone while the filesystem is online and active...

@behlendorf

This comment has been minimized.

Copy link
Member

behlendorf commented Feb 3, 2014

It sounds like zfs promote may already do what you need.

       zfs promote clone-filesystem

           Promotes a clone file system to no longer be dependent on its "ori
           gin"  snapshot.  This  makes it possible to destroy the file system
           that the clone was created from. The clone parent-child  dependency
           relationship  is reversed, so that the origin file system becomes a
           clone of the specified file system.

           The snapshot that was cloned, and any snapshots  previous  to  this
           snapshot,  are  now owned by the promoted clone. The space they use
           moves from the origin file system to the promoted clone, so  enough
           space  must  be  available  to  accommodate these snapshots. No new
           space is consumed by this operation, but the  space  accounting  is
           adjusted. The promoted clone must not have any conflicting snapshot
           names of its own. The rename subcommand can be used to  rename  any
           conflicting snapshots.

@behlendorf behlendorf added this to the 0.7.0 milestone Feb 3, 2014

@greg-hydrogen

This comment has been minimized.

Copy link
Author

greg-hydrogen commented Feb 3, 2014

I was looking at zfs promote, but this appears to just to flip the parent-child relationship...

what I was thinking was an end state where both file systems are completely independent from each other...

some use cases for this could be
cloning VM templates - having a base image that is cloned to create other VM's, that are in turn split from the template so the template can be updated/destroyed/recreated
database clones - cloning a prod db for dev that will undergo a lot of changes and which in turn might be the base for a testing clone itself, in the case it would be nice to split the dev from prod as the base snapshot might grow larger than having an independent file system for dev

@GregorKopka

This comment has been minimized.

Copy link
Contributor

GregorKopka commented Aug 26, 2014

After you clone original@snapshot you can modify both dataset and clone freely, they won't affect each other except that they share data still common to both on-disk.

If you want to destroy/recreate the template (original) you can simply destroy all snapshots on it (except the one(s) used as origins of clones), zfs rename original and zfs create a new one with the same name (origin property of clones isn't bound to the name of the original dataset, so you can rename the both freely).

Only downside to that is that all unique data held in original@snapshot (= base of the clone) can't be released unless you are willing to destroy either the clone(s) or (after a promote of the clone) the original.

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 30, 2014

@behlendorf

This comment has been minimized.

Copy link
Member

behlendorf commented Oct 30, 2014

@greg-hydrogen in the end did you determine if zfs promote meet your needs? Or is there still a possible feature request here.

@GregorKopka

This comment has been minimized.

Copy link
Contributor

GregorKopka commented Oct 31, 2016

To comment on this, it would be nice if there would be a functionality to transform origin-clone relationships into a deduped kind: removing the logical links that keep the datasets from being individually destructed at will - while maintaining only one copy of the still shared data.

@jsoref

This comment has been minimized.

Copy link
Contributor

jsoref commented Sep 14, 2017

@behlendorf: it almost certainly doesn't meet the need.
http://jrs-s.net/2017/03/15/zfs-clones-probably-not-what-you-really-want/
does a good job of explaining the problem.

@jsoref

This comment has been minimized.

Copy link
Contributor

jsoref commented Sep 15, 2017

Here's what I'm trying to do conceptually:

user@backup:

  1. generate a date based snapshot
  2. send it from backup to test as a mirror
zfs snapshot backup/prod@20170901
zfs send -R backup/prod@20170901 | ssh test zfs recv ... test/mirror

user@test:

  1. create a place to sanitize the mirror
  2. sanitize it
  3. snapshot it
  4. clone the sanitized version for use
  5. use it
zfs clone test/mirror@20170901 test/sanitizing
sanitize sanitizing
zfs snapshot test/sanitizing@sanitized
zfs clone test/sanitizing@sanitized test/test
dirty test/test

user@backup:

  1. having used production further...
  2. create an updated snapshot
  3. send the incremental changes from prod to test
  4. delete the previous incremental marker (which in my case frees 70GB)
dirty prod/prod
zfs snapshot backup/prod@20170908
zfs send -I backup/prod@20170901 backup/prod@20170908 | ssh test zfs recv test/mirror
zfs destroy backup/prod@20170901

user@test:

  • this is where problems appear.
  • with some amount of cajoling, one can destroy the sanitizing volumes.
  • But, I'm left with test/mirror@20170901 which is the origin for the two remaining things: test/mirror@20170908 and test/test.
  • I could destroy the updated mirror (test/mirror@20170908) if I wanted to, but that doesn't do me any good (since my goal is to use that data).

In order for me to make progress, I actually have to run through sanitize, stop the thing that's using test, destroy test (completely), clone mirror as test, restart the thing using test, and then i can finally try to destroy the original snapshot. Or, I can decide that I'm going to take a pass, trigger a new snapshot on backup later, and send its increment over, delete the snapshot that was never mirrored to test, and try again.

Fwiw, to get a taste of this...:

zfs list -t all -o used,refer,name,origin -r test/mirror test/test
 USED   REFER  NAME                              ORIGIN
 161G   1.57M  test/mirror                       -
  65G   82.8G  test/mirror@2017081710            -
    0   82.4G  test/mirror@201709141737          -
3.25G   82.8G  test/test                         test/mirror@2017081710

(the numbers are really wrong, I actually have 1 volume with 4 contained volumes, hence the recursive flags...)

Now, I understand that I can use zfs send | zfs recv to break dependencies, and for small things that's fine. But this portion of my pool is roughly twice the available space in the pool, and one half is probably larger than that, which means performing that operation is problematic. It's also a huge amount of bytes to reprocess. My hope in using snapshots was to be able to benefit from COW, but instead, I'm being charged for COW because the branch point which will eventually have data used by neither side of my branching tree must still be paid for.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Aug 28, 2018

@behlendorf Hi, any progress on this? Splitting clone from it's original filesystem will be really great for VMs templates and/or big file-level restore. See the link @jsoref pasted above for a practical example.

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 15, 2019

I don't believe any work has begun because it's not clear that there is an actual need for any changes when zfs send | zfs recv works as you're requesting - especially when redacted send/recv is implemented, negating the need to 'sanitize' this way.

can you elaborate on why send|recv is not suitable? that's what everyone else has been doing all along.

@jsoref

This comment has been minimized.

Copy link
Contributor

jsoref commented Feb 15, 2019

@kpande: the goal is to pay (in space and data transfer) for what has changed (COW), not for the entire dataset (each time this operation happens).

If I had a 10TB moving dataset, and a variation of the dataset that I want to establish, sure, I could copy the 10TB, apply the variation, and pay for 20TB (if I have 20TB available). But, If my variation is really only 10MB different from the original 10TB, why shouldn't I be able to pay for 10TB+10MB? -- snapshots + clones give me that. Until the 10TB moves sufficiently that I'm now paying for 10TB (live + 10TB snapshot + 10TB diverged) and my 10MB variation moves so that it's now its own 10TB (diverged from both live and snapshot). In the interim, to "fix" my 30TB problem, I have to spend another 10TB (=40TB -- via your zfs send+zfs recv). That isn't ideal. Sure, it will "work", but it is neither "fast" nor remotely space efficient.

Redacted send/recv sounds interesting (since it more or less matches my use case) -- but while I can find it mentioned in a bunch of places, I can't find any useful explanation of what it's actually redacting.

Fwiw, for our system, I switched so that the sanitizing happens on the sending side (which is also better from a privacy perspective), which mostly got us out of the woods.

The are instances where the data variation isn't "redacting" and where the system has the resources for zfs snapshot+zfs send but doesn't really want to allocate the resources to host a second database to do the "mutation" -- and doesn't want to have to pay to send the entire volume between primary and secondary (i.e. it would rather send an incremental snapshot to a system which already has the previous snapshot).

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 15, 2019

you could always experiment with a fast dedicated dedup device via #5182 so that you don't need to use clones, but instead use DDT to keep track of duplication.

@jsoref

This comment has been minimized.

Copy link
Contributor

jsoref commented Feb 15, 2019

Yes, I'm aware I could use dedup. We're paying for our cpus / ram, so dedicating constant cpu+ram to make a rare task (refresh mutated clone) fast felt like a poor tradeoff (I'd rather pay for a bit more disk space).

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 15, 2019

the memory requirements are overstated compared to the IOPS requirement of dedup table read/write overhead from sync writes. that is, if you have a fast enough DDT, the memory constraint is less problematic. conversely, even with plenty of memory, if your DDT is slow, deleting data will take forever. the reason I mentioned it is because 5182 allows offloading DDT operations to dedicated storage.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Feb 16, 2019

@kpande this link quite clearly shows the problem with current clones. After all, if a clone diverges so much from the base snapshot, the permanent parent->child relation between the two is a source of confusion. Splitting the clone would be a clear indication that they diverged so much to not be considered tied anymore.

But let me do a more practical example.

Let kvm/vmimages be a datastore container for multiple virtual disk images, with snapshots taken on a daily basis. I know the default answer would be "use a dataset for each disk", but libvirt pools does not play well with that. So we have something as:

kvm/vmimages
kvm/vmimages@snap1
kvm/vmimages@snap2
kvm/vmimages@snap3

At some point, something bad happens to vm disk (ie: serious guest filesystem corruption), but in the meantime other users are actively storing new, important data on the other disks. You basically have some contrasting requirements: a) to revert to the old, not corrupted data of yesterday, b) to preserve any new data uploaded, which are not found in any snapshots and c) to cause minimal service interruption.

Clones come to mind as a possible solution: you can clone kvm/vmimages@snap3 as kvm/restored to immediately restore service for the affected VM. So you now have:

kvm/vmimages
kvm/vmimages@snap1
kvm/vmimages@snap2
kvm/vmimages@snap3
kvm/restored   # it is a clone of snap3
kvm/restored@snap1
...

The affected VM runs from kvm/restored, while all other remains on kvm/vmimages. At this point, you delete all extra disks from kvm/restored and the original, corrupted disk from kvm/vmimages. All seems well, until you realize that the old corrupted disk image is still using real disk space, and any overwrite in kvm/restored consumes additional space due to the old, undeletable kvm/vmimages@snap3. You cannot remove this old snapshot without removing your clone also, and you can not simply promote kvm/restored and delete kvm/vmimages because it is not the only true "authoritative" data source (ie: real data are stored inside both dataset).

Splitting a clone from its source would completely solve the problem above. It is not clear to me how redacted send/recv would help in this case.

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 16, 2019

well, it describes an issue with clones that happens if you configure your system poorly.

in my scenario, I use Gentoo Linux to run VMs. this is classically irritating because if I run several copies of Gentoo (I do) then they each compile world updates or I need to set up a BINHOST and mess around with compiling/installing binaries on only hosts that need it. I am lazy.

if I create tank/vm/gentoo zvol and install the OS there with all needed utilities for each cloned VM, and snapshot it, I can now clone to tank/instance/02-02-03-03-04-04. great, the clone uses 0 space. but it must be configured - it has its own hostname, config files. these would cause the clone to diverge from its origin, though.

so I'll create tank/overlay/02-02-03-03-04-04 (as an ext4 or xfs zvol via iscsi because overlayfs doesn't work with NFS directories; NFS can't store needed attributes) and then mount it in the Gentoo guest during boot in initrd using overlayfs, over /.

now I can configure everything in that VM and the config files are written to the overlay instead of the clone. I take significant care to avoid installing any software into the guest's overlay that would diverge it too significantly from its clone origin. you can avoid this issue by creating separate overlays ONLY for the paths you need, like /etc.

when it comes time to update the VM, I just start a new clone fresh from the most recent gentoo snapshot.. compile and install updates for all clones, create a new snapshot.

the fun part comes next. when I shut down and promote this dataset, the gentoo template now has its new snapshot. we can destroy the tank/instance/... dataset for the temporary VM we used to create the new snapshot, it's not needed anymore.

so then the other clones are rebooted and a libvirt hook reprovisions their clones automatically.

I don't ever have to worry about divergence because the VMs have a freshly reprovisioned OS each time they boot.

you can do this with Windows guests using folder redirection and roaming profiles via CIFS.

if you're interested, maybe I can make some of this work available to you.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Feb 16, 2019

@kpande first, thank for sharing your view and your solution (which is interesting!). I totally agree that a careful, and very specific, guest configuration (and host dataset tree) can avoid the problem exposed above.

That said, libvirt (and its implementation of storage pools) does not play very well with this approach, especially when managing mixed environments with Windows virtual machines. Even more, this was a single example only. Splittable clones would be very useful, for example, when used to create a "gold master / base image", which can be instanced at will to create "real" virtual machines.

With the current state of affair, doing that will tax you heavily in allocated space, as you will not be able to ever remove the original, potentially obsolete, snapshot. What surprise me is that, being ZFS a CoW filesystem, this should be a relative simple operation: when deleting the original snapshot, "simply" mark as free any non-referenced block and remove the parent/child relation. In other words, let be the clone a real filesystem, untangled from any source snapshot.

Note that I used the world "simply" inside quotes: while it is indeed a simple logical operation, I am not sure if/how well it maps to the underlying zfs filesystem.

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 16, 2019

snapshots are immutable, period. there is no way to modify them without rewriting a ton of block pointers, which would be like trying to change your pants while running at full speed.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Feb 16, 2019

@kpande ok, fair enough - if a real technical problem exists, I must accept it. But this is different from stating that a specific use case in invalid.

If this view (ie: impossibility to split a clone from its original parent snapshot without involving the "mythical" BPR) is shared by the zfs developers, I think this FR can be closed.

Thanks.

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Feb 16, 2019

there have been many things that ZFS couldn't do in the past - removing a top level vdev is now possible, with significant restrictions - it won't work for raidz, for example. however, removing a vdev causes ZFS to also "rebalance data" among remaining vdevs. this was considered impossible for a long time.

its not that your request can't be done, but it will probably be functionally restricted and slow. it would be a neat feature to have a temporary indirection table for "unlinked clones", but this could grow forever.

maybe I'm wrong here and the piece of data we have to change isn't going to affect our block pointers.

@helamonster

This comment has been minimized.

Copy link

helamonster commented Mar 22, 2019

+1 on needing this feature. Yes, send/recv could be used, but that would require downtime of whatever is using that dataset to switch from the old (clone) to the new dataset.

I've ran into situations with LXD where a container is copied (cloned), but that causes problems with my separately managed snapshotting.

@jsoref

This comment has been minimized.

Copy link
Contributor

jsoref commented Mar 22, 2019

@kpande: again, my use case has the entire dataset being a database, and a couple of variations of the database.

From what I've seen, it doesn't look like overlayfs plays nicely w/ zfs as the file system (it seems happy w/ zvols and ext4/xfs according to your notes). It sounds like this approach would cover most cases, in which case documentation explaining how to set up overlayfs w/ ext4/xfs would be welcome.

That said, some of us are using zfs not just for the volume management but also for the acl/allow/snapshot behavior/browsing, and would like to be able to use overlayfs w/ zfs instead of ext4/xfs, so if that isn't possible, is there a bug for that? If there is, it'd be good if that was highlighted (from here), if not, if you're endorsing the overlayfs approach, maybe you could file it (if you insist, I could probably write it, but I don't know anything about overlayfs, and that seems like a key technology in the writeup).

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Mar 22, 2019

@jsoref I mentioned an issue with nfs but not zfs and overlay. if there is any issue, be sure to file one.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Mar 23, 2019

From what I've seen, it doesn't look like overlayfs plays nicely w/ zfs as the file system (it seems happy w/ zvols and ext4/xfs according to your notes). It sounds like this approach would cover most cases, in which case documentation explaining how to set up overlayfs w/ ext4/xfs would be welcome.

The overlayfs approach will not work for an extremely important, and common, use case: cloning a virtual image starting from another one (or a "gold master" template). In such a case, splitting the clone would be key to avoid wasted space as the original/cloned images diverge.

@ptx0

This comment has been minimized.

Copy link

ptx0 commented Mar 23, 2019

actually overlayfs as described above works great for VMs to avoid tightly coupled clones.. you need to adjust your infrastructure instead of expecting this feature.

@shodanshok

This comment has been minimized.

Copy link

shodanshok commented Mar 23, 2019

@ptx0 this only works if the guest OS supports overlayfs (so no Windows VMs support) and if the end user (ie: our customers) are willing to significantly change their VM images provisioning/installation. As a side note, while I completely understand - and accept - this PR closed on a technical basis (eg: if it involves BPR), it is quite frustating to have a legitimate user case stamped as "invalid". If it is not your use case, fine. But please do not suppose that no one has a valid use case for this feature.

@kpande

This comment has been minimized.

Copy link
Contributor

kpande commented Mar 23, 2019

Windows doesn't need overlayfs, it has built in folder redirection and roaming profiles.

@GregorKopka

This comment has been minimized.

Copy link
Contributor

GregorKopka commented Mar 25, 2019

Windows doesn't need overlayfs, it has built in folder redirection and roaming profiles.

Folder redirection, while existing since NT, dosn't always work relieable as software exists that (for obscure reasons) dosn't handle redirected folders correctly and simply fails when confronted with redirected Desktop or Documents. Apart from that, clones of Windows installations diverge, all by themselves, massively and quite quickly from the origin, courtesy of Windows Update - having different users logging on and off only speeds this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.