Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COW cp (--reflink) support #405

Open
torn5 opened this issue Sep 21, 2011 · 50 comments

Comments

@torn5
Copy link

commented Sep 21, 2011

This is a feature request for an implementation of the BTRFS_IOC_CLONE in zfsonlinux, or something similar, so that it's possible to COW-copy a single file in zero space, zero RAM and zero time, without having to enable super-expensive things like filesystem-wide deduplication (which btw is not zero RAM or zero time).

If it can be done at the directory level, so to clone entire directory trees with one call, even better.

On the mailing list, doubts were raised regarding semantics on:

  1. Quotas, especially when the "copy" start to be modified.
  2. Source and destination dataset don't use the same compression.
  3. Source and destination dataset don't use the same encryption key.
  4. Source and destination dataset haven't the same "copies" attribute.
  5. Source and destination dataset haven't the same recordsize.

Firstly, I don't expect this to work across datasets. Secondly, I'd suggest that the same semantics of deduplication are used. It should just be a shortcut of 1) enabling deduplication + 2) copying the file by reading it byte-by-byte and writing it byte-by-byte elsewhere.

If you can implement exactly the BTRFS_IOC_CLONE, the same cp with --reflink that works for btrfs could be used here. If the IOC is different, we will also need patches to the cp program or another another cp program.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Sep 28, 2011

@darthur

This comment has been minimized.

Copy link

commented Oct 1, 2011

I have my doubts regarding whether this is needed -- because the functionality is already available at the dataset(filesystem) level, and ZFS is intended to be implemented with fine-grained datasets(filesystems).

  1. I see no reason why quotas should be special for this feature. Already if you turn on zfs copies, any new writes will write 2 or 3 times the number of blocks an application would expect. ZFS based systems are intended to have enough free space, and quotas are both difficult to calculate/understand and more forgiving than quotas on systems that do not have COW.
  2. Same compression does not matter. If you have a cloned dataset and change the compression on one of them -- existing data is unchanged, the new compression setting only matters for new writes.
  3. Clearly if encryption keys are different, then the data may not be stored in the same blocks and COW does not apply.
  4. The copies attribute, like the compression settings, only applies to new writes. This is no different semantically than cloning a dataset (or promoting a snapshot).
  5. This issue is similar to 3.

COW is already implemented across specific datasets -- e.g. clones or datasets with snapshots (promoted or otherwise). Therefore, I propose a more generally useful version of this request: implement or allow COW for all copy operations in the same pool base on a setting implemented at both the dataset and pool level.

@akorn

This comment has been minimized.

Copy link
Contributor

commented Sep 20, 2012

Just to reiterate here what has been discussed in https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups=#!msg/zfs-discuss/mvGB7QEpt3w/r4xZ3eD7nn0J -- snapshots, dedup and so on already provide most of the benefits of COW hardlink breaking, but not all.

Specifically, binaries and libs with the same inode number get mapped to the same memory locations for execution, which results in memory savings. This matters a lot with container based virtualisation where you may have dozens or hundreds of identical copies of libc6 or sshd or apache in memory. KSM helps there, but it needs madvise() and is not free (it costs CPU at runtime).

COW hardlinks are essentially free.

linux-vserver (http://linux-vserver.org/) already patches ext[234] and I think jfs to enable this sort of COW hardlink breaking functionality by abusing the immutable attribute and an otherwise unused attribute linux-vserver repurposes as "break-hardlink-and-remove-immutable-bit-when-opened-for-writing".

It would be great to have the same feature in zfsonlinux; its lack is one of the only reasons I can't put my vservers volume on zfs (the lack of posix or nfsv4 acl support is the oter reason).

So, to clarify, I'm requesting the following semantics:

  1. create a hardlink to a file and set some xattr(?) on it.
  2. if the file is opened for reading, everything proceeds as normal.
  3. if the file is opened for writing, the instance of the file that was to be opened gets removed, and a new copy of the original file placed there; the open() succeeds and the resulting file handle refers to the new copy.

This wouldn't break existing applications because the feature would not be enabled by default (you'd have to set the special xattr on a file to use it).

@cburroughs

This comment has been minimized.

Copy link
Contributor

commented Oct 8, 2012

As was brought up in the thread we currently using http://www.xmailserver.org/flcow.html on ext4 for file/dir level COW. This works, but we would much prefer if we were using ZFS to have the filesystem take care of COW goodness. (For our narrow use case we can probably do everything we need to do with a filesystem per directory, but having our code just work with ``cp` would be nice to have.)

@torn5

This comment has been minimized.

Copy link
Author

commented Aug 2, 2013

I would like to bump for this feature. When I submitted it 2 years ago there were like 30 issues before it, now there are like 500. It's moving farther and farther with issues being created before it like in the expansion of the universe. I do understand this a feature request and not a bugfix but it would make ZFS a helluva lot more appealing for people.
Snapshotting the whole ZFS filesystem to achieve the clone of 1 file is definitely overkill, cannot substitute.

@ryao

This comment has been minimized.

Copy link
Member

commented Sep 25, 2013

One of the problems with implementing this is that the directory entries are implemented as name-value pairs, which at a glance, provides no obvious way of doing this. I just noticed today that value is divided into 3 sections. The top 4 bits indicate the file type, the bottom 48-bits are the actual object and the middle 12-bits are unused. One of those unused bits could be repurposed to implement reflinks.

Implementing reflinks would require much more than marking a bit, but the fact that we have spare bits available should be useful information to anyone who decides to work on this.

@torn5

This comment has been minimized.

Copy link
Author

commented Sep 25, 2013

Hi Ryao, thanks for noticing this :-) If deduplication does not need such bits or a different directory structure, then also reflink should not need them. I see the reflink as a way to do copy+deduplicate a specific file without the costs associated with copy and with deduplication, but with the same final result... Is it not so?

@behlendorf

This comment has been minimized.

Copy link
Member

commented Oct 11, 2013

@torn5 I believe you're correct, this could be done relatively easily by leveraging dedup. Basically, the reflink ioctl() would provide an user interface for per-file deduplication. As long as we're talking about a relatively small number of active files the entire deduplication table should be easily cachable and performance will be good. If implemented this way we'd inherit the existing dedup behavior for quotas and such. This makes the most sense for individual files in larger filesystems, for directories creating a new dataset would still be best.

@hufman

This comment has been minimized.

Copy link

commented Jan 17, 2014

Here is a scenario that I think this feature would be very helpful for:

I take regular snapshots of my video collection. Because of COW, these snapshots do not take any space. However, a (young relative|virus|hostile alien) comes for a visit and deletes some videos from my collection, and I would like to recover them from my handy snapshots. If I use cp normally, each recovered video is duplicated in snapshots and in the active space. With cp --reflink, the file system would be signaled to COW the file to a new name, without taking any additional space, along with making recovery instantaneous.

As an aside, is there a way to scan a ZFS volume and run an offline deduplication? If I had copied the data, is there a way to recover the space other than deleting all snapshots that contained the data?

@akorn

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2014

On Thu, Jan 16, 2014 at 05:16:38PM -0800, hufman wrote:

I take regular snapshots of my video collection. Because of COW, these
snapshots do not take any space. However, a (young relative|virus|hostile
alien) comes for a visit and deletes some videos from my collection, and I
would like to recover them from my handy snapshots. If I use cp normally,
each recovered video is duplicated in snapshots and in the active space.
With cp --reflink, the file system would be signaled to COW the file to a
new name, without taking any additional space, along with making recovery
instantaneous.

I'm not sure I see how that would work; it would need cross-filesystem
reflink support (since you'd be copying out of a snapshot and into a real
fs).

Normally, to recover from such situations, you'd just roll back to the
latest snapshot that still has the missing data. Of course, if you'd like to
keep some of the changes made since then, this is less than ideal.

If this is a frequent occurrence, maybe you'd like to turn on deduplication.
In that case, copying the files out of the snapshot will not use extra
space.

As an aside, is there a way to scan a ZFS volume and run an offline
deduplication?

None that I know of. What you could do is: enable dedup, then copy each file
to a new name, remove the original file and rename the new file to the old
name. This obviously does nothing to deduplicate existing snapshots.

If I had copied the data, is there a way to recover the
space other than deleting all snapshots that contained the data?

Other than rolling back to a snapshot, no, I don't think so.

Andras

                            E = mc^2 + 3d6
@hufman

This comment has been minimized.

Copy link

commented Jan 19, 2014

Thank you for your respose!

@rocallahan

This comment has been minimized.

Copy link

commented Feb 25, 2014

My use-case for reflink is that we are building a record and replay debugging tool (https://github.com/mozilla/rr) and every time a debuggee process mmaps a file, we need to make a copy of that file so we can map it again later unchanged. Reflink makes that free. Filesystem snapshots won't work for us; they're heavier-weight, clumsy for this use-case, and far more difficult to implement in our tool.

@ThisGuyCodes

This comment has been minimized.

Copy link

commented Mar 11, 2014

I also have a use for this, I use different zfs filesystems to control differing IO requirements. Currently for my app to move items between these filesystems it must do a full copy, which works, but makes things fairly unresponsive fairly regularly, as it copies dozens of gigabytes of data. I would consider using deduplication, but I'm fairly resource constrained as it is.

@yshui

This comment was marked as outdated.

Copy link
Contributor

commented Mar 11, 2014

I would like to do this as my GSoC 2014 project, but I don't know whether ZFSOnLinux participate in GSoC.

@ryao

This comment was marked as outdated.

Copy link
Member

commented Mar 12, 2014

ZFSOnLinux is not participating, but Gentoo is. Put together a proposal and I and others will review it. If it looks good, I am willing to mentor this.

@aarcane

This comment was marked as outdated.

Copy link

commented Mar 12, 2014

As much as I want this feature, I worry that this might constitute gsoc
abuse in some way. pease be sure to read the rules carefully.
On Mar 11, 2014 5:27 PM, "Richard Yao" notifications@github.com wrote:

ZFSOnLinux is not participating, but Gentoo is. Put together a proposal
and I and others will review it. If it looks good, I am willing to mentor
this.


Reply to this email directly or view it on GitHubhttps://github.com//issues/405#issuecomment-37363140
.

@yshui

This comment was marked as outdated.

Copy link
Contributor

commented Mar 12, 2014

@ryao Hmm, the illumos ideas page1 says I could also proposal ideas for OpenZFS. Isn't ZoL part of OpenZFS?

@ryao

This comment was marked as outdated.

Copy link
Member

commented Mar 12, 2014

@yshui OpenZFS is an umbrella project for the various ZFS platforms. It is not directly participating in the GSoC, but some of its member platforms are. Illumos and FreeBSD are participating in the GSoC. ZoL is not, but can indirectly through Gentoo, which is also participating in the GSoC.

@yshui

This comment was marked as outdated.

Copy link
Contributor

commented Mar 12, 2014

@ryao I see. Sorry I didn't read your comment before reply to your email.

@kpande

This comment was marked as outdated.

Copy link
Contributor

commented Mar 12, 2014

Gentoo maintains its own patchset for ZoL through portage (Richard Yao
and others)

@behlendorf

This comment has been minimized.

Copy link
Member

commented Mar 21, 2014

If someone does decide to work on this I'm happy to help point them in the right direction.

@ryao

This comment has been minimized.

Copy link
Member

commented Jul 31, 2014

I had jotted down a possible way of doing this in #2554 in response to an inquiry about btrfs-style deduplication via reflinks. I am duplicating it here for future reference:

Directories in ZFS are name-value pairs. Adding reflinks to that is non-trivial. One idea that might work would be to replace the block pointers in the indirection tree with object identifiers and utilize ZAP to store mappings between the object id and a (reference count,block pointer) tuple. That would involve a disk format change and would only apply to newly created files. Each block access would suffer an additional indirection. Making reflinks under this scheme would require duplicating the entire indirection tree and updating the corresponding ZAP entries in a single transaction group.

Thinking about it, the indirection tree itself is always significantly smaller than the data itself, so the act of writing would be bounded to some percentage of the data. We already suffer from this sort of penalty from the existing deduplication implementation, but at a much higher penalty as we must perform a random seek on each data access. Letting users copy only metadata through reflinks seems preferrable to direct data copies by shuffling data through userland. This could avoid the penalties in the presence of dedup=on because all of the data has been preprocessed by our deduplication algorithm.

That being said, the DDT has its own reference counts for each block, so we would either need to implement this in a way compatible with that or change it. We also need to consider the interaction with snapshots.

@ryao

This comment has been minimized.

Copy link
Member

commented Sep 29, 2014

Here is a technique that might be possible to apply:

https://www.usenix.org/legacy/event/lsf07/tech/rodeh.pdf

There are a few caveats:

  1. Care needs to be taken with the interaction between this and dataset snapshots. The snapshots contain a list of referenced blocks that might need to be updated.
  2. As I said in July, the DDT has reference counts and we would need to implement this in a way that is compatible.
  3. The technique described cannot be applied to existing ZFS datasets. The manner in which Jeff designed the indirection tree exploits some properties of numbers that keep us from suddenly repurposing two entries, which the algorithm requires. It also might also limit our maximum file size (I need to check) because the loss of 2 entries in the indirection tree means that the total addressable storage is reduced. This means that we could only support reflinks on new datasets created with this feature if it is implemented. Old datasets could not be upgraded, although send/recv likely could be used to create new ones in the new format.
@batrick

This comment has been minimized.

Copy link

commented Apr 6, 2015

I'm late to this party but I want to give a definitive and real use-case for this that is not satisfied by clones.

We have a process sandbox which imports immutable files. Ideally, each imported file may be modified by the process but those modifications shouldn't change the immutable originals. Okay, that could be solved with a cloned file system and COW. However, from this sandbox we also want to extract (possibly very large) output files. This throws a wrench in our plans. We can't trivially import data from a clone back into the parent file system without doing byte-by-byte copy.

I think this is a problem worth solving. Clones (at least, as they currently exist) are not the right answer.

@dioni21

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2016

Sorry if I lost something in this long range discussion, but it seems to me that everybody is thinking in snapshots, clones, dedup, etc.

I am, personally, a big fan of dedup. But I know it has many memory drawbacks, because it is done online. In BTRFS and WAFL, dedup is done offline. Thus, all this memory is only used during the dedup process.

I think that the original intent of this request is to add "clone/dedup" functionality to cp. But not by enabling online dedup to the filesystem, nor by first copying and them deduping the file. Let the filesystem just create another "file" instance, in which its data is a set of CoW sectors from another file.

Depending on how ZFS adjusts data internally, I can imagine this even being used to "move" files between filesystems on the same pool. No payload disk block need to be touched, only metadata.

Ok, there are cases in which blocksize, compression, etc are setup differently. But IIRC, these are only "valid" for newer files. Old files will keep what is already on disk. So, it appears to me that this not a problem. Maybe crypto, as @darthur has already told at 2011, which is NOT even a real feature yet...

There's already such a feature in WALF. Please, take a look at this: http://community.netapp.com/t5/Microsoft-Cloud-and-Virtualization-Discussions/What-is-SIS-Clone/td-p/6462

Let's start cloning single files!!!

@ilovezfs

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2016

@dioni21 perhaps you should consider a zvol formatted with btrfs.

@dasJ

This comment has been minimized.

Copy link

commented Dec 8, 2016

I'm sorry to be the annoying guy who brings up old threads, but I found this issue and it seems to be really useful.
So I just wanted to ask if this is scheduled for any release? It may not be part of the 0.7 but I'd love to see it after that.
So is there anyone else working on this feature? And are there even plans to implement it? Now that encryption will make it into 0.7 (I think), this is another factor to think of.

@galt

This comment has been minimized.

Copy link

commented Sep 27, 2017

This feature would be perfect for our needs.
A user prepares large files for a project.
The user submits those files.
Those files are currently copied to a directory
for QA and other processing.
Most of the time, the files are OK and do not have to be re-submitted.
However, because of the copy we have massive duplication.
Using zfs snapshots do not really work for us.
We would have to make a huge number of them
and keep them around forever.
But a reflink copy would be perfect.

@mschilli87

This comment has been minimized.

Copy link

commented Sep 28, 2017

@galt: While I agree that the COW feature outlined here would be great to have, it seems for your use case turning on de-duplication on the zfs dataset containing the originals & copies would already effectively save you the space currently occupied by 'un-needed' identical copies. Or did I miss anything?

@nagisa

This comment has been minimized.

Copy link

commented Sep 28, 2017

Isn’t deduplication extremely RAM intensive?

@ilovezfs

This comment has been minimized.

Copy link
Contributor

commented Sep 28, 2017

Yes.

@galt

This comment has been minimized.

Copy link

commented Sep 28, 2017

Instead of just copying the file, I would call it with
cp --reflink=always source target
which would mean that it operates at the whole file level.

This is quite different from turning on zfs dedup feature and spending 1GB/TB on RAM for dedup hash.
Also, it is a super fast operation since only some metadata need be copied
to make the reflink.

The feature has already been implemented on XFS (at least for testing) since January 2017.
http://strugglers.net/~andy/blog/2017/01/10/xfs-reflinks-and-deduplication/

@galt

This comment has been minimized.

Copy link

commented Sep 28, 2017

APFS: APple File System has just been released 25 Sept 2017 on OS High Sierra and iOS.
It has many great features of zfs and btrfs and ocfs2 like copy-on-write, reflinks,
snapshots and clones.

@galt

This comment has been minimized.

Copy link

commented Sep 28, 2017

Quote:
[...] one I know that works also in XFS is duperemove.
https://github.com/markfasheh/duperemove
You will need to use a git checkout of duperemove for this to work.

@rouben

This comment has been minimized.

Copy link

commented Sep 29, 2017

Shouldn’t this feature technically be implemented upstream, via openzfs project? This doesn’t seem like a ZFS on Linux issue...

Second question: is it even practical to implement given ZFS design? Based on what I’ve read, reflinks were never really part of ZFS design to begin with and may therefore not be preactical to implement...

@galt

This comment has been minimized.

Copy link

commented Sep 29, 2017

First question: good point. openzfs it should be.

Second question: most of the copy-on-write machinery needed is already in place.
It is fairly simple to make a reflink copy, compared to making snapshots and clones.
It basically just has to duplicate the inode and its associated meta data (but not
the actual data blocks of the file). Maybe it needs to update some flags or counts
on the inodes and blocks.

@galt

This comment has been minimized.

Copy link

commented Sep 29, 2017

Can somebody point me to the openzfs repository?
I do not see one.

This page implies there isn't one:
http://open-zfs.org/wiki/FAQ#Do_you_plan_to_release_OpenZFS_under_a_license_other_than_the_CDDL.3F
QUOTE:
Why are there four different repositories?

Each repository supports a different operating system. Even though the core of OpenZFS is platform-independent, there are a significant number of platform-specific changes need to be maintained for the parts of ZFS which interact with the rest of the operating system (VFS, memory management, disk i/o, etc.).
Are new features and improvements shared between the different repositories?

Yes. Each implementation regularly ports platform-independent changes from the other implementations. One of the goals of OpenZFS is to simplify this porting process.

@galt

This comment has been minimized.

Copy link

commented Sep 29, 2017

ryao wanted to add reflink to zfs in 2014.
https://lists.gt.net/gentoo/dev/285286?do=post_view_threaded

@rouben

This comment has been minimized.

Copy link

commented Sep 29, 2017

@galt Looks like there is no "main" OpenZFS repository. I suspect the closest thing you'd find to that would be the oringial OpenSolaris ZFS build, which would be the Illumos and/or OpenIndiana flavours. Having said that, it looks like there is an effort (or interest thereof) to unify the code (core ZFS code and various platform-specific porting layers): http://open-zfs.org/wiki/Reduce_code_differences

That's bad... imagine if the ZoL team figures out reflinks, I would assume it would be non-trivial to port that feature to Illumos or FreeBSD, for example. I hope the above unification initiative takes off, otherwise the ZFS codebase will become too fragmented and difficult to support in the long run... 😞

@gmelikov

This comment has been minimized.

Copy link
Member

commented Sep 29, 2017

@galt @rouben there is no problem at all with different repos etc, it's just non-trivial to add this functionality in ZFS.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Sep 29, 2017

The upstream OpenZFS repository is located at https://github.com/openzfs/openzfs. Features developed on Linux, FreeBSD, and Illumos are feed back upstream as appropriate to this repository. Each platform then pulls back down the changes they need.

Regarding reflink support this is something which has discussed at previous OpenZFS developer summits and several possible designs have been proposed. It's definitely doable, but we want to do it in a, efficient portable way which ideally all the platforms can take advantage of.

@behlendorf

This comment has been minimized.

Copy link
Member

commented Sep 29, 2017

I should have added that anyone who is interested is more than welcome to join us at the upcoming 2017 OpenZFS Developer Summit where we'll be discussing all things OpenZFS!

@cwedgwood

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2017

fwiw oracle has done this on internally, see https://www.youtube.com/watch?v=c1ek1tFjhH8#t=18m55s for a slide referring to this (there is also a q&a later in the talking confirming this is reflink)

@galt

This comment has been minimized.

Copy link

commented Nov 23, 2017

@galt

This comment has been minimized.

Copy link

commented Nov 23, 2017

@galt

This comment has been minimized.

Copy link

commented Nov 23, 2017

@kpande

This comment has been minimized.

Copy link
Contributor

commented Nov 23, 2017

can you please refrain from flooding the issue comments? it unnecessarily causes spam.

@mtalexan

This comment was marked as off-topic.

Copy link

commented Jun 5, 2018

Bumping again.
With the semi-recent push by Ubuntu of Kubernetes based on LXD, which recommends a ZFS pool for performance, ZFS in general is getting a lot more attention. While the snapshot/dedup is useful when I spin a copy of a whole container file system, copying large files from inside the container to outside the container, or between containers is still as slow as any other file system, which is a significant end-user-visible benefit/change if the --reflink=always behavior were available for implementation into the LXD tools.

In my individual use case, we have automated build servers for CI that need to build many different variants of a very large code base. We clone once, and do basic setup, then we make many copies of the directory to run different variant builds on. We don't touch any of the existing files during the builds, we only add to them, but since disk IO is a major factor we can't use overlay-type mounting options. We also won't tie ourselves to only functioning on one file system, as would be necessary for implementing this as snapshots for each variant. The implementation that would be useful would be one that would be transparent through something like a default --reflink=always behavior on the cp command.

I think it's pretty clear that a --reflink=always from the cp command would be the most common use of COW for most users of ZFS if it were available, as this issue, the number of issues that have been combined into this one, and a simple web search for questions related to COW in file systems indicates.

@shodanshok

This comment was marked as off-topic.

Copy link

commented Aug 30, 2018

Bumping again. Any news on the matter? Due of the CoW nature of ZFS, reflink is an obvious (but not necessarily trivial to implement) feature...

@kpande

This comment has been minimized.

Copy link
Contributor

commented Aug 30, 2018

i've asked nicely before to stop flooding this ticket with useless "me too" comments, if you want it, implement it, open a PR, but this one is getting locked, and any other "requests" opened for the same thing will be closed, too.

@zfsonlinux zfsonlinux locked as off topic and limited conversation to collaborators Aug 30, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
You can’t perform that action at this time.