COW cp (--reflink) support #405

torn5 · 2011-09-21T21:24:48Z

This is a feature request for an implementation of the BTRFS_IOC_CLONE in zfsonlinux, or something similar, so that it's possible to COW-copy a single file in zero space, zero RAM and zero time, without having to enable super-expensive things like filesystem-wide deduplication (which btw is not zero RAM or zero time).

If it can be done at the directory level, so to clone entire directory trees with one call, even better.

On the mailing list, doubts were raised regarding semantics on:

Quotas, especially when the "copy" start to be modified.
Source and destination dataset don't use the same compression.
Source and destination dataset don't use the same encryption key.
Source and destination dataset haven't the same "copies" attribute.
Source and destination dataset haven't the same recordsize.

Firstly, I don't expect this to work across datasets. Secondly, I'd suggest that the same semantics of deduplication are used. It should just be a shortcut of 1) enabling deduplication + 2) copying the file by reading it byte-by-byte and writing it byte-by-byte elsewhere.

If you can implement exactly the BTRFS_IOC_CLONE, the same cp with --reflink that works for btrfs could be used here. If the IOC is different, we will also need patches to the cp program or another another cp program.

behlendorf · 2011-09-28T18:22:43Z

Link to zfs-discuss thread for this feature request:
http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/browse_thread/thread/9af181ed0129b77c#

darthur · 2011-10-01T19:52:16Z

I have my doubts regarding whether this is needed -- because the functionality is already available at the dataset(filesystem) level, and ZFS is intended to be implemented with fine-grained datasets(filesystems).

I see no reason why quotas should be special for this feature. Already if you turn on zfs copies, any new writes will write 2 or 3 times the number of blocks an application would expect. ZFS based systems are intended to have enough free space, and quotas are both difficult to calculate/understand and more forgiving than quotas on systems that do not have COW.
Same compression does not matter. If you have a cloned dataset and change the compression on one of them -- existing data is unchanged, the new compression setting only matters for new writes.
Clearly if encryption keys are different, then the data may not be stored in the same blocks and COW does not apply.
The copies attribute, like the compression settings, only applies to new writes. This is no different semantically than cloning a dataset (or promoting a snapshot).
This issue is similar to 3.

COW is already implemented across specific datasets -- e.g. clones or datasets with snapshots (promoted or otherwise). Therefore, I propose a more generally useful version of this request: implement or allow COW for all copy operations in the same pool base on a setting implemented at both the dataset and pool level.

akorn · 2012-09-20T14:54:33Z

Just to reiterate here what has been discussed in https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups=#!msg/zfs-discuss/mvGB7QEpt3w/r4xZ3eD7nn0J -- snapshots, dedup and so on already provide most of the benefits of COW hardlink breaking, but not all.

Specifically, binaries and libs with the same inode number get mapped to the same memory locations for execution, which results in memory savings. This matters a lot with container based virtualisation where you may have dozens or hundreds of identical copies of libc6 or sshd or apache in memory. KSM helps there, but it needs madvise() and is not free (it costs CPU at runtime).

COW hardlinks are essentially free.

linux-vserver (http://linux-vserver.org/) already patches ext[234] and I think jfs to enable this sort of COW hardlink breaking functionality by abusing the immutable attribute and an otherwise unused attribute linux-vserver repurposes as "break-hardlink-and-remove-immutable-bit-when-opened-for-writing".

It would be great to have the same feature in zfsonlinux; its lack is one of the only reasons I can't put my vservers volume on zfs (the lack of posix or nfsv4 acl support is the oter reason).

So, to clarify, I'm requesting the following semantics:

create a hardlink to a file and set some xattr(?) on it.
if the file is opened for reading, everything proceeds as normal.
if the file is opened for writing, the instance of the file that was to be opened gets removed, and a new copy of the original file placed there; the open() succeeds and the resulting file handle refers to the new copy.

This wouldn't break existing applications because the feature would not be enabled by default (you'd have to set the special xattr on a file to use it).

cburroughs · 2012-10-08T14:42:18Z

As was brought up in the thread we currently using http://www.xmailserver.org/flcow.html on ext4 for file/dir level COW. This works, but we would much prefer if we were using ZFS to have the filesystem take care of COW goodness. (For our narrow use case we can probably do everything we need to do with a filesystem per directory, but having our code just work with ``cp` would be nice to have.)

torn5 · 2013-08-02T13:37:15Z

I would like to bump for this feature. When I submitted it 2 years ago there were like 30 issues before it, now there are like 500. It's moving farther and farther with issues being created before it like in the expansion of the universe. I do understand this a feature request and not a bugfix but it would make ZFS a helluva lot more appealing for people.
Snapshotting the whole ZFS filesystem to achieve the clone of 1 file is definitely overkill, cannot substitute.

ryao · 2013-09-25T16:37:02Z

One of the problems with implementing this is that the directory entries are implemented as name-value pairs, which at a glance, provides no obvious way of doing this. I just noticed today that value is divided into 3 sections. The top 4 bits indicate the file type, the bottom 48-bits are the actual object and the middle 12-bits are unused. One of those unused bits could be repurposed to implement reflinks.

Implementing reflinks would require much more than marking a bit, but the fact that we have spare bits available should be useful information to anyone who decides to work on this.

torn5 · 2013-09-25T16:51:26Z

Hi Ryao, thanks for noticing this :-) If deduplication does not need such bits or a different directory structure, then also reflink should not need them. I see the reflink as a way to do copy+deduplicate a specific file without the costs associated with copy and with deduplication, but with the same final result... Is it not so?

behlendorf · 2013-10-11T19:33:42Z

@torn5 I believe you're correct, this could be done relatively easily by leveraging dedup. Basically, the reflink ioctl() would provide an user interface for per-file deduplication. As long as we're talking about a relatively small number of active files the entire deduplication table should be easily cachable and performance will be good. If implemented this way we'd inherit the existing dedup behavior for quotas and such. This makes the most sense for individual files in larger filesystems, for directories creating a new dataset would still be best.

hufman · 2014-01-17T01:16:24Z

Here is a scenario that I think this feature would be very helpful for:

I take regular snapshots of my video collection. Because of COW, these snapshots do not take any space. However, a (young relative|virus|hostile alien) comes for a visit and deletes some videos from my collection, and I would like to recover them from my handy snapshots. If I use cp normally, each recovered video is duplicated in snapshots and in the active space. With cp --reflink, the file system would be signaled to COW the file to a new name, without taking any additional space, along with making recovery instantaneous.

As an aside, is there a way to scan a ZFS volume and run an offline deduplication? If I had copied the data, is there a way to recover the space other than deleting all snapshots that contained the data?

akorn · 2014-01-17T08:17:54Z

On Thu, Jan 16, 2014 at 05:16:38PM -0800, hufman wrote:

I take regular snapshots of my video collection. Because of COW, these
snapshots do not take any space. However, a (young relative|virus|hostile
alien) comes for a visit and deletes some videos from my collection, and I
would like to recover them from my handy snapshots. If I use cp normally,
each recovered video is duplicated in snapshots and in the active space.
With cp --reflink, the file system would be signaled to COW the file to a
new name, without taking any additional space, along with making recovery
instantaneous.

I'm not sure I see how that would work; it would need cross-filesystem
reflink support (since you'd be copying out of a snapshot and into a real
fs).

Normally, to recover from such situations, you'd just roll back to the
latest snapshot that still has the missing data. Of course, if you'd like to
keep some of the changes made since then, this is less than ideal.

If this is a frequent occurrence, maybe you'd like to turn on deduplication.
In that case, copying the files out of the snapshot will not use extra
space.

As an aside, is there a way to scan a ZFS volume and run an offline
deduplication?

None that I know of. What you could do is: enable dedup, then copy each file
to a new name, remove the original file and rename the new file to the old
name. This obviously does nothing to deduplicate existing snapshots.

If I had copied the data, is there a way to recover the
space other than deleting all snapshots that contained the data?

Other than rolling back to a snapshot, no, I don't think so.

Andras

                            E = mc^2 + 3d6

hufman · 2014-01-19T15:54:45Z

Thank you for your respose!

rocallahan · 2014-02-25T00:06:38Z

My use-case for reflink is that we are building a record and replay debugging tool (https://github.com/mozilla/rr) and every time a debuggee process mmaps a file, we need to make a copy of that file so we can map it again later unchanged. Reflink makes that free. Filesystem snapshots won't work for us; they're heavier-weight, clumsy for this use-case, and far more difficult to implement in our tool.

ThisGuyCodes · 2014-03-11T17:38:41Z

I also have a use for this, I use different zfs filesystems to control differing IO requirements. Currently for my app to move items between these filesystems it must do a full copy, which works, but makes things fairly unresponsive fairly regularly, as it copies dozens of gigabytes of data. I would consider using deduplication, but I'm fairly resource constrained as it is.

behlendorf · 2014-03-21T18:19:00Z

If someone does decide to work on this I'm happy to help point them in the right direction.

ryao · 2014-07-31T16:24:56Z

I had jotted down a possible way of doing this in #2554 in response to an inquiry about btrfs-style deduplication via reflinks. I am duplicating it here for future reference:

Directories in ZFS are name-value pairs. Adding reflinks to that is non-trivial. One idea that might work would be to replace the block pointers in the indirection tree with object identifiers and utilize ZAP to store mappings between the object id and a (reference count,block pointer) tuple. That would involve a disk format change and would only apply to newly created files. Each block access would suffer an additional indirection. Making reflinks under this scheme would require duplicating the entire indirection tree and updating the corresponding ZAP entries in a single transaction group.

Thinking about it, the indirection tree itself is always significantly smaller than the data itself, so the act of writing would be bounded to some percentage of the data. We already suffer from this sort of penalty from the existing deduplication implementation, but at a much higher penalty as we must perform a random seek on each data access. Letting users copy only metadata through reflinks seems preferrable to direct data copies by shuffling data through userland. This could avoid the penalties in the presence of dedup=on because all of the data has been preprocessed by our deduplication algorithm.

That being said, the DDT has its own reference counts for each block, so we would either need to implement this in a way compatible with that or change it. We also need to consider the interaction with snapshots.

ryao · 2014-09-29T17:56:28Z

Here is a technique that might be possible to apply:

https://www.usenix.org/legacy/event/lsf07/tech/rodeh.pdf

There are a few caveats:

Care needs to be taken with the interaction between this and dataset snapshots. The snapshots contain a list of referenced blocks that might need to be updated.
As I said in July, the DDT has reference counts and we would need to implement this in a way that is compatible.
The technique described cannot be applied to existing ZFS datasets. The manner in which Jeff designed the indirection tree exploits some properties of numbers that keep us from suddenly repurposing two entries, which the algorithm requires. It also might also limit our maximum file size (I need to check) because the loss of 2 entries in the indirection tree means that the total addressable storage is reduced. This means that we could only support reflinks on new datasets created with this feature if it is implemented. Old datasets could not be upgraded, although send/recv likely could be used to create new ones in the new format.

batrick · 2015-04-06T20:26:58Z

I'm late to this party but I want to give a definitive and real use-case for this that is not satisfied by clones.

We have a process sandbox which imports immutable files. Ideally, each imported file may be modified by the process but those modifications shouldn't change the immutable originals. Okay, that could be solved with a cloned file system and COW. However, from this sandbox we also want to extract (possibly very large) output files. This throws a wrench in our plans. We can't trivially import data from a clone back into the parent file system without doing byte-by-byte copy.

I think this is a problem worth solving. Clones (at least, as they currently exist) are not the right answer.

dioni21 · 2016-01-31T14:18:29Z

Sorry if I lost something in this long range discussion, but it seems to me that everybody is thinking in snapshots, clones, dedup, etc.

I am, personally, a big fan of dedup. But I know it has many memory drawbacks, because it is done online. In BTRFS and WAFL, dedup is done offline. Thus, all this memory is only used during the dedup process.

I think that the original intent of this request is to add "clone/dedup" functionality to cp. But not by enabling online dedup to the filesystem, nor by first copying and them deduping the file. Let the filesystem just create another "file" instance, in which its data is a set of CoW sectors from another file.

Depending on how ZFS adjusts data internally, I can imagine this even being used to "move" files between filesystems on the same pool. No payload disk block need to be touched, only metadata.

Ok, there are cases in which blocksize, compression, etc are setup differently. But IIRC, these are only "valid" for newer files. Old files will keep what is already on disk. So, it appears to me that this not a problem. Maybe crypto, as @darthur has already told at 2011, which is NOT even a real feature yet...

There's already such a feature in WALF. Please, take a look at this: http://community.netapp.com/t5/Microsoft-Cloud-and-Virtualization-Discussions/What-is-SIS-Clone/td-p/6462

Let's start cloning single files!!!

ilovezfs · 2016-01-31T15:42:01Z

@dioni21 perhaps you should consider a zvol formatted with btrfs.

robn · 2023-10-03T08:28:14Z

I've tried to reflink a big file with --reflink=auto across different datasets and it took 3 seconds compared to 0.1 seconds when the dataset was the same. reflinking clearly didn't work across datasets, even with auto.

Yeah, this is on me - in my haste I said auto will clone; I should have said, it has the chance to clone. There's plenty of reasons that copy_file_range might not create a clone.

(also, run time is not a great indicator of whether or not a clone happened, but that's by the by).

Basically: auto is "I don't care how you do this". always is "this must be a clone".

shodanshok · 2023-10-03T09:26:02Z

OpenZFS implements both of these. However, Linux places additional restrictions on ioctl(FICLONE) that it does not with copy_file_range - the former requires both source and destination to have the same superblock (and before 5.19, the same mountpoint) while the latter only requires the same filesystem driver. If Linux's own checks do not pass, then it will not even call into OpenZFS to do the work.

As kernel 5.19+ removed the restriction about mountpoints, what does now prevent FICLONE for working between dataset?

Thanks.

robn · 2023-10-03T09:57:05Z

They have to have the same superblock. Btrfs subvolumes share a superblock so this works there, but ZFS datasets have separate superblocks so fails that check.

darkbasic · 2023-10-03T10:57:52Z

Would it be possible to use a similar trick as btrfs does or to remove the limitation upstream?

rouben · 2023-10-03T12:58:35Z

Would it be possible to use a similar trick as btrfs does or to remove the limitation upstream?

I don't think that's a "trick," more like a fundamental design difference. I may be wrong, of course, and defer to more knowledgeable people on this one.

robn · 2023-10-03T20:51:30Z

Yeah, no trick. The superblock structure is where a lot of the Linux-side metadata and accounting are held for the mount, as well as being the "key" in a superblock->dataset mapping so when Linux calls into OpenZFS we can find the right dataset. Trying to share a single superblock across multiple datasets is going to significantly complicate things inside OpenZFS, assuming its even possible to do without causing real issues.

("how does Btrfs do it then?", you'll ask. And I don't know, but I do know it is a fundamentally different system with a very different scope, so I can easily imagine its needs are quite different internally. I don't think its useful to compare).

Removing the limitation upstream is technically trivial, but working with Linux upstream takes a lot of time and energy I'm not sure many have available (I don't). Its also of limited benefit to anyone not on the bleeding-edge kernels, which is the majority of OpenZFS users.

(fun fact; if you install 2.2-rc4 on RHEL7, ioctl(FICLONE) can do cross-dataset clones just fine, because that ancient kernel doesn't even know what FICLONE is, and just passes it through untouched through to OpenZFS).

darkbasic · 2023-10-03T21:17:54Z

Its also of limited benefit to anyone not on the bleeding-edge kernels, which is the majority of OpenZFS users.

I can assure you that soon enough LTS users will face the very same issue: bleeding edge doesn't remain so forever :)

robn · 2023-10-04T00:42:01Z

I can assure you that soon enough LTS users will face the very same issue: bleeding edge doesn't remain so forever :)

Be that as it may, it doesn't change anything. There isn't one neat trick to solve this particular issue, and the block cloning feature is still very useful even with this shortcoming.

darkbasic · 2023-10-04T07:57:23Z

and the block cloning feature is still very useful even with this shortcoming

Unfortunately not so much for my use case and there is nothing I can do to work it around because the dataset is actually the same:

[niko@arch-phoenix ~]$ cp --reflink=always /home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz .
cp: failed to clone './linux-mainline.tar.gz' from '/home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz': Invalid cross-device link
[niko@arch-phoenix ~]$ mount | grep home
rpool/home on /home type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home@zrepl_20231004_074556_000 on /home/.zfs/snapshot/zrepl_20231004_074556_000 type zfs (ro,relatime,xattr,posixacl,casesensitive)

I want to use reflinks to restore data from snapshots (mainly virtual machine images) without having to rollback the dataset. Rolling back I will loose subsequent snapshots and I cannot use clones because that would break zrepl replication.

Atemu · 2023-10-04T08:24:02Z

Could you create a new issue on cross-dataset reflinks? I think that's far past this issue's scope.

robn · 2023-10-04T08:26:26Z

A filesystem and a snapshot are not the same dataset.

darkbasic · 2023-10-04T08:40:56Z

@Atemu I've already did: #15345

shodanshok · 2023-10-04T08:44:49Z

cp --reflink=always

Have you tried using cp --reflink=auto? It should work without issues.

darkbasic · 2023-10-04T09:42:51Z

Have you tried using cp --reflink=auto? It should work without issues.

Yes I did, but unfortunately it doesn't work. I've tried timing the time it requires to copy the file and even checking the pool available space before and after the copy.
In both cases it looked like it didn't clone anything.
Do you have any idea why auto wouldn't work on my system? I'm on Linux 6.6 + zfs zfs-2.2-release.

shodanshok · 2023-10-04T10:52:17Z

I am on stable LTS distros as Debian12 and Rocky9, so I don't have first hand experience on such latest kernels. On a Debian12 box, strace cp --reflink=auto /tank/test/.zfs/snapshot/snap1/test.img test2.img show the following:

ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = -1 EXDEV (Invalid cross-device link)
newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 1048576
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0

Notice how FICLONE was tried, failed, and cp switched to copy_file_range. Can you try the same on your system?

darkbasic · 2023-10-05T08:17:26Z

@shodanshok it seems like --reflink=auto falls back to copy_file_range on my system as well. The problem is that that copy_file_range does not create a clone!

[niko@arch-phoenix ~]$ strace cp --reflink=auto ~/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz .
execve("/usr/bin/cp", ["cp", "--reflink=auto", "/home/niko/.cache/yay/chromium-w"..., "."], 0x7fffecdf1308 /* 54 vars */) = 0
brk(NULL)                               = 0x55de49dc3000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffdfbebaf90) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=132935, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 132935, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb49653a000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=34688, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb496538000
mmap(NULL, 32800, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb49652f000
mmap(0x7fb496531000, 16384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fb496531000
mmap(0x7fb496535000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7fb496535000
mmap(0x7fb496536000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7fb496536000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=26496, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 28696, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb496527000
mmap(0x7fb496529000, 12288, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7fb496529000
mmap(0x7fb49652c000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7fb49652c000
mmap(0x7fb49652d000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7fb49652d000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220~\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=1948832, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1973104, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb496345000
mmap(0x7fb49636b000, 1417216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7fb49636b000
mmap(0x7fb4964c5000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7fb4964c5000
mmap(0x7fb496519000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d3000) = 0x7fb496519000
mmap(0x7fb49651f000, 31600, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb49651f000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb496342000
arch_prctl(ARCH_SET_FS, 0x7fb496342740) = 0
set_tid_address(0x7fb496342a10)         = 20991
set_robust_list(0x7fb496342a20, 24)     = 0
rseq(0x7fb496343060, 0x20, 0, 0x53053053) = 0
mprotect(0x7fb496519000, 16384, PROT_READ) = 0
mprotect(0x7fb49652d000, 4096, PROT_READ) = 0
mprotect(0x7fb496536000, 4096, PROT_READ) = 0
mprotect(0x55de485f4000, 4096, PROT_READ) = 0
mprotect(0x7fb49658c000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fb49653a000, 132935)          = 0
getrandom("\x7e\x43\xdf\x71\xda\x7c\xa9\x79", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55de49dc3000
brk(0x55de49de4000)                     = 0x55de49de4000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=3052896, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 3052896, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb496058000
close(3)                                = 0
geteuid()                               = 1000
newfstatat(AT_FDCWD, "/home/niko/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz", {st_mode=S_IFREG|0644, st_size=3141005880, ...}, 0) = 0
newfstatat(AT_FDCWD, "chromium-117.0.5938.132.tar.xz", 0x7ffdfbebaaa0, 0) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/niko/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=3141005880, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "chromium-117.0.5938.132.tar.xz", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = -1 EXDEV (Invalid cross-device link)
newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
uname({sysname="Linux", nodename="arch-phoenix", ...}) = 0
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 2147479552
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 993526328
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0
close(4)                                = 0
close(3)                                = 0
lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
close(0)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

@scineram since you've put a "thumbs down" on my previous comment may I ask you why you're against using reflinks to restore data from snapshots? Is there any reason why that would be considered a bad practice?

shodanshok · 2023-10-05T11:37:14Z

It is perfectly normal for cp --reflink=auto to fallback to copy_file_range, otherwise (being cross-dataset FICLONE unavailable) it will not clone nothing at all. Basically, copy_file_range tells the kernel "please copy that, with or without reflinks". So the key question is why, on your system, copy_file_range can't use reflinks.

Please provide the output of zpool get all | grep clone; zfs get all src/dataset; zfs get all src/dataset; zdb -Ovv src/dataset srcfile; zdb -Ovv dst/dataset dstfile (after having tried a failed cp --reflink=auto)

darkbasic · 2023-10-05T19:22:13Z

Unfortunately zdb -O doesn't work:

[root@arch-phoenix ~]# zfs list | grep home
rpool/home                                                                      124G  3.29T  2.08G  /home
rpool/home/niko                                                                 101G  3.29T   192K  /home/niko
rpool/home/niko/.cache                                                         10.8G  3.29T  10.8G  /home/niko/.cache
rpool/home/root                                                                12.0M  3.29T  10.9M  /root

[root@arch-phoenix ~]# mount | grep home
rpool/home on /home type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home/root on /root type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home/niko/.cache on /home/niko/.cache type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)

[root@arch-phoenix ~]# zpool get all | grep clone; \
zfs get all rpool/home/niko/.cache; \
zfs get all rpool/home; \
zdb -Ovv rpool/home/niko/.cache /home/niko/.cacheyay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz; \
cp --reflink=auto /home/niko/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz /home/niko/; \
zdb -Ovv rpool/home niko/chromium-117.0.5938.132.tar.xz
rpool  bcloneused                     206M                           -
rpool  bclonesaved                    227M                           -
rpool  bcloneratio                    2.10x                          -
NAME                    PROPERTY                   VALUE                      SOURCE
rpool/home/niko/.cache  type                       filesystem                 -
rpool/home/niko/.cache  creation                   Sun Oct  1  9:58 2023      -
rpool/home/niko/.cache  used                       10.8G                      -
rpool/home/niko/.cache  available                  3.29T                      -
rpool/home/niko/.cache  referenced                 10.8G                      -
rpool/home/niko/.cache  compressratio              1.32x                      -
rpool/home/niko/.cache  mounted                    yes                        -
rpool/home/niko/.cache  quota                      none                       default
rpool/home/niko/.cache  reservation                none                       default
rpool/home/niko/.cache  recordsize                 128K                       default
rpool/home/niko/.cache  mountpoint                 /home/niko/.cache          inherited from rpool/home
rpool/home/niko/.cache  sharenfs                   off                        default
rpool/home/niko/.cache  checksum                   on                         default
rpool/home/niko/.cache  compression                zstd                       inherited from rpool
rpool/home/niko/.cache  atime                      on                         default
rpool/home/niko/.cache  devices                    off                        inherited from rpool
rpool/home/niko/.cache  exec                       on                         default
rpool/home/niko/.cache  setuid                     on                         default
rpool/home/niko/.cache  readonly                   off                        default
rpool/home/niko/.cache  zoned                      off                        default
rpool/home/niko/.cache  snapdir                    hidden                     default
rpool/home/niko/.cache  aclmode                    discard                    default
rpool/home/niko/.cache  aclinherit                 restricted                 default
rpool/home/niko/.cache  createtxg                  491                        -
rpool/home/niko/.cache  canmount                   on                         default
rpool/home/niko/.cache  xattr                      sa                         inherited from rpool
rpool/home/niko/.cache  copies                     1                          default
rpool/home/niko/.cache  version                    5                          -
rpool/home/niko/.cache  utf8only                   on                         -
rpool/home/niko/.cache  normalization              formD                      -
rpool/home/niko/.cache  casesensitivity            sensitive                  -
rpool/home/niko/.cache  vscan                      off                        default
rpool/home/niko/.cache  nbmand                     off                        default
rpool/home/niko/.cache  sharesmb                   off                        default
rpool/home/niko/.cache  refquota                   none                       default
rpool/home/niko/.cache  refreservation             none                       default
rpool/home/niko/.cache  guid                       17373078615024374509       -
rpool/home/niko/.cache  primarycache               all                        default
rpool/home/niko/.cache  secondarycache             all                        default
rpool/home/niko/.cache  usedbysnapshots            0B                         -
rpool/home/niko/.cache  usedbydataset              10.8G                      -
rpool/home/niko/.cache  usedbychildren             0B                         -
rpool/home/niko/.cache  usedbyrefreservation       0B                         -
rpool/home/niko/.cache  logbias                    latency                    default
rpool/home/niko/.cache  objsetid                   1297                       -
rpool/home/niko/.cache  dedup                      off                        default
rpool/home/niko/.cache  mlslabel                   none                       default
rpool/home/niko/.cache  sync                       standard                   default
rpool/home/niko/.cache  dnodesize                  auto                       inherited from rpool
rpool/home/niko/.cache  refcompressratio           1.32x                      -
rpool/home/niko/.cache  written                    10.8G                      -
rpool/home/niko/.cache  logicalused                14.2G                      -
rpool/home/niko/.cache  logicalreferenced          14.2G                      -
rpool/home/niko/.cache  volmode                    default                    default
rpool/home/niko/.cache  filesystem_limit           none                       default
rpool/home/niko/.cache  snapshot_limit             none                       default
rpool/home/niko/.cache  filesystem_count           none                       default
rpool/home/niko/.cache  snapshot_count             none                       default
rpool/home/niko/.cache  snapdev                    hidden                     default
rpool/home/niko/.cache  acltype                    posix                      inherited from rpool
rpool/home/niko/.cache  context                    none                       default
rpool/home/niko/.cache  fscontext                  none                       default
rpool/home/niko/.cache  defcontext                 none                       default
rpool/home/niko/.cache  rootcontext                none                       default
rpool/home/niko/.cache  relatime                   on                         inherited from rpool
rpool/home/niko/.cache  redundant_metadata         all                        default
rpool/home/niko/.cache  overlay                    on                         default
rpool/home/niko/.cache  encryption                 aes-256-gcm                -
rpool/home/niko/.cache  keylocation                none                       default
rpool/home/niko/.cache  keyformat                  passphrase                 -
rpool/home/niko/.cache  pbkdf2iters                350000                     -
rpool/home/niko/.cache  encryptionroot             rpool                      -
rpool/home/niko/.cache  keystatus                  available                  -
rpool/home/niko/.cache  special_small_blocks       0                          default
rpool/home/niko/.cache  org.zfsbootmenu:keysource  roool/ROOT/archlinux       inherited from rpool
NAME        PROPERTY                   VALUE                      SOURCE
rpool/home  type                       filesystem                 -
rpool/home  creation                   Sun Oct  1  9:26 2023      -
rpool/home  used                       124G                       -
rpool/home  available                  3.29T                      -
rpool/home  referenced                 2.08G                      -
rpool/home  compressratio              1.72x                      -
rpool/home  mounted                    yes                        -
rpool/home  quota                      none                       default
rpool/home  reservation                none                       default
rpool/home  recordsize                 128K                       default
rpool/home  mountpoint                 /home                      local
rpool/home  sharenfs                   off                        default
rpool/home  checksum                   on                         default
rpool/home  compression                zstd                       inherited from rpool
rpool/home  atime                      on                         default
rpool/home  devices                    off                        inherited from rpool
rpool/home  exec                       on                         default
rpool/home  setuid                     on                         default
rpool/home  readonly                   off                        default
rpool/home  zoned                      off                        default
rpool/home  snapdir                    hidden                     default
rpool/home  aclmode                    discard                    default
rpool/home  aclinherit                 restricted                 default
rpool/home  createtxg                  14                         -
rpool/home  canmount                   on                         default
rpool/home  xattr                      sa                         inherited from rpool
rpool/home  copies                     1                          default
rpool/home  version                    5                          -
rpool/home  utf8only                   on                         -
rpool/home  normalization              formD                      -
rpool/home  casesensitivity            sensitive                  -
rpool/home  vscan                      off                        default
rpool/home  nbmand                     off                        default
rpool/home  sharesmb                   off                        default
rpool/home  refquota                   none                       default
rpool/home  refreservation             none                       default
rpool/home  guid                       18340805338908695946       -
rpool/home  primarycache               all                        default
rpool/home  secondarycache             all                        default
rpool/home  usedbysnapshots            21.2G                      -
rpool/home  usedbydataset              2.08G                      -
rpool/home  usedbychildren             101G                       -
rpool/home  usedbyrefreservation       0B                         -
rpool/home  logbias                    latency                    default
rpool/home  objsetid                   774                        -
rpool/home  dedup                      off                        default
rpool/home  mlslabel                   none                       default
rpool/home  sync                       standard                   default
rpool/home  dnodesize                  auto                       inherited from rpool
rpool/home  refcompressratio           2.21x                      -
rpool/home  written                    32.1M                      -
rpool/home  logicalused                202G                       -
rpool/home  logicalreferenced          3.91G                      -
rpool/home  volmode                    default                    default
rpool/home  filesystem_limit           none                       default
rpool/home  snapshot_limit             none                       default
rpool/home  filesystem_count           none                       default
rpool/home  snapshot_count             none                       default
rpool/home  snapdev                    hidden                     default
rpool/home  acltype                    posix                      inherited from rpool
rpool/home  context                    none                       default
rpool/home  fscontext                  none                       default
rpool/home  defcontext                 none                       default
rpool/home  rootcontext                none                       default
rpool/home  relatime                   on                         inherited from rpool
rpool/home  redundant_metadata         all                        default
rpool/home  overlay                    on                         default
rpool/home  encryption                 aes-256-gcm                -
rpool/home  keylocation                none                       default
rpool/home  keyformat                  passphrase                 -
rpool/home  pbkdf2iters                350000                     -
rpool/home  encryptionroot             rpool                      -
rpool/home  keystatus                  available                  -
rpool/home  special_small_blocks       0                          default
rpool/home  snapshots_changed          Thu Oct  5 21:12:52 2023   -
rpool/home  org.zfsbootmenu:keysource  roool/ROOT/archlinux       inherited from rpool
failed to hold dataset 'rpool/home/niko/.cache': No such file or directory
failed to hold dataset 'rpool/home': No such file or directory

robn · 2023-10-05T20:51:23Z

Add zpool sync after the cp; the change is not on disk yet, so zdb can't see it.

You also want the path relative to dataset on the calls to zdb, not the absolute paths.

zdb -Ovv rpool/home/niko/.cache yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz
zdb -Ovv rpool/home niko/chromium-117.0.5938.132.tar.xz

robn · 2023-10-05T21:00:50Z

Anyway, the problem is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). #14705 has a start on this.

darkbasic · 2023-10-06T07:20:09Z

You also want the path relative to dataset on the calls to zdb, not the absolute paths.

I already tried relative paths but ended up copy pasting the absolute ones in the end.

dd zpool sync after the cp; the change is not on disk yet, so zdb can't see it.

Unfortunately that's not it, because the same happens for the src one which is on disk since a very long time:

[niko@arch-phoenix ~]$ sudo sync
[niko@arch-phoenix ~]$ sudo zdb -Ovv rpool/home/niko/.cache yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz
failed to hold dataset 'rpool/home/niko/.cache': No such file or directory

Maybe this has something to do with encryption as well?

Anyway, the problem is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). #14705 has a start on this.

Thanks, I will follow that issue.

robn · 2023-10-06T07:33:29Z

Ahh yeah, you'll need zdb -K to get access to the encrypted parts (I noticed the encryption after I wrote that part).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Kay Pedersen <mail@mkwg.de> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Sponsored-By: OpenDrives Inc. Sponsored-By: Klara Inc. Closes openzfs#15050 Closes openzfs#405 Closes openzfs#13349

nrdxp · 2024-01-07T23:06:58Z

is there something special needed to enable reflinks? I ran a zfs upgrade -a on my pool to enabled the block_cloning feature, but cp --reflink=always still failed with Operation not supported. Am I missing something? The file I attempted to copy is on the same dataset as the destination for simplicity.

ixhamza · 2024-01-08T07:53:17Z

@nrdxp - you would need to enable zfs_bclone_enabled tunable as well if you're using 2.2 release: #15529.

adamdmoss · 2024-01-08T22:15:57Z

@nrdxp - you would need to enable zfs_bclone_enabled tunable as well if you're using 2.2 release: #15529.

I wouldn't though, personally. Yet. It was retroactively disabled by default for a reason.
It's a lot more robust in the master branch now, but those fixes aren't in 2.2 so far.

nrdxp · 2024-01-08T23:01:05Z

Yeah, I didn't realize it was still so unstable, definitely gonna hold off for a bit, but thanks for clarifying my confusion.

petermaloney mentioned this issue Oct 21, 2012

Please implement "cp --reflink=always" with ZoL #1063

Closed

This comment has been minimized.

Sign in to view

bjquinn mentioned this issue Jul 29, 2014

btrfs bedup equivalent #2554

Closed

behlendorf added the Difficulty - Medium label Oct 3, 2014

behlendorf removed this from the 0.8.0 milestone Oct 3, 2014

pavel-odintsov mentioned this issue Jan 20, 2015

Feature request for file based copy on write #3020

Open

Majiir mentioned this issue Nov 28, 2023

OpenZFS Block Clonning Extension ... #15573

Open

EirikAskheim mentioned this issue Jan 28, 2024

Implement copy_file_range as alternative to ioctl(FICLONE) cargo-bins/reflink-copy#60

Closed

COW cp (--reflink) support #405

COW cp (--reflink) support #405

Comments

torn5 commented Sep 21, 2011

behlendorf commented Sep 28, 2011

darthur commented Oct 1, 2011

akorn commented Sep 20, 2012

cburroughs commented Oct 8, 2012

torn5 commented Aug 2, 2013

ryao commented Sep 25, 2013

torn5 commented Sep 25, 2013

behlendorf commented Oct 11, 2013

hufman commented Jan 17, 2014

akorn commented Jan 17, 2014

hufman commented Jan 19, 2014

rocallahan commented Feb 25, 2014

ThisGuyCodes commented Mar 11, 2014

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

behlendorf commented Mar 21, 2014

ryao commented Jul 31, 2014

ryao commented Sep 29, 2014

batrick commented Apr 6, 2015

dioni21 commented Jan 31, 2016

ilovezfs commented Jan 31, 2016

robn commented Oct 3, 2023

shodanshok commented Oct 3, 2023

robn commented Oct 3, 2023

darkbasic commented Oct 3, 2023 • edited Loading

rouben commented Oct 3, 2023

robn commented Oct 3, 2023

darkbasic commented Oct 3, 2023

robn commented Oct 4, 2023

darkbasic commented Oct 4, 2023

Atemu commented Oct 4, 2023

robn commented Oct 4, 2023

darkbasic commented Oct 4, 2023

shodanshok commented Oct 4, 2023

darkbasic commented Oct 4, 2023 • edited Loading

shodanshok commented Oct 4, 2023 • edited Loading

darkbasic commented Oct 5, 2023

shodanshok commented Oct 5, 2023 • edited Loading

darkbasic commented Oct 5, 2023

robn commented Oct 5, 2023

robn commented Oct 5, 2023

darkbasic commented Oct 6, 2023

robn commented Oct 6, 2023

nrdxp commented Jan 7, 2024

ixhamza commented Jan 8, 2024

adamdmoss commented Jan 8, 2024

nrdxp commented Jan 8, 2024

darkbasic commented Oct 3, 2023 •

edited

Loading

darkbasic commented Oct 4, 2023 •

edited

Loading

shodanshok commented Oct 4, 2023 •

edited

Loading

shodanshok commented Oct 5, 2023 •

edited

Loading