Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cp --reflink for filesystem that support it. #132

Closed
sahib opened this issue May 11, 2015 · 24 comments
Closed

Support cp --reflink for filesystem that support it. #132

sahib opened this issue May 11, 2015 · 24 comments

Comments

@sahib
Copy link
Owner

sahib commented May 11, 2015

Support reflinks for filesystems that support it. We had this previosuly in #93.
But it was in a rough unfinished state as far as I know.

@sahib sahib added the feature label May 11, 2015
@sahib sahib added this to the 2.3.0 milestone May 11, 2015
@phiresky
Copy link

Yes, this would be great (what is the current state of it?). Another option (for BTRFS) would be to use the btrfs-extent-same ioctl to handle duplicates like duperemove does, but that would probably be more work to implement.

@SeeSpotRun
Copy link
Collaborator

@phiresky current state is that this is implemented in both master and develop branches, but has not been thoroughly tested. Usage:

rmlint -c sh:link path1 [path2...]

or more fully:

rmlint -o sh:foo.sh -c sh:link -o progressbar -o summary path1 [path2...]

The algorithm doesn't use btrfs-extent-same ioctl at the moment, instead it's a bit clunky:

If we used BTRFS_IOC_FILE_EXTENT_SAME then the reflinking would happen during rmlint execution rather than during post-processing by the script. That's slightly misaligned with the current rmlint philosophy of doing all the actual deduplication via post-processing.

For btrfs deduplication there is already https://github.com/markfasheh/duperemove but maybe there's a case for a btrfs-specific rmlint option:

rmlint --btrfs-dedupe path1 [path2...]

...this would search for duplicates, call BTRFS_IOC_FILE_EXTENT_SAME for each, log successful calls, and write a shell script for post-processing of duplicates that couldn't be reflinked (ie failed calls to the ioctl). This should be a faster alternative to duperemove (which works at a block level and hashes all blocks) since rmlint works at a file level and has criteria to avoid unnecessary reads. Of course duperemove has the advantage of being able to partially deduplicate partially matched files.

@phiresky
Copy link

Thanks for the explanation!

If we used BTRFS_IOC_FILE_EXTENT_SAME then the reflinking would happen during rmlint execution rather than during post-processing by the script

Ah, I didn't think of that. But it could be done from the script, there just doesn't seem to be a userspace tool for it yet. (Could call duperemove $file $duplicate though ;)

@SeeSpotRun
Copy link
Collaborator

Note that BTRFS_IOC_FILE_EXTENT_SAME has its own internal checks and locks so it should be inherently safe to call, even for non-matching files.

@SeeSpotRun
Copy link
Collaborator

Experimental option --btrfs-dedup added at https://github.com/SeeSpotRun/rmlint/tree/feature-btrfs-extent-same
Usage:

rmlint --btrfs-dedup ...

This will run rmlint as usual but will also attempt to reflink duplicates to originals. If don't want to do anything else, use:

rmlint --btrfs-dedup -o progressbar -o summary [path]

@SeeSpotRun
Copy link
Collaborator

Some follow-up:
Firstly a bugfix SeeSpotRun@03ab176

Secondly however it appears that the BTRFS_IOC_FILE_EXTENT_SAME ioctl is a bit hit-and-miss; it works for some file pairs and not others. I haven't spotted a pattern yet.

@sahib
Copy link
Owner Author

sahib commented Aug 9, 2015

Would it be an idea to add a (hidden?) --btrfs-same-extent option to rmlint that calls BTRFS_IOC_FILE_EXTENT_SAME, so we're becoming that userspace-tool @phiresky was thinking of?

Edit: Clarification: --btrfs-same-extent would take the path of two files it should check for same extents.

@SeeSpotRun
Copy link
Collaborator

That's basically what --btrfs-dedup is trying to be. Just trying to figure out how to debug. I think I need to customise & compile the btrfs kernel module so that it gives more meaningful errors then -EINVAL.

@SeeSpotRun
Copy link
Collaborator

Ok I'm giving up on btrfs-extent-same for now, mainly due to unresolved bug http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512.
Instead, new experimental branch https://github.com/SeeSpotRun/rmlint/tree/feature-btrfs-clone uses BTRFS_IOC_CLONE (which is the same as used by cp --reflink). This doesn't have the built-in safety features of BTRFS_IOC_FILE_EXTENT_SAME (which locks both files, then verifies that content matches, before cloning) so use at own risk.
Usage:

$ rmlint --btrfs-clone file1 file2  # silently compare file1 to file2 and reflink them if they match
$ rmlint --btrfs-clone path1    # silently look for duplicates in path1 and reflink them if they match
$ rmlint --btrfs-clone -o progressbar -o summary path1    # as above but with progressbar and summary
$ rmlint --btrfs-clone -o pretty -o summary path1    # as above but with normal output and summary

@SeeSpotRun
Copy link
Collaborator

Well I gave it one more try (7bbce46) and got it working with BTRFS_IOC_FILE_EXTENT_SAME, albeit with the bug / limitation flagged in http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512.

What this means:

  1. rmlint --btrfs-clone should be inherently safe (the kernel does the double-checking during the reflink operation).
  2. rmlint --btrfs-clone won't reflink the last block of each file, unless it is exactly 4096 bytes long (at least not until the patch in the above-referenced link has been applied). Meantime you will get most of the space savings, but will create some fragmentation (note that duperemove suffers the same limitation too). Sadly this limitation means that if you defragment the files, then they will un-reflink.

Usage is still as per #132 (comment). Note that --btrfs-clone automatically disables all "other lint" searching and turns off all outputs.

@phiresky
Copy link

That bug will apparently be fixed in 4.2 (this commit in upstream fixes the same bug)

@SeeSpotRun
Copy link
Collaborator

... as will the annoying mtime change (http://www.spinics.net/lists/linux-btrfs/msg44919.html)

@SeeSpotRun
Copy link
Collaborator

An update on this:
(1) Although kernel 4.2 is out, I haven't rushed out and installed... while it does fix the clone ioctl, there seems to be more than the usual level of issue reporting and patching at http://www.mail-archive.com/linux-btrfs@vger.kernel.org/. So I'm going to wait a few weeks before jumping in.
(2) Meanwhile SeeSpotRun@7d4cc22 introduces command-line utility rmlint --btrfs-clone source dest, which is then used by the rmint-generated shell script to do the cloning. On kernels below 4.2 this work fine for files which are exact multiples of 4096 bytes, or else generates an error message suggesting the user might need kernel >= 4.2.
(3) Chris pointed out that SeeSpotRun@7d4cc22 breaks things for other reflink-capable fs's (only ocfs2 for now, but bcachefs will probably join that party soon) so the plan is to have a couple of command-line options for rmlint:
rmlint -c sh:link will generate shell scripts which use cp --reflink source dest
rmlint -c sh:clone will generate shell scripts which use rmlint --btrfs-clone source dest

@ghost
Copy link

ghost commented Nov 18, 2015

Using "link" option, it will default to reflink on 3.19 btrfs filesystem. Which works. However, if I run rmlint again on the filesystem, using both hardlink options, it still marks previously deduced reflinks as duplicates.

Expected behavior? Am I missing something? Should I be using hardlinks instead for now?

@SeeSpotRun
Copy link
Collaborator

The problem with reflinks is they are fairly invisible from userland, so it's hard to detect existing reflinks. The code at https://github.com/sahib/rmlint/blob/develop/lib/formats/sh.c.in#L89-L96 tries to detect existing reflinks by comparing their fiemap's, but that's not foolproof, particularly for small files which use "inline extents".
Hardlinks are easier to detect, eg via stat file but have some disadvantages eg future changes to one file may (depending on how the change is made) get reflected in the other file.
Worst case rmlint may re-reflink files that were already reflinked; other than a slight time factor, this should have no undesirable impacts. Using rmlint's timestamp feature (http://rmlint.readthedocs.org/en/latest/tutorial.html#limit-files-by-their-modification-time) should avoid this.

@phiresky
Copy link

phiresky commented Apr 4, 2016

Just a note: If/When this is done and stable rmlint should be added to the BTRFS wiki at https://btrfs.wiki.kernel.org/index.php/Deduplication which is also what you find when you google "btrfs deduplication"

@saintger
Copy link

I also started a discussion here about this topic:
https://www.spinics.net/lists/linux-btrfs/msg60081.html

Here is the current answer:

Inline extents have no physical address (FIEMAP returns 0 in that field).
You can't dedup them and each file can have only one, so if you see
the FIEMAP_EXTENT_INLINE bit set, you can just skip processing the entire
file immediately.

You can create a separate non-inline extent in a temporary file then
use dedup to replace _both_ copies of the original inline extent.
Or don't bother, as the savings are negligible.

> Is there another way that deduplication programs can easily use ?

The problem is that it's not files that are reflinked--individual extents
are.  "reflink file copy" really just means "a file whose extents are
100% shared with another file." It's possible for files on btrfs to have
any percentage of shared extents from 0 to 100% in increments of the
host page size.  It's also possible for the blocks to be shared with
different extent boundaries.

The quality of the result therefore depends on the amount of effort
put into measuring it.  If you look for the first non-hole extent in
each file and use its physical address as a physical file identifier,
then you get a fast reflink detector function that has a high risk of
false positives.  If you map out two files and compare physical addresses
block by block, you get a slow function with a low risk of false positives
(but maybe a small risk of false negatives too).

If your dedup program only does full-file reflink copies then the first
extent physical address method is sufficient.  If your program does
block- or extent-level dedup then it shouldn't be using files in its
data model at all, except where necessary to provide a mechanism to
access the physical blocks through the POSIX filesystem API.

FIEMAP will tell you about all the extents (physical address for extents
that have them, zero for other extent types).  It's also slow and has
assorted accuracy problems especially with compressed files.  Any user
can run FIEMAP, and it uses only standard structure arrays.

SEARCH_V2 is root-only and requires parsing variable-length binary
btrfs data encoding, but it's faster than FIEMAP and gives more accurate
results on compressed files.

@SeeSpotRun
Copy link
Collaborator

@phiresky
Copy link

@SeeSpotRun Nice. You should maybe add the option -T df to the example or mention it, because I wouldn't expect it to also remove empty files and broken symlinks. Also maybe mention --no-crossdev because I would expect it to stay on one FS for this purpose. And should the sh:handler=clone option maybe fail on non-btrfs file systems? It just seems to produce an sh file that only prints "Keeping..." for me but doesn't do anything when running on non-btrfs.

@SeeSpotRun
Copy link
Collaborator

@saintger we currently check for existing reflinks after we find duplicates (https://github.com/sahib/rmlint/blob/master/lib/formats/sh.c.in#L89-L96) by comparing fiemaps.

I have made a couple of false starts at an algorithm to speed up duplicate detection by pre-matching existing reflinks but am not currently working on that.

@SeeSpotRun
Copy link
Collaborator

@phiresky thanks, updated with -T df

--no-crosdev may be a problem since subvolumes have different dev numbers so may fail...

The behaviour on non-reflink filesystems (do nothing, rather than fail) is reasonable although maybe not so helpful to the user ("why didn't it do anything?). Maybe should output:

echo 'Keeping: /home/file1'
echo 'Warning cannot clone: /home/file2'

@saintger
Copy link

@SeeSpotRun On my BTRFS system I am using the following options which seems to be a reasonable starting point for deduplicating on BTRFS:

rmlint --algorithm=xxhash --types="duplicates" --hidden --config=sh:handler=clone --no-hardlinked

Perhaps you can also add that to the BTRFS wiki.

@sahib
Copy link
Owner Author

sahib commented Feb 21, 2017

Can this issue be closed?

@SeeSpotRun
Copy link
Collaborator

Yes I think so

@sahib sahib closed this as completed Feb 22, 2017
@sahib sahib removed the In Progress label Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants