Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upSupport cp --reflink for filesystem that support it. #132
Comments
|
Yes, this would be great (what is the current state of it?). Another option (for BTRFS) would be to use the btrfs-extent-same ioctl to handle duplicates like duperemove does, but that would probably be more work to implement. |
|
@phiresky current state is that this is implemented in both master and develop branches, but has not been thoroughly tested. Usage:
or more fully:
The algorithm doesn't use btrfs-extent-same ioctl at the moment, instead it's a bit clunky:
If we used BTRFS_IOC_FILE_EXTENT_SAME then the reflinking would happen during rmlint execution rather than during post-processing by the script. That's slightly misaligned with the current rmlint philosophy of doing all the actual deduplication via post-processing. For btrfs deduplication there is already https://github.com/markfasheh/duperemove but maybe there's a case for a btrfs-specific rmlint option:
...this would search for duplicates, call BTRFS_IOC_FILE_EXTENT_SAME for each, log successful calls, and write a shell script for post-processing of duplicates that couldn't be reflinked (ie failed calls to the ioctl). This should be a faster alternative to duperemove (which works at a block level and hashes all blocks) since rmlint works at a file level and has criteria to avoid unnecessary reads. Of course duperemove has the advantage of being able to partially deduplicate partially matched files. |
|
Thanks for the explanation!
Ah, I didn't think of that. But it could be done from the script, there just doesn't seem to be a userspace tool for it yet. (Could call |
|
Note that BTRFS_IOC_FILE_EXTENT_SAME has its own internal checks and locks so it should be inherently safe to call, even for non-matching files. |
|
Experimental option --btrfs-dedup added at https://github.com/SeeSpotRun/rmlint/tree/feature-btrfs-extent-same
This will run rmlint as usual but will also attempt to reflink duplicates to originals. If don't want to do anything else, use:
|
|
Some follow-up: Secondly however it appears that the BTRFS_IOC_FILE_EXTENT_SAME ioctl is a bit hit-and-miss; it works for some file pairs and not others. I haven't spotted a pattern yet. |
|
Would it be an idea to add a (hidden?) Edit: Clarification: |
|
That's basically what --btrfs-dedup is trying to be. Just trying to figure out how to debug. I think I need to customise & compile the btrfs kernel module so that it gives more meaningful errors then -EINVAL. |
|
Ok I'm giving up on btrfs-extent-same for now, mainly due to unresolved bug http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512.
|
|
Well I gave it one more try (7bbce46) and got it working with BTRFS_IOC_FILE_EXTENT_SAME, albeit with the bug / limitation flagged in http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512. What this means:
Usage is still as per #132 (comment). Note that --btrfs-clone automatically disables all "other lint" searching and turns off all outputs. |
|
That bug will apparently be fixed in 4.2 (this commit in upstream fixes the same bug) |
|
... as will the annoying mtime change (http://www.spinics.net/lists/linux-btrfs/msg44919.html) |
|
An update on this: |
|
Using "link" option, it will default to reflink on 3.19 btrfs filesystem. Which works. However, if I run rmlint again on the filesystem, using both hardlink options, it still marks previously deduced reflinks as duplicates. Expected behavior? Am I missing something? Should I be using hardlinks instead for now? |
|
The problem with reflinks is they are fairly invisible from userland, so it's hard to detect existing reflinks. The code at https://github.com/sahib/rmlint/blob/develop/lib/formats/sh.c.in#L89-L96 tries to detect existing reflinks by comparing their fiemap's, but that's not foolproof, particularly for small files which use "inline extents". |
|
Just a note: If/When this is done and stable rmlint should be added to the BTRFS wiki at https://btrfs.wiki.kernel.org/index.php/Deduplication which is also what you find when you google "btrfs deduplication" |
|
I also started a discussion here about this topic: Here is the current answer:
|
|
@phiresky I've added rmlint to https://btrfs.wiki.kernel.org/index.php/Deduplication |
|
@SeeSpotRun Nice. You should maybe add the option |
|
@saintger we currently check for existing reflinks after we find duplicates (https://github.com/sahib/rmlint/blob/master/lib/formats/sh.c.in#L89-L96) by comparing fiemaps. I have made a couple of false starts at an algorithm to speed up duplicate detection by pre-matching existing reflinks but am not currently working on that. |
|
@phiresky thanks, updated with
The behaviour on non-reflink filesystems (do nothing, rather than fail) is reasonable although maybe not so helpful to the user ("why didn't it do anything?). Maybe should output:
|
|
@SeeSpotRun On my BTRFS system I am using the following options which seems to be a reasonable starting point for deduplicating on BTRFS:
Perhaps you can also add that to the BTRFS wiki. |
|
Can this issue be closed? |
|
Yes I think so |
Support reflinks for filesystems that support it. We had this previosuly in #93.
But it was in a rough unfinished state as far as I know.