-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support cp --reflink for filesystem that support it. #132
Comments
Yes, this would be great (what is the current state of it?). Another option (for BTRFS) would be to use the btrfs-extent-same ioctl to handle duplicates like duperemove does, but that would probably be more work to implement. |
@phiresky current state is that this is implemented in both master and develop branches, but has not been thoroughly tested. Usage:
or more fully:
The algorithm doesn't use btrfs-extent-same ioctl at the moment, instead it's a bit clunky:
If we used BTRFS_IOC_FILE_EXTENT_SAME then the reflinking would happen during rmlint execution rather than during post-processing by the script. That's slightly misaligned with the current rmlint philosophy of doing all the actual deduplication via post-processing. For btrfs deduplication there is already https://github.com/markfasheh/duperemove but maybe there's a case for a btrfs-specific rmlint option:
...this would search for duplicates, call BTRFS_IOC_FILE_EXTENT_SAME for each, log successful calls, and write a shell script for post-processing of duplicates that couldn't be reflinked (ie failed calls to the ioctl). This should be a faster alternative to duperemove (which works at a block level and hashes all blocks) since rmlint works at a file level and has criteria to avoid unnecessary reads. Of course duperemove has the advantage of being able to partially deduplicate partially matched files. |
Thanks for the explanation!
Ah, I didn't think of that. But it could be done from the script, there just doesn't seem to be a userspace tool for it yet. (Could call |
Note that BTRFS_IOC_FILE_EXTENT_SAME has its own internal checks and locks so it should be inherently safe to call, even for non-matching files. |
Experimental option --btrfs-dedup added at https://github.com/SeeSpotRun/rmlint/tree/feature-btrfs-extent-same
This will run rmlint as usual but will also attempt to reflink duplicates to originals. If don't want to do anything else, use:
|
Some follow-up: Secondly however it appears that the BTRFS_IOC_FILE_EXTENT_SAME ioctl is a bit hit-and-miss; it works for some file pairs and not others. I haven't spotted a pattern yet. |
Would it be an idea to add a (hidden?) Edit: Clarification: |
That's basically what --btrfs-dedup is trying to be. Just trying to figure out how to debug. I think I need to customise & compile the btrfs kernel module so that it gives more meaningful errors then -EINVAL. |
Ok I'm giving up on btrfs-extent-same for now, mainly due to unresolved bug http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512.
|
Well I gave it one more try (7bbce46) and got it working with BTRFS_IOC_FILE_EXTENT_SAME, albeit with the bug / limitation flagged in http://permalink.gmane.org/gmane.comp.file-systems.btrfs/42512. What this means:
Usage is still as per #132 (comment). Note that --btrfs-clone automatically disables all "other lint" searching and turns off all outputs. |
That bug will apparently be fixed in 4.2 (this commit in upstream fixes the same bug) |
... as will the annoying mtime change (http://www.spinics.net/lists/linux-btrfs/msg44919.html) |
An update on this: |
Using "link" option, it will default to reflink on 3.19 btrfs filesystem. Which works. However, if I run rmlint again on the filesystem, using both hardlink options, it still marks previously deduced reflinks as duplicates. Expected behavior? Am I missing something? Should I be using hardlinks instead for now? |
The problem with reflinks is they are fairly invisible from userland, so it's hard to detect existing reflinks. The code at https://github.com/sahib/rmlint/blob/develop/lib/formats/sh.c.in#L89-L96 tries to detect existing reflinks by comparing their fiemap's, but that's not foolproof, particularly for small files which use "inline extents". |
Just a note: If/When this is done and stable rmlint should be added to the BTRFS wiki at https://btrfs.wiki.kernel.org/index.php/Deduplication which is also what you find when you google "btrfs deduplication" |
I also started a discussion here about this topic: Here is the current answer:
|
@phiresky I've added rmlint to https://btrfs.wiki.kernel.org/index.php/Deduplication |
@SeeSpotRun Nice. You should maybe add the option |
@saintger we currently check for existing reflinks after we find duplicates (https://github.com/sahib/rmlint/blob/master/lib/formats/sh.c.in#L89-L96) by comparing fiemaps. I have made a couple of false starts at an algorithm to speed up duplicate detection by pre-matching existing reflinks but am not currently working on that. |
@phiresky thanks, updated with
The behaviour on non-reflink filesystems (do nothing, rather than fail) is reasonable although maybe not so helpful to the user ("why didn't it do anything?). Maybe should output:
|
@SeeSpotRun On my BTRFS system I am using the following options which seems to be a reasonable starting point for deduplicating on BTRFS:
Perhaps you can also add that to the BTRFS wiki. |
Can this issue be closed? |
Yes I think so |
Support reflinks for filesystems that support it. We had this previosuly in #93.
But it was in a rough unfinished state as far as I know.
The text was updated successfully, but these errors were encountered: