yafd is a (yet another) file deduplicator.
For detailed info, see USAGE or the
manpage (man yafd
).
The easiest way to use yafd is to pass a directory or a set of directories to it.
jason@io ~ yafd .
It can recurse as well.
jason@io ~ yafd -r .
Another easy way to use yafd is passing files to check as arguments. Shell globbing helps.
jason@io ~ yafd **/*.c **/*.h
You can also pipe paths to yafd via stdin. This makes it easy to limit the sets of files to check.
jason@io ~ find . -size +1M | yafd
The output can also be piped to other commands to do things with the duplicate files.
jason@io ~ find /usr/src -size +1M | yafd | xargs du -b | awk '{ x+=$1; } END { print x; }'
12659698
As of yet, yafd is not always the fastest deduplicator (see hdd performance).
If performance is a concern, it may be worth considering another deduplicator
like rmlint. Performance can be optimized
using command arguments (--bytes
, --blocksize
, --threads
, etc...),
although yafd with defaults should be usable for most tasks.
Here are some metrics for reference.
SSD (btrfs)
time | throughput | throughput (dup) | |
---|---|---|---|
yafd | 4.30s | 267.88 MiB/s | 175.70 MiB/s |
rmlint | 7.43s | 155.13 MiB/s | 101.74 MiB/s |
fdupes | 30.34s | 37.99 MiB/s | 24.92 MiB/s |
duff | 25.14s | 45.86 MiB/s | 30.08 MiB/s |
yafd (cached) | 0.61s | 1.84 GiB/s | 1.20 GiB/s |
rmlint (cached) | 2.46s | 466.21 MiB/s | 307.40 MiB/s |
fdupes (cached) | 12.27s | 93.94 MiB/s | 61.61 MiB/s |
duff (cached) | 6.51s | 176.17 MiB/s | 116.12 MiB/s |
HDD (ext4)
time | throughput | throughput (dup) | |
---|---|---|---|
yafd | 1087.59s | 1.05 MiB/s | 711.99 KiB/s |
rmlint | 65.03s | 163.46 MiB/s | 107.21 MiB/s |
fdupes | 322.57s | 3.57 MiB/s | 2.34 MiB/s |
duff | 954.70s | 1.20 MiB/s | 811.10 KiB/s |
yafd (cached) | 7.05s | 163.46 MiB/s | 107.21 MiB/s |
rmlint (cached) | 2.84s | 406.37 MiB/s | 266.53 MiB/s |
fdupes (cached) | 12.44s | 92.64 MiB/s | 60.76 MiB/s |
duff (cached) | 6.56s | 175.76 MiB/s | 115.28 MiB/s |
NFS (v4)
time | throughput | throughput (dup) | |
---|---|---|---|
yafd | 197.08s | 5.85 MiB/s | 3.83 MiB/s |
rmlint | 461.26s | 2.49 MiB/s | 1.63 MiB/s |
fdupes | 648.24s | 1.77 MiB/s | 1.16 MiB/s |
duff | 466.69s | 2.47 MiB/s | 1.62 MiB/s |
yafd (cached) | 95.04s | 12.13 MiB/s | 7.95 MiB/s |
rmlint (cached) | 423.90s | 2.71 MiB/s | 1.78 MiB/s |
fdupes (cached) | 611.19s | 1.88 MiB/s | 1.23 MiB/s |
duff (cached) | 403.72s | 2.85 MiB/s | 1.87 MiB/s |
(1) The linux sources were searched for identical files (4.3, 4.4)
(2) For an equivalent comparison, the following command arguments were used (also see)
yafd --recurse --zero
rmlint --algorithm=paranoid --hidden -o fdupes:stdout
fdupes --recurse
duff -rpta -f#
(3) Linux 4.4.0 and Intel Ivy Bridge (i7-3632QM) were used for benchmarks
You can download a copy of the source here or you can clone the repository using git.
jason@io ~ git clone git://github.com:uxcn/yafd.git
It's a good idea to check out a specific release.
jason@io ~/yafd git checkout v0.1
In the project directory, run the autoconf script.
jason@io ~/yafd ./autoconf.sh CFLAGS='-march=native -mtune=native -O2'
Adding the architecture allows algorithms that rely on architecutre specific
implementations to be used. The easiest way to do this is normally
-march=native
. You can also explicitly enable instruction sets
via autoconf.
jason@io ~/yafd ./autoconf.sh --enable-sse4_2
To install to a directory other than /usr/local, you can manually configure the
prefix. If you do, make sure your PATH
and MANPATH
are set correctly.
jason@io ~/yafd ./autoconf.sh --prefix=$HOME
Run make install
to compile and install.
jason@io ~/yafd $ make install
Currently yafd compiles and is tested on Linux, FreeBSD, OSX, and Windows. Although, patches and pull requests for others are definitely welcome.
0.1 - alpha release
Why write another file deduplicater?
A lot of the current ones were more complicated than I wanted, didn't perform well, or weren't portable.
Why doesn't yafd do X?
Most likely nobody asked for X yet. If you think something's missing, send a feature request or even better, a pull request.
How does yafd work?
The basic algorithm is to group files by their sizes, compute a hash on a small (random) chunk of each file, and then compare files that have the same hash. This is a bit of an oversimplicification though. For a better understanding, it may help to try reading the code.