GitHub - rburchell/fastdup: FastDup is a tool to find copies of the same file within directory tree(s)

rburchell / fastdup Public

Notifications You must be signed in to change notification settings
Fork 5
Star 9

FastDup is a tool to find copies of the same file within directory tree(s)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
include		include
src		src
LICENSE		LICENSE
Makefile		Makefile
README		README

Repository files navigation

FastDup - http://github.com/rburchell/fastdup

The current maintainer is:
Robin Burchell <robin+git@viroteck.net>

However, most(/all) of the blood, sweat, and tears in FastDup - and all of the credit belongs to:
John Brooks <special@dereferenced.net>

-- INTRODUCTION --

FastDup is a tool to find copies of the same file within directory tree(s), designed
for maximum speed and efficiency unlike most similar tools. Where many similar tools
rely on checksums or hashes of files, or simple comparisons, fastdup uses a number
of cleverly optimized tricks to reduce the number of actual comparisons necessary,
and as a result can scan large sets of data extremely quickly compared to
alternatives.

-- INSTALLATION --

make
make install

-- NOTES --

FastDup will currently only work on OS X and modern Linux platforms with GNU make.

Please report build problems (on non-BSD systems), bugs, or ideas at:

http://github.com/rburchell/fastdup/issues

FastDup *may* exhibit high memory usage on data sets with large numbers of
potentially identical files.

-- TECHNICAL NOTES --

The method of comparing files falls in two steps; scanning and comparison.
Scanning is a simple(ish) walk of directory trees searching out files and indexing
them by their filesize. Multiple files with the same size are selected for detailed
comparison. Special efforts are taken to properly handle symbolic links and other
strange situations.

Comparison is, at the core, simple byte-by-byte comparison of files in blocks, but
there are many tricks and optimizations to reduce this from O(n^n) comparisons.
The exact and numerous methods used are too extensive to list here (see compare.cpp),
and are always open to improvement. This method is used in favor of checksums or
hashes to make non-matching files less expensive; these are usually eliminated
almost immediately. The most time consuming comparisons are those on files that are
identical, because the entire files must be compared.

There are many planned changes to the method for reading from files and general
memory usage.