Duplicate file detector. Incremental-block style.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



Duplicate file detector.

Incrementally checks blocks of content, which is pretty fast when you have largeish mostly-unique files, because the larger the (unique) files, the more reading you avoid.

Motivated by the observation that duplicate detection is largely IO-bound, and that unique files are usually unique in the first few dozen KB.

When there are large duplicates it'll spend most of the time a while verifying, and clobber your page cache in the process, so you can make it assume equality after some amount of identical bytes.

If you want a fast estimate of possible space savings, both make that assumption, and make it ignore the bulk of small files. For example,

    # duppy -v -a 32M -s 500K /dosgames
    NOTE: Assuming files are identical after 32MB

    Creating list of files to check...
       no file arguments given, using current directory
    Done scanning, found 8423 files to check.

    Looking for duplicates...
    Assuming 2 files ('/dosgames/Full Throttle/image.img', '/dosgames/Full Throttle.img') are identical after 32MB (of 614MB)

    Looking for duplicates...

    [bunch of stuff omitted for brevity]

    1.8MB each:

    29MB each:

    614MB each:
    '/dosgames/Full Throttle.img'
    '/dosgames/Full Throttle/image.img'

      Found 567 sets of duplicate files   (Warning: 1 were ASSUMED to be equal after 32MB)
        684 files could be removed,
        to save 2.4GB
      Spent 93 seconds    reading 3.0GB  (2.7% of 111GB)

Notes / warnings:

  • Should be safe around symlinks (avoids following) and hardlinks (avoids adding the same inode twice - though it wouldn't matter much if we did).
  • ...but does not consider that symlink may be pointing at files we are deleting (because to guarantee that, you have to scan the entire filesystem)
  • I have done basic sanity tests, but don't trust this blindly on files you haven't backed up.
  • think about what your delete rules mean. It'll refuse to delete every copy, but you can still make a mess for yourself.


    Usage: duppy [options] path [path...]

       Finds duplicate files.

       By default only prints a summary of wasted space.
       Use -v for a full list of duplicates (I feel more comfortable manually deleting based on this output)
        or -d and some rules to automate deletion.

          -q             be quiet while working.   Access errors are still reported.
          -v             be verbose about progress, report all duplicate file sets. Specify twice for even more.

          -R             Non-recursive.
                         (To run on only files in curdr -R,  and * (not .) for the argument)

          -s size    minimum size. Ignore files smaller than this size.  Default is 1, only ignores zero-length files.
                     (useful to find major space saving in less time)
          -S size    maximum size
                     (sometimes useful in a "now look at just the files between 100KB and 200KB",
                      which makes repeated runs faster since most data will still be cached)

          -a size    Assume a set is identical after this many matching bytes, e.g.  -a 32MB
                     Meant for dry-runs, for a faster _estimation_ of wasted space,
                     but WARNING: it will also apply to real runs. While this _usually_ makes sense
                     (most files that are distinct are so within ~150KB), *know for sure* that it makes sense for your case.

       Rules and deleting:
          --rule-help            List of rules, and examples of use.
          -n                     Apply rules - in a dry run. Will say what it would do, but delete nothing.
          -d                     Apply rules, and delete files from sets that are marked DELETE (and that also have at least one KEEP).

          # Just list what is duplicate
          duppy -v .

          # faster estimation of bulk space you could probably free  (note: not a full check)
          duppy -v -s 10M -a 20M /data/Video

          # work on the the specific files we mention, and no recursion
          duppy -vR *

Delete rule options (still sort of an experiment)

        The set is currently
            --elect-one-random                      keep one, the easiest and most arbitrary way to go.
            --(keep,delete,lose)-path=SUBSTR        match absolute path by substring
            --(keep,delete,lose)-path-re=REGEXP     match absolute path with regular expressions, mark any matches KEEP
            --(keep,delete,lose)-newest             looks for file(s) with the most recent time  (and for each file uses max(mtime,ctime))
            --(keep,delete,lose)-deepest            keep file in deepest directory.   Useful when you typically organize into directories.
            --(keep,delete,lose)-shallowest         keep file in shallowest directory. Useful in similar cases.
            --(keep,delete,lose)-longest-path       considers string length of full path
            --(keep,delete,lose)-longest-basename   considers string length of just the filename (not the path to the directory it is in)

        lose-*   rules mark matches DELETE, non-matches as KEEP     so by themselves they are decisive
        delete-* rules mark matches DELETE, non-matches UNKNOWN     e.g. useful if combining a detailed set of keep- and delete-
        keep-*   rules mark matches KEEP, non-matches UNKNOWN,      e.g. useful to mark exceptions to a main lose- rule

        We apply all specified rules to all sets, which marks each filename as KEEP, DELETE, or UNKNOWN.
          These are combined conservatively, e.g. KEEP+DELETE = KEEP

        We only delete files from decided sets,  which are those with at least one KEEP and no UNKNOWN
          Note that we *never* delete everything: an all-DELETE set is considered undecided)


          # "I don't care which version you keep"
          duppy -v -d -n --keep-one-random /data/Video

          # "Right, I just downloaded a load of images, and I have most already. Clean from this new dir"
          duppy -v -d -n --delete-path '/data/images/justdownloaded/' /data/images

          # "assume that anything sorted into a deeper directory, and anything named archive, needs to stay"
          # This may well be indecisive for some sets
          duppy -v -d -n --keep-deepest --keep-path-re '[Aa]rchive' /data/Video


  • consider verbosity by default (I always use -v)

  • More sanity checks, and regression tests. I really don't want to have to explain that we deleted all your files due to a silly bug :)

  • figure out why the 'total read' sum is incorrect

  • cleanup. Until now I mainly didn't release it because it needed cleaning. Typical :)

  • think about the clarity of the rule system

  • maybe rip out the rules after all? (I usually look at the -v output and delete manually)

  • test on windows


  • page-cache-non-clovvering

  • consider a homedir config of permant rules (for things like "always keep stuff from this dir")

  • consider having an IO thread (minor speed gains?)

  • storing a cache with (fullpath,mtime,size,hash(first64kB)) or so in your homedir, for incremental checks on slowly growing directories (storage should be on the order of ~35MB per 100k files, acceptable for most)