Skip to content
This repository

Tool to remove duplicates and other lint, being much faster than fdupes

branch: master
README.textile

RMLINT

rmlint is a commandline tool to clean your filesystem from various sort of lint (unused files, twins, etc.).
It was mainly written for Unix-like Operating Systems, but should also work on Mac OSX (not yet tested!)

DISCLAIMER:
THERE IS NO WARRANTY THAT THIS SOFTWARE WON’T KILL YOUR FILES,
NOR THAT IT KILLS YOUR KITTEN, BURNS YOUR HOUSE, OR WHATSOEVER.
NONETHELESS IT WAS WRITTEN VERY CAREFULLY AND SHOULD DO WHAT IT IS SUPPOSED TO DO.

INSTALL

Download

download the sources by one of the following ways:

Git way

git clone https://github.com/sahib/rmlint.git

Zip way

wget --no-check-certificate http://github.com/sahib/rmlint/zipball/master -O rmlint.zip
unzip rmlint.zip

Tar way

wget --no-check-certificate http://github.com/sahib/rmlint/tarball/master -O rmlint.tar
tar xfv rmlint.tar

Now compile

cd <name_of_the_dir>
make -j 2  && sudo make install

Packages

A PKGBUILD is available in the AUR: rmlint-git

HELP:

Use:

  • 'rmlint -h'
  • 'man rmlint'

FEATURES

  • Very fast (written in pure C, in many cases faster than rdfind, and always magnitudes faster than fdupes).
  • Output of both a ready to use script to handle finds and a easy-to-parse logfile.
  • Tries to minimize I/O as much as possible (focus on CPU-usage).
  • Finds duplicates, nonstripped binaries, files with same basenames (nameclusters), empty files/directories, old tempdata, strange filenames and bad links.
  • Displays finds in realtime. (like ‘duff’ or ‘fdupes’)
  • Safely abortable at any time (will write log & script).
  • No extra dependencies at all (glibc2 and pthread is something you already have).
  • Colorful output (can be disabled via -B).
  • Regex filter for both files and directories.
  • It has been tested with very large filesets, with a record of finding 166GB dupes, with a logsize of 82MB (cheers Christoph)
  • Handles the files the way you want:
    • replace double file by a Symlink (-m link)
    • Removes the file without asking you. (-m noask)
    • Simply list all files without doing anything dangerous. (-m list)
    • It executes a user specified commando for each file (-m cmd)
    • It asks you for each file what you want (added for convinience only to be honest). (-m ask)

ALGORITHM

The algorithm tries to mimize IO as far as possible, thus focusing on CPU usage. (can get up to 390% on a quadcore)

  1. Go through all directories and catch all files conformig to regexpattern / dirpattern / hiddenstatus
  2. lint other than duplicates get detected here on the fly (like nonstripped binaries – every file is checked)
  3. the rest of the list (all files without files from 2)) gets sorted by their filesize
  4. elements with a unique filesize gets kicked out (because they can’t have a twin)
  5. list gets divided isn sublist, each size one sublist
  6. each sublist gets sort by inode (to speed up reading from HD)
  7. Each group is processed seperately:
    1. if the size of group exceeds a certain limit then it’s processed on an own thread
    2. else the group gets processed within the main thread
  8. Processing: For each file of a group..
    1. A short fingerprint from the start/end + some bytes in the middle of the file is read and stored
    2. Nonmatching files get kicked out, if the group consists of 1 elem or less, rmlint forgets about it
    3. a md5sums are calculated for the rest of the group (only the part of the file that hasnt been read, is used fo md5sum calculation)
    4. if the groupssize exceeds a certain limit, the group gets splitted into several equalsized subgroups
      1. The whole file is read blockwise, while other threads have wait (so no useless jumping is done)
      2. After a block is read (blocksize is about 2MB) md5 is updated, while at the same time another thread is reading, back to 8.3.1)
    5. md5sums, filesize, fingerprint and bytes in the middle get checked each other (to double check and prevent false positives)
    6. log/handle result to script / log / screen (let other threads wait for this short time, so no chaos is created)
  9. Do for every group, and print statistics

RMLINT IS KNOWN RUN TO FINE ON THOSE PLATFORMS:

  • Linux 32/64
  • Solaris

Note1: It is written in ANSI C, so every ANSI C compiler should be happily compile it.
Note2: rmlint uses alloca(), if you want to port it you may need to replace it with malloc() (and a corresponding free())

NOTE ABOUT FALSE POSITIVES

Short: False Positves are actuallly possible, but very, very, very unlikely.
Longer: They would need to have the same size, fingerprint and checksums to be marked as twins.
md5 is not perfect, but the probability of getting false positves on a normal set of data is the same as lim(1/x) : x → +inf = 0 + h; where h ~ 0

But isn’t there a solution to be 100% sure? Yes, there is. It’s the --paranoid/-p option. It does a
true byte-by-byte comparasion of each(!) file. Be warned, because it’s incredibly slow.

If you find false positives, those are most likely a bug on rmlint, please make a bugreport to sahib@online.de in this case,
so others won’t suffer from it.

COMPARASION TO OTHER TOOLS

(this list could get very, very long, but never accurate)

compared to…

  • ..fdupes / duff:
    • + LOTS faster
    • + more options
    • + finds also bad links and other stuff
    • + logging
    • - did find one file more once. :-)
  • ..rdfind:
    • + Live output of finds
    • + mostly as fast, or faster
    • + little less buggy ;-)
    • - rdfind is faster with many small files (like Sourcedirs)
  • ..fslint:
    • + Faster / more options
    • - fslint finds also broken UID/GIDs (this never happened to me, anyone?)
  • ..[other disadvantages]
    • - does not look into archives (things like those could be performed by some bash script easier)
    • - no gui (some people mark this as a clear ‘+’)

Pseudobenchmark


Machine was a regular quadcore with a even more regular HDD and an absolutely regular Linux x86_64.
measured was with the 2nd,3rd & 4th run of the programs.

rdfind

fdupes

rmlint

notes:

88GB Documents:

1,430s 8,137s 0.656s rmlint CPU usage was 310%, rdfind’s 99%

2,2GB of Source:

12.030s 30.552s 1.641s rdfind was faster on the first run.

50GB of Music:

0.089s 1:54min 0.097s Dir did not contain any twins. :-)

FAQ

Q: I want only the found files to be displayed! Like fdupes does!
A: “-v 1” is your friend.

Q: I guess I found a bug, what now?
A: Great! Write me an email (sahib@online.de), with a nifty problem description and/or
a patch and/or any suggestions and/or a backtrace (if it was a crash)

Q: Can I set a ‘preferred dir’ when specifying more than one dir, which picks the orig from the preferred one?
A: Yes, prepend the preferred directory with a ‘//’, the first found file in this directory is tagged as original.
Future versions might tag all files in the //-dir as original.

Q: None of the -m options satisfy my needs. What can I do?
A: You can specify your own commands by -c/-C, those get also replaced in the script.
You can for example pipe rmlint’s find directly to sh:
rmlint -c "echo '<dupl>' # same as '<orig>'" -C "ls -la '<orig>'" -v 5 | sh

Q: I want hardlinks instead of symlinks (as with -m link), how can I do this?
A: rmlint -c "rm -f '<dupl>' && ln '<orig>' '<dupl>'" YOUR_TARGET_DIR_HERE

Q: Are there bugs I should know off?
A: None that are really known, only some strange output formatting sometimes.

Q: Room for improvement?
A: Of course. Finding non-stripped binaries is slow (should use libmagic), a progress bar should be added maybe.

DONATE

You also might consider a small (CS-Students are already motivated by 1 Cent ) donation if you use any of my software in a commercial environement:


Flattr this

Alternatively you can use PayPal: http://sahib.github.com/donate.html

Something went wrong with that request. Please try again.