Skip to content

Latest commit

 

History

History
45 lines (36 loc) · 1.82 KB

README.md

File metadata and controls

45 lines (36 loc) · 1.82 KB

duplikaatti

GitHub All Releases GitHub release (latest by date) GitHub tag (latest by date)

Remove duplicate files and do it fast. duplikaatti is designed to go through 50 TiB+ of data and hundreds of thousands of files and find duplicate files in few minutes.

Algorithm

  • Create file list of given directories
    • do not add files with same identifier already added to the list (windows: file id, *nix: inode)
    • do not add 0 byte files
    • directories listed first has higher priority than the last
  • Remove all files from the list which do not share same file sizes (ie. there's only one 1000 byte file -> remove)
  • Read first bytes of files and generate SHA256 sum of those bytes
  • Remove all hashes from the list which occured only once
  • Read last bytes of files and generate SHA256 sum of those bytes
  • Remove all hashes from the list which occured only once
  • Now finally hash the whole files that are left
  • Remove all hashes from the list which occured only once
  • Generate list of files to keep and what to remove
    • use directory priority and file age to find what to keep
      • oldest and highest priority files are kept
  • Finally, remove files from filesystem(s)

Usage

Duplicate file remover (version 1.0.0)
Removes duplicate files. Algorithm idea from rdfind.

Usage of duplikaatti [options] <directories>:

Parameters:
  -remove
    	Actually remove files.

Examples:
  Test what would be removed:
    duplikaatti /home/raspi/storage /mnt/storage
  Remove files:
    duplikaatti -remove /home/raspi/storage /mnt/storage

Idea inspired by https://github.com/pauldreik/rdfind