Skip to content

tomventa/dedupe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dedupe

Dedupe is a simple tool to deduplicate lines in text files. It processes files and directories, removing duplicate lines and optionally sorting the content.

Why

When I worked with large text files, I often found myself needing to remove duplicate lines. I tried several existing tools, but they either lacked features I needed or were too slow for my use case. So I decided to create my own tool that would be fast and with the features I wanted.

Features

  • Deduplication of lines in text files
  • Sorting of output file content
  • Processing of directories (even recursively)
  • Configurable memory usage
  • Concurrent processing with multiple workers
  • Progress indication and verbose logging

Help

Usage of dedupe:
  -input string
    	Input file or directory to process
  -max-memory uint
    	Maximum total memory usage in Megabytes (default 2048)
  -nologo
    	Disable printing the logo
  -output string
    	Output directory for deduplicated files (if not overwriting originals)
  -overwrite
    	Overwrite original files with deduplicated versions
  -recursive
    	Recursively process directories
  -sort
    	Sort output file content alphabetically (default true)
  -verbose
    	Enable verbose logging
  -workers int
    	Number of concurrent workers for processing (default <number of CPU cores>)

Next Steps

I don't plan to add many more features to this tool as it already serves my needs. The only feature I want to add is the ability to detect duplicated lines across multiple files in a directory, but this will take some time to implement properly since in my usecase I can't load all files in memory at once. Also, I want to handle file larger than available memory more gracefully, but this is a low priority for the moment.

About

Tool to deduplicate file contents

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages