Dedupe is a simple tool to deduplicate lines in text files. It processes files and directories, removing duplicate lines and optionally sorting the content.
When I worked with large text files, I often found myself needing to remove duplicate lines. I tried several existing tools, but they either lacked features I needed or were too slow for my use case. So I decided to create my own tool that would be fast and with the features I wanted.
- Deduplication of lines in text files
- Sorting of output file content
- Processing of directories (even recursively)
- Configurable memory usage
- Concurrent processing with multiple workers
- Progress indication and verbose logging
Usage of dedupe:
-input string
Input file or directory to process
-max-memory uint
Maximum total memory usage in Megabytes (default 2048)
-nologo
Disable printing the logo
-output string
Output directory for deduplicated files (if not overwriting originals)
-overwrite
Overwrite original files with deduplicated versions
-recursive
Recursively process directories
-sort
Sort output file content alphabetically (default true)
-verbose
Enable verbose logging
-workers int
Number of concurrent workers for processing (default <number of CPU cores>)
I don't plan to add many more features to this tool as it already serves my needs. The only feature I want to add is the ability to detect duplicated lines across multiple files in a directory, but this will take some time to implement properly since in my usecase I can't load all files in memory at once. Also, I want to handle file larger than available memory more gracefully, but this is a low priority for the moment.