Simple command line utility for removing duplicate lines from input. Like
uniq
but not just adjacent lines.
The standard advice for removing duplicates from a pipeline is either sort | uniq
or just sort -u
. That works, but
can be suboptimal. sort
serializes the input. On multi-cpu machines this can prevent all of the cores from being
utilized. Also, sorting the input isn't free, imposing an O(n log n) cost.
dedup
scans for duplicates incrementally, keeping a hash-set of lines it has seen. Hash look-up and storage is
O(c). And because it scans a line at a time, dedup
can output a new line immediately, silently dropping
subsequent ones as they appear.
Currently, the easiest way to install is via cargo. Checkout the project and run cargo install
.
cargo install -path .
producer | dedup | consumer
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.