yuniq is a high-performance and stable line deduplicator.
Unlike the standard uniq utility, yuniq does not require your input to be sorted.
Hyperfast line deduplicator
Usage: yuniq [OPTIONS]
Options:
--fast Use 64-bit hashing (faster, negligible collision risk)
-c, --count Prefix each line with its global occurrence count, sorted by count
-r, --reverse Reverse sort order (requires --count, incompatible with --no-sort)
-S, --no-sort Preserve insertion order instead of sorting by count (requires --count)
--size-hint <SIZE_HINT> Expected number of unique lines (used to pre-size internal structures) [default: 1048576]
-w, --check-chars <CHECK_CHARS> Only compare the first N characters of each line
-s, --skip-chars <SKIP_CHARS> Skip the first N characters of each line before comparing
-f, --skip-fields <SKIP_FIELDS> Skip the first N whitespace-delimited fields of each line before comparing
-h, --help Print help
Detailed benchmark results can be generated by executing ./bench.sh script.
| Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
|---|---|---|---|---|
yuniq --fast < data.txt > /dev/null |
87.2 ± 4.9 | 80.0 | 96.9 | 1.00 |
yuniq < data.txt > /dev/null |
214.9 ± 3.2 | 210.6 | 220.4 | 2.46 ± 0.14 |
xuniq < data.txt > /dev/null |
228.2 ± 13.2 | 206.2 | 256.6 | 2.62 ± 0.21 |
hist -u < data.txt > /dev/null |
261.8 ± 8.6 | 243.3 | 275.2 | 3.00 ± 0.20 |
ripuniq < data.txt > /dev/null |
368.5 ± 4.8 | 363.2 | 377.2 | 4.23 ± 0.24 |
runiq < data.txt > /dev/null |
593.7 ± 3.1 | 589.4 | 598.9 | 6.81 ± 0.38 |
huniq < data.txt > /dev/null |
596.9 ± 4.7 | 589.6 | 604.6 | 6.85 ± 0.39 |
xuniq --safe < data.txt > /dev/null |
619.8 ± 8.9 | 609.8 | 634.4 | 7.11 ± 0.41 |
perl -ne 'print if !$seen{$_}++' data.txt > /dev/null |
1791.1 ± 14.3 | 1778.3 | 1827.2 | 20.54 ± 1.16 |
awk '!seen[$0]++' data.txt > /dev/null |
3637.1 ± 10.3 | 3620.6 | 3650.1 | 41.71 ± 2.35 |
sort -u data.txt > /dev/null |
6904.1 ± 52.1 | 6862.5 | 7024.8 | 79.18 ± 4.48 |
sort data.txt | uniq > /dev/null |
7471.6 ± 18.2 | 7451.4 | 7499.2 | 85.69 ± 4.82 |
{
seq 1 1250000 | awk '{print "dup_"$1; print "dup_"$1}'; # Duplicated lines
seq 1250001 3750000 | awk '{print "uniq_"$1}'; # Unique lines
} | shuf > data.txt
hyperfine --warmup 3 \
'yuniq --fast < data.txt > /dev/null' \
'yuniq < data.txt > /dev/null' \
'xuniq < data.txt > /dev/null' \
'xuniq --safe < data.txt > /dev/null' \
'hist -u < data.txt > /dev/null' \
'ripuniq < data.txt > /dev/null' \
'runiq < data.txt > /dev/null' \
'huniq < data.txt > /dev/null' \
'perl -ne '\''print if !$seen{$_}++'\'' data.txt > /dev/null' \
'awk '\''!seen[$0]++'\'' data.txt > /dev/null' \
'sort -u data.txt > /dev/null' \
'sort data.txt | uniq > /dev/null' \
--export-markdown bench.md