Skip to content

wyfo/yuniq

Repository files navigation

yuniq: hyperfast line deduplicator

yuniq is a high-performance and stable line deduplicator.

Unlike the standard uniq utility, yuniq does not require your input to be sorted.

Usage

Hyperfast line deduplicator

Usage: yuniq [OPTIONS]

Options:
      --fast                       Use 64-bit hashing (faster, negligible collision risk)
  -c, --count                      Prefix each line with its global occurrence count, sorted by count
  -r, --reverse                    Reverse sort order (requires --count, incompatible with --no-sort)
  -S, --no-sort                    Preserve insertion order instead of sorting by count (requires --count)
      --size-hint <SIZE_HINT>      Expected number of unique lines (used to pre-size internal structures) [default: 1048576]
  -w, --check-chars <CHECK_CHARS>  Only compare the first N characters of each line
  -s, --skip-chars <SKIP_CHARS>    Skip the first N characters of each line before comparing
  -f, --skip-fields <SKIP_FIELDS>  Skip the first N whitespace-delimited fields of each line before comparing
  -h, --help                       Print help

Benchmarks

Detailed benchmark results can be generated by executing ./bench.sh script.

Old benchmark

Benchmark Results

Command Mean [ms] Min [ms] Max [ms] Relative
yuniq --fast < data.txt > /dev/null 87.2 ± 4.9 80.0 96.9 1.00
yuniq < data.txt > /dev/null 214.9 ± 3.2 210.6 220.4 2.46 ± 0.14
xuniq < data.txt > /dev/null 228.2 ± 13.2 206.2 256.6 2.62 ± 0.21
hist -u < data.txt > /dev/null 261.8 ± 8.6 243.3 275.2 3.00 ± 0.20
ripuniq < data.txt > /dev/null 368.5 ± 4.8 363.2 377.2 4.23 ± 0.24
runiq < data.txt > /dev/null 593.7 ± 3.1 589.4 598.9 6.81 ± 0.38
huniq < data.txt > /dev/null 596.9 ± 4.7 589.6 604.6 6.85 ± 0.39
xuniq --safe < data.txt > /dev/null 619.8 ± 8.9 609.8 634.4 7.11 ± 0.41
perl -ne 'print if !$seen{$_}++' data.txt > /dev/null 1791.1 ± 14.3 1778.3 1827.2 20.54 ± 1.16
awk '!seen[$0]++' data.txt > /dev/null 3637.1 ± 10.3 3620.6 3650.1 41.71 ± 2.35
sort -u data.txt > /dev/null 6904.1 ± 52.1 6862.5 7024.8 79.18 ± 4.48
sort data.txt | uniq > /dev/null 7471.6 ± 18.2 7451.4 7499.2 85.69 ± 4.82

Commands to reproduce

{
  seq 1 1250000 | awk '{print "dup_"$1; print "dup_"$1}'; # Duplicated lines
  seq 1250001 3750000 | awk '{print "uniq_"$1}';          # Unique lines
} | shuf > data.txt

hyperfine --warmup 3 \
  'yuniq --fast < data.txt > /dev/null' \
  'yuniq < data.txt > /dev/null' \
  'xuniq < data.txt > /dev/null' \
  'xuniq --safe < data.txt > /dev/null' \
  'hist -u < data.txt > /dev/null' \
  'ripuniq < data.txt > /dev/null' \
  'runiq < data.txt > /dev/null' \
  'huniq < data.txt > /dev/null' \
  'perl -ne '\''print if !$seen{$_}++'\'' data.txt > /dev/null' \
  'awk '\''!seen[$0]++'\'' data.txt > /dev/null' \
  'sort -u data.txt > /dev/null' \
  'sort data.txt | uniq > /dev/null' \
  --export-markdown bench.md

About

yuniq is a blazing-fast utility to remove duplicate lines from input.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors