Skip to content

cut: optimizations #2111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 28, 2021
Merged

cut: optimizations #2111

merged 2 commits into from
Apr 28, 2021

Conversation

cbjadwani
Copy link
Contributor

@cbjadwani cbjadwani commented Apr 24, 2021

  • Use buffered stdout to reduce write sys calls.

    This simple change yielded the biggest performance gain.

  • Use for_byte_record_with_terminator from the bstr crate.

    This is to minimize the per line copying needed by BufReader::read_until. The cut_fields and cut_fields_delimiter functions used read_until to iterate over lines. That required copying each input line to the line buffer. With for_byte_record_with_terminator copying is minimized as it calls our closure with a reference to BufReader's buffer most of the time. It needs to copy (internally) only to process any incomplete lines at the end of the buffer.

  • Re-write Searcher to use memchr.

    Switch from the naive implementation to one that uses memchr.

  • Rewrite cut_bytes almost entirely.

    This was already well optimized. The performance gain in this case is not from avoiding copying. In fact, it needed zero copying whereas new implementation introduces some copying similar to cut_fields described above. But the occasional copying cost is more than offset by the use of the very fast memchr inside for_byte_record_with_terminator. This change also simplifies the code significantly. Removed the buffer module.

Benchmarks

  • $ hyperfine -w3 "./target/release/cut -c2-4,8 test-large.txt" "cut -c2-4,8 test-large.txt"

Before

Benchmark #1: ./target/release/cut -c2-4,8 test-large.txt
  Time (mean ± σ):     159.6 ms ±   2.4 ms    [User: 112.5 ms, System: 46.9 ms]
  Range (min … max):   156.3 ms … 163.7 ms    18 runs
 
Benchmark #2: cut -c2-4,8 test-large.txt
  Time (mean ± σ):      29.5 ms ±   1.4 ms    [User: 26.7 ms, System: 2.8 ms]
  Range (min … max):    28.3 ms …  34.2 ms    99 runs
 
Summary
  'cut -c2-4,8 test-large.txt' ran
    5.41 ± 0.27 times faster than './target/release/cut -c2-4,8 test-large.txt'

After

Benchmark #1: ./target/release/cut -c2-4,8 test-large.txt
  Time (mean ± σ):      13.9 ms ±   1.4 ms    [User: 11.0 ms, System: 2.9 ms]
  Range (min … max):    12.7 ms …  19.4 ms    208 runs
 
Benchmark #2: cut -c2-4,8 test-large.txt
  Time (mean ± σ):      29.8 ms ±   1.6 ms    [User: 26.8 ms, System: 3.0 ms]
  Range (min … max):    28.1 ms …  34.1 ms    99 runs
 
Summary
  './target/release/cut -c2-4,8 test-large.txt' ran
    2.14 ± 0.25 times faster than 'cut -c2-4,8 test-large.txt'
  • $ hyperfine -w3 "./target/release/cut -f2-4,8 -d' ' test-large.txt" "cut -f2-4,8 -d' ' test-large.txt"

Before

Benchmark #1: ./target/release/cut -f2-4,8 -d' ' test-large.txt
  Time (mean ± σ):     165.4 ms ±   2.4 ms    [User: 116.6 ms, System: 48.6 ms]
  Range (min … max):   161.6 ms … 170.6 ms    17 runs
 
Benchmark #2: cut -f2-4,8 -d' ' test-large.txt
  Time (mean ± σ):      64.0 ms ±   1.5 ms    [User: 61.1 ms, System: 2.8 ms]
  Range (min … max):    62.2 ms …  68.8 ms    44 runs
 
Summary
  'cut -f2-4,8 -d' ' test-large.txt' ran
    2.59 ± 0.07 times faster than './target/release/cut -f2-4,8 -d' ' test-large.txt'

After

Benchmark #1: ./target/release/cut -f2-4,8 -d' ' test-large.txt
  Time (mean ± σ):      26.8 ms ±   1.6 ms    [User: 23.7 ms, System: 3.1 ms]
  Range (min … max):    24.9 ms …  31.1 ms    103 runs
 
Benchmark #2: cut -f2-4,8 -d' ' test-large.txt
  Time (mean ± σ):      63.9 ms ±   1.6 ms    [User: 60.5 ms, System: 3.4 ms]
  Range (min … max):    62.3 ms …  69.1 ms    46 runs
 
Summary
  './target/release/cut -f2-4,8 -d' ' test-large.txt' ran
    2.38 ± 0.16 times faster than 'cut -f2-4,8 -d' ' test-large.txt'
  • hyperfine -w3 "./target/release/cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt" "cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt"

Before

Benchmark #1: ./target/release/cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt
  Time (mean ± σ):     190.6 ms ±   6.1 ms    [User: 140.5 ms, System: 49.9 ms]
  Range (min … max):   185.5 ms … 210.5 ms    15 runs
 
Benchmark #2: cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt
  Time (mean ± σ):      63.8 ms ±   1.8 ms    [User: 60.9 ms, System: 2.8 ms]
  Range (min … max):    62.2 ms …  69.5 ms    46 runs
 
Summary
  'cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt' ran
    2.99 ± 0.13 times faster than './target/release/cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt'

After

Benchmark #1: ./target/release/cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt
  Time (mean ± σ):      34.2 ms ±   1.8 ms    [User: 31.4 ms, System: 2.8 ms]
  Range (min … max):    32.2 ms …  39.1 ms    87 runs
 
Benchmark #2: cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt
  Time (mean ± σ):      63.4 ms ±   1.3 ms    [User: 60.6 ms, System: 2.8 ms]
  Range (min … max):    62.3 ms …  68.0 ms    46 runs
 
Summary
  './target/release/cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt' ran
    1.85 ± 0.10 times faster than 'cut -f2-4,8 -d' ' --output-delimiter=- test-large.txt'

* Use buffered stdout to reduce write sys calls.

This simple change yielded the biggest performace gain.

* Use `for_byte_record_with_terminator` from the `bstr` crate.

This is to minimize the per line copying needed by
`BufReader::read_until`. The `cut_fields` and `cut_fields_delimiter`
functions used `read_until` to iterate over lines. That required copying
each input line to the line buffer. With
`for_byte_record_with_terminator` copying is minimized as it calls our
closure with a reference to BufReader's buffer most of the time.  It
needs to copy (internally) only to process any incomplete lines at the
end of the buffer.

* Re-write `Searcher` to use `memchr`.

Switch from the naive implementation to one that uses `memchr`.

* Rewrite `cut_bytes` almost entirely.

This was already well optimized. The performance gain in this case is
not from avoiding copying. In fact, it needed zero copying whereas new
implementation introduces some copying similar to `cut_fields` described
above. But the occassional copying cost is more than offset by the use
of the very fast `memchr` inside `for_byte_record_with_terminator`.
This change also simplifies the code significantly. Removed the `buffer`
module.
@sylvestre
Copy link
Contributor

Well done for the huge wins!

Could you please create a benchmark files like:
https://github.com/uutils/coreutils/blob/master/src/uu/sort/BENCHMARKING.md
to be able to store way to benchmark this program? thanks

and minor refactoring
@cbjadwani
Copy link
Contributor Author

Added benchmarking doc. I think it would be more accessible if instead of adding all details for each utility separately common details can be extracted and kept at project root level.

@sylvestre
Copy link
Contributor

Great doc, thanks

@sylvestre sylvestre merged commit 1675200 into uutils:master Apr 28, 2021
@cbjadwani cbjadwani deleted the cut_optimizations branch April 29, 2021 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants