wc: specialize scanning loop on settings.#3708
Conversation
The primary computational loop in wc (iterating over all the characters and computing word lengths, etc) is configured by a number of boolean options that control the text-scanning behavior. If we monomorphize the code loop for each possible combination of scanning configurations, the rustc is able to generate better code for each instantiation, at the least by removing the conditional checks on each iteration, and possibly by allowing things like vectorization. On my computer (aarch64/macos), I am seeing at least a 5% performance improvement in release builds on all wc flag configurations (other than those that were already specialized) against odyssey1024.txt, with wc -l showing the greatest improvement at 15%.
|
Excellent, could you please document how to run this benchmark (or finding) in https://github.com/uutils/coreutils/blob/main/src/uu/wc/BENCHMARKING.md if it makes sense ? |
src/uu/wc/src/wc.rs
Outdated
| settings.show_max_line_length, | ||
| settings.show_words, | ||
| ) { | ||
| (false, false, false, false, false) => { |
There was a problem hiding this comment.
Sorry but this is making the code less readable
would it be possible to make it a bit better ?
There was a problem hiding this comment.
Do you have any recommendations on how to make it more readable? The only thing I can think of is to write some proc_macro, but I'm not sure it would end up being any shorter, and proc_macro's themselves aren't very readable.
There was a problem hiding this comment.
I think rustc will probably also optimize it if you just pass the booleans as normal parameters? It's at least worth a try to remove the giant match statement.
Another thing to investigate is what would happen if we remove the boolean checks all together in the for loop. Maybe the branching itself is the overhead.
There was a problem hiding this comment.
Passing the individual configuration booleans as normal parameters is not sufficient. Based on my internal knowledge of LLVM, it won't do loop unswitching as complex as would be required here. Also confirmed by benchmarking wc -m odyssey1024.txt:
Normal parameters:
Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt
Time (mean ± σ): 880.2 ms ± 5.2 ms [User: 841.3 ms, System: 37.8 ms]
Range (min … max): 873.9 ms … 886.8 ms 10 runs
Const generic parameters:
Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt
Time (mean ± σ): 740.7 ms ± 7.9 ms [User: 705.0 ms, System: 35.1 ms]
Range (min … max): 729.2 ms … 749.3 ms 10 runs
There was a problem hiding this comment.
I suspect that manually removing the boolean checks from the loop would achieve the same performance gains, but would require require writing 16 different versions of word_count_from_reader. IMO one large dispatch table is better than 16 fully expanded copies of the function from a readability perspective?
There was a problem hiding this comment.
Thanks for checking the normal params. With removing the checks, I really meant removing the logic. For example, we could still compute the total number of characters, even if show_chars is set to false. The overhead of branching seems much more significant to me than the overhead of doing a single increment. So we would essentially remove all specialization.
There was a problem hiding this comment.
Here are benchmark results for wc -L using that approach:
Original:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
Time (mean ± σ): 2.588 s ± 0.009 s [User: 2.527 s, System: 0.060 s]
Range (min … max): 2.580 s … 2.610 s 10 runs
Unconditional character counting:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
Time (mean ± σ): 2.617 s ± 0.008 s [User: 2.549 s, System: 0.067 s]
Range (min … max): 2.608 s … 2.628 s 10 runs
Fully specialized:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
Time (mean ± σ): 2.461 s ± 0.009 s [User: 2.397 s, System: 0.063 s]
Range (min … max): 2.451 s … 2.474 s 10 runs
There was a problem hiding this comment.
I stand corrected again! Thanks!
I actually used the guidance from that file, with the only caveat that I used an even longer copy of the Odyssey file. |
By extracting the handling of hand-written fast-paths to the same dispatch as the automatic specializations, we can avoid needing to pass `show_bytes` as a const generic to `word_count_from_reader_specialized`. Eliminating this parameter halves the number of arms in the dispatch.
|
I figured out that I could reduce the size of the dispatch by half by handling the manually-written fast paths in the same way as the automatic ones. This eliminates |
|
bravo :) |
The primary computational loop in wc (iterating over all the
characters and computing word lengths, etc) is configured by a
number of boolean options that control the text-scanning behavior.
If we monomorphize the code loop for each possible combination of
scanning configurations, the rustc is able to generate better code
for each instantiation, at the least by removing the conditional
checks on each iteration, and possibly by allowing things like
vectorization.
On my computer (aarch64/macos), I am seeing at least a 5% performance
improvement in release builds on all wc flag configurations
(other than those that were already specialized) against
odyssey1024.txt, with wc -l showing the greatest improvement at 15%.