Skip to content

wc: specialize scanning loop on settings.#3708

Merged
sylvestre merged 6 commits intouutils:mainfrom
resistor:main
Jul 18, 2022
Merged

wc: specialize scanning loop on settings.#3708
sylvestre merged 6 commits intouutils:mainfrom
resistor:main

Conversation

@resistor
Copy link
Contributor

The primary computational loop in wc (iterating over all the
characters and computing word lengths, etc) is configured by a
number of boolean options that control the text-scanning behavior.
If we monomorphize the code loop for each possible combination of
scanning configurations, the rustc is able to generate better code
for each instantiation, at the least by removing the conditional
checks on each iteration, and possibly by allowing things like
vectorization.

On my computer (aarch64/macos), I am seeing at least a 5% performance
improvement in release builds on all wc flag configurations
(other than those that were already specialized) against
odyssey1024.txt, with wc -l showing the greatest improvement at 15%.

The primary computational loop in wc (iterating over all the
characters and computing word lengths, etc) is configured by a
number of boolean options that control the text-scanning behavior.
If we monomorphize the code loop for each possible combination of
scanning configurations, the rustc is able to generate better code
for each instantiation, at the least by removing the conditional
checks on each iteration, and possibly by allowing things like
vectorization.

On my computer (aarch64/macos), I am seeing at least a 5% performance
improvement in release builds on all wc flag configurations
(other than those that were already specialized) against
odyssey1024.txt, with wc -l showing the greatest improvement at 15%.
@sylvestre
Copy link
Contributor

Excellent, could you please document how to run this benchmark (or finding) in https://github.com/uutils/coreutils/blob/main/src/uu/wc/BENCHMARKING.md if it makes sense ?

settings.show_max_line_length,
settings.show_words,
) {
(false, false, false, false, false) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but this is making the code less readable
would it be possible to make it a bit better ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any recommendations on how to make it more readable? The only thing I can think of is to write some proc_macro, but I'm not sure it would end up being any shorter, and proc_macro's themselves aren't very readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rustc will probably also optimize it if you just pass the booleans as normal parameters? It's at least worth a try to remove the giant match statement.

Another thing to investigate is what would happen if we remove the boolean checks all together in the for loop. Maybe the branching itself is the overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the individual configuration booleans as normal parameters is not sufficient. Based on my internal knowledge of LLVM, it won't do loop unswitching as complex as would be required here. Also confirmed by benchmarking wc -m odyssey1024.txt:

Normal parameters:

Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt
  Time (mean ± σ):     880.2 ms ±   5.2 ms    [User: 841.3 ms, System: 37.8 ms]
  Range (min … max):   873.9 ms … 886.8 ms    10 runs

Const generic parameters:

Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt
  Time (mean ± σ):     740.7 ms ±   7.9 ms    [User: 705.0 ms, System: 35.1 ms]
  Range (min … max):   729.2 ms … 749.3 ms    10 runs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that manually removing the boolean checks from the loop would achieve the same performance gains, but would require require writing 16 different versions of word_count_from_reader. IMO one large dispatch table is better than 16 fully expanded copies of the function from a readability perspective?

Copy link
Member

@tertsdiepraam tertsdiepraam Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking the normal params. With removing the checks, I really meant removing the logic. For example, we could still compute the total number of characters, even if show_chars is set to false. The overhead of branching seems much more significant to me than the overhead of doing a single increment. So we would essentially remove all specialization.

Copy link
Contributor Author

@resistor resistor Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are benchmark results for wc -L using that approach:

Original:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
  Time (mean ± σ):      2.588 s ±  0.009 s    [User: 2.527 s, System: 0.060 s]
  Range (min … max):    2.580 s …  2.610 s    10 runs

Unconditional character counting:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
  Time (mean ± σ):      2.617 s ±  0.008 s    [User: 2.549 s, System: 0.067 s]
  Range (min … max):    2.608 s …  2.628 s    10 runs

Fully specialized:
Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt
  Time (mean ± σ):      2.461 s ±  0.009 s    [User: 2.397 s, System: 0.063 s]
  Range (min … max):    2.451 s …  2.474 s    10 runs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand corrected again! Thanks!

@resistor
Copy link
Contributor Author

Excellent, could you please document how to run this benchmark (or finding) in https://github.com/uutils/coreutils/blob/main/src/uu/wc/BENCHMARKING.md if it makes sense ?

I actually used the guidance from that file, with the only caveat that I used an even longer copy of the Odyssey file.

By extracting the handling of hand-written fast-paths to the
same dispatch as the automatic specializations, we can avoid
needing to pass `show_bytes` as a const generic to
`word_count_from_reader_specialized`. Eliminating this parameter
halves the number of arms in the dispatch.
@resistor
Copy link
Contributor Author

I figured out that I could reduce the size of the dispatch by half by handling the manually-written fast paths in the same way as the automatic ones. This eliminates show_bytes as a const generic parameter, thereby halving the size of the dispatch. Please take a look!

@sylvestre sylvestre merged commit 735db78 into uutils:main Jul 18, 2022
@sylvestre
Copy link
Contributor

bravo :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants