wc: specialize scanning loop on settings. by resistor · Pull Request #3708 · uutils/coreutils

resistor · 2022-07-11T05:51:46Z

The primary computational loop in wc (iterating over all the
characters and computing word lengths, etc) is configured by a
number of boolean options that control the text-scanning behavior.
If we monomorphize the code loop for each possible combination of
scanning configurations, the rustc is able to generate better code
for each instantiation, at the least by removing the conditional
checks on each iteration, and possibly by allowing things like
vectorization.

On my computer (aarch64/macos), I am seeing at least a 5% performance
improvement in release builds on all wc flag configurations
(other than those that were already specialized) against
odyssey1024.txt, with wc -l showing the greatest improvement at 15%.

The primary computational loop in wc (iterating over all the characters and computing word lengths, etc) is configured by a number of boolean options that control the text-scanning behavior. If we monomorphize the code loop for each possible combination of scanning configurations, the rustc is able to generate better code for each instantiation, at the least by removing the conditional checks on each iteration, and possibly by allowing things like vectorization. On my computer (aarch64/macos), I am seeing at least a 5% performance improvement in release builds on all wc flag configurations (other than those that were already specialized) against odyssey1024.txt, with wc -l showing the greatest improvement at 15%.

sylvestre · 2022-07-11T08:36:29Z

Excellent, could you please document how to run this benchmark (or finding) in https://github.com/uutils/coreutils/blob/main/src/uu/wc/BENCHMARKING.md if it makes sense ?

sylvestre · 2022-07-11T08:39:47Z

src/uu/wc/src/wc.rs

+        settings.show_max_line_length,
+        settings.show_words,
+    ) {
+        (false, false, false, false, false) => {


Sorry but this is making the code less readable
would it be possible to make it a bit better ?

Do you have any recommendations on how to make it more readable? The only thing I can think of is to write some proc_macro, but I'm not sure it would end up being any shorter, and proc_macro's themselves aren't very readable.

I think rustc will probably also optimize it if you just pass the booleans as normal parameters? It's at least worth a try to remove the giant match statement.

Another thing to investigate is what would happen if we remove the boolean checks all together in the for loop. Maybe the branching itself is the overhead.

Passing the individual configuration booleans as normal parameters is not sufficient. Based on my internal knowledge of LLVM, it won't do loop unswitching as complex as would be required here. Also confirmed by benchmarking wc -m odyssey1024.txt:

Normal parameters:

Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt Time (mean ± σ): 880.2 ms ± 5.2 ms [User: 841.3 ms, System: 37.8 ms] Range (min … max): 873.9 ms … 886.8 ms 10 runs

Const generic parameters:

Benchmark 1: ./target/release/coreutils wc -m odyssey1024.txt Time (mean ± σ): 740.7 ms ± 7.9 ms [User: 705.0 ms, System: 35.1 ms] Range (min … max): 729.2 ms … 749.3 ms 10 runs

I suspect that manually removing the boolean checks from the loop would achieve the same performance gains, but would require require writing 16 different versions of word_count_from_reader. IMO one large dispatch table is better than 16 fully expanded copies of the function from a readability perspective?

Thanks for checking the normal params. With removing the checks, I really meant removing the logic. For example, we could still compute the total number of characters, even if show_chars is set to false. The overhead of branching seems much more significant to me than the overhead of doing a single increment. So we would essentially remove all specialization.

Here are benchmark results for wc -L using that approach:

Original: Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt Time (mean ± σ): 2.588 s ± 0.009 s [User: 2.527 s, System: 0.060 s] Range (min … max): 2.580 s … 2.610 s 10 runs Unconditional character counting: Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt Time (mean ± σ): 2.617 s ± 0.008 s [User: 2.549 s, System: 0.067 s] Range (min … max): 2.608 s … 2.628 s 10 runs Fully specialized: Benchmark 1: ./target/release/coreutils wc -L odyssey1024.txt Time (mean ± σ): 2.461 s ± 0.009 s [User: 2.397 s, System: 0.063 s] Range (min … max): 2.451 s … 2.474 s 10 runs

I stand corrected again! Thanks!

resistor · 2022-07-11T18:55:41Z

Excellent, could you please document how to run this benchmark (or finding) in https://github.com/uutils/coreutils/blob/main/src/uu/wc/BENCHMARKING.md if it makes sense ?

I actually used the guidance from that file, with the only caveat that I used an even longer copy of the Odyssey file.

By extracting the handling of hand-written fast-paths to the same dispatch as the automatic specializations, we can avoid needing to pass `show_bytes` as a const generic to `word_count_from_reader_specialized`. Eliminating this parameter halves the number of arms in the dispatch.

resistor · 2022-07-13T03:55:18Z

I figured out that I could reduce the size of the dispatch by half by handling the manually-written fast paths in the same way as the automatic ones. This eliminates show_bytes as a const generic parameter, thereby halving the size of the dispatch. Please take a look!

src/uu/wc/src/wc.rs

tests/by-util/test_wc.rs

sylvestre · 2022-07-18T10:17:00Z

bravo :)

sylvestre reviewed Jul 11, 2022

View reviewed changes

sylvestre reviewed Jul 13, 2022

View reviewed changes

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved

Fix clippy warning.

9696950

sylvestre reviewed Jul 16, 2022

View reviewed changes

src/uu/wc/src/wc.rs Outdated Show resolved Hide resolved

sylvestre reviewed Jul 16, 2022

View reviewed changes

src/uu/wc/src/wc.rs Show resolved Hide resolved

resistor added 2 commits July 16, 2022 21:19

Add comment.

919ecdb

Add testcases for all combinations of flags.

ad84c53

sylvestre reviewed Jul 17, 2022

View reviewed changes

tests/by-util/test_wc.rs Show resolved Hide resolved

sylvestre reviewed Jul 17, 2022

View reviewed changes

tests/by-util/test_wc.rs Show resolved Hide resolved

Run rustfmt.

25b9455

sylvestre merged commit 735db78 into uutils:main Jul 18, 2022

Uh oh!

Conversation

resistor commented Jul 11, 2022

Uh oh!

sylvestre commented Jul 11, 2022

Uh oh!

sylvestre Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

resistor Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

tertsdiepraam Jul 12, 2022

Choose a reason for hiding this comment

Uh oh!

resistor Jul 12, 2022

Choose a reason for hiding this comment

Uh oh!

resistor Jul 12, 2022

Choose a reason for hiding this comment

Uh oh!

tertsdiepraam Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

resistor Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tertsdiepraam Jul 13, 2022

Choose a reason for hiding this comment

Uh oh!

resistor commented Jul 11, 2022

Uh oh!

resistor commented Jul 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sylvestre commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tertsdiepraam Jul 12, 2022 •

edited

Loading

resistor Jul 13, 2022 •

edited

Loading