Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ASCII case conversions more than 4× faster #59283

Merged
merged 11 commits into from Mar 28, 2019

Conversation

Projects
None yet
6 participants
@SimonSapin
Copy link
Contributor

SimonSapin commented Mar 18, 2019

Reformatted output of ./x.py bench src/libcore --test-args ascii below. The libcore benchmark calls [u8]::make_ascii_lowercase. lookup has code (effectively) identical to that before this PR, and branchless mask_shifted_bool_match_range after this PR.

See code comments in u8::to_ascii_uppercase in src/libcore/num/mod.rs for an explanation of the branchless algorithm.

Update: the algorithm was simplified while keeping the performance. See branchless v.s. mask_shifted_bool_match_range benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on u32 to convert four bytes at a time. The fake_simd_u32 benchmarks implements this with let (before, aligned, after) = bytes.align_to_mut::<u32>(). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize [u8]::make_ascii_lowercase and [u8]::make_ascii_uppercase in src/libcore/slice/mod.rs) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from #59283 (comment))

6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:


alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
@rust-highfive

This comment has been minimized.

Copy link
Collaborator

rust-highfive commented Mar 18, 2019

r? @joshtriplett

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive

This comment was marked as resolved.

Copy link
Collaborator

rust-highfive commented Mar 18, 2019

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:1610fe8c:start=1552939996582639211,finish=1552939997546939179,duration=964299968
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
Setting environment variables from .travis.yml
---

[00:05:09] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:05:09] tidy error: /checkout/src/libcore/benches/ascii_case.rs:148: line longer than 100 chars
[00:05:09] tidy error: /checkout/src/libcore/benches/ascii_case.rs:150: line longer than 100 chars
[00:05:11] some tidy checks failed
[00:05:11] 
[00:05:11] 
[00:05:11] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:05:11] 
[00:05:11] 
[00:05:11] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:05:11] Build completed unsuccessfully in 0:00:43
[00:05:11] Build completed unsuccessfully in 0:00:43
[00:05:11] make: *** [tidy] Error 1
[00:05:11] Makefile:67: recipe for target 'tidy' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0f6f1c5c
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Mon Mar 18 20:18:40 UTC 2019
---
travis_time:end:000a3bdf:start=1552940321027716542,finish=1552940321032773778,duration=5057236
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:050a3420
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:1fd11106
travis_time:start:1fd11106
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:01ef92e5
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@raphlinus
Copy link
Contributor

raphlinus left a comment

This looks good to me, modulo the tidy warnings.

I like that the explanation is so much longer than the code :)

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Mar 18, 2019

Looks good to me. r=me as soon as CI passes.

@ollie27

This comment has been minimized.

Copy link
Contributor

ollie27 commented Mar 18, 2019

Might this be slower for platforms without SIMD which can't take advantage of auto-vectorization or does that not matter?

@raphlinus

This comment has been minimized.

Copy link
Contributor

raphlinus commented Mar 18, 2019

It's probably still faster than the status quo on those platforms because it does the computation without branches. If one cared deeply about those platforms, then the pseudo-SIMD approach could be resurrected. However, I think this is a pretty good compromise.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Mar 18, 2019

I guess it depends on whether LLVM can auto-vectorize based on "classic" u32 operations. But either way it’s likely still faster than the current lookup table.

I also just realize that when doing one byte at a time, instead of convoluted add-then-mask to emulate comparison, we can use actual comparison to obtain a bool, then cast to u8 to obtain a 1 or 0, then multiply that by a mask:

byte &= !(0x20 * (b'a' <= byte && byte <= b'z') as u8)

This even turns out to be slightly faster! I’ll update the PR.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Mar 18, 2019

If instead of b'a' <= byte && byte <= b'z' in the above I use byte.is_ascii_lowercase(), the performance is completely destroyed and goes to several slower than before this PR. So I also change the implementations of all u8::is_ascii_* methods to use match expressions with range patterns instead of the ASCII_CHARACTER_CLASS lookup table. When benchmarking black_box(bytes.iter().all(u8::is_ascii_FOO), the change is small, possibly noise.

Benchmark results in GIF for "visual diff":

a

Benchmark results in text

Before:

test ascii::long::is_ascii                                 ... bench:         187 ns/iter (+/- 0) = 37379 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          94 ns/iter (+/- 0) = 74361 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         125 ns/iter (+/- 0) = 55920 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 1) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_control                       ... bench:          23 ns/iter (+/- 1) = 1391 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          24 ns/iter (+/- 0) = 1333 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          22 ns/iter (+/- 2) = 1454 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_control                        ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_digit                          ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          25 ns/iter (+/- 0) = 280 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          24 ns/iter (+/- 0) = 291 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          23 ns/iter (+/- 1) = 304 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          24 ns/iter (+/- 1) = 291 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          22 ns/iter (+/- 0) = 318 MB/s

After:

test ascii::long::is_ascii                                 ... bench:         186 ns/iter (+/- 0) = 37580 MB/s
test ascii::long::is_ascii_alphabetic                      ... bench:          96 ns/iter (+/- 0) = 72812 MB/s
test ascii::long::is_ascii_alphanumeric                    ... bench:         119 ns/iter (+/- 0) = 58739 MB/s
test ascii::long::is_ascii_control                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_digit                           ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_graphic                         ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_hexdigit                        ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_lowercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_punctuation                     ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_uppercase                       ... bench:         124 ns/iter (+/- 0) = 56370 MB/s
test ascii::long::is_ascii_whitespace                      ... bench:         134 ns/iter (+/- 0) = 52164 MB/s
test ascii::medium::is_ascii                               ... bench:          28 ns/iter (+/- 0) = 1142 MB/s
test ascii::medium::is_ascii_alphabetic                    ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_alphanumeric                  ... bench:          23 ns/iter (+/- 0) = 1391 MB/s
test ascii::medium::is_ascii_control                       ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_digit                         ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_graphic                       ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_hexdigit                      ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_lowercase                     ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::medium::is_ascii_punctuation                   ... bench:          22 ns/iter (+/- 0) = 1454 MB/s
test ascii::medium::is_ascii_uppercase                     ... bench:          21 ns/iter (+/- 0) = 1523 MB/s
test ascii::medium::is_ascii_whitespace                    ... bench:          20 ns/iter (+/- 0) = 1600 MB/s
test ascii::short::is_ascii                                ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphabetic                     ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_alphanumeric                   ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_control                        ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_digit                          ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_graphic                        ... bench:          23 ns/iter (+/- 0) = 304 MB/s
test ascii::short::is_ascii_hexdigit                       ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_lowercase                      ... bench:          20 ns/iter (+/- 0) = 350 MB/s
test ascii::short::is_ascii_punctuation                    ... bench:          22 ns/iter (+/- 0) = 318 MB/s
test ascii::short::is_ascii_uppercase                      ... bench:          21 ns/iter (+/- 0) = 333 MB/s
test ascii::short::is_ascii_whitespace                     ... bench:          20 ns/iter (+/- 0) = 350 MB/s
@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Mar 19, 2019

Benchmark results from the original PR description, in case they end up being relevant:

6830 bytes string:

alloc_only                ... bench:    109 ns/iter (+/- 0) = 62660 MB/s
black_box_read_each_byte  ... bench:  1,708 ns/iter (+/- 5) = 3998 MB/s
lookup                    ... bench:  1,725 ns/iter (+/- 2) = 3959 MB/s
branch_and_subtract       ... bench:    413 ns/iter (+/- 1) = 16537 MB/s
branch_and_mask           ... bench:    411 ns/iter (+/- 2) = 16618 MB/s
branchless                ... bench:    377 ns/iter (+/- 2) = 18116 MB/s
libcore                   ... bench:    378 ns/iter (+/- 2) = 18068 MB/s
fake_simd_u32             ... bench:    373 ns/iter (+/- 1) = 18310 MB/s
fake_simd_u64             ... bench:    374 ns/iter (+/- 0) = 18262 MB/s

32 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 2461 MB/s
black_box_read_each_byte  ... bench:     28 ns/iter (+/- 0) = 1142 MB/s
lookup                    ... bench:     25 ns/iter (+/- 0) = 1280 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask           ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
libcore                   ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
fake_simd_u32             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64             ... bench:     17 ns/iter (+/- 0) = 1882 MB/s

7 bytes string:

alloc_only                ... bench:     13 ns/iter (+/- 0) = 538 MB/s
black_box_read_each_byte  ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup                    ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branch_and_subtract       ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask           ... bench:     17 ns/iter (+/- 0) = 411 MB/s
branchless                ... bench:     21 ns/iter (+/- 0) = 333 MB/s
libcore                   ... bench:     21 ns/iter (+/- 0) = 333 MB/s
fake_simd_u32             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u64             ... bench:     23 ns/iter (+/- 0) = 304 MB/s
@@ -3794,7 +3794,8 @@ impl u8 {
#[stable(feature = "ascii_methods_on_intrinsics", since = "1.23.0")]
#[inline]
pub fn to_ascii_uppercase(&self) -> u8 {
ASCII_UPPERCASE_MAP[*self as usize]
// Unset the fith bit if this is a lowercase letter
*self & !((self.is_ascii_lowercase() as u8) << 5)

This comment has been minimized.

Copy link
@ollie27

ollie27 Mar 19, 2019

Contributor
Suggested change
*self & !((self.is_ascii_lowercase() as u8) << 5)
*self - ((self.is_ascii_lowercase() as u8) << 5)

Using subtract is slightly faster for me:

test long::case12_mask_shifted_bool_match_range         ... bench:         776 ns/iter (+/- 26) = 9007 MB/s
test long::case13_sub_shifted_bool_match_range          ... bench:         734 ns/iter (+/- 49) = 9523 MB/s

This comment has been minimized.

Copy link
@SimonSapin

SimonSapin Mar 19, 2019

Author Contributor

This is also an improvement for me, but smaller:

test ascii::long::case12_mask_shifted_bool_match_range         ... bench:         352 ns/iter (+/- 0) = 19857 MB/s
test ascii::long::case13_subtract_shifted_bool_match_range     ... bench:         350 ns/iter (+/- 1) = 19971 MB/s
test ascii::medium::case12_mask_shifted_bool_match_range       ... bench:          15 ns/iter (+/- 0) = 2133 MB/s
test ascii::medium::case13_subtract_shifted_bool_match_range   ... bench:          15 ns/iter (+/- 0) = 2133 MB/s
test ascii::short::case12_mask_shifted_bool_match_range        ... bench:          19 ns/iter (+/- 0) = 368 MB/s
test ascii::short::case13_subtract_shifted_bool_match_range    ... bench:          18 ns/iter (+/- 0) = 388 MB/s
@ollie27

This comment has been minimized.

Copy link
Contributor

ollie27 commented Mar 19, 2019

A quick benchmark using i586-pc-windows-msvc target gets me:

test long::case00_alloc_only                            ... bench:         291 ns/iter (+/- 46) = 24020 MB/s
test long::case01_black_box_read_each_byte              ... bench:       4,214 ns/iter (+/- 163) = 1658 MB/s
test long::case02_lookup_table                          ... bench:       6,158 ns/iter (+/- 226) = 1135 MB/s
test long::case03_branch_and_subtract                   ... bench:      17,402 ns/iter (+/- 641) = 401 MB/s
test long::case04_branch_and_mask                       ... bench:      17,748 ns/iter (+/- 1,242) = 393 MB/s
test long::case05_branchless                            ... bench:      10,757 ns/iter (+/- 390) = 649 MB/s
test long::case06_libcore                               ... bench:       6,165 ns/iter (+/- 401) = 1133 MB/s
test long::case07_fake_simd_u32                         ... bench:       2,790 ns/iter (+/- 138) = 2505 MB/s
test long::case08_fake_simd_u64                         ... bench:       2,816 ns/iter (+/- 166) = 2482 MB/s
test long::case09_mask_mult_bool_branchy_lookup_table   ... bench:      11,366 ns/iter (+/- 353) = 614 MB/s
test long::case10_mask_mult_bool_lookup_table           ... bench:       9,793 ns/iter (+/- 486) = 713 MB/s
test long::case11_mask_mult_bool_match_range            ... bench:       8,949 ns/iter (+/- 330) = 781 MB/s
test long::case12_mask_shifted_bool_match_range         ... bench:       8,938 ns/iter (+/- 478) = 782 MB/s
test long::case13_sub_shifted_bool_match_range          ... bench:       8,136 ns/iter (+/- 363) = 859 MB/s
test medium::case00_alloc_only                          ... bench:          64 ns/iter (+/- 1) = 500 MB/s
test medium::case01_black_box_read_each_byte            ... bench:          73 ns/iter (+/- 2) = 438 MB/s
test medium::case02_lookup_table                        ... bench:          66 ns/iter (+/- 4) = 484 MB/s
test medium::case03_branch_and_subtract                 ... bench:          63 ns/iter (+/- 2) = 507 MB/s
test medium::case04_branch_and_mask                     ... bench:          64 ns/iter (+/- 2) = 500 MB/s
test medium::case05_branchless                          ... bench:         110 ns/iter (+/- 3) = 290 MB/s
test medium::case06_libcore                             ... bench:          62 ns/iter (+/- 4) = 516 MB/s
test medium::case07_fake_simd_u32                       ... bench:          79 ns/iter (+/- 2) = 405 MB/s
test medium::case08_fake_simd_u64                       ... bench:          80 ns/iter (+/- 2) = 400 MB/s
test medium::case09_mask_mult_bool_branchy_lookup_table ... bench:         118 ns/iter (+/- 5) = 271 MB/s
test medium::case10_mask_mult_bool_lookup_table         ... bench:          64 ns/iter (+/- 5) = 500 MB/s
test medium::case11_mask_mult_bool_match_range          ... bench:          62 ns/iter (+/- 3) = 516 MB/s
test medium::case12_mask_shifted_bool_match_range       ... bench:          62 ns/iter (+/- 2) = 516 MB/s
test medium::case13_sub_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 3) = 524 MB/s
test short::case00_alloc_only                           ... bench:          62 ns/iter (+/- 3) = 112 MB/s
test short::case01_black_box_read_each_byte             ... bench:          65 ns/iter (+/- 4) = 107 MB/s
test short::case02_lookup_table                         ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case03_branch_and_subtract                  ... bench:          63 ns/iter (+/- 4) = 111 MB/s
test short::case04_branch_and_mask                      ... bench:          61 ns/iter (+/- 1) = 114 MB/s
test short::case05_branchless                           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case06_libcore                              ... bench:          61 ns/iter (+/- 4) = 114 MB/s
test short::case07_fake_simd_u32                        ... bench:          74 ns/iter (+/- 4) = 94 MB/s
test short::case08_fake_simd_u64                        ... bench:          74 ns/iter (+/- 3) = 94 MB/s
test short::case09_mask_mult_bool_branchy_lookup_table  ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case10_mask_mult_bool_lookup_table          ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case11_mask_mult_bool_match_range           ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case12_mask_shifted_bool_match_range        ... bench:          61 ns/iter (+/- 2) = 114 MB/s
test short::case13_sub_shifted_bool_match_range         ... bench:          61 ns/iter (+/- 2) = 114 MB/s

Which shows that this can be slower than the lookup for a target without SIMD.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Mar 19, 2019

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

@ollie27

This comment has been minimized.

Copy link
Contributor

ollie27 commented Mar 19, 2019

What commit were these i586 results on? Because the libcore performs exactly like lookup_table, which seems surprising.

I was just a recent nightly so that's why libcore is the same as lookup_table.

@SimonSapin

This comment has been minimized.

Copy link
Contributor Author

SimonSapin commented Mar 22, 2019

@joshtriplett I pushed several changes since your review, could you have another look?

@joshtriplett

This comment has been minimized.

Copy link
Member

joshtriplett commented Mar 26, 2019

@bors r+

@bors

This comment has been minimized.

Copy link
Contributor

bors commented Mar 26, 2019

📌 Commit 7fad370 has been approved by joshtriplett

Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019

Rollup merge of rust-lang#59283 - SimonSapin:branchless-ascii-case, r…
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```

bors added a commit that referenced this pull request Mar 27, 2019

Auto merge of #59457 - Centril:rollup, r=Centril
Rollup of 13 pull requests

Successful merges:

 - #57293 (Make some lints incremental)
 - #57565 (syntax: Remove warning for unnecessary path disambiguators)
 - #58253 (librustc_driver => 2018)
 - #58581 (Refactor generic parameter encoder functions)
 - #58717 (Add FromStr impl for NonZero types)
 - #59283 (Make ASCII case conversions more than 4× faster)
 - #59284 (adjust MaybeUninit API to discussions)
 - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations)
 - #59421 (Reject integer suffix when tuple indexing)
 - #59429 (When moving out of a for loop head, suggest borrowing it in nll mode)
 - #59430 (Renames `EvalContext` to `InterpretCx`)
 - #59436 (Update jemalloc-sys to version 0.3.0)
 - #59451 (Add `Default` to `std::alloc::System`)

Failed merges:

r? @ghost

Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019

Rollup merge of rust-lang#59283 - SimonSapin:branchless-ascii-case, r…
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```

Centril added a commit to Centril/rust that referenced this pull request Mar 27, 2019

Rollup merge of rust-lang#59283 - SimonSapin:branchless-ascii-case, r…
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```

bors added a commit that referenced this pull request Mar 27, 2019

Auto merge of #59466 - Centril:rollup, r=Centril
Rollup of 17 pull requests

Successful merges:

 - #57293 (Make some lints incremental)
 - #57565 (syntax: Remove warning for unnecessary path disambiguators)
 - #58253 (librustc_driver => 2018)
 - #58717 (Add FromStr impl for NonZero types)
 - #58837 (librustc_interface => 2018)
 - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected)
 - #59283 (Make ASCII case conversions more than 4× faster)
 - #59284 (adjust MaybeUninit API to discussions)
 - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations)
 - #59393 (Refactor tuple comparison tests)
 - #59421 (Reject integer suffix when tuple indexing)
 - #59430 (Renames `EvalContext` to `InterpretCx`)
 - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type)
 - #59449 (fix: Make incremental artifact deletion more robust)
 - #59451 (Add `Default` to `std::alloc::System`)
 - #59459 (Add some tests)
 - #59460 (Include id in Thread's Debug implementation)

Failed merges:

r? @ghost

cuviper added a commit to cuviper/rust that referenced this pull request Mar 28, 2019

Rollup merge of rust-lang#59283 - SimonSapin:branchless-ascii-case, r…
…=joshtriplett

Make ASCII case conversions more than 4× faster

Reformatted output of `./x.py bench src/libcore --test-args ascii` below. The `libcore` benchmark calls `[u8]::make_ascii_lowercase`. `lookup` has code (effectively) identical to that before this PR, and ~~`branchless`~~ `mask_shifted_bool_match_range` after this PR.

~~See [code comments](rust-lang@ce933f7#diff-01076f91a26400b2db49663d787c2576R3796) in `u8::to_ascii_uppercase` in `src/libcore/num/mod.rs` for an explanation of the branchless algorithm.~~

**Update:** the algorithm was simplified while keeping the performance. See `branchless` v.s. `mask_shifted_bool_match_range` benchmarks.

Credits to @raphlinus for the idea in https://twitter.com/raphlinus/status/1107654782544736261, which extends this algorithm to “fake SIMD” on `u32` to convert four bytes at a time. The `fake_simd_u32` benchmarks implements this with [`let (before, aligned, after) = bytes.align_to_mut::<u32>()`](https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut). Note however that this is buggy when addition carries/overflows into the next byte (which does not happen if the input is known to be ASCII).

This could be fixed (to optimize `[u8]::make_ascii_lowercase` and `[u8]::make_ascii_uppercase` in `src/libcore/slice/mod.rs`) either with some more bitwise trickery that I didn’t quite figure out, or by using “real” SIMD intrinsics for byte-wise addition. I did not pursue this however because the current (incorrect) fake SIMD algorithm is only marginally faster than the one-byte-at-a-time branchless algorithm. This is because LLVM auto-vectorizes the latter, as can be seen on https://rust.godbolt.org/z/anKtbR.

Benchmark results on Linux x64 with Intel i7-7700K: (updated from rust-lang#59283 (comment))

```rust
6830 bytes string:

alloc_only                          ... bench:    112 ns/iter (+/- 0) = 62410 MB/s
black_box_read_each_byte            ... bench:  1,733 ns/iter (+/- 8) = 4033 MB/s
lookup_table                        ... bench:  1,766 ns/iter (+/- 11) = 3958 MB/s
branch_and_subtract                 ... bench:    417 ns/iter (+/- 1) = 16762 MB/s
branch_and_mask                     ... bench:    401 ns/iter (+/- 1) = 17431 MB/s
branchless                          ... bench:    365 ns/iter (+/- 0) = 19150 MB/s
libcore                             ... bench:    367 ns/iter (+/- 1) = 19046 MB/s
fake_simd_u32                       ... bench:    361 ns/iter (+/- 2) = 19362 MB/s
fake_simd_u64                       ... bench:    361 ns/iter (+/- 1) = 19362 MB/s
mask_mult_bool_branchy_lookup_table ... bench:  6,309 ns/iter (+/- 19) = 1107 MB/s
mask_mult_bool_lookup_table         ... bench:  4,183 ns/iter (+/- 29) = 1671 MB/s
mask_mult_bool_match_range          ... bench:    339 ns/iter (+/- 0) = 20619 MB/s
mask_shifted_bool_match_range       ... bench:    339 ns/iter (+/- 1) = 20619 MB/s

32 bytes string:

alloc_only                          ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
black_box_read_each_byte            ... bench:     29 ns/iter (+/- 0) = 1103 MB/s
lookup_table                        ... bench:     24 ns/iter (+/- 4) = 1333 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
branchless                          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
libcore                             ... bench:     15 ns/iter (+/- 0) = 2133 MB/s
fake_simd_u32                       ... bench:     17 ns/iter (+/- 0) = 1882 MB/s
fake_simd_u64                       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     42 ns/iter (+/- 0) = 761 MB/s
mask_mult_bool_lookup_table         ... bench:     35 ns/iter (+/- 0) = 914 MB/s
mask_mult_bool_match_range          ... bench:     16 ns/iter (+/- 0) = 2000 MB/s
mask_shifted_bool_match_range       ... bench:     16 ns/iter (+/- 0) = 2000 MB/s

7 bytes string:

alloc_only                          ... bench:     14 ns/iter (+/- 0) = 500 MB/s
black_box_read_each_byte            ... bench:     22 ns/iter (+/- 0) = 318 MB/s
lookup_table                        ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_subtract                 ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branch_and_mask                     ... bench:     16 ns/iter (+/- 0) = 437 MB/s
branchless                          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
libcore                             ... bench:     20 ns/iter (+/- 0) = 350 MB/s
fake_simd_u32                       ... bench:     18 ns/iter (+/- 0) = 388 MB/s
fake_simd_u64                       ... bench:     21 ns/iter (+/- 0) = 333 MB/s
mask_mult_bool_branchy_lookup_table ... bench:     20 ns/iter (+/- 0) = 350 MB/s
mask_mult_bool_lookup_table         ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_mult_bool_match_range          ... bench:     19 ns/iter (+/- 0) = 368 MB/s
mask_shifted_bool_match_range       ... bench:     19 ns/iter (+/- 0) = 368 MB/s
```

bors added a commit that referenced this pull request Mar 28, 2019

Auto merge of #59471 - cuviper:rollup, r=cuviper
Rollup of 18 pull requests

Successful merges:

 - #57293 (Make some lints incremental)
 - #57565 (syntax: Remove warning for unnecessary path disambiguators)
 - #58253 (librustc_driver => 2018)
 - #58837 (librustc_interface => 2018)
 - #59268 (Add suggestion to use `&*var` when `&str: From<String>` is expected)
 - #59283 (Make ASCII case conversions more than 4× faster)
 - #59284 (adjust MaybeUninit API to discussions)
 - #59372 (add rustfix-able suggestions to trim_{left,right} deprecations)
 - #59390 (Make `ptr::eq` documentation mention fat-pointer behavior)
 - #59393 (Refactor tuple comparison tests)
 - #59420 ([CI] record docker image info for reuse)
 - #59421 (Reject integer suffix when tuple indexing)
 - #59430 (Renames `EvalContext` to `InterpretCx`)
 - #59439 (Generalize diagnostic for `x = y` where `bool` is the expected type)
 - #59449 (fix: Make incremental artifact deletion more robust)
 - #59451 (Add `Default` to `std::alloc::System`)
 - #59459 (Add some tests)
 - #59460 (Include id in Thread's Debug implementation)

Failed merges:

r? @ghost

@bors bors merged commit 7fad370 into rust-lang:master Mar 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.