Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_whitespace() performance improvements #99487

Merged
merged 4 commits into from
Aug 26, 2022

Conversation

bmacnaughton
Copy link
Contributor

@bmacnaughton bmacnaughton commented Jul 20, 2022

This is my first rust PR, so if I miss anything obvious please let me know and I'll do my best to fix it.

This was a bit more of a challenge than I realized because, while I made working code locally and tested it against the native is_whitespace(), this PR required changing src/tools/unicode-table-generator, the code that generated the code.

I have benchmarked this locally, using criterion, and have seen meaningful performance improvements. I can add those outputs to this if you'd like, but am guessing that the perf run that @fmease recommended is what's needed.

I have run ./x.py test --stage 0 library/std after building it locally after executing ./x.py build library. I didn't try to build the whole compiler, but maybe I should have - any guidance would be appreciated.

If this general approach makes sense, I'll take a look at some other candidate categories, e.g., Cc, in the future.

Oh, and I wasn't sure whether the generated code should be included in this PR or not. I did include it.

@rustbot rustbot added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 20, 2022
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon.

Please see the contribution instructions for more information.

@rustbot
Copy link
Collaborator

rustbot commented Jul 20, 2022

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 20, 2022
@the8472
Copy link
Member

the8472 commented Jul 20, 2022

I have benchmarked this locally, using criterion, and have seen meaningful performance improvements. I can add those outputs to this if you'd like, but am guessing that the perf run that @fmease recommended is what's needed.

Those aren't mutually exclusive, so please add them.

Also, did you measure the code-size impact? The unicode tables are optimized to avoid bloating compiled binaries too much.
Edit: Ah, I see the generated code is included and not much larger.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 20, 2022
@bors
Copy link
Contributor

bors commented Jul 20, 2022

⌛ Trying commit e5d4de3 with merge d1258d956ce049f81e01a273655c3b5476ef7d2f...

@rust-timer

This comment was marked as duplicate.

@leonardo-m
Copy link

There's also the question about increased CPU cache pressure.

@bors
Copy link
Contributor

bors commented Jul 20, 2022

☀️ Try build successful - checks-actions
Build commit: d1258d956ce049f81e01a273655c3b5476ef7d2f (d1258d956ce049f81e01a273655c3b5476ef7d2f)

@rust-timer
Copy link
Collaborator

Queued d1258d956ce049f81e01a273655c3b5476ef7d2f with parent 03d488b, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d1258d956ce049f81e01a273655c3b5476ef7d2f): comparison url.

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results
  • Primary benchmarks: mixed results
  • Secondary benchmarks: mixed results
mean1 max count2
Regressions 😿
(primary)
7.3% 7.3% 1
Regressions 😿
(secondary)
3.9% 4.1% 2
Improvements 🎉
(primary)
-1.4% -1.4% 1
Improvements 🎉
(secondary)
-7.0% -7.0% 1
All 😿🎉 (primary) 3.0% 7.3% 2

Cycles

Results
  • Primary benchmarks: 🎉 relevant improvements found
  • Secondary benchmarks: 🎉 relevant improvement found
mean1 max count2
Regressions 😿
(primary)
N/A N/A 0
Regressions 😿
(secondary)
N/A N/A 0
Improvements 🎉
(primary)
-3.0% -3.9% 2
Improvements 🎉
(secondary)
-2.2% -2.2% 1
All 😿🎉 (primary) -3.0% -3.9% 2

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

Footnotes

  1. the arithmetic mean of the percent change 2

  2. number of relevant changes 2

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jul 20, 2022
@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Jul 20, 2022

Here are the benchmarks I ran with criterion; they were focused only on checking character sequences for whitespace. All multiple-character benchmarks, other than issue-38, loop through sequentially increasing characters. issue-38 is a real-world data file. (I edited out the statistical reporting noise criterion emits so the comparisons are easier to see; multiple runs have all resulted in roughly the same results.)

The key to the naming of each benchmark:

The first element is always whitespace, a common prefix.

The second element:
is_whitespace - uses the current char::is_whitespace()
uni_ws - uses the new code
is_ascii_whitespace - uses the current char::is_ascii_whitespace() - only one instance of this

The last element indicates what data was tested:
extended - characters from 0..4000
ascii - characters from 0..=0x7f
first-256 - characters from 0..=0xff
issue-38 - a json file of 388619 bytes with many unicode characters
single-space - the space character
single-a - the lowercase a character
single-tab - the tab character

whitespace/is_whitespace/extended
                        time:   [99.715 us 100.44 us 101.30 us]
whitespace/uni_ws/extended
                        time:   [3.7047 us 3.7227 us 3.7422 us]


whitespace/is_whitespace/ascii
                        time:   [70.171 ns 70.969 ns 71.770 ns]
whitespace/uni_ws/ascii
                        time:   [28.764 ns 28.867 ns 28.982 ns]
whitespace/is_ascii_whitespace/ascii
                        time:   [28.504 ns 28.612 ns 28.730 ns]


whitespace/is_whitespace/first-256
                        time:   [993.37 ns 997.58 ns 1.0018 us]
whitespace/uni_ws/first-256
                        time:   [57.510 ns 57.686 ns 57.880 ns]

whitespace/is_whitespace/issue-38
                        time:   [479.65 us 481.29 us 483.16 us]
whitespace/uni_ws/issue-38
                        time:   [153.03 us 153.65 us 154.33 us]


whitespace/is_whitespace/single-space
                        time:   [462.76 ps 472.56 ps 486.08 ps]
whitespace/uni_ws/single-space
                        time:   [234.87 ps 236.92 ps 238.90 ps]


whitespace/is_whitespace/single-a
                        time:   [473.38 ps 480.89 ps 489.09 ps]
whitespace/uni_ws/single-a
                        time:   [226.44 ps 227.59 ps 228.93 ps]


whitespace/is_whitespace/single-tab
                        time:   [468.98 ps 474.41 ps 479.54 ps]
whitespace/uni_ws/single-tab
                        time:   [225.10 ps 226.04 ps 227.04 ps]

@bmacnaughton
Copy link
Contributor Author

@bmacnaughton
Copy link
Contributor Author

Edit: Ah, I see the generated code is included and not much larger. @the8472

i was aware that this made the code slightly larger because the 256-byte table is larger than the 37 bytes in the skip list. but the code itself and table are small. i thought the performance gain was worth it, but don't know how that is weighed in your decision.

@thomcc
Copy link
Member

thomcc commented Jul 23, 2022

How does this compare to implementing is_whitespace as a match of just the codepoints that are relevant? (It's a very small set of them). I'd expect that to have much better behavior, at least under cold cache situations (I've also been meaning to do that for is_whitespace and is_control).

@bmacnaughton
Copy link
Contributor Author

How does this compare to implementing is_whitespace as a match of just the codepoints that are relevant? (It's a very small set of them). I'd expect that to have much better behavior, at least under cold cache situations (I've also been meaning to do that for is_whitespace and is_control).

that's a good question; i just ran a quick compare by changing the uni_ws function to be a version like you suggested. the match approach is either statistically the same or just slightly slower (4-5%) than the code i submitted. i'll do some code-size compares this week and post results (i'll also run benchmarks on a dedicated benchmark machine).

thanks for the suggestion!

@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Jul 28, 2022

updated benchmarks with new key. my quick-and-dirty benchmark was a little bit off, it appears. my interpretation - the mapped-if (see below) approach is overall best losing to match only in the case of the first match value, '\t'.

i'll have some code-size comparisons before monday.

The key to the naming of each benchmark:

The first element is always ws, a common prefix.

The second element:
is_whitespace - uses the current char::is_whitespace()
is_ascii - uses char::is_ascii_whitespace()
match - uses the match-based implementation suggested by @thomcc
mapped - uses the original implementation (match arms)
mapped-if - modified original implementation using if rather than match arms.

(the code for the last 3 is in src/map_lookup.rs)

The last element indicates what data was tested:
issue-38 - a json file of 388619 bytes with many unicode characters
addresses - 500 random u.s. addresses (29986 bytes)
first-128 - 0..0x80
first-256 - 0..0x100
first-4000 - 0..0x4000 (really 16384)
30K-blanks - 30_000 ascii blanks
single-space - one ascii blank
single-a - one 'a'
single-tab - one '\t'

The result with the best performance for each dataset is marked with an *; more than one is marked for statistical ties. is_ascii_whitespace() is excluded from "best performance" because these changes are aimed at improving is_whitespace(). it's in the benchmark for comparison purposes (looking at the code, there's not much more to squeeze out of it).

ws/is_whitespace/issue-38        time:   [527.84 µs 528.70 µs 529.89 µs]
ws/is_ascii/issue-38             time:   [324.65 µs 325.59 µs 326.42 µs]
ws/match/issue-38                time:   [623.96 µs 624.12 µs 624.30 µs]
ws/mapped/issue-38               time:   [468.43 µs 468.51 µs 468.62 µs]
*ws/mapped-if/issue-38            time:   [259.96 µs 260.03 µs 260.11 µs]

ws/is_whitespace/addresses       time:   [45.766 µs 45.776 µs 45.788 µs]
ws/is_ascii/addresses            time:   [38.286 µs 38.305 µs 38.325 µs]
ws/match/addresses               time:   [50.122 µs 50.139 µs 50.159 µs]
ws/mapped/addresses              time:   [36.143 µs 36.153 µs 36.165 µs]
*ws/mapped-if/addresses           time:   [18.072 µs 18.076 µs 18.079 µs]

ws/is_whitespace/first-128       time:   [139.00 ns 139.03 ns 139.05 ns]
ws/is_ascii/first-128            time:   [114.00 ns 114.12 ns 114.27 ns]
ws/match/first-128               time:   [206.35 ns 206.40 ns 206.45 ns]
ws/mapped/first-128              time:   [159.31 ns 159.34 ns 159.38 ns]
*ws/mapped-if/first-128           time:   [82.443 ns 82.460 ns 82.478 ns]

ws/is_whitespace/first-256       time:   [1.3559 µs 1.3563 µs 1.3566 µs]
ws/is_ascii/first-256            time:   [254.32 ns 254.48 ns 254.67 ns]
ws/match/first-256               time:   [501.59 ns 501.72 ns 501.87 ns]
ws/mapped/first-256              time:   [347.05 ns 347.16 ns 347.28 ns]
*ws/mapped-if/first-256           time:   [269.13 ns 269.19 ns 269.26 ns]

ws/is_whitespace/first-4000      time:   [126.43 µs 126.58 µs 126.72 µs]
ws/is_ascii/first-4000           time:   [20.343 µs 20.369 µs 20.402 µs]
ws/match/first-4000              time:   [39.066 µs 39.079 µs 39.095 µs]
*ws/mapped/first-4000             time:   [24.187 µs 24.194 µs 24.203 µs]
ws/mapped-if/first-4000          time:   [31.897 µs 31.911 µs 31.925 µs]

ws/is_whitespace/30K-blanks      time:   [31.680 µs 31.686 µs 31.692 µs]
ws/is_ascii/30K-blanks           time:   [24.103 µs 24.111 µs 24.120 µs]
*ws/match/30K-blanks              time:   [18.106 µs 18.117 µs 18.130 µs]
ws/mapped/30K-blanks             time:   [36.161 µs 36.175 µs 36.191 µs]
*ws/mapped-if/30K-blanks          time:   [18.084 µs 18.089 µs 18.095 µs]

ws/is_whitespace/single-space    time:   [1.2049 ns 1.2054 ns 1.2062 ns]
ws/is_ascii/single-space         time:   [1.4057 ns 1.4060 ns 1.4064 ns]
*ws/match/single-space            time:   [1.0038 ns 1.0043 ns 1.0049 ns]
ws/mapped/single-space           time:   [2.0073 ns 2.0078 ns 2.0084 ns]
*ws/mapped-if/single-space        time:   [1.0042 ns 1.0046 ns 1.0052 ns]

ws/is_whitespace/single-a        time:   [1.2045 ns 1.2049 ns 1.2053 ns]
ws/is_ascii/single-a             time:   [1.0042 ns 1.0049 ns 1.0058 ns]
ws/match/single-a                time:   [1.8074 ns 1.8078 ns 1.8083 ns]
ws/mapped/single-a               time:   [2.0078 ns 2.0082 ns 2.0085 ns]
*ws/mapped-if/single-a            time:   [1.0040 ns 1.0042 ns 1.0043 ns]

ws/is_whitespace/single-tab      time:   [1.2049 ns 1.2055 ns 1.2062 ns]
ws/is_ascii/single-tab           time:   [1.4057 ns 1.4061 ns 1.4065 ns]
*ws/match/single-tab              time:   [806.82 ps 809.74 ps 813.16 ps]
ws/mapped/single-tab             time:   [2.0080 ns 2.0084 ns 2.0088 ns]
ws/mapped-if/single-tab          time:   [1.0042 ns 1.0047 ns 1.0053 ns]

note - these benchmarks were run on a dedicated Xeon E-2278G CPU @ 3.40GHz with 32GB of memory with no other user loads running (other than my ssh connection).

@bjorn3
Copy link
Member

bjorn3 commented Jul 29, 2022

I think benchmarking this in isolation is not realistic. The increased table size could slow down other code by increasing cache misses, but that isn't measured when benchmarking this code in isolation. At the very least I think you should benchmark calling is_whitespace on a string large enough to not fit in caches, and one that first in the L2 cache, but not the L1D cache.

@leonardo-m
Copy link

Thank you bjorn3, that's what I was trying to say.

@thomcc
Copy link
Member

thomcc commented Jul 29, 2022

Right, that's why I'm in favor of the match, which is the smallest in both icache and dcache. If you benchmark, you should also benchmark something with realistic branch prediction rates, like a big blob of text, json, html, ...

@bmacnaughton
Copy link
Contributor Author

i'm not sure i follow - issue-38 is 388619 bytes of json, addresses is 29986 bytes of json.

what size are you suggesting is the minimum that makes sense?

@bmacnaughton
Copy link
Contributor Author

ok, i've found this information, does it align with what you're thinking?

l1i - 8x32kb (262144)
l1d - 8x32kb (262144)
l2 - 8x256kb (2_097_152)
l3 - 16mb (16_777_216)

@bjorn3 - issue-38 does not fit in the l1 cache but does fit in the l2 cache. would you like to see a json file that doesn't fit in the l2 cache? and that doesn't fit in the l3 cache? i'm happy to run benchmarks you suggest. it's easy enough to clone what's there to achieve any size necessary.

@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Jul 29, 2022

@thomcc - the table definitely adds to the size of mapped-if but it's 23 instructions while match is 28 instructions. i'm not familiar with the details of the icache, but am presuming that data, whether static or not, doesn't go through the icache.

mapped-if - https://godbolt.org/z/c3ajqhYWa
match - https://godbolt.org/z/W3nhMvxq5

@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Jul 29, 2022

that isn't measured when benchmarking this code in isolation.

@bjorn3 - i could use a bit of guidance here. i'm not sure how to balance the cache costs of one approach vs. another in a benchmark. if the comparison we want to see is buried in too much code, then the impact of the changes is lost in the noise; the other end of the spectrum is what's here - the performance of each approach is completely isolated.

i'm not familiar with measuring the impact of the 128 byte map on general cache performance. i could benchmark is_whitespace using something like a json parser, which would approximate one real-world use case. i could modify the parser to just check every character for being whitespace, but throw away the result. it's enough work that i'd like your thoughts before going there.

do you have any good examples that you could point me at?

@bjorn3
Copy link
Member

bjorn3 commented Jul 29, 2022

do you have any good examples that you could point me at?

Unfortunately not.

@thomcc
Copy link
Member

thomcc commented Jul 29, 2022

@thomcc - the table definitely adds to the size of mapped-if but it's 23 instructions while match is 28 instructions. i'm not familiar with the details of the icache, but am presuming that data, whether static or not, doesn't go through the icache.

Instruction counting isn't the way to measure this, as instructions are variable length on x86 (you can measure it do multiple builds and compare their sizes directly, this is a huge pain though), and you may be right that they're roughly the same, so don't feel obligated to do this.

I was more referring to the current implementation, which is significantly larger than either in icache.

@bmacnaughton
Copy link
Contributor Author

I was more referring to the current implementation, which is significantly larger than either in icache.

that's what got me started on this. i wanted to understand how the current implementation worked; i didn't realized it was machine generated at first. i just kept digging and ultimately found myself thinking "there has to be a better way" for small sets of codepoints, like whitespace and control-characters.

@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Jul 30, 2022

Instruction counting isn't the way to measure this, as instructions are variable length on x86

makes sense; i figured the count was a proxy for the instruction-bytes, but precision is always better. i copied the code from godbolt, inserted it into a simple .s file ending with:

.LEND:  .long LENGTH
LENGTH = .LEND - _start

and then objdump'd the files and looked at the LENGTH symbol.

mapped-if 0x42 = 66 (but has 256 bytes not-executable for the table)
matches 0x62 = 98
is_whitespace (ascii check then jump to unicode check) 0x25 = 37
is_whitespace (including skip_search, decode_length, decode_prefix_sum) 0x14c = 332

this same approach could work for control characters and it's possible (size-wise) to use the same table as whitespace. it would require fiddling with the code-writing code, or possibly whitespace and control-codes could just be hardcoded. i don't know how likely they are to change, but they appear relatively stable.

@bmacnaughton
Copy link
Contributor Author

bmacnaughton commented Aug 5, 2022

is there anything i can do to move this along?

it might make sense to hardcode the whitespace and control character checks as opposed to generate them from the downloaded unicode tables; using if-then-else does perform better than match.

or, if there is no interest, i'm ok with closing this. you have plenty of PRs to wrestle with. this improvement is very narrow in scope.

@thomcc
Copy link
Member

thomcc commented Aug 8, 2022

is there anything i can do to move this along?

Sorry, rustconf was this week which took up some time. This is still on my radar, and will get back to you shortly.

@thomcc
Copy link
Member

thomcc commented Aug 25, 2022

Taking a bunch of kenny's old PRs since he probably will not get to them.

r? @thomcc

@rust-highfive rust-highfive assigned thomcc and unassigned kennytm Aug 25, 2022
@thomcc
Copy link
Member

thomcc commented Aug 25, 2022

I think there are further improvements possible here, but we shouldn't use this to block the current version, which I do think is an improvement. This is a perf change and it's plausible that it's hot code, so it shouldn't be in rollup.

@bors r+ rollup=never

@bors
Copy link
Contributor

bors commented Aug 25, 2022

📌 Commit 5d048eb has been approved by thomcc

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Aug 25, 2022
@bors
Copy link
Contributor

bors commented Aug 26, 2022

⌛ Testing commit 5d048eb with merge 76f3b89...

@bmacnaughton
Copy link
Contributor Author

let me know if there's anything you need from me. thanks for the updates.

@bors
Copy link
Contributor

bors commented Aug 26, 2022

☀️ Test successful - checks-actions
Approved by: thomcc
Pushing 76f3b89 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Aug 26, 2022
@bors bors merged commit 76f3b89 into rust-lang:master Aug 26, 2022
@rustbot rustbot added this to the 1.65.0 milestone Aug 26, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (76f3b89): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
0.2% [0.2%, 0.2%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.0% [-1.3%, -0.6%] 6
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
1.6% [1.2%, 1.8%] 3
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean1 range count2
Regressions ❌
(primary)
4.3% [2.5%, 6.3%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.5% [-3.5%, -3.5%] 1
Improvements ✅
(secondary)
-3.1% [-3.1%, -3.1%] 1
All ❌✅ (primary) 2.4% [-3.5%, 6.3%] 4

Footnotes

  1. the arithmetic mean of the percent change 2 3

  2. number of relevant changes 2 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants