-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_whitespace() performance improvements #99487
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @kennytm (or someone else) soon. Please see the contribution instructions for more information. |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
Those aren't mutually exclusive, so please add them. Also, did you measure the code-size impact? The unicode tables are optimized to avoid bloating compiled binaries too much. @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit e5d4de3 with merge d1258d956ce049f81e01a273655c3b5476ef7d2f... |
This comment was marked as duplicate.
This comment was marked as duplicate.
There's also the question about increased CPU cache pressure. |
☀️ Try build successful - checks-actions |
Queued d1258d956ce049f81e01a273655c3b5476ef7d2f with parent 03d488b, future comparison URL. |
Finished benchmarking commit (d1258d956ce049f81e01a273655c3b5476ef7d2f): comparison url. Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)Results
CyclesResults
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Footnotes |
Here are the benchmarks I ran with criterion; they were focused only on checking character sequences for whitespace. All multiple-character benchmarks, other than issue-38, loop through sequentially increasing characters. issue-38 is a real-world data file. (I edited out the statistical reporting noise criterion emits so the comparisons are easier to see; multiple runs have all resulted in roughly the same results.) The key to the naming of each benchmark: The first element is always The second element: The last element indicates what data was tested:
|
the benchmark code: https://github.com/bmacnaughton/is_whitespace/tree/main/benches |
i was aware that this made the code slightly larger because the 256-byte table is larger than the 37 bytes in the skip list. but the code itself and table are small. i thought the performance gain was worth it, but don't know how that is weighed in your decision. |
How does this compare to implementing is_whitespace as a match of just the codepoints that are relevant? (It's a very small set of them). I'd expect that to have much better behavior, at least under cold cache situations (I've also been meaning to do that for is_whitespace and is_control). |
that's a good question; i just ran a quick compare by changing the thanks for the suggestion! |
updated benchmarks with new key. my quick-and-dirty benchmark was a little bit off, it appears. my interpretation - the i'll have some code-size comparisons before monday. The key to the naming of each benchmark: The first element is always The second element: (the code for the last 3 is in src/map_lookup.rs) The last element indicates what data was tested: The result with the best performance for each dataset is marked with an
note - these benchmarks were run on a dedicated Xeon E-2278G CPU @ 3.40GHz with 32GB of memory with no other user loads running (other than my ssh connection). |
I think benchmarking this in isolation is not realistic. The increased table size could slow down other code by increasing cache misses, but that isn't measured when benchmarking this code in isolation. At the very least I think you should benchmark calling is_whitespace on a string large enough to not fit in caches, and one that first in the L2 cache, but not the L1D cache. |
Thank you bjorn3, that's what I was trying to say. |
Right, that's why I'm in favor of the match, which is the smallest in both icache and dcache. If you benchmark, you should also benchmark something with realistic branch prediction rates, like a big blob of text, json, html, ... |
i'm not sure i follow - issue-38 is 388619 bytes of json, addresses is 29986 bytes of json. what size are you suggesting is the minimum that makes sense? |
ok, i've found this information, does it align with what you're thinking? l1i - 8x32kb (262144) @bjorn3 - issue-38 does not fit in the l1 cache but does fit in the l2 cache. would you like to see a json file that doesn't fit in the l2 cache? and that doesn't fit in the l3 cache? i'm happy to run benchmarks you suggest. it's easy enough to clone what's there to achieve any size necessary. |
@thomcc - the table definitely adds to the size of mapped-if - https://godbolt.org/z/c3ajqhYWa |
@bjorn3 - i could use a bit of guidance here. i'm not sure how to balance the cache costs of one approach vs. another in a benchmark. if the comparison we want to see is buried in too much code, then the impact of the changes is lost in the noise; the other end of the spectrum is what's here - the performance of each approach is completely isolated. i'm not familiar with measuring the impact of the 128 byte map on general cache performance. i could benchmark do you have any good examples that you could point me at? |
Unfortunately not. |
Instruction counting isn't the way to measure this, as instructions are variable length on x86 (you can measure it do multiple builds and compare their sizes directly, this is a huge pain though), and you may be right that they're roughly the same, so don't feel obligated to do this. I was more referring to the current implementation, which is significantly larger than either in icache. |
that's what got me started on this. i wanted to understand how the current implementation worked; i didn't realized it was machine generated at first. i just kept digging and ultimately found myself thinking "there has to be a better way" for small sets of codepoints, like whitespace and control-characters. |
makes sense; i figured the count was a proxy for the instruction-bytes, but precision is always better. i copied the code from godbolt, inserted it into a simple .s file ending with:
and then objdump'd the files and looked at the LENGTH symbol. mapped-if 0x42 = 66 (but has 256 bytes not-executable for the table) this same approach could work for control characters and it's possible (size-wise) to use the same table as whitespace. it would require fiddling with the code-writing code, or possibly whitespace and control-codes could just be hardcoded. i don't know how likely they are to change, but they appear relatively stable. |
is there anything i can do to move this along? it might make sense to hardcode the whitespace and control character checks as opposed to generate them from the downloaded unicode tables; using if-then-else does perform better than match. or, if there is no interest, i'm ok with closing this. you have plenty of PRs to wrestle with. this improvement is very narrow in scope. |
Sorry, rustconf was this week which took up some time. This is still on my radar, and will get back to you shortly. |
Taking a bunch of kenny's old PRs since he probably will not get to them. r? @thomcc |
I think there are further improvements possible here, but we shouldn't use this to block the current version, which I do think is an improvement. This is a perf change and it's plausible that it's hot code, so it shouldn't be in rollup. @bors r+ rollup=never |
let me know if there's anything you need from me. thanks for the updates. |
☀️ Test successful - checks-actions |
Finished benchmarking commit (76f3b89): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Footnotes |
This is my first rust PR, so if I miss anything obvious please let me know and I'll do my best to fix it.
This was a bit more of a challenge than I realized because, while I made working code locally and tested it against the native
is_whitespace()
, this PR required changingsrc/tools/unicode-table-generator
, the code that generated the code.I have benchmarked this locally, using criterion, and have seen meaningful performance improvements. I can add those outputs to this if you'd like, but am guessing that the perf run that @fmease recommended is what's needed.
I have run
./x.py test --stage 0 library/std
after building it locally after executing./x.py build library
. I didn't try to build the whole compiler, but maybe I should have - any guidance would be appreciated.If this general approach makes sense, I'll take a look at some other candidate categories, e.g.,
Cc
, in the future.Oh, and I wasn't sure whether the generated code should be included in this PR or not. I did include it.