Add a check for ASCII characters in to_upper and to_lower #81358

mcastorina · 2021-01-24T22:14:21Z

This extra check has better performance. See discussion here:
https://internals.rust-lang.org/t/to-upper-speed/13896

Thanks to @gilescope for helping discover and test this.

rust-highfive · 2021-01-24T22:14:24Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

nagisa · 2021-01-24T23:34:21Z

This should be accompanied with some benchmark outputs.

mcastorina · 2021-01-25T02:35:15Z

I didn't see to_upper / to_lower being benchmarked anywhere, so I added two more benches for them. Below are the results before and after this PR's changes.

EDIT: To be clear, these benchmarks test both ASCII and non-ASCII characters combined. Maybe it would be better to separate them, but I'll wait for feedback before doing so.

Before

test char::methods::bench_ascii_to_lowercase  ... bench:     248,590 ns/iter (+/- 1,181)
test char::methods::bench_ascii_to_uppercase  ... bench:     250,286 ns/iter (+/- 513)

After

test char::methods::bench_ascii_to_lowercase  ... bench:     152,079 ns/iter (+/- 84)
test char::methods::bench_ascii_to_uppercase  ... bench:     154,737 ns/iter (+/- 680)

gilescope · 2021-01-26T12:26:55Z

Just linking this up with the internals chat on this subject: https://internals.rust-lang.org/t/to-upper-speed/13896/12

bluss · 2021-02-02T16:58:57Z

library/core/benches/char/methods.rs

+
+#[bench]
+fn bench_ascii_to_uppercase(b: &mut Bencher) {
+    b.iter(|| (0..=255).cycle().take(10_000).map(|b| char::from(b).to_uppercase()).count())


Would it be possible to say anything about benchmarks results on mixed and non-ascii text? I expect branch prediction would mean that on pure non-ascii, there is virtually no change(?)

I would expect so. What do you think of adding two more benchmarks for a total of three?

all ASCII characters

no ASCII characters

50/50 mixed characters

We need to be slightly careful here. The non-ascii benchmark is always one byte unicode is it not? It is non-ascii but above 192 it's going to be indicating another code point so there's going to be pleanty of malformed unicode in here.
It might be slightly better to create a random slice of u8 and then use String::from_utf8_lossy() to make it valid first?

Since it's a benchmark I would want to make the "random chars" constant so runs can be compared.

What do you think of this? Playground Each test case can use the same list of chars and filter for what it's testing for (where mixed does no filtering).

Random was the wrong choice of words here, sorry! It looks like a fair few other benchmarks / tests don't worry about putting in malformed utf sequences so I think it's fine to define the benchmarks this way for now.

mcastorina · 2021-02-03T02:33:11Z

Additional Benchmarks

Before

test char::methods::bench_ascii_mix_to_lowercase                ... bench:     248,212 ns/iter (+/- 235)
test char::methods::bench_ascii_mix_to_uppercase                ... bench:     251,416 ns/iter (+/- 135)
test char::methods::bench_ascii_to_lowercase                    ... bench:     247,025 ns/iter (+/- 204)
test char::methods::bench_ascii_to_uppercase                    ... bench:     246,656 ns/iter (+/- 108)
test char::methods::bench_non_ascii_to_lowercase                ... bench:     248,828 ns/iter (+/- 121)
test char::methods::bench_non_ascii_to_uppercase                ... bench:     253,036 ns/iter (+/- 1,798)

After

test char::methods::bench_ascii_mix_to_lowercase                ... bench:     151,013 ns/iter (+/- 248)
test char::methods::bench_ascii_mix_to_uppercase                ... bench:     154,812 ns/iter (+/- 475)
test char::methods::bench_ascii_to_lowercase                    ... bench:      53,337 ns/iter (+/- 197)
test char::methods::bench_ascii_to_uppercase                    ... bench:      56,142 ns/iter (+/- 104)
test char::methods::bench_non_ascii_to_lowercase                ... bench:     247,925 ns/iter (+/- 295)
test char::methods::bench_non_ascii_to_uppercase                ... bench:     251,799 ns/iter (+/- 503)

joshtriplett · 2021-02-03T05:40:35Z

That's quite definitive! It's a performance win for strings containing ASCII, even if they're not entirely ASCII, and it appears entirely neutral for strings containing non-ASCII.

There are likely other potential performance wins to be had here, but this is an obvious win with no apparent downside.

Let's see if it affects any of the existing benchmarks.

@bors try @rust-timer queue

rust-timer · 2021-02-03T05:40:36Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-02-03T05:40:47Z

⌛ Trying commit c6dae69c489b509f675035e9a255112a66587266 with merge cd94273159f60d0494f0803385204b8aa90b0e55...

bors · 2021-02-03T06:41:10Z

☀️ Try build successful - checks-actions
Build commit: cd94273159f60d0494f0803385204b8aa90b0e55 (cd94273159f60d0494f0803385204b8aa90b0e55)

rust-timer · 2021-02-03T06:41:11Z

Queued cd94273159f60d0494f0803385204b8aa90b0e55 with parent d95d4f0, future comparison URL.

rust-timer · 2021-02-03T09:10:43Z

Finished benchmarking try commit (cd94273159f60d0494f0803385204b8aa90b0e55): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

joshtriplett · 2021-02-04T05:59:09Z

Looks reasonable to merge, to me.

pickfire · 2021-02-07T14:24:18Z

Does this have an impact on String::to_uppercase?

gilescope · 2021-02-07T17:10:44Z

Yes - a positive one. At the moment the string to_uppercase calls the char fn for each char and though I think there's room for optimistion there too, that's a tail for another PR.

gilescope · 2021-02-07T17:12:53Z

FYI: Also in the same area: #81837
(compounding speedups)

This extra check has better performance. See discussion here: https://internals.rust-lang.org/t/to-upper-speed/13896

JohnTitor · 2021-03-17T03:09:31Z

r? @joshtriplett

joshtriplett · 2021-03-17T07:08:35Z

@bors r+

bors · 2021-03-17T07:08:36Z

📌 Commit 229fdf8 has been approved by joshtriplett

bors · 2021-03-17T11:17:23Z

⌛ Testing commit 229fdf8 with merge 0ce0fed...

bors · 2021-03-17T14:01:43Z

☀️ Test successful - checks-actions
Approved by: joshtriplett
Pushing 0ce0fed to master...

cuviper · 2021-10-07T01:18:39Z

Please note, unicode_data.rs is a generated file, not meant to be changed directly, as it says at the top. I have reapplied your changes in the tool in #89614 so we don't lose this optimization.

Update to Unicode 14.0 The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly. This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.

rust-highfive assigned sfackler Jan 24, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 24, 2021

bluss reviewed Feb 2, 2021

View reviewed changes

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 3, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 3, 2021

mcastorina added 3 commits February 26, 2021 11:39

Add a check for ASCII characters in to_upper and to_lower

e48c684

This extra check has better performance. See discussion here: https://internals.rust-lang.org/t/to-upper-speed/13896

Add to_lowercase and to_uppercase char benchmarks

8acb566

Add two more benchmarks for strictly ASCII and non ASCII cases

229fdf8

mcastorina force-pushed the to-upper-lower-speed branch from c6dae69 to 229fdf8 Compare February 26, 2021 17:43

JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 16, 2021

rust-highfive assigned joshtriplett and unassigned sfackler Mar 17, 2021

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 17, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 17, 2021

bors merged commit 0ce0fed into rust-lang:master Mar 17, 2021

rustbot added this to the 1.52.0 milestone Mar 17, 2021

cuviper added a commit to cuviper/rust that referenced this pull request Oct 7, 2021

Redo rust-lang#81358 in unicode-table-generator

e159d42

cuviper mentioned this pull request Oct 7, 2021

Update to Unicode 14.0 #89614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a check for ASCII characters in to_upper and to_lower #81358

Add a check for ASCII characters in to_upper and to_lower #81358

mcastorina commented Jan 24, 2021

rust-highfive commented Jan 24, 2021

nagisa commented Jan 24, 2021

mcastorina commented Jan 25, 2021 •

edited

gilescope commented Jan 26, 2021

This comment was marked as outdated.

bluss Feb 2, 2021

mcastorina Feb 3, 2021

gilescope Feb 3, 2021

mcastorina Feb 3, 2021

gilescope Feb 10, 2021

mcastorina commented Feb 3, 2021

joshtriplett commented Feb 3, 2021

rust-timer commented Feb 3, 2021

bors commented Feb 3, 2021

bors commented Feb 3, 2021

rust-timer commented Feb 3, 2021

rust-timer commented Feb 3, 2021

joshtriplett commented Feb 4, 2021

pickfire commented Feb 7, 2021

gilescope commented Feb 7, 2021 •

edited

gilescope commented Feb 7, 2021 •

edited

JohnTitor commented Mar 17, 2021

joshtriplett commented Mar 17, 2021

bors commented Mar 17, 2021

bors commented Mar 17, 2021

bors commented Mar 17, 2021

cuviper commented Oct 7, 2021

Add a check for ASCII characters in to_upper and to_lower #81358

Add a check for ASCII characters in to_upper and to_lower #81358

Conversation

mcastorina commented Jan 24, 2021

rust-highfive commented Jan 24, 2021

nagisa commented Jan 24, 2021

mcastorina commented Jan 25, 2021 • edited

Before

After

gilescope commented Jan 26, 2021

This comment was marked as outdated.

bluss Feb 2, 2021

Choose a reason for hiding this comment

mcastorina Feb 3, 2021

Choose a reason for hiding this comment

gilescope Feb 3, 2021

Choose a reason for hiding this comment

mcastorina Feb 3, 2021

Choose a reason for hiding this comment

gilescope Feb 10, 2021

Choose a reason for hiding this comment

mcastorina commented Feb 3, 2021

Additional Benchmarks

Before

After

joshtriplett commented Feb 3, 2021

rust-timer commented Feb 3, 2021

bors commented Feb 3, 2021

bors commented Feb 3, 2021

rust-timer commented Feb 3, 2021

rust-timer commented Feb 3, 2021

joshtriplett commented Feb 4, 2021

pickfire commented Feb 7, 2021

gilescope commented Feb 7, 2021 • edited

gilescope commented Feb 7, 2021 • edited

JohnTitor commented Mar 17, 2021

joshtriplett commented Mar 17, 2021

bors commented Mar 17, 2021

bors commented Mar 17, 2021

bors commented Mar 17, 2021

cuviper commented Oct 7, 2021

mcastorina commented Jan 25, 2021 •

edited

gilescope commented Feb 7, 2021 •

edited

gilescope commented Feb 7, 2021 •

edited