Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a check for ASCII characters in to_upper and to_lower #81358

Merged
merged 3 commits into from Mar 17, 2021

Conversation

mcastorina
Copy link
Contributor

This extra check has better performance. See discussion here:
https://internals.rust-lang.org/t/to-upper-speed/13896

Thanks to @gilescope for helping discover and test this.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 24, 2021
@nagisa
Copy link
Member

nagisa commented Jan 24, 2021

This should be accompanied with some benchmark outputs.

@mcastorina
Copy link
Contributor Author

mcastorina commented Jan 25, 2021

I didn't see to_upper / to_lower being benchmarked anywhere, so I added two more benches for them. Below are the results before and after this PR's changes.

EDIT: To be clear, these benchmarks test both ASCII and non-ASCII characters combined. Maybe it would be better to separate them, but I'll wait for feedback before doing so.

Before

test char::methods::bench_ascii_to_lowercase  ... bench:     248,590 ns/iter (+/- 1,181)
test char::methods::bench_ascii_to_uppercase  ... bench:     250,286 ns/iter (+/- 513)

After

test char::methods::bench_ascii_to_lowercase  ... bench:     152,079 ns/iter (+/- 84)
test char::methods::bench_ascii_to_uppercase  ... bench:     154,737 ns/iter (+/- 680)

@gilescope
Copy link
Contributor

Just linking this up with the internals chat on this subject: https://internals.rust-lang.org/t/to-upper-speed/13896/12


#[bench]
fn bench_ascii_to_uppercase(b: &mut Bencher) {
b.iter(|| (0..=255).cycle().take(10_000).map(|b| char::from(b).to_uppercase()).count())

This comment was marked as outdated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to say anything about benchmarks results on mixed and non-ascii text? I expect branch prediction would mean that on pure non-ascii, there is virtually no change(?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect so. What do you think of adding two more benchmarks for a total of three?

  • all ASCII characters
  • no ASCII characters
  • 50/50 mixed characters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be slightly careful here. The non-ascii benchmark is always one byte unicode is it not? It is non-ascii but above 192 it's going to be indicating another code point so there's going to be pleanty of malformed unicode in here.
It might be slightly better to create a random slice of u8 and then use String::from_utf8_lossy() to make it valid first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a benchmark I would want to make the "random chars" constant so runs can be compared.

What do you think of this? Playground Each test case can use the same list of chars and filter for what it's testing for (where mixed does no filtering).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random was the wrong choice of words here, sorry! It looks like a fair few other benchmarks / tests don't worry about putting in malformed utf sequences so I think it's fine to define the benchmarks this way for now.

@mcastorina
Copy link
Contributor Author

Additional Benchmarks

Before

test char::methods::bench_ascii_mix_to_lowercase                ... bench:     248,212 ns/iter (+/- 235)
test char::methods::bench_ascii_mix_to_uppercase                ... bench:     251,416 ns/iter (+/- 135)
test char::methods::bench_ascii_to_lowercase                    ... bench:     247,025 ns/iter (+/- 204)
test char::methods::bench_ascii_to_uppercase                    ... bench:     246,656 ns/iter (+/- 108)
test char::methods::bench_non_ascii_to_lowercase                ... bench:     248,828 ns/iter (+/- 121)
test char::methods::bench_non_ascii_to_uppercase                ... bench:     253,036 ns/iter (+/- 1,798)

After

test char::methods::bench_ascii_mix_to_lowercase                ... bench:     151,013 ns/iter (+/- 248)
test char::methods::bench_ascii_mix_to_uppercase                ... bench:     154,812 ns/iter (+/- 475)
test char::methods::bench_ascii_to_lowercase                    ... bench:      53,337 ns/iter (+/- 197)
test char::methods::bench_ascii_to_uppercase                    ... bench:      56,142 ns/iter (+/- 104)
test char::methods::bench_non_ascii_to_lowercase                ... bench:     247,925 ns/iter (+/- 295)
test char::methods::bench_non_ascii_to_uppercase                ... bench:     251,799 ns/iter (+/- 503)

@joshtriplett
Copy link
Member

That's quite definitive! It's a performance win for strings containing ASCII, even if they're not entirely ASCII, and it appears entirely neutral for strings containing non-ASCII.

There are likely other potential performance wins to be had here, but this is an obvious win with no apparent downside.

Let's see if it affects any of the existing benchmarks.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 3, 2021
@bors
Copy link
Contributor

bors commented Feb 3, 2021

⌛ Trying commit c6dae69c489b509f675035e9a255112a66587266 with merge cd94273159f60d0494f0803385204b8aa90b0e55...

@bors
Copy link
Contributor

bors commented Feb 3, 2021

☀️ Try build successful - checks-actions
Build commit: cd94273159f60d0494f0803385204b8aa90b0e55 (cd94273159f60d0494f0803385204b8aa90b0e55)

@rust-timer
Copy link
Collaborator

Queued cd94273159f60d0494f0803385204b8aa90b0e55 with parent d95d4f0, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (cd94273159f60d0494f0803385204b8aa90b0e55): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 3, 2021
@joshtriplett
Copy link
Member

Looks reasonable to merge, to me.

@pickfire
Copy link
Contributor

pickfire commented Feb 7, 2021

Does this have an impact on String::to_uppercase?

@gilescope
Copy link
Contributor

gilescope commented Feb 7, 2021

Yes - a positive one. At the moment the string to_uppercase calls the char fn for each char and though I think there's room for optimistion there too, that's a tail for another PR.

@gilescope
Copy link
Contributor

gilescope commented Feb 7, 2021

FYI: Also in the same area: #81837
(compounding speedups)

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 16, 2021
@JohnTitor
Copy link
Member

r? @joshtriplett

@joshtriplett
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented Mar 17, 2021

📌 Commit 229fdf8 has been approved by joshtriplett

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 17, 2021
@bors
Copy link
Contributor

bors commented Mar 17, 2021

⌛ Testing commit 229fdf8 with merge 0ce0fed...

@bors
Copy link
Contributor

bors commented Mar 17, 2021

☀️ Test successful - checks-actions
Approved by: joshtriplett
Pushing 0ce0fed to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 17, 2021
@bors bors merged commit 0ce0fed into rust-lang:master Mar 17, 2021
@rustbot rustbot added this to the 1.52.0 milestone Mar 17, 2021
cuviper added a commit to cuviper/rust that referenced this pull request Oct 7, 2021
@cuviper cuviper mentioned this pull request Oct 7, 2021
@cuviper
Copy link
Member

cuviper commented Oct 7, 2021

Please note, unicode_data.rs is a generated file, not meant to be changed directly, as it says at the top. I have reapplied your changes in the tool in #89614 so we don't lose this optimization.

GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Oct 9, 2021
Update to Unicode 14.0

The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly.

This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Oct 9, 2021
Update to Unicode 14.0

The Unicode Standard [announced Version 14.0](https://home.unicode.org/announcing-the-unicode-standard-version-14-0/) on September 14, 2021, and this pull request updates the generated tables in `core` accordingly.

This did require a little prep-work in `unicode-table-generator`. First, rust-lang#81358 had modified the generated file instead of the tool, so that change is now reflected in the tool as well. Next, I found that the "Alphabetic" property in version 14 was panicking when generating a bitset, "cannot pack 264 into 8 bits". We've been using the skiplist for that anyway, so I changed this to fail gracefully. Finally, I confirmed that the tool still created the exact same tables for 13 before moving to 14.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet