Optimize core::str::Lines::count #123606

thomcc · 2024-04-07T19:54:17Z

s.lines().count()+1 is somewhat common as a way to find the line number given a byte position, so it'd be nice if it were faster.

This just generalizes the SWAR-optimized char counting code so that it can be used for SWAR-optimized line counting, so it's actually not very complex of a PR.

TODO

benchmarks
adjust comments
more tests

Benchmarks

case00_libcore is the new version, and case01_fold_increment is the previous implementation (the default impl of Iterator::count() is a fold that increments

    str::line_count::all_newlines_32kib::case00_libcore           4.35µs/iter  +/- 11.00ns
    str::line_count::all_newlines_32kib::case01_fold_increment  779.99µs/iter   +/- 8.43µs
    str::line_count::all_newlines_4kib::case00_libcore          562.00ns/iter   +/- 5.00ns
    str::line_count::all_newlines_4kib::case01_fold_increment    97.81µs/iter   +/- 1.48µs
    str::line_count::all_newlines_64b::case00_libcore            21.00ns/iter   +/- 0.00ns
    str::line_count::all_newlines_64b::case01_fold_increment      1.49µs/iter  +/- 32.00ns

    str::line_count::en_huge::case00_libcore                     45.58µs/iter +/- 122.00ns
    str::line_count::en_huge::case01_fold_increment             167.62µs/iter +/- 609.00ns
    str::line_count::en_large::case00_libcore                   734.00ns/iter   +/- 6.00ns
    str::line_count::en_large::case01_fold_increment              2.62µs/iter   +/- 9.00ns
    str::line_count::en_medium::case00_libcore                  100.00ns/iter   +/- 0.00ns
    str::line_count::en_medium::case01_fold_increment           347.00ns/iter   +/- 0.00ns
    str::line_count::en_small::case00_libcore                    18.00ns/iter   +/- 1.00ns
    str::line_count::en_small::case01_fold_increment             60.00ns/iter   +/- 2.00ns
    str::line_count::en_tiny::case00_libcore                      6.00ns/iter   +/- 0.00ns
    str::line_count::en_tiny::case01_fold_increment              60.00ns/iter   +/- 0.00ns

    str::line_count::zh_huge::case00_libcore                     40.63µs/iter  +/- 85.00ns
    str::line_count::zh_huge::case01_fold_increment             205.10µs/iter   +/- 1.62µs
    str::line_count::zh_large::case00_libcore                   655.00ns/iter   +/- 1.00ns
    str::line_count::zh_large::case01_fold_increment              3.21µs/iter  +/- 21.00ns
    str::line_count::zh_medium::case00_libcore                   92.00ns/iter   +/- 0.00ns
    str::line_count::zh_medium::case01_fold_increment           420.00ns/iter   +/- 2.00ns
    str::line_count::zh_small::case00_libcore                    20.00ns/iter   +/- 1.00ns
    str::line_count::zh_small::case01_fold_increment             63.00ns/iter   +/- 1.00ns
    str::line_count::zh_tiny::case00_libcore                      6.00ns/iter   +/- 0.00ns
    str::line_count::zh_tiny::case01_fold_increment              21.00ns/iter   +/- 0.00ns

This is a speedup of around 2x-4x most of the time, but for some highly unrealistic scenarios (32KiB of newlines) it's up to almost 200x faster (because the time taken by the version in this PR is not dependent on the number of newlines in the input, but the old version is slower the more newlines are present). It's also much faster for small inputs, especially if they have newlines (10x faster for en_tiny).

Real world cases will vary, don't read too much into these, I would expect 2x-4x speedup in general, since that's what it gets on the most realistic examples.

Obviously a SIMD impl will beat this, but users who are really bottlenecked on this operation should probably just reach for crates.io (even if we provided a SIMD version, libcore can't use runtime CPU feature detection so they'd still be better off with something from crates.io).

rustbot · 2024-04-07T19:54:24Z

r? @Amanieu

rustbot has assigned @Amanieu.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

thomcc · 2024-04-07T19:59:25Z

Clearing assignee cuz it's a draft still.

thomcc · 2024-04-09T01:28:06Z

I've updated the PR description to contain benchmarks. This PR needs more tests tho, almost missed that I got the logic wrong for going from newline count to line count.

thomcc · 2024-04-17T01:07:49Z

i don't have time/energy to poke at this right now (maybe over the weekend)... but it seems it is used in the compiler so curiosity is getting the better of me.

@bors try @rust-timer queue

bors · 2024-04-17T01:09:01Z

⌛ Trying commit ef27373 with merge eb24f76...

Optimize core::str::Lines::count `s.lines().count()+1` is somewhat common as a way to find the line number given a byte position, so it'd be nice if it were faster. This just generalizes the SWAR-optimized char counting code so that it can be used for SWAR-optimized line counting, so it's actually not very complex of a PR. ## TODO - [x] benchmarks - [x] adjust comments - [ ] more tests ## Benchmarks `case00_libcore` is the new version, and `case01_fold_increment` is the previous implementation (the default impl of `Iterator::count()` is a fold that increments ``` str::line_count::all_newlines_32kib::case00_libcore 4.35µs/iter +/- 11.00ns str::line_count::all_newlines_32kib::case01_fold_increment 779.99µs/iter +/- 8.43µs str::line_count::all_newlines_4kib::case00_libcore 562.00ns/iter +/- 5.00ns str::line_count::all_newlines_4kib::case01_fold_increment 97.81µs/iter +/- 1.48µs str::line_count::all_newlines_64b::case00_libcore 21.00ns/iter +/- 0.00ns str::line_count::all_newlines_64b::case01_fold_increment 1.49µs/iter +/- 32.00ns str::line_count::en_huge::case00_libcore 45.58µs/iter +/- 122.00ns str::line_count::en_huge::case01_fold_increment 167.62µs/iter +/- 609.00ns str::line_count::en_large::case00_libcore 734.00ns/iter +/- 6.00ns str::line_count::en_large::case01_fold_increment 2.62µs/iter +/- 9.00ns str::line_count::en_medium::case00_libcore 100.00ns/iter +/- 0.00ns str::line_count::en_medium::case01_fold_increment 347.00ns/iter +/- 0.00ns str::line_count::en_small::case00_libcore 18.00ns/iter +/- 1.00ns str::line_count::en_small::case01_fold_increment 60.00ns/iter +/- 2.00ns str::line_count::en_tiny::case00_libcore 6.00ns/iter +/- 0.00ns str::line_count::en_tiny::case01_fold_increment 60.00ns/iter +/- 0.00ns str::line_count::zh_huge::case00_libcore 40.63µs/iter +/- 85.00ns str::line_count::zh_huge::case01_fold_increment 205.10µs/iter +/- 1.62µs str::line_count::zh_large::case00_libcore 655.00ns/iter +/- 1.00ns str::line_count::zh_large::case01_fold_increment 3.21µs/iter +/- 21.00ns str::line_count::zh_medium::case00_libcore 92.00ns/iter +/- 0.00ns str::line_count::zh_medium::case01_fold_increment 420.00ns/iter +/- 2.00ns str::line_count::zh_small::case00_libcore 20.00ns/iter +/- 1.00ns str::line_count::zh_small::case01_fold_increment 63.00ns/iter +/- 1.00ns str::line_count::zh_tiny::case00_libcore 6.00ns/iter +/- 0.00ns str::line_count::zh_tiny::case01_fold_increment 21.00ns/iter +/- 0.00ns ``` This is a speedup of around 2x-4x most of the time, but for some highly unrealistic scenarios (32KiB of newlines) it's up to almost 200x faster (because the time taken by the version in this PR is not dependent on the number of newlines in the input, but the old version is slower the more newlines are present). It's also much faster for small inputs, especially if they have newlines (10x faster for en_tiny). Real world cases will vary, don't read too much into these, I would expect 2x-4x speedup in general, since that's what it gets on the most realistic examples. Obviously a SIMD impl will beat this, but users who are really bottlenecked on this operation should probably just reach for crates.io (even if we provided a SIMD version, libcore can't use runtime CPU feature detection so they'd still be better off with something from crates.io).

bors · 2024-04-17T02:40:56Z

☀️ Try build successful - checks-actions
Build commit: eb24f76 (eb24f76fec39eaf9264ad15cf44aed176968e12e)

bors · 2024-04-17T02:40:56Z

☀️ Try build successful - checks-actions
Build commit: eb24f76 (eb24f76fec39eaf9264ad15cf44aed176968e12e)

rust-timer · 2024-04-17T05:51:15Z

Finished benchmarking commit (eb24f76): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.4%	[0.3%, 0.4%]	5
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.2%	[-0.2%, -0.2%]	1
Improvements ✅ (secondary)	-0.4%	[-0.7%, -0.2%]	2
All ❌✅ (primary)	0.3%	[-0.2%, 0.4%]	6

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	7.8%	[0.2%, 15.3%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.9%	[-2.3%, -1.3%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	2.0%	[-2.3%, 15.3%]	5

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-3.5%	[-5.4%, -2.3%]	17
All ❌✅ (primary)	-	-	0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.2%	[0.1%, 0.6%]	10
Regressions ❌ (secondary)	0.8%	[0.2%, 1.4%]	2
Improvements ✅ (primary)	-0.0%	[-0.2%, -0.0%]	18
Improvements ✅ (secondary)	-0.0%	[-0.0%, -0.0%]	18
All ❌✅ (primary)	0.0%	[-0.2%, 0.6%]	28

Bootstrap: 676.389s -> 679.092s (0.40%)
Artifact size: 316.13 MiB -> 316.09 MiB (-0.01%)

refactor char count machinery to be generic

ade7869

rustbot assigned Amanieu Apr 7, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Apr 7, 2024

thomcc unassigned Amanieu Apr 7, 2024

thomcc added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 7, 2024

rust-cloud-vms bot force-pushed the thomcc/opt-lines branch from b502143 to 1258d48 Compare April 7, 2024 21:45

This comment has been minimized.

Sign in to view

Write a CountPred for line counting, and make Lines::count use it

1466713

rust-cloud-vms bot force-pushed the thomcc/opt-lines branch from 1258d48 to 1466713 Compare April 7, 2024 22:30

add benchmarks, tweak impl, etc

ef27373

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Apr 17, 2024

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize core::str::Lines::count #123606

Optimize core::str::Lines::count #123606

thomcc commented Apr 7, 2024 •

edited

Loading

rustbot commented Apr 7, 2024

thomcc commented Apr 7, 2024

This comment has been minimized.

thomcc commented Apr 9, 2024

thomcc commented Apr 17, 2024

This comment has been minimized.

bors commented Apr 17, 2024

bors commented Apr 17, 2024

bors commented Apr 17, 2024

This comment has been minimized.

rust-timer commented Apr 17, 2024

Optimize core::str::Lines::count #123606

Are you sure you want to change the base?

Optimize core::str::Lines::count #123606

Conversation

thomcc commented Apr 7, 2024 • edited Loading

TODO

Benchmarks

rustbot commented Apr 7, 2024

thomcc commented Apr 7, 2024

This comment has been minimized.

thomcc commented Apr 9, 2024

thomcc commented Apr 17, 2024

This comment has been minimized.

bors commented Apr 17, 2024

bors commented Apr 17, 2024

bors commented Apr 17, 2024

This comment has been minimized.

rust-timer commented Apr 17, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

thomcc commented Apr 7, 2024 •

edited

Loading