Implement wc fast paths that skip Unicode decoding. by resistor · Pull Request #3740 · uutils/coreutils

resistor · 2022-07-23T04:52:57Z

Byte, character, and line counting can all be done on the raw bytes
of the incoming stream without decoding the Unicode characters. This
fact was previously exploited in specific fast paths for counting
characters and counting lines. This change unifies those fast paths into
a single shared fast paths, using const generics to specialize the
function for each use case. This has the benefit of making sure that all
combinations of these Unicode-oblivious fast paths benefit from the same
optimization.

On my laptop, this speeds up wc -clm odyssey1024.txt from 840ms to
120ms. I experimented with using a filter loop for line counting, but
continuing to use the bytecount crate came out ahead by a significant
margin.

anastygnome · 2022-07-23T08:46:26Z

@resistor good job, but needs a cargo fmt run ;)

Byte, character, and line counting can all be done on the raw bytes of the incoming stream without decoding the Unicode characters. This fact was previously exploited in specific fast paths for counting characters and counting lines. This change unifies those fast paths into a single shared fast paths, using const generics to specialize the function for each use case. This has the benefit of making sure that all combinations of these Unicode-oblivious fast paths benefit from the same optimization. On my laptop, this speeds up `wc -clm odyssey1024.txt` from 840ms to 120ms. I experimented with using a filter loop for line counting, but continuing to use the bytecount crate came out ahead by a significant margin.

resistor · 2022-07-23T17:45:49Z

@resistor good job, but needs a cargo fmt run ;)

Done

sylvestre · 2022-07-24T12:59:27Z

the coverage indicates that we are missing some tests, could you please add them ?
thanks

resistor · 2022-07-25T04:56:38Z

the coverage indicates that we are missing some tests, could you please add them ?
thanks

Done

anastygnome · 2022-07-25T09:51:39Z

@sylvestre this is now good to merge (tail-related failures) ;)
Congrats @resistor !

sylvestre · 2022-07-25T19:15:09Z

indeed, thanks :)

resistor force-pushed the main branch from 7ebe57f to d5f59f2 Compare July 23, 2022 17:45

Add missing testcases for wc.

0e7f1c4

sylvestre merged commit 2fa4d6a into uutils:main Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement wc fast paths that skip Unicode decoding.#3740

Implement wc fast paths that skip Unicode decoding.#3740
sylvestre merged 2 commits intouutils:mainfrom
resistor:main

resistor commented Jul 23, 2022

Uh oh!

anastygnome commented Jul 23, 2022

Uh oh!

resistor commented Jul 23, 2022

Uh oh!

sylvestre commented Jul 24, 2022

Uh oh!

resistor commented Jul 25, 2022

Uh oh!

anastygnome commented Jul 25, 2022

Uh oh!

sylvestre commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

resistor commented Jul 23, 2022

Uh oh!

anastygnome commented Jul 23, 2022

Uh oh!

resistor commented Jul 23, 2022

Uh oh!

sylvestre commented Jul 24, 2022

Uh oh!

resistor commented Jul 25, 2022

Uh oh!

anastygnome commented Jul 25, 2022

Uh oh!

sylvestre commented Jul 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants