Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

torokati44 · 2021-09-12T09:57:38Z

Instead of iterating on plain numbers and computing array indices manually (hence forcing continuous bounds checks everywhere), a parallel iteration scheme is set up on two consecutive lines of all three input, and the output array, at once. Each of these iterators yields two neighboring samples/pixels to work with (in a nice, symmetric, square arrangement), so with each iteration we have 4 of every operand.

There is no clipping of coordinates anywhere, as they are not needed - the pixels are split into 3 different "classes" (bulk, edge, corner; with bilinear, linear, and no interpolation; in this order), and indexing for them is always handled appropriately. There is also not a single unsafe in sight, yay!

The two-stage bilinear interpolation (with wider intermediates) results in only one division per pixel, instead of two, which is nice.

Please don't look at the commit history just yet, I'll squash it neatly once everything works well.
Benchmark numbers and fancy explainer diagrams will follow. I expect around an overall 15% speedup yet again, as measured on three hand-picked z0r loops.
EDIT: Commit history cleaned up, benchmarks and explainers added.

~~Note that the indexing in process_edge_col is simply WRONG right now, so the left and right (one or two) columns of pixels are incorrect in the output.~~
EDIT: This is fixed.

The processing of 4 pixels at a time in an identical manner (just in different "directions") also lends itself nicely to SIMD in the future (perhaps even to autovectorization right now, but I haven't checked that), although 16 bit operands on 4 parallel lanes is still only 64 bits total, which is half of the 128 bits available in WASM SIMD.

This supersedes #6.

torokati44 · 2021-09-13T19:53:38Z

I have fixed the indexing, removed all the TODOs and added a bunch more comments.
I think it works fine now in all cases.
Here is a bunch of drawings I put together to better explain what the intention was with all the iterationation: https://imgur.com/a/yF2uIkz
I have squashed the messy commit history into a single rewrite commit. The commit message could use some work, and I plan to add some tests too.
Now, on to benchmarks.

torokati44 · 2021-09-13T21:13:19Z

And the benchmarks says:

An almost 16% overall improvement!

As measured by the time it takes for run_frame() to finish, in milliseconds, on Chromium 94 and a Ryzen 2700X.
Performed on 10 hand-picked z0r loops (ones with relatively higher resolution H.263 videos), given by their #. The "mean" result is the cross-loop average frame time.

…lor, add tests

torokati44 · 2021-09-13T23:36:53Z

While adding tests, I also discovered some off-by-one errors in the results. These were caused by rounding more than necessary.

So I switched to a more precise formula to precompute the linear functions, with coefficients taken exactly from the BT.601 standard. (I'm much happier with these, instead of those 5 float literals just copied from a random piece of code somewhere.)

Now that all of this goes into a lookup table anyway, the cost of all of these additional computations shouldn't matter that much. (They are only done once, and only on the 256 values of a u8.)

torokati44 · 2021-09-14T00:50:09Z

Okay, with even more tests added, I think this is ready for review now!

torokati44 · 2021-09-16T07:51:37Z

Another point about the performance impact of this PR, this time tested with a VP6 video, with the same methodology as above:

The top row is with the colorspace conversion done by libswscale.
The middle one is with the current master branch of this crate.
The bottom one is with this PR applied to this crate.

I think the difference is fairly significant: just +32% overall runtime instead of +122% when compared to libswscale

torokati44 · 2021-09-19T10:35:19Z

More work needed, see: ruffle-rs/ruffle#3004 (comment)

torokati44 · 2021-09-25T00:59:46Z

Replaced by #13, in accordance with the discoveries discussed therein.

torokati44 mentioned this pull request Sep 12, 2021

Eliminating bounds checks in the YUV->RGBA conversion #6

Closed

torokati44 force-pushed the yuv-iter-2x2 branch from 03cd1ce to 603b674 Compare September 13, 2021 19:44

torokati44 added 2 commits September 14, 2021 01:33

yuv: Process the bulk of pixels in 2x2 groups, using iterators for speed

9e917bc

yuv: Use more accurate coefficients for the conversion from BT.601 co…

414cfac

…lor, add tests

torokati44 force-pushed the yuv-iter-2x2 branch from 603b674 to f838448 Compare September 13, 2021 23:33

torokati44 marked this pull request as ready for review September 13, 2021 23:39

yuv: Do a check for empty Y array sooner, additional asserts and tests

ddea149

torokati44 force-pushed the yuv-iter-2x2 branch from f838448 to 10c62ce Compare September 14, 2021 00:49

torokati44 force-pushed the yuv-iter-2x2 branch from 10c62ce to c99d6d3 Compare September 14, 2021 00:53

yuv: Add more tests, including a roundtrip conversion one

ae7462b

torokati44 force-pushed the yuv-iter-2x2 branch from c99d6d3 to ae7462b Compare September 14, 2021 00:59

torokati44 mentioned this pull request Sep 17, 2021

yuv: Use more accurate BT.601 coefficients #12

Merged

torokati44 marked this pull request as draft September 19, 2021 10:34

torokati44 mentioned this pull request Sep 25, 2021

yuv: Remove chroma interpolation, use iterators #13

Merged

torokati44 closed this Sep 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

torokati44 commented Sep 12, 2021 •

edited

torokati44 commented Sep 13, 2021

torokati44 commented Sep 13, 2021 •

edited

torokati44 commented Sep 13, 2021

torokati44 commented Sep 14, 2021

torokati44 commented Sep 16, 2021

torokati44 commented Sep 19, 2021

torokati44 commented Sep 25, 2021 •

edited

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

Conversation

torokati44 commented Sep 12, 2021 • edited

torokati44 commented Sep 13, 2021

torokati44 commented Sep 13, 2021 • edited

torokati44 commented Sep 13, 2021

torokati44 commented Sep 14, 2021

torokati44 commented Sep 16, 2021

torokati44 commented Sep 19, 2021

torokati44 commented Sep 25, 2021 • edited

torokati44 commented Sep 12, 2021 •

edited

torokati44 commented Sep 13, 2021 •

edited

torokati44 commented Sep 25, 2021 •

edited