Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

Closed
wants to merge 4 commits into from

Conversation

torokati44
Copy link
Member

@torokati44 torokati44 commented Sep 12, 2021

Instead of iterating on plain numbers and computing array indices manually (hence forcing continuous bounds checks everywhere), a parallel iteration scheme is set up on two consecutive lines of all three input, and the output array, at once. Each of these iterators yields two neighboring samples/pixels to work with (in a nice, symmetric, square arrangement), so with each iteration we have 4 of every operand.

There is no clipping of coordinates anywhere, as they are not needed - the pixels are split into 3 different "classes" (bulk, edge, corner; with bilinear, linear, and no interpolation; in this order), and indexing for them is always handled appropriately. There is also not a single unsafe in sight, yay!

The two-stage bilinear interpolation (with wider intermediates) results in only one division per pixel, instead of two, which is nice.

Please don't look at the commit history just yet, I'll squash it neatly once everything works well.
Benchmark numbers and fancy explainer diagrams will follow. I expect around an overall 15% speedup yet again, as measured on three hand-picked z0r loops.

EDIT: Commit history cleaned up, benchmarks and explainers added.

Note that the indexing in process_edge_col is simply WRONG right now, so the left and right (one or two) columns of pixels are incorrect in the output.
EDIT: This is fixed.

The processing of 4 pixels at a time in an identical manner (just in different "directions") also lends itself nicely to SIMD in the future (perhaps even to autovectorization right now, but I haven't checked that), although 16 bit operands on 4 parallel lanes is still only 64 bits total, which is half of the 128 bits available in WASM SIMD.

This supersedes #6.

@torokati44
Copy link
Member Author

I have fixed the indexing, removed all the TODOs and added a bunch more comments.
I think it works fine now in all cases.
Here is a bunch of drawings I put together to better explain what the intention was with all the iterationation: https://imgur.com/a/yF2uIkz
I have squashed the messy commit history into a single rewrite commit. The commit message could use some work, and I plan to add some tests too.
Now, on to benchmarks.

@torokati44
Copy link
Member Author

torokati44 commented Sep 13, 2021

And the benchmarks says:

An almost 16% overall improvement!

As measured by the time it takes for run_frame() to finish, in milliseconds, on Chromium 94 and a Ryzen 2700X.
Performed on 10 hand-picked z0r loops (ones with relatively higher resolution H.263 videos), given by their #. The "mean" result is the cross-loop average frame time.

@torokati44
Copy link
Member Author

While adding tests, I also discovered some off-by-one errors in the results. These were caused by rounding more than necessary.

So I switched to a more precise formula to precompute the linear functions, with coefficients taken exactly from the BT.601 standard. (I'm much happier with these, instead of those 5 float literals just copied from a random piece of code somewhere.)

Now that all of this goes into a lookup table anyway, the cost of all of these additional computations shouldn't matter that much. (They are only done once, and only on the 256 values of a u8.)

@torokati44 torokati44 marked this pull request as ready for review September 13, 2021 23:39
@torokati44
Copy link
Member Author

Okay, with even more tests added, I think this is ready for review now!

@torokati44
Copy link
Member Author

Another point about the performance impact of this PR, this time tested with a VP6 video, with the same methodology as above:

  • The top row is with the colorspace conversion done by libswscale.
  • The middle one is with the current master branch of this crate.
  • The bottom one is with this PR applied to this crate.

I think the difference is fairly significant: just +32% overall runtime instead of +122% when compared to libswscale

@torokati44
Copy link
Member Author

More work needed, see: ruffle-rs/ruffle#3004 (comment)

@torokati44
Copy link
Member Author

torokati44 commented Sep 25, 2021

Replaced by #13, in accordance with the discoveries discussed therein.

@torokati44 torokati44 closed this Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant