yuv: Drop the LUTs, do integer-only (fixed-point), SIMD-capable arithmetic instead #15
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This, somewhat surprisingly, already speeds up the conversion function by a factor of about 10% on the web target in itself.
But the more important thing is that it's SIMD-capable. And that target feature (SSE2) is enabled by default on desktop targets, AFAIK.
All the tests still pass (in fact I had to add a couple more), and the sample videos I tested still look completely fine to me.
Once ruffle-rs/ruffle#5834 is merged, it should speed it up (on web, in capable browsers) by an additional factor of 2x. Why only 2x, and not 4x, you ask? Well, I don't know. Maybe we should ask our friend, Amdahl. I'm not complaining though, it's still a nice uplift.
I've also experimented with 16-bit intermediate precision, but it's just barely not accurate enough for my taste, and isn't any faster. And using
i32x4
also allows the neat transpose trick at the end.Nor is doing a 2x2 group of pixels together faster, even though the chroma samples can be just splatted across all lanes, the additional shuffling in memory and more complicated iteration most probably negate that.
Additionally, if we were really serious about performance, we could also use
bytemuck::pod_align_to
, but since the h.263 decoder chops off the luma samples to odd widths in some cases, it would make lining up the chroma samples to it kinda awkward. And the WASM VM might also not even JIT these loads into the faster aligned versions of the native instructions...