Eliminating bounds checks in the YUV->RGBA conversion #6

torokati44 · 2021-09-01T23:39:37Z

~~These changes together yield an overall ∼21% reduction of the time run_frame() takes, on average.~~ (outdated)
Measured on the self-hosted web build, running the first 100 frames of z0r loops 3664, 4449, and 7081; but discarding the first 20 frame times of each loop before averaging - to let the WASM JIT warm up, and the numbers stabilize a little.

The results:

The first and last rows are about the current state, the middle rows are some "milestone" commits of this PR. The X axis is milliseconds. Every measurement was done four times.
The benchmarks were done in Chromium 94, on a Ryzen 2700X (using playwright and the temci tool).

~~The first two commits I'm fairly confident in being "good", the middle one is just "okay", and the last two "should be fine, I think".~~ (outdated)

More details about the changes in the added comments.

kmeisthax

I found a number of pathological cases that could trigger unsafety - they all have to do with mismatches between the chroma and luma input sizes. We don't have type-level assurances that the three arrays are compatibly sized, so we'll need some additional bounds checks.

I also would like to see some of the internal functions documented for their safety requirements, I've flagged those as well.

kmeisthax · 2021-09-02T21:44:38Z

yuv/src/bt601.rs

@@ -131,62 +180,60 @@ pub fn yuv420_to_rgba(
    let y_height = y.len() / y_width;
    let br_height = chroma_b.len() / br_width;


It is possible to hand this function a chroma_r that is too short for it's given geometry. Since this is the entry point for the whole crate, it's unsound as-is. (For the record, this function's use of y and chroma_b appear to be sound; since we derive different bounds for it based on it's length.)

We should either have separate b_height and r_height calculations and variables, or return an error if chroma_r is too short. I'd prefer the former over the latter but I'm not sure what the performance impact is in this case.

kmeisthax · 2021-09-02T21:54:53Z

yuv/src/bt601.rs

@@ -40,31 +38,31 @@ fn sample_chroma_for_luma(
            (luma_y as i32 - 1) / 2


sample_chroma_for_luma needs to be marked as unsafe, as it has several invariants that must be fulfilled in order for code that uses it to remain sound:

The chroma_width and chroma_height parameters must be derived from the length of chroma in such a way that their product cannot exceed that length, otherwise clamping will fail to prohibit out-of-bounds reads

If clamp is off, then luma_x and luma_y must already be bounds-checked to twice the chroma_width and chroma_height (respectively) in order to prohibit out-of-bounds reads

These invariants should also be documented in the function's doccomment

kmeisthax · 2021-09-02T22:01:38Z

yuv/src/bt601.rs

-    rgba[base + 2] = clamp(b);
-    rgba[base + 3] = 255;
+#[inline]
+fn convert_and_write_pixel(yuv: (u8, u8, u8), rgba: &mut Vec<u8>, base: usize, luts: &LUTs) {


convert_and_write_pixel should also be marked as unsafe, as it has an invariant must be fulfilled for callers to remain sound:

base must be less than the length of rgba minus four.

This invariant should also be doccommented.

Furthermore, rgba itself is not being reallocated, so it should be passed as &mut [u8] instead of &mut Vec<u8>.

kmeisthax · 2021-09-02T22:14:02Z

yuv/src/bt601.rs

            let b_sample =
-                sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false) as f32;
+                sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false);


Pathologically small chroma_b or chroma_r inputs to this function are unsound, as the unclamped sampling will read outside their bounds. x_pos and y_pos are derived from the width and height of luma, but there is nothing to ensure that chroma_b or chroma_r's geometry is compatible with luma's. As a result, this code can read out of bounds.

kmeisthax · 2021-09-02T22:14:12Z

yuv/src/bt601.rs

        }
+        base += 8; // skipping the rightmost pixel, and the leftmost pixel in the next row
    }

    // doing the sides with clamping
    for y_pos in 0..y_height {
        for x_pos in [0, y_width - 1].iter() {


Pathologically small inputs (specifically, 1px wide videos) will cause a panic here. This should instead be a saturating_sub, so that in this case, the conversion will merely harmlessly run twice.

As far as I can tell, the use of the unsafe functions below is still sound, since clamping is on. We may still want to test this somehow.

This may impact performance; depending on how LLVM's optimizer is feeling about loop-invariant code motion today.

kmeisthax · 2021-09-02T22:14:32Z

yuv/src/bt601.rs

-    for x_pos in 0..y_width {
-        for y_pos in [0, y_height - 1].iter() {
-            let y_sample = y.get(x_pos + y_pos * y_width).copied().unwrap_or(0) as f32;
+    for y_pos in [0, y_height - 1].iter() {


This can also panic on pathologically small video heights.

torokati44 · 2021-09-03T23:25:36Z

Before addressing the comments, let me just add a note that the H.263 ITU Recommendation has this paragraph in it:

Custom picture formats can have a custom pixel aspect ratio as described in Table 3, if the custom
pixel aspect ratio use is first negotiated by external means. Custom picture formats can have any
number of lines and any number of pixels per line, provided that the number of lines is divisible by
four and is in the range [4, ... , 1152], and provided that the number of pixels per line is also
divisible by four and is in the range [4, ... , 2048]. For picture formats having a width or height that
is not divisible by 16, the picture is decoded in the same manner as if the width or height had the
next larger size that would be divisible by 16 and then the picture is cropped at the right and the
bottom to the proper width and height for display purposes only.

Adhering to the "decoded in the same manner as if the width or height had the next larger size that would be divisible by 16" part could simplify some loops in the IDCT and gather phases - but is not relevant for this PR.

The last part of that sentence is relevant however: "and then the picture is cropped at the right and the
bottom to the proper width and height for display purposes only".
It's not clear to me, if this cropping is to be done before or after the colorspace conversion?
If after, then many of the comments about both odd and <4 picture dimensions become moot in this code.

kmeisthax · 2021-09-06T02:55:18Z

My guess would be that the cropping is supposed to happen after the YUV 420 conversion, because that removes ambiguity about how to handle odd-sized luma pictures.

That being said, while the H.263 specification prohibits the pathological inputs I mentioned, the PR doesn't. It's entirely possible this code winds up getting reused for other video formats that are less restrictive with custom picture sizes. We don't need to correctly decode all invalid picture sizes, but we do need to reject them before any unsafe code runs. Doing this as early as possible in yuv420_to_rgba will have the least performance impact and still ensure this code is sound.

…multiplication

torokati44 · 2021-09-12T09:57:54Z

Superseded by #9.

kmeisthax requested changes Sep 2, 2021

View reviewed changes

torokati44 mentioned this pull request Sep 5, 2021

Optimize the YUV->RGBA conversion with lookup tables #7

Merged

torokati44 added 3 commits September 6, 2021 19:45

yuv: Compute RGBA pixel base index with running additions instead of …

3433995

…multiplication

yuv: Eliminate bounds checking when writing RGBA pixels [unsafe]

5f9000a

yuv: Eliminate bounds checking when sampling chroma [unsafe]

8cfab6e

torokati44 force-pushed the yuv-luts branch from 80bae31 to 8cfab6e Compare September 6, 2021 17:45

torokati44 changed the title ~~Optimize the YUV->RGBA conversion with lookup tables, and eliminating bounds checks~~ Eliminating bounds checks in the YUV->RGBA conversion Sep 6, 2021

torokati44 marked this pull request as draft September 8, 2021 01:13

torokati44 mentioned this pull request Sep 12, 2021

Speed up YUV->RGBA conversion by utilizing iterators, and processing 2x2 group of pixels at once #9

Closed

torokati44 closed this Sep 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminating bounds checks in the YUV->RGBA conversion #6

Eliminating bounds checks in the YUV->RGBA conversion #6

torokati44 commented Sep 1, 2021 •

edited

kmeisthax left a comment

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

kmeisthax Sep 2, 2021

torokati44 commented Sep 3, 2021

kmeisthax commented Sep 6, 2021

torokati44 commented Sep 12, 2021

		@@ -131,62 +180,60 @@ pub fn yuv420_to_rgba(
		let y_height = y.len() / y_width;
		let br_height = chroma_b.len() / br_width;

		@@ -40,31 +38,31 @@ fn sample_chroma_for_luma(
		(luma_y as i32 - 1) / 2

Eliminating bounds checks in the YUV->RGBA conversion #6

Eliminating bounds checks in the YUV->RGBA conversion #6

Conversation

torokati44 commented Sep 1, 2021 • edited

kmeisthax left a comment

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

kmeisthax Sep 2, 2021

Choose a reason for hiding this comment

torokati44 commented Sep 3, 2021

kmeisthax commented Sep 6, 2021

torokati44 commented Sep 12, 2021

torokati44 commented Sep 1, 2021 •

edited