Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminating bounds checks in the YUV->RGBA conversion #6

Closed
wants to merge 3 commits into from

Conversation

torokati44
Copy link
Member

@torokati44 torokati44 commented Sep 1, 2021

These changes together yield an overall ∼21% reduction of the time run_frame() takes, on average. (outdated)
Measured on the self-hosted web build, running the first 100 frames of z0r loops 3664, 4449, and 7081; but discarding the first 20 frame times of each loop before averaging - to let the WASM JIT warm up, and the numbers stabilize a little.

The results:
image
The first and last rows are about the current state, the middle rows are some "milestone" commits of this PR. The X axis is milliseconds. Every measurement was done four times.
The benchmarks were done in Chromium 94, on a Ryzen 2700X (using playwright and the temci tool).

The first two commits I'm fairly confident in being "good", the middle one is just "okay", and the last two "should be fine, I think". (outdated)

More details about the changes in the added comments.

Copy link
Member

@kmeisthax kmeisthax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a number of pathological cases that could trigger unsafety - they all have to do with mismatches between the chroma and luma input sizes. We don't have type-level assurances that the three arrays are compatibly sized, so we'll need some additional bounds checks.

I also would like to see some of the internal functions documented for their safety requirements, I've flagged those as well.

yuv/src/bt601.rs Outdated
@@ -131,62 +180,60 @@ pub fn yuv420_to_rgba(
let y_height = y.len() / y_width;
let br_height = chroma_b.len() / br_width;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to hand this function a chroma_r that is too short for it's given geometry. Since this is the entry point for the whole crate, it's unsound as-is. (For the record, this function's use of y and chroma_b appear to be sound; since we derive different bounds for it based on it's length.)

We should either have separate b_height and r_height calculations and variables, or return an error if chroma_r is too short. I'd prefer the former over the latter but I'm not sure what the performance impact is in this case.

@@ -40,31 +38,31 @@ fn sample_chroma_for_luma(
(luma_y as i32 - 1) / 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_chroma_for_luma needs to be marked as unsafe, as it has several invariants that must be fulfilled in order for code that uses it to remain sound:

  • The chroma_width and chroma_height parameters must be derived from the length of chroma in such a way that their product cannot exceed that length, otherwise clamping will fail to prohibit out-of-bounds reads
  • If clamp is off, then luma_x and luma_y must already be bounds-checked to twice the chroma_width and chroma_height (respectively) in order to prohibit out-of-bounds reads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These invariants should also be documented in the function's doccomment

rgba[base + 2] = clamp(b);
rgba[base + 3] = 255;
#[inline]
fn convert_and_write_pixel(yuv: (u8, u8, u8), rgba: &mut Vec<u8>, base: usize, luts: &LUTs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_and_write_pixel should also be marked as unsafe, as it has an invariant must be fulfilled for callers to remain sound:

  • base must be less than the length of rgba minus four.

This invariant should also be doccommented.

Furthermore, rgba itself is not being reallocated, so it should be passed as &mut [u8] instead of &mut Vec<u8>.

yuv/src/bt601.rs Outdated
let b_sample =
sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false) as f32;
sample_chroma_for_luma(chroma_b, br_width, br_height, x_pos, y_pos, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pathologically small chroma_b or chroma_r inputs to this function are unsound, as the unclamped sampling will read outside their bounds. x_pos and y_pos are derived from the width and height of luma, but there is nothing to ensure that chroma_b or chroma_r's geometry is compatible with luma's. As a result, this code can read out of bounds.

yuv/src/bt601.rs Outdated
}
base += 8; // skipping the rightmost pixel, and the leftmost pixel in the next row
}

// doing the sides with clamping
for y_pos in 0..y_height {
for x_pos in [0, y_width - 1].iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pathologically small inputs (specifically, 1px wide videos) will cause a panic here. This should instead be a saturating_sub, so that in this case, the conversion will merely harmlessly run twice.

As far as I can tell, the use of the unsafe functions below is still sound, since clamping is on. We may still want to test this somehow.

This may impact performance; depending on how LLVM's optimizer is feeling about loop-invariant code motion today.

for x_pos in 0..y_width {
for y_pos in [0, y_height - 1].iter() {
let y_sample = y.get(x_pos + y_pos * y_width).copied().unwrap_or(0) as f32;
for y_pos in [0, y_height - 1].iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also panic on pathologically small video heights.

@torokati44
Copy link
Member Author

Before addressing the comments, let me just add a note that the H.263 ITU Recommendation has this paragraph in it:

Custom picture formats can have a custom pixel aspect ratio as described in Table 3, if the custom
pixel aspect ratio use is first negotiated by external means. Custom picture formats can have any
number of lines and any number of pixels per line, provided that the number of lines is divisible by
four and is in the range [4, ... , 1152], and provided that the number of pixels per line is also
divisible by four and is in the range [4, ... , 2048]. For picture formats having a width or height that
is not divisible by 16, the picture is decoded in the same manner as if the width or height had the
next larger size that would be divisible by 16 and then the picture is cropped at the right and the
bottom to the proper width and height for display purposes only.

Adhering to the "decoded in the same manner as if the width or height had the next larger size that would be divisible by 16" part could simplify some loops in the IDCT and gather phases - but is not relevant for this PR.

The last part of that sentence is relevant however: "and then the picture is cropped at the right and the
bottom to the proper width and height for display purposes only".
It's not clear to me, if this cropping is to be done before or after the colorspace conversion?
If after, then many of the comments about both odd and <4 picture dimensions become moot in this code.

@kmeisthax
Copy link
Member

My guess would be that the cropping is supposed to happen after the YUV 420 conversion, because that removes ambiguity about how to handle odd-sized luma pictures.

That being said, while the H.263 specification prohibits the pathological inputs I mentioned, the PR doesn't. It's entirely possible this code winds up getting reused for other video formats that are less restrictive with custom picture sizes. We don't need to correctly decode all invalid picture sizes, but we do need to reject them before any unsafe code runs. Doing this as early as possible in yuv420_to_rgba will have the least performance impact and still ensure this code is sound.

@torokati44 torokati44 changed the title Optimize the YUV->RGBA conversion with lookup tables, and eliminating bounds checks Eliminating bounds checks in the YUV->RGBA conversion Sep 6, 2021
@torokati44 torokati44 marked this pull request as draft September 8, 2021 01:13
@torokati44
Copy link
Member Author

Superseded by #9.

@torokati44 torokati44 closed this Sep 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants