-
Notifications
You must be signed in to change notification settings - Fork 261
Tile encoding #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tile encoding #1126
Conversation
294daf7
to
ae0a873
Compare
The motion vectors were stored as a Vec<Vec<MotionVector>>. The innermost Vec contains a flatten matrix (fi.w_in_b x fi.h_in_b) of MotionVectors, and there are REF_FRAMES instances of them (the outermost Vec). Introduce a typed structure to replace the innermost Vec: - this improves readability; - this allows to expose it as a 2D array, thanks to Index and IndexMut traits; - this will allow to split it into (non-overlapping) tiled views, containing only the motion vectors for a bounded region of the plane (see <xiph#1126>).
The motion vectors were stored in a Vec<Vec<MotionVector>>. The innermost Vec contains a flatten matrix (fi.w_in_b x fi.h_in_b) of MotionVectors, and there are REF_FRAMES instances of them (the outermost Vec). Introduce a typed structure to replace the innermost Vec: - this improves readability; - this allows to expose it as a 2D array, thanks to Index and IndexMut traits; - this will allow to split it into (non-overlapping) tiled views, containing only the motion vectors for a bounded region of the plane (see <xiph#1126>).
The motion vectors were stored in a Vec<Vec<MotionVector>>. The innermost Vec contains a flatten matrix (fi.w_in_b x fi.h_in_b) of MotionVectors, and there are REF_FRAMES instances of them (the outermost Vec). Introduce a typed structure to replace the innermost Vec: - this improves readability; - this allows to expose it as a 2D array, thanks to Index and IndexMut traits; - this will allow to split it into (non-overlapping) tiled views, containing only the motion vectors for a bounded region of the plane (see <#1126>).
61d2d74
to
274478b
Compare
I added But since I added
I think we should replace all |
BlockOffset has a size of 128 bits (the same as a slice), and is trivially copyable, so make it derive Copy. Once it derives Copy, clippy suggests to never pass it by reference: <https://rust-lang.github.io/rust-clippy/master/index.html#trivially_copy_pass_by_ref> So pass it by value everywhere to simplify usage. In particular, this avoids lifetimes bounds where not necessary (e.g. in get_sub_partitions()). See <xiph#1126 (comment)>.
BlockOffset has a size of 128 bits (the same as a slice), and is trivially copyable, so make it derive Copy. Once it derives Copy, clippy suggests to never pass it by reference: <https://rust-lang.github.io/rust-clippy/master/index.html#trivially_copy_pass_by_ref> So pass it by value everywhere to simplify usage. In particular, this avoids lifetimes bounds where not necessary (e.g. in get_sub_partitions()). See <xiph#1126 (comment)>.
BlockOffset has a size of 128 bits (the same as a slice), and is trivially copyable, so make it derive Copy. Once it derives Copy, clippy suggests to never pass it by reference: <https://rust-lang.github.io/rust-clippy/master/index.html#trivially_copy_pass_by_ref> So pass it by value everywhere to simplify usage. In particular, this avoids lifetimes bounds where not necessary (e.g. in get_sub_partitions()). See <xiph#1126 (comment)>.
BlockOffset has a size of 128 bits (the same as a slice), and is trivially copyable, so make it derive Copy. Once it derives Copy, clippy suggests to never pass it by reference: <https://rust-lang.github.io/rust-clippy/master/index.html#trivially_copy_pass_by_ref> So pass it by value everywhere to simplify usage. In particular, this avoids lifetimes bounds where not necessary (e.g. in get_sub_partitions()). See <#1126 (comment)>.
Switched the parameters of the get_sad() function ps://rust-lang.github.io/rust-clippy/master/index.html#neg_multiply Use a procedural macro instead a local copy of arg_enum Make cargo doc happier and make crav1e not depend on compiler unstable feature. Remove mutability from input plane for CDEF functions Restore CDEF-disabled paths for RDO Add speed setting for CDEF Teach --speed-test=baseline to set SpeedSettings::default() Many of the settings change nothing at the default speed. Use cargo fetch to generate the Cargo.lock needed by kcov Split long running tests and add a feature flag to avoid high dimension Parallel build for dependencies on Travis Also, check out the v1.0.0-errata1 tag of libaom. Build test separately before running kcov This reduces the odds of a Travis timeout. Rewrite the high_bit_depth and chroma_sampling tests Apparently cbindgen has problems parsing them in the former rendition. Add width and height as parsable parameters Use .iter() over the plane data Preliminary to use a different backing storage. Drop macro_use for interpolate_name Use a Box<[T]> as storage for Plane Add PlaneData It acts a aligned memory wrapper. Fixes xiph#1101 Derive Layout on demand in PlaneData Align PlaneData to 32 bytes on Windows Re-enable building assembly files on Windows. This doesn't actually call the assembly functdions yet. Unbreak Context::container_sequence_header() Remove fake-genericity from sad functions he functions sad_ssse3() and sad_sse2() only support u16 and u8 respectively, so they are not generic. Make the caller pass the expected type. Rename sad_ssse3() to sad_hbd_ssse3() <xiph#1092 (comment)> Suggested-by: David Michael Barr <b@rr-dav.id.au> Add a copy_from_raw_u8 test Use sccache in CI scripts (xiph#1110) * Extract archives in parallel with download * Fetch sccache binary release * Use sccache for C and C++ dependencies * Limit sccache size to 500M * Use CI generic cache to store compiler cache api: Drop parse() function from Config This function is not needed in rust and it is mostly a convenience for other languages. Instead move this chunk in the appropriate bindings. Fix get_sad() tests and benches The function get_sad() was called with block width and block height parameters swapped. As a consequence, in tests, associate the precomputed SAD values to the transposed block size. Call assembly functions on Windows. Retrieve dimensions from plane_cfg To compute the number of pixels available in the top-right and bottom-left edges, get_intra_edges() received frame_w_in_b (MiCols) and frame_h_in_b (MiRows) as parameters, initialized as follow: MiCols = 2 * ( ( FrameWidth + 7 ) >> 3 ) MiRows = 2 * ( ( FrameHeight + 7 ) >> 3 ) <https://aomediacodec.github.io/av1-spec/#compute-image-size-function> The sizes computed by get_intra_edges() were basically the frame dimensions rounded up to the next multiple of 8, decimated: (MI_SIZE >> plane_cfg.xdec) * frame_w_in_b (MI_SIZE >> plane_cfg.ydec) * frame_h_in_b But in Frame::new(), the luma plane dimensions are also initialized with the frame dimensions rounded up to the next multiple of 8. Therefore, it is equivalent to directly use the plane dimensions. Avoid superfluous memset in forward transforms Avoid superfluous memset in write_coeffs_lv_map Move motion_estimation to a trait And keep the actual code as default trait Move full pixel me in a separate function Move the specific full_pixel_me impl where they belong Move the specific sub_pixel_me impl where they belong Disable prep_8tap assembly. Temporarily fixes xiph#1115. Cast before left shift in native prep_8tap Enable prep_8tap assembly Enable the Clippy's manual_memcpy lint (xiph#1122) https://rust-lang.github.io/rust-clippy/master/index.html#manual_memcpy Inline often called and almost-trivial functions (xiph#1124) * Inline constrain and msb for cdef_filter_block This reduces its average time by around 42%. * Inline round_shift for pred_directional and others This reduces its average time by around 10%. * Inline sgrproj_sum_finish to its various callers It is at the lowest level of a hot call graph and almost trivial. * Inline get_mv_rate in motion estimation It is almost trivial and called often. Enable the Clippy's if_same_then_else lint https://rust-lang.github.io/rust-clippy/master/index.html#if_same_then_else Add struct FrameMotionVectors The motion vectors were stored in a Vec<Vec<MotionVector>>. The innermost Vec contains a flatten matrix (fi.w_in_b x fi.h_in_b) of MotionVectors, and there are REF_FRAMES instances of them (the outermost Vec). Introduce a typed structure to replace the innermost Vec: - this improves readability; - this allows to expose it as a 2D array, thanks to Index and IndexMut traits; - this will allow to split it into (non-overlapping) tiled views, containing only the motion vectors for a bounded region of the plane (see <xiph#1126>). Enable the Clippy's len_zero lint (xiph#1128) https://rust-lang.github.io/rust-clippy/master/index.html#len_zero diamond_me: save only selected frame motion vectors Save them by reference frame types instead of picture slot. Do not add several times the zero motion vector to the predictor list. Use diamond search for the half resolution motion estimation estimate_motion_ss2: include it in the MotionEstimation trait Make BlockOffset derive Copy BlockOffset has a size of 128 bits (the same as a slice), and is trivially copyable, so make it derive Copy. Once it derives Copy, clippy suggests to never pass it by reference: <https://rust-lang.github.io/rust-clippy/master/index.html#trivially_copy_pass_by_ref> So pass it by value everywhere to simplify usage. In particular, this avoids lifetimes bounds where not necessary (e.g. in get_sub_partitions()). See <xiph#1126 (comment)>. Make SuperBlockOffset derive Copy Like previous commit did for BlockOffset. Make PlaneOffset derive Copy Like previous commits did for BlockOffset and SuperBlockOffset. Set timeout for cargo kcov to 20 minutes. Do not pass both BlockOffset and PlaneOffset In motion estimation, several functions received both the offset expressed in blocks and in pixels for the luma plane. This information is redundant: a block offset is trivially convertible to a luma plane offset. With tiling, we need to manage both absolute offsets (relative to the frame) and offsets relative to the current tile. This will be more simple without duplication.
f272325
to
320377f
Compare
Here is my first working tiled (2×2) video encoded with rav1e: From this version: https://github.com/rom1v/rav1e/commits/tiling.100
|
5f85599
to
59f09d0
Compare
1 tile vs 4 tiles: https://beta.arewecompressedyet.com/?job=%402019-04-05T21%3A29%3A59.430Z_ref_1_tile&job=%402019-04-05T21%3A28%3A52.445Z_4_tiles Of course, there is a cost in quality (SSIM, PSNR…) because it loses the possibility to exploit some redundancy across tiles in the same frame. For now, the current version only saves CDF from tile 0 (it should choose the bigger tile in bytes instead), and always store tile sizes on 4 bytes. It can (and will) be improved. The encoding time is worse with tiling on AWCY because I think that it uses only 1 core per instance. |
Unfortunately (but as expected), the tiling structures are not a zero cost abstraction. They add an overhead in encoding time between 1~3%. Concretely, if we compare the version compiled from As an example, on my laptop, an encoding takes 3mn33,245 on You can compare encoding times on AWCY for 1 tile: https://beta.arewecompressedyet.com/?job=master-70005e353aa8ce21e3ecd257c927f71d4012a117&job=%402019-04-05T21%3A29%3A59.430Z_ref_1_tile Maybe some work could be done to minimize this overhead (for example using more EDIT: now there is no overhead (even a negative overhead with more inlines than on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ready to land now.
Thank you very much for the great work!
As I observed, BD Rate change by this PR is:
At speed 0, low_latency=true (Sine low_latency=false seems not working correctly)
1 tile -> 2x2=4 tiles:
AWCY link
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
2.8320 | 1.7719 | 2.2092 | 2.7641 | 2.8205 | 2.7920 | 2.4722 |
1 tile -> 4x2 (col x row) tiles:
AWCY link
PSNR | PSNR Cb | PSNR Cr | PSNR HVS | SSIM | MS SSIM | CIEDE 2000 |
---|---|---|---|---|---|---|
5.2285 | 3.8110 | 4.2109 | 5.0781 | 5.2115 | 5.1621 | 4.7286 |
Actually, in absolute, I don't know why. |
They have been implemented in TileRestorationPlane instead.
Add --tile-cols-log2 and --tile-rows-log2 to configure tiling. This configuration is made available in FrameInvariants.
Compute the tiling information and make it accessible from FrameInvariants.
Encode the tiles from each tile context provided by the TilingInfo tile iterator.
To write the bitstream, a big-endian BitWriter is used. However, some values need to be written in little-endian (le(n) in AV1 specs). A method write_uleb128() was already present. Add a new one to write little-endian values: write_le(bytes, value).
Correctly write the bitstream if there are several tiles: <https://aomediacodec.github.io/av1-spec/#tile-info-syntax>
Write the tile group from the vector of individual tile data: <https://aomediacodec.github.io/av1-spec/#general-tile-group-obu-syntax>
Collect the context and CDFs in an intermediate vector, so that it can be iterated in parallel with Rayon.
Use par_iter_mut() from Rayon to call encode_tile() for each tile context in parallel.
Tile RDO trackers results need to be aggregated at frame level.
Use the tile that takes the largest number of bytes for CDF update. It should be better for entropy coding.
The tile size may be encoded using 1, 2, 3 or 4 bytes. For simplicity, it always used 4 bytes. Instead, use the number of bytes required by the biggest tile.
The region may be smaller than the lrf_input plane. In that case, &rec[..width] panic!ed.
The offsets are relatives to the tile, so find_valid_row_offs() behavior does not change with tiling.
Make it consistent with find_valid_rows_offs() and with the libaom implementation: <https://aomedia.googlesource.com/aom/+/645dbcba0c4b42a79c28eec4516bd37702121ae3/av1/common/mvref_common.h#89>
We will need the blocks size at frame-level to clamp motion vectors.
This fixes bitstream corruption! Lost hours here: many.
This will allow to add tile encoding tests.
Add a decode_test with size such as it uses stretched restoration units. See <xiph#631 (comment)>.
The tail call confuses the compiler, preventing inlining.
The method set_block_size() have been declared inlined after profiling. Also inline the others setters.
Sure, far better (> 10%), that is why we wanted to have pyramid and/or frame-reordering. |
I published a blog post about this feature: https://blog.rom1v.com/2019/04/implementing-tile-encoding-in-rav1e/ |
(description updated on 16 april 2019)
This PR implements tile encoding (#631).
Context
Encoding a frame first involves frame-wise accesses (initialization, etc.), then tile-wise accesses (to encode tiles in parallel), then frame-wise accesses using the results of tile-encoding (deblocking, cdef, …):
Tiling
As you know, in Rust, it is not sufficient not to read/write the same memory from several threads, it must be impossible to write (safe) code that could do it. More precisely, a mutable reference may not alias any other reference to the same memory.
That's the reason why, as a preliminary step, I replaced accesses using the whole plane as a raw slice in addition to the stride information by
PlaneSlice
(#1035) andPlaneMutSlice
(#1043).But
Plane(Mut)Slice
still borrows the whole plane slice, so it does not, in itself, solves the problem.There are several structures to be tiled, which form a tree:
Most of them exist both in const and mutable version (e.g.
PlaneRegion
andPlaneRegionMut
).Tiling structures
PlaneRegion
This is a view of bounded region of a
Plane
. It is similar toPlaneSlice
, except that it does not borrow the whole underlying raw slice. That way, it is possible to get several non-overlapping regions simultaneously.In the end, we should probably merge it with
PlaneSlice
, but it requires more work because some frame-wise code still usesPlaneSlice
in the code base.It is possible to retrieve a subregion of a region (which may not exceed its parent). In theory, a subregion is defined by a rectangle (for example: x, y, width, height), but in practice, we need more flexibility. For example, we often need to retrieve a region from an offset, using the same bottom-right corner as its parent without providing width and height.
For that purpose, I propose a specific
Area
structure (actually, a Rustenum
) to describe subregion bounds. Here are some usage examples:Retrieving a subregion from a
BlockOffset
is so common accross the code base that I decided to expose it directly:Like
Plane(Mut)Slice
, it provides operator[] and iterators over its rows:The mutable versions of the structure (
PlaneRegionMut
) and methods are also provided.Tile
A
Tile
is a view of 3 colocated plane regions (Tile
is to aPlaneRegion
as aFrame
is to aPlane
).The mutable version (
TileMut
) is also provided.TileState
The way the
FrameState
fields are mapped inTileState
depends on how they are accessed tile-wise and frame-wise.Some fields (like
qc
) are only used during tile-encoding, so they are only stored inTileState
.Some other fields (like
input
orsegmentation
) are not written tile-wise, so they just reference the matching field inFrameState
.Some others (like
rec
) are written tile-wise, but must be accessible frame-wise once the tile views vanish (e.g. for deblocking).It contains 2 tiled views:
TileRestorationState
and a vector ofTileMotionVectorsMut
(a tiled view ofFrameMotionVectors
).This structure is only provided as mutable (
TileStateMut
). A const version is not necessary, and would require to instantiate a const version of all its embedded tiled views.TileBlocks
TileBlocks
is a tiled view ofFrameBlocks
. It exposes the blocks associated to the tile.The mutable version (
TileBlocksMut
) is also provided.Splitting into tiles
A
TilingInfo
structure computes all the details about tiling from the frame width and height and the (log2 of the) number of tile columns and rows. The details are accessible for initializing data or writing into the bitstream.It provides an iterator over tiles (yielding one
TileStateMut
and oneTileBlocksMut
for each tile).Frame offsets vs tile offsets
In
encode_tile()
, super-block, block and plane offsets are expressed relative to the tile. The tiling views expose its data relative to the tile:plane_region[y][x]
is pixel (x, y) relative to the plane region,tile_blocks[boy][box]
contains theBlock
at (box, boy) relative to the tile,TileStateMut
exposes some references to frame-level data stored inFrameState
:input
is a reference to the whole frame,input_hres
andinput_qres
are references to the whole planes.When accessing these frame-level data, tile offsets are converted to frame offsets, for example by:
Current state
It works.
Need more tests and reviews.
Usage
Pass the requested log2 number of tiles, with
--tile-cols-log2
and--tile-rows-log2
. For example, to request 2x2 tiles:Currently, the number of tiles is passed in log2 (like in libaom, even if the
aomenc
options are called--tile-columns
and--tile-rows
), to avoid any confusion. Maybe we could find a correct user-friendly option later.Note that the actual number of tiles may be smaller (e.g. if the image size has fewer super-blocks).