Add tiles support #631

lu-zero · 2018-10-02T17:28:19Z

Implement the configuration settings
- Context structure
- Expose CLI options for it
have the Frame provide a tile iterator example
- Have a FrameTileIterator
- Have a FrameTile
- Make the FrameTile produce PlaneSlice
Use the tile elements in the codebase

rom1v · 2018-12-18T13:30:43Z

Currently, Plane(Mut)Slice<'_> borrows the whole Plane, and just stores an offset (x, y). It exposes the whole underlying data (e.g. via as_mut_slice()), the user is responsible to use the stride information to access to different rows.

To support tiling, we must not be able to borrow memory outside the relevant region of the Plane (currently, if base == 0 for example, the method returns a mutable reference to the whole data).

To avoid this problem, I suggest to create a PlaneRegion<'_>, which is basically a "rectangle view" of a Plane. Several non-overlapping PlaneRegion<'_> could be created via an iterator (like in the example).

We could then make Plane(Mut)Slice<'_> (which is basically a cursor/view) use a PlaneRegion<'_> instead of a Plane (contrary to PlaneRegion<'_>, the cursor could be moved, expanded, etc. inside the region).

As a consequence, it will not be possible to provide a slice covering the whole surface (multiple rows) and let the user use the stride information anymore, since this would borrow data possibly belonging to other regions (tiles). Therefore, all code using Plane(Mut)Slice<'_> must be adapted to access the data (and also to take additional region offset into account). Currently, I don't know the amount of work to adapt the existing code. Maybe it's not that much.

To be able to use a cursor (Plane(Mut)Slice<'_>) either on the whole plane or a plane region, we can simply expose Plane::as_region() which would provide a region over the whole plane (including padding).

After that, we can expose Tile<'_> (provided by an interator on a Frame) containing 3 PlaneRegion<'a> (one for each plane).

lu-zero · 2018-12-18T13:47:47Z

src/cdef.rs:                let mut cdef_row = &mut cdef_slice.as_mut_slice()[..2];
src/cdef.rs:                let mut cdef_row = &mut cdef_slice.as_mut_slice()[..padded_px[p][0]-rec_w-2];
src/cdef.rs:                let mut cdef_row = &mut cdef_slice.as_mut_slice()[..rec_w];
src/deblock.rs:        let slice = plane_slice.as_mut_slice();
src/deblock.rs:        let slice = plane_slice.as_mut_slice();
src/encoder.rs:        inverse_transform_add(rcoeffs, &mut rec.mut_slice(po).as_mut_slice(), stride, tx_size, tx_type, bit_depth);
src/encoder.rs:        .as_mut_slice()
src/partition.rs:    let slice = dst.as_mut_slice();
src/partition.rs:    let slice = dst.as_mut_slice();
src/plane.rs:        let s = ps.as_mut_slice_w_width(xorigin + 1);
src/plane.rs:        let s = ps.as_mut_slice_w_width(stride - xorigin - width + 1);
src/plane.rs:      let (s1, s2) = ps.as_mut_slice().split_at_mut(yorigin * stride);
src/plane.rs:      let (s2, s1) = ps.as_mut_slice().split_at_mut(stride);
src/plane.rs:      let mut dst = dst_slice.as_mut_slice();
src/plane.rs:  pub fn as_mut_slice(&'a mut self) -> &'a mut [u16] {
src/plane.rs:  pub fn as_mut_slice_w_width(&'a mut self, width: usize) -> &'a mut [u16] {
src/rdo.rs:            let mut rec_row = rec_slice.as_mut_slice();

Here all the users according to git grep.

rom1v · 2018-12-18T13:49:35Z

as_slice(), as_slice_clamped(), as_slice_w_width():

src/deblock.rs:      let rec = rec_tmp.as_slice();
src/deblock.rs:      let src = src_tmp.as_slice();
src/deblock.rs:      let rec = rec_tmp.as_slice();
src/deblock.rs:      let src = src_tmp.as_slice();
src/encoder.rs:                        .zip(src1.as_slice().chunks(src1_stride))
src/encoder.rs:                        .zip(src2.as_slice().chunks(src2_stride)) {
src/encoder.rs:                fs.rec.planes[p].data.copy_from_slice(rec.frame.planes[p].data.as_slice());
src/encoder.rs:        fs.rec.planes[p].slice(&po).as_slice()[..32],
src/encoder.rs:        fs_.rec.planes[p].slice(&po).as_slice()[..32]
src/me.rs:        let org_ptr = org_slice.as_slice().as_ptr();
src/me.rs:        let ref_ptr = ref_slice.as_slice().as_ptr();
src/partition.rs:        above[..B::W].copy_from_slice(&dst.go_up(1).as_slice()[..B::W]);
src/partition.rs:              let s = ps.as_slice_clamped();
src/partition.rs:              let s = ps.as_slice_clamped();
src/partition.rs:              let s = ps.as_slice_clamped();
src/partition.rs:              let s = ps.as_slice_clamped();
src/plane.rs:  pub fn as_slice(&'a self) -> &'a [u16] {
src/plane.rs:  pub fn as_slice_clamped(&'a self) -> &'a [u16] {
src/plane.rs:  pub fn as_slice_w_width(&'a self, width: usize) -> &'a [u16] {
src/rdo.rs:    let s1 = src1j.as_slice_w_width(w);
src/rdo.rs:    let s2 = src2j.as_slice_w_width(w);
src/rdo.rs:                let mut in_row = in_slice.as_slice();

EDIT: but in fact, Plane(Mut)Slice<'a> is only used "locally" in some functions, so it's not very important to make it work with a region I think.

ycho · 2019-01-02T21:30:36Z

Oh, the sample program (https://play.rust-lang.org/?version=stable&mode=debug&edition=2015)
seems crashing if I try larger (but more realistic) tile size and frame size lke:

fn main() {
let mut plane = Plane::new(PlaneInfo {
stride: 1920 + 256,
w: 1920,
h: 1080,
});

let tile_info = TileInfo { w: 256, h: 256 };

rom1v · 2019-01-03T13:35:38Z

@ycho It still works for me (correct link).

Maybe you should just remove the last println!():

println!("{:#}", plane);

With your settings, it generates 2075762 additional output lines, which might flood your console or the playground.

ycho · 2019-01-03T17:43:12Z

@rom1v, indeed it works with println! commented out.

I tried below and worked!

https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=f98d6ceb1457fc5b3ecb9db0e3a2652a

let mut plane = Plane::new(PlaneInfo {
    stride: 2048,
    w: 2048,
    h: 2048,
});

let tile_info = TileInfo { w: 512, h: 512 };

rom1v · 2019-01-15T14:52:51Z

As I explained in #836 and #848, I want a to split FrameState into TileState<'_> views, one per tile.

This split is recursive, the fields of FrameState need to be split:

the rec Frame will be split into TileMut<'_>, itself splitting its 3 planes (see [WIP] Tiling #821)
the RestorationState field will be split into RestorationTile<'_>, itself splitting its 3 planes of units…

Each TileMut<'_> must be independant of the others, so that they can be accessed concurrently from several threads.

But I just noticed a problem: the restoration units may be "stretched" (is it the right word?) near frame boundaries.

Let's take a concrete example:

frame: 300×300
restoration unit size: 128×128
tile size: 128×128

Splitting the frame results in 9 tiles:

128x128  128x128  44x128
128x128  128x128  44x128
128x 44  128x 44  44x 44

This is consistent with Tile info semantics:

uniform_tile_spacing_flag equal to 1 means that the tiles are uniformly spaced across the frame. (In other words, all tiles are the same size except for the ones at the right and bottom edge which can be smaller.)

However, splitting the restoration units vector results in 4 tiles, because restoration units on borders will be stretched:

rav1e/src/lrf.rs

Lines 471 to 472 in ebea677

    
           let cols = ((fi.width + (y_unit_size >> 1)) / y_unit_size).max(1); 
        
           let rows = ((fi.height + (y_unit_size >> 1)) / y_unit_size).max(1);

cols = rows = max((300 + 64) / 128, 1) = 2

So units are assigned to different parts of the frame:

128x128  172x172
172x172  172x172

Therefore, all 9 tiles cannot be encoded independently, since adjacent tiles on frame borders will need to share their RestorationState.

Any thoughts about this?

lu-zero · 2019-01-15T15:16:25Z

I'd round up the frame so the tiles have exactly the same size.
Then we provide the crop information to match the input picture :)

rom1v · 2019-01-15T15:29:10Z

Stop loop rest units from straddling tile boundaries (Wed Oct 25 17:15:28 2017 +0100)

restoration units are allocated within each tile as if it were its own image

Does it mean that we (may) need more restoration units if we tile than if we don't?

ycho · 2019-01-15T19:46:56Z

According to read_lr() in https://aomediacodec.github.io/av1-spec/#read-loop-restoration-syntax, though its parameters like unitRowStart/unitColStart might be liberal, read_lr() is anyway called for one tile as in https://aomediacodec.github.io/av1-spec/#decode-tile-syntax.

xiphmont · 2019-01-16T07:25:26Z

Therefore, all 9 tiles cannot be encoded independently, since adjacent tiles on frame borders will need to share their RestorationState.

True, but the loop filters, in general, need read access and very little, tightly-contained write access. Once they're pipelined, that will be even more true (they will write to pipelined temp vectors passed between, not global storage).

I would think the loop filter write access could simply be global across tiles and locked with very little downside.

rom1v · 2019-01-16T11:06:37Z

I'd round up the frame so the tiles have exactly the same size.
Then we provide the crop information to match the input picture :)

I'm not sure cropping should be used for more than just rounding to a multiple of 8 or something. I don't think we want to extend the frame to the next tile boundaries (tiles may be big).

I would think the loop filter write access could simply be global across tiles and locked with very little downside.

Indeed, it is written only there (for now):

rav1e/src/context.rs

Lines 3017 to 3019 in de68ab1

    
           let ru = &mut rp.restoration_unit_as_mut(sbo); 
        
           code = !ru.coded; 
        
           ru.coded = true;

So a lock (or an atomic swap) would "work".

But it still seems weird to me that a restoration unit has to be shared between tiles (and only in specific cases, on frame borders), since a tile is intended to be decoded/encoded independently of others.

According to read_lr() in https://aomediacodec.github.io/av1-spec/#read-loop-restoration-syntax, though its parameters like unitRowStart/unitColStart might be liberal, read_lr() is anyway called for one tile as in https://aomediacodec.github.io/av1-spec/#decode-tile-syntax.

<codeview> As I read, read_lr() is called for each SB in decode_tile(), (I know it from that decode_partition() is started being called from there). And, if you read further in read_lr(), unitColStart and unitColEnd are set based on (r,c), which are bound by MiColStart = MiColStarts[ tileCol ] and MiColEnd = MiColStarts[ tileCol + 1 ], so does not exceed the tile.
<codeview> Similarly to Row.

(from IRC)

I think you're right.

In decode_tile(), r takes values between MiRowStart and MiRowEnd, respectively initialized to MiRowStarts[tileRow] and MiRowStarts[tileRow+1] in tile_group_obu().

MiRowStarts[] is initialized in tile_info(), and represents the start row for each tile, expressed in number of mode info (4x4 luma samples). So in my example above, it will contain these values:

MiRowStarts = [0, 32, 64, 75]

Now, let's consider the last row of tiles:

128x128  128x128  44x128
128x128  128x128  44x128
128x 44  128x 44  44x 44   <-- this row

read_lr() will be called with r = 64. For the luma sample:

unitRows = 2;
unitCols = 2;
unitRowStart = (64 * MI_SIZE + 127) / 128 = 2;
unitRowEnd = Min(2, /* a value greater than 2 */) = 2;

Therefore:

for ( unitRow = unitRowStart; unitRow < unitRowEnd; unitRow++ )

does nothing, and read_lr_unit() is never called.

In other words, unless I'm mistaken, it seems that restoration units are not used for the last row and last column in my example.

So there are 9 tiles, but only 4 of them use a restoration unit. That way, a restoration unit is never shared between tiles.

Am I right?

xiphmont · 2019-01-16T11:49:23Z

Now, let's consider the last row of tiles: 128x128 128x128 44x128 128x128 128x128 44x128 128x 44 128x 44 44x 44 <-- this row read_lr() will be called with r = 64. For the luma sample: unitRows = 2; unitCols = 2; unitRowStart = (64 * MI_SIZE + 127) / 128 = 2; unitRowEnd = Min(2, /* a value greater than 2 */) = 2; Therefore: for ( unitRow = unitRowStart; unitRow < unitRowEnd; unitRow++ ) does nothing, and read_lr_unit() is never called.

It does nothing because it's a shared RU. It uses the parameters coded from the earlier RU.

In other words, unless I'm mistaken, it seems that restoration units are not used for the last row and last column in my example. So there are 9 tiles, but only 4 of them use a restoration unit. That way, a restoration unit is never shared between tiles. Am I right?

There are only four restoration units, but all tiles have a restoration unit. The ones on the edges are stretched. Use a last row/col size of eg 68 rather than 48 and you'll see it allocate nine RU instead of 4. Monty

rom1v · 2019-01-16T15:56:55Z

Stop loop rest units from straddling tile boundaries (Wed Oct 25 17:15:28 2017 +0100)

[NORMATIVE] Allow LR units to cross tiles (Wed Mar 28 16:58:57 2018 +0100)

Since fully independent decoding of tiles is no longer supported, instead of dividing each tile into loop restoration units independently, divide the frame as a whole into units.

rom1v · 2019-01-16T15:57:06Z

It does nothing because it's a shared RU.

OK.

It uses the parameters coded from the earlier RU.

It's not clear to me where it uses (or should use) the earlier RU (shared with another tile) in the code. Could you detail how it works, please?

rom1v · 2019-01-16T20:55:10Z

the loop filters, in general, need read access and very little, tightly-contained write access.

What are the expected write access (and from where), apart from the coded flag in RestorationUnit?

rom1v · 2019-01-21T14:17:00Z

So, finally, restoration units may not be split into tiles (because they may be shared between several tiles). There is no need to create non-overlapping views of them, they have interior mutability via atomic or mutex.

For now, I can split a FrameState into TileStateMut<'_> structures (one for each tile).

The next step is to use these structure instead of FrameState for every tile-specific action.

Problem

The main problem is that many pieces of code directly use &[u16] and &mut [u16] (spanning multiple rows) from a PlaneSlice, which is not tile-compatible. Concretely, we cannot directly "tile" the code which access a region of a plane using a &[u16]/&mut [u16] + the stride information, like sse_size4().

I see two strategies to attack this problem:

from current master (without my tiling stuff), first refactor PlaneSlice not to expose multirows raw slices, and adapt all client code that needs to read/write several rows.
from my WiP tiling branch, start to use tiling structures (PlaneRegion, PlaneSubRegion…) and adapt only client code that needs tiling.

The first strategy has advantages:

the changes can be applied separately and incrementally;
accesses to regions of planes will be homogeneous across the app.

However, this will imply to refactor things that are only used at frame level, so some changes would not be absolutely necessary (like possibly sse_size4(), which is called from deblock_filter_optimize() at frame level).

On the other hand, option 2 will result in a big tiling branch to maintain until it's merged, and different parts of the code would access regions of plances in two different ways (PlaneSlice, tile-incompatible, and PlaneRegion, tile-compatible).

I suggest we try to take option 1.

Proposal

We will need to remove as_slice() and its variants from Plane(Mut)Slice.

We will need to access rows separately (and also retrieve a raw ptr for assembly code).

I suggest something like:

diff --git a/src/plane.rs b/src/plane.rs
index 80f5d8f..1fab676 100644
--- a/src/plane.rs
+++ b/src/plane.rs
@@ -293,11 +293,21 @@ impl<'a> ExactSizeIterator for IterWidth<'a> { }
 impl<'a> FusedIterator for IterWidth<'a> { }
 
 impl<'a> PlaneSlice<'a> {
-  pub fn as_slice(&self) -> &'a [u16] {
-    let stride = self.plane.cfg.stride;
-    let base = (self.y + self.plane.cfg.yorigin as isize) as usize * stride
-      + (self.x + self.plane.cfg.xorigin as isize) as usize;
-    &self.plane.data[base..]
+  pub fn row(&self, x_offset: isize, y_offset: isize) -> &[u16] {
+    assert!(self.plane.cfg.yorigin as isize + self.y + y_offset >= 0);
+    assert!(self.plane.cfg.xorigin as isize + self.x + x_offset >= 0);
+    let base_y = (self.plane.cfg.yorigin as isize + self.y + y_offset) as usize;
+    let base_x = (self.plane.cfg.xorigin as isize + self.x + x_offset) as usize;
+    let base = base_y * self.plane.cfg.stride + base_x;
+    let width = self.plane.cfg.stride - base_x;
+    &self.plane.data[base..base + width]
+  }
+
+  pub fn as_ptr(&self) -> *const u16 {
+    let base_y = (self.plane.cfg.yorigin as isize + self.y) as usize;
+    let base_x = (self.plane.cfg.xorigin as isize + self.x) as usize;
+    let base = base_y * self.plane.cfg.stride + base_x;
+    self.plane.data[base..].as_ptr()
   }
 
   pub fn as_slice_clamped(&self) -> &'a [u16] {

Passing x_offset and y_offset to rows() provides a unified way to access slices at a specific position, avoiding go_up(1).go_left(1) and the lifetime problems it poses (see #846).

Usage would be:

diff --git a/src/partition.rs b/src/partition.rs
index eaeba5c..1764e0e 100644
--- a/src/partition.rs
+++ b/src/partition.rs
@@ -866,7 +866,7 @@ pub fn get_intra_edges<'a>(
     // Needs top
     if needs_top {
       if y != 0 {
-        above[..tx_size.width()].copy_from_slice(&dst.go_up(1).as_slice()[..tx_size.width()]);
+        above[..tx_size.width()].copy_from_slice(&dst.row(0, -1)[..tx_size.width()]);
       } else {
         let val = if x != 0 { dst.go_left(1).p(0, 0) } else { base - 1 };
         for v in above[..tx_size.width()].iter_mut() {
diff --git a/src/me.rs b/src/me.rs
index 15ecc10..3ccdae7 100644
--- a/src/me.rs
+++ b/src/me.rs
@@ -82,8 +82,8 @@ mod nasm {
       for c in (0..blk_w).step_by(step_size) {
         let org_slice = plane_org.subslice(c, r);
         let ref_slice = plane_ref.subslice(c, r);
-        let org_ptr = org_slice.as_slice().as_ptr();
-        let ref_ptr = ref_slice.as_slice().as_ptr();
+        let org_ptr = org_slice.as_ptr();
+        let ref_ptr = ref_slice.as_ptr();
         sum += func(org_ptr, org_stride, ref_ptr, ref_stride);
       }
     }

Then we need to see where the compiler fails (hint: at many places), and use these new methods. Sometimes it's straightforward, sometimes it requires to refactor code which uses &[u16] + stride information.

What do you think?

lu-zero · 2019-01-21T14:41:17Z

The plan 1 is good and possibly we could work in parallel since should be easy to update little by little keeping the old methods around till their usage is phased away completely.

See <xiph#631 (comment)>.

The util function convert_slice_2d() operates on a slice using the stride information. It will become incompatible with PlaneSlice which will not expose a multi-rows slice anymore (see <xiph#631 (comment)>). To keep it generic enough, we don't want to use a PlaneSlice wrapper for every call, so make the function use raw pointers (and unsafe).

See <xiph#631 (comment)>.

The util function convert_slice_2d() operates on a slice using the stride information. It will become incompatible with PlaneSlice which will not expose a multi-rows slice anymore (see <xiph#631 (comment)>). To keep it generic enough, we don't want to use a PlaneSlice wrapper for every call, so make the function use raw pointers (and unsafe).

See <xiph#631 (comment)>.