Motion Estimation and Motion Compensation Driven Neural Network for Video Interpolation and Enhancement from Shanghai Jiao Tong University
- exploit motion estimation and motion compensation in a neural network
- propose an adaptive warping layer based on optical flow and compensation filters for synthesizing new pixels. This novel warping layer is fully differentiable such that the gradients can be back-propagated to both the ME and MC networks.
- To account for the occlusions, we estimate occlusion masks to adaptively blend the warped frames. Furthermore, the missing pixels in holes and unreliable pixels of the warped frames are processed by a post-processing CNN
- simultaneously estimate the flow and compensation kernels with respect to the original reference frames, then combine with adaptive warping layer
- Motion Estimation: block-based then search (e.g. spatial/temporal search)
- Motion Compensated Interpolation: utilize motion to get interpolated frame (e.g. image fusion, overlapped patch reconstruction)
- Post-processing: remove artifacts
- Motion Estimation: CNNs (SPyNet(CVPR 2017), Flownet (ICCV 2015), Flownet 2(CVPR 2017))
- predict bi-directional flow: pretrained flow model first, then task-oriented. e.g. TOFlow(IJCV 2019)
- bilinear warping operation to align input frames based on linear motion models (i.e. Synthesized frame) Deep Voxel Flow(ICCV 2017), FIGAN(2017), Super SloMo(CVPR 2018)
- Bilinear Warpping: blend neighbor pixels based on the sub-pixel shifts
- Post-processing
- Kernel Estimation: estimate spatially-adaptive convolutional kernels for each output pixel
- Kernel Convolution e.g. SepConv(ICCV 2017)
warps images or feature based on given optical flow and local convolutional kernels
Foward pass
$I(x) : \mathbb{Z}^2 \to \mathbb{R}^3$
denote the RGB image (from 2D coordinates to 3 RGB color real value)
$f(x):= (u(x),v(x) $
represent the optical flow field, u(x), v(x) denote the horizontal and vertical part of 2D vector
$ k^l(x) = [k^l_\mathbf{r}(x)]_{H \times W} (r \in [-R+1,R]^2) $
indicate the interpolation kernel where R is the kernel size
For each kernel: Image part: shift by optical flow + Kernel
$k^l_\mathbf{r}$
: 16 (maps to 4x4) channels interpolation kernel learned in kernel estimation network
$k^d_\mathbf{r}$
: 4x4 coefficients computed from f(x)= u(x), v(x) fractional part
interpolate the 4x4 kernel $k^l_\mathbf{r}$
with optical flow
backward pass
compute the gradients with respect to optical flow and interpolation kernels respectively. I do not understand how to differentiate $I(x+f(x)+r)$
part… It seems it do not learn optional flow estimation, using pre-trained model directly, in this case adaptive wrapping layer only learn/fine tune the optical flow transform for kernel layer. From Hyper-parameter settings, it only fine tune pre-trained model with low learning rate.
Let $f_{t_x \to t_y}(x) $
be the motion vector field of coordinate x in frame $I_{t_x} to I_{t_y}$
given $f_{t-1 \to t+1}(y) $
, find $f_{t \to t-1}(x)$
and $f_{t \to t+1} $
Project with outside-in strategy
All 4 branches take $I_{t-1}, I_{t+1} $
as input
module | function/output | architecture |
---|---|---|
Motion estimation | estimate optical flow | FlowNetS |
Kernel estimation | 2 R^2=RxR coefficient maps | U-Net |
Mask estimation | 2-channel feature map | U-Net |
Context extraction | , warp by adaptive warping layer and fed to post-processing | ResNet18(for MEMC-Net*) |