Skip to content

samyakjain0112/Video-denoising_tensorflow-core-api-2.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Video-denoising_tensorflow 2.0

1) Inspired from the Transformer Network used in the paper(attention is all you need by vaswani et al) https://arxiv.org/pdf/1706.03762.pdf and U-Net for video denoising used in the paper (Fast DVD-Net) 'https://arxiv.org/pdf/1907.01361.pdf'

2) Using the similiar architecture as in attention is all you need the model broadly consists of an ENCODER and a DECODER.

3) The encoder part of the model inputs five noisy video frames that is [t-2,t-1,t,t+1,t+2] where t is the nosiy frame which we aim to denoise

4) The decoder part of the model inputs two previous ground truth denoised reference frames that are[T-2,T-1] and the aim of the entire model is to predict the Tth frame.

5) We also concatenate the encoder and decoder corresponding feature maps so that the model can generalise well.

6) The main difference in our proposed architecture and the current state of the art is the using of T-2,T-1 reference ground truth denoised frames in the model and concatenating of encoder and decoder feature maps.

7) The loss fuction is L2 loss.


MODEL

The input to the encoder part is 5 frames[t-2,t-1,t,t+1,t+2] where t is the noisy frame which we aim to denoise. The input to the decoder part is 2 previous decoised ground truth time frames [T-2,T-1] and we aim to predict the T present denoised frame. In the encoder part taking inspiration from different architecture networks we first do SPATIAL denoising by passing each frame to a 8 layers long convolutional network and the parameters are shared for each of the five noisy frames.Lets name the t-2 output as P-2 ; that of t-1 as P-1 and that of t as P. The output of the five frames is concatinated to form a feature map of 15 layers (3*5) and this is the passed through a unet which first downsamples and then upsamples them. However the U-net used here is a bit different from the conventional ones as here instead of concatination addition operation is used. Now the U-net outputs 3 feature maps. While this was all happening on the encoder side; on the decoder side the two reference ground truth denoised frames[T-1,T-2] are passed to a simple 4 layer convolutional network. As in encoder here also the parameters are shared for the two [T-1,T-2] frames and then the output is concatinated with the corresponding encoder outputs of [t-1,t-2] that are P-2 and P-1 (which had come on the encoder side after passing of t-1 and t-2 to the 8 layers long convolutional network). Now the concatinated outputs are passed through a 10 layers long convolutional network and the parameters are again shared here. Now the all the decoder output frames are concatinated and also Pth output is concatinated and thus passed through a unet. So the input to this U-net is 9 layers and the output is 3 layers .This is how the dencoder works. Now the outputs of decoder(3 layers) and that of encoder(3 layers) is concatinated with Pth output and the [T-1,T-2] frames an thus obtained 15 layer feature map is passed through a 12 layers long network to get the final denoised frame T as output. Then the loss is calculated and the model is trained end to end. We also used different learning rates for different sections of the model so that vanishing gradients is not a problem in any case.

About

Video denoising using Tensorflow2.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages