Split-Deconvolution - An implementation of accelerated Deonvolution operation

This repo mainly includes two parts. In the first part, we present the python script that converts the deconvolution to standard convolution. Then we compare the results calculated using the split deconvolution and that calculated using Tensorflow transpose function. The comparison verifies the correctness of split deconvolution.

In the second part, we provide the detailed steps to deploy the split deconvolution on Google Edge TPU and NCS2. Meanwhile, we compare the runtime of the baseline deconvolution and the split deconvolution, which demonstrates the significant performance speedup of split deconvolution over the baseline implementation. Since the native deconvolution (transpose convolution) is not supported on Google Edge TPU, we implement the deconvolution using the well-known zero padding on input activations which is also included in the repo.

Paper Preprint:https://arxiv.org/abs/1907.01773, accepted by IEEE Transactions on Computer, 2020.

Prerequisites

Pillow
matplotlib
tensorflow 1.12.0

Split deconvolution verification

In order to use execute the neural networks of this repo on Tensorflow or Split Deconvolution, you need to provide

Network configuration .csv
Model Parameters .npy They are included in the repo (./networks_configuration/ and ./raw_data/).

1.Calculate deconvolution using transpose function in Tensorflow.

python Setup.py --model DCGAN --mode tf_deconv

2.Convert deconvolution to standard convolution using the proposed split deconvolution

python Setup.py --model DCGAN --mode split_deconv

3.Results comparison

python Setup.py --model DCGAN --mode verify

Performance comparison

All the models can be downloaded from the dropbox link https://www.dropbox.com/sh/qenwhtupqkfsezv/AACRLlnFvzCe2VvkXCC3DChJa?dl=0.
To compare the deconvolution implementations on Google Edge TPU and NCS2, we have one deconvolution layer implemented using both baseline deconvolution and the split deconvolution. The data for the two deconvolution implementations are stored as two .pb files which is required by Google Edge TPU and two .onnx files for NCS2.

The reason that we do not implement the whole neural network is that the converted deconvolution using both zero padding and split deconvolution need output reorganization for the computing in the next layer. The output reorganization itself is trivial because it is essentially to store the output data in the on-chip buffers in DRAM on some scattered but sequential locations of DRAM which is slightly differently to a conventional sequential write back. As DMA module is usually required for any CNN processors because they want to have the output written back to external memory. However, it is not open to users, so we have no choice but to do it on the host. Also the overhead of the data reorganization on host can be measured. The overhead is negligible according to our experiments. The whole dataflow is also verified on our in-house AI chip, through it is not in the market yet. We are working toward to an open sourced FPGA version with netlist. It will appear soon. We will have more experiments announced later.

Deconvolution execution on Google Edge TPU
The inference time of baseline deconvolution (DropBox: /models/example_models/tf_model.pb) and split deconvolution (DropBox: /models/example_models/sd_mdoel.pb) on TPU is 104.88 ms and the 49.85 ms respectively. Note that the output reorganization time is considered in the measurement.
Detailed instructions to deploy the models on Google Edge TPU can be found in the official documents. https://coral.withgoogle.com/docs/accelerator/get-started/
The experimental models used in the paper are stored in DropBox: /models/experimental_models/TPU_models/*. We found that there two .tflite files that can be run on Edge TPU and the latest performance is listed below.

a) .tflite

Benchmarks	NZP (ms)	SD (ms)	Speedup
DCGAN	26.76	9.32	2.87x
SNGAN	24.02	6.76	3.55x
FST	149.80	72.29	2.06x
ArtGAN	93.08	26.94	3.46x
GP-GAN	28.14	7.82	3.60x
MDE	204.31	96.30	2.12x

b) _edgetpu.tflite

Benchmarks	NZP (ms)	SD (ms)	Speedup
DCGAN	110.54	74.09	1.50x
SNGAN	85.26	57.21	1.50x
FST	2133.42	1289.39	1.65x
ArtGAN	465.55	355.89	1.31x
GP-GAN	177.24	108.60	1.63x
MDE	2308.84	1673.15	1.38x

Similar to NCS2, we compare the computing efficiency of convolution with different input feature map sizes and filter sizes, which has the same trend as NCS2 described in the paper.
Filter size is set to be 3 x 3

Feature map size	# of input channels	# of output channels	Normalized GMACPS
8 x 8	256	128	1x
16 x 16	256	128	1.32x
32 x 32	256	128	1.76x
64 x 64	256	128	1.88x
128 x 128	256	128	1.98x

Feature map size is set to be 128 x 128

Filter size	# of input channels	# of output channels	Normalized GMACPS
2 x 2	256	128	1x
3 x 3	256	128	2.14x
4 x 4	256	128	3.64x
5 x 5	256	128	5.22x

Deconvolution execution on NCS2
The inference time of baseline deconvolution (DropBox: /models/example_models/transpose_conv.onnx) and split deconvolution (DropBox: /models/example_models/split_deconv.onnx) on TPU is 30.713ms and the 29.384ms respectively. Note that the output reorganization time is considered in the measurement. Detailed instructions can be found in https://software.intel.com/en-us/articles/get-started-with-neural-compute-stick
The experimental models used in the paper are stored in DropBox: /models/experimental_models/NCS2_models/*. The latest performance is listed below.

Benchmarks	Transpose_Conv (ms)	SD (ms)	Speedup
DCGAN	94.25	89.52	1.05x
SNGAN	95.08	82.02	1.16x
FST	784.79	727.48	1.08x
ArtGAN	107.92	98.92	1.10x
GP-GAN	126.04	111.58	1.13x
MDE	562.48	500.41	1.12x

Important Note
The performance of Edge TPU and NCS2 is tested by python module time, which is not accurate. i.e., The running time of a single layer can vary from 10 ms to 50 ms, depending on the corresponding time of the host.

To help the users to experiment with their own data, we also provide some auxiliary functions. They are included in utils/utils.py.
generate_input() is used to generate input for DCGAN.
filter_split() is used to split and convert the original deconvolution filter.
insert_zeros() is used to insert zeros in the input feature maps for baseline zero-padding-based deconvolution.

Related Papers

Fcn-engine: Accelerating Deconvolutional Layers in Classic CNN Processors (ICCAD'18)
Accelerating Generative Neural Networks on Unmodified Deep Learning Processors - A Software Approach (IEEE TC)

If you find Deep Compression useful in your research, please consider citing the paper:

@inproceedings{xu2018fcn, title={Fcn-engine: Accelerating deconvolutional layers in classic cnn processors}, author={Xu, Dawen and Tu, Kaijie and Wang, Ying and Liu, Cheng and He, Bingsheng and Li, Huawei}, booktitle={Proceedings of the International Conference on Computer-Aided Design}, pages={22}, year={2018}, organization={ACM} }

@misc{xu2019accelerating, title={Accelerating Generative Neural Networks on Unmodified Deep Learning Processors -- A Software Approach}, author={Dawen Xu and Ying Wang and Kaijie Tu and Cheng Liu and Bingsheng He and Lei Zhang}, booktitle={Transactions on Computers}, year={2020}, organization={IEEE} }

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
networks_configuration		networks_configuration
raw_data/DCGAN		raw_data/DCGAN
utils		utils
.gitignore		.gitignore
Inference.py		Inference.py
README.md		README.md
Setup.py		Setup.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

networks_configuration

networks_configuration

raw_data/DCGAN

raw_data/DCGAN

utils

utils

.gitignore

.gitignore

Inference.py

Inference.py

README.md

README.md

Setup.py

Setup.py

requirements.txt

requirements.txt

Repository files navigation

Split-Deconvolution - An implementation of accelerated Deonvolution operation

Prerequisites

Split deconvolution verification

Performance comparison

About

Releases

Packages

Contributors 2

Languages

wangying-ict/FCN

Folders and files

Latest commit

History

Repository files navigation

Split-Deconvolution - An implementation of accelerated Deonvolution operation

Prerequisites

Split deconvolution verification

Performance comparison

About

Resources

Stars

Watchers

Forks

Languages