Skip to content

wangying-ict/FCN

Repository files navigation

Split-Deconvolution - An implementation of accelerated Deonvolution operation

This repo mainly includes two parts. In the first part, we present the python script that converts the deconvolution to standard convolution. Then we compare the results calculated using the split deconvolution and that calculated using Tensorflow transpose function. The comparison verifies the correctness of split deconvolution.

In the second part, we provide the detailed steps to deploy the split deconvolution on Google Edge TPU and NCS2. Meanwhile, we compare the runtime of the baseline deconvolution and the split deconvolution, which demonstrates the significant performance speedup of split deconvolution over the baseline implementation. Since the native deconvolution (transpose convolution) is not supported on Google Edge TPU, we implement the deconvolution using the well-known zero padding on input activations which is also included in the repo.

Paper Preprint:https://arxiv.org/abs/1907.01773, accepted by IEEE Transactions on Computer, 2020.

Prerequisites

  • Pillow
  • matplotlib
  • tensorflow 1.12.0

Split deconvolution verification

In order to use execute the neural networks of this repo on Tensorflow or Split Deconvolution, you need to provide

  • Network configuration .csv
  • Model Parameters .npy They are included in the repo (./networks_configuration/ and ./raw_data/).

1.Calculate deconvolution using transpose function in Tensorflow.

python Setup.py --model DCGAN --mode tf_deconv

2.Convert deconvolution to standard convolution using the proposed split deconvolution

python Setup.py --model DCGAN --mode split_deconv

3.Results comparison

python Setup.py --model DCGAN --mode verify

Performance comparison

All the models can be downloaded from the dropbox link https://www.dropbox.com/sh/qenwhtupqkfsezv/AACRLlnFvzCe2VvkXCC3DChJa?dl=0.
To compare the deconvolution implementations on Google Edge TPU and NCS2, we have one deconvolution layer implemented using both baseline deconvolution and the split deconvolution. The data for the two deconvolution implementations are stored as two .pb files which is required by Google Edge TPU and two .onnx files for NCS2.

The reason that we do not implement the whole neural network is that the converted deconvolution using both zero padding and split deconvolution need output reorganization for the computing in the next layer. The output reorganization itself is trivial because it is essentially to store the output data in the on-chip buffers in DRAM on some scattered but sequential locations of DRAM which is slightly differently to a conventional sequential write back. As DMA module is usually required for any CNN processors because they want to have the output written back to external memory. However, it is not open to users, so we have no choice but to do it on the host. Also the overhead of the data reorganization on host can be measured. The overhead is negligible according to our experiments. The whole dataflow is also verified on our in-house AI chip, through it is not in the market yet. We are working toward to an open sourced FPGA version with netlist. It will appear soon. We will have more experiments announced later.

  1. Deconvolution execution on Google Edge TPU
    The inference time of baseline deconvolution (DropBox: /models/example_models/tf_model.pb) and split deconvolution (DropBox: /models/example_models/sd_mdoel.pb) on TPU is 104.88 ms and the 49.85 ms respectively. Note that the output reorganization time is considered in the measurement.
    Detailed instructions to deploy the models on Google Edge TPU can be found in the official documents. https://coral.withgoogle.com/docs/accelerator/get-started/
    The experimental models used in the paper are stored in DropBox: /models/experimental_models/TPU_models/*. We found that there two .tflite files that can be run on Edge TPU and the latest performance is listed below.

    a) .tflite

    Benchmarks NZP (ms) SD (ms) Speedup
    DCGAN 26.76 9.32 2.87x
    SNGAN 24.02 6.76 3.55x
    FST 149.80 72.29 2.06x
    ArtGAN 93.08 26.94 3.46x
    GP-GAN 28.14 7.82 3.60x
    MDE 204.31 96.30 2.12x

    b) _edgetpu.tflite

    Benchmarks NZP (ms) SD (ms) Speedup
    DCGAN 110.54 74.09 1.50x
    SNGAN 85.26 57.21 1.50x
    FST 2133.42 1289.39 1.65x
    ArtGAN 465.55 355.89 1.31x
    GP-GAN 177.24 108.60 1.63x
    MDE 2308.84 1673.15 1.38x

    Similar to NCS2, we compare the computing efficiency of convolution with different input feature map sizes and filter sizes, which has the same trend as NCS2 described in the paper.
    Filter size is set to be 3 x 3

    Feature map size # of input channels # of output channels Normalized GMACPS
    8 x 8 256 128 1x
    16 x 16 256 128 1.32x
    32 x 32 256 128 1.76x
    64 x 64 256 128 1.88x
    128 x 128 256 128 1.98x

    Feature map size is set to be 128 x 128

    Filter size # of input channels # of output channels Normalized GMACPS
    2 x 2 256 128 1x
    3 x 3 256 128 2.14x
    4 x 4 256 128 3.64x
    5 x 5 256 128 5.22x
  2. Deconvolution execution on NCS2
    The inference time of baseline deconvolution (DropBox: /models/example_models/transpose_conv.onnx) and split deconvolution (DropBox: /models/example_models/split_deconv.onnx) on TPU is 30.713ms and the 29.384ms respectively. Note that the output reorganization time is considered in the measurement. Detailed instructions can be found in https://software.intel.com/en-us/articles/get-started-with-neural-compute-stick
    The experimental models used in the paper are stored in DropBox: /models/experimental_models/NCS2_models/*. The latest performance is listed below.

    Benchmarks Transpose_Conv (ms) SD (ms) Speedup
    DCGAN 94.25 89.52 1.05x
    SNGAN 95.08 82.02 1.16x
    FST 784.79 727.48 1.08x
    ArtGAN 107.92 98.92 1.10x
    GP-GAN 126.04 111.58 1.13x
    MDE 562.48 500.41 1.12x

 

Important Note
The performance of Edge TPU and NCS2 is tested by python module time, which is not accurate. i.e., The running time of a single layer can vary from 10 ms to 50 ms, depending on the corresponding time of the host.

To help the users to experiment with their own data, we also provide some auxiliary functions. They are included in utils/utils.py.
generate_input() is used to generate input for DCGAN.
filter_split() is used to split and convert the original deconvolution filter.
insert_zeros() is used to insert zeros in the input feature maps for baseline zero-padding-based deconvolution.

Related Papers

  1. Fcn-engine: Accelerating Deconvolutional Layers in Classic CNN Processors (ICCAD'18)
  2. Accelerating Generative Neural Networks on Unmodified Deep Learning Processors - A Software Approach (IEEE TC)

If you find Deep Compression useful in your research, please consider citing the paper:

@inproceedings{xu2018fcn, title={Fcn-engine: Accelerating deconvolutional layers in classic cnn processors}, author={Xu, Dawen and Tu, Kaijie and Wang, Ying and Liu, Cheng and He, Bingsheng and Li, Huawei}, booktitle={Proceedings of the International Conference on Computer-Aided Design}, pages={22}, year={2018}, organization={ACM} }

@misc{xu2019accelerating, title={Accelerating Generative Neural Networks on Unmodified Deep Learning Processors -- A Software Approach}, author={Dawen Xu and Ying Wang and Kaijie Tu and Cheng Liu and Bingsheng He and Lei Zhang}, booktitle={Transactions on Computers}, year={2020}, organization={IEEE} }

About

Split-Deconvolution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages