This repository is the official release of the code for the following paper "FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture" which is published at the 13th Asian Conference on Computer Vision (ACCV 2016).
C++ Python Cuda CMake Protocol Buffer Matlab Other


FuseNet is developed as a general architecture for deep convolutional neural network (CNN) to train dataset with RGB-D images. It can be used for semantic segmentation, scene classification and other applications. This repository is an official release of this paper, and it is implemented based on the BVLC/caffe framework.



The code is compatible with an early Caffe version of June 2016. It is developed under Ubuntu 16.04 with CUDA 7.5 and cuDNN v5.0. If you use the program under other Ubuntu distributions, you may need to comment out line 72--73 in the root CMakeLists.txt file. If you compile under other OS, please use Google as your friend. We mostly test the program with Nvidia Titan X GPU. Please note multi-GPU training is supported.

git clone
mkdir build && cd build
cmake ..
make -j10
make runtest -j10

Training and Testing

We provide all needed python scripts and prototxt files to reproduce our published results under ./fusenet/. A short guideline is given below. For further detailed instructions, check here.


Our network architecture is based on VGGNet-16layer. However, since we have extra input channel for depth, we provide the compute the

Data preparation

To store dataset, we save paired RGB-D images into LMDB. We also scale the original depth image to the the range of [0, 255]. It is optional further cast the scaled depth value into unsigned char (grayscale), so as to save memory. If you do not want to lose precision, store the scaled depth as float. To prepare LMDB, we provide the following python scripts for your reference. However, you can also write your own image input layer to grab paired RGB-D images.

demo   ./fusenet/scripts/

LMDB shuffling

We support LMDB shuffling, and recommend to do shuffling after each epoch during training. To enable this option, flag shuffle to be true for the DataLayer in the prototxt. Note that we do not support shuffling with LevelDB.

demo   ./fusenet/segmentation/nyuv2_sf1/train.prototxt

Weighted cross-entropy loss

On common technique to handle class imbalance is to give the loss of each class a different weight, which typically has a higher value for less frequent classes and a lower value for more frequent class. For semantic segmentation, we support this loss weighting with the SoftmaxWithLossLayer by allowing user to specify a weight for each label. One way to set the weights is accordingly to the inverse class frequency (see our paper for detail). We provide the weights used in our paper in ./fusenet/data/.

Batch normalization

We use batch normalization after each convolution. This is supported by the Caffe BatchNormLayer. Notice that we add ScaleLayer after each BatchNormLayer.


To test the semantic segmentation performance, we provide the python scripts to calculate the global accuracy, average class accuracy and average intersection-over-union score. The implementation is based on confusion matrix.

demo   ./fusenet/scripts/

Released Caffemodel

Semantic Image Segmentation

The items marked with ticks are already available for downloading, otherwise they will be released soon. Unless otherwise stated, all models are finetuned from pretrained VGGNet-16Layer model. Stay tuned 🔥

NYUv2 40-class semantic segmentation

More information about the dataset, check here.

  • FuseNet-SF1:

    This model is trained with the FuseNet Sparse-Fusion1 architecture on 320x240 resolution. To obtain 640x480 full resolution, you can use bilinear upsample the segmenation or better with CRF refinement.

  • FuseNet-SF5:

    This model is trained the FuseNet SparseFusion5 architecture on 320x240 resolution. It gives 66.0% global pixelwise accuracy, 43.4% average classwise accuracy and 32.7% average classwise IoU.

SUN-RGBD 37-class semantic segmentation

More information about the dataset, check here.

  • FuseNet-SF5:

    This model is trained with 224x224 resolution. It gives 76.3% global pixelwise accuracy, 48.30% average classwise accuracy and 37.3% average classwise IoU,

Scene Classification

To be released.


If you use this code or our trainined model in your work, please consider cite the following paper.

Caner Hazirbas, Lingni Ma, Csaba Domokos and Daniel Cremers, "FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-based CNN Architecture", in proceedings of the 13th Asian Conference on Computer Vision, 2016. (pdf)

 author    = "C. Hazirbas and L. Ma and C. Domokos and D. Cremers",
 title     = "FuseNet: incorporating depth into semantic segmentation via fusion-based CNN architecture",
 booktitle = "Asian Conference on Computer Vision",
 year      = "2016",
 month     = "November",

License and Contact

BVLC/caffe is released under the BSD 2-Clause license. The modification to the original code is released under GNU General Public License Version 3 (GPLv3).

Contact Lingni Ma ✉️ for questions, comments and reporting bugs.