## General overview 
There are many neural network architectures for semantic image segmentation, but most of them use convolutional encoder-decoder architecture.

![](http://mi.eng.cam.ac.uk/projects/segnet/images/segnet.png)
*Convolutional encoder-decoder architecture of popular SegNet model*

Encoder in these networks has in many cases structure similar to some image classification neural network (e.g. [vgg-16](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/vgg.py)). Layers in the decoder are then ussualy inverse to layers used in the encoder (e.g. for convolution that makes its input smaller, we use deconvolution; for max_pool we use some form of "demax_pool").

Applications for semantic segmentation include:

- Autonomous driving

- Industrial inspection

- Classification of terrain visible in satellite imagery

- Medical imaging analysis

<p><strong><a href="https://arxiv.org/abs/1411.4038">Fully Convolution Networks (FCNs)</a></strong></p>

<table>
  <tbody>
    <tr>
      <td>CVPR 2015</td>
      <td>Fully Convolutional Networks for Semantic Segmentation</td>
      <td><a href="https://arxiv.org/abs/1411.4038">Arxiv</a></td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<blockquote>
  <p>We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.</p>
</blockquote>

<p><br /></p>
<p align="center">
<span>
<img width="580px" src="http://meetshah1995.github.io/images/blog/ss/fcn.png" alt="arch_fcn" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure : </b> The FCN end-to-end dense prediction pipeline.</small>
</p>

<p>A few key features of networks of this type are:</p>

<ul>
  <li>The features are merged from different stages in the encoder which vary in <strong>coarseness of semantic information</strong>.</li>
  <li>The upsampling of learned low resolution semantic feature maps is done using <strong>deconvolutions which are initialized with billinear interpolation filters</strong>.</li>
  <li>Excellent example for <strong>knowledge transfer from modern classifier networks</strong> like VGG16, Alexnet to perform semantic segmentation</li>
</ul>

<p><br /></p>
<p align="center">
<span>
<img width="450px" src="http://meetshah1995.github.io/images/blog/ss/fcn_1.png" alt="arch_fcn_" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure : </b> Transforming fully connected layers into convolutions enables a classification network to output a class heatmap.</small>
</p>

<p>The fully connected layers (<code class="highlighter-rouge">fc6</code>, <code class="highlighter-rouge">fc7</code>) of classification networks like <code class="highlighter-rouge">VGG16</code> were converted to fully convolutional layers and as shown in the figure above, it produces a class presence heatmap in low resolution, which then is upsampled using billinearly initialized deconvolutions and at each stage of upsampling further refined by fusing (simple addition) features from coarser but higher resolution feature maps from lower layers in the VGG 16 (<code class="highlighter-rouge">conv4</code> and <code class="highlighter-rouge">conv3</code>) . A more detailed netscope-style visualization of the network can be found in at <a href="http://ethereon.github.io/netscope/#/preset/fcn-8s-pascal">here</a></p>

<p>In conventional classification CNNs, pooling is used to increase the field of view and at the same time reduce the feature map resolution. While this works best for classification as the end goal is to just find the presence of a particular class, while the spatial location of the object is not of relevance. Thus pooling is introduced after each convolution block, to enable the succeeding block to extract more abstract, class-sailent features from the pooled features.</p>

<p><br /></p>
<p align="center">
<span>
<img width="580px" src="http://meetshah1995.github.io/images/blog/ss/fcn_2.png" alt="arch_fcn__" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure :</b>  The FCN-32s Architecture</small>
</p>

<p>On the other hand any sort of operation - pooling or strided convolutions is deterimental to for semantic segmentation as spatial information is lost. Most of the architectures listed below mainly differ in the mechanism employed by them in the <strong>decoder</strong> to <em>recover</em> the information lost while reducing the resolution in the <strong>encoder</strong>. As seen above, FCN-8s fused features from different coarseness (<code class="highlighter-rouge">conv3</code>, <code class="highlighter-rouge">conv4</code> and <code class="highlighter-rouge">fc7</code>) to refine the segmentation using spatial information from different resolutions at different stages from the encoder.</p>

<p align="center">
<span>
<img width="250px" src="https://raw.githubusercontent.com/shekkizh/FCN.tensorflow/master/logs/images/conv_1_1_gradient.png" alt="conv_1_1_gradient" />
</span> &nbsp;&nbsp;&nbsp;
<span>
<img width="250px" src="https://raw.githubusercontent.com/shekkizh/FCN.tensorflow/master/logs/images/conv_4_1_gradient.png" alt="conv_4_1_gradient" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<span>
<img width="250px" src="https://raw.githubusercontent.com/shekkizh/FCN.tensorflow/master/logs/images/conv_4_2_gradient.png" alt="conv_4_2_gradient" />
</span> &nbsp;&nbsp;&nbsp;
<span>
<img width="250px" src="https://raw.githubusercontent.com/shekkizh/FCN.tensorflow/master/logs/images/conv_4_3_gradient.png" alt="conv_4_3_gradient" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure :</b> Gradients at conv layers when training FCNs <a href="https://github.com/shekkizh/FCN.tensorflow">Source</a></small>
</p>

<p>The first conv layers captures low level geometric information and since this entrirely dataset dependent you notice the gradients adjusting the first layer weights to accustom the model to the dataset. Deeper conv layers from VGG have very small gradients flowing as the higher level semantic concepts captured here are good enough for segmentation. This is what amazes me about how well transfer learning works.</p>

<p align="center">
<span>
<img width="250px" src="http://meetshah1995.github.io/images/blog/ss/deconv.gif" alt="deconv" />
</span> &nbsp;&nbsp;&nbsp;
<span>
<img width="250px" src="http://meetshah1995.github.io/images/blog/ss/dilation.gif" alt="dilated" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Left : </b> Deconvolution (Transposed Convolution) and <b>Right : </b> Dilated (Atrous) Convolution <a href="https://github.com/vdumoulin/conv_arithmetic">Source</a></small>
</p>

<p>Other important aspect for a semantic segmentation architecture is the mechanism used for <strong>feature upsampling</strong> the low-resolution segmentation maps to input image resolution using learned deconvolutions or partially avoid the reduction of resolution altogether in the encoder using dilated convolutions at the cost of computation. Dilated convolutions are very expensive, even on modern GPUs. This post on <a href="http://distill.pub/2016/deconv-checkerboard/">distill.pub</a> explains in a much more detail about deconvolutions.</p>

<p><strong><a href="https://arxiv.org/abs/1511.00561">SegNet</a></strong></p>

<table>
  <tbody>
    <tr>
      <td>2015</td>
      <td>SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation</td>
      <td><a href="https://arxiv.org/abs/1511.00561">Arxiv</a></td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<blockquote>
  <p>The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance.</p>
</blockquote>

<p><br /></p>
<p align="center">
<span>
<img width="580px" src="http://meetshah1995.github.io/images/blog/ss/segnet.png" alt="arch_segnet" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure : </b> The SegNet Architecture</small>
</p>

<p>A few key features of networks of this type are:</p>

<ul>
  <li>SegNet uses <strong>unpooling</strong> to upsample feature maps in decoder to use and keep high frequency details intact in the segmentation.</li>
  <li>This encoder doesn’t use the fully connected layers (by convolutionizing them as FCN) and hence is lightweight network lesser parameters.</li>
</ul>

<p align="center">
<span>
<img width="450px" src="http://meetshah1995.github.io/images/blog/ss/unpooling.png" alt="unpool" />
</span> &nbsp;&nbsp;&nbsp;
<br />
<small><b>Figure : </b> Max Unpooling</small>
</p>

<p>As shown in the above image, the indices at each max-pooling layer in encoder are stored and later used to upsample the correspoing feature map in the decoder by unpooling it using those stored indices. While this helps keep the high-frequency information intact, it also misses neighbouring information when unpooling from low-resolution feature maps.</p>
