# Dealing with Images (Enhancing and Segmenting)

## Introduction

This project dives into Encoders-Decoders, where these models are used to edit and generate full images. How these models can be adapted for a wider range of applications such as image denoising or object and instance segmentation. The project will also introduce new concepts like Unpooling, Transposed and Atrous Convolutions layers to the network architecture and its utility for high-dimensional data. Encoders-Decoders can be used for semantic segmentation for driverless cars, where it would help in defining the objects surrounding the vehicle like, roads, other vehicles, people or trees etc. 

## Breakdown of this Project:
- Introduction to Encoders-Decoders. (Notebook 1)
- Encoders-Decoders trained for pixel-level prediction. (Notebook 1)
- Layers such as Unpooling, Transposed and Atrous Convolutions to output high-dimensional data. (Notebook 2)
- FCN and U-Net Architectures for semantic segmentation. (Notebook 3)
- Instance segmentation (extension of Faster-RCNN with Mask-RCNN) (Notebook 4)

## Requirements:
1) Tensorflow 2.0 (GPU prefferably) \
2) CV2 (OpenCV) \
3) Cython \
4) Eigen \
5) PyDenseCRF

For "PyDenseCRF" for windows, LINK: https://github.com/lucasb-eyer/pydensecrf\

It can be installed directly with the following in command prompt or terminal-equivalent: __conda install -c conda-forge pydensecrf__.

If Conda-Forge __does not work__, try: 
- going to: https://www.lfd.uci.edu/~gohlke/pythonlibs/#pydensecrf
- Download: pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl
- Where "cp37" in the filename is the python version of 3.7, make sure you download the correct one.
- Place the downloaded "pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl" file in your working directory drive.
- Open Command Prompt and type in: pip install pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl
- Or if you placed it in a folder or different location: pip install <FILEPATH>\pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl

## Dataset:
    
The dataset can be obtain from the link: http://www.laurencemoroney.com/rock-paper-scissors-dataset/ or from https://www.tensorflow.org/datasets/catalog/rock_paper_scissors

Rock Paper Scissors is a dataset containing 2,892 images of different types of hands in Rock/Paper/Scissors poses. Each of these images are 300×300 pixels in 24-bit colour.


### Import the required libraries:

In [None]:
%matplotlib inline

import tensorflow as tf
import numpy as np
import math
import timeit
import time
from absl import app, flags, logging
from absl.flags import FLAGS
import cv2
import os
import matplotlib.pyplot as plt

# Run on GPU:
os.environ["CUDA_VISIBLE_DEVICES"]= "0" 

In [None]:
# Set the random set seed number: for reproducibility.
Seed_nb = 42

# Set to run or not run the code block: for code examples only. (0 = run code, and 1 = dont run code)
dont_run = 0

### GPU Information:

In [None]:
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
devices = sess.list_devices()
devices

### Use RTX_GPU Tensor Cores for faster compute: FOR TENSORFLOW ONLY

Automatic Mixed Precision Training in TF. Requires NVIDIA DOCKER of TensorFlow.

Sources:
- https://developer.nvidia.com/automatic-mixed-precision
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#framework

When enabled, automatic mixed precision will do two things:

- Insert the appropriate cast operations into your TensorFlow graph to use float16 execution and storage where appropriate(this enables the use of Tensor Cores along with memory storage and bandwidth savings). 
- Turn on automatic loss scaling inside the training Optimizer object.

In [None]:
# os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

EXAMPLE CODE: 

In [None]:
# # Graph-based example:
# opt = tf.train.AdamOptimizer()
# opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
# train_op = opt.miminize(loss)

# # Keras-based example:
# opt = tf.keras.optimizers.Adam()
# opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
# model.compile(loss=loss, optimizer=opt)
# model.fit(...)

### Use RTX_GPU Tensor Cores for faster compute: FOR KERAS API

Source:
- https://www.tensorflow.org/guide/keras/mixed_precision
- https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy

In [None]:
from tensorflow.keras.mixed_precision import experimental as mixed_precision

In [None]:
# # Set for MIXED PRECISION:
# policy = mixed_precision.Policy('mixed_float16')
# mixed_precision.set_policy(policy)

# print('Compute dtype: %s' % policy.compute_dtype)
# print('Variable dtype: %s' % policy.variable_dtype)

### To run this notebook without errors, the GPU will have to be set accordingly:

In [None]:
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

## 1 - Introduction to Convolutional Encoders-Decoders:

Similar to the Fully connected network, the Encoder-Decoder models can also be kitted out with Convolutional and Pooling layers, although with some variables of these, to be structured in a way to form the __Deep Auto-Encoders (DAEs)__. These kinds of architecture can be used for more complex tasks such as __Super-Resolution__ of images. Before moving onto making the network in TensorFlow, some of the basics will need to be covered first and these are:

1) Unpooling. \
2) Transposing. \
3) Dilating.

These can also be grouped together and be called __Reversed Operations__.

## 2 - Reversed Operations: Unpooling, Transposing and Dilating.

In previous notebooks, the convolutional layers were introduced to extract high-level features of the neural network, and that the pooling layers would down-sample the data to extract semantically rich features that are compact. In a way, Convolutional Neural Networks' feature extractors are like encoders. The part that differs from a CNN is on the decoding end, where it would decode these lower dimensional features into full images. To achieve this, reversed operations would  be utilised. The decoder part of the network would use __transposed convolutions__, __dilated convolutions__ and __unpooling__. 

### 2.1 - Transposed Convolutions: a.k.a Fractionally Strided Convolutions

The convolution operation will be discussed here once more for reference and to be compared with its reversed operation. For Convolutions, it takes in hyperparameters that are following:

- Kernel Size, k.
- Input Depth, D.
- Number of Kernels, N.
- Padding, p.
- Stride, s. 

Where for an input tensor with the shape Height x Width x Depth, the output tensor will have the shape $H_{o}$ and $W_{o}$ as:

$$ H_{o} = \frac{H - k + 2p}{s} + 1 $$ and $$ W_{o} = \frac{W - k + 2p}{s} + 1 $$

From the Decoder's perspective, $H_{o}$ and $W_{o}$ will be its input. Therefore for a given input feature map with the shape of $Height_{o}$, $Width_{o}$ and $N$ (Number of Kernels), a __reversed operation__ will be performed to recover the tensor shape, that is defined as the following:

$$ H = (H_{o} - 1) * s + k - 2p$$ and $$ W = (W_{o} - 1) * s + k - 2p$$

##### To demonstrate a simple transposed convolution, an example is shown below:

<img src="Description Images/Transposed Convolution.PNG" width="750">

Image Ref -> http://d2l.ai/chapter_computer-vision/transposed-conv.html

From the above, it can be seen that some of the elements of the upsampled feature maps are overlapping, these are resolved by simple addition of the the overlapping elements. On other words, the transposed convolutional layers mirrors the standard convolutions to increase the spatal dimensions of the feature maps while utilising trainable filters for its convolutions over the input feature tensor. As this operation is quite similar to the standard convolution, there won't be much performance hit on the overall computation. __Overall, transposed convolutions approximates and tries to return the tensors back to the original shape, it does not revertit back to the original input tensor.__ 

These can be called onin TensorFlow with: "tf.keras.layers.Conv2DTranspose()" or "tf.layers.conv2d_transpose()"

Another good example can be found with the link: https://medium.com/apache-mxnet/transposed-convolutions-explained-with-ms-excel-52d13030c7e8

### 2.2 - Unpooling:

In CNNs, pooling operations were used subsequently after each convolutions, although as the name suggests, unpooling would be the reverse of max-pooling or average pooling techniques, however, this is not the case. Taking the example of max-pooling on the original output feature tensor, performing the operation in reserve is impossible as it is inherent in the way the operation was conducted. Recovering the original input tensor is not possible. Unpooling would then be an operation that approximates the inversion of the tensor where it would be in terms of spatial sampling. 

##### Below shows the Unpooling Operation:

<img src="Description Images/Unpooling.PNG" width="550">

Image Ref -> https://abenbihi.github.io/posts/2018-06-tf-unpooling/

From the diagram above, it can be seen that the location of the max-values remained in the same position as with the original, where the rest of the elements were filled with zeros. The positions of these values are determined by the pooling masks. Similar to other pooling layers, the unpoling operation are also fixed and considered untrainable operations. The diagram below shows the operation with a different tensor, but the concept applies.

<img src="Description Images/Mask_Unpooling.PNG" width="550">

Image Ref -> https://abenbihi.github.io/posts/2018-06-tf-unpooling/

The more commonly used unpooling operation other than max-unpooling is the __Average-unpooling__ operation (or known as __Upsampling__ or __Resize__). Similar to average-pooling operation but in reverse, it takes each of the tensor values and copies it into a k by k region. This is demonstrated in the diagram below.

<img src="Description Images/Upsampling_Resizing.PNG" width="550">

Image Ref -> https://arxiv.org/pdf/2004.04892.pdf

In TensorFlow, it can be called with: "tf.keras.layers.UpSampling2D()" and "tf.image.resize()", where the lateer is a wrapper which will need to takin a parameter "method = tf.image.ResizeMethod.NEAREST_NEIGHBOR".

### 2.3 - Dilated Convolutions: a.k.a Atrous Convolutions.

The dilated convolutions serves the purpose of increasing the receptive field of the convolutions while keeping the spatial dimensionality of the data. It takes in a hyperparameter "d" as the dilation to be applied to the kernels. Note that "Transposed convolution" also has its own implementation of dilation but it is implemented differently. 

##### The operation can be seen in the diagram below:

<img src="Description Images/Dilated_Conv.PNG" width="550">

Image Ref -> https://www.semanticscholar.org/paper/Deep-Dilated-Convolution-on-Multimodality-Time-for-Xi-Hou/afadf82529110fadcbbe82671d35a83f334ca242/figure/1

In TensorFlow, it can be called with: "tf.layers.conv2d()" and "tf.keras.layers.Conv2D()" where both will take in a hyperparameter "dilation_rate".

## 3 - Fully Convolutional Networks & U-Net:

The Fully Convolutional Networks (FCNs) & U-Net are models that are used for Deep Auto-Encoders. 

### 3.1 - Fully Convolutional Networks:

The FCN models are mostly based on the VGG-16 architecture, where the CNN part of the model will extract the features from the input images during training, the difference here is that the final dense layers were replaced with 1x1 convolutions and extended with upsampling blocks. This essentially transforms the model into an Encoder-Decoder network. 

##### Below shows the Overall FCN model with its three variants:

<img src="Description Images/FCN_variants(VGG16).PNG" width="650">

Image Ref -> http://deeplearning.net/tutorial/fcn_2D_segm.html

In more detail, the model consists of 5 convolutional blocks that originates from the VGG-16 architecture to extract the features from the input images to feaature maps, it divides the spatial dimensions by 2 after each block. When arriving at the decoding stage at "conv6", the fully connected layer were replaced by convolutional (1x1) blocks, subequently another final layer is placed and this is the transposed convolution layer for the purposes of upsampling the data back into the original input shape. 

<img src="Description Images/FCN_variants(VGG16)_breakdown.PNG" width="650">

Image Ref -> http://deeplearning.net/tutorial/fcn_2D_segm.html

From the above, it can also be seen that there are __3 variants of FCNs__, and these are FCN-32s, FCN-16s, and FCN-8s. These models mainly differ in the spatial precision of their output. The 3 variants exists because it was found out during testing that the output of FCN-32s produces very coarsed results (J. Long et al.). 

##### Below shows the Output of FCN-32s:

<img src="Description Images/FCN32_Output.PNG" width="150">

Image Ref -> http://deeplearning.net/tutorial/fcn_2D_segm.html

The reason for this was that the stride (= 32) of the final layer limited the scale of the details, where although it was contextually rich but lacked in spatial definition. The creators then proposed techniques to solve this problem.

##### Below shows the difference in final operations applied to achieve better results, producing these 3 variants:

<img src="Description Images/FCN_variants(VGG16)_breakdown2.PNG" width="650">

Image Ref -> http://deeplearning.net/tutorial/fcn_2D_segm.html

These 3 architectures differs in the stride applied in the last convolution, and with the addition of skip connections to obtain the output segmentation maps. IN FCN-16s, the last layer of FCN-32s are replaced with a new transposed layer with Stride = 2 (essentially added after "conv 5 + pool 5"), in order to produces the feature maps that are of the same dimensions as the 4th convolutional layer (conv 4 + pool 4). This is so that when a skip connection is used, feature maps from both tensors (conv 5_transposed and with conv 4) can be merged together by element-wise additon. The resultant output is a scaled back tensor that has the original input shape, where it is applied with another transposed convolution with Stride = 16. The same is done for FCN-8s, except rather than merging with convolution block 4, it was doen with convolutional block 3.

With these changes to the architecture, the output will also differ in coarseness.

##### Below shows the Output of the FCN-32s, 16s and 8s:

<img src="Description Images/FCN_All_Output.PNG" width="450">

Image Ref -> http://deeplearning.net/tutorial/fcn_2D_segm.html

Source: 
- J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic segmentation," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.

### 3.2 - U-Net:

## 4 - Tensorflow Implementation:

In [None]:
<img src="Description Images/.png" width="750">

Image Ref -> 

<img src="Description Images/.png" width="750">

Image Ref -> 