# Dealing with Images (Enhancing and Segmenting) [Notebook 3]

## Introduction

This project dives into Encoders-Decoders, where these models are used to edit and generate full images. How these models can be adapted for a wider range of applications such as image denoising or object and instance segmentation. The project will also introduce new concepts like Unpooling, Transposed and Atrous Convolutions layers to the network architecture and its utility for high-dimensional data. Encoders-Decoders can be used for semantic segmentation for driverless cars, where it would help in defining the objects surrounding the vehicle like, roads, other vehicles, people or trees etc. 

## Breakdown of this Project:
- Introduction to Encoders-Decoders. (Notebook 1)
- Encoders-Decoders trained for pixel-level prediction. (Notebook 1)
- Layers such as Unpooling, Transposed and Atrous Convolutions to output high-dimensional data. (Notebook 2)
- FCN and U-Net Architectures for semantic segmentation. (Notebook 3)
- Instance segmentation (extension of Faster-RCNN with Mask-RCNN) (Notebook 4)

## Requirements:
1) Tensorflow 2.0 (GPU prefferably) \
2) CV2 (OpenCV) \
3) Cython \
4) Eigen \
5) PyDenseCRF

For "PyDenseCRF" for windows, LINK: https://github.com/lucasb-eyer/pydensecrf\

It can be installed directly with the following in command prompt or terminal-equivalent: __conda install -c conda-forge pydensecrf__.

If Conda-Forge __does not work__, try: 
- going to: https://www.lfd.uci.edu/~gohlke/pythonlibs/#pydensecrf
- Download: pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl
- Where "cp37" in the filename is the python version of 3.7, make sure you download the correct one.
- Place the downloaded "pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl" file in your working directory drive.
- Open Command Prompt and type in: pip install pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl
- Or if you placed it in a folder or different location: pip install <FILEPATH>\pydensecrf-1.0rc2-cp37-cp37m-win_amd64.whl

## Dataset:
    
The dataset can be obtain from the link: https://www.cityscapes-dataset.com/dataset-overview/.

Quoted from the website: "The Cityscapes Dataset focuses on semantic understanding of urban street scenes." It consists of >5,000 images with fine-grained semantic labels, 20,000 images with coarser annotations that were shot from the view point of driving a car around different cities in Germany. 


### Import the required libraries:

In [1]:
%matplotlib inline

import tensorflow as tf
import numpy as np
import math
import timeit
import time
import os
import matplotlib.pyplot as plt

# Run on GPU:
os.environ["CUDA_VISIBLE_DEVICES"]= "0" 
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [2]:
# Set the random set seed number: for reproducibility.
Seed_nb = 42

# Set to run or not run the code block: for code examples only. (0 = run code, and 1 = dont run code)
dont_run = 0

### GPU Information:

In [3]:
# sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
# devices = sess.list_devices()
# devices

### Use RTX_GPU Tensor Cores for faster compute: FOR TENSORFLOW ONLY

Automatic Mixed Precision Training in TF. Requires NVIDIA DOCKER of TensorFlow.

Sources:
- https://developer.nvidia.com/automatic-mixed-precision
- https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#framework

When enabled, automatic mixed precision will do two things:

- Insert the appropriate cast operations into your TensorFlow graph to use float16 execution and storage where appropriate(this enables the use of Tensor Cores along with memory storage and bandwidth savings). 
- Turn on automatic loss scaling inside the training Optimizer object.

In [4]:
# os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'

EXAMPLE CODE: 

In [5]:
# # Graph-based example:
# opt = tf.train.AdamOptimizer()
# opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
# train_op = opt.miminize(loss)

# # Keras-based example:
# opt = tf.keras.optimizers.Adam()
# opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
# model.compile(loss=loss, optimizer=opt)
# model.fit(...)

### Use RTX_GPU Tensor Cores for faster compute: FOR KERAS API

Source:
- https://www.tensorflow.org/guide/keras/mixed_precision
- https://www.tensorflow.org/api_docs/python/tf/keras/mixed_precision/experimental/Policy

In [6]:
# from tensorflow.keras.mixed_precision import experimental as mixed_precision

In [7]:
# # Set for MIXED PRECISION:
# policy = mixed_precision.Policy('mixed_float16')
# mixed_precision.set_policy(policy)

# print('Compute dtype: %s' % policy.compute_dtype)
# print('Variable dtype: %s' % policy.variable_dtype)

### To run this notebook without errors, the GPU will have to be set accordingly:

In [8]:
# physical_devices = tf.config.list_physical_devices('GPU') 
# physical_devices
# tf.config.experimental.set_memory_growth(physical_devices[0], True) 

## 1 - What is Semantic Segmentation?:

__Semantic Segmentation__ is the task of segmenting images into meaningful parts, it covers segmentation of both objects and instances. This task is different comapred to image classification or object detection tasks, where more fundamentally, it requires the return of a pixel-level dense predictions whereby it assigns a label to each of the pixel of the input images.

## 2 - Encoders-Decoders for Object Segmentation:

Segmenting objects in a scene of an image can be described as the mapping of images from a colour domain to a class domain. It assigns one of the target classes to each pixel and returns a label map of the same height and width. To perform this kind of operation with an Encoder-Decoder will require further considerations as it is not as straightforward.

## 2.1 - Decoding as label maps:

If an Encoder-Decoder network was constructed to output label maps where each of the pixel value would be a class (i.e. 1 for house or 2 for car), the model would only output very poor results. A better implementation is to directly output categorical values instead. Previously, for the task of image classification that consist of "N" number of categories, the final layer of the network would outtput "N" logits for each class, these scores were then converted to probabilities with the Softmax function and finished by picking the largest probability value with Argmax function. This mechanism can also be applied to Semantic Segmentation, where it would be at a pixel level instead rather than the overall image level. 

##### The image below represents the task of Image Segmentation:

<img src="Description Images/Semantic_Segmentation_Overall.PNG" width="750">

The diagram above shows the process of the Encoder-Decoder model taking an input image and outputing the predicted label maps. This process can be broken into three parts. Note that the example used here is labeled as a low-resolution prediction map, where in practice, the predicted segmentation label map should have the same resolution as the original input.

##### The image below represents part 1 of Image Segmentation:

<img src="Description Images/Semantic_Segmentation_1.PNG" width="750">

Image Ref -> https://www.jeremyjordan.me/semantic-segmentation/

First, the main goal of the model would be to take an input such as a RGB coloured image tensor with the shape of (Height x Width x 3) or greyscale tensor with the shape of (Height x Width x 1), and to output a segmentation label map, where each of the pixel would have a class label that is represented as an integer (Height x Width x 1). 

##### The image below represents part 2 of Image Segmentation:

<img src="Description Images/Semantic_Segmentation_2.PNG" width="750">

Image Ref -> https://www.jeremyjordan.me/semantic-segmentation/

Second, the above shows an intermediate stage composing of individual masks for each of the class labels. By setting the number of output channels equal to the number of classes, the Encoder-Decoder model can obtain the output tensor. In doing so, it also means that it can be trained as a classifier. The loss is computed with cross-entropy loss, where it compares the softmax values with the one-hot-encoded ground truth label maps. These (Height x Width x N) prediction can be transformed into per-pixel labels by selecting the highest value along the channel axis. This essentially means that an output prediction image can be formed by collapsing the segmenation map by taking the argmax value along the channel axis (or depth-wise pixel vector).

##### The image below represents part 3 of Image Segmentation:

<img src="Description Images/Semantic_Segmentation_3.PNG" width="750">

Image Ref -> https://www.jeremyjordan.me/semantic-segmentation/

Third, by overlaying these predictions into a single channel will form the target prediction image, that is refered to as the __mask__, where each of the specific class is highlighted over regions of the image.





## 2.2 - Training the model with segmentation losses and metrics:

Utilising advanced architectures like FCN-8s or U-net can be advantageous for semantic segmentation as these are performant systerms. Like most models, these also require proper loss computations to converge optimally, however, even as cross-entropy loss was applied to train the models for both coarse and dense classifications, some precaution is needed for dense classification. 

One such precaution for this task is __class imbalance__. This is where the number of data (image) samples for one class is overwhelming more than another, and this will lead to a model that will always output a prediction of the class with the greater number. This of course is not an ideal model to have in practice. For image classification task, this can be avoided by simply removing or adding more data (images) so that all the classes will be in the same proportions. For pixel-level classification, the problem is different and can't be solved by simply adding or removing images. For example, some classes would appear in every image and spans only a few pizels, while the other classes would appear in more of the images (e.g. roads, cars etc). In these cases, the dataset itself can't be edited to improve the imbalance.

This fix this, the loss function requires some adaptation to account for the classes with larger numbers. In practice, this is done by including a weighing system on the contribution of each of the class to the cross-entropy loss computation. Meaning that, the less a class appearing in the training images, the more the weight is place on it in the loss computation, and consequently, the network would be heavily penalised if it begins to ignore smaller classes. The weight maps are derived from the ground truth label maps, where the weight that is applied to the pixel are set according to the class and also, according to the pixel's positioning that is relative to the other elements. 

Another fix for this problem, would be to replace the cross-entropy function altogether with another cost function. This removes the fact that it is affected by the proportions of the classes. One such function is the __Intersection-Over-Union (IoU) function__, or another is the __Sorensen-Dice Coefficient (Dice Coefficient)__. 

The __Jaccard Index__ or __Intersection over Union (IoU)__ is the common metric that is used to measure when a prediction and ground truth are matching. The following equation defines the IoU as:

$$ IoU(A, B) =  \frac{\lvert{A \cap B}\rvert}{\lvert{A \cup B}\rvert} = \frac {\lvert{A \cap B}\rvert}{\lvert{A}\rvert - \lvert{B}\rvert - \lvert{A \cap B}\rvert} $$

The __Sorensen-Dice Coefficient (Dice Coefficient)__ is a statistical gauge for the similarity between two image samples. It measures a range from 0 to 1, where 1 represents a perfect overlap between the two images. The following equation defines the Dice Coefficient as:

$$ Dice(A, B) =  \frac{2\lvert{A \cap B}\rvert}{\lvert{A}\rvert + \lvert{B}\rvert} $$

Where for both,
- $\lvert{A}\rvert$, is the cardinality of set A. The number of elements A contains.
- $\lvert{B}\rvert$, is the cardinality of set B. The number of elements B contains.
- $\lvert{A \cap B}\rvert$, is the numerator for number of elements that are in common between A and B. Here $A \cap B$ is the intersection between the two sets.
- $\lvert{A \cup B}\rvert$, is the demoninator for total number of elements that A and B sets covers. Here $A \cup B$ is the union of the sets.

Both __Intersection over Union (IoU)__ and __Sorensen-Dice Coefficient (Dice Coefficient)__ share several properties and is able to help compute one another. This can be seen with the equation below:

$$ IoU(A, B) = \frac{Dice(A, B)}{2 - Dice(A, B)} $$ and $$ Dice(A, B) = \frac{2 * IoU(A, B)}{1 + IoU(A, B)}$$

For a one class semantic segmentation, the __numerator__ for the Dice Coefficient is the number of correctly classified pixels, while the __denominator__ is the total number of pixels to belongs to this class in both the prediction and ground truth masks. 

As a metric, the Dice coefficient doesn't depend on the relative number of pixels that one class takes in the images. For multi-class tasks, the Dice coefficient is computed for each class, compared between each pair of predicted and ground truth masks, and then is averaged to get the results. 

As mentioned earlier, Dice coefficient measures a range from 0 to 1, where 1 represents a perfect overlap between the two images. But to use it as the loss function in a network (to minimise loss), the scoring is __reversed__. For semantic segmentation for "N" number of classes, the __Dice loss__ is then defined as the following:

$$ L_{Dice}(y, y^{true}) = 1 - \frac{1}{N}\sum^{N-1}_{k=0}Dice(y_{k}, y^{true}_{k}) $$ where $$ Dice(a, b) = \frac{\epsilon + 2\sum_{i, j}(a \odot b)_{i, j}}{\epsilon + \sum_{i, j}a_{i, j} + \sum_{i, j}b_{i, j}} $$

For two one-hot encoded tensor "a" and "b", the Dice numeratoris appoximated with applying an element-wise multiplication on both of them. These values are then summed together to form the resulting tensor. The denomiator is obtained by summing all the elements of "a" and "b". A small " $\epsilon$ " value is added to the denominator to avoid zero division while added to the numerator to smooth the result. This is termed as __Soft Dice__ Loss.

## 2.3 - Conditional Random Fields (CRFs) post-processing:

The task of labeling every pixel for segmentation is complex and usually yields poorly predicted label maps, where such maps would consists of poor contours and small incorrect areas. This where __Condition Random Fields (CRFs)__ comes in to help improve the results by post-processing them.

Generally, CRFs are able improves the pixel level predictions, by accounting for the context that was present in the original image. When there are no abrupt changes of colour or that the colour gradient between two neighbouring pixels are small, it means that they belong to the same class. The CRFs method therefore returns a refined label maps by combining the spatial and colour-based model along with probability maps from the predictors.

Here, the "pydensecrf" package will be used.

## 3 - Image Segmentation for Self-Driving Cars:

From the background introduction and model exercises (earlier notebooks) presented in previous sections, these will be apply onto __Segmentation of traffic images for Slef-driving cars__.

## 3.1 - The Task:

Applying semantic segmentation to video images obtained from camera systems would allow the vehicle to understand the environment and elements around them. It allows the vehicle to distinguish pedestrians, bikes, following traffic signs and lines and so on. The dataset used here would be the "Cityscapes" dataset.

## 4 - Implementation in TensorFlow:

In [None]:
# <img src="Description Images/.png" width="750">

# Image Ref -> 

# <img src="Description Images/.png" width="750">

# Image Ref -> 