<a href="https://colab.research.google.com/github/rahiakela/tensorflow-computer-vision-cookbook/blob/main/8-image-segmentation/1_image_segmentation_using_convolutional_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Image segmentation using convolutional network

Image segmentation is one of the biggest areas of study in computer vision. It consists of simplifying the visual contents of an image by grouping together pixels that share one or more defining characteristics, such as location, color, or texture. 

As is the case with many other subareas of computer vision, image segmentation has been greatly boosted by deep neural networks, mainly in industries such as medicine and autonomous driving.

While it's great to classify the contents of an image, more often than not, it's not enough.

What if we want to know exactly where an object is? 

What if we're interested in its shape? 

What if we need its contour? 

These fine-grained needs cannot be met with traditional classification techniques.

We can frame an image segmentation problem in a very similar way to a regular classification project. 

How?

Instead of labeling the image as a whole, we'll label each pixel! This is known as image segmentation.

## Image segmentation using fully convolutional network

If you were to create your first network for image segmentation while knowing that, at its core, segmenting is just pixel-wise classification, what would you do? 

You would probably take a battle-tested architecture and swap the final layers (usually fully connected ones) with convolutions in order to produce an output volume, instead of an output vector.

Well, that's exactly what we'll do to build a **Fully Convolutional Network (FCN)** for image segmentation based on the famous **VGG16** network.

## Setup

In [None]:
!pip install git+https://github.com/tensorflow/docs

In [2]:
import pathlib

import cv2
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_docs as tfdocs
import tensorflow_docs.plots
from tensorflow.keras.layers import *
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import load_img

# Define an alias for tf.data.experimental.AUTOTUNE:
AUTOTUNE = tf.data.experimental.AUTOTUNE

In [None]:
# download VGG model weights
!wget https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5

## Loading and preparing dataset

Regarding the data, we will segment images from the Oxford-IIIT Pet dataset. The
good news is that we'll access it using tensorflow-datasets, so we don't really need to do anything in that respect here. 

Each pixel in this dataset is classified as follows:
- 1: The pixel belongs to a pet (cat or dog).
- 2: The pixel belongs to the contour of a pet.
- 3: The pixel belongs to the surroundings.

We will normalize the images in the dataset to the range [0, 1].
Just for consistency's sake, we'll subtract one from each pixel in the mask so that they go from 0 all the way to 2:

In [4]:
def normalize(input_image, input_mask):
  input_image = tf.cast(input_image, tf.float32) / 255.0
  input_mask -= 1

  return input_image, input_mask

Let's loads both the image and its mask, given a TensorFlow dataset element. We will seize the opportunity to resize the images to `256x256` here. Also, if the train flag is set to True, we can perform a bit of augmentation by randomly mirroring the image and its mask. Lastly, we must normalize the inputs.

In [5]:
@tf.function
def load_image(dataset, train=True):
  input_image = tf.image.resize(dataset["image"], (256, 256))
  input_mask = tf.image.resize(dataset["segmentation_mask"], (256, 256))

  if train and np.random.uniform() > 0.5:
    input_image = tf.image.flip_left_right(input_image)
    input_mask = tf.image.flip_left_right(input_mask)

  input_image, input_mask = normalize(input_image, input_mask)

  return input_image, input_mask

## Create Model Architecture

Let's implement a class, which encapsulates all the logic required to build, train, and evaluate our FCN image segmentation model.

In [6]:
class FCN(object):
  
  def __init__(self, input_shape=(256, 256, 3), output_channels=3):
    self.input_shape = input_shape 
    self.output_channels = output_channels 

    # defining the path to the weights of the VGG16 model
    # self.vgg_weights_path = str(pathlib.Path.home() / "vgg16_weights_tf_dim_" "ordering_tf_kernels.h5")
    self.vgg_weights_path = "vgg16_weights_tf_dim_" "ordering_tf_kernels.h5"

    self.model = self.create_model()

    loss = SparseCategoricalCrossentropy(from_logits=True)
    self.model.compile(optimizer=RMSprop(), loss=loss, metrics=["accuracy"])

  def create_model(self):
    input = Input(shape=self.input_shape)

    # first block of convolutions and max pooling layers
    x = Conv2D(filters=64, kernel_size=(3, 3), activation="relu", padding="same", name="block1_conv1")(input)
    x = Conv2D(filters=64, kernel_size=(3, 3), activation="relu", padding="same", name="block1_conv2")(x)
    x = MaxPooling2D(pool_size=(2, 2), strides=2, name="block1_pool")(x)

    # second block of convolutions and max pooling layers
    x = Conv2D(filters=128, kernel_size=(3, 3), activation="relu", padding="same", name="block2_conv1")(x)
    x = Conv2D(filters=128, kernel_size=(3, 3), activation="relu", padding="same", name="block2_conv2")(x)
    x = MaxPooling2D(pool_size=(2, 2), strides=2, name="block2_pool")(x)

    # third block of convolutions and max pooling layers
    x = Conv2D(filters=256, kernel_size=(3, 3), activation="relu", padding="same", name="block3_conv1")(x)
    x = Conv2D(filters=256, kernel_size=(3, 3), activation="relu", padding="same", name="block3_conv2")(x)
    x = Conv2D(filters=256, kernel_size=(3, 3), activation="relu", padding="same", name="block3_conv3")(x)
    x = MaxPooling2D(pool_size=(2, 2), strides=2, name="block3_pool")(x)

    block3_pool = x

    # fourth block of convolutions and max pooling layers
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block4_conv1")(x)
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block4_conv2")(x)
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block4_conv3")(x)
    block4_pool = MaxPooling2D(pool_size=(2, 2), strides=2, name="block4_pool")(x)

    # The fifth block is a repetition of block four, again with 512 filter-deep convolutions
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block5_conv1")(block4_pool)
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block5_conv2")(x)
    x = Conv2D(filters=512, kernel_size=(3, 3), activation="relu", padding="same", name="block5_conv3")(x)
    block5_pool = MaxPooling2D(pool_size=(2, 2), strides=2, name="block5_pool")(x)

    model = Model(input, block5_pool)
    model.load_weights(self.vgg_weights_path, by_name=True)

    output = Conv2D(filters=self.output_channels, kernel_size=(7, 7), activation="relu", padding="same", name="conv6")(block5_pool)

    """
    output, in a traditional VGG16 architecture, is comprised of fully connected layers. However, we'll be replacing them 
    with transposed convolutions. Notice we are connecting these layers to the output of the fifth block
    """
    conv6_4 = Conv2DTranspose(filters=self.output_channels, kernel_size=(4, 4), strides=4, use_bias=False)(output)
    # Create a 1x1 convolution and connect it to the output of the fourth block(this is, indeed, a skip connection)
    pool4_n = Conv2D(filters=self.output_channels, kernel_size=(1, 1), activation="relu", padding="same", name="pool4_n")(block4_pool)
    pool4_n_2 = Conv2DTranspose(filters=self.output_channels, kernel_size=(2, 2), strides=2, use_bias=False)(pool4_n)
    # Pass the output of the third block through a 1x1 convolution
    pool3_n = Conv2D(filters=self.output_channels, kernel_size=(1, 1), activation="relu", padding="same", name="pool3_n")(block3_pool)

    # Then, merge these three paths into one and pass them through a final transposed convolution.
    output = Add(name="add")([pool4_n_2, pool3_n, conv6_4])
    output = Conv2DTranspose(filters=self.output_channels, kernel_size=(8, 8), strides=8, use_bias=False)(output)

    output = Softmax()(output)

    return Model(input, output)

  @staticmethod
  def plot_model_history(model_history, metric, ylim=True):
    plt.style.use('seaborn-darkgrid')
    plotter = tfdocs.plots.HistoryPlotter()
    plotter.plot({'Model': model_history}, metric=metric)

    plt.title(f'{metric.upper()}')
    if ylim is None:
        plt.ylim([0, 1])
    else:
        plt.ylim(ylim)

    plt.savefig(f'{metric}.png')
    plt.close()

  def train(self, train_dataset, epochs, steps_per_epoch, validation_dataset, validation_steps):
    hist = self.model.fit(train_dataset, epochs=epochs, steps_per_epoch=steps_per_epoch, validation_steps=validation_steps, validation_data=validation_dataset)

    self.plot_model_history(hist, 'loss', [0., 2.0])
    self.plot_model_history(hist, 'accuracy')

  @staticmethod
  def process_mask(mask):
    """
    It is used to make the segmentation masks compatible with OpenCV. What this function does is create a three-channeled
    version of a grayscale mask and upscale the class values to the [0, 255] range
    """
    mask = (mask.numpy() * 127.5).astype("uint8")
    mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2RGB)

    return mask

  def save_image_and_masks(self, image, ground_truth_mask, prediction_mask, image_id):
    """
    This method creates a mosaic of the original image, the ground truth mask, and the predicted segmentation mask, 
    and then saves it to disk for later revision
    """
    image = (image.numpy() * 255.0).astype('uint8')
    gt_mask = self.process_mask(ground_truth_mask)
    pred_mask = self.process_mask(prediction_mask)

    mosaic = np.hstack([image, gt_mask, pred_mask])
    mosaic = cv2.cvtColor(mosaic, cv2.COLOR_RGB2BGR)

    cv2.imwrite(f'mosaic_{image_id}.jpg', mosaic)

  @staticmethod
  def create_mask(prediction_mask):
    """
    In order to pass the output volume produced by the network to a valid segmentation mask, we must take the index
    with the highest value at each pixel location. This corresponds to the most likely category for that pixel.
    """
    prediction_mask = tf.argmax(prediction_mask, axis=-1)
    prediction_mask = prediction_mask[..., tf.newaxis]

    return prediction_mask[0]

  def save_predictions(self, dataset, sample_size=1):
    """
    This method uses the FCN to predict the mask of a sample of images in the input dataset. It then saves the result to disk.
    """
    for id, (image, mask) in enumerate(dataset.take(sample_size), start=1):
      pred_mask = self.model.predict(image)
      pred_mask = self.create_mask(pred_mask)

      image = image[0]
      ground_truth_mask = mask[0]

      self.save_image_and_masks(image, ground_truth_mask, pred_mask, image_id=id)

  def evaluate(self, test_dataset, sample_size=5):
    result = self.model.evaluate(test_dataset)
    print(f"Accuracy: {result[1] * 100:.2f}%")

    self.save_predictions(test_dataset, sample_size)

## Putting all stuffs together

Now let's download (or load, if cached) Oxford IIIT Pet Dataset, along with its
metadata, using TensorFlow Datasets:

In [None]:
dataset, info = tfds.load("oxford_iiit_pet", with_info=True)

Now we use the metadata to define the corresponding number of steps the network will take over the training and validation datasets.

In [8]:
TRAIN_SIZE = info.splits['train'].num_examples
VALIDATION_SIZE = info.splits['test'].num_examples
BATCH_SIZE = 32
STEPS_PER_EPOCH = TRAIN_SIZE // BATCH_SIZE

VALIDATION_SUBSPLITS = 5
VALIDATION_STEPS = VALIDATION_SIZE // BATCH_SIZE
VALIDATION_STEPS //= VALIDATION_SUBSPLITS

BUFFER_SIZE = 1000

Let's define the training and testing datasets' pipelines:

In [9]:
train_dataset = (dataset["train"].map(load_image, num_parallel_calls=AUTOTUNE)
                                 .cache()
                                 .shuffle(BUFFER_SIZE)
                                 .batch(BATCH_SIZE)
                                 .repeat()
                                 .prefetch(buffer_size=AUTOTUNE))
test_dataset = (dataset["test"].map(lambda x: load_image(x, train=False), num_parallel_calls=AUTOTUNE)
                               .batch(BATCH_SIZE))

Now we will instantiate the FCN and train it for 120 epochs.

In [10]:
fcn = FCN(output_channels=3)
fcn.train(train_dataset, epochs=120, steps_per_epoch=STEPS_PER_EPOCH, validation_steps=VALIDATION_STEPS, validation_dataset=test_dataset)

Epoch 1/120
Epoch 2/120
Epoch 3/120
Epoch 4/120
Epoch 5/120
Epoch 6/120
Epoch 7/120
Epoch 8/120
Epoch 9/120
Epoch 10/120
Epoch 11/120
Epoch 12/120
Epoch 13/120
Epoch 14/120
Epoch 15/120
Epoch 16/120
Epoch 17/120
Epoch 18/120
Epoch 19/120
Epoch 20/120
Epoch 21/120
Epoch 22/120
Epoch 23/120
Epoch 24/120
Epoch 25/120
Epoch 26/120
Epoch 27/120
Epoch 28/120
Epoch 29/120
Epoch 30/120
Epoch 31/120
Epoch 32/120
Epoch 33/120
Epoch 34/120
Epoch 35/120
Epoch 36/120
Epoch 37/120
Epoch 38/120
Epoch 39/120
Epoch 40/120
Epoch 41/120
Epoch 42/120
Epoch 43/120
Epoch 44/120
Epoch 45/120
Epoch 46/120
Epoch 47/120
Epoch 48/120
Epoch 49/120
Epoch 50/120
Epoch 51/120
Epoch 52/120
Epoch 53/120
Epoch 54/120
Epoch 55/120
Epoch 56/120
Epoch 57/120
Epoch 58/120
Epoch 59/120
Epoch 60/120
Epoch 61/120
Epoch 62/120
Epoch 63/120
Epoch 64/120
Epoch 65/120
Epoch 66/120
Epoch 67/120
Epoch 68/120
Epoch 69/120
Epoch 70/120
Epoch 71/120
Epoch 72/120
Epoch 73/120
Epoch 74/120
Epoch 75/120
Epoch 76/120
Epoch 77/120
Epoch 78

Lastly, evaluate the network on the test dataset:

In [12]:
fcn.evaluate(test_dataset)

Accuracy: 33.29%


The training curves display a healthy behavior, meaning that the network did,
indeed, learn. 

However, the true test is to visually assess the results:

## Conclusions

In this recipe, we implemented an FCN for image segmentation. Even though we adapted a well-known architecture, VGG16, to our purposes, in reality, there are many different adaptations of FCNs that extend or modify other seminal architectures, such as ResNet50, DenseNet, and other variants of VGG.

What we need to remember is that FCN is more of a template than a concrete
implementation. Such a template consists of swapping the fully connected layers at the end of these networks, which are often used for traditional image classification, with 1x1 convolutions and upsampling layers (either UpSampling2D() with bilinear interpolation or ConvTranspose2D()). The achieved result is that instead of classifying the whole image with an output vector of probabilities, we produce an output volume that has the same dimensions as the input image, where each pixel contains a probability distribution of the classes it can belong to. Such an output volume of pixel-wise likelihood
is known as a predicted segmentation mask.