# **About this notebook**

This notebook tries to provide a solution for the semantic segmentation task for "Semantic Drone Dataset", that can be downloaded from [here](https://www.kaggle.com/bulentsiyah/semantic-drone-dataset).

In this [link](https://www.tugraz.at/index.php?id=22387) you can find the original dataset.

I hope this could be a useful work for everyone who approaches ML and Semantic Segmentation for the first time (like me! :) ).


# **Semantic Segmentation**

Source: "https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47"

![Semantic Segmentation](https://miro.medium.com/max/875/1*nXlx7s4wQhVgVId8qkkMMA.png)

Semantic image segmentation is a branch of computer vision and its goal is to label each pixel of an image with a corresponding class of what is being represented. The output in semantic segmentation is a high resolution image (typically of the same size as input image) in which each pixel is classified to a particular class. It is a pixel level image classification.

Examples of the applications of this task are:

Autonomous vehicles, where semantic segmentation provides information about free space on the roads, as well as to detect lane markings and traffic signs.
Biomedical image diagnosis, helping radiologists improving analysis performed, greatly reducing the time required to run diagnostic tests.
Geo sensing, to recognize the type of land cover (e.g., areas of urban, agriculture, water, etc.) for each pixel on a satellite image, land cover classification can be regarded as a multi-class semantic segmentation task.

# **U-Net**

U-Net is an end-to-end fully convolutional network (FCN), i.e. it only contains Convolutional layers and does not contain any Dense layer because of which it can accept image of any size. The architecture of a U-Net contains two paths: the first one is the contraction path (also called as the encoder) which is used to capture the context in the image. The encoder is just a traditional stack of convolutional and max pooling layers; the second path is the symmetric expanding path (also called as the decoder) which is used to enable precise localization using transposed convolutions. In the original paper, the U-Net is described as follows:

![](https://miro.medium.com/max/3000/1*OkUrpDD6I0FpugA_bbYBJQ.png)

It is important to say that convolution and pooling operations down-sample the image, i.e. convert a high resolution image to a low resolution image.

Max Pooling operation helps to understand “what” there is in the image by increasing the receptive field. However it tends to lose the information of “where” the objects are.

In semantic segmentation it is not just important to know “what” is present in the image but it is equally important to know “where” it is present. Hence we need a way to up-sample the image from low resolution to high resolution which will help us restore the “where” information. Transposed Convolution is the most preferred choice to perform up-sampling, which basically learns parameters through back propagation to convert a low resolution image to a high resolution image.

In the following image there is the U-Net example for an input image of size 128x128x3.

![U-Net architecture](https://miro.medium.com/max/2082/1*yzbjioOqZDYbO6yHMVpXVQ.jpeg)

To better explain the image:

- 2@Conv layers means that two consecutive Convolution Layers are applied.
- c1, c2, ... c9 are the output tensors of Convolutional Layers.
- p1, p2, p3 and p4 are the output tensors of Max Pooling Layers.
- u6, u7, u8 and u9 are the output tensors of up-sampling (transposed convolutional) layers.
- The left hand side is the contraction path (Encoder) where we apply regular convolutions and max pooling layers.
- In the Encoder, the size of the image gradually reduces while the depth gradually increases. Starting from 128x128x3 to 8x8x256.
- This basically means the network learns the “WHAT” information in the image, however it has lost the “WHERE” information.
- The right hand side is the expansion path (Decoder) where we apply transposed convolutions along with regular convolutions.
- In the decoder, the size of the image gradually increases and the depth gradually decreases. Starting from 8x8x256 to 128x128x1.
- Intuitively, the Decoder recovers the “WHERE” information (precise localization) by gradually applying up-sampling.
- To get better precise locations, at every step of the decoder we use skip connections by concatenating the output of the transposed convolution layers with the feature maps from the Encoder at the same level:

    u6 = u6 + c4

    u7 = u7 + c3

    u8 = u8 + c2

    u9 = u9 + c1

    After every concatenation we again apply two consecutive regular convolutions so that the model can learn to assemble a more precise output.

- This is what gives the architecture a symmetric U-shape, hence the name UNET.

In [None]:
## Import libraries 

import cv2
import numpy as np
import os
import pandas as pd

from tqdm import tqdm
from glob import glob
from albumentations import RandomCrop, HorizontalFlip, VerticalFlip

from sklearn.model_selection import train_test_split
from PIL import Image

from tensorflow.keras.layers import Input, Conv2D, BatchNormalization, Activation, MaxPool2D, UpSampling2D, Concatenate
from tensorflow.keras.models import Model
from keras.utils import plot_model
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, Callback

In [None]:
## Data Augmentation
## It was chosen the resolution of 1536x1024px keep the ratio of the original images (6000x4000px). 

def create_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)
        
def augment_data(images, masks, save_path, augment=True):
    H = 1024
    W = 1536
    for x,y in tqdm(zip(images, masks), total=len(images)):
        name = x.split("/")[-1].split(".")
        image_name = name[0]
        image_extn = name[1]

        name = y.split("/")[-1].split(".")
        mask_name = name[0]
        mask_extn = name[1]       
        
        x = cv2.imread(x, cv2.IMREAD_COLOR)
        x = cv2.resize(x, (W, H))
        y = cv2.imread(y, cv2.IMREAD_COLOR)
        y = cv2.resize(y, (W, H))
        
        if augment == True:
            
            aug = RandomCrop(int(2*H/3), int(2*W/3), always_apply=False, p=1.0)
            augmented = aug(image=x, mask=y)
            x1 = augmented["image"]
            y1 = augmented["mask"]
 
            aug = HorizontalFlip(always_apply=False, p=1.0)
            augmented = aug(image=x, mask=y)
            x2 = augmented["image"]
            y2 = augmented["mask"]
            
            aug = VerticalFlip(always_apply=False, p=1.0)
            augmented = aug(image=x, mask=y)
            x3 = augmented["image"]
            y3 = augmented["mask"] 
            
            save_images = [x, x1, x2, x3]
            save_masks = [y, y1, y2, y3]            
          
        else:
            save_images = [x]
            save_masks = [y]
        
        idx = 0
        for i, m in zip(save_images, save_masks):
            i = cv2.resize(i, (W, H))
            m = cv2.resize(m, (W, H))
            
            tmp_img_name = f"{image_name}_{idx}.{image_extn}"
            tmp_msk_name = f"{mask_name}_{idx}.{mask_extn}" 
            
            image_path = os.path.join(save_path, "images", tmp_img_name)
            mask_path = os.path.join(save_path, "masks", tmp_msk_name)
            
            cv2.imwrite(image_path, i)
            cv2.imwrite(mask_path, m)

            idx+=1


path = "../input/semantic-drone-dataset/dataset/semantic_drone_dataset/"
images = sorted(glob(os.path.join(path, "original_images/*")))
masks = sorted(glob(os.path.join(path, "label_images_semantic/*")))
print(f"Original images:  {len(images)} - Original masks: {len(masks)}")

create_dir("./new_data/images/")
create_dir("./new_data/masks/")

save_path = "./new_data/"

augment_data(images, masks, save_path, augment=True)

images = sorted(glob(os.path.join(save_path, "images/*")))
masks = sorted(glob(os.path.join(save_path, "masks/*")))
print(f"Augmented images:  {len(images)} - Augmented masks: {len(masks)}")

In [None]:
## Create dataframe

image_path =  os.path.join(save_path, "images/")
label_path = os.path.join(save_path, "masks/")

def create_dataframe(path):
    name = []
    for dirname, _, filenames in os.walk(path):
        for filename in filenames:
            name.append(filename.split('.')[0])
    
    return pd.DataFrame({'id': name}, index = np.arange(0, len(name)))

df_images = create_dataframe(image_path)
df_masks = create_dataframe(label_path)
print('Total Images: ', len(df_images))
#print(df_images)

In [None]:
## Split data

X_trainval, X_test = train_test_split(df_images['id'], test_size=0.1, random_state=19)
X_train, X_val = train_test_split(X_trainval, test_size=0.2, random_state=19)

print(f"Train Size : {len(X_train)} images")
print(f"Val Size   :  {len(X_val)} images")
print(f"Test Size  :  {len(X_test)} images")

y_train = X_train #the same values for images (X) and labels (y)
y_test = X_test
y_val = X_val

img_train = [os.path.join(image_path, f"{name}.jpg") for name in X_train]
mask_train = [os.path.join(label_path, f"{name}.png") for name in y_train]
img_val = [os.path.join(image_path, f"{name}.jpg") for name in X_val]
mask_val = [os.path.join(label_path, f"{name}.png") for name in y_val]
img_test = [os.path.join(image_path, f"{name}.jpg") for name in X_test]
mask_test = [os.path.join(label_path, f"{name}.png") for name in y_test]

In [None]:
## Define U-Net Model
## In order to minimize the dimension of the model, it is possible to reduce the number of the filters for each layer.
## To do this, scale by a factor 2 the variables filters_x and filters_b.


def conv_block(inputs, filters, pool=True):
    x = Conv2D(filters, 3, padding='same')(inputs)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    
    x = Conv2D(filters, 3, padding='same')(x)
    x = BatchNormalization()(x)
    x = Activation('relu')(x)
    
    if pool == True:
        p = MaxPool2D((2,2))(x)
        return x, p
    else:
        return x
    
    
def build_unet(shape, num_classes):   
    inputs = Input(shape)
    
    filters_x = [32,64,96,128,128,96,64,32]
    filters_b = [256] 

    # Encoder
    x1, p1 = conv_block(inputs, filters_x[0], pool=True)
    x2, p2 = conv_block(p1, filters_x[1], pool=True)
    x3, p3 = conv_block(p2, filters_x[2], pool=True)
    x4, p4 = conv_block(p3, filters_x[3], pool=True)    
    
    # Bridge
    b1 = conv_block(p4, filters_b[0], pool=False)
    
    # Decoder
    u1 = UpSampling2D((2,2), interpolation='bilinear')(b1)
    c1 = Concatenate()([u1, x4])
    x5 = conv_block(c1, filters_x[4], pool=False)
    
    u2 = UpSampling2D((2,2), interpolation='bilinear')(x5)
    c2 = Concatenate()([u2, x3])
    x6 = conv_block(c2, filters_x[5], pool=False)
    
    u3 = UpSampling2D((2,2), interpolation='bilinear')(x6)
    c3 = Concatenate()([u3, x2])
    x7 = conv_block(c3, filters_x[6], pool=False)
    
    u4 = UpSampling2D((2,2), interpolation='bilinear')(x7)
    c4 = Concatenate()([u4, x1])
    x8 = conv_block(c4, filters_x[7], pool=False)
    
    # Output Layer
    output = Conv2D(num_classes, 1, padding='same', activation='softmax')(x8)

    return Model(inputs, output)

In [None]:
## Define the resolution of the images and the number of classes

H = 768   #to keep the original ratio 
W = 1152 
num_classes = 23

model = build_unet((W, H, 3), num_classes)  

In [None]:
## Show the summary of the U-Net model and its diagram

model.summary()
plot_model(model,to_file='model.png')

In [None]:
## Dataset Pipeline used for training the model

def read_image(x):
    x = cv2.imread(x, cv2.IMREAD_COLOR)
    x = cv2.resize(x, (W, H))
    x = x/255.0
    x = x.astype(np.float32)
    return x


def read_mask(x):
    x = cv2.imread(x, cv2.IMREAD_GRAYSCALE)
    x = cv2.resize(x, (W, H))
    x = x.astype(np.int32)
    return x


def tf_dataset(x,y, batch=4):
    dataset = tf.data.Dataset.from_tensor_slices((x,y))
    dataset = dataset.shuffle(buffer_size=500)
    dataset = dataset.map(preprocess)
    dataset = dataset.batch(batch)
    dataset = dataset.repeat()
    dataset = dataset.prefetch(2)
    return dataset
    

def preprocess(x,y):
    def f(x,y):
        x = x.decode()
        y = y.decode()
        image = read_image(x)
        mask = read_mask(y)
        return image, mask
    
    image, mask = tf.numpy_function(f,[x,y],[tf.float32, tf.int32])
    mask = tf.one_hot(mask, num_classes, dtype=tf.int32)
    image.set_shape([H, W, 3])    # In the Images, number of channels = 3. 
    mask.set_shape([H, W, num_classes])    # In the Masks, number of channels = number of classes. 
    return image, mask

In [None]:
## Train the model

# Seeding
np.random.seed(42)
tf.random.set_seed(42)

# Hyperparameters
shape = (H, W, 3)
num_classes = 23  
lr = 1e-4
batch_size = 4 
epochs = 30

# Model
model = build_unet(shape, num_classes)
model.compile(loss="categorical_crossentropy", optimizer=tf.keras.optimizers.Adam(lr), metrics=['accuracy'])

train_dataset = tf_dataset(img_train, mask_train, batch = batch_size)
valid_dataset = tf_dataset(img_val, mask_val, batch = batch_size)

train_steps = len(img_train)//batch_size
valid_steps = len(img_val)//batch_size

callbacks = [
    ModelCheckpoint("model.h5", verbose=1, save_best_model=True),
    ReduceLROnPlateau(monitor='val_loss', patience=3, factor=0.1, verbose=1, min_lr=1e-6),
    EarlyStopping(monitor='val_loss', patience=5, verbose=1)
]

model.fit(train_dataset,
          steps_per_epoch=train_steps,
          validation_data=valid_dataset,
          validation_steps=valid_steps,
          epochs=epochs,
          callbacks=callbacks
         )

In [None]:
## Plot accuracy and loss

train_loss = model.history.history['loss']
val_loss   = model.history.history['val_loss']
train_acc  = model.history.history['accuracy']
val_acc    = model.history.history['val_accuracy']

# summarize history for accuracy
plt.plot(model.history.history['accuracy'])
plt.plot(model.history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(model.history.history['loss'])
plt.plot(model.history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [None]:
## Prediction

create_dir('./results')  #create the folder for the predictions

# Seeding
np.random.seed(42)
tf.random.set_seed(42)

# Model
model = tf.keras.models.load_model("model.h5")

# Saving the masks
for x, y in tqdm(zip(img_test, mask_test), total=len(img_test)):
    name = x.split("/")[-1]
    
    ## Read image
    x = cv2.imread(x, cv2.IMREAD_COLOR)
    x = cv2.resize(x, (W, H))
    x = x/255.0
    x = x.astype(np.float32)

    ## Read mask
    y = cv2.imread(y, cv2.IMREAD_GRAYSCALE)
    y = cv2.resize(y, (W, H))
    
    y = np.expand_dims(y, axis=-1) #(384,256,1)
    
    y = y * (255/num_classes)
    y = y.astype(np.int32)
    y = np.concatenate([y, y, y], axis=2)
    
    ## Prediction
    p = model.predict(np.expand_dims(x, axis=0))[0]
    p = np.argmax(p, axis=-1)
    
    p = np.expand_dims(p, axis=-1)  
    
    p = p * (255/num_classes)
    p = p.astype(np.int32)
    p = np.concatenate([p, p, p], axis=2)
      
    cv2.imwrite(f"./results/{name}", p)
    

In [None]:
# From the test set, take only images that represent the ones in the original dataset and not those are obtained from the data augmentation.
# (they have _0 in the name)

image_list = []
mask_list = []

for x,y in tqdm(zip(img_test, mask_test), total=len(img_test)):
    name = x.split("/")[-1]
    image_name = name[4]

    name = y.split("/")[-1]
    mask_name = name[4]
    
    if image_name == '0':
        image_list.append(x)
        mask_list.append(y)

In [None]:
## Plot 5 images to verify the accuracy in the predictions

img_selection = image_list[0:5]
mask_selection = mask_list[0:5]

for img, mask in zip(img_selection, mask_selection):
    name = img.split("/")[-1]
    x = cv2.imread(img, cv2.IMREAD_COLOR)
    x = cv2.resize(x, (W, H))

    y = cv2.imread(mask, cv2.IMREAD_GRAYSCALE)
    y = cv2.resize(y, (W, H))


    p = cv2.imread(f"./results/{name}", cv2.IMREAD_GRAYSCALE)
    p = cv2.resize(p, (W, H))

    #Plotto le tre immagini
    fig, axs = plt.subplots(1, 3, figsize=(20, 20), constrained_layout=True)

    axs[0].imshow(x, interpolation = 'nearest')
    axs[0].set_title('image')
    axs[0].grid(False)

    axs[1].imshow(y, interpolation = 'nearest')
    axs[1].set_title('mask')
    axs[1].grid(False)

    axs[2].imshow(p)
    axs[2].set_title('prediction')
    axs[2].grid(False)

I want to thank "[Idiot Developer](https://www.youtube.com/c/IdiotDeveloper/about)" from YT, for helping me with his videos to write this kernel.