# Deep Learning - Exercise 10

This lecture is focused on using CNN for object localization tasks.

[Open in Google colab](https://colab.research.google.com/github/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_10.ipynb)
[Download from Github](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/dl_10.ipynb)

##### Remember to set **GPU** runtime in Colab!

In [None]:
!pip install ultralytics

In [None]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import matplotlib.pyplot as plt # plotting
import matplotlib.image as mpimg # images
import numpy as np #numpy
import seaborn as sns
import tensorflow as tf
# import tensorflow.compat.v2 as tf #use tensorflow v2 as a main 
import tensorflow.keras as keras # required for high level applications
from sklearn.model_selection import train_test_split # split for validation sets
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import normalize # normalization of the matrix
import scipy
import pandas as pd

tf.version.VERSION

In [None]:
import requests
from typing import List, Tuple

In [None]:
def show_history(history):
    plt.figure()
    for key in history.history.keys():
        plt.plot(history.epoch, history.history[key], label=key)
    plt.legend()
    plt.tight_layout()

# 📒 What is the Object Localization?
* Object localization is the name of the task of **classification with localization**
* Namely, given an image, classify the object that appears in it, and find its location in the image, usually by using a **bounding-box**
* In Object Localization, only a single object can appear in the image. 
    * 💡 If more than one object can appear, the task is called **Object Detection**

![model](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_10_01.png?raw=true)

## 📌 Object Localization can be treated as a regression problem 

### We can represent our output (a bounding-box) as a tuple of size 4, as follows:
* `(x, y, height, width)`
    * `x, y`: the coordination of the left-top corner of the bounding box
    * `height`: the height of the bounding box
    * `width`: the width of the bounding box
    
![model2](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_10_02.png?raw=true)

### 📌 Network architecture in general
* The coordinates of the left-top corner of the bounding box must be inside the image and so do x+width and y+height
    * We will scale the image width and height to be 1.0
    * So we make sure that the CNN outputs will be in the range `[0,1]` - we will use the sigmoid activation layer
        * 💡 It will enforce that `(x,y)` will be inside the image, but not necessarily x+width and y+height
        * 💡 This property will be learned by the network during the training process.

![model3](https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/images/dl_10_03.png?raw=true)

## 🔎 What about the loss?
* The output of a sigmoid can be treated as probabilistic values, and therefore we can use **Binary Crossentropy** loss
    * 📌 You can see [this](https://www.theaidream.com/post/loss-functions-in-neural-networks) or [this](https://github.com/christianversloot/machine-learning-articles/blob/main/about-loss-and-loss-functions.md) for more informations.

## ⚡ We will start with purely synthetic use-case for educational purposes before we start to implement more complex one 🙂
* 📌 Our task will be the detection of white circles on pure black background
    * We will assume that the white blobs will be located in square bounding boxes for simplicity
        * 🔎 How will the output layer look like for task like this one?

In [None]:
dataset_size = 512
X = np.zeros((dataset_size, 128, 128, 1))
labels = np.zeros((dataset_size, 3))
# fill each image
for i in range(dataset_size):
    x = np.random.randint(8,120)
    y = np.random.randint(8,120)
    a = min(128 - max(x,y), min(x,y))
    r = np.random.randint(4,a)
    for x_i in range(128):
      for y_i in range(128):
        if ((x_i - x)**2) + ((y_i - y)**2) < r**2:
          X[i, x_i, y_i,:] = 1
    labels[i,0] = (x-r)/128.0
    labels[i,1] = (y-r)/128.0
    labels[i,2] = 2*r / 128.0

In [None]:
X.shape

In [None]:
labels.shape

# We can check an example of the input image

In [None]:
plt.imshow(X[0].reshape(128, 128))

In [None]:
from matplotlib.patches import Rectangle
def plot_pred(img,p):
  fig, ax = plt.subplots(1)
  ax.imshow(img.reshape(128, 128))
  rect = Rectangle(xy=(p[1]*128,p[0]*128),width=p[2]*128, height=p[2]*128, linewidth=2,edgecolor='g',facecolor='none')
  ax.add_patch(rect)
  plt.show()

## And also with the ground truth bounding-box plotted

In [None]:
plot_pred(X[0], labels[0])

## More examples of our input data with bounding boxes incoming 🙂
* 💡 We can see that the circles varies in position and size

In [None]:
fig, ax = plt.subplots(8, 8, figsize=(20, 14))
for i in range(8):
    for j in range(8):
        img = X[i*8 + j]
        p = labels[i*8 + j]
        ax[i, j].imshow(img.reshape(128, 128))
        rect = Rectangle(xy=(p[1]*128,p[0]*128),width=p[2]*128, height=p[2]*128, linewidth=2,edgecolor='g',facecolor='none')
        ax[i, j].add_patch(rect)

## 🚀 Let's define our first object localization model!

In [None]:
model = keras.Sequential([
    keras.layers.Conv2D(64, (3,3), activation='relu', padding='same', input_shape=(128,128,1)),
#     keras.layers.BatchNormalization(),
    keras.layers.MaxPooling2D((2, 2), padding='same'),
    keras.layers.Conv2D(64, (3,3), padding='same', activation='relu'),
#     keras.layers.BatchNormalization(),
    keras.layers.MaxPooling2D((2, 2), padding='same'),
    keras.layers.Conv2D(16, (3,3), padding='same', activation='relu'),
#     keras.layers.BatchNormalization(),
    keras.layers.MaxPooling2D((2, 2), padding='same'),
    
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(3, activation='sigmoid'),
])

model.summary()

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## Fit the model on the train data

In [None]:
train_x, test_x, train_y, test_y = train_test_split(X, labels, test_size=0.2, random_state=42)

In [None]:
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='weights.best.h5',
    save_weights_only=True,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
batch_size = 32
epochs = 100
history = model.fit(train_x, train_y, validation_split=0.2, callbacks=[model_checkpoint_callback], epochs=epochs, batch_size=batch_size)

show_history(history)

In [None]:
model.load_weights("weights.best.h5")

In [None]:
test_loss, test_acc = model.evaluate(test_x, test_y)

In [None]:
model.evaluate(train_x, train_y)

# ⚡ Now we can take a look at our predictions using the model
* We will see that sometimes the prediction is slightly off but usually not by much

In [None]:
y_pred = model.predict(test_x)

In [None]:
nrows = 10
fig, ax = plt.subplots(nrows, nrows, figsize=(20, 14))
for i in range(nrows):
    for j in range(nrows):
        img = test_x[i*nrows + j]
        p = test_y[i*nrows + j]
        predicted = y_pred[i*nrows + j]
        ax[i, j].imshow(img.reshape(128, 128))
        rect = Rectangle(xy=(p[1]*128,p[0]*128),width=p[2]*128, height=p[2]*128, linewidth=2,edgecolor='g',facecolor='none')
        ax[i, j].add_patch(rect)
        rect = Rectangle(xy=(predicted[1]*128,predicted[0]*128),width=predicted[2]*128, height=predicted[2]*128, linewidth=2,edgecolor='r',facecolor='none')
        ax[i, j].add_patch(rect)

# 🚀 Now we know the basics so we can focus on more interesting stuff
* Usually you don't want to train your own model for the task, but you try to leverage transfer learning approach
* 💡 Object localization is no exception
* Object localization/detection is very common task and there is already wide variety of the models focused on this task

## 📌 Current the State-of-the-Art model is [YOLOv8 by Ultralytics](https://github.com/ultralytics/ultralytics)
* It is useful for wide range of object detection and tracking, instance segmentation, image classification and pose estimation tasks
* YOLOv8 may be used directly in the Command Line Interface (CLI) or in a Python environment using the Python API
* There are 5 pre-trained models available
    * Number of parameters thus the size of the models is different
    * Models can be downloaded from [YOLOv8 Github repository](https://github.com/ultralytics/ultralytics)

### 💡 Tensorflow 2 has high-level API available for these tasks too 
* However it is a bit more comlicated compared to YOLOv8
* You can also use already pre-trained models which can be used directly for the inference or fine-tuned
* You can read the [blog post](https://blog.tensorflow.org/2020/07/tensorflow-2-meets-object-detection-api.html) about the API or you can take a look at the [Github](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2.md)



# ⚡ We will start with a simple zero-shot object detection
* 🔎 What is meant by *zero-shot* approach?

In [None]:
from ultralytics import YOLO
from PIL import Image, ImageDraw

In [None]:
!wget https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/blob/main/misc/bus.jpg?raw=true -O bus.jpg
!wget https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/raw/main/misc/yolov8n.pt -O yolov8n.pt
!wget https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/releases/download/v0.0.01/video_cut.mkv -O video_cut.mkv


## Load the model

In [None]:
model = YOLO("yolov8n.pt", verbose=True)

## 🚀 The model is ready ready to use
* The model has several parameters:
    * `save`: Enables saving of the annotated images or videos to file. 
    * `save_txt`: Save the bounding boxes and class labels to text file
        * 💡 Format is [class] [x_center] [y_center] [width] [height] [confidence]
    * `save_conf`: 	Includes confidence scores in the saved text files (you can filter out low confidence detections later)
    * `imgsz`: Defines the image size for inference
        * 💡 Can be a single integer for square resizing or a (height, width) tuple
    * `project`: Folder name for saving output

In [None]:
res = model('bus.jpg', save_txt=True, save_conf=True, save=True, imgsz=1088, project="yolo")

In [None]:
res

## If you want to work with the detected bounding boxes, you can use the following code

In [None]:
for i, result in enumerate(res):
    curr = result.boxes.xyxy.cpu().numpy()
    if curr.shape[0] > 0:
        print(f"Image {i}")
        print(f"Found {curr.shape[0]} boxes")
        print(f'Classes:  {result.boxes.cls.cpu().numpy()}')
        for j, box in enumerate(curr):
            print(f'Box {j}: {box}')

# 📊 We can display boxes in the image easily

In [None]:
im = Image.open("bus.jpg")
draw = ImageDraw.Draw(im)
cls_to_color = {0 : 'red', 5: 'yellow'}
for result in res:
    for i, box in enumerate(result.boxes.xyxy.cpu().numpy()):
        cls = result.boxes.cls.cpu().numpy()[i]
        x, y, xx, yy = box
        draw.rectangle([x, y, xx, yy], outline=cls_to_color[cls], width=4)

# Display image in matplotlib
plt.imshow(im)

## Yolo can detect 80 classes out of the box

In [None]:
res[0].names

## 💡 You can limit the classes that are detected with the `classes` parameter
* Let's say that we want to detect only the *bus* object
    * We need to set the `classes` to `5` as this is the ID of *bus*

In [None]:
res = model('bus.jpg', save_txt=True, save_conf=True, save=True, imgsz=1088, project="yolo", classes=[5])

## Now only the *bus* was detected

In [None]:
im = Image.open("bus.jpg")
draw = ImageDraw.Draw(im)
cls_to_color = {0 : 'red', 5: 'yellow'}
for result in res:
    for i, box in enumerate(result.boxes.xyxy.cpu().numpy()):
        cls = result.boxes.cls.cpu().numpy()[i]
        x, y, xx, yy = box
        draw.rectangle([x, y, xx, yy], outline=cls_to_color[cls], width=4)

# Display image in matplotlib
plt.imshow(im)

## YOLO is able to process video files using the same API, we can try it using the downloaded video file
* We want to detect the boats that are in the video sequence
    * 💡 ID of *boat* object is `8`
* 📌 Set `stream=True` so inference results won't accumulate in RAM causing potential out-of-memory

In [None]:
res = model('video_cut.mkv', stream=True, save_txt=True, save_conf=True, save=True, imgsz=1088, project="yolo", classes=[8])

## ⚡ With `stream=True` the detection is done when we iterate over the `res` object

In [None]:
output = []
for i, result in enumerate(res):
    curr = result.boxes.xyxy.cpu().numpy()
    output.append({'Cls': result.boxes.cls.cpu().numpy(), 'BBoxes': curr})

## 📊 The bounding boxes are stored in the output list

In [None]:
output[8:15]

# 🚀 We can fine-tune the model using our data
* It requires a dataset in COCO format
* And also the configuration file, which is a modified version of the original YOLOv5 configuration file

## Let's download the data first

In [None]:
!wget https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/releases/download/v0.0.01/yolo_data.zip -O yolo_data_dir.zip
!unzip yolo_data_dir.zip

## Don't forget to download the config file
* 💡 We need to modify the `path` property in the config file

In [None]:
!wget https://github.com/rasvob/VSB-FEI-Deep-Learning-Exercises/raw/main/misc/coco128.yaml -O coco128.yaml

In [None]:
model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)
model.train(data='coco128.yaml', epochs=5, imgsz=1920, batch=8, pretrained=True, cache=True, workers=16, seed=13)

## After training the model, we can use it for inference, usually we want to export the model
* You can check the export documentation [here](https://docs.ultralytics.com/modes/export/#export-formats)
* Often the ONNX (for CPU) or TensorRT (for GPU) export format is used
* `half` parameter enables FP16 (half-precision) quantization, reducing model size and potentially speeding up inference on supported hardware
    * 💡 For GPU only, you need to set `device` parameter
* `simplify` parameter simplifies the model graph for ONNX exports, potentially improving performance and compatibility

In [None]:
model.export(format='onnx', imgsz=1920, half=True, simplify=True)

## 🚀 Finally we can load the model as usual and use it in the inference mode
* 💡 Set the model path according to your workspace!

In [None]:
model = YOLO("yolov8n.onnx", verbose=True, task='detect')

In [None]:
res = model('video_cut.mkv', stream=True, save_txt=True, save_conf=True, save=True, imgsz=1920, project="yolo", classes=[8])

In [None]:
output = []
for i, result in enumerate(res):
    curr = result.boxes.xyxy.cpu().numpy()
    output.append({'Cls': result.boxes.cls.cpu().numpy(), 'BBoxes': curr})

In [None]:
output[8:15]

# ✅  Tasks for the lecture (**4p**)

* There are multiple YOLOv8 models available on the [Github](https://github.com/ultralytics/ultralytics)
* 📌 Choose another 2 versions of the model beside the `YOLOv8n` one that we used during the lecture
* Try to fine-tune the models and export them to either ONNX or TensorRT
    * 💡 The fine-tuning step is optional - if you don't have enough resources, just skip this step and try the inference directly
    * Compare the inference times of all three models using the provided video file of boat
        * 🔎 How much they differ?
    * Also check the output video files
        * 🔎 Are there any differences in the detected bounding boxes?