# Advanced classification

<img src="https://www.mrtfuelcell.polimi.it/images/logo_poli.jpg" height="200">
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f8/Python_logo_and_wordmark.svg" height="150">

A2A ML Course - day 8 - 18/11/2024

Maciej Sakwa, Micheal Wood, Emanuele Ogliari

## Outline

1. Introduction to image processing
2. Training a CNN for image classification
3. Object detection
4. Image segmentation

## Learning obejctives

* Understand the mathematical concepts behind image processing
* Learn to construct simple Convolutional Neural Network
* Understand the computational burden of training large DL models, and the advantage of Transfer Learning
* Provide an overview of the methodologies used in modern Computer Vision

<img src="https://freesvg.org/img/evil-robot-glitch-remix.png" width="400">


**Please run the cells below with imports and function declarations** 

Feel free to skip the details of the content.

---

Imports

In [None]:
import warnings
import os
import cv2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, accuracy_score, recall_score

warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

`tensorflow` imports

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Conv2D, MaxPool2D, Input, Flatten, Resizing
from tensorflow.keras import Sequential as create_model

Plotting

In [None]:
def plot_RGB_layers(image, show_colors=False):
    colors = ['Reds', 'Greens', 'Blues']
    fig, ax = plt.subplots(ncols=3)
    
    for i in [0, 1, 2]:
        if show_colors: 
            ax[i].imshow(image[:, :, i], cmap=colors[i], vmin=0, vmax=255)
        else:
            ax[i].imshow(image[:, :, i], cmap='Greys', vmin=0, vmax=255)
        ax[i].xaxis.set_visible(False)
        ax[i].yaxis.set_visible(False)
        ax[i].set_title(colors[i])
    plt.tight_layout()
    plt.show()

def plot_convolution(image):
    n_filters = image.shape[-1]
    fig, ax = plt.subplots(ncols=4, nrows=image.shape[-1]//4)
    for i in range(n_filters):
        ax[i//4, i%4].imshow(np.reshape(image[:, :, :, i], image.shape[1:-1]))
        ax[i//4, i%4].xaxis.set_visible(False)
        ax[i//4, i%4].yaxis.set_visible(False)

    plt.tight_layout()
    plt.show()

Frames generator

In [None]:
def get_frames_from_vid(source:str, out:str, n_frames=1, increment=1):
    
    # Check out path and create if does not exist
    if not os.path.exists(out):
        os.makedirs(out)

    # Get the frame range and set current to zero
    frames_range = range(0, n_frames*increment, increment)
    current_frame = 0

    # Open the video
    vidcap = cv2.VideoCapture(source)

    # Loop through the video saving the frames we want
    while(True):
        _ ,frame = vidcap.read()

        if current_frame in frames_range:
            name = f'test_img_{current_frame}.jpg'
            print (f'Creating... {name}')
            cv2.imwrite(os.path.join(out, name), frame)

        current_frame += 1

        if current_frame > n_frames:
            break

    # Release all space and windows once done
    vidcap.release()
    cv2.destroyAllWindows()

**Did you run all the cells?**

---

## Introduction to image processing

This lesson is a camouflaged tutorial on basics of Computer Vision (or CV in short). In fact, most of the tasks in CV center on classification of the contents of the image - we want to teach the machine to recognise some things that are generally easily recognisable for a human.

<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2024/01/image-classification-model.jpg" width="600">

So far we dealt with simple datasets:

* **Points**

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/linreg.png?raw=True" width="400">

* **Lines** (sequences)

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/price_plot.png?raw=True" width="400">

Where a single instance is usually a few numbers long, e.g. *(x, y, z)* coordinates or *(I, V, T, SOC*) features of time series. Now we will move on to images:

### What is an image?

We will be using a satellite image dataset, available [here](https://github.com/phelber/eurosat). Let's load the dataset:

In [None]:
from datasets import load_dataset

ds = load_dataset("nielsr/eurosat-demo", split='train').with_format('tf')
ds = ds.train_test_split(test_size=0.2, shuffle=True)

ds_class_names = {
    0: 'AnnualCrop',
    1: 'Forest',
    2: 'HerbaceusVegetation',
    3: 'Hihgway',
    4: 'Industrial',
    5: 'Pasture',
    6: 'PermanentCrop',
    7: 'Residential',
    8: 'River',
    9: 'SeaLake'
}

It comes in a conveniant format, already preprocessed and splitted into train and test.
Each image has a label.

Let's print the first example using matplotlib:

In [None]:
N_IMAGE = 0

image = ds['train'][N_IMAGE]['image'].numpy()
label = ds['train'][N_IMAGE]['label'].numpy()

plt.imshow(image)
plt.show()

Also the corresponding label is imported:

In [None]:
ds_class_names[label]

Mathematically speaking this small image is a massive 3D matrix:

> **NB:** Throughout the lesson we will probably use a lot the term *tensor*. A *tensor* is a generalisation of a matrix concept that can be expanded to more dimentions, e.g. 3D, 4D, ect... So a matrix is a 2D tensor.

In [None]:
image # .shape .size

The image is a square of 64x64 pixels.

It has three layers that correspond to **(R, G, B)** channels:

In [None]:
N_IMAGE = 0

image = ds['train'][N_IMAGE]['image'].numpy()

plot_RGB_layers(image)

**Each image has 12288 data points!**

Previously we were talking about some descriptive parameters (current, voltage, temperature ect.) the number of *features* usually was below 10-20.

A numercial representation of a single image on comparison will have thousands if not millions of *features*. A full HD image in *1920x1080* has a staggering 2073600 (2 million) data points!

We cannot analyse them in the same way as we did before. Now it involves two techniques:

* **CONVOLUTION**
* **POOLING**


### Convolution

The convolution is a mathematical operation that takes two functions $f$ and $g$ and creates a new function $(f*g)$ defined as:

$$
    (f * g) = \int^{\infty}_{-\infty} f(\tau) g(t-\tau) d\tau
$$

It's much easier to understand what's happening if we look at a graphical representation:

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/conv_book.png?raw=True" width="350">

Let's introduce some terminology:
* **Filter** or **kernel** - the computational unit that we *slide* along the input
* **Stride** - number of pixels "jumped" at each *slide*
* **Padding** - sides added to the image to equalize the size of the output with the input
* **Feature map** or **map** - the output

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/conv_2d_book.png?raw=True" width="500">

An image is a 2-dimentional data structure, so the **kernel** (or filter) is also 2-dimentional.

Normally, the kernel has a square shape: 2x2, 3x3, or sometimes 5x5, and even 7x7.

The bigger the kernel the more general are the features extracted.

<img src="https://upload.wikimedia.org/wikipedia/commons/1/19/2D_Convolution_Animation.gif" height="400">

There is a direct relation to ANNs:

||**ANN**|**CNN**|
|---|---|---|
|**UNITS**|Neurons|Filters|
|**LAYERS**|Hidden layer|Convolutional layer|

We will be stacking layers of filters on top of each other. A single layer in a CNN will be composed of a number of filters (from single up to hundreds)

Let's see how it works:

In [None]:
N_IMAGE = 10

image = ds['train'][N_IMAGE]['image'].numpy()

plt.imshow(image)
plt.show()

The convolve function takes an image and applies the convolution using the number of filters that you select (here by default set to 16):

>**NB:** it uses the `tensorflow` library to perform the operation

In [None]:
def convolve(image, n_filters=16):

    if len(image.shape) != 4:
        image = image.reshape([1, *image.shape])

    convolution = tf.keras.Sequential([
        Input(shape=image.shape[1:]),
        Conv2D(filters=n_filters, kernel_size=2, padding='valid')
    ])

    return convolution(image)

out_img = convolve(image, n_filters=16)

And let's plot the maps using a previously defined function:

In [None]:
plot_convolution(out_img)

In fact, we can stack the convolution operation many times.

And this is exacly how complex features are extracted from the images:

In [None]:
out_out_img = convolve(convolve(convolve(convolve(convolve(convolve(image))))))
plot_convolution(out_out_img)

We are not getting cool featuere maps now, because the weights on the filters are random.

In fact, the training procedure of a CNN is about adjusting these weights to get maps that focus on specific parts of the image.

### Pooling

During training and inference, each filter has to pass over the input, moving one space at a time.

Let's take the previously defined layer: 64x64x3 input, and 8 2x2 filters.

In [None]:
64 * 64 * 3 * 8 * 2 * 2 # Operations per layer

It quickly becomes a lot. Our PC memory will not be able to handle number of stacked layers.

**Pooling** is the necessary reduction operation, that downscales the image. Similarly a filter passes over the image, but now it outputs an aggregated result for each swept input.

There are two types of pooling commonly used:

* **Max** pooling
* **Average** pooling

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/max-pooling-a.png?711b14799d07f9306864695e2713ae07" height="300">
<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/average-pooling-a.png?58f9ab6d61248c3ec8d526ef65763d2f" height="300">

The max pooling is more commonly used as it is better in preserving the features.

## Training a CNN for image classification

### Putting the blocks together

Now that we know the two most common building blocks of CNNs we can use them to construct the model.

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/fig-1-full.png?raw=True" width="700">

The *rule of thumb* is that we:

* put a few convolution layers with the same number of filters
* put the max pooling layer to downsample
* repeat a few times
* flatten the output
* attach an ANN at the end to predic the output

Let's build a model using these rules:

In [None]:
test_model = create_model([
    Input(shape=(64, 64, 3)),

    Conv2D(filters=16, kernel_size=(2, 2), padding='same'),
    Conv2D(filters=16, kernel_size=(2, 2), padding='same'),
    MaxPool2D(2),

    Conv2D(filters=32, kernel_size=(2, 2), padding='same'),
    Conv2D(filters=32, kernel_size=(2, 2), padding='same'),
    MaxPool2D(2),

    Conv2D(filters=64, kernel_size=(2, 2), padding='same'),
    Conv2D(filters=64, kernel_size=(2, 2), padding='same'),
    MaxPool2D(2),

    Flatten(),

    Dense(64),
    Dense(len(ds_class_names), activation='softmax')
])

test_model.compile(optimizer='Adam', loss='sparse_categorical_crossentropy')
test_model.summary()

It's like putting LEGO blocks together :)

Let's train our model. As per usual, we split the data into train and test sets, and fit the model using the train set:

In [None]:
ds_train = ds['train'].to_tf_dataset(columns=["image"], label_cols=["label"], batch_size=32)
ds_test = ds['test'].to_tf_dataset(columns=["image"], label_cols=["label"], batch_size=1)

train_history = test_model.fit(ds_train, epochs=1)

It might take a couple of minutes. Let's run an example prediction:

In [None]:
example = next(ds_test.as_numpy_iterator()) # Get a single example from the test dataset

example_image = example[0]  # Get image
example_label = example[1]  # Get label

plt.imshow(example_image[0])   
print(ds_class_names[example_label[0]])

In [None]:
label_probas = test_model.predict(example_image)
label_pred   = ds_class_names[np.argmax(label_probas)]

print(label_probas)
print(label_pred)

Let's predict the entire test dataset now and see the metrics:

In [None]:
label_true   = ds['test']['label']

label_probas = test_model.predict(ds_test)
label_pred   = np.argmax(label_probas, axis=1)

Let's get the confusion matrix:

In [None]:
cm = confusion_matrix(label_true, label_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=ds_class_names.values())
disp.plot(xticks_rotation='vertical')

And the numerical results:

In [None]:
accuracy    = accuracy_score(label_true, label_pred)
precision   = precision_score(label_true, label_pred, average=None)
recall      = recall_score(label_true, label_pred, average=None)

print(f'Global accuracy: {accuracy*100:.02f} %')
print('\t Prec\t Recall')
for i in range(len(ds_class_names)):
    print(f'Cl {i}:\t{precision[i]*100:.01f}%\t {recall[i]*100:.01f}%\t-> {ds_class_names[i]}')

### Transfer learning

There are many methods of putting the blocks together. How to be sure it's optimal?

Mathematicians and computer scientists put out competitions to figure out who is the best at stacking blocks. 

<img src=https://m.media-amazon.com/images/I/71LMmS-xmdL._AC_UF894,1000_QL80_.jpg width=400>

They are usually kind enough to share their findings online. Which makes it a bit useless to stack the blocks on our own in particular if we are doing a task that is fairly standard (such as image classification).

These models are trained on very simple datasets.

For example it learns to predict cats vs dogs vs horses. And what if we want to classify types of defects on PV modules? Can we use the same model?

The answer is: *kind of*



It turns out that a lot of training process is spent on the deep first layers of the CNN, that extract the most basic features. 
It *also* turns out that these layers will have the same weights no matter on the task we want to do with the model (dogs or PV modules).

*We can use that* 

In fact we can reuse the deeper and middle layers of any model that performs a similar task. 

Here, we load a MobileNet model pretrained on 'imagenet' dataset and remove the top hidden layers with include_top=False:

In [None]:
mobilenet_layers = tf.keras.applications.MobileNetV3Small(weights='imagenet', include_top=False)

We can use the the loaded layers to mold in into our model:

In [None]:
model = create_model([
    Input(shape=(64, 64, 3)),

    Resizing(224, 224),

    mobilenet_layers,

    Flatten(),

    Dense(64),
    Dense(len(ds_class_names), activation='softmax'),
])

The *re*-fitting procedure is often called **fine-tuning**. It's enough if we run it just for a few epochs:

In [None]:
history = model.fit(ds_train, epochs=1)

Let's get the results:

In [None]:
label_true   = ds['test']['label']

label_probas = model.predict(ds_test)
label_pred   = np.argmax(label_probas, axis=1)

And show them:

In [None]:
cm = confusion_matrix(label_true, label_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=ds_class_names.values())
disp.plot(xticks_rotation='vertical')

In [None]:
accuracy    = accuracy_score(label_true, label_pred)
precision   = precision_score(label_true, label_pred, average=None)
recall      = recall_score(label_true, label_pred, average=None)

print(f'Global accuracy: {accuracy*100:.02f} %')
print('\t Prec\t Recall')
for i in range(len(ds_class_names)):
    print(f'Cl {i}:\t{precision[i]*100:.01f}%\t {recall[i]*100:.01f}%\t-> {ds_class_names[i]}')

This procedure is often reffered to as **TRANSFER LEARNING** as we are transfering the knowledge from one model to a new one. It is commonly used if the task that we want to perform is similar to the one we already did.

The most important models that intoduce new and cool ways of stacking blocks (and win the competitioms) are often named. For example:

* VGG-16 nets - introduced the Conv-Pool stacking that we used before
* Residual nets [*(Resnets)*](https://en.wikipedia.org/wiki/Residual_neural_network) - introduced residual blocks with skip connections.
* Inception nets or [*GoogleNet*](https://en.wikipedia.org/wiki/Inception_(deep_learning_architecture)) - introduced inception block for wide and deep feature extraction
* Squeeze-and-excitation [*(SE)nets*](https://towardsdatascience.com/squeeze-and-excitation-networks-9ef5e71eacd7) - expands on the ResNet by scaling the feature maps with learned weight parameters

### Use example in engineering

Convolutional Neural Networks are extensively used in research and industry, for image based predictions. 

An example that we were working on (and now Nam is working), is the CNN based prediction of Global Horizontal Irradiance. In that scenario we used a modified VGG-16 type network for *regression* based on a sequence of images.

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/CNN_pred_example.png?raw=True" width="800">

Apart from images, CNNs are also extensively used for sequence analysis - e.g. time series forecast or signal classification (such as speech recognition). We can do it because it is very easy to represent a window of a time series as a 2D matrix that can be *scanned* convolutions. 

## Object detection

Another important task in image processing is that of **Object Detection**. 

In comparison to classification, now the task is a bit different as our objective is not only about classifying the content of the image, but also about detection of the location of that object:

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/object-detection-clas-en.jpeg?raw=True" width="250">

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/object-detection-det-en.jpeg?raw=True" width="250">

That means that the output of our model is not only the single label for the entire image but a set of **bounding boxes** with accompanying labels.

The task changes from stright classification to *half-regression* (estimation of the location of the box) and *half-classification* (classification of the content of a box).

The staple model to perform this task is called **YOLO** - **Y**ou **O**nly **L**ook **O**nce

### Webcam test

Let's load the model:

In [None]:
from ultralytics import YOLO

# Load model
model = YOLO("yolov5su.pt")

# Model info
model.info()

By typing model.names we can see what classes it is learned to detect:

In [None]:
model.names

It's 80 classes that come from another standard dataset called [*COCO*](https://cocodataset.org/#home) (Common Objects in COntext)

This cell contains code to take pictures with your webcam. *Please run it*

In [None]:
if 'google.colab' in str(get_ipython()):
    from IPython.display import display, Javascript
    from google.colab.output import eval_js
    from base64 import b64decode

    def take_photo(filename='/content/photo.jpg', quality=0.8):
        js = Javascript('''
            async function takePhoto(quality) {
            const div = document.createElement('div');
            const capture = document.createElement('button');
            capture.textContent = 'Capture';
            div.appendChild(capture);

            const video = document.createElement('video');
            video.style.display = 'block';
            const stream = await navigator.mediaDevices.getUserMedia({video: true});

            document.body.appendChild(div);
            div.appendChild(video);
            video.srcObject = stream;
            await video.play();

            // Resize the output to fit the video element.
            google.colab.output.setIframeHeight(document.documentElement.scrollHeight, true);

            // Wait for Capture to be clicked.
            await new Promise((resolve) => capture.onclick = resolve);

            const canvas = document.createElement('canvas');
            canvas.width = video.videoWidth;
            canvas.height = video.videoHeight;
            canvas.getContext('2d').drawImage(video, 0, 0);
            stream.getVideoTracks()[0].stop();
            div.remove();
            return canvas.toDataURL('image/jpeg', quality);
            }
            ''')
        display(js)
        data = eval_js('takePhoto({})'.format(quality))
        binary = b64decode(data.split(',')[1])
        with open(filename, 'wb') as f:
            f.write(binary)
        return filename

We can take a picture with our webcam:

In [None]:
if 'google.colab' in str(get_ipython()):
    from IPython.display import Image
    # Take an image and display an error if something goes wrong
    try:
        filename = take_photo()
        print('Saved to {}'.format(filename))
        display(Image(filename))

    except Exception as err:
        str(err)

And let's try to see what the model detects:

In [None]:
if 'google.colab' in str(get_ipython()):   
    input_img = "/content/photo.jpg"
    results = model.predict(input_img, save=True)

    results[0].show()

We can also analyse in more detail the model outputs:

In [None]:
if 'google.colab' in str(get_ipython()):
    classes = results[0].boxes.cls
    conf    = results[0].boxes.conf
    xyxy    = results[0].boxes.xyxy

    pd.DataFrame({
        'left': xyxy[:, 0], 
        'right': xyxy[:, 2], 
        'bottom': xyxy[:, 1], 
        'top': xyxy[:, 3],
        'conf': conf,
        'class': [model.names[int(cls.item())] for cls in classes]})

From a technical point of view the **YOLO** model is dividing the image in small patches and looking for anchor points of the objects in them.

### Video test

In fact, the model is so fast we can use it in real time *(almost)*.

Let's see how it handles videos. This cell should download a video from our github:

In [None]:
url = f"https://raw.githubusercontent.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/refs/heads/main/data/test_vid_small.mp4"
! wget --no-cache --backups=1 {url}

This cell imports the video to colab, give it a look:

In [None]:
from IPython.display import HTML
from base64 import b64encode
import os

# Show video
mp4 = open("test_vid_small.mp4",'rb').read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
HTML("""
<video width=600 controls>
      <source src="%s" type="video/mp4">
</video>
""" % data_url)

And now let's predict using the pre-trained model:

In [None]:
results_vid = model.predict(source="test_vid_small.mp4", save=True, conf=0.5)

### Use example in engineering

With proper *fine-tuning* and *re-training* on usable datasets, we can use YOLO to detect other objcets that we need. An example that we were working on includes fine-tuning the YOLO model to detect defects in PV modules for facilitated diagnosis of power drops: 

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/YOLO_pred_example.png?raw=True" width="900">

## Segmentation

The last important task in image processing is that of **Segmentation**. 

It is significantly more advanced compared to image classification, because here we try to classify **every pixel of the image**.

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/object-detection-clas-en.jpeg?raw=True" width="250">

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/object-detection-det-en.jpeg?raw=True" width="250">

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/r-cnn-en.jpg?raw=True" width="250">

Just like **YOLO** is the staple model for Object Detection, **SAM** (Segment Anything Model) is the go-to solution for Segmentation. It's a robust vision models developed by some smart guys at META. In comparison to previous models we used and trained, **SAM** is an example of a *Vision Transformer* or *ViT* - a prompt-driven heavyweight model that is based on encoding and decoding of the input. *ViT* models are significantly too heavy to be trained "in-house". In fact, in the original research paper, the authors report that the training of only the encoder part took them a few *days* using a staggering number of **256 GPUs**! 

First, let's extract a couple of frames from our video:

In [None]:
get_frames_from_vid(source="test_vid_small.mp4", out="frames")

We can chech out the image in our content browser

Similarly to **YOLO** the **SAM** model is hosted on ultralytics. We can easily import it and run predictions:

In [None]:
from ultralytics import SAM

# Load a model
model = SAM("mobile_sam.pt")

# View model info
model.info()

It's massive compared to **YOLO**!

We can run it with the .predict() command:

In [None]:
results = model.predict(source='frames/test_img_0.jpg', save=True, device='cpu')

This however takes a very long time to run in colab (slightly over 7 minutes for full picture).

Moreover, the output is not very clear, you can see the example below:

<img src="https://github.com/woodjmichael/Basi-Fondamentali-del-Machine-Learning/blob/main/images/SAM_result.jpeg?raw=True" width="900">

It's a mess, isn't it? 

The model however was designed for "prompted" segmentation, i.e. we have to specify which point or area of the image we want to extract, e.g. we can focus on one item location, or on detection of objects on the path of our car:

In [None]:
results = model.predict(source='frames/test_img_0.jpg', save=True, device='cpu', points=[[100, 100]], labels=[1])

It's not working perfect, but that's what fine-tuning's for!

**Thank you for your attention!**