<a href="https://colab.research.google.com/github/xinconggg/Machine-Learning/blob/main/Deep%20Computer%20Vision%20using%20Convolutional%20Neural%20Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Convolutional Layers
Convolutional layers are the core building blocks of Convolutional Neural Networks (CNNs). They apply convolution operations to the input data to extract features such as edges, textures, and patterns.

In a convolutional layer:
- A **filter (kernel)** slides over the input, performing element-wise multiplication and summation to produce a **feature map**.
- **Strides** control how much the filter moves at each step.
- **Padding** can be applied to preserve the spatial dimensions of the input.
- **Activation functions** like ReLU are typically applied to introduce non-linearity.
Convolutional layers help CNNs learn **spatial hierarchies** in data, making them highly effective for tasks like **image classification, object detection, and segmentation**.

### Implementing Convolutional Layers with Keras
First, load and preprocess a couple of sample images, using Scikit-Learn's `load_sample_image` function and Keras's `CenterCrop` and `Rescaling` layers:

In [1]:
from sklearn.datasets import load_sample_images
import tensorflow as tf

images = load_sample_images()["images"]
images = tf.keras.layers.CenterCrop(height=70, width=120)(images)
images = tf.keras.layers.Rescaling(scale=1/225)(images)

Check the shape of the "images" tensor:

In [2]:
images.shape

TensorShape([2, 70, 120, 3])

It is a 4D tensor: the 2 sample images explains the first dimension. Each image is 70x120, since that's the size specified in the `CenterCrop` layer, which explains the second and third dimension. Lastly, each pixel holds one value per color channel, and there are 3 of them - red, green and blue, which explains the last dimension.

Now, let's create a 2D convolutional layer and feed these images to see what comes out. To do so, Keras provides a `Convulution2D` layer, alias `Conv2D`. Create a convolutional layer with 32 filters, each of size 7x7 (``using `kernel_size=7`), and apply this layer to 2 of the images:

In [3]:
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7)
fmaps = conv_layer(images)

Look at the ouput's shape:

In [4]:
fmaps.shape

TensorShape([2, 64, 114, 32])

The output's shape is similar to the input's shape, but with 2 main differences. **First**, there are 32 channels instead of 3. This is because we set `filters=32`, so we get 32 output feature maps; instead of the intensity of red, blue and green at each location, we now have the intensity of each feature at each location. **Second**, the height and width have both shrunk by 6 pixels. This is due to the fact that the `Conv2D` layer does not use any zero-padding by default, meaning that we lose a few pixels on the sides of the output feature maps. In this case, since the kernel size is 7, we lose 6 pixels horizontally and vertically. (3 pixels on each side)

However, if we set `padding=same`, then the inputs are padded with enough zero on all sides to ensure that the output feature maps end up with the same size as the input:

In [5]:
conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=7,
                                    padding="same")
fmaps = conv_layer(images)

In [6]:
fmaps.shape

TensorShape([2, 70, 120, 32])

Look at the layer's weights. Just like a `Dense` layer, a `Conv2D` layer holds all the layer's weights, including the kernels and biases:

In [7]:
kernels, biases = conv_layer.get_weights()
kernels.shape

(7, 7, 3, 32)

In [8]:
biases.shape

(32,)

## Pooling Layers
Pooling layers are used in Convolutional Neural Networks (CNNs) to **reduce the spatial dimensions of feature maps**, which helps to reduce computation, prevent overfitting, and make the network more efficient.

There are two main types of pooling:
- **Max Pooling:** Takes the maximum value from each region of the feature map, focusing on the most prominent features.
- **Average Pooling:** Takes the average value from each region, smoothing the feature map.

Pooling layers:
- **Reduce dimensionality** by downsampling the input.
- **Retain important features** while discarding less important details.
- **Provide translation invariance**, making the network more robust to small shifts in the input.

Pooling layers are typically applied after convolutional layers to **extract dominant features** and make the model more efficient without losing key information.








### Implementing Pooling Layers with Keras
The following code creates a `MaxPooling2D` layer, alias `MaxPool2D`, using a 2x2 kernel. The strides default to the kernel size, so this layer uses a stride of 2 (horizontally and vertically). By default, it uses a "valid" padding (i.e., no padding at all):

In [9]:
max_pool = tf.keras.layers.MaxPool2D(pool_size=2)

To create an *Average Pooling* layer, just use `AveragePooling2D`, alias `AvgPool2D`, instead of `MaxPool2D`. However, most use max pooling layers rather than average pooling layers since they generally perform better.

Note that max pooling and average pooling can be performed along the depth dimension instead of spatial dimensions. This allows the CNN to learn to be invariant to various features. For example, it could learn multiple filters, each detecting a different rotation of the same pattern, and the *depthwise max pooling* layer would ensure taht the output is the same regardless of the rotation.

Keras does not include a *depthwise max pooling* layer, but we can implement a custom layer for that:

In [10]:
class DepthPool(tf.keras.layers.Layer):
    def __init__(self, pool_size=2, **kwargs):
        super().__init__(**kwargs)
        self.pool_size = pool_size

    def call(self, inputs):
        shape = tf.shape(inputs)  # shape[-1] is the number of channels
        groups = shape[-1] // self.pool_size  # number of channel groups
        new_shape = tf.concat([shape[:-1], [groups, self.pool_size]], axis=0)
        return tf.reduce_max(tf.reshape(inputs, new_shape), axis=-1)

This layer reshapes its inputs to split the channels into groups of the desired size (`pool_size`), then it uses `tf.reduce_max`  to compute the max of each group.

One last type of pooling layer that is often seen is the *global average pooling* layer. It works very differently: all it does is compute the mean of each entire feature map. This means that it just outputs a single number per feature map and per instance. Although it can be extremely destructive (most of the information in the feature map would be lost), but it can be useful just before the output layer. To create such a layer, simply use the `GlobalAveragePooling2D` class, alias `GlobalAvgPool2D`:

In [11]:
global_avg_pool = tf.keras.layers.GlobalAvgPool2D()

For example, if we apply this layer to the input images, we get the mean intensity of the red, green and blue for each image:

In [12]:
global_avg_pool(images)

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[0.72917455, 0.6768054 , 0.6601711 ],
       [0.86481947, 0.29479   , 0.12295388]], dtype=float32)>

## CNN Architectures
Typical CNN Architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers, then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network.

The following code implements a basic CNN to tackle the Fashion MNIST:

In [13]:
import numpy as np

# Import the Fashion MNIST dataset from TensorFlow
mnist = tf.keras.datasets.fashion_mnist.load_data()
# Load the dataset into training and testing sets
(X_train_full, y_train_full), (X_test, y_test) = mnist
# Normalize and Scale to range [0-1]
X_train_full = np.expand_dims(X_train_full, axis=-1).astype(np.float32) / 255
X_test = np.expand_dims(X_test.astype(np.float32), axis=-1) / 255
# Split the full training set into a smaller training set and a validation set
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

In [14]:
from functools import partial

tf.random.set_seed(42)

DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, padding="same",
                        activation="relu", kernel_initializer="he_normal")

model = tf.keras.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    tf.keras.layers.MaxPool2D(),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    tf.keras.layers.MaxPool2D(),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=128, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=64, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(units=10, activation="softmax")
])

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Explanation of code:
- `functools.partial` function was used to define `DefaultConv2D`, which acts just like `Conv2D` but with different default arguments: a kernel size of 3, "same" padding, the ReLU activation function, and its corresponding He intializer.
-Next, we create the `Sequential` model. Its first layer is a `DefaultConv2D` with 64 filters. It sets `input_shape=[28, 28, 1]`, because the images are 28x28 pixels. Note: when loading the Fashion MNIST dataset, ensure that each image has this shape, else use `np.reshape` or `np.expanddims` to change the dimension.
- We then add a max pooling layer that uses a default pool size of 2, so it divides each spatial dimension by a factor of 2.
- The same structure is then repeated twice: 2 convolutional layers followed by a max pooling layer. Note: for larger images, we could repeat this structure several more times.
- Note that the number of filters doubles as we climb up the CNN woards the output layer (from 64 to 128 then 256). It is a common practice to double the number of filters after each pooling layer.
- Next is the fully connected network, composed of 2 hidden dense layers and a dense output layer. Since it's a classification task with 10 classes, the output layer has 10 units, and it uses the softmax activation function.

### CNN Architectures Summary
**1) LeNet-5:**
- **Designed for:** Handwritten digit recognition
- **Architecture:** 7 layers (Convolution, Pooling, Fully Connected)
- **Usage today:** Rarely used in modern applications

**2) AlexNet:**
- **Designed for:** ImageNet classification
- **Architecture:** 8 layers (5 Convolution + 3 Fully Connected)
- **Usage today:** Foundational architecture, but newer models have surpassed it

**3) VGGNet:**
- **Designed for:** Image classification
- **Architecture:** 16 or 19 layers of small 3x3 filters
- **Usage today:** Still used for *feature extraction* and in *transfer learning*

**4) GooLeNet:**
- **Designed for:** Image classification
- **Architecture:** 22 layers with *Inception Modules* (multiple filter sizes at each layer)
- **Usage today:** Still relevant in modern *Inception-based* models

**5) ResNet:**
- **Designed for:** Image classification
- **Architecture:** 50, 101 or 152 layers with *Residual Connections* (skip connections)
- **Usage today:** Widely used in both research and industry. Foundation for many modern models

**6) Xception:**
- **Designed for:** Image classification
- **Architecture:** 36 layers with *Depthwise Separable Convolutions*
- **Usage today:** Used in real-word applications like *image processing* and *object detection*

**7) SENet:**
- **Designed for:** Image classification
- **Architecture:** Adds *Squeeze and Excitation (SE)* blocks to other architectures
- **Usage today:** Used in many modern networks (ResNeXt, EfficientNet)

## Implementing a ResNet-34 CNN using Keras
First, create a `ResidualUnit` layer:

In [15]:
DefaultConv2D = partial(tf.keras.layers.Conv2D, kernel_size=3, strides=1,
                        padding="same", kernel_initializer="he_normal",
                        use_bias=False)

class ResidualUnit(tf.keras.layers.Layer):
    def __init__(self, filters, strides=1, activation="relu", **kwargs):
        super().__init__(**kwargs)
        self.activation = tf.keras.activations.get(activation)
        self.main_layers = [
            DefaultConv2D(filters, strides=strides),
            tf.keras.layers.BatchNormalization(),
            self.activation,
            DefaultConv2D(filters),
            tf.keras.layers.BatchNormalization()
        ]
        self.skip_layers = []
        if strides > 1:
            self.skip_layers = [
                DefaultConv2D(filters, kernel_size=1, strides=strides),
                tf.keras.layers.BatchNormalization()
            ]

    def call(self, inputs):
        Z = inputs
        for layer in self.main_layers:
            Z = layer(Z)
        skip_Z = inputs
        for layer in self.skip_layers:
            skip_Z = layer(skip_Z)
        return self.activation(Z + skip_Z)

We can now build a ResNet-34 using a `Sequential` model, since it is jsut a long sequence of layers, we can treat each residual unit as a single layer now that we have the `ResidualUnit` class:

In [16]:
model = tf.keras.Sequential([
    DefaultConv2D(64, kernel_size=7, strides=2, input_shape=[224, 224, 3]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation("relu"),
    tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding="same"),
])
prev_filters = 64
for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
    strides = 1 if filters == prev_filters else 2
    model.add(ResidualUnit(filters, strides=strides))
    prev_filters = filters

model.add(tf.keras.layers.GlobalAvgPool2D())
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(10, activation="softmax"))

## Using Pretrained Models from Keras
In general, we don't have to implement standard models like *GooLeNet* or *ResNet* manually since pretrained networks are available in the `tf.keras.applications` package.

For example, we can load the ResNet-50 model, pretrained on ImageNet, using:

In [17]:
model = tf.keras.applications.ResNet50(weights="imagenet")

Now we can create a ResNet-50 model and download weights pretrained on the ImageNet dataset. To use it, first ensure that the images have the right size: ResNet-50 model expects 224x224 pixel images. So, we can use Keras's `Resizing` layer to resize 2 sample images:

In [18]:
images = tf.keras.backend.constant(load_sample_images()["images"])
images_resized = tf.keras.layers.Resizing(height=224, width=224,
                                          crop_to_aspect_ratio=True)(images)

The pretrained models assume that the iamges are preprocessed in a specific way. In some cases they may expect the inputs to be scaled from 0 to 1, or
from –1 to 1, and so on. Each model provides a `preprocess_input` function that can be used to preprocess the images:

In [19]:
inputs = tf.keras.applications.resnet50.preprocess_input(images_resized)

Now we can use the pretrained model to make predictions:

In [20]:
Y_proba = model.predict(inputs)
Y_proba.shape

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step


(2, 1000)

As usual, the output `Y_proba` is a matrix with 1 row per image and 1 column per class (in this case, there are 1,000 classes). To display the top *K* predictions, use the `decode_predictions` function, which returns an array containing the class identifier, its name, and the corresponding confidence score for each iamge:

In [21]:
top_K = tf.keras.applications.resnet50.decode_predictions(Y_proba, top=3)

for image_index in range(len(images)):
    print(f"Image #{image_index}")
    for class_id, name, y_proba in top_K[image_index]:
        print(f"  {class_id} - {name:12s} {y_proba:.2%}")

Image #0
  n03877845 - palace       54.69%
  n03781244 - monastery    24.71%
  n02825657 - bell_cote    18.55%
Image #1
  n04522168 - vase         32.67%
  n11939491 - daisy        17.82%
  n03530642 - honeycomb    12.04%


## Pretrained Models for Transfer Learning
If we want to build an image classifier but do not have enough data to train it from scratch, it is often a good idea to reuse the lower layers of a pretrained model. For example, let's train a model to classify pictures of flowers, reusing a pretrained Xception model.

First, load the flowers dataset using "TensorFlow Datasets":

In [22]:
import tensorflow_datasets as tfds

dataset, info = tfds.load("tf_flowers", as_supervised=True, with_info=True)
dataset_size = info.splits["train"].num_examples # 3670
class_names = info.features["label"].names # ['dandelion', 'daisy', 'tulips', 'sunflowers', 'roses']
n_classes = info.features["label"].num_classes # 5

Since there is only a "train" dataset, with no test or validation set, we need to split the training set. We can call `tfds.load` again, but this time taking the first 10% of the dataset for testing, next 15% for validation and the remaining 75% for training:

In [23]:
test_set_raw, valid_set_raw, train_set_raw = tfds.load(
    "tf_flowers",
    split=["train[:10%]", "train[10%:25%]", "train[25%:]"],
    as_supervised=True)

All 3 datasets contain individual iamges. We need to batch them, but first we need to ensure that they all have the same size else batching will fail. We can use a `Resizing` layerfor this. We must also call the `tf.keras.applications.xception.preprocess_input` function to preprocess the images appropriately for the Xception mode. Lastly, we'll shuffle the training set and use prefetching:

In [24]:
tf.keras.backend.clear_session()

batch_size = 32
preprocess = tf.keras.Sequential([
    tf.keras.layers.Resizing(height=224, width=224, crop_to_aspect_ratio=True),
    tf.keras.layers.Lambda(tf.keras.applications.xception.preprocess_input)
])
train_set = train_set_raw.map(lambda X, y: (preprocess(X), y))
train_set = train_set.shuffle(1000, seed=42).batch(batch_size).prefetch(1)
valid_set = valid_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)
test_set = test_set_raw.map(lambda X, y: (preprocess(X), y)).batch(batch_size)

Now each batch contains 32 images, all of them being 224x224 pixels, with pixel values ranging from -1 to 1.

Since the dataset is not very huge, a bit of data augmentation will help. Let's create a data augmentation model that we will embed in the  final model. During training, it will randomly flip the images horizontally, rotate them a little bit, and tweak the constrast:

In [25]:
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip(mode="horizontal", seed=42),
    tf.keras.layers.RandomRotation(factor=0.05, seed=42),
    tf.keras.layers.RandomContrast(factor=0.2, seed=42)
])

Next let’s load an Xception model, pretrained on ImageNet. We exclude the
top of the network by setting `include_top=False`. This excludes the global
average pooling layer and the dense output layer. We then add our own
global average pooling layer (feeding it the output of the base model),
followed by a dense output layer with one unit per class, using the softmax
activation function:

In [26]:
tf.random.set_seed(42)

base_model = tf.keras.applications.xception.Xception(weights="imagenet",
                                                     include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
model = tf.keras.Model(inputs=base_model.input, outputs=output)

It is usually a good idea to freeze the weights of the pretrained layers, at least at the beginning of training:

In [27]:
for layer in base_model.layers:
  layer.trainable = False

Now we can compile the model and start training:

In [28]:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=3)

Epoch 1/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m796s[0m 9s/step - accuracy: 0.7201 - loss: 0.8890 - val_accuracy: 0.8566 - val_loss: 0.5970
Epoch 2/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m817s[0m 9s/step - accuracy: 0.9026 - loss: 0.3612 - val_accuracy: 0.8367 - val_loss: 0.8240
Epoch 3/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m859s[0m 9s/step - accuracy: 0.9306 - loss: 0.2099 - val_accuracy: 0.8276 - val_loss: 0.7655


After a few epochs, the model's validation accuracy will stop improving. This means that the top layers are now pretty well trained, and we can now unfreeze some of the base model's top layers then continue trianing. For example, let's unfreeze layers 56 and above:

In [30]:
for layer in base_model.layers[56:]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])
history = model.fit(train_set, validation_data=valid_set, epochs=10)

Epoch 1/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1607s[0m 18s/step - accuracy: 0.9334 - loss: 0.2102 - val_accuracy: 0.8875 - val_loss: 0.5031
Epoch 2/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1592s[0m 19s/step - accuracy: 0.9816 - loss: 0.0597 - val_accuracy: 0.9002 - val_loss: 0.3643
Epoch 3/3
[1m86/86[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1623s[0m 18s/step - accuracy: 0.9968 - loss: 0.0147 - val_accuracy: 0.8947 - val_loss: 0.3570


## Classification and Localization
Localizing an object in a picture can be expressed as a regression task: to predict a bounding box around the object. A common approach is to predict the horizontal and vertical coordinates of the object's center, as well as its height and width. Meaning that we have 4 numbers to predict. It does not require much change to the model; we just need to add a second dense output layer with 4 units, and it can be trained using MSE loss:

In [31]:
tf.random.set_seed(42)

base_model = tf.keras.applications.xception.Xception(weights="imagenet",
                                                     include_top=False)
avg = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
class_output = tf.keras.layers.Dense(n_classes, activation="softmax")(avg)
loc_output = tf.keras.layers.Dense(4)(avg)
model = tf.keras.Model(inputs=base_model.input,
                       outputs=[class_output, loc_output])
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)  # added this line
model.compile(loss=["sparse_categorical_crossentropy", "mse"],
              loss_weights=[0.8, 0.2],  # depends on what you care most about
              optimizer=optimizer, metrics=["accuracy", "mse"])