<a href="https://colab.research.google.com/github/isa-ulisboa/greends-pml/blob/main/notebooks/T9_CNNs_for_image_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional NN for image classification

## Convolutions and kernels

A convolution applies a kernel across an image. A kernel is a little matrix, such as the 3×3 matrix below. The 7×7 grid to the left is the image we're going to apply the kernel to. The convolution operation multiplies each element of the kernel by each element of a 3×3 block of the image. The results of these multiplications are then added together. The diagram  shows an example of applying a kernel to a single location in the image, the 3×3 block around cell 18.

<img src="https://github.com/fastai/fastbook/blob/master/images/chapter9_conv_basic.png?raw=1" id="basic_conv" caption="Applying a kernel to one location" alt="Applying a kernel to one location" width="600">

In the paper ["A Guide to Convolution Arithmetic for Deep Learning"](https://arxiv.org/abs/1603.07285) there are many nice diagrams showing how image kernels can be applied. Here's an example from the paper showing (at the bottom) a light blue 4×4 image, with a dark blue 3×3 kernel being applied, creating a 2×2 green output activation map at the top.

<img alt="Result of applying a 3×3 kernel to a 4×4 image" width="782" caption="Result of applying a 3×3 kernel to a 4×4 image (courtesy of Vincent Dumoulin and Francesco Visin)" id="three_ex_four_conv" src="https://github.com/fastai/fastbook/blob/master/images/att_00028.png?raw=1">

What is the shape of the result? If the original image has a height of `h` and a width of `w`, how many 3×3 windows can we find? As you can see from the example, there are `h-2` by `w-2` windows, so the image we get has a result as a height of `h-2` and a width of `w-2`.

## Padding, pooling, stride and activation map

**Padding** consists in creating new cells on th emargins of the input, with a given value (in general 0). With appropriate padding, we can ensure that the output **activation map** is the same size as the original image, which can make things a lot simpler when we construct our architectures. The figure below shows how adding padding allows us to apply the kernels in the image corners.

<img src="https://github.com/fastai/fastbook/blob/master/images/chapter9_padconv.svg?raw=1" id="pad_conv" caption="A convolution with padding" alt="A convolution with padding" width="600">

With a 5×5 input, 4×4 kernel, and 2 pixels of padding, we end up with a 6×6 activation map:

<img alt="A 4×4 kernel with 5×5 input and 2 pixels of padding" width="783" caption="A 4×4 kernel with 5×5 input and 2 pixels of padding (courtesy of Vincent Dumoulin and Francesco Visin)" id="four_by_five_conv" src="https://github.com/fastai/fastbook/blob/master/images/att_00029.png?raw=1">

If we add a kernel of size `ks` by `ks` (with `ks` an odd number), the necessary padding on each side to keep the same shape is `ks//2`. An even number for `ks` would require a different amount of padding on the top/bottom and left/right, but in practice we almost never use an even filter size.

**Stride**. So far, when we have applied the kernel to the grid, we have moved it one pixel over at a time. But we can jump further; for instance, we could move over two pixels after each kernel application, as in the figure below. This is known as a *stride-2* convolution. The most common kernel size in practice is 3×3, and the most common padding is 1. As you'll see, **stride-2** convolutions are useful for decreasing the size of our outputs, and **stride-1** convolutions are useful for adding layers without changing the output size.

<img alt="A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding" width="774" caption="A 3×3 kernel with 5×5 input, stride-2 convolution, and 1 pixel of padding (courtesy of Vincent Dumoulin and Francesco Visin)" id="three_by_five_conv" src="https://github.com/fastai/fastbook/blob/master/images/att_00030.png?raw=1">

In an image of size `h` by `w`, using a padding of 1 and a stride of 2 will give us a result of size `(h+1)//2` by `(w+1)//2`. The general formula for each dimension is `(n + 2*pad - ks)//stride + 1`, where `pad` is the padding, `ks`, the size of our kernel, and `stride` is the stride.

**Pooling** is a type of convolution with a fixed operation (not trainable) as illustrated in the example below. This can be used to reduce the size of a layer. However, pooling can be replaced by convolution with stride larger than 1 (see paper "Striving for Simplicity: The All Convolutional Net" at https://arxiv.org/abs/1412.6806).

<img src="https://epynn.net/_images/pool-01.svg"  width="600">


## Creating a simple CNN from scratch

We create a CNN from scratch with `torch.nn.Conv2d` (https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv2d.html). First, we'll define a function with the basic parameters we'll use in each convolution. Each layer receives `ni` channels of input and produces `nf` output channels.

The input of the CNN has shape (N,C,W,H) where N is the batch size, C is the number of input channels, W and H are the width and height of each image. The image size of the input and output of a single convolutional layer are in general different (the output size in this case is half of the input size), but it depends on the parameters of the convolution.

In [None]:
import torch
import torch.nn as nn

# Custom Flatten layer for nn.Sequential; reshapes the final output to a 2D tensor suitable for classification
class Flatten(nn.Module):
    def forward(self, x):
        return x.view(x.size(0), -1)

# Convolutional block as described: ic=input channels; oc=output channels
def conv(ic, oc, act=True):
    ks=3 # kernel size
    res = nn.Conv2d(ic, oc, stride=2, kernel_size=ks, padding=ks//2)
    if act: #apply ReLU activation function
        res = nn.Sequential(res, nn.ReLU())
    return res

# Build the CNN model
simple_cnn = nn.Sequential(
    conv(1, 4),            # Output: 14x14
    conv(4, 8),            # Output: 7x7
    conv(8, 16),           # Output: 4x4
    conv(16, 32),          # Output: 2x2
    conv(32, 2, act=False),# Output: 1x1
    Flatten(),
)

# Example usage with a dummy input
xb = torch.randn(1, 1, 28, 28)  # batch size 1, 1 channel, 28x28 image
output = simple_cnn(xb)
print(output.shape)  # torch.Size([1, 2])

torch.Size([1, 2])


So, in the example above the shape of the model's input is a batch with 1 example of a 1-channel image with 28 rows and columns, and outputs two activations, which would map to the two possible levels in our labels.

We look at its structure, and in particular to the number of weights that need to be estimated,  with `summary`.


In [None]:
from torchsummary import summary
# For your model, input size is (1, 28, 28) (channels, height, width)
summary(simple_cnn, input_size=(1, 28, 28))


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1            [-1, 4, 14, 14]              40
              ReLU-2            [-1, 4, 14, 14]               0
            Conv2d-3              [-1, 8, 7, 7]             296
              ReLU-4              [-1, 8, 7, 7]               0
            Conv2d-5             [-1, 16, 4, 4]           1,168
              ReLU-6             [-1, 16, 4, 4]               0
            Conv2d-7             [-1, 32, 2, 2]           4,640
              ReLU-8             [-1, 32, 2, 2]               0
            Conv2d-9              [-1, 2, 1, 1]             578
          Flatten-10                    [-1, 2]               0
Total params: 6,722
Trainable params: 6,722
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.02
Params size (MB): 0.03
Estimated Total

#### Number of parameters

The number of parameters depend on the kernel size, the number of input channels and the number of output features. Each convolution applies over the kernel and the input channels. There will be one convolution per output feature. The number of multiplicative weights per convolution is therefore `ic*ks*ks`, where `ic` is the depth of the input and `ks` is the kernel size. If there is an additive weight (bias) that will add one more parameter per convolution. Finally, the number of parameters `ic*ks*ks+1`must be multiplied by the  `nf` of output features.

The summary shows we have 40 parameters for the first convolution, which is  `oc*(ic*ks*ks+1)=4*(1*3*3+1)`, where `oc`is the depth of the output. Four of those parameters are for the bias.

Similarly, for the second convolution, the depth of the input is `4`, the kernel is `3x3` which means that there are `4*3*3` multiplicative weights and `1` additive weight (bias) per output feature. Since the depth of the output `oc` is 8, then, in total there are `8*(4*3*3+1)=296` parameters.



If one looks at the number of weights to be estimated, they tend to grow for deeper layers of the CNN. This can be interpreted as follows:

- Deeper layers have more filters (output channels): As you progress through the network, each convolutional layer often increases the number of filters. Each filter has its own set of learnable parameters (weights and biases), so more filters mean more parameters

- Early layers capture simple features, deeper layers capture complex features: Early convolutional layers typically detect simple patterns like edges or textures using fewer filters. Deeper layers, with more filters, are designed to capture more complex, abstract features by combining the outputs of previous layers. This increased complexity requires more parameters to represent a broader variety of patterns

- Spatial dimensions decrease, depth increases: As the spatial size (height and width) of the feature maps decreases—often due to strided convolutions or pooling—the number of channels (depth) typically increases. This shift allows the network to represent more complex information in a more compact spatial form, again requiring more parameters per layer

- Parameter sharing controls but doesn't eliminate growth: While convolutional layers use parameter sharing to reduce the number of parameters compared to fully connected layers, the total number of parameters still grows as the number of filters increases in deeper layers

The growth in the number of parameters along the CNN reflects the network's **increasing capacity to model complex and abstract features in the data**. This design allows the network to **learn hierarchical representations—from low-level edges to high-level semantic concepts** by dedicating more resources (parameters) to the deeper, more complex parts of the model
.

#### Receptive Fields

The *receptive field* is the area of the input image that is involved in the calculation of features.

<img alt="Secondary precedents of conv2 layer" width="700" caption="Secondary precedents of Conv2 layer" id="preced2" src="https://github.com/fastai/fastbook/blob/master/images/att_00069.png?raw=1">

In this example, we have just two convolutional layers, each of stride 2, so this is now tracing right back to the input image. We can see that a 7×7 area of cells in the input layer is used to calculate the single green cell in the Conv2 layer. This 7×7 area is the *receptive field* in the input of the green activation in Conv2. We can also see that a second filter kernel is needed now, since we have two layers.

The deeper we are in the network (specifically, the more stride-2 convs we have before a layer), the larger the receptive field for an activation in that layer. A large receptive field means that a large amount of the input image is used to calculate each activation. We now know that in the deeper layers of the network we have semantically rich features, corresponding to larger receptive fields. Therefore, we'd expect that we'd need more weights for each of our features to handle this increasing complexity. This is another way of saying that when we introduce a stride-2 conv in our network, we should also increase the number of channels.

#### Batch Normalization

Batch normalization (aka *batchnorm*) works by standardizing the activations of each layer for each mini-batch during training.

It is useful in deep learning, especially in CNNs, for several key reasons:

- **Stabilizes and Accelerates Training:** By normalizing the activations within each layer, batch normalization ensures that the input to each layer has a consistent distribution. This reduces the problem of “internal covariate shift,” where the distribution of layer inputs changes during training, making learning unstable and slow.

- **Enables Higher Learning Rates:** Batch normalization allows for the use of higher learning rates, which speeds up training and can lead to faster convergence.

- **Improves Generalization and Reduces Overfitting:** The normalization process introduces a slight regularization effect, making the model less likely to overfit and improving its ability to generalize to new data.

- **Less Sensitive to Initialization:** Models with batch normalization are less dependent on the initial choice of weights, making them easier and more robust to train.

- **Improves Gradient Flow:** By keeping the scale of activations and gradients more consistent, batch normalization helps prevent vanishing or exploding gradients, allowing for deeper networks to be trained effectively.

In summary, batch normalization makes neural network training faster, more stable, and more robust, while also improving generalization and reducing the need for careful weight initialization.


In [None]:
import torch
import torch.nn as nn

# Custom Flatten layer
class Flatten(nn.Module):
    def forward(self, x):
        return x.view(x.size(0), -1)

# Convolutional block with batch normalization
def conv(ic, oc, act=True):
    ks = 3
    layers = [
        nn.Conv2d(ic, oc, stride=2, kernel_size=ks, padding=ks//2),
        nn.BatchNorm2d(oc)  # insert nn.BatchNorm2d layer after each convolution and before the activation (ReLU).
    ]
    if act:
        layers.append(nn.ReLU())
    return nn.Sequential(*layers)

# Build the CNN model with batch normalization
simple_cnn = nn.Sequential(
    conv(1, 4),            # 14x14
    conv(4, 8),            # 7x7
    conv(8, 16),           # 4x4
    conv(16, 32),          # 2x2
    conv(32, 2, act=False),# 1x1
    Flatten(),
)

# Example usage with a dummy input
xb = torch.randn(3, 1, 28, 28) # need batch size > 1 for batch normalization
output = simple_cnn(xb)
print(output.shape)  # Should print: torch.Size([1, 2])


torch.Size([3, 2])


To apply the same batch normalization parameters to both training and validation (or test) data, you need to ensure that the running mean and variance (computed during training) are used for normalization during validation and inference, rather than the statistics of the current batch.

- During training (`model.train()`), batch normalization normalizes activations using the mean and variance computed from the current mini-batch.

- During validation or inference (`model.eval()`), batch normalization uses the running (moving average) mean and variance that were accumulated during training, ensuring consistency between training and validation/inference phases

## Exercise: apply CNN to classify MNIST images

1. Replace two of the convolucional layers by **pooling** layers in `simple_cnn`. Each pooling layer should reduce the size (W,H) of the input by a factor of 2
2. **Dropout** is typically applied after the activation function (such as ReLU), and before the next layer. Add dropout (p=0.5) to the definition of `conv`.
2. Adapt the model `simple_cnn` and train it with the **MNIST** training data set.
3. Create a learning curve using the MNIST validation data set that would help to determine the optimum number of epochs