<a href="https://colab.research.google.com/github/zeeshan-sardar/efficient_ml/blob/main/cnn_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CNNs are specifically designed for image related tasks. They focus on local features instead of looking the image as a whole.
CNNs also help in reducing the numbre of parameters as compared to fully connected layers.

## Important Aspects of CNNs

### Kernels/Filters as Weights
These kernels are small matrices (usually 1x1, 3x3 or 5x5) whcih are learned during network training to indentify specific patterns in images. These filters learn using backpropagation and gradient descent.

In the early layers of the network, these filters learn simple patterns like edges and as we go deeper into the network they can detect more complex features like shapes. A particular kernel gets activated when it finds a similar pattern in the input image. These filters when applied on input image, produce feature maps.

### PyTorch Tensor Format for CNNs
PyTorch has a special format for image related data
- Input Tensor (batch_size, in_channels, img_height, img_width)
- Conv2D layer parameters (in_ch, out_ch, kernel_size, stride, padding)

### Example

In [5]:
import torch
import torch.nn as nn

# Input tensor
input_tensor = torch.randn(1, 3, 8, 8)

# Convolutional layer
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)

# Apply convolution
output_tensor = conv_layer(input_tensor)

print("Input tensor size:", input_tensor.size())
print("Output tensor size:", output_tensor.size())


Input tensor size: torch.Size([1, 3, 8, 8])
Output tensor size: torch.Size([1, 16, 8, 8])


In [6]:
# Check the size of the weights
print("Shape of conv_layer weights:", conv_layer.weight.shape)

# Print the weights of the first kernel
print("Weights of the first kernel: \n", conv_layer.weight[0])

Shape of conv_layer weights: torch.Size([16, 3, 3, 3])
Weights of the first kernel: 
 tensor([[[-0.1691, -0.0810, -0.1446],
         [ 0.0615,  0.0248, -0.0155],
         [-0.0579, -0.0669, -0.0626]],

        [[ 0.1800,  0.0315, -0.1642],
         [-0.1091,  0.1196, -0.1397],
         [-0.0068, -0.0308, -0.0815]],

        [[-0.0874, -0.1874,  0.0457],
         [-0.0329,  0.1774,  0.0874],
         [ 0.1300,  0.0244, -0.1758]]], grad_fn=<SelectBackward0>)


Considering the above example, the conv_layer.weight contains the information about kernels/filters which are the learnable parameters. There are total 16 filters each having shape of (3x3x3) because we have 3 input channels and we have defined 3x3 a kernel size. They are 16 because the output channels is 16 i this case.

If we see the output tensor shape (1x16x8x8). These are the feature maps. Each filter 3x3x3 in this case, produces one feature map 8x8 in this case. We have 16 feature maps as of our output channels. The size of feature maps depends upon, input size, stride, padding and kernel size.

## Parameter Update using Gradient Descent

In machine learning and optimization, gradient descent is a common algorithm used to minimize a loss function. The loss function measures the difference between the predicted output of a model and the actual target values. The goal of training a model is to find the set of parameters (weights and biases) that minimize this loss function.

### Gradient:

The gradient of a function represents the rate of change of the function with respect to its parameters. In the context of neural networks, the gradient of the loss function with respect to the model parameters indicates how much the loss would change if the parameters were adjusted slightly.

### Minimizing the Loss Function:

To minimize the loss function, we want to adjust the model parameters in a way that reduces the loss. This adjustment is done by moving the parameters in the direction opposite to the gradient. This is because the gradient points in the direction of the steepest ascent, and moving in the opposite direction will lead to a decrease in the loss.

### Opposite Direction of the Gradient:

When we say "the weights are adjusted in the opposite direction of the gradient," it means that we update the weights by subtracting a fraction of the gradient from the current weights. This fraction is determined by the learning rate, which controls the size of the steps taken during optimization.

### Example:

Suppose we have a simple loss function \(L(w)\) that depends on a single weight parameter \(w\). We compute the gradient of the loss function with respect to \(w\) as \(\frac{dL}{dw}\). If the gradient is positive, it means that increasing \(w\) will increase the loss, and decreasing \(w\) will decrease the loss. Therefore, to minimize the loss, we update \(w\) by subtracting a fraction of the gradient:
\[ w_{\text{new}} = w_{\text{old}} - \text{learning\_rate} \times \frac{dL}{dw} \]
If the gradient is negative, it means that decreasing \(w\) will increase the loss, and increasing \(w\) will decrease the loss. Therefore, we update \(w\) by adding a fraction of the gradient.

### Summary:

Adjusting the weights in the opposite direction of the gradient is a fundamental principle in optimization algorithms like gradient descent. By iteratively updating the weights based on the gradient, the model gradually learns to minimize the loss function and improve its performance on the given task.

In [None]:
!pip install -q d2l