## Introduction to CNNs Understanding Convolutions and Pooling
Convolutional Neural Networks (CNNs) are a type of deep learning model primarily used for tasks related to image processing, such as image classification, object detection, and more. Unlike regular fully-connected networks, CNNs are designed to automatically and adaptively learn spatial hierarchies of features from the input data. This basically means that CNNs have the ability to understand and learn various levels of abstract features from the input data (like an image) all by themselves.

To illustrate, consider an example of a CNN trained to recognize faces. In the beginning layers of the network, it might learn to detect simple features such as edges and curves. Moving deeper into the network, the subsequent layers might start to recognize combinations of edges and curves that form more complex features like eyes, noses, and mouths. Further down the network, even more complex features like a face can be detected.

So, it's like starting from simple features and gradually building up a 'hierarchy' of increasingly complex features. This is what we refer to as spatial hierarchies. The network does all this without any explicit programming instructing it to learn these features – hence the term "automatically and adaptively".

CNNs are founded on basic building blocks: convolutional and pooling layers.

### Convolutional Layer
A convolutional layer applies a series of different image filters, also known as convolutional kernels, to the input image. Each filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a CNN might have size 5x5x3 (i.e., 5 pixels width and height, and 3 because images have depth 3, the color channels).

As the filter slides over the image (or another layer's feature map), it is convolved with that portion of the image, computing the dot product between the entries of the filter and the input image at any position. 

Imagine you're given a large image, and your task is to find a smaller image within that large image. One way you might approach this is by taking your smaller image and sliding it over every possible position in the large image to see where it fits best.

This is essentially what a convolutional layer does. It takes a set of "small images" (filters) and slides them over the input image, checking how well they match. When a filter matches well with a certain region of the image, the convolutional layer will output a high value, indicating that it has found a feature it recognizes.

Here's a simplified example in PyTorch:

In [2]:
# PyTorch has its own conv2d function in the torch.nn.functional module
import torch
from torch import nn
import torch.nn.functional as F

# input is a 1x1 image with a 3x3 matrix of ones
inputs = torch.ones(1, 1, 3, 3)

# filters are a single 2x2 filter
filters = torch.ones(1, 1, 2, 2)

outputs = F.conv2d(inputs, filters, stride=1, padding=0)

print(outputs)

c:\Users\AB012DH\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
c:\Users\AB012DH\Anaconda3\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


tensor([[[[4., 4.],
          [4., 4.]]]])


### Pooling Layer
The other building block of CNNs is the pooling layer. Pooling layers reduce the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. 

Imagine you're given a task to describe a large, complex image with a small number of words. You'll probably pick the most important or distinctive features of the image to describe, effectively summarizing or "downsizing" the information.

This is the concept of pooling. The pooling layer takes a complex input (like the output of a convolutional layer), and summarizes it into a smaller, simpler form. This "summary" retains the most important features while discarding less useful details. This not only helps to reduce computational load and control overfitting, but also provides a form of translation invariance - meaning the network will recognize the same feature even if it's slightly shifted in the image. The most common type of pooling is max pooling, which takes the maximum value in each window of the input.

Here's a simple example of a max pooling operation in PyTorch:

In [3]:
# input is a 1x1 image with a 4x4 matrix
inputs = torch.Tensor([[[[1, 2, 1, 2], [2, 4, 2, 1], [1, 2, 4, 2], [2, 1, 2, 1]]]])

# MaxPool2d
m = nn.MaxPool2d(2, stride=2)
output = m(inputs)

print(output)

tensor([[[[4., 2.],
          [2., 4.]]]])


### CNN Architecture
A typical CNN architecture is made up of several layers:
* Convolutional Layer: This layer performs a convolutional operation, creating several smaller picture windows to go over the data.
* Non-Linearity (ReLU): After each convolutional layer, it is convention to apply a non-linear layer (or activation function) immediately afterward.
* Pooling or Sub Sampling: This layer is periodically inserted in-between successive convolutional layers.
* Classification (Fully Connected Layer): The last stage in a CNN. It takes all the previous outputs and flattens them into a single vector that can be fed into a fully connected layer for classification purposes.

### Building CNNs in PyTorch
#### A simple CNN
First, let's define a simple CNN architecture. We will build a network that has:

* A convolutional layer with 3 input channels, 6 output channels, a kernel size of 3 (a 3x3 filter), and a stride of 1.
* A ReLU activation function.
* A max pooling operation with 2x2 window and stride 2.
* A second convolutional layer with 6 input channels, 16 output channels, a kernel size of 3, and a stride of 1.
* Another ReLU activation function.
* Another max pooling operation.
* A fully connected (linear) layer that maps the input units to 120 output units.
* Another fully connected layer that maps the previous 120 output units to 84 output units.
* A final fully connected layer that maps the 84 output units to 10 output classes.

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(3, 6, 3)  # 3 input channels, 6 output channels, 3x3 kernel
        self.conv2 = nn.Conv2d(6, 16, 3)  # 6 input channels, 16 output channels, 3x3 kernel
        
        # Fully connected layers
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

Here, we've defined two methods: __init__ and forward. The __init__ method initializes the various layers we need for our network. The forward method defines the forward pass of the network. The backward pass (computing gradients) is automatically defined for us using autograd.

Let's do a basic test of our model. First, let's initialize our model:

In [6]:
net = Net()
print(net)

Net(
  (conv1): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


Now, let's create a random 3x32x32 input (remember, our network expects 3-channel images):

In [8]:
input = torch.randn(1, 3, 32, 32)
out = net(input)

This prints out the 10-dimensional output of the network.

In [9]:
print(out)

tensor([[-0.0600, -0.0380, -0.0109, -0.0721,  0.0605,  0.1440, -0.0294,  0.0416,
         -0.0323,  0.0159]], grad_fn=<AddmmBackward0>)


Let's try zeroing the gradient buffers of all parameters and backprops with random gradients:

In [11]:
net.zero_grad()
out.backward(torch.randn(1, 10))

This doesn't do much yet, but in a short bit, we'll see how to backpropagate errors based on a loss function and update the model's weights. This simple implementation isn't intended to achieve meaningful results - it's just to give you a sense of how the model operates.

### Training a CNN, Overfitting and Regularization Techniques
#### Training a CNN
Training a CNN involves several steps:

1. Forward Propagation: In the forward pass, you pass the input data through the network and get the output.
2. Loss Computation: After obtaining the output, you calculate the loss function. The loss function shows how far the network output is from the true output. The aim is to minimize this loss.
3. Backward Propagation: In the backward pass, also known as backpropagation, the gradient of the loss is calculated with respect to the parameters (or weights) of the model, and the parameters are updated in the opposite direction of the gradients to reduce the loss.
4. Optimization: Finally, the weights of the network are updated using the gradients computed during backpropagation.

Here is a simple example of how to train a CNN in PyTorch:

In [None]:
import torch.optim as optim

# create your network
net = Net()

# define a loss function
criterion = nn.CrossEntropyLoss()

# define an optimizer
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# for each epoch...
for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    # for each batch of data...
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

This will train your CNN on the dataset for 2 epochs. Note that this is a simple example for learning purposes; in real-world cases, you would likely need to train for more epochs and also split your dataset into training and validation sets to monitor and prevent overfitting.

#### Overfitting and Regularization Techniques
Overfitting occurs when a neural network model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model.

Here are some common ways to prevent overfitting:

* Data Augmentation: This is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data. Data augmentation techniques such as cropping, padding, and horizontal flipping are commonly used to train large neural networks.
* Dropout: Dropout is a regularization method where input and recurrent connections to a layer are probabilistically excluded from activation and weight updates while training a network. This has the effect of training a large ensemble of neural network models that share weights.
* Early Stopping: This involves stopping the training process before the learner passes that point where performance on the test dataset starts to degrade.
* Weight Decay: Also known as L2 regularization, weight decay involves updating the loss function to penalize the model in proportion to the size of the model weights.
* Batch Normalization: Batch normalization is a method used to make artificial neural networks faster and more stable through normalization of the input layer by adjusting and scaling the activations.

Each of these techniques can help mitigate overfitting, and they are often used together for better performance.

### Using Pretrained Networks (Transfer Learning)
Transfer learning is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task. It's popular in deep learning because it allows us to train deep neural networks with comparatively little data. In other words, if a model was trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world. You can then take advantage of these learned feature maps without having to start from scratch by training a large model on a large dataset.

#### Using a Pretrained Network in PyTorch
PyTorch provides a number of pretrained networks through the torchvision library. These networks have been trained on the ImageNet dataset which includes over 1 million images from 1000 categories.

Let's consider an example where we want to use the pretrained ResNet-18 model. This model can be loaded in PyTorch using torchvision.models.resnet18(pretrained=True).

Here is a simple code to use a pretrained model in PyTorch:

In [None]:
import torchvision.models as models

# Load the pretrained model
net = models.resnet18(pretrained=True)

# If you want to fine-tune only the top layer of the model, set as below
for param in net.parameters():
    param.requires_grad = False

# Replace the top layer for finetuning.
net.fc = nn.Linear(net.fc.in_features, 10)  # 100 is an example.

# Forward pass.
outputs = net(images)
# In this example, images would be a tensor containing a batch of images. 
# The images are normalized using the mean and standard deviation of the images in the dataset.

This script will first download the pretrained ResNet-18 model when you run it for the first time. It then sets requires_grad == False to freeze the parameters so that the gradients are not computed in backward().

Finally, the script replaces the top layer of ResNet which was previously used to classify the images into one of the 1000 categories, with a new layer that classifies images into one of 10 categories (as an example).

Remember to replace 10 with the number of classes in your dataset. The number 10 is used as an example here.

In a subsequent training step, only the weights of the final layer will be updated, while the other weights will remain the same. This is because setting requires_grad == False for a parameter implies that we do not want to update these weights.

This is the essence of transfer learning. You take an existing model trained on a large dataset, replace the final layer(s) with some layers of your own, and then train the model on your own dataset.

### Advanced CNN Architectures: ResNet, DenseNet, etc.
#### ResNet (Residual Network)
ResNet, introduced in the paper "Deep Residual Learning for Image Recognition" by Kaiming He et al., was the winner of ILSVRC 2015. It introduced a novel architecture with "skip connections" (also known as "shortcuts" or "residual connections").
The main innovation of ResNet is the introduction of the "identity shortcut connection", which skips one or more layers. The ResNet model structure is mainly a stack of Conv2D, BatchNorm, and ReLU layers, and it has an "identity shortcut connection" that skips over the non-linear layers.

The advantage of a residual network is that it can effectively handle the "vanishing gradient problem" which tends to occur when training deep neural networks. By using skip connections, the gradient can be directly backpropagated to shallow layers.

In PyTorch, you can use a pretrained ResNet model in the following way:

In [None]:
import torchvision.models as models

resnet = models.resnet50(pretrained=True)

#### DenseNet (Densely Connected Network)
DenseNet, introduced in the paper "Densely Connected Convolutional Networks" by Gao Huang et al., is an extension to ResNet. In DenseNet, each layer is connected to every other layer in a feed-forward fashion.

Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer, DenseNet has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are used as inputs, and its own feature maps are used as inputs into all subsequent layers.

Here is an example of how to use a pretrained DenseNet model in PyTorch:

In [None]:
import torchvision.models as models

densenet = models.densenet161(pretrained=True)

These models have a significant number of hyperparameters and layers. That's why these are typically used with transfer learning. These models have seen wide use in various tasks beyond just image classification, including object detection and segmentation.