# 1. CNN Theory

### Difference between traditional feature extraction and CNN

**Tabular Data**: each row represents the training examples and features are organized as columns. For example in iris dataset we have 4 different features.

So here we want to work with image dataset.

#### 1) Traditional Feature Extraction

So in **traditional feature extraction** approach we **manually extract the features from the raw images**
- See an `iris image` here we have `Petal` and `Sepal` dimentions.
- We extract these features/measurements by applying a ruler and write down these numbers a feature vector.
- So this would be manual feature extraction step  where we or someone else is taking these measurements.

<img src="https://drive.google.com/uc?id=1Q5-swMsxabZJlxfAfrBEyRLD45QWqie0" width="700">

#### 2) Mordern Deep Learning Architectures(CNN) for Feature Extraction
- Here in CNN we dont have to worry about manual feature extrection.
- Modern deep learning architechtures such CNN performs the feature extraction implicitly. Meaning automatically for us.
- This allows us to feed image data to neural networks directly instead of the tabular data directly.

#### Popular Examples of the CNNs
1. Optical Charector Recognition software.
2. Identifying different birds in your garden, we can implement a bird classifier using a CNN.

### Features of CNNs
- One of the **key feature of the convolutional neural networks** is **Convolutional layers**
- **Convolutional layers** are feature extractors.

<img src="https://drive.google.com/uc?id=1npFjl79qS0hUaxTEZJdp1alItcxuhDLT" width="700">

- Left most size 1st is the input image, typically that's an RGB image, RGB stands for Red, Green, and Blue.
- We can think of RGB as stack of matrixes, So each matrix is a height and a width dimentions. and here we have 3 different stacked matrixes, the red matrix, green matrix, and blue matrix.
- Each block in the figure referes the convolutional layer and there are 5 convolutional layers. and convolutional layers followed by fully connected layers. In the image we have 3 fully connected layers. 2 hidden layers and 1 output layer.
- And we can think fully connected layer as a MLP. Typical Convolutional network architechture consists of a convolutional backbone and multilayer perceptron unit.
- To draw a analogy with previous method(Manual feature extraction) see the below feature.


<img src="https://drive.google.com/uc?id=1kPekc_xu1HOh-oPRYbDLor8mm8Y_jT4Q" width="700">

- Previously in manual feature extraction, we **assume that someone has already extracted features in tabular format**.
- Here we have an input image that goes through different convolutional layers to have a tabular like representation that then feeds into MLP part.
- So here the new thing is here we have convolutional part where we have multiple convolutional layers.

### Next, Why we exactly need convolutional layers?


# 2. Image Data and Its Challenges

### Question: can we use MLP on image data?
**Ans:** is yes.

- the gray scale image 14 X 14 pixels, and in order to use it in MLP we can concatenate rows in order to use it as a feature vector.
- After concatanation we have 196 row vector.
- and this row vector represents a single training example, and assume the corresponding example is of label 2.
- we can then do this for 2nd training example.
- See, here we have image of 6, 14 X 14 dimentions gray scale image and same for digit 8.
- So, concatanate of each images will result in a 196 dimensional feature vector for each of three image.
- So we can think of this as our tabular dataset cosists of 3 rows and 196 columns.

<img src="https://drive.google.com/uc?id=156pNfiX3xO2un70gtDNFv5bVBtQiWc43" width="700">

- here each position represent feature column in the dataset.
- In image each pixel value is value and integer in range between `0 to 255`
- 🟥 MLP don't work very well for more complicated image classification problems.

### Drawbacks of MLP

- It heavyly realy on certain pixel positions.
- see below image, an MLP will not recognize the digit if it is in slightly different position.
- Consider the image of 1 that is slightly off centered. and MLP will look at the wrong place in order to classify it. and here in the place where it looks at, there are no pixel values that corresponding to digit 1.
- so in this case it will not make prediction as digit 1.


<img src="https://drive.google.com/uc?id=1vL1LJnn-1iMuAAAJ2VPQF1Ivh9DkfQ4i" width="700">

- Another drawback with MLP is features are considered independently in MLP.
- considers each features independent featuers.

### CNN

- On the other hand CNNs take locality into account.
- in given pixel it will also look at surrounding area and makes local region ini image not just individual pixels.

<img src="https://drive.google.com/uc?id=18xRt5zfoRkVRAKDpt66YTD0Zh8_hfg44" width="700">


# CNN Architecture
- Achronyms of CNN: Convnets
- CNNs takes locality into account in contraray to MLP.
- It helps with chellenging problems for image data.
- we want is certain level of invarience that image can still be correctly classified even if its little bit off-center or if image is scaled in terms of bigger or smaller, or image is little bit rotated. Ideally it should also learn to ignore parts of image that are not relevent to the classification task.

<img src="https://drive.google.com/uc?id=1UKQ39Ndg909X0EPFX-ueTE5_mLYqug_A" width="700">

**A Typical CNN Architechture looks like as follows**
- We have convolutional layers at center and they are responsible for implicit feature extraction. this is sometimes reffered as deep learning as representation learning problem. Modern deep neural networks are representation learning.
- So convolutional layers following the fully connected layers(MLP) that are responsible for classification task.
<img src="https://drive.google.com/uc?id=1geRU09AhgEXFyBcBvlaeAoWU1KAE_XN2" width="700">

**Different way of drawing CNN architecture**

<img src="https://drive.google.com/uc?id=1-oDuiiUiw0hGpSOiFdN3-lH4kZJpL7L3" width="700">

- It also has convolution layers then we have fully connected layers MLP on right side.
- Highlited part is learnable or trainable layers on the neural network architecture..
- the Convolutional layer and fully connected layer have weight parameters and bias units that we can train.
- the inbetween part is pooling and they dont have learnable parameters.


## PyTorch CNN Architechture

- It consists two block convloutional layers and fully connected layers.
- Conv layer contains the conv layers and pooling layers
- fully connected layer contains linear layers which takes vector as input. so we need to flatten(convert to vector) the output of conv layer.


In [None]:
# A typical CNN Architechture of CNN in PyTorch
"""
## PyTorch CNN Structure
# 1 block is Convolutional block : consists of conv layers and pooling layers
# 2 nd block is fully connected layer MLP (classifier part) and this is similar to MLP from previous methodds.
#
"""
import torch
import torch.nn as nn

class MyPytorchCNN(torch.nn.Module):
  def __init__(self, num_of_classes):
    super().__init__()

    ## below is the convolutional layers block
    # contains conv layers and pooling layers
    self.conv_layers = torch.nn.Sequential(
        nn.Conv2d(...),
        nn.MaxPool2d(...),
        nn.Conv2d(...),
        nn.MaxPool2d(...),
    )

    # Fully connected block or classifier block similar to MLP
    self.fc_layers = nn.Sequential(
        nn.Linear(24 * 16 * 16, 256),
        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, num_of_classes),
    )

  def forward(self, x):
    features = self.conv_layers(x)
    features = torch.flatten(features, start_dim=1) # flattening the output of conv layer to prepare to feed to fully connected network.
    logits = self.fc_layers(features) # weighted sum from fcl
    return logits

# Convolutional Layer
- When we apply a convolution to a input image we create a so called **feature map**

<img src="https://drive.google.com/uc?id=1yahrrKKw7lY_tJCSVfS7otCR02N4btvK" width="700">

### Q. What is a Feature Map and How this process work?

<img src="https://drive.google.com/uc?id=1yOny1up064ELEFPVSHLrpXbVPv0oOhZZ" width="500">

- here we have grey scale image of hand written digit 5.
- And we are going to use **3 X 3 feature detector(kernel, filter)**
- We apply these feature detector to these input image
- And position by position we slide these feature detector over the image to detect feature map values.
- Once we complete the row we move down by 1 position and do the same thing.
- and we do this until we have complited all the different values of feature map.

<img src="https://drive.google.com/uc?id=1jzjfQ8rFJOMKAvGbfrW_64qruiOoiooM" width="500">

**What is happening during the operation which is convolution?**
- convolution operation is very similar to a dot product. We have different weights, different feature input values, and bias unit.

$$ z = b + \sum_jw_jx_j $$

- So in this case if we have 3 X 3 feature detector (filter or kernel) meaning we have 3 X 3 weight matrix. So, if we flatten this weight matrix and the input feature matrix, it will be same thing as dot product between two vectors followed by adding bias unit.
- Notice: as we sliding over this image at each position **we have different input feature values**, however we keep reusing the same weight matrix.
- So the weights(w's) do not differ.
- Since we are using same weights for different positions its called **weight sharing**. Sharing the same set of weights accross different position of the image.
- Rational behind the weight sharing is the feature detector that works well in one region may also work well in another region.
- and it reduces the complexity by redusing the parameters compared to MLP.






**1 input** channel, **1 output** channel

In [None]:
"""
Instantiating the convolutional layer with 1 input channel and 1 output channel
Key here is that we are using kernel size of 3
"""
# kernel size referes to the filter/feature detector size
layer = torch.nn.Conv2d(1, 1, kernel_size=3)

In [None]:
layer.weight # 3 X 3 weight matrix / filter. we can choose different size of kernels

Parameter containing:
tensor([[[[ 0.1011, -0.2301, -0.0815],
          [ 0.2813, -0.2999,  0.1810],
          [ 0.0132, -0.0891,  0.2094]]]], requires_grad=True)

In [None]:
layer.bias # bias unit

Parameter containing:
tensor([-0.1211], requires_grad=True)

So far we looked at 1 input channel and 1 output channel. Now let's see convolution with multiple channels

## Multiple channel

**1 single input** channel, and **multiple** output channel

1@ 12X12 - input channel and 3@ 10X10 .`@` represents no of channel

<img src="https://drive.google.com/uc?id=1_S7eitCFM6cGhUXKq0TotdW1z21HBeLz" width="500">

for 1st output channel
$$ z = b + \sum_jw_jx_j $$


for 2nd output channel
$$ z = b + \sum_jw_jx_j $$


for 3rd output channel
$$ z = b + \sum_jw_jx_j $$

we are using same convolution operation for all the 3 output channels

only the difference is for each output channel we use different set of weights. or in other words we are using different feature detectors for different output channel. however, the convolutional operation will be the exactly same.

Now let's see working with multiple input channels.

**Multiple input** channel and **multiple output** channel
- consider input image as color image instead of gray scale. color image has 3 channel RGB red, green, blue.
- meaning our input image now has 3 channels.
- for simplitsity we only look at 1 output channel.

📌 if our input has 3 channel the the kernel/filter will also have 3 channel.
- so our kernel now consists of 3 matrixes.

<img src="https://drive.google.com/uc?id=1hwQQPQ2ENvLwBVEliGaH6hNCyJ4-mWZg" width="500">

- So, how do we use 3 channels in kernel?
- 1st compute 1 feature map value for each of these channels
- 2nd then sum that feature map values will result into output channel.


<img src="https://drive.google.com/uc?id=1S4h-v_0Sa-qu1rREmaep6SFY5YXmXsu6" width="500">


In [None]:
import torch

In [None]:
"""
While defining the layer we can specify in channels and out channels
here we have defined in channels as 3 and out channels as 5
kernel is 2 X 2
"""
layer = torch.nn.Conv2d(in_channels=3, out_channels=5, kernel_size=2)
layer.weight.shape # 5 filter of 3 X 2 X 2, 3 is matrixes

torch.Size([5, 3, 2, 2])

## Pooling Layer
- Cnn layers laerning representations
- In convolutional part it **decreases the height and width** of the image, and **increase the no of channels**.
- we **use Convolutional layers** to **increas the size of channels**
- and to **decrease the height and weidth** we **use pooling layer**

### Varients of Pooling
1. Max Pooling
2. And Average Pooling

## 1. Max Pooling

<img src="https://drive.google.com/uc?id=1XMZlpbVWhqI1G6OQtKSRqUIAHjrnbHtV" width="500">

- Its one of the varient of pooling.
- with the given input image, we use **2 X 2** kernel.
- What it will do is **takes maximum value and then we keep sliding max pooling kernel over the input** to create the output in the feature map.
- To slide this kernel we are using **stride of 2**. meaning we are skipping 2 pixels at a time.

## 2. Average Pooling
- Similar to max pooling, however we are taking average of the kernel size instead of max value.
- Max pooling picks only one value, wheareas average pooling takes average of all the values.
- Average pooling retains information about all these 4 pixel values.
- Which one works better is another hyperparameter
- Often max pooling tends to work better in practice.
- Pooling layers dont have any learning parameters. so no weight and bias units to learn.

**Why Pooling layer?**

- helps in local invarience so it can make network little bit invarient to the exact position of the pixel in image.
- downside is some information will be lost. and to avoid this some uses only convolutional layers and skipping the pooling layers.
- we can also use conv layer to decrease the height and width by increasing the stride.

In [1]:
import torch
import torch.nn as nn

# Conv net with pooling layer
layers_with_pooling = nn.Sequential(
    nn.Conv2d(3, 8, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(8, 16, kernel_size=3),
    nn.MaxPool2d(kernel_size=2, stride=2)
)

In [3]:
example = torch.rand(3, 110, 110)
layers_with_pooling(example).shape # with pooling we achived much height and width reductin

torch.Size([16, 26, 26])

In [4]:
# Conv net without pooling layer.
layers_without_pooling = nn.Sequential(
    nn.Conv2d(3, 8, kernel_size=3),
    # nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(8, 16, kernel_size=3),
    # nn.MaxPool2d(kernel_size=2, stride=2)
)

In [5]:
example = torch.rand(3, 110, 110)
layers_without_pooling(example).shape # without pooling we dont achive much reduction in height and width.

torch.Size([16, 106, 106])

In [6]:
# Conv net without pooling layer.
# but by adding stride = 2 to conv layers
# we can reduse same width and height.
layers_without_pooling1 = nn.Sequential(
    nn.Conv2d(3, 8, kernel_size=3, stride=2),
    # nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(8, 16, kernel_size=3, stride=2),
    # nn.MaxPool2d(kernel_size=2, stride=2)
)

example = torch.rand(3, 110, 110)
layers_without_pooling1(example).shape # without pooling we dont achive much reduction in height and width.

torch.Size([16, 26, 26])

## Padding: to control output size of feature map

- It allows us to control the output size more precisesly.

**Formula for output width of feature map**
- same formula works for height.
$$ O = \frac{W-K+2P}{S}+1$$

$$ O = output width \\ W = input widht \\ K = kernel width \\ P = padding \\ S = Stride $$

- **Floor** operation: if the out put is in point then pytorch will round down to its nearest integer

- Padding is the one of the parameter that influence the ouput width.

**Example 1**


In [7]:
"""
ip width = 100
kernel width = 3
padding = 0

stride = 1
"""
op_width = (100-3+0)/1 + 1
op_width # 98 pixels

"""
With this parameters on inputsize 100
it will be output of 98 pixels feature map
"""

98.0

In [12]:
# Validate it with the PyTorch Conv layer
layer = torch.nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=0)

temp_ip = torch.rand(1,100,100)

op = layer(temp_ip)
op.shape # as we can see its now 98 X 98 pixels

torch.Size([1, 98, 98])

**Example 2**

In [15]:
"""
Here we are chaning kernel width to 3 to 5 and increase the size of stride from 1 to 2.
and plugging these values into formula we get 48.5.
and that is non integer result. let's also compare it with pytorch
"""

op_width = (100-5+0)/2 + 1
op_width

48.5

In [16]:
layer = torch.nn.Conv2d(1,1, kernel_size=5, stride=2, padding=0)
layer(temp_ip).shape

"""
Pytorch is giving 48 but our code gives 48.5

It rounds the value to 48.5 ro 48.
"""

torch.Size([1, 48, 48])

## Working of padding


<img src="https://drive.google.com/uc?id=1UOGmujUY_KjO1iW-ZBaBbSJzobZifWjA" width="500">

- padding =  1 meaning adding row of 0's to top and bottom and column of 0's to right and left of input image.


- assume kernel_size= 3 X 3 and we are sliding it we will get 3 pixels.
<img src="https://drive.google.com/uc?id=1yky4oHkHw3xEQSxfNentMlC6kwfhRjv_" width="500">

- and when doing same thing with padding, now we have 5 iterations, so using the padding we get a exact feature map width of original size.
<img src="https://drive.google.com/uc?id=1_H9XjFSgVq3HWdu9bE7Kr7xoBI6JKtl_" width="500">

> So thats how we can control output size using padding parameter.

> without doing math we can pass **padding="same"** to **produce same sized outpu t feature map**.
