![](pics/header.png)

# Deep Learning: Convolutional Neural Network (CNN)

Kevin Walchko

---

These notes come from Udacity's Deep Learning Nanodegree

## Convolutional Neural Network (CNN)

![](pics/pnas.webp)

CNNs can be trained to find features within a region of an image and understand the combination of features to identify an object.

Usually, we think the layers do:

- First layer finds edges or other primative high frequency features
- Second layer finds groups of edges from primative edges
- Third finds object identifying features from groups
- etc ...

*Note:* It is not this simple, but a good starting point for understanding.

| Multi-Layer Perceptron (MLP) | Convolutional Neural Network (CNN) |
|------------------------------|------------------------------------|
| only operate on **vector** inputs, images were flattened first | operate on 2D (matrix) data |
| Vector input, no understanding of 2D relationship of pixels | Matrix input, can hardness the spatial relationship of pixels |
| Only fully connected layers, larger parameter set to train, long training time | Sparsely connected layers, reduced parameter set to train, faster training time |


- **Highpass Filters:**
    - sharpen an image
    - enhance high frequency parts of an image like an edge
    - convlutional edge detection kernel's elements must sum to zero because they are looking for the difference or change between pixels
    
## CNN Layer

The CNN layer can act on each channel of an image or just apply different convolutional kernels to the same grayscale image.

<table> 
    <tr>
        <table>
            <tr>
                <td><img src="pics/conv-layer-1.png"></td>
                <td><img src="pics/conv-layer-2.png"></td>
            </tr>
        </table>
    </tr>
    <tr>
        <td><img src="pics/conv-layer-3.png"></td>
    </tr>
</table>

### Convolutional Layer

```python
# create a numpy array of filter values
filter_vals = np.array([[-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1]])

# define four filters from the values above
filter_1 = filter_vals
filter_2 = -filter_1
filter_3 = filter_1.T
filter_4 = -filter_3
filters = np.array([filter_1, filter_2, filter_3, filter_4])

k_channels, k_height, k_width = filters.shape
# defines the convolutional layer, assumes there are 4 grayscale filters
# torch.nn.Conv2d(in_channels, out_channels, kernel_size, 
#                 stride=1, padding=0, dilation=1, groups=1, bias=True)
self.conv = nn.Conv2d(1, k_channels, kernel_size=(k_height, k_width), bias=False)
self.conv.weight = torch.nn.Parameter(filters)
```

The filters look like:

![](pics/cnn-filters.png)

Filter layers typically increase in depth (or number of feature detectors) as they go along the pipeline. However, the width and height of the image remains the same *until* a maxpooling layer is encountered to reduce the feature map size.

Definitions:

- **Channels:** each color plane **OR** feature map produced by a filter
    - RGB image would have 3 channels in, but I could apply 2 filters to each input channel for a total of 6 output channels: 3 colors x 2 filters = 6 channels
- **Filter:** convolutional filter that detects an image feature
- **Stride:** how many pixels you shift the filter over before applying, default is usually 1 pixel
- **Padding:** adding extra numbers (column and/or row values, typically set to 0) so the filter has numbers to work with on the edges of the image
    - for a kernel of size 7x7, to keep the same size output array as input array, you would need a padding of 3. This is because the center pixel needs 3 more pixels around the center pixel to do the convolution calculation with

## Pooling Layers

When applying many convolutional filters to an image, you need a way to reduce the dimensionality or you will have <u>too many</u> parameters to train. A Pooling layer will collapse down the result and reduce dimensionality

![](pics/maxpooling.png)

```python
# nn.MaxPool2d(window_size, stride)
#
# stride does not impact depth
#
# stride is the factor the input feature map is decimated by
# 2 -> 1/2 or 200x200 -> 100x100
# 4 -> 1/4 or 200x200 -> 50x50
# etc
self.pool = nn.MaxPool2d(2, 2) # reduces feature map by 1/2
```

- **Max pooling layer:**
    - window size and stride
    - works on all feature maps from an *n* set of convolution filters at once and returns the max value
    - Conv(4filters x W x H) -> Pool(window(2 x 2),stride(2)) -> (4filters x W/2 x H/2)
- Pooling layers throw away information and can lose spatial understanding
- Pooling layers are less popular today and can produce unexpected results
    - They can correctly idenitfy a face even when extra eyes have been photoshopped in because there is a loss in spatial understanding between location and amount of eyes a face has

## Build a CNN Network

Determining parameters and sizes of feature maps in a CNN:

- K: out_channels
- F: kernel_size
- D_in: depth of previous layer, typically 1 (grayscale) or 3 (RGB)
- **Parameters to train:** K\*F\*F\*D_in + K
- S: the stride of the convolution
- P: the padding
- W_in: the width/height (square) of the previous layer
- **Convolutional layer shape:** (W_in−F+2P)/S+1

Example: color image (3x130x130) as input

1. nn.Conv2d(3, 10, 3)
    - image: 3x130x130, depth: 10, kernel: 3, padding: 0, stride: 1
    - shape: (130-3+2\*0)/1+1 = 130
    - depth: 10
1. nn.MaxPool2d(4, 4)
    - output shape: 10x32x32
1. nn.Conv2d(10, 20, 5, padding=2)
    - featureMap: 10x32x32, depth: 20, kernel: 5, padding: 2, stride: 1
    - shape: (32-5+2\*2)/1+1 = 32
    - depth: 20
1. nn.MaxPool2d(2, 2)
    - output shape: 20x16x16

```python
import torch.nn as nn
from torch.nn import functional as F

class Network(nn.Model):
    def __init__(self, in_channesls, out_channels, kernel_size):
        super(Network, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
        self.pool = nn.MaxPool2d(2,2)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        
        return x
    
# grayscale image 200x200
# input channels = 1 (grayscale)
# want 16 convolutional filters applied resulting in 200x200x16
n = Network(1,16)
```

## Classifiers and Transfer Learning

As the feature map moves through successive layers of filtering and pooling, the details of the pixels matter less. Instead, the network is finding features and will be flattened and feed into a classifier. 

Now below, the feature detector will identify parts of the car, say tires, windows, etc and then feed this to the classifer portion for it to determine, "this is a car!"

<img src="pics/cnn-classifier.png" width="500px">

Now, you have a couple of options based on data size and how similar your new application is compared to the original application and that is shown below in the pictures. Basically you can: 

- Take a trained network and remove the classifer for one application and replace it with a custom classifier for another application. You also need to freeze the feature detector weights so you don't change them. The neat thing about this method is you don't have to re-train the feature detector portion of the network.
- Take a trained network, replace the classifier at the end. However, you retrain the entire network, but use the pre-trained weights as a starting point.

<table> 
    <tr>
        <img src="pics/transfer-learning.png">
    </tr>
    <tr>
        <td><img src="pics/d.png"></td>
        <td><img src="pics/c.png"></td>
    </tr>
    <tr>
        <td><img src="pics/a.png"></td>
        <td><img src="pics/b.png"></td>
    </tr>
</table>


### Freezing Parameters

```python
for param in vgg.features.parameters():
    param.requires_grad = False
```