# Increasing depth

We have learned two types of layers for neural networks, we began with convolutional layers wich detect regional patterns in an image using a series of image filters. We have seen how typically a ReLu activation function is applied to the output of these filters to standarize their output values. Then we learned about maxpooling layers, which appear after convolutional layers to reduce the dimensionality of our input arrays. These new layers, along with fully-connected layers, are often the only layers that we will find in CNNs.

Let's discuse how to arrange these layers to design a complete CNN architecture, we will focus again on CNNs for image classification. In this case, our CNN must accept an image array as an input. Now, if we are going to work with messy real-world images, there is a complication that we haven't yet discussed.

If we go online and collect thousands or million of images, it is pretty much guaranteed that they will all be different sizes. Similar to MLPs, the CNN we will discuss will require a fixed size input. So we have to pick an image size and resize all of our images to that same size before doing anything else. 

This is considered to be another pre-processing step, alongside normalization and conversion to a tensor datatype. It is very common to resize each image to be a square, with the spatial dimensions equal to a power of 2, or else a number that's divisible by a large power of two. 

We will be working with a dataset composed of images that have all been resized to 32x32 pixels. Recall that any image is interpreted by the computer as a 3D array. Color images had some height and width in pixels along with red, blue and green color channels corresponding to a depth of three. Gray scale images, while technically 2D, can also be thought of as having their own width and height and a depth of one.

For both of these cases, with color or grayscale, the input array will always be much taller and wider than it is depth.

Our CNN architecture will be designed with the goal of taking that array and gradually making it much deeper than it is tall or wide. Convolutional layers will be used to make the array deeper as it passes through the network, and maxpooling layers will be used to decrease the X, Y dimensions. As the network gets deeper, it is actually extracting more and more complex patterns and features that help identify the content and the objects in an image, and it is actually discarding some spatial information about features like a smooth background and so on that do not help identify the image. 

Let's go over a complete image classification CNN in detail!

Say we want to classify an input image. There are a few ways we could go about this using a deep learning architecture. Consider following the input layer with a sequence of convolutional layers. This stack will discover hierarchies of spatial patterns in the image. The first layer of filters looks at patterns in the input image, the second looks at patterns in the previous convolutional layer, and so on. Each of the convolutional layers requires us to specify a number of hyperparameters.

```
self.conv1 = nn.Conv2d(3, 16, kernel_size, stride = 1, padding = 0)
```

The first and second inputs to define a convolutional layer are simply the depth of the input and the desired depth of the output. For example, the input depth of a color image will be three for the RGB channels, and we might want to produce 16 different filtered images in the convolutional layer above. 

Next we define the size of the filters that define a convolutional layer: **kernel_size**, these are often square and range from the size of two-by-two at the smallest to up to a seven-by-seven or so for very large images. For this example let's choose to use three-by-three filters.

```
self.conv1 = nn.Conv2d(3, 16, 3, stride = 1, padding = 0)
```

The stride is generally set to one and many frameworks will have this as the default value, so we may need to input this value. As for padding, we may get better results if we set our padding such that a convolutional layer will have the same width and height as its input from the previous layer. In the case of a 3x3 filter, which can almost center itself perfectly on an image but misses the border pixels by one, this padding will be equal to one.


### Padding 

Padding is just adding a border of pixels around an image. In PyTorch, we specify the size of this border. Why do we need padding?

When we create a convolutional layer, we move a square filter around an image, using a center-pixel as an anchor. So, this kernel cannot perfectly overlay the edges/corners of images. The nice feature of padding is that it will allow us to control the spatial size of the output volumes (most commonly as we will see soon we will use it to exactly preserve the spatial size of the input volume so the input and output width and height are the same).

The most common methods of padding are padding an image with all 0-pixels (zero padding) or padding them with the nearest pixel value. [Here](http://cs231n.github.io/convolutional-networks/#conv) we can read more about calculating the amount of padding, given a kernel_size.

When deciding the depth or number of fliters in a convolutional layer, often we will have a number of filters increase in sequence. So, the first convolutional layer might have 16 filters. The second will see that depth as input and produce a layer with a depth of 32. The third will have a depth of 64 and so on. After each convolutional layer, we will apply a ReLU activation function. 

If we follow tis process, we have a method for gradually increasing the depth of our array without modifying the height and width. The input, just like all of the layers in this sequence, has a height and width of 32. But the depth increases from an input layers depth of 3 to 16 to 32 to 64. 

<img src="assets/IncreasingDepth.png">

```
self.conv1 = nn.Conv2d(3, 16, 3, padding = 1)
self.conv2 = nn.Conv2d(16, 32, 3, padding = 1)
self.conv3 = nn.Conv2d(32, 64, 3, padding = 1)
```

We call that, yes we wanted to increase the depth, but we also wanted to decrease the height and width and discard some spatial information. This is where maxpooling layers will come in. They generally follow every one or two convolutional layers in the sequence. 

Below, one such example with a max pooling layer after each convolutional layer. 

<img src="assets/MaxpoolingOnCNN.png">

To define a max pooling layers, we will onlu need to define the filter size and stride. The most common setting will use filters of size two with a stride of two.

```
self.conv1 = nn.Conv2d(3, 16, 3, padding = 1)
self.conv2 = nn.Conv2d(16, 32, 3, padding = 1)
self.conv3 = nn.Conv2d(32, 64, 3, padding = 1)

# self.maxpooling = nn.MaxPool2d(kernel_size, stride)
self.maxpooling = nn.MaxPool2d(2, 2)
```

This has the effect of making the X,Y dimensions half of what they were from the previous layer. In this way, the combination of convolutional and max pooling layers accomplishes our goal of attaining an array that is quite deep but small in the X and Y dimensions.

Quick quiz
- Question 1: How might we define a [Maxpooling layer](https://pytorch.org/docs/stable/nn.html#maxpool2d) such that it down-samples an input by a factor of 4?

<img src="assets/AnswerQuizCNN1.png">

That's right! The best choice would be to use a kernel and stride of 4, so that the maxpooling function sees every input pixel once, but any layer with a stride of 4 will down-sample an input by that factor.

- Question 2: If we want to define a convolutional layer that is the same x-y size as an input array, what **padding** should we have for a `kernel_size` of 7?

<img src="assets/AnswerQuizCNN2.png">

Yes! If we overlay a 7x7 kernel so that its center-pixel is at the right-edge of an image, we will have 3 kernel columns that do not overlay anything! So, that's how big your padding needs to be.

# PyTorch Layer Documentation 

## Convolutional Layers

We typically define a convolutional layer in PyTorch using `nn.Conv2d`, with the following parameters specified:
```
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
```

- `in_channels` refers to the depth of an input. For a graysacale image, this depth = 1
- `out_channels` refers to the desired depth of the output, or the number of filtered images we want to get as output.
- `kernel_size` is the size of our convolutional kernel, most commonly 3 for a 3x3 kernel.
- `stride` and `padding` have default values, but should be set depending on how large we want our output to be in the spatial dimensions x, y

[Read more about COnv2d in the docs](https://pytorch.org/docs/stable/nn.html#conv2d).

## Pooling layers

Maxpooling layers commonly come after convolutional layers to shrink the x-y dimensions of an input, read more about pooling layers in PyTorch, [here](https://pytorch.org/docs/stable/nn.html#maxpool2d).

# Convolutional layer in PyTorch

To create a convolutional layer in PyTorch, we must first import the necessary module:

```
import torch.nn as nn
```

Then, there is a two part process to defining a convolutional layer and defining the feedforward behavior of a model (how an input moves through the layers of a network). First, we must define a Model class and fill in two functions.

### init

We can define a convolutional layer in the `__init__` function of by using the following format:

```
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
```

### forward

Then, we refer to that layer in the forward function. So we pass in an input image `x` and applyung a ReLU function to hte output of this layer.

```
x = F.relu(self.conv1(x))
```

### Arguments

We must pass the next arguments:

- `in_channels` The number of inputs (in depth), 3 for an RGB image, for example.
- `out_channels` The number of output channels, i.e. the number of filtered "images" a convolutional layer is made of or the number of unique, convolutional kernels that will be applied to an input.
- `kernel_size` Number specifying both the height and width of the (square) convolutional kernel.

There are some additional, optional arguments that we might like to tune:

- `stride` The stride of the convolution. If you don't specify anything, stride is set to 1.
- `padding` The border of 0's around an input array. If you don't specify anything, padding is set to 0.

It is possible to represent both `kernel_size` and `stride` as either a number or a tuple.

There are many other tunable arguments that we can set to change the behavior of our convolutional layers. To read more about these, we recommend perusing the official [docs](https://pytorch.org/docs/stable/nn.html#conv2d)

# Pooling layers

Pooling layers take in a kernel_size and a stride. Typically the same value as is the down-sampling factor. For example, the following code will down-sample and input's x-y dimensions, by a factor of 2:

```
self.pool = nn.Maxpool2d(2, 2)
```

### forward

Then, we see that poling layer being applied in the forward function:

```
x = F.relu(self.conv1(x))
x = self.pool(x)
```

### Convolutional Example 1

Say we are constructing a CNN, and our input layer accepts grayscale images that are 200 by 200 pixels (corresponding to a 3D array with height 200, width 200, and depth 1). Then, say we'd like the next layer to be a convolutional layer with 16 filters, each filter having a width and height of 2. When performing the convolution, we'd like the filter to jump two pixels at a time. We also don't want the filter to extend outside of the image boundaries; in other words, we don't want to pad the image with zeros. Then, to construct this convolutional layer, I would use the following line of code:

```
self.conv1 = nn.Conv2d(1, 16, 2, stride=2)
```

### Convolutional Example 2

Say we'd like the next layer in our CNN to be a convolutional layer that takes the layer constructed in Example 1 as input. Say we'd like our new layer to have 32 filters, each with a height and width of 3. When performing the convolution, we'd like the filter to jump 1 pixel at a time. We want this layer to have the same width and height as the input layer, and so we will pad accordingly. Then, to construct this convolutional layer, we would use the following line of code:

```
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
```

# Sequential models

We can also create a CNN in PyTorch by using a `Sequential` wrapper in the `__init__` function. Sequential allows us to stack different types of layers, specifying activation functions in between:

```
def __init__(self):
    super(ModelName, self).__init__()
    self.features = nn.Sequential(
        nn.Conv2d(1, 16, 2, stride=2),
        nn.MaxPool2d(2, 2),
        nn.ReLU(True),
        
        nn.Conv2d(16, 32, 3, padding=1),
        nn.MaxPool2d(2, 2),
        nn.ReLU(True)
    )
```

### Formula: Number of parameters in a convolutional layer

The number of parameters in a convolutional layer depends on the supplied values of `filters/out_channels`, `kernel_size`, and `input_shape`. Let's define a few variables:

- `K` the number of filters in the convolutional layer
- `F` the height and width of the convolutional filters
- `D_in` the depth of the previous layer

Notice that `K` = `out_channels`, and `F` = `kernel_size`. Likewise, `D_in` is the last value in the `input_shape` tuple, typically 1 or 3 (grayscale and RGB).

Since there are `F*F*D_in` weights per filter, and the convolutional layer is composed of `K`filters, the total number of weights in the convolutional layer is `K*F*F*D_in`. Since there is one bias term per filter, the convolutional layer has `K` biases. Thus, the **number of parameters** in the convolutional layer is given by `K*F*F*D_in + K`.

### Formula: Shape of a Convolutional Layer

The shape of a convolutional layer depends on the supplied values of `kernel_size`, `input_shape`, `padding`, and `stride`. Let's define a few variables:

- `K` the number of filters in the convolutional layer
- `F` the height and width of the convolutional filters
- `S` the stride of the convolution
- `P` the padding
- `W_in` the width/height (square) of the previous layer

Notice that `K`= `out_channels`, `F` = `kernel_size`, and `S`= `stride`. Likewise, `W_in`is the first and second value of the `input_shape` tuple.

The **depth** of the convolutional layer will always equal the number of filters `K`.

The spatial dimensions of a convolutional layer can be calculated as: `(W_in-F+2P)/S+1`.

# Flattening

Part of completing a CNN architecture, is to flatten the eventual output of a series of convolutional and pooling layers, so that all parameters can be seen (as a vector) by a linear classification layer. At this step, it is imperative that we know exactly how many parameters are output by a layer.

For the following quiz questions, we consider an input image that is `130x130 (x, y) and 3` in depth (RGB). Say, this image goes through the following layers in order:

```
nn.Conv2d(3, 10, 3)
nn.MaxPool2d(4, 4)
nn.Conv2d(10, 20, 5, padding=2)
nn.MaxPool2d(2, 2)
```

## Quick quiz

- Question 1: After going through all four of these layers in sequence, what is the depth of the final output?

<img src="assets/AnswerQuizCNN3.png">

That's right, the final depth is determined by the last convolutional layer, which has a depth = out_channels = 20.

- Question 2: What is the x-y size of the output of the final maxpooling layer? Careful to look at how the 130x130 image passes through (and shrinks) as it moved through each convolutional and pooling layer.

<img src="assets/AnswerQuizCNN4.png">

That's right! The 130x130 image shrinks by one after the first convolutional layer, then is down-sampled by 4 then 2 after each successive maxpooling layer!

- Question 3: How many parameters, total, will be left after an image passes through all four of the above layers in sequence?

<img src="assets/AnswerQuizCNN5.png">

That's right! It's the x-y size of the final output times the number of final channels/depth = `16*16*20`.