# Convolutional Neural Networks (II)

 **Reading**:   Chapter 6.3-6.6 of *Dive Into Deep Learning*

## Outline

- Padding and Stride
- Multiple channels
- Pooling
- LeNet

# Padding and Stride


## Padding

-  Pad a $3 \times 3$ input with zeros increasing its size to $5 \times 5$.
- The corresponding output then increases to a $4 \times 4$ matrix.

![Two-dimensional cross-correlation with padding.](../img/conv-pad.svg)


 
- Add a total of $p_h$ rows of padding (roughly half on top and half on bottom) and 
- a total of $p_w$ columns of padding (roughly half on the left and half on the right),
- the output shape will be

$$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).$$



In [1]:
import torch
from torch import nn


# We define a convenience function to calculate the convolutional layer. This
# function initializes the convolutional layer weights and performs
# corresponding dimensionality elevations and reductions on the input and
# output
def comp_conv2d(conv2d, X):
    # Here (1, 1) indicates that the batch size and the number of channels
    # are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Exclude the first two dimensions that do not interest us: examples and
    # channels
    return Y.reshape(Y.shape[2:])
# Note that here 1 row or column is padded on either side, so a total of 2
# rows or columns are added
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

torch.Size([8, 8])

In [2]:
# Here, we use a convolution kernel with a height of 5 and a width of 3. The
# padding numbers on either side of the height and width are 2 and 1,
# respectively
conv2d = nn.Conv2d(1, 1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

torch.Size([8, 8])

## Stride



- *Stride*: the number of rows and columns traversed per slide

![Cross-correlation with strides of 3 and 2 for height and width, respectively.](../img/conv-stride.svg)





In [3]:
conv2d = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape

torch.Size([4, 4])

Next, we will look at (**a slightly more complicated example**).


In [4]:
conv2d = nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape

torch.Size([2, 2])



## Summary of Padding and Stride

* Padding can increase the height and width of the output. This is often used to give the output the same height and width as the input.
* The stride can reduce the resolution of the output, for example reducing the height and width of the output to only $1/n$ of the height and width of the input ($n$ is an integer greater than $1$).
* Padding and stride can be used to adjust the dimensionality of the data effectively.




# Multiple Input and Multiple Output Channels


## Multiple Input Channels

- RGB input image: represented by three matrices
  - has shape $3\times h\times w$.
- We refer to this axis, with a size of 3, as the *channel* dimension.
- Kernel size: $c_i\times k_h\times k_w$

- Kernel contains a tensor of shape $k_h\times k_w$ for *every* input channel

![Cross-correlation computation with 2 input channels.](../img/conv-multi-in.svg)

In [5]:
import torch
from d2l import torch as d2l

In [6]:
def corr2d_multi_in(X, K):
    # First, iterate through the 0th dimension (channel dimension) of `X` and
    # `K`. Then, add them together
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

In [7]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

tensor([[ 56.,  72.],
        [104., 120.]])

## Multiple Output Channels
- Kernel size $c_o\times c_i\times k_h\times k_w$. e.g $56\times 28\times 224\times 224$


## $1\times 1$ Convolutional Layer

- Convolution across channels for each spatial position
 - $k_h = k_w = 1$,
- requires the bias and $c_o\times c_i$ weights where $c_o, c_i$ are the numbers of channels in the input and the output respectively.

![The cross-correlation computation uses the $1\times 1$ convolution kernel with 3 input channels and 2 output channels. The input and output have the same height and width.](../img/conv-1x1.svg)
:label:`fig_conv_1x1`


In [8]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = X.reshape((c_i, h * w))
    K = K.reshape((c_o, c_i))
    # Matrix multiplication in the fully-connected layer
    Y = torch.matmul(K, X)
    return Y.reshape((c_o, h, w))

## Summary of channels

* Multiple channels can be used to extend the model parameters of the convolutional layer.
* The $1\times 1$ convolutional layer is equivalent to the fully-connected layer, when applied on a per pixel basis.
* The $1\times 1$ convolutional layer is typically used to adjust the number of channels between network layers and to control model complexity.




# Pooling

- Aims to gradually reduce the spatial resolution of our hidden representations.
- From lower-level features to high-level features

## Maximum Pooling 


![Maximum pooling with a pooling window shape of $2\times 2$. The shaded portions are the first output element as well as the input tensor elements used for the output computation: $\max(0, 1, 3, 4)=4$.](../img/pooling.svg)



In [9]:
import torch
from torch import nn
from d2l import torch as d2l

In [10]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

In [11]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

tensor([[4., 5.],
        [7., 8.]])

## Average Pooling


In [12]:
pool2d(X, (2, 2), 'avg')

tensor([[2., 3.],
        [5., 6.]])

## **Padding, Stride, Channels**

- Similar to convolution layers

## Summary of Pooling

* Taking the input elements in the pooling window, 
  - the maximum pooling operation assigns the maximum value as the output
  - the average pooling operation assigns the average value as the output.
* Alleviate the excessive sensitivity of the convolutional layer to location.
* Maximum pooling, combined with a stride larger than 1 can be used to reduce the spatial dimensions.




# Convolutional Neural Networks (LeNet)


- Introduced byYann LeCun in the 1990s
- For handwritten digit recognision in images.

## LeNet

- **LeNet (LeNet-5) consists of two parts:
  - a convolutional encoder consisting of two convolutional layers; and
  - a dense block consisting of three fully-connected layers.

![Data flow in LeNet. The input is a handwritten digit, the output a probability over 10 possible outcomes.](../img/lenet.svg)
:label:`img_lenet`

In [13]:
import torch
from torch import nn
from d2l import torch as d2l

net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Flatten(),
    nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.Sigmoid(),
    nn.Linear(84, 10))


![Compressed notation for LeNet-5.](../img/lenet-vert.svg)
:label:`img_lenet_vert`


In [14]:
X = torch.rand(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape: \t',X.shape)

Conv2d output shape: 	 torch.Size([1, 6, 28, 28])
Sigmoid output shape: 	 torch.Size([1, 6, 28, 28])
AvgPool2d output shape: 	 torch.Size([1, 6, 14, 14])
Conv2d output shape: 	 torch.Size([1, 16, 10, 10])
Sigmoid output shape: 	 torch.Size([1, 16, 10, 10])
AvgPool2d output shape: 	 torch.Size([1, 16, 5, 5])
Flatten output shape: 	 torch.Size([1, 400])
Linear output shape: 	 torch.Size([1, 120])
Sigmoid output shape: 	 torch.Size([1, 120])
Linear output shape: 	 torch.Size([1, 84])
Sigmoid output shape: 	 torch.Size([1, 84])
Linear output shape: 	 torch.Size([1, 10])


## Summary

* A CNN is a network that employs convolutional layers.
* In a CNN, we interleave convolutions, nonlinearities, and (often) pooling operations.
* Gradually decrease the spatial resolution, while increasing the number of channels.
* LeNet was arguably the first successful deployment of such a network.