$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 4: Convolutional Neural Networks

## Introduction

In this tutorial, we will cover:

- Convolutional layers
- Pooling layers
- Network architecture
- Spatial classification with fully-convolutional nets
- Residual nets

In [1]:
# Setup
%matplotlib inline
import os
import sys
import torch
import torchvision
import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')

## Theory Reminders

### Multilayer Perceptron (MLP)

#### Model
![img](https://qph.fs.quoracdn.net/main-qimg-330e8b2941bc0164211bbdc7d5c693f3)

Composed of multiple **layers**.

Each layer $j$ consists of $n_j$ regular perceptrons ("neurons") which calculate:
$$
\vec{y}_j = \varphi\left( \mat{W}_j \vec{y}_{j-1} + \vec{b}_j \right),~
\mat{W}_j\in\set{R}^{n_{j}\times n_{j-1}},~ \vec{b}_j\in\set{R}^{n_j}.
$$

- Note that both input and output are **vetors**. We can think of the above equation as describing a layer of **multiple perceptrons**.
- We'll henceforth refer to such layers as **fully-connected** or FC layers.


Given an input sample $\vec{x}^i$, the computed function of an $L$-layer MLP is:
$$
\vec{y}_L^i= \varphi \left(
\mat{W}_L \varphi \left( \cdots
\varphi \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right)
\cdots \right)
+ \vec{b}_L \right)
$$

- Universal approximator theorem: an MLP with $L>1$, can approximate (almost) any function given enough parameters (Cybenko, 1989).


#### Applications

##### Regression

<img src="img/regression.png" alt="regression" width="600"/>

- Output: $\hat{\vec{y}^i} = \vec{y}^i_L$
- Quadratic loss: $\sum_i (\vec{y}^i - \hat{\vec{y}^i})^2$

##### Classification

<img src="img/classification.png" width="900" alt="classification">

- Output: $\hat{\vec{y}^i} = \mathrm{softmax}(\vec{y}^i_L)$ (class probabilities)
- Cross entropy loss: $\sum_i - {\vectr{y}}^i \log(\hat{\vec{y}^i})$

To explore ConvNets we'll now be focusing our attention mainly on the task of classifying images.

#### Limitations of MLPs for image classification

- Number of parameters increases quadratically with image size due to connectivity.
    - 28x28 MNIST image: 784 weights per neuron in the first layer
    - 1000x1000x3 color image: 3M weights **per neuron**
    
    <img src="img/vanilla_dnn_scale.png" width="700" alt="scale">

- Huge number of parameters greatly increases risk of overfitting
    - MLP with 1 hidden layer, 3, 6 and 20 Neurons
    
    <img src="img/overfit_1HL_3-6-20N.jpg" width="600" alt="overfit1">
    
    - MLP with 1, 2 and 4 hidden layers, 3 neurons each
    
    <img src="img/overfit_1-2-4HL_3N.jpg" width="600" alt="overfit1">

- FC layers are highly sensitivity to translation, while image features are inherently translation-invariant.

Despite all these limitations we still want to use deep netural nets because they allow us to **learn hierarchical features** from the data.

## Convolutional Layers

### Structural view

A convolutional layer is similar to an MLP FC layer but with three improtant distinctions:
1. Each neuron is only **connected to a small region** of the previous layer's output.
1. The neurons are stacked in a **3D** grid (insead of 1D).
1. Neurons that are at the same depth in the grid **share the same weights** (parameters $\mat{W},~\vec{b}$).

<img src="img/cnn_layer.jpeg" width="400"/>

In the above image, the colors of the neurons represent their weights.

Two important things to understand about convolutional layers:
- They operate on and produce **volumes** (3D tensors).

   <img src="img/cnn_layers.jpeg" width="800" />

- Each neuron is spatially local, but operates on the **full depth** dimension of it's input layer.

   <img src="img/depthcol.jpeg" width="500" />

### Filter-based view

Since each neuron in a given depth-slice of operates on a small region of the input layer, we can think of the combined **output of that depth-slice** as the **convolution between a filter and the input volume**.

<img src="img/cnn_filters.png" width="700" />
<img src="img/filter_resp.png" width="700" />

Since we have multiple depth-slices per convolutional layer, the layer computes multiple convolutions of the same input with different kernels (filters).

Each 2D slice of an input and output volume is known as **feature map** or a **channel**.

[Visualization of a convolutional filter](http://cs231n.github.io/assets/conv-demo/index.html).

### Hyperparameters & dimentions

Assume an input volume of shape $(C_{\mathrm{in}}, H_{\mathrm{in}}, W_{\mathrm{in}})$, i.e. channels, height, width.
Define,

1. Number of kernels, $K \geq 1$.
2. Spatial extent (size) of each kernel, $F \geq 1$. 
3. Stride $S\geq 1$: spatial distance between consecutive applications of a kernel.
4. Padding $P\geq 0$: Number of "pixels" to zero-pad around each input feature map.
5. Dilation $D \geq 1$: Spacing between kernel elements when applying to input.

In the following animations, **blue** maps are inputs,
**green** maps are outputs and
the **shaded** area is the kernel with $F=3$.

| $P=0,~S=1,~D=1$ | $P=1,~S=1,~D=1$ | $P=1,~S=2,~D=1$ | $P=0,~S=1,~D=2$ |
|-----------------|-----------------|-----------------| --------------- |
|<img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif" width="250"/>| <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/same_padding_no_strides.gif" width="250"/> | <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/padding_strides.gif" width="250"/> | <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/dilation.gif" width="250"/> |


We can see that the second combination, $F=3,~P=1,~S=1,~D=1$, leads to identical sizes of input and output feature maps.

Then, given a set of hyperparameters,

- Each convolution kernel will be a tensor of shape $(C_{\mathrm{in}}, F, F)$.
- The ouput volume dimensions will be (ignoring dilation):

  $$\begin{align}
  H_{\mathrm{out}} &= \left\lfloor \frac{H_{\mathrm{in}} + 2P - D\cdot(F-1) -1}{S} \right\rfloor + 1\\
  W_{\mathrm{out}} &= \left\lfloor \frac{W_{\mathrm{in}} + 2P - D\cdot(F-1) -1}{S} \right\rfloor + 1\\
  C_{\mathrm{out}} &= K\\
  \end{align}$$

- The number of parameters in the convolutional layer will be:

$$
\underbrace{K}_{\mathrm{kernels}} \cdot \left(
\underbrace{C_{\mathrm{in}} \cdot F^2}_{\mathrm{kernel\ size}} + \underbrace{1}_{\mathrm{bias\ term}}
\right)
$$

**Example**: Input image is 1000x1000x3, and the first conv layer has $10$ kernels of size 5x5.
The number of parameters in the first layer will be: $ 10 \cdot 3 \cdot 5^2 + 10 = 760 $.


### Pytorch `Conv2d` layer example

In [2]:
import torchvision.transforms as tvtf

tf = tvtf.Compose([tvtf.ToTensor()])
ds_cifar10 = torchvision.datasets.CIFAR10(data_dir, download=True, train=True, transform=tf)

Files already downloaded and verified


In [3]:
# Load first CIFAR10 image
x0,y0 = ds_cifar10[0]

# add batch dimension
x0 = x0.unsqueeze(0)

print('x0 shape with batch dim:', x0.shape)

x0 shape with batch dim: torch.Size([1, 3, 32, 32])


In [4]:
# A function to count the number of parameters in an nn.Module.
def num_params(layer):
    return sum([p.numel() for p in layer.parameters()])

In [5]:
import torch.nn as nn

# First conv layer: works on input image volume
conv1 = nn.Conv2d(in_channels=x0.size(1), out_channels=10, padding=1, kernel_size=3, stride=1)

print(f'conv1: {num_params(conv1)} parameters')
print(f'{"Input image shape:":25s}{x0.shape}')
print(f'{"After first conv layer:":25s}{conv1(x0).shape}')

conv1: 280 parameters
Input image shape:       torch.Size([1, 3, 32, 32])
After first conv layer:  torch.Size([1, 10, 32, 32])


In [6]:
# Second conv layer: works on output volume of first layer
conv2 = nn.Conv2d(in_channels=10, out_channels=20, padding=0, kernel_size=6, stride=2)

print(f'conv2: {num_params(conv2)} parameters')
print(f'{"After second conv layer:":25s}{conv2(conv1(x0)).shape}')

conv2: 7220 parameters
After second conv layer: torch.Size([1, 20, 14, 14])


**Note**: observe that the width and height dimensions of the input image were never specified!
more on the significance of that later.

## Pooling layers

In addition to strides, another way to reduce the size of feature maps between the convolutional layers,
is by adding **pooling** layers.

A pooling layer has the following hyperparameters (but **no trainable parameters**):

1. Spatial extent (size) of each pooling kernel, $F \geq 2$. 
1. Stride $S\geq 2$: spatial distance between consecutive applications.
1. Operation (e.g. max, average, $p$-norm)

**Example**: $\max$-pooling with $F=2,~S=2$ performing a factor-2 downsample:

<img src="img/maxpool.png" width="600" />

### Why downsample feature maps after convolutions?

Obviously, to reduce the number of features to process in the next layer.

But more crucially,

- To increase the **receptive field** of the original image that each layer works with.
- We want successive conv layers to be affected by increasingly larger parts of the input image.
- This allows us to learn a hierarchy of visual features.

<img src="img/feature_hierarchy.png" width="500" />

### PyTorch `Pool2d` layer example

In [7]:
pool = nn.MaxPool2d(kernel_size=2, stride=2)

print(f'{"After second conv layer:":25s}{conv2(conv1(x0)).shape}')
print(f'{"After max-pool:":25s}{pool(conv2(conv1(x0))).shape}')

After second conv layer: torch.Size([1, 20, 14, 14])
After max-pool:          torch.Size([1, 20, 7, 7])


## Network Architecture

The basic way to build an architecture of a deep convolutional neural net, is to repeat groups of **conv-relu** layers, sprinkle in some **pooling** in between and top it all off with a nice **FC-softmax** combo.

<img src="img/arch.png" width="700" />

In the above image,

- all the **conv** blocks are actually **conv-relu** (or some other nonlinearity).
- The rightmost architecture is called VGG, and used to be a relevant architecture for ImageNet classification.
- Other types of layers, such as normalization layers are usually also added.

There are many other things to consider as part of the architecture:
- Size of conv kernels
- Number of consecutive convolutions
- Use of batch norm to speed up training
- Dropout for improved generalization
- Pointwise convolutions instead of FC layers
- Skip connections (we'll see later)

Many different network architectures exist, made famous mainly by repeated improvements on the ImageNet classification challenge since 2012.

<img src="img/net_archs.png" width="1000" />

Notable ImageNet-winning architectures:

- AlexNet, 5 layers (2012): Based on LeNet, deeper, with ReLU, trained with GPUs
- Inception/GoogLeNet, 22 layers (2014): Multiple (small) kernel sizes at same depth
- ResNet, 152 (!) layers (2015): Skip connections

### PyTorch network architecture example

Let's implement **LeNet**, arguably the first successful CNN model for MNIST (LeCun, 1998).

<img src="https://cdn-images-1.medium.com/max/1600/1*1TI1aGBZ4dybR6__DI9dzA.png" width="1000" />

In [8]:
class LeNet(nn.Module):
    def __init__(self, in_channels=3):
        super().__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels, 6, 5),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(6, 16, 5),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Linear(16*5*5, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU(),
            nn.Linear(84, 10)
        )
    def forward(self, x):
        features = self.feature_extractor(x)
        features = features.view(features.size(0), -1)
        class_scores = self.classifier(features)
        return class_scores

In [9]:
net = LeNet()
print(net)
print('LeNet(x0)=', net(x0))
print('shape=', net(x0).shape)

LeNet(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Linear(in_features=400, out_features=120, bias=True)
    (1): ReLU()
    (2): Linear(in_features=120, out_features=84, bias=True)
    (3): ReLU()
    (4): Linear(in_features=84, out_features=10, bias=True)
  )
)
LeNet(x0)= tensor([[-0.0121, -0.0345,  0.0950,  0.0754,  0.0334,  0.0507,  0.0379,  0.0738,
          0.1052, -0.0774]], grad_fn=<AddmmBackward>)
shape= torch.Size([1, 10])


### Fully-convolutional Networks

Notice how we never actually specified the input image size when implementing the network.

**Does this mean we can use the network on images of any size**?

**No**, because of the FC layers at the end.

Here, let's try:

In [10]:
large_image = torch.randn(1,3,32*2,32*2)
try:
    net(large_image)
except RuntimeError as e:
    print(e, file=sys.stderr)

size mismatch, m1: [1 x 2704], m2: [400 x 120] at /Users/soumith/mc3build/conda-bld/pytorch_1549597882250/work/aten/src/TH/generic/THTensorMath.cpp:940


However,
- Only the FC layers at the end require actual knowledge of exact image sizes.
- We can replace them with... Convolutions, of course


In [11]:
class LeNetFullyConv(LeNet):
    def __init__(self):
        super().__init__()
        # Remember: the last feature map volume has shape (16,5,5)
        # Override the classifier with 5x5 then 1x1 convolutions
        self.classifier = nn.Sequential(
            nn.Conv2d(16, 120, 5), # note: no padding or strides!
            nn.ReLU(),
            nn.Conv2d(120, 84, 1),
            nn.ReLU(),
            nn.Conv2d(84, 10, 1),
        )
    def forward(self, x):
        features = self.feature_extractor(x)
        # note: no need to reshape the features now
        class_scores = self.classifier(features)
        return class_scores

In [12]:
net_fully_conv = LeNetFullyConv()
print(net_fully_conv)

LeNetFullyConv(
  (feature_extractor): Sequential(
    (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Conv2d(16, 120, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
    (2): Conv2d(120, 84, kernel_size=(1, 1), stride=(1, 1))
    (3): ReLU()
    (4): Conv2d(84, 10, kernel_size=(1, 1), stride=(1, 1))
  )
)


Let's forward the original-sized image and the larger image through the network and observe the output shapes:

In [13]:
print('regular image output shape:', net_fully_conv(x0).shape)
print('large   image output shape:', net_fully_conv(large_image).shape)

regular image output shape: torch.Size([1, 10, 1, 1])
large   image output shape: torch.Size([1, 10, 9, 9])


**What's the meaning of the output after conversion to fully convolutional?**

It's now a **spatial classification map**.

<img src="img/fully_conv.png" width="800" />


## Residual Networks


- For image-related tasks it seems that **deeper is better**: learn more complex features
    <img src="img/deeper_meme.jpeg"/>

- In practice there are two major problems with adding depth:

1. More difficult convergence: vanishing gradients
1. More difficult optimization: parameter space increases

In theory, adding an addition layer should provide **at least** the same accuracy as before, since it could always just be an identity map.

In practice, not so:

<img src="img/resnet_plain_deep_error.png" width="800"/>

I.e., even if the same solution (or better) exists, SGD optimization can't find it.

ResNets attempt to address these issues by building a network architecture composed of convolutional blocks with added **shortcut-connections**:

<img src="img/resnet_block.png" width="400"/>

Here the weight layers are `3x3` or `1x1` convolutions followed by batch-normalization.
These shortcuts create two key advantages:
- Allow gradients to flow freely backwards
- Each block only learns the "residual mapping", i.e. some delta from the identity map which is easier to optimize.

### PyTorch ResNet example

Let's implement the smallest resifual network from the original paper, ResNet18.

<img src="img/resnet_arch_table.png" width="1000" />


In [14]:
import torch.nn.functional as F

class ResNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, first_stride=1):
        super().__init__()
        self.main_path = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=first_stride, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(out_channels),
        )

        self.shortcut_path = nn.Sequential()
        # Check if spatial or channel dimentions changed along main path
        # If so, we need to adjust the dimensions of the shortcut connection
        if first_stride != 1 or in_channels != out_channels:
            self.shortcut_path = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=first_stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    def forward(self, x):
        out = self.main_path(x)
        out += self.shortcut_path(x)
        out = F.relu(out)
        return out

In [15]:
class ResNet18(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.input_group = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        # Add 4 groups of 2 blocks, each group adding more output channels
        in_channels_curr = 64
        out_channels_per_group = [64, 128, 256, 512]
        for group_idx, out_channels in enumerate(out_channels_per_group):
            first_stride = 1 if group_idx == 0 else 2
            group = nn.Sequential(
                ResNetBlock(in_channels_curr, out_channels, first_stride),
                ResNetBlock(out_channels, out_channels, 1),
            )
            in_channels_curr = out_channels
            setattr(self, f'group{group_idx}', group)
            
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) # kernel is adaptive s.t. 1x1 is output size
        self.fc = nn.Linear(out_channels_per_group[-1], num_classes) 
    def forward(self, x):
        out = self.input_group(x)
        for group_idx in range(4):
            group = getattr(self, f'group{group_idx}')
            out = group(out)
        out = self.avgpool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

In [16]:
block = ResNetBlock(in_channels=3, out_channels=64, first_stride=2)
net = ResNet18(num_classes=10)
net

ResNet18(
  (input_group): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  )
  (group0): Sequential(
    (0): ResNetBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU()
        (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (shortcut_path): Sequential()
    )
    (1): ResNetBlock(
      (main_path): Sequential(
        (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): Bat

In [17]:
test_image = torch.randn(1,3,224,224)

print('First block output shape: ', block(test_image).shape)
print('ResNet output shape: ', net(test_image).shape)

First block output shape:  torch.Size([1, 64, 112, 112])
ResNet output shape:  torch.Size([1, 10])


**Image credits**

Images in this tutorial were taken and/or adapted from:

- Sebastian Raschka, https://sebastianraschka.com/
- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Deep Learning with Python, Francios Chollet, Manning 2018
- Stanford cs231n course notes by Andrej Karpathy
- https://github.com/vdumoulin/conv_arithmetic
- Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Canziani, A., Paszke, A., & Culurciello, E. (2016). An analysis of deep neural network models for practical applications.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.