$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 5: Convolutional Neural Networks

## Introduction

In this tutorial, we will cover:

**TODO**

In [9]:
# Setup
%matplotlib inline
import os
import torch
import torchvision
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 20

## Theory Reminders

### Multilayer Perceptron (MLP)

#### Model
![img](https://qph.fs.quoracdn.net/main-qimg-330e8b2941bc0164211bbdc7d5c693f3)

Composed of multiple **layers**.

Each layer $j$ consists of $n_j$ regular perceptrons ("neurons") which calculate:
$$
\vec{y}_j = \varphi\left( \mat{W}_j \vec{y}_{j-1} + \vec{b}_j \right),~
\mat{W}_j\in\set{R}^{n_{j}\times n_{j-1}},~ \vec{b}_j\in\set{R}^{n_j}.
$$

- Note that both input and output are **vetors**. We can think of the above equation as describing a layer of **multiple perceptrons**.
- We'll henceforth refer to such layers as **fully-connected** or FC layers.


Given an input sample $\vec{x}^i$, the computed function of an $L$-layer MLP is:
$$
\vec{y}_L^i= \varphi \left(
\mat{W}_L \varphi \left( \cdots
\varphi \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right)
\cdots \right)
+ \vec{b}_L \right)
$$

- Universal approximator theorem: an MLP with $L>1$, can approximate (almost) any function given enough parameters (Cybenko, 1989).
- This expression is fully differentiable w.r.t. parameters using the Chain Rule.
    - In practice, many challenges to calculate.

#### Applications

##### Regression

<img src="https://docs.microsoft.com/en-us/azure/machine-learning/studio/media/algorithm-choice/image2.png" alt="regression" width="600"/>

- Output: $\hat{\vec{y}^i} = \vec{y}^i_L$
- Quadratic loss: $\sum_i (\vec{y}^i - \hat{\vec{y}^i})$

##### Classification
<img src="https://ml.berkeley.edu/blog/assets/tutorials/1/image_3.svg" width="400" alt="classification">

- Output: $\hat{\vec{y}^i} = \mathrm{softmax}(\vec{y}^i_L)$ (class probabilities)
- Cross entropy loss: $\sum_i - {\vectr{y}}^i \log(\hat{\vec{y}^i})$

To explore ConvNets we'll now be focusing our attention mainly on the task of classifying images.

#### Limitations of MLPs for image classification

- Number of parameters increases quadratically with image size due to connectivity.
    - 28x28 MNIST image: 784 weights per neuron in the first layer
    - 1000x1000x3 color image: 3M weights **per neuron**
    
    <img src="img/vanilla_dnn_scale.png" width="400" alt="scale">

- Huge number of parameters greatly increases risk of overfitting
    - MLP with 1 hidden layer, 3, 6 and 20 Neurons
    
    <img src="img/overfit_1HL_3-6-20N.jpg" width="500" alt="overfit1">
    
    - MLP with 1, 2 and 4 hidden layers, 3 neurons each
    
    <img src="img/overfit_1-2-4HL_3N.jpg" width="500" alt="overfit1">

- FC layers are highly sensitivity to translation, while image features are inherently translation-agnostic

Despite all these limitations we still want to use deep netural nets because they allow us to **learn hierarchical features** from the data.

## Convolutional Layers

### Structure

A convolutional layer is similar to an MLP FC layer but with three improtant distinctions:
1. Each neuron is only **connected to a small region** of the previous layer's output.
1. The neurons are stacked in a **3D** grid (insead of 1D).
1. Neurons that are at the same depth in the grid **share the same weights** (parameters $\mat{W},~\vec{b}$).

![cnn_layer](img/cnn_layer.jpeg)

In the above image, the colors of the neurons represent their weights.

Two important things to understand about convolutional layers:
- They operate on and produce **volumes** (3D tensors).

   <img src="img/cnn_layers.jpeg" width="400" />

- Each neuron is spatially local, but operates on the **full depth** dimension of it's input layer.

   <img src="img/depthcol.jpeg" width="300" />

### Interpretation as filters

Since each neuron in a given depth-slice of operates on a small region of the input layer, we can think of the combined **output of that depth-slice** as the **convolution between a filter and the input volume**.

<img src="img/cnn_filters.png" width="500" />

Since we have multiple depth-slices per convolutional layer, the layer computes multiple convolutions of the same input with different kernels (filters).

Each 2D slice of an input and output volume is known as **feature map** or a **channel**.

[Visualization of a convolutional filter](http://cs231n.github.io/assets/conv-demo/index.html).

### Hyperparameters & dimentions

Assume an input volume of shape $(C_{\mathrm{in}}, H_{\mathrm{in}}, W_{\mathrm{in}})$, i.e. channels, height, width.

Define,

1. Number of kernels, $K \geq 1$.
2. Spatial extent (size) of each kernel, $F \geq 1$. 
3. Stride $S\geq 1$: spatial distance between consecutive applications of a kernel.
4. Padding $P\geq 0$: Number of "pixels" to zero-pad around each input feature map.
5. Dilation $D \geq 1$: Spacing between kernel elements when applying to input.

| $P=0,~S=1,~D=1$ | $P=1,~S=1,~D=1$ | $P=1,~S=2,~D=1$ | $P=0,~S=1,~D=2$ |
|-----------------|-----------------|-----------------| --------------- |
|<img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif" width="200"/>| <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/same_padding_no_strides.gif" width="200"/> | <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/padding_strides.gif" width="200"/> | <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/dilation.gif" width="200"/> |

In the above animations, **blue** maps are inputs,
**cyan** maps are outputs and
the **shaded** area is the kernel with $F=3$.

We can see that the second combination, $F=3,~P=1,~S=1,~D=1$, leads to identical sizes of input and output feature maps.

Then,

- Each convolution kernel will be a tensor of shape $(C_{\mathrm{in}}, F, F)$.
- The ouput volume dimensions will be:

  $$\begin{align}
  H_{\mathrm{out}} &= \left\lfloor \frac{H_{\mathrm{in}} + 2P - D\cdot(F-1) -1}{S} \right\rfloor + 1\\
  W_{\mathrm{out}} &= \left\lfloor \frac{W_{\mathrm{in}} + 2P - D\cdot(F-1) -1}{S} \right\rfloor + 1\\
  C_{\mathrm{out}} &= K\\
  \end{align}$$

- The number of parameters in the layer will be:

$$
\underbrace{K \cdot C_{\mathrm{in}} \cdot F^2}_{\mathrm{weights}} +
\underbrace{K}_{\mathrm{biases}}
$$

**Example**: Input image is 1000x1000x3, and the first conv layer has $10$ kernels of size 5x5.
The number of parameters in the first layer will be: $ 10 \cdot 3 \cdot 5^2 + 10 = 760 $. No dependency on the width and height of the image.

### Pytorch `Conv2d` layer example

In [58]:
import torchvision.transforms as tvtf

data_dir = os.path.join(os.getenv('HOME'), '.pytorch-datasets')
tf = tvtf.Compose([tvtf.ToTensor()])
ds_cifar10 = torchvision.datasets.CIFAR10(data_dir, download=True, train=True, transform=tf)

Files already downloaded and verified


In [59]:
# Load first CIFAR10 image
x0,y0 = ds_cifar10[0]
# add batch dim
x0 = x0.unsqueeze(0)
print('x0 shape with batch dim:', x0.shape)

x0 shape with batch dim: torch.Size([1, 3, 32, 32])


In [62]:
import torch.nn as nn

# First conv layer: works on input image volume
conv1 = nn.Conv2d(in_channels=x0.size(1), out_channels=10, padding=1, kernel_size=3, stride=1)

# Second conv layer: works on output volume of first layer
conv2 = nn.Conv2d(in_channels=10, out_channels=20, padding=0, kernel_size=6, stride=2)

print(f'{"Input image shape:":25s}{x0.shape}')
print(f'{"After first conv layer:":25s}{conv1(x0).shape}')
print(f'{"After second conv layer:":25s}{conv2(conv1(x0)).shape}')

Input image shape:       torch.Size([1, 3, 32, 32])
After first conv layer:  torch.Size([1, 10, 32, 32])
After second conv layer: torch.Size([1, 20, 14, 14])


**Note**: observe that the width and height dimensions of the input image were never specified!
more on the significance of that later.

**Image credits**

Some images in this tutorial were taken and/or adapted from:
- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Deep Learning with Python, Francios Chollet, Manning 2018
- Stanford cs231n course site
- https://github.com/vdumoulin/conv_arithmetic