<a href="https://colab.research.google.com/github/yexf308/MachineLearning/blob/main/Module3/Convolutional_Neural_Network_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\def\m#1{\mathbf{#1}}$
$\def\mm#1{\boldsymbol{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$
$\def\mr#1{\mathrm{#1}}$
$\newenvironment{rmat}{\left[\begin{array}{rrrrrrrrrrrrr}}{\end{array}\right]}$
$\newcommand\brm{\begin{rmat}}$
$\newcommand\erm{\end{rmat}}$
$\newenvironment{cmat}{\left[\begin{array}{ccccccccc}}{\end{array}\right]}$
$\newcommand\bcm{\begin{cmat}}$
$\newcommand\ecm{\end{cmat}}$

# Intro to Convolutional Neural Network
Let's look at our MNIST example. We have 60000 training dataset with 10 labels and 10000 testing dataset. 
Here is the accuracy of MNIST example. 

| classifier        | Test Error                                                      | 
|---------|-------------------------------------------------------|
|Linear classifier     |12%                          |
|SVM, Gaussian kernel   |1.4%                                      |
|SVM, degree 4 polynomial  |1.1%                                   |
|Best SVM result      |0.56%                                                    |
|2-layer NN      |~3%            |
|3-layer NN      |~2.5%                             |
|CNN, LeNet-5 (1998)      |0.85%                          |
|Larger CNN (2011, 2012)   |~0.3%                |


MNIST digits are done! 

### ImageNet Classification Challenge
ImageNet dataset has ~1.2M color images of size $256\times256$ for training and 100K for testing with ~1000 object classes.

<img src="https://github.com/yexf308/MAT592/blob/main/image/imagenet1.png?raw=true" width="800" />


**Models are Getting Deeper and Larger**.
Computations of training large scale networks are made parallelizable by using GPUs, an essential ingredient of the deep learning revolution. 
<img src="https://github.com/yexf308/MAT592/blob/main/image/imagenet2.png?raw=true" width="800" />

**Why?**
Fully connected networks doesn’t work well for computer vision
applications! Because it has too many parameters and very hard to train. So it has poor performance. 

**Structure of VGG**

<img src="https://github.com/yexf308/MAT592/blob/main/image/vgg.png?raw=true" width="500" />


If the first layer is fully connected, 
- Input: $224\times 224\times 3$

- Output: $224\times 224\times 64$.

- Number of parameters: 483 Billions!


Two important layers we will discuss later:
- Convolution: **Local connectivity** and **parameter sharing**. 
- Pooling: reduce the size of representation, down-sampling.






## Convolutional layers
- It is a **linear** layer!

- It take image or features output from hidden layers as input in its original shape without vectorization. i.e., The input dimension is $224\times 224\times 3$. 

- It can better capture the spatial information in an image. 

- Trainable weights are typically 3-D or 4-D tensors, called **filters**. 

- **Weights sharing** and **local(sparse) connectivity** in contrast to full connectivity in dense layers. 

- Much more compact in terms of model size (i.e., less parameters)


### Local connectivity

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn1.png?raw=true" width="500" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn2.png?raw=true" width="500" />

Each hidden unit is connected only to a sub-region of input. It is connected to all channels (R, G, B). 

### Parameter Sharing
Making one reasonable **assumption**: 
If one feature is useful to compute at some spatial position $(x, y)$, then it
should also be useful to compute at a different position $(x_2, y_2)$

### Convolution Operator
The convolution of an image $\m{x}$ with a kernel $\m{k}$ is computed as
\begin{align}
(\m{x}\ast \m{k})_{ij} = \sum_{pq} \m{x}_{i+p, j+q}\m{k}_{p,q}
\end{align}

<img src="https://github.com/yexf308/MAT592/blob/main/image/conv.png?raw=true" width="400" />


$$ $$

<img src="https://github.com/yexf308/MAT592/blob/main/image/conv1.png?raw=true" width="400" />
<img src="https://github.com/yexf308/MAT592/blob/main/image/conv2.png?raw=true" width="400" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/conv3.png?raw=true" width="400" />
<img src="https://github.com/yexf308/MAT592/blob/main/image/conv4.png?raw=true" width="400" />


## Remarks for 2-D convolutions
2-D convolution works for matrix input:

- One convolution filter of (kernel) size $2\times 2$ (with just 4 weights)
transforms $3\times 3$ input into $2\times 2 $ output. 

- In contrast, a dense layer would have $4\times 9$ = 36 weights!

- Could have bias term. 

- Could have multiple  filters of (kernel) size $2\times 2$ (say, 4 filters with $4\times 4$
weights), then the output is a $4 \times2 \times 2$ tensor (each  filter gives a $2\times 2$
slice of the 3-D tensor).

- Typical choices of kernel size are $3\times 3, 5\times 5$ and $7\times 7$. 

### Example:
- Input Size: $3\times d\times d$.

- Filter Size: $3\times 3\times 3$ with kernel size $3\times 3$, stride=1. 

- Output Size: $(d-2)\times (d-2)$. Convolutions Reduce Image Dimensions! If we have multiple layers of convolutions, then the dimension will be smaller and smaller. This is a problem. 

 

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn3.png?raw=true" width="600" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn4.png?raw=true" width="600" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn5.png?raw=true" width="600" />



### Zero padding

Use zero padding to allow going over the boundary
Easier to control the size of output layer. So the dimension of input and output will be the same. Both are $d\times d$. 



![padding](https://github.com/yexf308/MAT592/blob/main/image/padding.gif?raw=true "padding")


## 3-D inputs to convolutional layer
(Without zero padding)
Convolve $d_{in} \times w_{in}\times h_{in}$ input features with $d_{out}$  filters of size $d_{in}\times d_{ker}\times d_{ker}$
and stride = 1, then size of output features is
$d_{out} \times (w_{in} - d_{ker} + 1) \times (h_{in} - d_{ker} + 1)$

<img src="https://github.com/yexf308/MAT592/blob/main/image/cnn6.png?raw=true" width="400" />




## Pooling layer
- It’s common to insert a pooling layer in-between successive
convolutional layers. 

- Pooling layers reduce the spatial size of features to reduce the amount of parameters. 

- It operates on each feature slice independently.

### Max Pooling
$2\times 2$ max pooling (with stride 2) is commonly used. By pooling, we gain robustness to the exact spatial location of features. 

<img src="https://github.com/yexf308/MAT592/blob/main/image/pooling1.png?raw=true" width="800" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/pooling2.png?raw=true" width="400" />

## Examples on CNN

### LeNet
<img src="https://github.com/yexf308/MAT592/blob/main/image/LeNet.png?raw=true" width="600" />

Pad images with zeros and increase the size to $32\times 32$.
- Input: 32 × 32 images (MNIST)

- Convolution 1: 6 5 × 5 filters, stride 1 (followed by sigmoid
activation layer)
   - Output: 6 28 × 28 maps 

- Pooling 1: 2 × 2 max pooling, stride 2
   - Output: 6 14 × 14 maps

- Convolution 2: 16 5 × 5 filters, stride 1 (followed by sigmoid
activation))
   - Output: 16 10 × 10 maps

- Pooling 2: 2 × 2 max pooling with stride 2  
   - Output: 16 5 × 5 maps (total 400 values)

- 3 fully connected layers:  120 ⇒ 84 (followed by sigmoid activation)) ⇒ 10 neurons (followed by softmax output layer)        

In total 61,706 trainable parameters including weights and biases
easily > 99% test accuracy for MNIST. 

ReLU activation is used nowadays for better accuracy.

### AlexNet
<img src="https://github.com/yexf308/MAT592/blob/main/image/AlexNet.png?raw=true" width="600" />

AlexNet: 
- 8 layers in total, about 60 million
parameters and 650,000 neurons.
- Trained on ImageNet dataset. 


<img src="https://github.com/yexf308/MAT592/blob/main/image/compare_cnn.png?raw=true" width="600" />

### Learned kernel in AlexNet

- The receptive field of a neuron is the input region that can affect the neuron’s output. 

- The receptive field for a first layer neuron is its neighbors (depending on kernel size) ⇒ capturing very local patterns. 
 
- For higher layer neurons, the receptive field can be much larger ⇒
capturing global patterns. 

<img src="https://github.com/yexf308/MAT592/blob/main/image/AlexNet2.png?raw=true" width="400" />

<img src="https://github.com/yexf308/MAT592/blob/main/image/AlexNet3.png?raw=true" width="600" />


## Training on CNN

- Apply SGD to minimize in-sample training error
- Backpropagation can be extended to convolutional layer and pooling
layer to compute gradient!
- Millions of parameters ⇒ easy to overfit. More to be discussed! 