# Lecture 9 CNN Architectures - Private Notes

- AlexNet, VGG, GoogLeNet, ResNet, etc 
- slides : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf
- Videos : https://www.youtube.com/watch?v=DAOcjicFr1Y&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv

# Today: CNN Architectures

## Case Studies
- AlexNet
- VGG
- GoogLeNet
- ResNet

## Also....
- NiN (Network in Network)
- Wide ResNet
- ResNeXT
- Stochastic Depth

# Review: LeNet-5 [LeCun et al., 1998]

Conv filters were 5x5, applied at stride 1<br>
Subsampling (Pooling) layers were 2x2 applied at stride 2<br>
i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC]<br>

# Case Study: AlexNet [Krizhevsky et al. 2012]
First CNN-based winner
## Architecture:
CONV1 - MAX POOL1 - NORM1 <br>
CONV2 - MAX POOL2 - NORM2 <br>
CONV3 - CONV4 - CONV5 - Max POOL3 <br>
FC6 - FC7 - FC8

### Input: 227x227x3 images 
### First layer (CONV1): 96 11x11 filters applied at stride 4

Q: what is the output volume size? Hint: (227-11)/4+1 = 55 <br>
=> Output volume [55x55x96]<br>
Q: What is the total number of parameters in this layer?<br>
=> Parameters: (11*11*3)*96 = 35K<br>

### Input: 227x227x3 images
### After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2<br>
Q: what is the output volume size? Hint: (55-3)/2+1 = 27<br>
=> Output volume: 27x27x96<br>
Q: what is the number of parameters in this layer?<br>
=> 0!<br>

## Full (simplified) AlexNet architecture:<br>
[227x227x3] INPUT<br>
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0<br>
[27x27x96] MAX POOL1: 3x3 filters at stride 2<br>
[27x27x96] NORM1: Normalization layer<br>
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2<br>
[13x13x256] MAX POOL2: 3x3 filters at stride 2<br>
[13x13x256] NORM2: Normalization layer<br>
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1<br>
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1<br>
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1<br>
[6x6x256] MAX POOL3: 3x3 filters at stride 2<br>
[4096] FC6: 4096 neurons<br>
[4096] FC7: 4096 neurons<br>
[1000] FC8: 1000 neurons (class scores)<br>

## Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- heavy data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2, reduced by 10
manually when val accuracy plateaus
- L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%

## Historical note: 
Trained on GTX 580 GPU with only 3 GB of memory.<br>
Network spread across 2 GPUs, half the neurons (feature maps) on each GPU.<br>
CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. <br>
CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs<br>

<img src='./Lesson pic/9-23.png'>
<img src='./Lesson pic/9-24.png'>
<img src='./Lesson pic/9-25.png'>

# Case Study: VGGNet [Simonyan and Zisserman, 2014]

Small filters, Deeper networks<br>

8 layers (AlexNet) -> 16 - 19 layers (VGG16Net)<br>
Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2<br>
11.7% top 5 error in ILSVRC’13 (ZFNet) <br>
-> 7.3% top 5 error in ILSVRC’14<br>

<img src='./Lesson pic/9-26.png'>

Q: Why use smaller filters? (3x3 conv)<br>

Stack of three 3x3 conv (stride 1) layers has same effective receptive field <br>
as one 7x7 conv layer

Q: What is the effective receptive field of three 3x3 conv (stride 1) layers?<br>
[7x7]<br>

But deeper, more non-linearities<br>

And fewer parameters: 3 * (32C2) vs.72C2 for C channels per layer<br>

# VGG 16(not counting biases) 
INPUT: [224x224x3] memory: 224*224*3=150K params: 0<br>
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728<br>
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864<br>
POOL2: [112x112x64] memory: 112*112*64=800K params: 0<br>
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728<br>
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456<br>
POOL2: [56x56x128] memory: 56*56*128=400K params: 0<br>
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912<br>
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824<br>
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824<br>
POOL2: [28x28x256] memory: 28*28*256=200K params: 0<br>
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648<br>
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296<br>
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296<br>
POOL2: [14x14x512] memory: 14*14*512=100K params: 0<br>
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296<br>
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296<br>
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296<br>
POOL2: [7x7x512] memory: 7*7*512=25K params: 0<br>
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448<br>
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216<br>
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000<br>

<font color='red'><b>TOTAL memory: 24M * 4 bytes ~= 96MB / image (only forward! ~*2 for bwd)</b></font><br>
<font color='blue'><b>TOTAL params: 138M parameters</b></font>

<b>Most memory is in early CONV </b><br>
ex) CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728<br>
<b>Most params are in late FC</b><br>
ex) FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448<br>

# Details:
- ILSVRC’14 2nd in classification, 1st in
localization
- Similar training procedure as Krizhevsky
2012
- No Local Response Normalisation (LRN)
- Use VGG16 or VGG19 (VGG19 only
slightly better, more memory)
- Use ensembles for best results
- FC7 features generalize well to other
tasks

# Case Study: GoogLeNet [Szegedy et al., 2014]

Deeper networks, with computational efficiency

- 22 layers
- Efficient “Inception” module
- No FC layers
- Only 5 million parameters!
 - 12x less than AlexNet
- ILSVRC’14 classification winner (6.7% top 5 error)

“Inception module”: design a good local network topology (network within a network) <br>
and then stack these modules on top of each other

<img src='./Lesson pic/9-38.png'>

Apply parallel filter operations on the input from previous layer:
- Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
- Pooling operation (3x3)

Concatenate all filter outputs together depth-wise

Q: What is the problem with this? [Hint: Computational complexity]
<img src='./Lesson pic/9-39.png'>

Q1: What is the output size of the 1x1 conv, with 128 filters?<br>
=> 28x28x128<br>

Q2: What are the output sizes of all different filter operations?<br>
=> 28x28x192, 28x28x192, 28x28x96, 28x28x256<br>

Q3:What is output size after filter concatenation?<br>
=> 28x28x(128+192+96+256) = 28x28x672<br>

<b>Q: What is the problem with this? [Hint: Computational complexity]<br></b>

Conv Ops:<br>
[1x1 conv, 128] 28x28x128x1x1x256<br>
[3x3 conv, 192] 28x28x192x3x3x256<br>
[5x5 conv, 96] 28x28x96x5x5x256<br>
<b>Total: 854M ops<br></b>

Very expensive compute

Pooling layer also preserves feature depth, <br>
which means total depth after concatenation can only grow at every layer!<br>

<font color='blue'><b>Solution: “bottleneck” layers that use 1x1 convolutions to reduce feature depth<br></b></font>

<img src='./Lesson pic/9-50.png'>
<img src='./Lesson pic/9-52.png'>

<br>
<font color='blue' size=5><b>
(each filter has size 1x1x64, and performs a 64-dimensional dot product)<br>
preserves spatial dimensions, reduces depth!<br>
Projects depth to lower dimension (combination of feature maps)<br>
</b></font><br>

<img src='./Lesson pic/9-54.png'>
<img src='./Lesson pic/9-55.png'>

Using same parallel layers as naive example, and adding “1x1 conv, 64 filter” bottlenecks:<br>

<font color='blue'><b>Conv Ops:</b></font><br>
[1x1 conv, 64] 28x28x64x1x1x256<br>
[1x1 conv, 64] 28x28x64x1x1x256<br>
[1x1 conv, 128] 28x28x128x1x1x256<br>
[3x3 conv, 192] 28x28x192x3x3x64<br>
[5x5 conv, 96] 28x28x96x5x5x64<br>
[1x1 conv, 64] 28x28x64x1x1x256<br>
<font color='blue'><b>Total: 358M ops</b></font><br>

Compared to 854M ops for naive version<br>
Bottleneck can also reduce depth after pooling layer<br>

Stack Inception modules with dimension reduction on top of each other<br>

<img src='./Lesson pic/9-55.png'>
<img src='./Lesson pic/9-57.png'>
<img src='./Lesson pic/9-58.png'>
<img src='./Lesson pic/9-60.png'>
<img src='./Lesson pic/9-61.png'>

22 total layers with weights (including each parallel layer in an Inception module)

<img src='./Lesson pic/9-64.png'>

# Case Study: ResNet [He et al., 2015]

Very deep networks using residual connections

- 152-layer model for ImageNet
- ILSVRC’15 classification winner (3.57% top 5 error)
- Swept all classification and detection competitions in ILSVRC’15 and COCO’15!

<img src='./Lesson pic/9-65.png'>
<img src='./Lesson pic/9-67.png'>

<font color='blue'><b>
56-layer model performs worse on both training and test error<br>
-> The deeper model performs worse, but it’s not caused by overfitting!<br>
</b></font>

<font color='blue' size=5><b>
Hypothesis: the problem is an optimization problem, deeper models are harder to optimize <br>

The deeper model should be able to perform at least as well as the shallower model.

A solution by construction is copying the learned layers from the shallower model <br>
and setting additional layers to identity mapping.<br>
</b></font>

<img src='./Lesson pic/9-72.png'>
<img src='./Lesson pic/9-74.png'>
<img src='./Lesson pic/9-76.png'>
<img src='./Lesson pic/9-79.png'>

## Training ResNet in practice:
- Batch Normalization after every CONV layer
- Xavier/2 initialization from He et al.
- SGD + Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error plateaus
- Mini-batch size 256
- Weight decay of 1e-5
- No dropout used

## Experimental Results
- Able to train very deep networks without degrading (152 layers on ImageNet, 1202 on Cifar)
- Deeper networks now achieve lowing training error as expected
- Swept 1st place in all ILSVRC and COCO 2015 competitions

<img src='./Lesson pic/9-82.png'>
<img src='./Lesson pic/9-85.png'>

<b>VGG: Highest memory, most operations</b><br>
<b>GoogLeNet:most efficient</b><br>
<b>AlexNet: Smaller compute, still memory heavy, lower accuracy</b><br>
<b>ResNet: Moderate efficiency depending on model, highest accuracy</b><br>

<img src='./Lesson pic/9-90.png'>

<img src='./Lesson pic/9-92.png'>
<img src='./Lesson pic/9-93.png'>
<img src='./Lesson pic/9-94.png'>
<img src='./Lesson pic/9-95.png'>
<img src='./Lesson pic/9-96.png'>
<img src='./Lesson pic/9-97.png'>
<img src='./Lesson pic/9-98.png'>
<img src='./Lesson pic/9-99.png'>

# Summary: CNN Architectures

- VGG, GoogLeNet, ResNet all in wide use, available in model zoos
- ResNet current best default
- Trend towards extremely deep networks
- Significant research centers around design of layer / skip connections and improving gradient flow
- Even more recent trend towards examining necessity of depth vs width and residual connections

<br>
<img src='./Lesson pic/8-21.png'>
<font color='blue' size=5><b></b></font>
<img src='./Lesson pic/9-25.png'>