# Week 2 - CNN Architectures

## 1. Notations

### 1.1 Pioneering Work

- Definition of convolution and pooling layers: Work of Fukushima in 1982, Neocognitron.
- Backpropagation applied to handwritten recognition: Work of LeCun in 1989
- Publication of the algorithm LeNet5: Conerstone of modern CNN structure. Year was 1998.

### 1.2 Architectures

1. LeNet-5

- Use Tanh as activation function 

2. AlexNet

- The first to stack conv. layers directly on top of one another.
- Two regularization techniques: Dropout (50% rate) and data augmentation
- Use of a competitive normalization step after the ReLU of layers C1 and C3
- 62.5 million parameters (6% from conv. layers and 94% from FC layers)

3. ZFNet

- Similar to AlexNet with a narrower first conv. layer, and more maps in the following conv. layers.
- Reduced the error in 5% compared to AlexNet


4. VGG-19 Net (Visual Geometry Group - Oxford University)

- Introduces a modular design concept: stages and blocks
- According to the author, the use of many consecutive conv. layers introduces a non-linearity while 
  the number of parameters is reduced.

5. GoogLeNet / Inception Net

- Great performance due to deeper network using inception modules
- 10 times fewer parameters than AlexNet, 6 millions (according to author, 5 millions)
- Inception modules: capture patterns along depth, bottleneck layers, acts like a two-layer neural network across the     image

6. InceptionNet V2

- Uses BatchNorm for each conv layer 
- Increases non-linearities

7. ResNet

- Use of residual connections -> Better gradient flow
- Possibility of training extreme deeper networks
- Demonstrated that network degradation can occur
- For deeper networks (> 50 layers), it uses the bottleneck block to suppress both computational complex and the number    of parameters

8. ResNeXt

9. SENet

- Adds a small neural network - SE block (3 layers, global averaging pooling layer, hidden dense layer with relu, and      dense output layer with sigmoid)
- SE blocks focus on depth dimension -> recalibrate feature maps

10. Attention Networks

- Groundbreaking papper from Google team: "Attention is all you need"

11. NASNet


## 2. Quizzes

### 2.1 CNN Architectures 1

**Question 1**

Which of these components were a part of the famous LeNet-5 architecture by LeCun et al.?

- [x] 5 x 5 Convolution
- [ ] Batch Normalization 
- [x] Fully Connected Layer 
- [ ] 3 x 3 Convolution

**Question 2**

Which of the following components is most likely to have the biggest influence on the success of the AlexNet?

- [x] The usage of ReLU as activation function
- [ ] The usage of the max-pooling operation
- [ ] The usage of fully connected layers with large number of neurons
- [ ] The usage of 3 x 3 convolutions in conjunction with a 5 x 5 and 11 x 11 convolution layers

**Question 3**

How does Dropout work?

- [x] It randomly sets some neurons to zero during the forward pass
- [ ] It randomly sets some neurons to zero during the backward pass 
- [ ] It randomly sets some neurons to one during the forward pass 
- [ ] It randomly sets some neurons to one during the backward pass

### 2.2 CNN Architectures 2

**Question 1**

What is an insight that one could draw from investigating the ZFNet architecture?

- [ ] A large number of neurons in the fully connected layer is important to achieve good results
- [ ] It is not necessary to use networks with more than 8 layers
- [x] if you reduce the spatial size of a feature map by ,e.g., using pooling, you should increasing the dimensionality of the feature map

**Question 2**

What was one of the most important insights we can draw from the VGG architecture?

- [ ] Bootleneck blocks helps to squeeze feature information and concentrate only on the most important aspects
- [ ] Convolutional neural networks do not need to be deep to learn hierarchical representations of visual data
- [x] Using only 3x3 convolutional layers allows the usage of more non-linearities. Thus, it is possible to learn more meaningful features.

**Question 3**

Which of these networks architectures has the lowest number of parameters?

- [x] GoogLeNet with 22 layers
- [ ] AlexNet with 8 layers
- [ ] VGG-19 with 19 layers

### 2.3 CNN Architectures 3

**Question 1**

What is the core concept of an inception module?

- [ ] To regularize the learning process and enable a deeper neural network to converge 
- [x] To decouple the learning process of channel correlation and spatial correlation 
- [ ] To decouple the correlation of input data and learned weights
- [ ] To improve the feature representation obtained by a single convolutional layer

**Question 2**

Which of these statements describe benefits of using Batch Normalization?

- [x] Improve accuracy
- [ ] Improve memory usage, by using less memory 
- [x] Improve convergence speed 
- [ ] Improve speed of a forward pass

**Question 3**

Which of the following components can be found in Batch Normalization?

- [x] The calculation of the mini-batch variance 
- [x] The application of learnable scale and shift parameters 
- [x] The calculation of the mini-batch mean 
- [x] The normalization of the mini-batch using the calculated mean and variance

**Question 4**

Batch Normalization is effective due to the reduction of internal covariate shift. 

- [ ] Correct
- [x] Incorrect

### 2.4 CNN Architectures 4

**Question 1**

What are possible weaknesses of the InceptionNet V2 architecture?

- [x] Over-engineering of the network structure
- [ ] marginal accuracy improvements over the original InceptionNet architecture (less than 2%)
- [ ] Abandonment of decoupled learning of channel correlation and spatial correlation
- [x] Inconsistent kernel shapes

**Question 2**

What is the purpose of a bottleneck block in the ResNet architecture?

- [ ] the bottleneck is used to remove unused features from the feature maps
- [ ] the bottleneck is used to encourage the network to focus only on the most important feature information
- [x] the bottleneck suppresses both computational complexity and number of parameters

**Question 3**

Which of the following factors may lead to the larger popularity of ResNet compared to InceptionNet?

- [x] ResNet's pre-trained models are supported by almost all deep learning frameworks
- [ ] ResNet has been introduced by Google AI 
- [x] ResNet has a concise design
- [x] ResNet has been open source since its introduction

### 2.5 CNN Architectures 5

**Question 1**

What is the core idea of the ResNeXt architecture?

- [ ] The introduction of inception modules into the residual architecture of ResNet
- [ ] Replacement of the initial 7x7 convolution in the stem with a set of 3x3 convolutions
- [x] The replacement of ResNet's three-layer convolutional block with a parallel stack of blocks

**Question 2**

Given a feature map shape 15x15x32 (WxHxC), we want to apply a GroupConvolution with a group number of 2, 
a kernel of 3x3, stride 1, padding 1. What is the size of the output feature map (WxHxC)?

- [ ] 15x15x32
- [ ] 15x15x8
- [ ] 15x15x16
- [x] 15x15x2

**Question 3**

What is the goal of the SENet architecture?

- [ ] To improve representational power of the network by introducing shortcut connections
- [ ] To improve accuracy by adding blocks specifically engineered for the task of image classification
- [ ] To improve the convergence speed and accuracy of the model by using thinner feature maps and shuffle operations
- [x] To improve the representational power of the model by introducing a channel-wise attention mechanism


### 2.6 CNN Architectures 6

**Question 1**

What does NAS stand for?

- [ ] Neural Architecture Service
- [x] Neural Architecture Search 
- [ ] Network Automation Service
- [ ] Network Search Strategy

**Question 2**

Which machine learning method is used when training a network using the NAS strategy?

- [ ] Variational Autoencoders
- [ ] Generative Adversative Networks
- [x] Reinforcement Learning
- [ ] Residual Learning

**Question 3**

What is the core principle behind neural architecture search?

- [ ] Fully random sampling of neural architectures while hopping that they performe well on the task at hand
- [ ] Manual architecture design aided by proposals from the computer for the task at hand
- [x] Guided sampling, training, and validation of neural architectures that are probable to obtain good performance on the task at hand
- [ ] Outsourcing of architecture design, training and validation to crowd workers

### 2.7 Transformer 

**Question 1**

What does attention capture / describe?

- [x] Correlation between tokens
- [ ] parts of images that can safely be ignored
- [ ] text tokens that do not convey much semantic information
- [x] "interesting" receptive fields on the images

**Question 2**

Which form of advanced attention is used in a transformer model?

- [x] self-attention
- [ ] backwards-attention
- [ ] neighbor-attention 
- [ ] visual-attention

**Question 3**

Which function is used to calculate the attention weights?

- [ ] sigmoid
- [ ] tanh
- [x] softmax 
- [ ] cross entropy

