<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/3_resnet_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##ResNet networks

The researchers for the residual block design pattern component of the residual
network proposed a new novel layer connection they called an identity link. 

The identity
link introduced the earliest concept of feature reuse. Prior to the identity link,
each convolutional block did feature extraction on the previous convolutional output,
without retaining any knowledge from prior outputs. The identity link can be seen as
a coupling between the current and prior convolutional outputs to reuse feature
information gained from earlier extraction.

Using identity links along with batch normalization provided more stability across
layers, reducing both vanishing and exploding gradients and divergence between
layers, allowing model architectures to go deeper in layers to increase accuracy in
prediction.

ResNet34 introduced a new block layer and layer-connection pattern, residual
blocks, and identity connection, respectively. The residual block in ResNet34 consists
of blocks of two identical convolutional layers without a pooling layer. Each block has
an identity connection that creates a parallel path between the input of the residual
block and its output.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/2.png?raw=1' width='800'/>

One of the problems with neural networks is that as we add deeper layers (under the
presumption of increasing accuracy), their performance can degrade. It can get
worse, not better. This occurs for several reasons. As we go deeper, we are adding
more parameters (weights). The more parameters, the more places that each input in
the training data will fit to the excess parameters. Instead of generalizing, the neural
network will simply learn each training example (rote memorization).

The other issue
is covariate shift: the distribution of the weights will widen (spread further apart) as we
go deeper, resulting in making it more difficult for the neural network to converge.
The former case causes a degradation in performance on the test (holdout) data, and
the latter, on the training data as well as a vanishing or exploding gradient.

Residual blocks allow neural networks to be built with deeper layers without a
degradation in performance on the test data. A ResNet block could be viewed as a
VGG block with the addition of the identity link.

While the VGG style of the block performs
feature detection, the identity link retains the input for the next subsequent
block, whereby the input to the next block consists of both the previous features’
detection and input.

By retaining information from the past (previous input), this block design allows
neural networks to go deeper than the VGG counterpart, with an increase in accuracy.
Mathematically, we could represent the VGG and ResNet as follows.

VGG: $h(x)=f(x, {W})$

ResNet:  $h(x)=f(x, {W}) + x$

The variable x represents the output of a
layer, which is the input to the next layer. At the beginning of the block, we retain a
copy of the previous block/layer output as the variable shortcut. We then pass the
previous block/layer output (x) through two convolutional layers, each time taking
the output from the previous layer as input into the next layer. Finally, the last output
from the block (retained in the variable x) is added (matrix addition) with the original
value of x (shortcut).

```python
shortcut = x  # Remember the input to the block.
x = layers.Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same')(x)
x = layers.ReLU()(x)
x = layers.Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same')(x)
x = layers.ReLU()(x)   # The output of the convolutional sequence
x = layers.add([shortcut, x])  # Matrix addition of the input to the output
```

The ResNet architectures take as input a (224, 224, 3) vector—an RGB image
(3 channels) of 224 (height) × 224 (width) pixels. The first layer is a basic convolutional
layer, consisting of a convolution using a fairly large filter size of 7 × 7. The output
(feature maps) is then reduced in size by a max pooling layer.

After the initial convolutional layer is a succession of groups of residual blocks.
Each successive group doubles the number of filters (similar to VGG). Unlike VGG,
though, there is no pooling layer between the groups that would reduce the size of
the feature maps.

The identity link would
attempt to add the input matrix (X) and the output matrix (2X). Yikes—we get an
error, indicating we can’t broadcast (for the add operation) matrices of different sizes.

For ResNet, this is solved by adding a convolutional block between each “doubling”
group of residual blocks.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/3.png?raw=1' width='800'/>

The output of the last residual block group is passed to a pooling and flattening layer
(GlobalAveragePooling2D), which is then passed to a single Dense layer of 1000
nodes (number of classes).

Let’s now put the whole network together, using a procedural style. Additionally, we
need to add the entry convolutional layer of ResNet and then the DNN classifier.

##Setup

In [5]:
import tensorflow.keras.layers as layers
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, ReLU, Activation, BatchNormalization
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, GlobalAveragePooling2D, ZeroPadding2D

##ResNet using procedural style

In [None]:
def residual_block(n_filters, x):
  """
  Create a Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  x = layers.add([shortcut, x])
  return x

def conv_block(n_filters, x):
  """
  Create Block of Convolutions without Pooling
  n_filters: number of filters
  x        : input into the block
  """
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(2, 2), padding="same", activation="relu")(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(2, 2), padding="same", activation="relu")(x)
  return x

In [None]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="same", activation="relu")(inputs)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2), padding="same")(x)

# First residual block group of 64 filters
for _ in range(2):
  x = residual_block(64, x)

# Doubles the size of filters and reduces feature maps by 75% (stride s = 2, 2) to fit the next residual group
x = conv_block(128, x)

# Second residual block group of 128 filters
for _ in range(3):
  x = residual_block(128, x)

x = conv_block(256, x)

# Third residual block group of 256 filters
for _ in range(5):
  x = residual_block(256, x)

x = conv_block(512, x)

# Fourth residual block group of 512 filters
x = residual_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv2d (Conv2D)                 (None, 112, 112, 64) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
max_pooling2d (MaxPooling2D)    (None, 56, 56, 64)   0           conv2d[0][0]                     
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 56, 56, 64)   36928       max_pooling2d[0][0]              
______________________________________________________________________________________________

We see that the total number of parameters to learn
is 21 million. This is in contrast to the VGG16, which has 138 million parameters. So
the ResNet architecture is six times computationally faster. This reduction is mostly
achieved by the construction of the residual blocks.

Notice that the DNN backend is
just a single output Dense layer. In effect, there is no backend. The early residual
block groups act as the CNN frontend doing the feature detection, while the latter
residual blocks perform the classification. 

In doing so, unlike in VGG, there was no
need for several fully connected dense layers, which would have substantially
increased the number of parameters.

Unlike the previous example of pooling, in which the size of each feature map is
reduced according to the size of the stride, GlobalAveragePooling2D is like a supercharged
version of pooling: each feature map is replaced by a single value, which in
this case is the average of all values in the corresponding feature map. For example, if
the input is 256 feature maps, the output will be a 1D vector of size 256. 

After ResNet,
it became the general practice for deep convolutional neural networks to use Global-
AveragePooling2D at the last pooling stage, which benefited from a substantial
reduction of the number of parameters coming into the classifier, without significant
loss in representational power.

Another advantage is the identity link, which provided the ability to add deeper
layers, without degradation, for higher accuracy.



##ResNet with bottleneck residual block

ResNet50 introduced a variation of the residual block referred to as the bottleneck
residual block. In this version, the group of two 3 × 3 convolutional layers is replaced by
a group of 1 × 1, then 3 × 3, and then 1 × 1 convolutional layers. 

The first 1 × 1 convolution
performs a dimensionality reduction, reducing the computational complexity,
and the last convolution restores the dimensionality, increasing the number of filters
by a factor of 4. 

The middle 3 × 3 convolution is referred to as the bottleneck convolution,
like the neck of a bottle. 

The bottleneck residual block, allows
for deeper neural networks, without degradation, and further reduction in computational
complexity.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/4.png?raw=1' width='800'/>

In [None]:
def bottleneck_block(n_filters, x):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x
  # A 1 × 1 bottleneck convolution for dimensionality reduction
  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1), padding="same", activation="relu")(x)
  # A 3 × 3 convolution for feature extraction
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  # A 1 × 1 projection convolution for dimensionality expansion
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1), padding="same", activation="relu")(x)
  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])

  return x

In [None]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="same", activation="relu")(inputs)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2), padding="same")(x)

# First residual block group of 64 filters
for _ in range(2):
  x = residual_block(64, x)

# Doubles the size of filters and reduces feature maps by 75% (stride s = 2, 2) to fit the next residual group
x = conv_block(128, x)

# Second residual block group of 128 filters
for _ in range(3):
  x = residual_block(128, x)

x = conv_block(256, x)

# Third residual block group of 256 filters
for _ in range(5):
  x = residual_block(256, x)

x = conv_block(512, x)

# Fourth residual block group of 512 filters
x = residual_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv2d_33 (Conv2D)              (None, 112, 112, 64) 9472        input_3[0][0]                    
__________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D)  (None, 56, 56, 64)   0           conv2d_33[0][0]                  
__________________________________________________________________________________________________
conv2d_34 (Conv2D)              (None, 56, 56, 64)   36928       max_pooling2d_2[0][0]            
____________________________________________________________________________________________

Residual blocks introduced the concepts of representational power and representational
equivalence. Representational power is a measure of how powerful a block is as a
feature extractor. Representational equivalence is the idea that a block can be factored
into a lower computational complexity, while maintaining representational power.

The design of the residual bottleneck block was demonstrated to maintain representational
power of the ResNet34 block, with a lower computational complexity.

##Batch normalization

Another problem with adding deeper layers in a neural network is the vanishing gradient
problem. This is actually about computer hardware. During training (the process
of backward propagation and gradient descent), at each layer the weights are multiplied
by very small numbers—specifically, numbers less than 1. As you know, two numbers
less than 1 multiplied together make an even smaller number. When these tiny
values are propagated through deeper layers, they continuously get smaller. At some
point, the computer hardware can’t represent the value anymore—hence, the vanishing
gradient.

Batch normalization is a technique applied to the output of a layer (before or after
the activation function). Without going into the statistics aspect, it normalizes the shift
in the weights as they are being trained. This has several advantages: it smooths out
(across a batch) the amount of change, thus slowing the possibility of getting a number
so small that it can’t be represented by the hardware. Additionally, by narrowing
the amount of shift between the weights, convergence can happen sooner by using a
higher learning rate and reducing the overall amount of training time. Batch normalization
is added to a layer in TF.Keras with the BatchNormalization class.

```python
model = Sequential()
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding='same',
input_shape=(128, 128, 3)))
model.add(BatchNormalization())
# Adds batchnorm before the activation
model.add(ReLU())
model.add(Flatten())
model.add(Dense(4096))
model.add(ReLU())

# Adds batchnorm after the activation
model.add(BatchNormalization())
```



##ResNet50

ResNet50 is a well-known model, which is commonly reused as a stock model, such as
for transfer learning, as shared layers in objection detection, and for performance benchmarking.

ResNet50 v1 formalized the concept of a convolutional group. This is a set of convolutional
blocks that share a common configuration, such as the number of filters. In v1,
the neural network is decomposed into groups, and each group doubles the number
of filters from the previous group.

Additionally, the concept of a separate convolution block to double the number of
filters was removed and replaced by a residual block that uses linear projection. Each
group starts with a residual block using linear projection on the identity link to double
the number of filters, while the remaining residual blocks pass the input directly to the output for the matrix add operation. Additionally, the first 1 × 1 convolution in
the residual block with linear projection uses a stride of 2 (feature pooling), which is
also known as a strided convolution, reducing the feature map sizes by 75%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/5.png?raw=1' width='800'/>

In [3]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x

In [7]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=strides)(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

In [None]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = ZeroPadding2D(padding=(3, 3))(inputs)
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="valid")(x)
x = BatchNormalization()(x)
x = ReLU()(x)
x = ZeroPadding2D(padding=(1, 1))(x)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2))(x)

# Each convolutional group after the first group starts with a projection block.
x = projection_block(64, x, strides=(1, 1))

# First identity block group of 64 filters
for _ in range(2):
  x = identity_block(64, x)
x = projection_block(128, x)

# Second identity block group of 128 filters
for _ in range(3):
  x = identity_block(128, x)
x = projection_block(256, x)

# Third identity block group of 256 filters
for _ in range(5):
  x = identity_block(256, x)
x = projection_block(512, x)

# Fourth identity block group of 512 filters
for _ in range(2):
  x = identity_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.summary()

v1.5 introduced a refactoring of the bottleneck design and
further reducing of computational complexity, while maintaining representational
power. The feature pooling (strides = 2) in the residual block with linear projection is
moved from the first 1 × 1 convolution to the 3 × 3 convolution, reducing computational
complexity and increasing results on ImageNet by 0.5%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/6.png?raw=1' width='800'/>

In [13]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

ResNet50 v2 introduced preactivation batch normalization (BN-RE-Conv), in which the batch normalization and activation functions are placed before (instead of after) the corresponding convolution or dense layer. 

This has now become a common practice, as depicted here for implementation of the residual block with the identity link in v2:

In [14]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  # Batchnorm before the convolution
  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x