<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/3_resnet_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##ResNet networks

The researchers for the residual block design pattern component of the residual
network proposed a new novel layer connection they called an identity link. 

The identity
link introduced the earliest concept of feature reuse. Prior to the identity link,
each convolutional block did feature extraction on the previous convolutional output,
without retaining any knowledge from prior outputs. The identity link can be seen as
a coupling between the current and prior convolutional outputs to reuse feature
information gained from earlier extraction.

Using identity links along with batch normalization provided more stability across
layers, reducing both vanishing and exploding gradients and divergence between
layers, allowing model architectures to go deeper in layers to increase accuracy in
prediction.

ResNet34 introduced a new block layer and layer-connection pattern, residual
blocks, and identity connection, respectively. The residual block in ResNet34 consists
of blocks of two identical convolutional layers without a pooling layer. Each block has
an identity connection that creates a parallel path between the input of the residual
block and its output.

<img src='images/4.png?raw=1' width='800'/>

One of the problems with neural networks is that as we add deeper layers (under the
presumption of increasing accuracy), their performance can degrade. It can get
worse, not better. This occurs for several reasons. As we go deeper, we are adding
more parameters (weights). The more parameters, the more places that each input in
the training data will fit to the excess parameters. Instead of generalizing, the neural
network will simply learn each training example (rote memorization).

The other issue
is covariate shift: the distribution of the weights will widen (spread further apart) as we
go deeper, resulting in making it more difficult for the neural network to converge.
The former case causes a degradation in performance on the test (holdout) data, and
the latter, on the training data as well as a vanishing or exploding gradient.

Residual blocks allow neural networks to be built with deeper layers without a
degradation in performance on the test data. A ResNet block could be viewed as a
VGG block with the addition of the identity link.

While the VGG style of the block performs
feature detection, the identity link retains the input for the next subsequent
block, whereby the input to the next block consists of both the previous features’
detection and input.

By retaining information from the past (previous input), this block design allows
neural networks to go deeper than the VGG counterpart, with an increase in accuracy.
Mathematically, we could represent the VGG and ResNet as follows.

VGG: $h(x)=f(x, {W})$

ResNet:  $h(x)=f(x, {W}) + x$

The variable x represents the output of a
layer, which is the input to the next layer. At the beginning of the block, we retain a
copy of the previous block/layer output as the variable shortcut. We then pass the
previous block/layer output (x) through two convolutional layers, each time taking
the output from the previous layer as input into the next layer. Finally, the last output
from the block (retained in the variable x) is added (matrix addition) with the original
value of x (shortcut).

```python
shortcut = x  # Remember the input to the block.
x = layers.Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same')(x)
x = layers.ReLU()(x)
x = layers.Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding='same')(x)
x = layers.ReLU()(x)   # The output of the convolutional sequence
x = layers.add([shortcut, x])  # Matrix addition of the input to the output
```

The ResNet architectures take as input a (224, 224, 3) vector—an RGB image
(3 channels) of 224 (height) × 224 (width) pixels. The first layer is a basic convolutional
layer, consisting of a convolution using a fairly large filter size of 7 × 7. The output
(feature maps) is then reduced in size by a max pooling layer.

After the initial convolutional layer is a succession of groups of residual blocks.
Each successive group doubles the number of filters (similar to VGG). Unlike VGG,
though, there is no pooling layer between the groups that would reduce the size of
the feature maps.

The identity link would
attempt to add the input matrix (X) and the output matrix (2X). Yikes—we get an
error, indicating we can’t broadcast (for the add operation) matrices of different sizes.

For ResNet, this is solved by adding a convolutional block between each “doubling”
group of residual blocks.

<img src='images/5.png?raw=1' width='800'/>

The output of the last residual block group is passed to a pooling and flattening layer
(GlobalAveragePooling2D), which is then passed to a single Dense layer of 1000
nodes (number of classes).

Let’s now put the whole network together, using a procedural style. Additionally, we
need to add the entry convolutional layer of ResNet and then the DNN classifier.

##Setup

In [6]:
import tensorflow.keras.layers as layers
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense, ReLU, Activation
from tensorflow.keras.layers import Conv2D, MaxPool2D, Flatten, GlobalAveragePooling2D

##ResNet using procedural style

In [5]:
def residual_block(n_filters, x):
  """
  Create a Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  x = layers.add([shortcut, x])
  return x

def conv_block(n_filters, x):
  """
  Create Block of Convolutions without Pooling
  n_filters: number of filters
  x        : input into the block
  """
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(2, 2), padding="same", activation="relu")(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(2, 2), padding="same", activation="relu")(x)
  return x

In [8]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="same", activation="relu")(inputs)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2), padding="same")(x)

# First residual block group of 64 filters
for _ in range(2):
  x = residual_block(64, x)

# Doubles the size of filters and reduces feature maps by 75% (stride s = 2, 2) to fit the next residual group
x = conv_block(128, x)

# Second residual block group of 128 filters
for _ in range(3):
  x = residual_block(128, x)

x = conv_block(256, x)

# Third residual block group of 256 filters
for _ in range(5):
  x = residual_block(256, x)

x = conv_block(512, x)

# Fourth residual block group of 512 filters
x = residual_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
conv2d_29 (Conv2D)              (None, 112, 112, 64) 9472        input_2[0][0]                    
__________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)  (None, 56, 56, 64)   0           conv2d_29[0][0]                  
__________________________________________________________________________________________________
conv2d_30 (Conv2D)              (None, 56, 56, 64)   36928       max_pooling2d_1[0][0]            
____________________________________________________________________________________________

We see that the total number of parameters to learn
is 21 million. This is in contrast to the VGG16, which has 138 million parameters. So
the ResNet architecture is six times computationally faster. This reduction is mostly
achieved by the construction of the residual blocks.

Notice that the DNN backend is
just a single output Dense layer. In effect, there is no backend. The early residual
block groups act as the CNN frontend doing the feature detection, while the latter
residual blocks perform the classification. 

In doing so, unlike in VGG, there was no
need for several fully connected dense layers, which would have substantially
increased the number of parameters.

##VGG using procedural style

Let’s now code the same using a procedural reuse
style. In this example, we create a procedure (function) conv_block(), which builds
the convolutional blocks and takes as parameters the number of layers in the block (2 or 3), and number of filters (64, 128, 256, or 512). 

Note that we keep the first convolutional
layer outside conv_block. The first layer needs the input_shape parameter. We
could have coded this as a flag to conv_block, but since it would occur only one time,
that’s not reuse.

In [None]:
def conv_block(n_layers, n_filters):
  """
  n_layers : number of convolutional layers
  n_filters: number of filters
  """
  for n in range(n_layers):
    model.add(Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu"))
  model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

model = Sequential()

# First convolutional specified separately since it requires the input_shape parameter
model.add(Conv2D(64, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu", input_shape=(224, 224, 3)))

# Remainder of first convolutional block
conv_block(1, 64)
# Second through fifth convolutional blocks
conv_block(2, 128)
conv_block(3, 256)
conv_block(3, 512)
conv_block(3, 512)

# DNN backend
model.add(Flatten())
model.add(Dense(4096, activation="relu"))
model.add(Dense(4096, activation="relu"))

# Output layer for classification (1000 classes)
model.add(Dense(1000, activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_23 (Conv2D)           (None, 224, 224, 64)      1792      
_________________________________________________________________
conv2d_24 (Conv2D)           (None, 224, 224, 64)      36928     
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 112, 112, 64)      0         
_________________________________________________________________
conv2d_25 (Conv2D)           (None, 112, 112, 128)     73856     
_________________________________________________________________
conv2d_26 (Conv2D)           (None, 112, 112, 128)     147584    
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 56, 56, 128)       0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 56, 56, 256)      