<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/training_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Training fundamentals

Let’s start with an overview of supervised training. When training a model, you feed
data forward through the model, and compute how incorrect the predicted results
are—the loss. Then the loss is backward-propagated to make updates to the model’s
parameters, which is what the model is learning—the values for the parameters.

When training a model, you start with training data that’s representative of the target
environment where the model will be deployed. That data, in other words, is a
sampling distribution of a population distribution. The training data consists of examples.
Each example has two parts: the features, also referred to as independent variables;
and corresponding labels, also referred to as the dependent variable.

The labels are also known as the ground truths (the “correct answers”). Our goal is
to train a model that, once deployed and given examples without labels from the population
(examples the model has never seen before), the model is generalized so that
it can accurately predict the label (the “correct answer”)—supervised learning. This
step is known as inference.

During training, we feed batches (also called samples) of the training data to the
model through the input layer (also referred to as the bottom of the model). The training
data is then transformed by the parameters (weights and biases) in the layers of
the model as it moves forward toward the output nodes (also referred to as the top of
the model).

At the output nodes, we measure how far away we are from the “correct”
answers, which, again, is called the loss. We then backward-propagate the loss through
the layers of the models and update the parameters to be closer to getting the correct
answer on the next batch.

We continue to repeat this process until we reach convergence, which could be
described as “this is as accurate as we can get on this training run.”

**Feeding**

Feeding is the process of sampling batches from the training data and forward-feeding
the batches through the model, and then calculating the loss at the output. A batch
can be one or more examples from the training data chosen at random.

The size of the batch is typically constant, which is referred to as the (mini) batch
size. All the training data is split into batches, and typically each example will appear in
only one batch.

All of the training data is fed multiple times to the model. Each time we feed the
entire training data, it is called an epoch.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/1.png?raw=1' width='800'/>

##Setup

In [10]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, MaxPooling2D, Conv2D, Dropout
from tensorflow.keras.utils import to_categorical

import numpy as np

from tensorflow.keras.datasets import cifar10

##Backward propagation

After each batch of training data is forward-fed through the model and the loss is calculated,
the loss is backward-propagated through the model. We go layer by layer
updating the model’s parameters (weights and parameters), starting at the top layer
(output) and moving toward the bottom layer (input). How the parameters are
updated is a combination of the loss, the values of the current parameters, and the
updates made to the proceeding layer.

The general method for doing this is based on gradient descent. The optimizer is an
implementation of gradient descent whose job is to update the parameters to minimize
the loss (maximize getting closer to the correct answer) on subsequent batches.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/2.png?raw=1' width='800'/>

##Dataset splitting

A dataset is a collection of examples that are large and diverse enough to be representative
of the population being modeled (the sampling distribution). When a dataset
meets this definition and is cleaned (not noisy), and in a format that’s ready for machine
learning training, we refer to it as a curated dataset.

Once you have a curated dataset, the next step is to split it into examples that will
be used for training and those that will be used for testing (also called evaluation or
holdout). We train the model with the portion of the dataset that is the training data. If
we assume the training data is a good sampling distribution (representative of the
population distribution), the accuracy of the training data should reflect the accuracy
when deployed to the real-world predictions on examples from the population not
seen by the model during training.

Historically,
the rule of thumb has been 80/20: 80% for training and 20% for testing.

###Training and test sets

What is important is that we are able to assume our dataset is sufficiently large enough
that if we split it into 80% and 20%, and the examples are randomly chosen so that
both datasets will be good sampling distributions representative of the population distribution,
the model will make predictions (inference) after it’s deployed.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/3.png?raw=1' width='800'/>



In [2]:
# Built-in dataset is automatically randomly shuffled and presplit into training and testing data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)


###One-hot encoding

Let’s build a simple DNN to train our curated dataset. We
start by flattening the 28-×-28-image input into a 1D vector by using the Flatten layer,
which is then followed by two hidden Dense() layers of 512 nodes each, each using
the convention of a relu activation function. Finally, the output layer is a Dense layer
with 10 nodes, one for each digit. Since this is a multiclass classifier, the activation
function for the output layer is a softmax.

Next, we compile the model for the convention for multiclass classifiers by using
`categorical_crossentropy` for the loss and adam for the optimizer:

In [None]:
model = Sequential()
# Flattens the 2D grayscale image into 1D vector for a DNN
model.add(Flatten(input_shape=(28, 28)))
# The actual input layer of the DNN, once the image is flattened
model.add(Dense(512, activation="relu"))
# A hidden layer
model.add(Dense(512, activation="relu"))
# The output layer of the DNN
model.add(Dense(10, activation="softmax"))

# compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

The most basic way to train this model with this dataset is to use the fit() method. We
will pass as parameters the training data (x_train, y_train). We will keep the
remaining keyword parameters set to their defaults:

In [None]:
#model.fit(x_train, y_train)

```
ValueError: Shapes (32, 1) and (32, 10) are incompatible
```

In [None]:
y_train[:5]

array([5, 0, 4, 1, 9], dtype=uint8)

What went wrong? This is an issue with the loss function we choose. It will compare the
difference between each output node and corresponding output expectation. For
example, if the answer is the digit 3, we need a 10-element vector (one element per
digit) with a 1 (100% probability) in the 3 index and 0s (0% probability) in the remaining
indexes. In this case, we need to convert the scalar-value labels into 10-element vectors
with a 1 in the corresponding index. This is known as one-hot encoding.

Let’s fix our example by first importing the `to_categorical()` function from
TF.Keras and then using it to convert the scalar-value labels to one-hot-encoded labels.

Note that we pass the value 10 to `to_categorical()` to indicate the size of the onehot-
encoded labels (number of classes):

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/4.png?raw=1' width='800'/>

In [None]:
# One-hot-encodes the training and testing labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
print(y_train[0])

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


In [None]:
model.fit(x_train, y_train)



<keras.callbacks.History at 0x7f21907bf1d0>

That works, and we got 90% accuracy on the training data—but we can simplify this
step. The compile() method has one-hot encoding built into it. To enable it, we just
change the loss function from `categorical_crossentropy` to `sparse_categorical_
crossentry`. 

In this mode, the loss function will receive the labels as scalar values and
dynamically convert them to one-hot-encoded labels before performing the crossentropy
loss calculation.

In [None]:
model = Sequential()
# Flattens the 2D grayscale image into 1D vector for a DNN
model.add(Flatten(input_shape=(28, 28)))
# The actual input layer of the DNN, once the image is flattened
model.add(Dense(512, activation="relu"))
# A hidden layer
model.add(Dense(512, activation="relu"))
# The output layer of the DNN
model.add(Dense(10, activation="softmax"))

# compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Loads MNIST dataset into memory
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
# Trains MNIST model for 10 epochs
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2190612f90>

##Data normalization

Generally, when data feeds forward through the
layer and the parameters of one layer are matrix-multiplied against parameters at the
next layer, the result is a very small number.

The problem with our preceding example is that the input values are substantially
larger (up to 255), which will produce large numbers initially as they are multiplied
through the layers. This will result in taking longer for the parameters to learn their
optimal values—if they learn them at all.


###Normalization

We can increase the speed at which the parameters learn the optimal values and
increase our chances of convergence (discussed subsequently) by squashing the input
values into a smaller range. One simple way to do this is to squash them proportionally
into a range from 0 to 1. We can do this by dividing each value by 255.

By default, NumPy does floating-point operations as double precision (64 bits). By
default, the parameters in a TF.Keras model are single-precision floating-point (32
bits). 

For efficiency, as a last step, we convert the result of the broadcasted division to
32 bits by using the NumPy astype() method. If we did not do the conversion, the initial
matrix multiplication from the input-to-input layer would take double the number
of machine cycles (64 × 32 instead of 32 × 32):

In [None]:
model = Sequential()
# Flattens the 2D grayscale image into 1D vector for a DNN
model.add(Flatten(input_shape=(28, 28)))
# The actual input layer of the DNN, once the image is flattened
model.add(Dense(512, activation="relu"))
# A hidden layer
model.add(Dense(512, activation="relu"))
# The output layer of the DNN
model.add(Dense(10, activation="softmax"))

# compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
# Normalizes the pixel data from 0 to 1
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = (x_train / 255.0).astype(np.float32)
x_test = (x_test / 255.0).astype(np.float32)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [None]:
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2209ed2f50>

Let’s compare the output
with a normalized input to the prior non-normalized input. In the prior input, we
reached 97% accuracy after the 10th epoch. In our normalized input, we reach the
same accuracy after just the second epoch, and almost 99.5% accuracy after the tenth.

Thus, we learned faster and more accurately when we normalized the input data.

The evaluate() method operates in inference mode: the test data is forward-fed
through the model to make predictions, but there is no backward propagation. The
model’s parameters are not updated.

In [None]:
model.evaluate(x_test, y_test)



[0.10212759673595428, 0.9799000024795532]

We see that the accuracy is about 98%, compared to the training
accuracy of 99.5%. This is expected. Some overfitting always occurs during training.
What we are looking for is a very small difference between the training and testing, and
in this case it’s about 1.5%.

###Standardization

Another technique, called standardization, is
considered to give a better result. However, it requires a pre-analysis (scan) over the
entire input data to find its mean and standard deviation. You then center the data at
the mean of the full distribution of the input data and squash the values between +/–
one standard deviation.

The following code, which implements standardization when
the input data is in memory as a NumPy multidimensional array, uses the NumPy
methods np.mean() and np.std():

In [None]:
# Calculates the mean value for the pixel data
mean = np.mean(x_train)
# Calculates the standard deviation for the pixel data
std = np.std(x_train)

# Standardization of the pixel data using the mean and standard deviation
x_train = ((x_train - mean) / std).astype(np.float32)
x_test = ((x_test - mean) / std).astype(np.float32)

In [None]:
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2209de2410>

In [None]:
model.evaluate(x_test, y_test)



[0.1867692768573761, 0.9796000123023987]

##Validation and overfitting

Typically, to get higher accuracy, we build larger and larger models. One consequence
is that the model can rote-memorize some or all of the examples. The
model learns the examples instead of learning to generalize from the examples to
accurately predict examples it never saw during training. 

In an extreme case, a model
could achieve 100% training accuracy yet have random accuracy on the testing (for 10
classes, that would be 10% accuracy).

###Validation

Let’s say training the model takes several hours. Do you really want to wait until the
end of training and then test on the test data to learn whether the model overfitted?
Of course not. Instead, we set aside a small portion of the training data, which we call
validation data.

We don’t train the model with the validation data. Instead, after each epoch, we
use the validation data to estimate the likely result on the test data. Like the test data, the validation data is forward-fed through the model without updating the model’s
parameters (inference mode), and we measure the loss and accuracy. 

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/5.png?raw=1' width='800'/>

If a dataset is very small, and using even less data for training has a negative impact, we
can use cross-validation

At
the beginning of each epoch, the examples for validation are randomly selected and
not used for training for that epoch, and instead used for the validation test. But since
the selection is random, some or all of the examples will appear in the training data
for other epochs. Today’s datasets are large, so you seldom see the need for this tech
nique.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/6.png?raw=1' width='800'/>

Next, we will train a simple CNN to classify images from the CIFAR-10 dataset. Our
dataset is a subset of this dataset of tiny images, of size 32 × 32 × 3. It consists of 60,000
training and 10,000 test images covering 10 classes: airplane, automobile, bird, cat,
deer, dog, frog, horse, ship, and truck.

In our simple CNN, we have one convolutional layer of 32 filters with kernel size
3 × 3, followed by a strided max pooling layer. The output is then flattened and passed
to the final outputting dense layer.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/7.png?raw=1' width='800'/>

In [6]:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation="relu", input_shape=(32, 32, 3)))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 30, 30, 32)        896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 15, 15, 32)       0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 7200)              0         
                                                                 
 dense (Dense)               (None, 10)                72010     
                                                                 
Total params: 72,906
Trainable params: 72,906
Non-trainable params: 0
_________________________________________________________________


Here, we’ve added the keyword parameter validation_split=0.1 to the fit()
method to set aside 10% of the training data for validation testing after each epoch.

In [9]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

x_train = (x_train / 255.0).astype(np.float32)
x_test = (x_test / 255.0).astype(np.float32)

# Uses 10% of training data for validation—not trained on
model.fit(x_train, y_train, epochs=15, validation_split=0.1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fc51e282dd0>

You can see that after the
fourth epoch, the training and evaluation accuracy are essentially the same. But after
the fifth epoch, we start to see them spread apart (65% versus 61%). By the 15th epoch, the spread is very large (74% versus 63%). Our model clearly started overfitting
around the fifth epoch.

Let’s now work on getting the model to not overfit to the examples and instead generalize
from them.So, we want to add some regularization—
some noise—during training so the model cannot rote-memorize the training
examples.

In this code example, we modify our model by adding 50% dropout before
the final dense layer. Because dropout will slow our learning (because of forgetting),
we increase the number of epochs to 20:

In [11]:
model = Sequential()
model.add(Conv2D(32, (3, 3), activation="relu", input_shape=(32, 32, 3)))
model.add(MaxPooling2D(2, 2))
model.add(Flatten(input_shape=(28, 28)))
# Adds noise to training to prevent overfitting
model.add(Dropout(0.5))
model.add(Dense(10, activation="softmax"))

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Uses 10% of training data for validation—not trained on
model.fit(x_train, y_train, epochs=20, validation_split=0.1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc5228ea290>

Thus, the
model is learning to generalize instead of rote-memorizing the training examples.

###Loss monitoring

ResNet50 is a well-known model, which is commonly reused as a stock model, such as
for transfer learning, as shared layers in objection detection, and for performance benchmarking.

ResNet50 v1 formalized the concept of a convolutional group. This is a set of convolutional
blocks that share a common configuration, such as the number of filters. In v1,
the neural network is decomposed into groups, and each group doubles the number
of filters from the previous group.

Additionally, the concept of a separate convolution block to double the number of
filters was removed and replaced by a residual block that uses linear projection. Each
group starts with a residual block using linear projection on the identity link to double
the number of filters, while the remaining residual blocks pass the input directly to the output for the matrix add operation. Additionally, the first 1 × 1 convolution in
the residual block with linear projection uses a stride of 2 (feature pooling), which is
also known as a strided convolution, reducing the feature map sizes by 75%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/5.png?raw=1' width='800'/>

In [None]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x

In [None]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=strides)(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

In [None]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = ZeroPadding2D(padding=(3, 3))(inputs)
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="valid")(x)
x = BatchNormalization()(x)
x = ReLU()(x)
x = ZeroPadding2D(padding=(1, 1))(x)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2))(x)

# Each convolutional group after the first group starts with a projection block.
x = projection_block(64, x, strides=(1, 1))

# First identity block group of 64 filters
for _ in range(2):
  x = identity_block(64, x)
x = projection_block(128, x)

# Second identity block group of 128 filters
for _ in range(3):
  x = identity_block(128, x)
x = projection_block(256, x)

# Third identity block group of 256 filters
for _ in range(5):
  x = identity_block(256, x)
x = projection_block(512, x)

# Fourth identity block group of 512 filters
for _ in range(2):
  x = identity_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.summary()

v1.5 introduced a refactoring of the bottleneck design and
further reducing of computational complexity, while maintaining representational
power. The feature pooling (strides = 2) in the residual block with linear projection is
moved from the first 1 × 1 convolution to the 3 × 3 convolution, reducing computational
complexity and increasing results on ImageNet by 0.5%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/6.png?raw=1' width='800'/>

In [None]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

ResNet50 v2 introduced preactivation batch normalization (BN-RE-Conv), in which the batch normalization and activation functions are placed before (instead of after) the corresponding convolution or dense layer. 

This has now become a common practice, as depicted here for implementation of the residual block with the identity link in v2:

In [None]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  # Batchnorm before the convolution
  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x