<a href="https://colab.research.google.com/github/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/training_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Training fundamentals

Let’s start with an overview of supervised training. When training a model, you feed
data forward through the model, and compute how incorrect the predicted results
are—the loss. Then the loss is backward-propagated to make updates to the model’s
parameters, which is what the model is learning—the values for the parameters.

When training a model, you start with training data that’s representative of the target
environment where the model will be deployed. That data, in other words, is a
sampling distribution of a population distribution. The training data consists of examples.
Each example has two parts: the features, also referred to as independent variables;
and corresponding labels, also referred to as the dependent variable.

The labels are also known as the ground truths (the “correct answers”). Our goal is
to train a model that, once deployed and given examples without labels from the population
(examples the model has never seen before), the model is generalized so that
it can accurately predict the label (the “correct answer”)—supervised learning. This
step is known as inference.

During training, we feed batches (also called samples) of the training data to the
model through the input layer (also referred to as the bottom of the model). The training
data is then transformed by the parameters (weights and biases) in the layers of
the model as it moves forward toward the output nodes (also referred to as the top of
the model).

At the output nodes, we measure how far away we are from the “correct”
answers, which, again, is called the loss. We then backward-propagate the loss through
the layers of the models and update the parameters to be closer to getting the correct
answer on the next batch.

We continue to repeat this process until we reach convergence, which could be
described as “this is as accurate as we can get on this training run.”

**Feeding**

Feeding is the process of sampling batches from the training data and forward-feeding
the batches through the model, and then calculating the loss at the output. A batch
can be one or more examples from the training data chosen at random.

The size of the batch is typically constant, which is referred to as the (mini) batch
size. All the training data is split into batches, and typically each example will appear in
only one batch.

All of the training data is fed multiple times to the model. Each time we feed the
entire training data, it is called an epoch.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/1.png?raw=1' width='800'/>

##Setup

In [17]:
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
from tensorflow.keras.utils import to_categorical

##Backward propagation

After each batch of training data is forward-fed through the model and the loss is calculated,
the loss is backward-propagated through the model. We go layer by layer
updating the model’s parameters (weights and parameters), starting at the top layer
(output) and moving toward the bottom layer (input). How the parameters are
updated is a combination of the loss, the values of the current parameters, and the
updates made to the proceeding layer.

The general method for doing this is based on gradient descent. The optimizer is an
implementation of gradient descent whose job is to update the parameters to minimize
the loss (maximize getting closer to the correct answer) on subsequent batches.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/2.png?raw=1' width='800'/>

##Dataset splitting

A dataset is a collection of examples that are large and diverse enough to be representative
of the population being modeled (the sampling distribution). When a dataset
meets this definition and is cleaned (not noisy), and in a format that’s ready for machine
learning training, we refer to it as a curated dataset.

Once you have a curated dataset, the next step is to split it into examples that will
be used for training and those that will be used for testing (also called evaluation or
holdout). We train the model with the portion of the dataset that is the training data. If
we assume the training data is a good sampling distribution (representative of the
population distribution), the accuracy of the training data should reflect the accuracy
when deployed to the real-world predictions on examples from the population not
seen by the model during training.

Historically,
the rule of thumb has been 80/20: 80% for training and 20% for testing.

###Training and test sets

What is important is that we are able to assume our dataset is sufficiently large enough
that if we split it into 80% and 20%, and the examples are randomly chosen so that
both datasets will be good sampling distributions representative of the population distribution,
the model will make predictions (inference) after it’s deployed.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/3.png?raw=1' width='800'/>



In [18]:
# Built-in dataset is automatically randomly shuffled and presplit into training and testing data.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(60000, 28, 28) (60000,)
(10000, 28, 28) (10000,)


###One-hot encoding

Let’s build a simple DNN to train our curated dataset. We
start by flattening the 28-×-28-image input into a 1D vector by using the Flatten layer,
which is then followed by two hidden Dense() layers of 512 nodes each, each using
the convention of a relu activation function. Finally, the output layer is a Dense layer
with 10 nodes, one for each digit. Since this is a multiclass classifier, the activation
function for the output layer is a softmax.

Next, we compile the model for the convention for multiclass classifiers by using
`categorical_crossentropy` for the loss and adam for the optimizer:

In [19]:
model = Sequential()
# Flattens the 2D grayscale image into 1D vector for a DNN
model.add(Flatten(input_shape=(28, 28)))
# The actual input layer of the DNN, once the image is flattened
model.add(Dense(512, activation="relu"))
# A hidden layer
model.add(Dense(512, activation="relu"))
# The output layer of the DNN
model.add(Dense(10, activation="softmax"))

# compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

The most basic way to train this model with this dataset is to use the fit() method. We
will pass as parameters the training data (x_train, y_train). We will keep the
remaining keyword parameters set to their defaults:

In [20]:
#model.fit(x_train, y_train)

```
ValueError: Shapes (32, 1) and (32, 10) are incompatible
```

In [21]:
y_train[:5]

array([5, 0, 4, 1, 9], dtype=uint8)

What went wrong? This is an issue with the loss function we choose. It will compare the
difference between each output node and corresponding output expectation. For
example, if the answer is the digit 3, we need a 10-element vector (one element per
digit) with a 1 (100% probability) in the 3 index and 0s (0% probability) in the remaining
indexes. In this case, we need to convert the scalar-value labels into 10-element vectors
with a 1 in the corresponding index. This is known as one-hot encoding.

Let’s fix our example by first importing the `to_categorical()` function from
TF.Keras and then using it to convert the scalar-value labels to one-hot-encoded labels.

Note that we pass the value 10 to `to_categorical()` to indicate the size of the onehot-
encoded labels (number of classes):

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/4-training-fundamentals/images/4.png?raw=1' width='800'/>

In [22]:
# One-hot-encodes the training and testing labels
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
print(y_train[0])

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]


In [23]:
model.fit(x_train, y_train)



<keras.callbacks.History at 0x7f21907bf1d0>

That works, and we got 90% accuracy on the training data—but we can simplify this
step. The compile() method has one-hot encoding built into it. To enable it, we just
change the loss function from `categorical_crossentropy` to `sparse_categorical_
crossentry`. 

In this mode, the loss function will receive the labels as scalar values and
dynamically convert them to one-hot-encoded labels before performing the crossentropy
loss calculation.

In [26]:
model = Sequential()
# Flattens the 2D grayscale image into 1D vector for a DNN
model.add(Flatten(input_shape=(28, 28)))
# The actual input layer of the DNN, once the image is flattened
model.add(Dense(512, activation="relu"))
# A hidden layer
model.add(Dense(512, activation="relu"))
# The output layer of the DNN
model.add(Dense(10, activation="softmax"))

# compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [27]:
# Loads MNIST dataset into memory
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [28]:
# Trains MNIST model for 10 epochs
model.fit(x_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f2190612f90>

##Data normalization

ResNet50 introduced a variation of the residual block referred to as the bottleneck
residual block. In this version, the group of two 3 × 3 convolutional layers is replaced by
a group of 1 × 1, then 3 × 3, and then 1 × 1 convolutional layers. 

The first 1 × 1 convolution
performs a dimensionality reduction, reducing the computational complexity,
and the last convolution restores the dimensionality, increasing the number of filters
by a factor of 4. 

The middle 3 × 3 convolution is referred to as the bottleneck convolution,
like the neck of a bottle. 

The bottleneck residual block, allows
for deeper neural networks, without degradation, and further reduction in computational
complexity.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/4.png?raw=1' width='800'/>

In [24]:
def bottleneck_block(n_filters, x):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x
  # A 1 × 1 bottleneck convolution for dimensionality reduction
  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1), padding="same", activation="relu")(x)
  # A 3 × 3 convolution for feature extraction
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same", activation="relu")(x)
  # A 1 × 1 projection convolution for dimensionality expansion
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1), padding="same", activation="relu")(x)
  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])

  return x

In [25]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="same", activation="relu")(inputs)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2), padding="same")(x)

# First residual block group of 64 filters
for _ in range(2):
  x = residual_block(64, x)

# Doubles the size of filters and reduces feature maps by 75% (stride s = 2, 2) to fit the next residual group
x = conv_block(128, x)

# Second residual block group of 128 filters
for _ in range(3):
  x = residual_block(128, x)

x = conv_block(256, x)

# Third residual block group of 256 filters
for _ in range(5):
  x = residual_block(256, x)

x = conv_block(512, x)

# Fourth residual block group of 512 filters
x = residual_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

NameError: ignored

Residual blocks introduced the concepts of representational power and representational
equivalence. Representational power is a measure of how powerful a block is as a
feature extractor. Representational equivalence is the idea that a block can be factored
into a lower computational complexity, while maintaining representational power.

The design of the residual bottleneck block was demonstrated to maintain representational
power of the ResNet34 block, with a lower computational complexity.

##Batch normalization

Another problem with adding deeper layers in a neural network is the vanishing gradient
problem. This is actually about computer hardware. During training (the process
of backward propagation and gradient descent), at each layer the weights are multiplied
by very small numbers—specifically, numbers less than 1. As you know, two numbers
less than 1 multiplied together make an even smaller number. When these tiny
values are propagated through deeper layers, they continuously get smaller. At some
point, the computer hardware can’t represent the value anymore—hence, the vanishing
gradient.

Batch normalization is a technique applied to the output of a layer (before or after
the activation function). Without going into the statistics aspect, it normalizes the shift
in the weights as they are being trained. This has several advantages: it smooths out
(across a batch) the amount of change, thus slowing the possibility of getting a number
so small that it can’t be represented by the hardware. Additionally, by narrowing
the amount of shift between the weights, convergence can happen sooner by using a
higher learning rate and reducing the overall amount of training time. Batch normalization
is added to a layer in TF.Keras with the BatchNormalization class.

```python
model = Sequential()
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding='same',
input_shape=(128, 128, 3)))
model.add(BatchNormalization())
# Adds batchnorm before the activation
model.add(ReLU())
model.add(Flatten())
model.add(Dense(4096))
model.add(ReLU())

# Adds batchnorm after the activation
model.add(BatchNormalization())
```



##ResNet50

ResNet50 is a well-known model, which is commonly reused as a stock model, such as
for transfer learning, as shared layers in objection detection, and for performance benchmarking.

ResNet50 v1 formalized the concept of a convolutional group. This is a set of convolutional
blocks that share a common configuration, such as the number of filters. In v1,
the neural network is decomposed into groups, and each group doubles the number
of filters from the previous group.

Additionally, the concept of a separate convolution block to double the number of
filters was removed and replaced by a residual block that uses linear projection. Each
group starts with a residual block using linear projection on the identity link to double
the number of filters, while the remaining residual blocks pass the input directly to the output for the matrix add operation. Additionally, the first 1 × 1 convolution in
the residual block with linear projection uses a stride of 2 (feature pooling), which is
also known as a strided convolution, reducing the feature map sizes by 75%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/5.png?raw=1' width='800'/>

In [None]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x

In [None]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=strides)(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

In [None]:
# The input tensor  
inputs = Input(shape=(224, 224, 3))

# First convolutional layer, where pooled feature maps will be reduced by 75%
x = ZeroPadding2D(padding=(3, 3))(inputs)
x = Conv2D(64, kernel_size=(7, 7), strides=(2, 2), padding="valid")(x)
x = BatchNormalization()(x)
x = ReLU()(x)
x = ZeroPadding2D(padding=(1, 1))(x)
x = MaxPool2D(pool_size=(3, 3), strides=(2, 2))(x)

# Each convolutional group after the first group starts with a projection block.
x = projection_block(64, x, strides=(1, 1))

# First identity block group of 64 filters
for _ in range(2):
  x = identity_block(64, x)
x = projection_block(128, x)

# Second identity block group of 128 filters
for _ in range(3):
  x = identity_block(128, x)
x = projection_block(256, x)

# Third identity block group of 256 filters
for _ in range(5):
  x = identity_block(256, x)
x = projection_block(512, x)

# Fourth identity block group of 512 filters
for _ in range(2):
  x = identity_block(512, x)

x = GlobalAveragePooling2D()(x)

# Output layer for classification (1000 classes)
outputs = Dense(1000, activation="softmax")(x)

model = Model(inputs, outputs)

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.summary()

v1.5 introduced a refactoring of the bottleneck design and
further reducing of computational complexity, while maintaining representational
power. The feature pooling (strides = 2) in the residual block with linear projection is
moved from the first 1 × 1 convolution to the 3 × 3 convolution, reducing computational
complexity and increasing results on ImageNet by 0.5%.

<img src='https://github.com/rahiakela/computer-vision-research-and-practice/blob/main/deep-learning-patterns-and-practices/3-convolutional-and-residual-neural-networks/images/6.png?raw=1' width='800'/>

In [None]:
def projection_block(x, n_filters, strides=(2, 2)):
  """
  Create Block of Convolutions with feature pooling
  Increase the number of filters by 4X
  x        : input into the block
  n_filters: number of filters
  """
  # 1 × 1 projection convolution on shortcut to match size of output
  shortcut = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=strides)(x)
  shortcut = BatchNormalization()(shortcut)

  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)

  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)
  x = BatchNormalization()(x)
  x = ReLU()(x)
  
  x = Conv2D(4 * n_filters, kernel_size=(1, 1), strides=(1, 1))(x)
  x = BatchNormalization()(x)

  # Matrix addition of the input to the output
  x = layers.add([x, shortcut])
  x = ReLU()(x)

  return x

ResNet50 v2 introduced preactivation batch normalization (BN-RE-Conv), in which the batch normalization and activation functions are placed before (instead of after) the corresponding convolution or dense layer. 

This has now become a common practice, as depicted here for implementation of the residual block with the identity link in v2:

In [None]:
def identity_block(x, n_filters):
  """
  Create a Bottleneck Residual Block of Convolutions
  n_filters: number of filters
  x        : input into the block
  """
  shortcut = x

  # Batchnorm before the convolution
  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(1, 1), strides=(1, 1))(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters, kernel_size=(3, 3), strides=(1, 1), padding="same")(x)

  x = BatchNormalization()(x)
  x = ReLU()(x)
  x = Conv2D(n_filters * 4, kernel_size=(1, 1), strides=(1, 1))(x)

  # Matrix addition of the input to the output
  x = layers.add([shortcut, x])
  x = ReLU()(x)

  return x