<a href="https://colab.research.google.com/github/victorviro/Deep_learning_python/blob/master/Reusing_pretrained_layers_DNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Deep Neural Networks

[Here](https://github.com/victorviro/Deep_learning_python/blob/master/Introduction_artificial_neural_networks.ipynb) we introduced the artificial neural networks and trained our first deep neural networks. But they were shallow nets, with just a few hidden layers. What if we need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? We may need to train a much deeper DNN, perhaps with 10 layers or many more, each containing hundreds of neurons, linked by hundreds of thousands of connections. Training a deep DNN isn’t a walk in the park. Here are some of the problems we could run into:

- We may be faced with the tricky *vanishing gradients* problem or the related *exploding gradients* problem. This is when the gradients grow smaller and smaller, or larger and larger when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train. See notebook [The vanishing/exploding gradients problem](https://github.com/victorviro/Deep_learning_python/blob/master/Vanishing_Exploding_gradients_problem_DNNs.ipynb).

- We might not have enough training data for such a large network, or it might be too costly to label.

- Training may be extremely slow.

- A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

In this notebook, we will look at transfer learning and unsupervised pretraining, which can help you tackle complex tasks even when we have little labeled data.

## Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch: instead, we should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle, then reuse the lower layers of this network. This technique is called *transfer learning*. It will not only speed up training considerably, but also require significantly less training data.

Suppose we have access to a DNN that was trained to classify pictures into 100 different categories, including animals, plants, vehicles, and everyday objects. We now want to train a DNN to classify specific types of vehicles. These tasks are very similar, even partly overlapping, so we should try to reuse parts of the first network (see Figure 11-4).

![texto alternativo](https://i.ibb.co/nCjDHHb/reuse-pretrained-layers.png)

**Note**: If the input pictures of our new task don’t have the same size as the ones used in the original task, we will usually have to add a preprocessing step to resize them to the size expected by the original model. More generally, transfer learning will work best when the inputs have similar low-level features.

The output layer of the original model should usually be replaced because it is most likely not useful at all for the new task, and it may not even have the right number of outputs for the new task.

Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. We want to find the right number of layers to reuse.

**Note**: The more similar the tasks are, the more layers we want to reuse (starting with the lower layers). For very similar tasks, we can try keeping all the hidden layers and just replacing the output layer.

We try freezing all the reused layers first (i.e., make their weights non-trainable so that Gradient Descent won’t modify them), then we train our model and see how it performs. Then we try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. The more training data we have, the more layers we can unfreeze. It is also useful to reduce the learning rate when we unfreeze reused layers: this will avoid wrecking their fine-tuned weights.

If we still cannot get good performance, and we have little training data, we can try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. We can iterate until we find the right number of layers to reuse. If we have plenty of training data, we may try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

### Transfer learning with Keras

Let’s look at an example. Suppose the [Fashion MNIST dataset](https://www.kaggle.com/zalando-research/fashionmnist) only contained eight classes, for example, all the classes except for sandal and shirt. We are going to built and trained a Keras model on that set and got reasonably good performance. Let’s call this model A.

Let's split the fashion MNIST training set in two:

- `X_train_A`: all images of all items except for sandals and shirts (classes 5 and 6).

- `X_train_B`: a much smaller training set of just the first 200 images of sandals or shirts.
The validation set and the test set are also split this way, but without restricting the number of images.

We will train a model on set A (classification task with 8 classes), and try to reuse it to tackle set B (binary classification, positive=shirt, negative=sandal). We hope to transfer a little bit of knowledge from task A to task B, since classes in set A (sneakers, ankle boots, coats, t-shirts, etc.) are somewhat similar to classes in set B (sandals and shirts). However, since we are using `Dense` layers, only patterns that occur at the same location can be reused (in contrast, convolutional layers will transfer much better, since learned patterns can be detected anywhere on the image).

In [1]:
import keras
import numpy as np

(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0

from sklearn.model_selection import train_test_split
# Split the data
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.1, shuffle= True)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz


In [2]:
def split_dataset(X, y):
    y_5_or_6 = (y == 5) | (y == 6) # sandals or shirts
    y_A = y[~y_5_or_6]
    y_A[y_A > 6] -= 2 # class indices 7, 8, 9 should be moved to 5, 6, 7
    y_B = (y[y_5_or_6] == 6).astype(np.float32) # binary classification task: is it a shirt (class 6)?
    return ((X[~y_5_or_6], y_A),
            (X[y_5_or_6], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

print(X_train_A.shape)
print(y_train_A[:30])
print(X_train_B.shape)
print(y_train_B[:30])

(43235, 28, 28)
[6 7 7 0 6 4 7 1 1 7 5 4 3 0 1 6 6 0 7 7 5 2 4 4 0 5 2 1 4 6]
(200, 28, 28)
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1.
 0. 0. 0. 0. 1. 0.]


Let's train the model A.

In [3]:
model_A = keras.models.Sequential()
model_A.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_A.add(keras.layers.Dense(n_hidden, activation="selu"))
model_A.add(keras.layers.Dense(8, activation="softmax"))


model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

history = model_A.fit(X_train_A, y_train_A, epochs=20,
                    validation_data=(X_valid_A, y_valid_A))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Let's train a model for task B without reusing layers of the model A.

In [5]:
model_B = keras.models.Sequential()
model_B.add(keras.layers.Flatten(input_shape=[28, 28]))
for n_hidden in (300, 100, 50, 50, 50):
    model_B.add(keras.layers.Dense(n_hidden, activation="selu"))
model_B.add(keras.layers.Dense(1, activation="sigmoid"))

model_B.compile(loss="binary_crossentropy",
                optimizer=keras.optimizers.SGD(lr=1e-3),
                metrics=["accuracy"])

history = model_B.fit(X_train_B, y_train_B, epochs=20,
                      validation_data=(X_valid_B, y_valid_B))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Now we train a model for the task B reusing layers of the model A. 

In [6]:
# reuse all layers except for the output layer
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Note that `model_A` and `model_B_on_A` now share some layers. When you train `model_B_on_A`, it will also affect `model_A`. If we want to avoid that, we need to clone `model_A` before we reuse its layers. To do this, we clone model A’s architecture with `clone_model()`, then copy its weights (since `clone_model()` does not clone the weights):

In [7]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Now we could train `model_B_on_A` for task B, but since the new output layer was initialized randomly it will make large errors (at least during the first few epochs), so there will be large error gradients that may wreck the reused weights. To avoid this, one approach is to freeze the reused layers during the first few epochs, giving the new layer some time to learn reasonable weights. To do this, set every layer’s `trainable` attribute to `False` and compile the model:

In [8]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])

**Note**: We must always compile our model after we freeze or unfreeze layers.

Now you can train the model for a few epochs, then unfreeze the reused layers (which requires compiling the model again) and continue training to fine-tune the reused layers for task B. After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid damaging the reused weights:

In [9]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                           validation_data=(X_valid_B, y_valid_B))

for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

model_B_on_A.compile(loss="binary_crossentropy",
                     optimizer=keras.optimizers.SGD(lr=1e-3),
                     metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                           validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


This model’s test accuracy is 99.25%, which means that transfer learning reduced the error rate a little bit.

In [None]:
print(model_B.evaluate(X_test_B, y_test_B))
print(model_B_on_A.evaluate(X_test_B, y_test_B))


[0.14917003136873244, 0.9764999747276306]
[0.06191555860638619, 0.9940000176429749]


Note that transfer learning does not work very well with small dense networks, presumably because small networks learn few patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks. Transfer learning works best with deep convolutional neural networks, which tend to learn feature detectors that are much more general (especially in the lower layers). We will revisit transfer learning for convnets, using the techniques we just discussed.

### Unsupervised Pretraining



Suppose we want to tackle a complex task for which we don’t have much labeled training data, but unfortunately we cannot find a model trained on a similar task. First, we should try to gather more labeled training data, but if we can’t, we may still be able to perform *unsupervised pretraining* (see Figure 11-5). Indeed, it is often cheap to gather unlabeled training examples, but expensive to label them. If we can gather plenty of unlabeled training data, we can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network. Then we can reuse the lower layers of the autoencoder or the lower layers of the GAN’s discriminator, add the output layer for our task on top, and fine-tune the final network using supervised learning (i.e., with the labeled training examples).

![texto alternativo](https://i.ibb.co/NTGdKc1/unsupervides-pretraining.png)

It is this technique that Geoffrey Hinton and his team used in 2006 and which led to the revival of neural networks and the success of Deep Learning. Until 2010, unsupervised pretraining—typically with [restricted Boltzmann machines](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine) (RBMs) was the norm for deep nets, and only after the vanishing gradients problem was alleviated did it become much more common to train DNNs purely using supervised learning. Unsupervised pretraining (today typically using autoencoders or GANs rather than RBMs) is still a good option when we have a complex task to solve, no similar model we can reuse, and little labeled training data but plenty of unlabeled training data.

Note that in the early days of Deep Learning it was difficult to train deep models, so people would use a technique called greedy layer-wise pretraining (depicted in Figure 11-5). They would first train an unsupervised model with a single layer, typically an RBM, then they would freeze that layer and add another one on top of it, then train the model again (effectively just training the new layer), then freeze the new layer and add another layer on top of it, train the model again, and so on. Nowadays, things are much simpler: people generally train the full unsupervised model in one shot (i.e., in Figure 11-5, just start directly at step three) and use autoencoders or GANs rather than RBMs.

### Pretraining on an Auxiliary Task

If we do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which we can easily obtain or generate labeled training data, then reuse the lower layers of that network for our actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.

For example, if we want to build a system to recognize faces, we may only have a few pictures of each individual—clearly not enough to train a good classifier. Gathering hundreds of pictures of each person would not be practical. We could, however, gather a lot of pictures of random people on the web and train a first neural network to detect whether or not two different pictures feature the same person. Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier that uses little training data.

For *natural language processing* (NLP) applications, we can download a corpus of millions of text documents and automatically generate labeled data from it. For example, we could randomly mask out some words and train a model to predict what the missing words are (e.g., it should predict that the missing word in the sentence "What ___ you saying?" is probably "are" or "were"). If we can train a model to reach good performance on this task, then it will already know quite a lot about language, and we can certainly reuse it for our actual task and fine-tune it on our labeled data.

**Note**: *Self-supervised learning* is when we automatically generate the labels from the data itself, then we train a model on the resulting "labeled" dataset using supervised learning techniques. Since this approach requires no human labeling whatsoever, it is best classified as a form of unsupervised learning.

# References

- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

- https://github.com/ageron/handson-ml2

- [A Survey on Deep Transfer Learning](https://arxiv.org/abs/1808.01974)

- [A Comprehensive Survey on Transfer Learning](https://arxiv.org/abs/1911.02685)