# Introduction

We have seen how to train our CNNs with a number of specified layers and hyperparameters, and we have also seen examples of some state of the art CNN architectures like VGG and ResNet. These are the result of careful experimentation with many architectures and extensive hyperparameter tuning. They have been trained on the very large ImageNet database, which has 1000 object classes. 

This training data is so large and the models have so many parameters that training models like thses often takes weeks even on multiple GPUs. So, it would be ideal if we could find a way to use what these models have already learned, and apply that knowledge to a new task. For example, say we have some images we want to classify like those in the CIFAR dataset. Instead of constructing a CNN from scratch, can we instead take the knowledge from a trained CNN like ResNet, and use it to help classify the set of images. Yes, we can, and this is best accomplished through a technique called transfer learning. 

Transfer learning is all about how to use a pre-trained network and apply it to a task of our own design, transferring what it is learned from one task to another. There are a few ways to implement transfer learning, and our approach will depend on how similar a dataset is to the dataset that a pre-trained network has seen. In other words, how transferable can certain knowledge be? We will answer this question and go over different strategies for transfer learning, and we will learn how to implement transfer learning in code using a pre-trained network to help classify different sets of images. 


# Useful layers

Let's take our example of the VGG network. How can we take the CNN that's been trained to classify a thousand ImageNet images and apply this to a new problem? What if we wanted to classify images of different flower types? Like sunflowers versus daisies, and so on. 

While VGG has learned to distinguish between the thousand different categories that are present in ImageNet, most of those categories are animals,fruits and vegetables, or everyday objects. Is VGG trained each of its convolutional layers learn to extract some information about the shapes and colors that distinguish these different objects? In fact, we know that the convolutional filters in a trained CNN are arranged in a kind of hierarchy. The filter in the first layer often detect edges or blocks of color. The second layer might detect circles, stripes, and rectangules. These are still very general features that are useful in analizing any image in almost any dataset. The filters in the final convolutional layers are much more specific. If they were birds in the training dataset, there are filters that can detect birds. If they were cars or bicycles, there are filters to detect wheels and so on. 

We will see then that it is useful to remove the final layers of the network that are very specific to the training dataset while keeping the earlier layers. In this way, we can use the convolutional and pooling layers in a pre-trained network like VGG is a feature extractor that identifies shape and color-based features in our set of flower images. Then, after an image has passed through this pre-trainined feature extractor, we can add one or two more linear layers at the end, which can act as a final classifier. These last layers take in the features from an image and we can train only these final layers to customize this network for a new task. So, even if the image database that we are interested in for flower images has no overlap with the ImageNet categories, it is still possible to use the knowledge from a pre-trained CNN. 

This is one technique for transfer learning, but our method will depend on both the size of our dataset and the level of similarity it shares with the ImageNet database. For instance, this technique of adding some final trainable classifier layers will work well if our dataset is relatively small and has distinct shape features that are similar to those found in the ImageNet database. If our dataset is very large and different than ImageNet, we may take a different approach. 

Let's go over one more approach to transfer learning.

How might we use transfer learning if the data we want to work with is very large and somewhat different than that found in the ImageNet database? As an example, I should mention that one of Udacity founders, Sebastian Thrun, along with a team at Standford, recently used transfer learning to develop a CNN to diagnose skin cancer. The CNN classifies lesions as either benign or malignant and achieves performance on par with dermatologists for diagnosing some forms of skin cancer.  To construct his model,  he used a tranfer learning approach with the inception architecture, pre-trained on the ImageNet database. 

As a first step, he removed the final densely connected classification layer and added a new fully-connected layer. This is similar to the approach that we outlined before, adding a new classification layer with an output size that we define. In this case, this layer has an output value for each type of disease class. 

As for all the other layers in the network, their parameters were initialized with pre-trained values. Then, during training, the parameters were further optimized to fit the database of skin lesions. So, instead of using the pre-trained model as a fixed feature extractor, Sebastian and his team used it as a starting point and then trained to the entire network, modifying all the weights such that they were tuned to the medical image classification task. 

In this case, the model truly benefited from the head start that it was given from pre-training on ImageNet. 

This is just another kind of transfer learning. This technique is called fine tuning because it requires slightly changing or tunning all the existing parameters in a pre-trained network. Fine-tunning often works best if the data set we are interested in is quite large. Transfer learning is an extremely useful technique, but we should apply it differently based on how big and how similar our dataset is to the data that a pretrained model has seen. 

As always, we encourage to experiment with different methods, and we will find more guidance on how to use transfer learning in a variety of scenarios.

The approach for using transfer learning will be different. There are four main cases:

- New data set is small, new data is similar to original training data.
- New data set is small, new data is different from original training data.
- New data set is large, new data is similar to original training data.
- New data set is large, new data is different from original training data.

A large data set might have one million images. A small data could have two-thousand images. The dividing line between a large data set and small data set is somewhat subjective. Overfitting is a concern when using transfer learning with a small data set.

Images of dogs and images of wolves would be considered similar; the images would share common characteristics. A data set of flower images would be different from a data set of dog images.

Each of the four transfer learning cases has its own approach. We will look at each case one by one.

The graph below displays what approach is recommended for each of the four main cases.
<img src="assets/GuideForTransferLearning.png">
Four cases for using tranfer learning

### Demonstration Network

To explain how each situation works, we will start with a generic pre-trained convolutional neural network and explain how to adjust the network for each case. Our example network contains three convolutional layers and three fully connected layers:

<img src="assets/pretrainedNeuralNet.png">
Overview of the layers of a pre-trained CNN

Here is an generalized overview of what the convolutional neural network does:

- the first layer will detect edges in the image
- the second layer will detect shapes
- the third convolutional layer detects higher level features

Each transfer learning case will use the pre-trained convolutional neural network in a different way.

### Case 1: Small Data Set, Similar Data

<img src="assets/Case1.png">

If the new data set is small and similar to the original training data:

- slice off the end of the neural network
- add a new fully connected layer that matches the number of classes in the new data set
- randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network
- train the network to update the weights of the new fully connected layer

To avoid overfitting on the small data set, the weights of the original network will be held constant rather than re-training the weights.

Since the data sets are similar, images from each data set will have similar higher level features. Therefore most or all of the pre-trained neural network layers already contain relevant information about the new data set and should be kept.

Here's how to visualize this approach:

<img src="assets/Case1path.png">
Adding and training a fully connected layer at the end of the NN

### Case 2: Small Data Set, Different Data

<img src="assets/Case2.png">
Case 2: small set, different data

If the new data set is small and different from the original training data:

- slice off all but some of the pre-trained layers near the beginning of the network
- add to the remaining pre-trained layers a new fully connected layer that matches the number of classes in the new data set
- randomize the weights of the new fully connected layer; freeze all the weights from the pre-trained network
- train the network to update the weights of the new fully connected layer

Because the data set is small, overfitting is still a concern. To combat overfitting, the weights of the original neural network will be held constant, like in the first case.

But the original training set and the new data set do not share higher level features. In this case, the new network will only use the layers containing lower level features.

Here is how to visualize this approach:

<img src="assets/Case2path.png">
Remove all but starting layers of the model, and add and train a linear layer at the end.

### Case 3: Large Data Set, Similar data

<img src="assets/Case3.png">
Case 3: large data, similar to ImageNet or pre-trained set.

If the new data set is large and similar to the original training data:

- remove the last fully connected layer and replace with a layer matching the number of classes in the new data set
- randomly initialize the weights in the new fully connected layer
- initialize the rest of the weights using the pre-trained weights
- re-train the entire neural network

Overfitting is not as much of a concern when training on a large data set; therefore, you can re-train all of the weights.

Because the original training set and the new data set share higher level features, the entire neural network is used as well.

Here is how to visualize this approach:

<img src="assets/Case3path.png">
Utilizing pre-trained weights as starting point!

### Case 4: Large Data Set, Different Data

<img src="assets/Case4.png">
Case 4: large data, different than original dataset

If the new data set is large and different from the original training data:

- remove the last fully connected layer and replace with a layer matching the number of classes in the new data set
- retrain the network from scratch with randomly initialized weights
- alternatively, you could just use the same strategy as the "large and similar" data case

Even though the data set is different from the training data, initializing the weights from the pre-trained network might make training faster. So this case is exactly the same as the case with a large, similar data set.

If using the pre-trained network as a starting point does not produce a successful model, another option is to randomly initialize the convolutional neural network weights and train the network from scratch.

Here is how to visualize this approach:

<img src="assets/Case4path.png">
Fine-tune or retrain entire network

We can also check out this [research paper](https://arxiv.org/pdf/1411.1792.pdf) that systematically analyzes the transferability of features learned in pre-trained CNNs. Also we can read the [Nature publication](http://www.nature.com/articles/nature21056.epdf?referrer_access_token=_snzJ5POVSgpHutcNN4lEtRgN0jAjWel9jnR3ZoTv0NXpMHRAJy8Qn10ys2O4tuP9jVts1q2g1KBbk3Pd3AelZ36FalmvJLxw1ypYW0UxU7iShiMp86DmQ5Sh3wOBhXDm9idRXzicpVoBBhnUsXHzVUdYCPiVV0Slqf-Q25Ntb1SX_HAv3aFVSRgPbogozIHYQE3zSkyIghcAppAjrIkw1HtSwMvZ1PXrt6fVYXt-dvwXKEtdCN8qEHg0vbfl4_m&tracking_referrer=edition.cnn.com) detailing Sebastian Thrun's cancer-detecting CNN.