# 4 - Influential Classification Models (and Tools)

## Introduction

This Notebook (6) will continue on from the previous sections. This notebook will go through the process of using __Transfer Learning and Applying it__. Previously, the notebook focused on implementing the Inception model and MobieNet from TensorFlow Hub, here, the intention is to utilise transfer learning with Keras. Utilising models from Keras Applications that are pre-trained on richer datasets on new tasks. The focus here would be to fetch the parameters of pre-trained weights of the models that was trained on the ImageNet dataset, test different types of transfer learning such as freezing and fine-tuning of the feature_extractor layers.

## Supporting Utilities .py files:

In this Notebook, there will be a requirement to import the code/utilities from the following files (.py files):
- DataPrepCIFAR_utility.py
- customCallbacks_Keras.py
-

## Dataset:

For this part of the project, the CIFAR-100 dataset will be used, it is a collection of of 60,000 32x32 images that have 100 classes. CIFAR-100 was originally collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. It iss also a subset of the 80 million tiny images dataset. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses.

Source: https://www.cs.toronto.edu/~kriz/cifar.html

Further, the TensorFlow team offers a python package called "tensorflow_datasets" that provides the helper function to download tthis dataset as well as other more common ones. For the purposes of this project, the CIFAR-100 dataset will be download with this package.

Source: https://www.tensorflow.org/datasets/catalog/cifar100

## Requirements:
- Tensorflow 2.0 (GPU is better)
- Tensorflow-Hub
- Keras (GPU is better)

### Import the required libraries:

In [1]:
%matplotlib inline

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import math
import timeit

In [2]:
import os
# from IPython.display import display, Image
# import matplotlib.pyplot as plt

# %matplotlib inline

# Set up the working directory for the images:
image_folderName = 'Description Images'
image_path = os.path.abspath(image_folderName) + '/'

In [2]:
# Set the random set seed number: for reproducibility.
Seed_nb = 42

# Paramneter to run code block or not: 0 = to run code, 1 = ignore.
dont_run = 0

### GPU Information:

In [4]:
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
devices = sess.list_devices()
devices

Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:09:00.0, compute capability: 7.5



[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 8120490652647963641),
 _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 6586313605, 1993847600325718994)]

## 1 - Transfer Learning:

Humans are able to learn things and apply it to new problems. For example, one part of our lifetime learning experiences are delivered in classrooms where there are teachers that explains the concepts of different topics. This guidance allows us to learn and develop without having to go through the early stages of trial and error until we find the correct path/answer. This behaviour is central to human intelligence. Essentially, this is what transfer learning is and it can be very powerful in the development of machine learning and deep learning. By applying guidance from pre-trained models and applying to new problems or a similar problem, is a way to develop a more proficient system(s) without having to relearn everything from scratch.

Typically, most machine learning systems are designed for a single task, where if these systems were to be applied on a different dataset, it would yield very poor results (like MNIST digits vs. ImageNet pictures and so on). As CNNs are trained to interpret some features of the dataset, it does make sense that the model should be able to partially be resued on a different but similar dataset like, classifying digits to classifying texts. The goal of transfer learning is to apply knowledge either from one task to another or to be adapted onto other domains.

## 2 - Transfering the Knowledge:

This section looks into how it is possible to transfer the knowledge gathered from one model to another? As with digital systems like Machine/Deep Learning, the data/weights can be easily stored and duplicated. 

Transfer learning for CNNs relies on conditioned instantiation, meaning it consists of reusing either the complete or partial architectures and weights of a past performant model and be instantiated as a new model for a new task. At this point, the model would be fine tuned for the new task/domain.

The 1st convolutional layer - extract low level features such as lines, edges or colour gradients, whereas the final convolutional layer - extracts shapes and patterns. The Dense layers at the end of the model are used to process these high level feature maps to make predictions. 

#### Typical setup for Transfer Learning models:

There are various strategies for the usage of pretrained CNNs. One of them is where the final prediction layers are removed, it would be used as an efficient __feature extractor__. If the model would be used for a simlar task where these feature extractors have trained for, it can be used to output pertinent features and then be processed by one or two dense layers which are trained to output predictions. 

The layers are often __FROZEN__ during the training phase to preserve the quality of the extracted feautres, meaning that the parameters will not be updated during the gradient descent phase. However, in other cases wher the tasks/domain differs, these layers will require __FINE TUNING__. This means that the feature extracting layers would be trained with the new prediction dense layers on the task. The next section will detail other use cases.

## 3 - Use Cases of Transfer learning:

Aside from the feature extractor case above, the following also details other use cases. Here, the questions asked are which pretrained model requires reusing or to be fine tuned or frozen?

List of cases:  
1) Limited training data for similar tasks. \
2) Abundant training data for similar tasks. \
3) Limited training data for dissimilar tasks. \
4) Abundant training data for dissimilar tasks.

### 3.1 - Limited training data for similar task:

When encountering a particular task that do not have enough training samples for the model to learn from, transfer learning is a solution where the model was previously trained on a larger but similar dataset. The pre-trained model can be used here by firstly removing the final layers and be replaced with new final layers. These layers can be trained on the new targeted task. 

For example, for the purpose of distinguishing bess and wasps, the ImageNet dataset has these two classes but does not contain enough data samples to produce an efficient CNN without overfitting. Therefore, the model can firstly be trained on the entire ImageNet dataset for the 1,000 classes, then its final dense layers are removed and replace with the output prediction layers specifically for 2 classes that are the bees and wasps. 

By fixing these parameters of the feature extractor layers (trained on the larger dataset), the network is able to retain its expressiveness that was previously developed on the richer dataset. 

### 3.2 - Abundant training data for similar tasks.:

One of reasons why networks tend to overfitting on the dataset is because it has less data samples. The bigger the dataset, the less chance the network has to overfitting it. In the case of larger training data, it is more common to unfreeze the latest layers of feature extractor for the purpose of fine tuning. This allows the network to extract the relevant features for the new task, this essentially translate to better learning during training and better prediction performance. Further, as the model is already close to convergence, it is common practice to use smaller learning rate in the fine tuning phase.

### 3.3 - Limited training data for dissimilar tasks:

Typically, the use of transfer learning is more advantageous if the task at hand are similar, but does not benefit a model that is used for visual recognition but was trained for audio tasks. For tasks that differs but however presents with a large enough dataset, it may not make sense to use an existing pretrained model. However, it has been demonstrated through experiments that the use of pretrained models performs much better with its pretrained weights than with random initialisation. 

### 3.4 - Abundant training data for dissimilar tasks:

This scenario would be the least ideal for a model to train on and can happen from time to time. In this case, careful considerations are required. For example, it would be crucial to reconsider the usage of deep models as training such a model can lead to overfitting, and that a deep pretrained model would consist of irrelevant features for the task. Interestingly, the solution here would be to utilise the first layers of the CNNs, as it tends to extract the low-level features from the data. This means that more layers other than the final dense layers would require removal, resulting in a shallow classifier. A shallow classifier can be adopted as the top half of the new dense layer for prediction, where this new model can be fine tuned for the specific task.

## 4 - Transfer Learning with TensorFlwo and Keras Examples: Model Surgery.

It is also common to utilise non-standard models/networks such as state-of-the-art CNNs or custom models by experts in their field/domain. This section will briefly go through the code implementations for the examples above, before a followed up section that applies transfer learning to an actual dataset. This section will also cover model surgery and selective training. 

Types of Model Surgery:
1) Removing Layers. \
2) Grafting Layers.

Types of Selective Training:
1) Restoring pretrained parameters. \
2) Freezing layers.

## Model Surgergy:

### 4.1 - Removing Layers:

Typically, when using pretrained models, one of the first tasks would be to remove the final prediction layers from the model and transforming it to feature extractors. 

In Keras, this can be done with the Sequential API where for "Sequential" models, the list of layers can be accessed with "model.layers" attribute. There is a "pop()" method that removes the last layer of the model. This can be done by knowing the number of layers to remove and specifying it in the parameter of the method.

For example in Keras:

In [3]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    for i in range(nb_layers_to_remove):
        model.layers.pop()

The removal of layers in TenserFlow is not recommended as it can be highly complex to edit operational graph that supports the model. Note that unused graph operations are not executed during runtime, this means that having old layers incorporated in the compiled graph does not effect the pcompute performance of the new model. Therefore in TensorFlow, the layers are "removed" by pinpointing the last layer/operation of the pretrained model that is to be kept rather than outright removing them. 

If the corresponding Python object was lost as there are a lot of things to keep track, and that the name of the object is known, it can be found with TensorBoard. The represenative tensor can be recovered through a for-loop over the model layers and checking its name.

For example in Tensorflow:

In [4]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    for layer in model.layers:
        if layer.name == name_of_lastLayer_to_keep:
            bottleneck_feaatures = layer.output
            break

As with all things Keras, it proveds the convience of simplicity to code design and thus this process, as compared to the one example above. Knowing the name of the last layer to be kept, after checking "model.summary()" for the layer name, a feature extractor can be instantiated in a couple of lines ans would be ready for use.

For example in Keras:

In [None]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    bottleneck_features = model.get_layer(last_layer_name).output
    feature_extractor = Model(inputs = model.input,
                              outputs = bottleneck_features)

### 4.2 - Grafting Layers:

Grafting is where new prediction layers are added to the pretrained model (on top of the feature extractor). This process is straightforward and can be done easily.

For example with Keras:

In [5]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    dense1 = Dense(...)(feature_extractor.ouput)
    new_model = Model(model.input, dense1)

## Selective Training:

The training phase with transfer learning can present to be complex, as the pretrained layers must be restored firstly and to define which of these layers are to be frozen.

### 4.3 - Restoring Pretrained Parameters:

For example with TensorFlow, there are utility functions to initialise ssome of the layers with pretrained weights. The following will show the saved parameters of a pretrained estimator to be used with a new model with layers of the same name.

The "warmStartSettings" is an initialiser that takes an optional argument "vars_to_warm_start" to provide the names of the specific variables (as a list or regex form) to be restored from the checkpoint files (ckpt).

In [None]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    def model_func():
        # Define the new model, where it resuses pretrained weights/parameters as a feature extractor.
    
    ckpt_path = '/path/to/pretrained/estimator/model.ckpt'
    ws = tf.estimator.WarmStartSettings(skpt_path)
    
    estimator = tf.estimator.Estimator(model_func, warm_start_from=ws)

For example with Keras, the restoration of the pretrained mdoel is done before its transformation for the new task. Note that although this is not the most optimal way to restore the complete model prior to removing some of the unwanted layers, the code is however, concise.

In [6]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    # Assumes pretrained model was saved with the method "model.save()":
    model.tf.keras.models.load_model('path/to/pretrained/model.h5')
    
    # Next, is to "pop" or "add" layers for the creation of the new model.

### 4.4 - Freezing Layers:

For the Tensorflow example, it presents with the most versatile way of freezing layers. This is done by removing the "tf.Variable" attributes from the list of variables to be passed into the optimiser.

In [7]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    # For this case, there is a need to freeze the model's layers with "conv" in the name:
    vars_to_train = model.trainable_varaibles
    vars_to_train = [ v for v in vars_to_train if "conv" in v.name]
    
    # Apply the optimiser to the remaining model's variables:
    optimizer.apply_gradietns( zip(gradient, vars_to_train) )

For the Keras example, the layers have a " .trainable " attribute that allows to be set as "False" to freeze the layers.

In [8]:
# If statement to check if this code block should be executed.
if dont_run == 1:
    
    # Code example below:
    for layer in feature_extractor.layers:
        layer.trainable = False # this freezes the complete extractor.

## Conclusions:



## Summary:

Although this is not the end of this project, it does conclude the 1st notebook relevant to the theory of these advanced deep learning models. Please go to check out __Notebook 7__ for the ____.