In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import keras
from keras.datasets import fashion_mnist, cifar10
from keras.layers import Dense, Flatten, Normalization, Dropout, Conv2D, MaxPooling2D, RandomFlip, RandomRotation, RandomZoom, BatchNormalization, Activation, InputLayer
from keras.models import Sequential
from keras.losses import SparseCategoricalCrossentropy, CategoricalCrossentropy
from keras.callbacks import EarlyStopping
from keras.utils import np_utils
from keras import utils
import os
from keras.preprocessing.image import ImageDataGenerator

import matplotlib as mpl
import matplotlib.pyplot as plt
import datetime

# Transfer Learning

### Feature Extraction and Classification

One of the key concepts needed with transfer learning is the separating of the feature extraction from the convolutional layers and the classification done in the fully connected layers.
<ul>
<li> The convolutional layer finds features in the image. I.e. the output of the end of the convolutional layers is a set of image-y features. 
<li> The fully connected layers take those features and classify the thing. 
</ul>

The idea behind this is that we allow someone (like Google) to train their fancy network on a bunch of fast computers, using millions and millions of images. These classifiers get very good at extracting features from objects. 

When using these models we take those convolutional layers and slap on our own classifier at the end, so the pretrained convolutional layers extract a bunch of features with their massive amount of training, then we use those features to predict our data!

In [2]:
epochs = 5

acc = keras.metrics.CategoricalAccuracy(name="accuracy")
pre = keras.metrics.Precision(name="precision")
rec = keras.metrics.Recall(name="recall")
metric_list = [acc, pre, rec]

2022-03-31 14:10:08.721716: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Download Model

There are several models that are pretrained and available to us to use. VGG16 is one developed to do image recognition, the name stands for "Visual Geometry Group" - a group of researchers at the University of Oxford who developed it, and ‘16’ implies that this architecture has 16 layers. The model got ~93% on the ImageNet test that we mentioned a couple of weeks ago. 

![VGG16](images/vgg16.png "VGG16" )

#### Slide Convolutional Layers from Classifier

When downloading the model we specifiy that we don't want the top - that's the classification part. When we remove the top we also allow the model to adapt to the shape of our images, so we specify the input size as well.

In [3]:
from keras.applications.vgg16 import VGG16
from keras.layers import Input
from keras.models import Model
from keras.applications.vgg16 import preprocess_input

### Preprocessing Data

Our VGG 16 model comes with a preprocessing function to prepare the data in a way it is happy with. For this model the color encoding that it was trained on is different, so we should prepare the data properly to get good results. 

In [4]:
import pathlib
import PIL 
from keras.applications.vgg16 import preprocess_input

dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file(origin=dataset_url,
                                   fname='flower_photos',
                                   untar=True)
data_dir = pathlib.Path(data_dir)

#Flowers
batch_size = 32
img_height = 180
img_width = 180

train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

class_names = train_ds.class_names
print(class_names)

def preprocess(images, labels):
  return tf.keras.applications.vgg16.preprocess_input(images), labels

train_ds = train_ds.map(preprocess)
val_ds = val_ds.map(preprocess)


Found 3670 files belonging to 5 classes.
Using 2936 files for training.
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']


#### Add on New Classifier

If we look at the previous summary of the model we can see that the last layer we have is a MaxPool layer. When making our own CNN this is the last layer before we add in the "normal" stuff for making predictions, this is the same. We need to flatten the data, then use dense layers and an output layer to classify the predictions. 

We end up with the pretrained parts finding features in images, and the custom part classifying images based on those features. If we think back to the concept of a convolutional network, the convolutional layers do the true heavy lifting in allowing us to do things like classify images, they take in the raw images and transform it into a set of features contained in that image. This ability to turn images into predictive features is the key - important parts of images like edges, corners, contrast, etc... are generic, and our borrowed model is excellent at finding these features in images. Our predicitons are unique, so we tweak the training of our model to make predictions for our data, into our classes - all based on the features that the borrowed model found! 

### Make Model

We take the model without the top, set the input image size, and then add our own classifier. Loading the model is simple, there are just a few things to specify:
<ul>
<li> weights="imagenet" - tells the model to use the weights from its imagenet training. This is what brings the "smarts", so we want it. 
<li> include_top=False - tells the model to not bring over the classifier bits that we wnat to replace. 
<li> input_shape - the model is trained on specific data sizes (224x224x3). We can repurpose it by changing the input size. 
</ul>

We also set the VGG model that we download to be not trainable. We don't want to overwrite all of the training that already exists, coming from the original training. What we want to be trained are the final dense parts we added on to classify our specific scenario. All the weights in the convolutional layers are kept the same, as they have been developed through large amounts of training; the weights in the fully connected layers will be trained, resulting in a model that combines the "sight" of the pretrained model with the context of what we are trying to classify. The VGG bits will just show as though they are one layer in our model, and for training purposes that makes sense. We can also see in the "trainable params" listing in the summary, the large number of weights in that VGG section we are borrowing are not trainable - that's the smart part of the model. 

<b>Note:</b> I think the "top" label is a bit misleading, as it isn't really the top, it is the part at the end that shows at the bottom of a summary. 

In [5]:
## Loading VGG16 model
base_model = VGG16(weights="imagenet", include_top=False, input_shape=(180,180,3))
base_model.trainable = False ## Not trainable weights

# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(50, activation='relu')
dense_layer_2 = Dense(20, activation='relu')
prediction_layer = Dense(5)

model = Sequential([
    base_model,
    flatten_layer,
    dense_layer_1,
    dense_layer_2,
    prediction_layer
])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 vgg16 (Functional)          (None, 5, 5, 512)         14714688  
                                                                 
 flatten (Flatten)           (None, 12800)             0         
                                                                 
 dense (Dense)               (None, 50)                640050    
                                                                 
 dense_1 (Dense)             (None, 20)                1020      
                                                                 
 dense_2 (Dense)             (None, 5)                 105       
                                                                 
Total params: 15,355,863
Trainable params: 641,175
Non-trainable params: 14,714,688
_________________________________________________________________


#### Compile and Train

Once the new Frakenstein model is built we finish the training process as we normally would. The only difference is that here the weights of the VGG part of the model are not being adjusted during the backpropagation steps, only the weights in the layers that we added at the end are. For many, if not most, applications, this approach of adapting a pretrained model will give the best real world results. Unless you happen to live in a data centre, you probably lack both the data and the processing capacity to train any model from scratch to be as good as those that we can download. 

In [6]:
# Model

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
            optimizer="adam", 
            metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy"))
            
log_dir = "logs/fit/VGG" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
callback = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True) 

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fa1971ad100>

### Fine Tune Models

Lastly, we can adapt the entire model to our data. We'll unfreeze the original model, and then train the model again. The key addition here is that we set the learning rate to be extremely low (here it is 2 orders of magnitude smaller than the default) so the model doesn't totally rewrite all of the weights while training, rather it will only change a little bit - fine tuning its predictions to the actual data! Here the oringal convolutional layers are trainable, and the weights will be adjusted during training, but we dial the learning rate way down so that our changes only impact the model a little bit. This is a greater degree of fine tuning than we get when we lock the VGG layers, but it is still mainly relying on the previous training of the VGG model.

The end result is a model that can take advantage of all of the training that the original model received before we downloaded it. That ability of extracting features from images is then reapplied to our data for making predictions based on the features identified in the original model. Finally we take the entire model and just gently train it to be a little more suited to our data. The best of all worlds!

In [7]:
#Save a copy of the above model for next test. 
copy_model = model

base_model.trainable = True
model.summary()

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),  # Low learning rate
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy")
)

model.fit(train_ds, epochs=epochs, validation_data=val_ds)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 vgg16 (Functional)          (None, 5, 5, 512)         14714688  
                                                                 
 flatten (Flatten)           (None, 12800)             0         
                                                                 
 dense (Dense)               (None, 50)                640050    
                                                                 
 dense_1 (Dense)             (None, 20)                1020      
                                                                 
 dense_2 (Dense)             (None, 5)                 105       
                                                                 
Total params: 15,355,863
Trainable params: 15,355,863
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5

KeyboardInterrupt: 

Yay, that's probably pretty accurate!

### More Specific Retraining

If we are extra ambitious we can also potentially slice the model even deeper, and take smaller portions to mix with our own models. The farther "into" the model you slice, the more of the original training will be removed and the more the model will learn from our training data. If done, this is a balancing act - we want to keep all of the smarts that the model has gotten from the original training, while getting the benefits of adaptation to our data. 

This is something that is hard to just eyeball - to splice parts of models together and create something that is actually superior likely requries a lot of experimentation, a solid understanding of the model's problem you're addressing, and some domain knowledge. For something like this adaptation of the VGG model, we'd probably start with some idea of what the model was weak at, build an understanding of what types of features it was extracting along the way, and insert our own layers where we think it would be most beneficial. 

In [None]:
## Loading VGG16 model
base_model = VGG16(weights="imagenet", include_top=False, input_shape=(180,180,3))
#base_model.trainable = False ## Not trainable weights
base_model.summary()

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_17 (InputLayer)       [(None, 180, 180, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 180, 180, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 180, 180, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 90, 90, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 90, 90, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 90, 90, 128)       147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 45, 45, 128)       0     

In [None]:
for layer in base_model.layers[:12]:
    layer.trainable = False
base_model.summary()

Model: "vgg16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_17 (InputLayer)       [(None, 180, 180, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 180, 180, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 180, 180, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 90, 90, 64)        0         
                                                                 
 block2_conv1 (Conv2D)       (None, 90, 90, 128)       73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 90, 90, 128)       147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 45, 45, 128)       0     

Now we have larger portions of the model that can be trained. We will be losing some of the pretrained knowldge, replacing it with the training coming from our data. 

In [None]:
# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(50, activation='relu')
dense_layer_2 = Dense(20, activation='relu')
prediction_layer = Dense(5)

model = Sequential([
    base_model,
    flatten_layer,
    dense_layer_1,
    dense_layer_2,
    prediction_layer
])

model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 vgg16 (Functional)          (None, 5, 5, 512)         14714688  
                                                                 
 flatten_11 (Flatten)        (None, 12800)             0         
                                                                 
 dense_29 (Dense)            (None, 50)                640050    
                                                                 
 dense_30 (Dense)            (None, 20)                1020      
                                                                 
 dense_31 (Dense)            (None, 5)                 105       
                                                                 
Total params: 15,355,863
Trainable params: 12,440,215
Non-trainable params: 2,915,648
_________________________________________________________________


In [None]:
# Model

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
            optimizer="adam", 
            metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy"))
            
log_dir = "logs/fit/VGG" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
callback = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True) 

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, callback])

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f6d36752700>

## Exercise - ResNet50

This is another pretrained network, containing 50 layers. We can use this one similarly to the last. 

In [None]:
def preprocess50(images, labels):
  return tf.keras.applications.resnet50.preprocess_input(images), labels

train_ds = train_ds.map(preprocess50)
val_ds = val_ds.map(preprocess50)

In [None]:

from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions

base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(180,180,3))
base_model.trainable = False ## Not trainable weights

# Add Dense Stuff
flatten_layer = Flatten()
dense_layer_1 = Dense(50, activation='relu')
dense_layer_2 = Dense(20, activation='relu')
prediction_layer = Dense(5)

model = Sequential([
    base_model,
    flatten_layer,
    dense_layer_1,
    dense_layer_2,
    prediction_layer
])

model.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 resnet50 (Functional)       (None, 6, 6, 2048)        23587712  
                                                                 
 flatten_8 (Flatten)         (None, 73728)             0         
                                                                 
 dense_20 (Dense)            (None, 50)                3686450   
                                                                 
 dense_21 (Dense)            (None, 20)                1020      
                                                                 
 dense_22 (Dense)            (None, 5)                 105       
                                                                 
Total params: 27,275,287
Trainable params: 

In [None]:
# Model
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
            optimizer="adam", 
            metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy"))
            
log_dir = "logs/fit/VGG" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
callback = EarlyStopping(monitor='loss', patience=3, restore_best_weights=True) 

model.fit(train_ds,
            epochs=epochs,
            verbose=1,
            validation_data=val_ds,
            callbacks=[tensorboard_callback, callback])

Epoch 1/2
 1/92 [..............................] - ETA: 18:14 - loss: 2.6699 - accuracy: 0.3125

KeyboardInterrupt: 

In [None]:
base_model.trainable = True
model.summary()

model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),  # Low learning rate
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=keras.metrics.SparseCategoricalAccuracy(name="accuracy")
)

model.fit(train_ds, epochs=epochs, validation_data=val_ds)

### Transfer Learning Conclusion

Transfer learning is common, especially when working with things like images. Pretrained models that have seen millions upon millions of images get very good at "understanding" what is in an image, or extracting important features from those images. This basic ability to "see" image data is interchangeable between different types of image tasks that we may want to do. For image data, natural language, audio, video, it is likely that one of these large models will be more capable of extracting features from the data than we could ever hope to do from scratch. Since the basics of "seeing a thing" or "reading a sentence" is the same no matter the specific application, that ability to process the data that our pretrained models have can be repurposed to our specific ends. 

We can see lots of scenarios in the real world where people are adapting image recognition models trained by Google to do things like recognize objects in their home security system, or language models like the GPT family being adapted to better understand domain specific language. We'll likely see more of this, as the benefits of training on massive amounts of data are hard, if not impossible, to replicate. 