# Transfer Learning with Keras Sequential

In this notebook I'll illustrate the power and concept of transfer learning using the MNIST dataset available as apart of Keras datasets. I'll train a CNN on the digits 5,6,7,8,9.  Then train just the last layer(s) of the network on the digits 0,1,2,3,4 and see how well the features learned on 5-9 help with classifying 0-4.


In [None]:
import datetime
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

In [None]:
# used to help some of the timing functions
now = datetime.datetime.now

In [3]:
# set some parameters
batch_size = 128
num_classes = 5
epochs = 5

In [4]:
# set some more parameters
img_rows, img_cols = 28, 28
filters = 32
pool_size = 2
kernel_size = 3

In [None]:
# This will handle some variability in how the input data is loaded

if K.image_data_format() == 'channels_first':
    input_shape = (1, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 1)

In [None]:
# To simplify things, write a function to include all the training steps
# As input: function takes a model, training set, test set, and the number of classes
# Inside the model object will be the state about which layers we are freezing and which we are training
# The expected data in NMIST will be sets of tuples with the first element being the image data and the second element 
# being the label i.e. the [0] and [1] elements in the below function.

def train_model(model, train, test, num_classes):
    x_train = train[0].reshape((train[0].shape[0],) + input_shape)
    x_test = test[0].reshape((test[0].shape[0],) + input_shape)
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train /= 255
    x_test /= 255
    print('x_train shape:', x_train.shape)
    print(x_train.shape[0], 'train samples')
    print(x_test.shape[0], 'test samples')

    # convert class vectors to binary class matrices
    y_train = keras.utils.to_categorical(train[1], num_classes)
    y_test = keras.utils.to_categorical(test[1], num_classes)

    model.compile(loss='categorical_crossentropy',
                  optimizer='adadelta',   # Can change this to Adam or RMSProp
                  metrics=['accuracy'])

    t = now()  # This will show the time it takes to train the model
    model.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              verbose=1, # This will show you an ongoing update of the training process for each epoch
              validation_data=(x_test, y_test))
    print('Training time: %s' % (now() - t))

    score = model.evaluate(x_test, y_test, verbose=0)
    print('Test score:', score[0])
    print('Test accuracy:', score[1])

In [None]:
# Shuffle and split the data between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Create two datasets: one with digits below 5 and one with 5 and above
x_train_lt5 = x_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
x_test_lt5 = x_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

x_train_gte5 = x_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5] - 5
x_test_gte5 = x_test[y_test >= 5]
y_test_gte5 = y_test[y_test >= 5] - 5

In [None]:
# Define the "feature" layers. These are the early layers that we expect will "transfer" to the other numbers. 
# Freeze these layers during the fine-tuning process
# Later on, the output of a 1D array / list is used to create a Sequential model. This is a list of layers that are used to create a model.
# The Sequential model is a linear stack of layers. You can create a Sequential model by passing a list of layer instances to the constructor.

feature_layers = [
    Conv2D(filters, kernel_size,
           padding='valid',
           input_shape=input_shape),
    Activation('relu'),
    Conv2D(filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size=pool_size),
    Dropout(0.25),
    Flatten(),
]

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [None]:
# Define the "classification" layers.  These are the later layers that predict the specific classes from the features
# learned by the feature layers.  This is the part of the model that needs to be re-trained for the next problem. 

classification_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(num_classes),
    Activation('softmax')
]

In [None]:
# Create the model by combining the two sets of layers as follows
model = Sequential(feature_layers + classification_layers)

In [11]:
# Let's take a look
model.summary()

In [None]:
# Now train the model on the digits 5,6,7,8,9

train_model(model,
            (x_train_gte5, y_train_gte5),
            (x_test_gte5, y_test_gte5), num_classes)

x_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Epoch 1/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 26ms/step - accuracy: 0.1866 - loss: 1.6205 - val_accuracy: 0.2477 - val_loss: 1.6021
Epoch 2/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 29ms/step - accuracy: 0.2189 - loss: 1.6024 - val_accuracy: 0.3454 - val_loss: 1.5816
Epoch 3/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 28ms/step - accuracy: 0.2660 - loss: 1.5838 - val_accuracy: 0.4767 - val_loss: 1.5604
Epoch 4/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - accuracy: 0.3241 - loss: 1.5643 - val_accuracy: 0.5838 - val_loss: 1.5377
Epoch 5/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 28ms/step - accuracy: 0.3869 - loss: 1.5436 - val_accuracy: 0.6573 - val_loss: 1.5126
Training time: 0:00:33.606948
Test score: 1.5125788450241089
Test accuracy: 0.6572721600532532


### Freezing Layers
Keras allows layers to be "frozen" during the training process.  That is, some layers have their weights updated during the training process, while others would not.  This is a core part of transfer learning, the ability to train just the last one or several layers.

Note also, that a lot of the training time is spent "back-propagating" the gradients back to the first layer.  Therefore, if we only need to compute the gradients back a small number of layers, the training time is much quicker per iteration.  This is in addition to the savings gained by being able to train on a smaller data set.

In [13]:
# Freeze only the feature layers
for l in feature_layers:
    l.trainable = False

`Observe below the differences between the number of *total params*, *trainable params*, and *non-trainable params*.

In [14]:
model.summary()

In [15]:
train_model(model,
            (x_train_lt5, y_train_lt5),
            (x_test_lt5, y_test_lt5), num_classes)

x_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Epoch 1/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.3053 - loss: 1.5928 - val_accuracy: 0.5437 - val_loss: 1.5586
Epoch 2/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.3770 - loss: 1.5608 - val_accuracy: 0.6032 - val_loss: 1.5246
Epoch 3/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 15ms/step - accuracy: 0.4411 - loss: 1.5285 - val_accuracy: 0.6492 - val_loss: 1.4917
Epoch 4/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 14ms/step - accuracy: 0.4864 - loss: 1.4964 - val_accuracy: 0.6888 - val_loss: 1.4593
Epoch 5/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.5350 - loss: 1.4680 - val_accuracy: 0.7225 - val_loss: 1.4278
Training time: 0:00:18.713408
Test score: 1.4278391599655151
Test accuracy: 0.7225140929222107


Note that after a single epoch, we are already achieving results on classifying 0-4 that are comparable to those achieved on 5-9 after 5 full epochs.  This despite the fact the we are only "fine-tuning" the last layer of the network, and all the early layers have never seen what the digits 0-4 look like.

Also, note that even though nearly all (590K/600K) of the *parameters* were trainable, the training time per epoch was still much reduced.  This is because the unfrozen part of the network was very shallow, making backpropagation faster. 


## Reverse the Process
- Lets try again to validate by reversing out MNIST example training process.  That is, train on the digits 0-4, then finetune only the last layers on the digits 5-9.

In [16]:
# Create layers and define the model as above
feature_layers2 = [
    Conv2D(filters, kernel_size,
           padding='valid',
           input_shape=input_shape),
    Activation('relu'),
    Conv2D(filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size=pool_size),
    Dropout(0.25),
    Flatten(),
]

classification_layers2 = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(num_classes),
    Activation('softmax')
]
model2 = Sequential(feature_layers2 + classification_layers2)
model2.summary()

In [None]:
# Train our model on the digits 0,1,2,3,4
train_model(model2,
            (x_train_lt5, y_train_lt5),
            (x_test_lt5, y_test_lt5), num_classes)

x_train shape: (30596, 28, 28, 1)
30596 train samples
5139 test samples
Epoch 1/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 26ms/step - accuracy: 0.2247 - loss: 1.5968 - val_accuracy: 0.5094 - val_loss: 1.5558
Epoch 2/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 27ms/step - accuracy: 0.3258 - loss: 1.5563 - val_accuracy: 0.7248 - val_loss: 1.5105
Epoch 3/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - accuracy: 0.4314 - loss: 1.5149 - val_accuracy: 0.8126 - val_loss: 1.4608
Epoch 4/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 27ms/step - accuracy: 0.5285 - loss: 1.4666 - val_accuracy: 0.8543 - val_loss: 1.4040
Epoch 5/5
[1m240/240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 26ms/step - accuracy: 0.5927 - loss: 1.4162 - val_accuracy: 0.8821 - val_loss: 1.3383
Training time: 0:00:32.956645
Test score: 1.3382974863052368
Test accuracy: 0.882078230381012


In [None]:
# Freeze the layers
for l in feature_layers2:
    l.trainable = False

In [19]:
model2.summary()

In [20]:
train_model(model2,
            (x_train_gte5, y_train_gte5),
            (x_test_gte5, y_test_gte5), num_classes)

x_train shape: (29404, 28, 28, 1)
29404 train samples
4861 test samples
Epoch 1/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 14ms/step - accuracy: 0.2705 - loss: 1.6017 - val_accuracy: 0.3956 - val_loss: 1.5637
Epoch 2/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.3063 - loss: 1.5646 - val_accuracy: 0.4573 - val_loss: 1.5246
Epoch 3/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.3599 - loss: 1.5299 - val_accuracy: 0.5143 - val_loss: 1.4863
Epoch 4/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.4088 - loss: 1.4936 - val_accuracy: 0.5892 - val_loss: 1.4487
Epoch 5/5
[1m230/230[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step - accuracy: 0.4677 - loss: 1.4597 - val_accuracy: 0.6521 - val_loss: 1.4116
Training time: 0:00:16.266774
Test score: 1.411635160446167
Test accuracy: 0.6521291732788086


## Conclusion

1. When training on digits 5-9 and transferring to digits 0-4:
    - Even with only one epoch of fine-tuning, we achieved comparable accuracy to full training
    - Training time was significantly reduced despite having most parameters trainable

2. When reversing the process (training on 0-4 and transferring to 5-9):
    - Similar positive transfer effects were observed
    - The model effectively leveraged features learned from one set of digits to classify another set

### Key Concepts Demonstrated
- **Transfer Learning**: We showed how knowledge gained from one task (classifying certain digits) can be applied to another related task (classifying different digits)
- **Feature Extraction**: The early convolutional layers learned general features useful for digit recognition regardless of the specific digits
- **Fine-tuning**: By freezing feature layers and only training classification layers, we achieved efficient adaptation

### Business Use Cases
- **Limited Data Scenarios**: When labeled data is scarce for a specific task, transfer learning enables leveraging data from related tasks
- **Quick Adaptation**: New classification problems can be solved rapidly by reusing pre-trained models
- **Resource Efficiency**: Reduced training time and computational resources make ML applications more cost-effective
- **Edge Deployment**: Smaller, specialized models can be deployed more efficiently to edge devices

### Key Functions and Parameters
- **Layer Freezing** (`l.trainable = False`): Critical for transfer learning, controls which parts of the network are updated
- **Model Architecture**: The separation of feature and classification layers enables effective knowledge transfer
- **Dropout Layers** (0.25 and 0.5): Help prevent overfitting during both initial training and fine-tuning
- **Epochs**: Even with just 5 epochs for initial training and 1 epoch for fine-tuning, we achieved good results
- **Batch Size**: The batch size of 128 balanced between training speed and gradient accuracy

This notebook demonstrates that transfer learning is not just theoretical but provides practical benefits in terms of accuracy, training time, and data efficiency.