# CNN transfer learning on MNIST dataset

Assignment instructions:
1. Create NN and train on MNIST digits 0-4
2. Test the NN on 0-9 digits test set
3. Apply transfer learning by freezing layer/adding new layers and training on digits 5-9
4. Test again on 0-9 digits test set

In this notebook I've implemented two different methods for applying the transfer learning: 
- The first approach uses a sequential model CNN where no layers are added after training on 0-4. The trained weights are copied over and the convolution layers which extract the image features are frozen. The model is then trained on 5-9 using a  slower learning rate (using the SGD optimizer) to reach a saddle point where the model can predict both digit subsets with 80+% accuracy overall.
- The second approach uses a branching CNN model where 0-4 is trained on a singular branch model (similar to the first approach model) and the weights are then loaded into a new model that branchs after the convolution layer. The convolution layer and 0-4 digit branch are then frozen and the 5-9 branch is trained resulting in a model with two output layers. Due to there being two output layers each producing their own probabilities on each image, the accuracy of the model is manually calculated by extracting the digit with highest probability of both output layers. The results are 90+% accurate predicitions.

## Sequential Approach

In [1]:
# import libraries
import keras
from keras.datasets import mnist
from keras.models import Sequential, Model, load_model
from keras.layers import Input, Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.optimizers import Adam, SGD
from keras import regularizers
from keras import backend as K
import numpy as np

Using TensorFlow backend.


### 1. Preparing the data

In [2]:
# set network parameters
batch_size = 128
num_classes = 10
epochs = 10

img_rows = 28
img_cols = 28

In [3]:
# load and split dataset into test/train
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# split subsets into 0-4 and 5-9 sets (using _lt for less than 5 set, and _gt for greater than 4 set)
X_train_lt = X_train[y_train < 5]
y_train_lt = y_train[y_train < 5]
X_test_lt = X_test[y_test < 5]
y_test_lt = y_test[y_test < 5]

X_train_gt = X_train[y_train > 4]
y_train_gt = y_train[y_train > 4]
X_test_gt = X_test[y_test > 4]
y_test_gt = y_test[y_test > 4]

In [4]:
# reshape the data to fit the model
X_train_lt = X_train_lt.reshape(X_train_lt.shape[0], img_rows, img_cols, 1)
X_test_lt = X_test_lt.reshape(X_test_lt.shape[0], img_rows, img_cols, 1)
X_train_lt = X_train_lt.astype('float32')/255
X_test_lt = X_test_lt.astype('float32')/255

X_train_gt = X_train_gt.reshape(X_train_gt.shape[0], img_rows, img_cols, 1)
X_test_gt = X_test_gt.reshape(X_test_gt.shape[0], img_rows, img_cols, 1)
X_train_gt = X_train_gt.astype('float32')/255
X_test_gt = X_test_gt.astype('float32')/255

X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
X_train = X_train.astype('float32')/255
X_test = X_test.astype('float32')/255

In [5]:
# convert class vectors to binary matrices
y_train_lt = keras.utils.to_categorical(y_train_lt, num_classes)
y_test_lt = keras.utils.to_categorical(y_test_lt, num_classes)

y_train_gt = keras.utils.to_categorical(y_train_gt, num_classes)
y_test_gt = keras.utils.to_categorical(y_test_gt, num_classes)

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

### 2. Training on 0-4

In [6]:
# set cnn parameters
filters = 32
kernel_size = (3, 3)
pool_size = (2, 2)

In [7]:
# create initial model
model = Sequential()
model.add(Conv2D(16, kernel_size, activation='relu', kernel_initializer='he_normal', input_shape=(img_rows, img_cols, 1), 
                 padding='same'))
model.add(Conv2D(32, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same'))
model.add(Conv2D(64, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same'))
model.add(MaxPooling2D(pool_size=pool_size))
model.add(Flatten())
model.add(Dense(128, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax', kernel_initializer='he_normal'))

model.summary()

W0904 11:26:59.603841  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0904 11:27:00.112296  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0904 11:27:00.185362  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4185: The name tf.truncated_normal is deprecated. Please use tf.random.truncated_normal instead.

W0904 11:27:00.322502  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3976: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

W0904 11:27:00.362531  5876 deprecation_wrapper.py:119] From C:\Prog

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 28, 28, 16)        160       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 32)        4640      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12544)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               1605760   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
__________

In [8]:
# compile the model
model.compile(
    loss='categorical_crossentropy', 
    optimizer=Adam(), 
    metrics=['accuracy']
)

# train the model
model.fit(
    X_train_lt, 
    y_train_lt, 
    batch_size=batch_size,
    epochs=epochs, 
    verbose=1,
    validation_split=0.25
)

W0904 11:27:04.408509  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0904 11:27:04.416516  5876 deprecation_wrapper.py:119] From C:\ProgramData\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.

W0904 11:27:04.909965  5876 deprecation.py:323] From C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 22947 samples, validate on 7649 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1447f0bca48>

In [9]:
# save model with weights to file
model.save('model_1.h5')

### 3. Testing on 0-9

In [10]:
# test accuracy on 0-4 test subset
score = model.evaluate(X_test_lt, y_test_lt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.0040327511951030095
Test accuracy: 0.9986378672893559


In [11]:
# test accuracy on 5-9 test subset
score = model.evaluate(X_test_gt, y_test_gt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 15.211717275601066
Test accuracy: 0.0


In [12]:
# test accuracy on full 0-9 test set
score = model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 7.396488146209717
Test accuracy: 0.5132


### 4. Training on 5-9

In [13]:
# open model from file
new_model = load_model('model_1.h5')

In [14]:
# freeze convolution layers and verify
for i in range(5):
    new_model.layers[i].trainable = False
for layer in new_model.layers:
    print(layer.name, layer.trainable)

conv2d_1 False
conv2d_2 False
conv2d_3 False
max_pooling2d_1 False
flatten_1 False
dense_1 True
dropout_1 True
dense_2 True


In [15]:
# set regularization for trainable dense layers
new_model.layers[-1].kernel_regularizer=regularizers.l2(0.001)
new_model.layers[-3].kernel_regularizer=regularizers.l2(0.001)

In [16]:
# compile the model with SGD slow rate learning
new_model.compile(
    loss='categorical_crossentropy', 
    optimizer=SGD(lr=7e-5, momentum=0.5), 
    metrics=['accuracy']
)

# train the model on 5-9
new_model.fit(
    X_train_gt, 
    y_train_gt, 
    batch_size=batch_size,
    epochs=epochs, 
    verbose=1,
    validation_split=0.2
    #validation_data=(X_test, y_test)
)

Train on 23523 samples, validate on 5881 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1440592e648>

### 5. Testing on 0-9

In [17]:
# test accuracy on 0-4 test subset
score = new_model.evaluate(X_test_lt, y_test_lt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.07988599122968205
Test accuracy: 0.9694493092041253


In [18]:
# test accuracy on 5-9 test subset
score = new_model.evaluate(X_test_gt, y_test_gt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 2.3018385858796697
Test accuracy: 0.5589384899858436


In [19]:
# test accuracy on full 0-9 test set
score = new_model.evaluate(X_test, y_test, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 1.1599771546840667
Test accuracy: 0.7699


We can see from comparing the test after the 0-4 training and 5-9 training that the model shows large improvement of overall prediction accuracy. The 0-4 predicition accuracy fall 3% after retraining but the 5-9 prediction goes from 0 to 70% giving a 0-9 boost from 51% to 83%.

This sequential transfer learned model performs well however when prototyping for optimal network parameters it was apparent that this model has a severe tradeoff between the accuracy of the two subsets resulting in model either being underfit for the new data or forgetting the old data through an update of weight values.

Adding new layers after transfer was considered but every implementation of it resulted in catastrophic forgetting of the 0-4 data set.

## Branching Approach

### 1. Preparing the data

In [20]:
# set network parameters
batch_size = 128
num_classes = 5
total_classes = 10
epochs = 10

img_rows = 28
img_cols = 28

In [21]:
# load and split dataset into test/train
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# sort into 0-4 and 5-9 sets (using _lt for less than 5 set, and _gt for greater than 4 set)
X_train_lt = X_train[y_train < 5]
y_train_lt = y_train[y_train < 5]
X_test_lt = X_test[y_test < 5]
y_test_lt = y_test[y_test < 5]

X_train_gt = X_train[y_train > 4]
y_train_gt = y_train[y_train > 4] - 5
X_test_gt = X_test[y_test > 4]
y_test_gt = y_test[y_test > 4] - 5

In [22]:
# reshape the data to fit the model
X_train_lt = X_train_lt.reshape(X_train_lt.shape[0], img_rows, img_cols, 1)
X_test_lt = X_test_lt.reshape(X_test_lt.shape[0], img_rows, img_cols, 1)
X_train_lt = X_train_lt.astype('float32')/255
X_test_lt = X_test_lt.astype('float32')/255

X_train_gt = X_train_gt.reshape(X_train_gt.shape[0], img_rows, img_cols, 1)
X_test_gt = X_test_gt.reshape(X_test_gt.shape[0], img_rows, img_cols, 1)
X_train_gt = X_train_gt.astype('float32')/255
X_test_gt = X_test_gt.astype('float32')/255

X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
X_train = X_train.astype('float32')/255
X_test = X_test.astype('float32')/255

In [23]:
# convert class vectors to binary matrices
y_train_lt = keras.utils.to_categorical(y_train_lt, num_classes)
y_test_lt = keras.utils.to_categorical(y_test_lt, num_classes)

y_train_gt = keras.utils.to_categorical(y_train_gt, num_classes)
y_test_gt = keras.utils.to_categorical(y_test_gt, num_classes)

y_train = keras.utils.to_categorical(y_train, total_classes)
y_test = keras.utils.to_categorical(y_test, total_classes)

# split labels into corresponding 0-4 and 5-9 sets (including zeros) for dual outputs
y_train_lt_dual = y_train[:,:5]
y_test_lt_dual = y_test[:,:5]
y_train_gt_dual = y_train[:,5:]
y_test_gt_dual = y_test[:,5:]

In [24]:
# get shape of lt and gt subsets
print('Shape of y_train_lt: ', y_train_lt.shape)
print('Shape of y_train_gt: ', y_train_gt.shape)
print('Shape of y_test_lt: ', y_test_lt.shape)
print('Shape of y_test_gt: ', y_test_gt.shape)

Shape of y_train_lt:  (30596, 5)
Shape of y_train_gt:  (29404, 5)
Shape of y_test_lt:  (5139, 5)
Shape of y_test_gt:  (4861, 5)


In [25]:
# create lt and gt shaped zero array for testing an evaluation
y_train_gt_zeros = np.zeros([30596,5])
y_train_lt_zeros = np.zeros([29404,5])
y_test_gt_zeros = np.zeros([5139,5])
y_test_lt_zeros = np.zeros([4861,5])

### 2. Training on 0-4

In [26]:
# set cnn parameters
filters = 32
kernel_size = (3, 3)
pool_size = (2, 2)

In [27]:
# create initial model
inp = Input(shape=(img_rows, img_cols, 1), name='input')
x = Conv2D(16, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_1')(inp)
x = Conv2D(32, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_2')(x)
x = Conv2D(64, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_3')(x)
x = MaxPooling2D(pool_size=pool_size, name='max_pooling2d_1')(x)
x = Flatten(name='flatten_1')(x)
x = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_1_dense_1')(x)
x = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_1_dense_2')(x)
x = Dropout(0.5, name='branch_1_dropout_1')(x)
x = Dense(num_classes, activation='softmax', kernel_initializer='he_normal', name='branch_1_dense_3')(x)

model = Model(inputs=inp, outputs=x)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 16)        160       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 32)        4640      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12544)             0         
_________________________________________________________________
branch_1_dense_1 (Dense)     (None, 128)               1605760   
__________

In [28]:
# compile the model
model.compile(
    loss='categorical_crossentropy', 
    optimizer=Adam(), 
    metrics=['accuracy']
)

# train the model
model.fit(
    X_train_lt, 
    y_train_lt, 
    batch_size=batch_size,
    epochs=epochs, 
    verbose=1,
    validation_split=0.25
)

Train on 22947 samples, validate on 7649 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14405937088>

In [29]:
# save model with weights to file
model.save('model_2.h5')

### 3. Testing on 0-9

In [30]:
# test accuracy on 0-4 test subset
score = model.evaluate(X_test_lt, y_test_lt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 0.013370477953510325
Test accuracy: 0.9961081922553026


In [31]:
# test accuracy on 5-9 test subset
score = model.evaluate(X_test_gt, y_test_gt, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 7.151638581904333
Test accuracy: 0.2592059246792614


Note: due to the model only inputting and outputting half the digits at a time, we'll assume an overall accuracy of the model to be an average of its performance on each subset. Thus 0-9 test accuracy: 0.5996

### 4. Training on 5-9

In [32]:
# recreate initial model
inp = Input(shape=(img_rows, img_cols, 1), name='input')
x = Conv2D(16, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_1')(inp)
x = Conv2D(32, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_2')(x)
x = Conv2D(64, kernel_size, activation='relu', kernel_initializer='he_normal', padding='same', name='conv2d_3')(x)
x = MaxPooling2D(pool_size=pool_size, name='max_pooling2d_1')(x)
x = Flatten(name='flatten_1')(x)

# output branch 1
y1 = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_1_dense_1')(x)
y1 = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_1_dense_2')(y1)
y1 = Dropout(0.5, name='branch_1_dropout_1')(y1)
y1 = Dense(num_classes, activation='softmax', kernel_initializer='he_normal', name='branch_1_dense_3')(y1)

# output branch 2
y2 = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_2_dense_1')(x)
y2 = Dense(128, activation='elu', kernel_initializer='he_normal', name='branch_2_dense_2')(y2)
y2 = Dropout(0.5, name='branch_2_dropout_1')(y2)
y2 = Dense(num_classes, activation='softmax', kernel_initializer='he_normal', name='branch_2_dense_3')(y2)

new_model = Model(inputs=inp, outputs=[y1, y2])
new_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input (InputLayer)              (None, 28, 28, 1)    0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 28, 28, 16)   160         input[0][0]                      
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 28, 28, 32)   4640        conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 28, 28, 64)   18496       conv2d_2[0][0]                   
__________________________________________________________________________________________________
max_poolin

In [33]:
old_model = load_model('model_2.h5')
old_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 28, 28, 1)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 28, 28, 16)        160       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 32)        4640      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 12544)             0         
_________________________________________________________________
branch_1_dense_1 (Dense)     (None, 128)               1605760   
__________

In [34]:
for i,new_layer in enumerate(new_model.layers):
    for j,old_layer in enumerate(old_model.layers):
        if new_layer.name == old_layer.name:
            new_layer.set_weights(old_layer.get_weights())
            new_layer.trainable = False

In [35]:
for i,layer in enumerate(new_model.layers):
    print(i,layer.name,layer.trainable)

0 input False
1 conv2d_1 False
2 conv2d_2 False
3 conv2d_3 False
4 max_pooling2d_1 False
5 flatten_1 False
6 branch_1_dense_1 False
7 branch_2_dense_1 True
8 branch_1_dense_2 False
9 branch_2_dense_2 True
10 branch_1_dropout_1 False
11 branch_2_dropout_1 True
12 branch_1_dense_3 False
13 branch_2_dense_3 True


In [36]:
# compile the model
new_model.compile(
    loss='categorical_crossentropy', 
    optimizer=Adam(), 
    metrics=['accuracy']
)

# train the model
new_model.fit(
    X_train_gt, 
    y={'branch_1_dense_3': y_train_lt_zeros, 'branch_2_dense_3': y_train_gt}, 
    batch_size=batch_size,
    epochs=epochs, 
    verbose=1,
    validation_split=0.25
)

Train on 22053 samples, validate on 7351 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x14407f84388>

### 5. Testing on 0-9

In [37]:
# retrieve which metrics are returned by evaluate
new_model.metrics_names

['loss',
 'branch_1_dense_3_loss',
 'branch_2_dense_3_loss',
 'branch_1_dense_3_acc',
 'branch_2_dense_3_acc']

In [38]:
# test branch 1 with zeros in branch 2
score = new_model.evaluate(X_test_lt, y={'branch_1_dense_3': y_test_lt, 'branch_2_dense_3': y_test_gt_zeros}, verbose=1)
print('Total loss:', score[0])
print('Branch 1 accuracy:', score[3])
print('Branch 2 accuracy:', score[4])

Total loss: 0.013370477953510325
Branch 1 accuracy: 0.9961081922553026
Branch 2 accuracy: 0.15547771937746463


In [39]:
# test branch 2 with zeros in branch 1
score = new_model.evaluate(X_test_gt, y={'branch_1_dense_3': y_test_lt_zeros, 'branch_2_dense_3': y_test_gt}, verbose=1)
print('Total loss:', score[0])
print('Branch 1 accuracy:', score[3])
print('Branch 2 accuracy:', score[4])

Total loss: 0.040323474580536936
Branch 1 accuracy: 0.19687307146419053
Branch 2 accuracy: 0.9917712404854968


In [40]:
# test both branches with mixed zeros
score = new_model.evaluate(X_test, y={'branch_1_dense_3': y_test_lt_dual, 'branch_2_dense_3': y_test_gt_dual}, verbose=1)
print('Total loss:', score[0])
print('Branch 1 accuracy:', score[3])
print('Branch 2 accuracy:', score[4])

Total loss: 0.0264723296154757
Branch 1 accuracy: 0.6076
Branch 2 accuracy: 0.562


In [41]:
# call predict so we can compute the correct output based on probabilities
pred = new_model.predict(X_test, verbose=1)
# concat. two probability arrays into one and extract largest probability
pred = np.concatenate((pred[0], pred[1]), axis=1)
pred = np.argmax(pred, axis=1)
# convert predicition to match label array and compute overall accuracy
pred = keras.utils.to_categorical(pred, total_classes)
acc = np.sum(np.all(np.equal(pred, y_test), axis=1))/len(pred)
print('Total overall accuracy: ', acc)

Total overall accuracy:  0.935


We can see from comparing the first iteration model to the second iteration branched model that there is great improvement. Even though the first iteration only accepts a five digit set at a time and was not trained on the 5-9 set it surprisingly has a 20% accuracy on the set. However, the average accuracy shows that its marginally better than the sequential model (first iteration) at predicition the full set.

After branching and training the new branch on the new set using the same feature extraction we see that each branch individually is capable of predicting their respective sets with 99% accuracy, as we might expect. 

We note that the branches are operating with complete independence and as a result when passing the dual set (where the labels contain zeros when the digit belongs to the other set) we get each branch's accuracy to fall to 56% due to the complimentary branch predicting a number but our label set verifying with an empty element.

For a more accurate measurement we instead use the predict function on the full test set and then compute which element of both branches contains the highest probability and compare it to the labels. These results show the final branched model to have a 92% accuracy of the full dataset.

## Conclusion

Using these model structure we can implement a continuous learning model that utilizes transfer learning techniques to avoid having to retrain the feature extraction component of the models. 

Our results show that the model with an independent branch for each new training set is better at predicting in this instance, however the results it outputs may be conflicting especially the more sets that are added on with each new retrain. 

Alternatively, the sequential model avoids this conflict by maintaining a singular structure during transfers but does so at a cost of underfitting new sets or forgetting old sets. It might be possible to avoid forgetting by writing a custom loss function for the specific training data set used which avoids penalizing and adjusting the weights corresponding to the previous training sets.