# Training a better model

In [24]:
%matplotlib inline

import utils
from utils import *

In [25]:
path = '../data/redux/'
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
    
batch_size = 64

## Are we underfitting?

Our validation accuracy so far has generally been higher than our training accuracy. That leads to two obvious questions:

1. How is this possible?
2. Is this desirable?

The answer to (1) is that this is happening because of *dropout*. Dropout refers to a layer that randomly deletes (i.e. sets to zero) each activation in the previous layer with probability *p* (generally 0.5). This only happens during training, not when calculating the accuracy on the validation set, which is why the validation set can show higher accuracy than the training set.

The purpose of dropout is to avoid overfitting. By deleting parts of the neural network at random during training, it ensures that no one part of the network can overfit to one part of the training set. The creation of dropout was one of the key developments in deep learning, and has allowed us to create rich models without overfitting. However, it can also result in underfitting if overused, and this is something we should be careful of with our model.

So the answer to (2) is: this is probably not desirable. It is likely that we can get better validation set results with less (or no) dropout, if we're seeing that validation accuracy is higher than training accuracy - a strong sign of underfitting. So let's try removing dropout entirely, and see what happens!

(We had dropout in this model already because the VGG authors found it necessary for the imagenet competition. But that doesn't mean it's necessary for dogs v cats, so we will do our own analysis of regularization approaches from scratch.)

## Removing dropout

Our high level approach here will be to start with our fine-tuned cats vs dogs model (with dropout), then fine-tune all the dense layers, after removing dropout from them. The steps we will take are:
- Re-create and load our modified VGG model with binary dependent (i.e. dogs v cats)
- Split the model between the convolutional (*conv*) layers and the dense layers
- Pre-calculate the output of the conv layers, so that we don't have to redundently re-calculate them on every epoch
- Create a new model with just the dense layers, and dropout p set to zero
- Train this new model using the output of the conv layers as training data.

As before we need to start with a working model, so let's bring in our working VGG 16 model and change it to predict our binary dependent...

In [26]:
model = vgg_ft(2)

In [27]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda_3 (Lambda)            (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_27 (ZeroPaddi (None, 3, 226, 226)       0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_28 (ZeroPaddi (None, 64, 226, 226)      0         
_________________________________________________________________
conv2d_28 (Conv2D)           (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 64, 112, 112)      0         
_________________________________________________________________
zero_padding2d_29 (ZeroPaddi (None, 64, 114, 114)      0         
__________

...and load our fine-tuned weights.

In [28]:
model.load_weights(model_path+'finetune2.h5')

We're going to be training a number of iterations without dropout, so it would be best for us to pre-calculate the input to the fully connected layers - i.e. the *Flatten()* layer. We'll start by finding this layer in our model, and creating a new model that contains just the layers up to and including this layer:

In [29]:
layers = model.layers
last_conv_idx = [idx for idx, layer in enumerate(layers) if type(layer)==Conv2D][-1]

print(last_conv_idx)
print(layers[last_conv_idx])

30
<keras.layers.convolutional.Conv2D object at 0x7ff0c4771be0>


In [30]:
conv_layers = layers[:last_conv_idx+1]
conv_model = Sequential(conv_layers)
conv_model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

fc_layers = layers[last_conv_idx+1:]

In [31]:
conv_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda_3 (Lambda)            (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_27 (ZeroPaddi (None, 3, 226, 226)       0         
_________________________________________________________________
conv2d_27 (Conv2D)           (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_28 (ZeroPaddi (None, 64, 226, 226)      0         
_________________________________________________________________
conv2d_28 (Conv2D)           (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 64, 112, 112)      0         
_________________________________________________________________
zero_padding2d_29 (ZeroPaddi (None, 64, 114, 114)      0         
__________

Now we can use the exact same approach to creating features as we used when we created the linear model from the imagenet predictions in the last lesson - it's only the model that has changed. As you're seeing, there's a fairly small number of "recipes" that can get us a long way!

In [15]:
batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)

trn_features = batches.classes
val_features = val_batches.classes
trn_labels = onehot(trn_features)
val_labels = onehot(val_features)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [16]:
batches.class_indices

{'cats': 0, 'dogs': 1}

In [33]:
val_features = conv_model.predict_generator(val_batches, steps=int(math.ceil(val_batches.n/val_batches.batch_size)))

In [34]:
trn_features = conv_model.predict_generator(batches, steps=int(math.ceil(batches.n/batches.batch_size)))

In [35]:
save_array(model_path + 'train_convlayer_features.bc', trn_features)
save_array(model_path + 'valid_convlayer_features.bc', val_features)

In [36]:
trn_features = load_array(model_path+'train_convlayer_features.bc')
val_features = load_array(model_path+'valid_convlayer_features.bc')

In [37]:
trn_features.shape

(23000, 512, 14, 14)

For our new fully connected model, we'll create it using the exact same architecture as the last layers of VGG 16, so that we can conveniently copy pre-trained weights over from that model. However, we'll set the dropout layer's p values to zero, so as to effectively remove dropout.

In [38]:
# Copy the weights from the pre-trained model.
# NB: Since we're removing dropout, we want to half the weights
def proc_wgts(layer): return [o/2 for o in layer.get_weights()]

In [39]:
# Such a finely tuned model needs to be updated very slowly!
opt = RMSprop(lr=0.00001, rho=0.7)

In [40]:
def get_fc_model():
    model = Sequential([
        MaxPooling2D(input_shape=conv_model.layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(2, activation='softmax')
    ])
    
    for layer1, layer2 in zip(model.layers, fc_layers): layer1.set_weights(proc_wgts(layer2)) 
    
    model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [41]:
fc_model = get_fc_model()

And fit the model in the usual way:

In [None]:
fc_model.fit(trn_features, trn_labels, batch_size=batch_size, epochs=8,
             verbose=2, validation_data=(val_features, val_labels))

Train on 23000 samples, validate on 2000 samples
Epoch 1/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 2/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 3/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 4/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 5/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 6/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 7/8
 - 56s - loss: 8.0689 - acc: 0.4994 - val_loss: 7.9462 - val_acc: 0.5070
Epoch 8/8
