This notebook explains how to add batch normalization to VGG.  The code shown here is implemented in [vgg_bn.py](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/vgg16bn.py), and there is a version of ``vgg_ft`` (our fine tuning function) with batch norm called ``vgg_ft_bn`` in [utils.py](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/utils.py).

In [1]:
%matplotlib inline
import utils; 
import importlib
importlib.reload(utils)
from utils import *
from __future__ import print_function, division

Using TensorFlow backend.


# The problem, and the solution

## The problem

The problem that we faced in the lesson 3 is that when we wanted to add batch normalization, we initialized *all* the dense layers of the model to random weights, and then tried to train them with our cats v dogs dataset. But that's a lot of weights to initialize to random - out of 134m params, around 119m are in the dense layers! Take a moment to think about why this is, and convince yourself that dense layers are where most of the weights will be. Also, think about whether this implies that most of the *time* will be spent training these weights. What do you think?

Trying to train 120m params using just 23k images is clearly an unreasonable expectation. The reason we haven't had this problem before is that the dense layers were not random, but were trained to recognize imagenet categories (other than the very last layer, which only has 8194 params).

## The solution

The solution, obviously enough, is to add batch normalization to the VGG model! To do so, we have to be careful - we can't just insert batchnorm layers, since their parameters (*gamma* - which is used to multiply by each activation, and *beta* - which is used to add to each activation) will not be set correctly. Without setting these correctly, the new batchnorm layers will normalize the previous layer's activations, meaning that the next layer will receive totally different activations to what it would have without new batchnorm layer. And that means that all the pre-trained weights are no longer of any use!

So instead, we need to figure out what beta and gamma to choose when we insert the layers. The answer to this turns out to be pretty simple - we need to calculate what the mean and standard deviation of that activations for that layer are when calculated on all of imagenet, and then set beta and gamma to these values. That means that the new batchnorm layer will normalize the data with the mean and standard deviation, and then immediately un-normalize the data using the beta and gamma parameters we provide. So the output of the batchnorm layer will be identical to it's input - which means that all the pre-trained weights will continue to work just as well as before.

The benefit of this is that when we wish to fine-tune our own networks, we will have all the benefits of batch normalization (higher learning rates, more resiliant training, and less need for dropout) plus all the benefits of a pre-trained network.

To calculate the mean and standard deviation of the activations on imagenet, we need to download imagenet. You can download imagenet from http://www.image-net.org/download-images . The file you want is the one titled **Download links to ILSVRC2013 image data**. You'll need to request access from the imagenet admins for this, although it seems to be an automated system - I've always found that access is provided instantly. Once you're logged in and have gone to that page, look for the **CLS-LOC dataset** section. Both training and validation images are available, and you should download both. There's not much reason to download the test images, however.

Note that this will not be the entire imagenet archive, but just the 1000 categories that are used in the annual competition. Since that's what VGG16 was originally trained on, that seems like a good choice - especially since the full dataset is 1.1 terabytes, whereas the 1000 category dataset is 138 gigabytes.

# Adding batchnorm to Imagenet

## Setup

### Sample

As per usual, we create a sample so we can experiment more rapidly.

In [None]:
%pushd data/imagenet
%cd train

In [6]:
%mkdir ../sample
%mkdir ../sample/train
%mkdir ../sample/valid

from shutil import copyfile

g = glob('*')
for d in g: 
    os.mkdir('../sample/train/'+d)
    os.mkdir('../sample/valid/'+d)

In [8]:
g = glob('*/*.JPEG')
shuf = np.random.permutation(g)
for i in range(25000): copyfile(shuf[i], '../sample/train/' + shuf[i])

In [10]:
%cd ../valid

g = glob('*/*.JPEG')
shuf = np.random.permutation(g)
for i in range(5000): copyfile(shuf[i], '../sample/valid/' + shuf[i])

%cd ..

/data/jhoward/imagenet/valid
/data/jhoward/imagenet


In [11]:
%mkdir sample/results

In [None]:
%popd

### Data setup

We set up our paths, data, and labels in the usual way. Note that we don't try to read all of Imagenet into memory! We only load the sample into memory.

In [48]:
sample_path = '/data/yinterian/imagenet/sample/'
# This is the path to my fast SSD - I put datasets there when I can to get the speed benefit
fast_path = '/home/jhoward/ILSVRC2012_img_proc/'
path = '/data/jhoward/imagenet/sample/'
path = '/data/datasets/imagenet/full/'

In [49]:
batch_size=64

In [50]:
import vgg16; importlib.reload(vgg16)
vgg = Vgg16()

In [20]:
# just run this the first time
samp_trn = vgg.get_data(sample_path + 'train')
samp_val = vgg.get_data(sample_path + 'valid')

Found 25000 images belonging to 1000 classes.
Found 5000 images belonging to 1000 classes.


In [6]:
# just run this the first time
save_array(sample_path+'results/trn.dat', samp_trn)
save_array(sample_path+'results/val.dat', samp_val)

In [5]:
samp_trn = load_array(sample_path+'results/trn.dat')
samp_val = load_array(sample_path+'results/val.dat')

In [6]:
(samp_val_classes, samp_trn_classes, samp_val_labels, samp_trn_labels, 
    samp_val_filenames, samp_filenames, samp_test_filenames) = get_classes(sample_path)

Found 25000 images belonging to 1000 classes.
Found 5000 images belonging to 1000 classes.
Found 0 images belonging to 0 classes.


### Model setup

Since we're just working with the dense layers, we should pre-compute the output of the convolutional layers.

In [7]:
vgg = Vgg16()
model = vgg.model

In [8]:
layers = model.layers
last_conv_idx = [index for index,layer in enumerate(layers) 
                     if type(layer) is Conv2D][-1]

In [9]:
last_conv_idx

17

In [25]:
conv_model = Model(inputs=model.input, outputs=model.layers[last_conv_idx].output)

In [12]:
# run the first time
samp_conv_val_feat = conv_model.predict(samp_val, batch_size=batch_size*2)
samp_conv_feat = conv_model.predict(samp_trn, batch_size=batch_size*2)

In [16]:
# run the first time
save_array(sample_path+'results/conv_val_feat.dat', samp_conv_val_feat)
save_array(sample_path+'results/conv_feat.dat', samp_conv_feat)

In [26]:
samp_conv_feat = load_array(sample_path + 'results/conv_feat.dat')
samp_conv_val_feat = load_array(sample_path + 'results/conv_val_feat.dat')

In [13]:
samp_conv_val_feat.shape

(5000, 14, 14, 512)

This is our usual Vgg network just covering the dense layers:

In [14]:
def get_dense_layers(inputs):
    x = MaxPooling2D((2, 2), strides=(2, 2), name='block5_pool')(inputs)
    x = Flatten(name='flatten')(x)
    x = Dense(4096, activation='relu', name='fc1')(x)
    x = Dropout(.5)(x)
    x = Dense(4096, activation='relu', name='fc2')(x)
    x = Dropout(.5)(x)
    x = Dense(1000, activation='softmax', name='predictions')(x)
    model = Model(inputs = inputs, outputs = x, name='vgg16')
    return model

In [15]:
img_input = Input(shape=conv_layers[-1].output_shape[1:])
img_input

<tf.Tensor 'input_3:0' shape=(?, 14, 14, 512) dtype=float32>

In [16]:
dense_model = get_dense_layers(img_input)
dense_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 14, 14, 512)       0         
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
dropout_1 (Dropout)          (None, 4096)              0         
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
dropout_2 (Dropout)          (None, 4096)              0         
__________

In [17]:
for x in ["fc1", "fc2", "predictions"]:
    l1 = model.get_layer(name=x)
    l2 = dense_model.get_layer(name=x)
    l2.set_weights(l1.get_weights())

### Check model

It's a good idea to check that your models are giving reasonable answers, before using them.

In [18]:
dense_model.compile(Adam(), 'categorical_crossentropy', ['accuracy'])

In [19]:
dense_model.evaluate(samp_conv_val_feat, samp_val_labels)



[1.5166243791580201, 0.64380000000000004]

In [20]:
model.compile(Adam(), 'categorical_crossentropy', ['accuracy'])

In [21]:
# should be identical to above
model.evaluate(samp_val, samp_val_labels)



[1.5166243734359741, 0.64380000000000004]

In [25]:
# should be a little better than above, since VGG authors overfit
dense_model.evaluate(samp_conv_feat, samp_trn_labels)



[1.0944163201045991, 0.71687999999999996]

In [26]:
model.evaluate(samp_trn, samp_trn_labels)



[1.0944161347103118, 0.71687999999999996]

## Adding our new layers

### Calculating batchnorm params

Here is how you obtain the output of an intermediate layer

In [22]:
intermediate_layer_model = Model(inputs=dense_model.layers[0].input,
                                 outputs=dense_model.layers[2].output)

Then we can call the function to get our layer activations:

In [23]:
d0_out = intermediate_output = intermediate_layer_model.predict(samp_conv_val_feat)

In [24]:
d0_out.shape

(5000, 25088)

In [25]:
intermediate_layer_model = Model(inputs=dense_model.layers[0].input,
                                 outputs=dense_model.layers[4].output)

In [26]:
d2_out = intermediate_output = intermediate_layer_model.predict(samp_conv_val_feat)

In [27]:
d2_out.shape

(5000, 4096)

Now that we've got our activations, we can calculate the mean and standard deviation for each.

In [28]:
mu0, var0, std0 = d0_out.mean(axis=0), d0_out.var(axis=0), d0_out.std(axis=0) 
mu2, var2, std2 = d2_out.mean(axis=0), d2_out.var(axis=0), d2_out.std(axis=0)

### Creating batchnorm model

Now we're ready to create and insert our layers just after each dense layer.

In [29]:
def insert_bn_layer(model, bn_name, index):
    img_input = Input(batch_shape=model.input_shape)
    x = img_input
    for i, layer in enumerate(model.layers):
        if i == index: x = BatchNormalization(name=bn_name)(x)
        if i > 0: x = layer(x)
    return Model(inputs=img_input, outputs=x)

In [30]:
bn_model = insert_bn_layer(dense_model, "bn2", 5)
bn_model = insert_bn_layer(bn_model, "bn1", 3)
bn_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 14, 14, 512)       0         
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
bn1 (BatchNormalization)     (None, 25088)             100352    
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
dropout_1 (Dropout)          (None, 4096)              0         
_________________________________________________________________
bn2 (BatchNormalization)     (None, 4096)              16384     
__________

After inserting the layers, we can set their weights to the variance and mean we just calculated.

In [31]:
l1 = bn_model.get_layer(name="bn1")
l1.set_weights([std0, mu0, mu0, std0])
l2 = bn_model.get_layer(name="bn2")
l2.set_weights([std2, mu2, mu2, std2])

In [32]:
l1.get_weights()

[array([ 5.9348,  5.6416,  6.8435, ...,  7.1214,  9.0332,  6.2522], dtype=float32),
 array([ 0.9367,  0.6878,  1.0604, ...,  0.9634,  4.188 ,  1.2063], dtype=float32),
 array([ 0.9367,  0.6878,  1.0604, ...,  0.9634,  4.188 ,  1.2063], dtype=float32),
 array([ 5.9348,  5.6416,  6.8435, ...,  7.1214,  9.0332,  6.2522], dtype=float32)]

In [33]:
bn_model.compile(Adam(1e-5), 'categorical_crossentropy', ['accuracy'])

We should find that the new model gives identical results to those provided by the original VGG model.

In [48]:
bn_model.evaluate(samp_conv_val_feat, samp_val_labels)



[4.2745669403076167, 0.628]

In [49]:
bn_model.evaluate(samp_conv_feat, samp_trn_labels)



[3.2328354600811005, 0.68976000000000004]

### Optional - additional fine-tuning

Now that we have a VGG model with batchnorm, we might expect that the optimal weights would be a little different to what they were when originally created without batchnorm. So we fine tune the weights for one epoch.

In [81]:
#feat_bc = bcolz.open(fast_path+'trn_features.dat')

In [82]:
#labels = load_array(fast_path+'trn_labels.dat')

In [83]:
#val_feat_bc = bcolz.open(fast_path+'val_features.dat')

In [29]:
#val_labels = load_array(fast_path+'val_labels.dat')

In [42]:
bn_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         (None, 14, 14, 512)       0         
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
bn1 (BatchNormalization)     (None, 25088)             100352    
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
dropout_1 (Dropout)          (None, 4096)              0         
_________________________________________________________________
bn2 (BatchNormalization)     (None, 4096)              16384     
__________

In [34]:
bn_model.fit(samp_conv_feat, samp_trn_labels, epochs=1, batch_size=batch_size,
             validation_data=(samp_conv_val_feat, samp_val_labels))

Train on 25000 samples, validate on 5000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f7ef74215c0>

The results look quite encouraging! Note that these VGG weights are now specific to how keras handles image scaling - that is, it squashes and stretches images, rather than adding black borders. So this model is best used on images created in that way.

In [35]:
bn_model.save_weights(sample_path+'models/bn_model2-1.h5')

In [36]:
bn_model.load_weights(sample_path+'models/bn_model2-1.h5')

### Create combined model

Our last step is simply to copy our new dense layers on to the end of the convolutional part of the network, and save the new complete set of weights, so we can use them in the future when using VGG. (Of course, we'll also need to update our VGG architecture to add the batchnorm layers).

In [51]:
comb_model = Model(inputs = conv_model.input, outputs=bn_model(conv_model.output))

In [52]:
comb_model.compile(Adam(1e-5), 'categorical_crossentropy', ['accuracy'])

In [53]:
comb_model.evaluate(samp_val, samp_val_labels)



[1.455510517692566, 0.65600000000000003]

In [48]:
comb_model.save_weights(sample_path+'models/inet_224squash_bn_samp.h5')

In [74]:
comb_model.load_weights(sample_path+'models/inet_224squash_bn_samp.h5')

The code shown here is implemented in [vgg_bn.py](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/vgg16bn.py), and there is a version of ``vgg_ft`` (our fine tuning function) with batch norm called ``vgg_ft_bn`` in [utils.py](https://github.com/fastai/courses/blob/master/deeplearning1/nbs/utils.py).

## Finetune on the whole imagenet

In [41]:
path = '/data/datasets/imagenet/full/'
result_path = '/data/yinterian/imagenet/full/'

In [42]:
batch_size=32

In [43]:
vgg = Vgg16()
batches = vgg.get_batches(path+'train', batch_size=batch_size)
val_batches = vgg.get_batches(path+'valid', batch_size=batch_size)

Found 1281167 images belonging to 1000 classes.
Found 50000 images belonging to 1000 classes.


In [45]:
for layer in comb_model.layers[:17]:
    layer.trainable = False

In [46]:
comb_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
__________

In [47]:
comb_model.fit_generator(batches, steps_per_epoch=batches.samples/batches.batch_size, epochs=1,
                         validation_data=val_batches,
                         validation_steps=val_batches.samples/val_batches.batch_size)

Epoch 1/1
 1553/40036 [>.............................] - ETA: 27314s - loss: 1.6753 - acc: 0.5877 

KeyboardInterrupt: 