# Training a better model

This week, we want to improve on the model we have trained from last week from a underfitting or overfitting perspective.

In [1]:
from theano.sandbox import cuda

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [2]:
from importlib import reload
import utils; reload(utils)
from utils import *
from __future__ import division, print_function
%matplotlib inline

Using Theano backend.


In [3]:
path = 'data/dogscats/'
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
    
batch_size=32

## Are we underfitting?

So far, our validation accuracy has generally been higher that our training accuracy
This leads to 2 questions:
1. How is this possible?
2. Is this desirable?

**Answer 1):**
Because of _dropout_. Dropout refers to a layer taht randomly deletes (i.e. sets to 0) each activation in the previous layer with probability _p_ (usually 0.5). This only happends during training, not when calculating the accuracy on the validation set, which is why the validation set can have higher accuracy than the training set.

The purpose of dropout is to avoid overfitting. Why? -- by deleting parts of the neural netowkr at random during training, it ensures that no one part of the network can overfit to one part of the training set. The creation of dropout was one of the key developments in deep learning, which allows us to create rich models w/o overfitting. However, it can also result in underfitting if overused. 

**Answer 2):**
Not desirable. It is likely that we can get better validation set results with less dropout.

### Removing dropout

We start with our fine-tuned cats vs dogs model (with dropout), then fine-tune again all the dense layers, after removing dropout from them.

Action Plan:
* Re-create and load our modified VGG model with binary dependent
* Split the model between the convolutnional (_conv_) layers and the dense layers
* Pre-calculate the output of the conv layers, so that we don't have to redundently re-calculate them on every epoch
* Create a new model with just the dense layers and dropout p set to 0
* Train this new model using the output of the conv layers as training data

In [4]:
??vgg_ft

```py
def vgg_ft(out_dim):
    vgg = Vgg16()
    vgg.ft(out_fim)
    model = vgg.model
    return model
```

In [5]:
??Vgg16.ft

```py
def ft(self, num):
    """
    Replace the last layer of the model with a Dense layer of num neurons.
    Will also lock the weights of all layers except the new layer so that we only learn weights for the last layer in subsequent training.
    Args:
        num (int): Number of neurons in the Dense layer
    Returns:
        None
"""
    model = self.model
    model.pop()
    for layer in model.layers: layer.trainable=False
    model.add(Dense(num, activation='softmax'))
    self.compile()
```

In [11]:
model = vgg_ft(2)

...and load our fine-tuned weights from lesson 2.

In [12]:
model.load_weights(model_path + 'finetune3.h5')

Now, let's train a few iterations w/o dropout. But first, let's pre-calculate the input to the fully connected layers - i.e. the _Flatten()_ layer.

Because convolution layers take a lot of time to compute, but Dense layers do not.

In [13]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lambda_1 (Lambda)                (None, 3, 224, 224)   0           lambda_input_1[0][0]             
____________________________________________________________________________________________________
zeropadding2d_1 (ZeroPadding2D)  (None, 3, 226, 226)   0           lambda_1[0][0]                   
____________________________________________________________________________________________________
convolution2d_1 (Convolution2D)  (None, 64, 224, 224)  1792        zeropadding2d_1[0][0]            
____________________________________________________________________________________________________
zeropadding2d_2 (ZeroPadding2D)  (None, 64, 226, 226)  0           convolution2d_1[0][0]            
___________________________________________________________________________________________

In [14]:
layers = model.layers

In [15]:
# find the lasy convolution layer
last_conv_idx = [index for index, layer in enumerate(layers) 
                 if type(layer) is Convolution2D][-1]
last_conv_idx

30

In [16]:
layers[last_conv_idx]

<keras.layers.convolutional.Convolution2D at 0x7f652a12a1d0>

In [17]:
conv_layers = layers[:last_conv_idx+1]
conv_model = Sequential(conv_layers)
fc_layers = layers[last_conv_idx+1:]

Now, we can use the exact same approach to create features as we used when we created the linear model from the imagenet predictions in lesson 2.

In [14]:
batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)

val_classes = val_batches.classes
trn_classes = batches.classes
val_labels = onehot(val_classes)
trn_labels = onehot(trn_classes)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [None]:
# Let's get the outputs of the conv model and save them
val_features = conv_model.predict_generator(val_batches, val_batches.nb_sample)
trn_features = conv_model.predict_generator(batches, batches.nb_sample)

In [None]:
save_array(model_path + 'train_convlayer_features.bc', trn_features)
save_array(model_path + 'valid_convlayer_features.bc', val_features)

In [15]:
trn_features = load_array(model_path+'train_convlayer_features.bc')
val_features = load_array(model_path+'valid_convlayer_features.bc')

In [16]:
trn_features.shape
# Note that the last conv layer is 512, 14, 14

(23000, 512, 14, 14)

For our new fully connected model, we'll create it using the exact same architecture as the last layers of VGG 16, so that we can conveniently copy pre-trained weights over from that model. 

In [5]:
def proc_wgts(layer): return [o/2 for o in layer.get_weights()]

In [6]:
# Such a finely tuned model needs to be updated very slowly!!
opt = RMSprop(lr=0.00001, rho=0.7)

In [7]:
def get_fc_model():
    model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(4096, activation='relu'),
        Dropout(0.),
        Dense(2, activation='softmax')
        ])
    
    for l1, l2 in zip(model.layers, fc_layers): l1.set_weights(proc_wgts(l2))
    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [18]:
fc_model = get_fc_model()

And fit the model in the usual way:

In [23]:
fc_model.fit(trn_features, trn_labels, nb_epoch=8,
            batch_size=batch_size, validation_data=(val_features, val_labels))

Train on 23000 samples, validate on 2000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f56a22005f8>

In [24]:
fc_model.save_weights(model_path+'no_dropout.h5')

In [19]:
fc_model.load_weights(model_path+'no_dropout.h5')

## Reducing overfitting

Now we've gotten a model that overfits. So let's take a few steps to reduce this.

### Approaches to reduce overfitting

Before relying on dropout or other regularization approches to reduce overfitting, try the following techniques first. Because regularization, by definition, biases our model towards simplicity - which we only wnat to do if we know that's necessary. 

Action Plan:
1. Add more data (Kaggle comp N.A.)
2. Use data augmentation
3. Use architectures that generalize well
4. Add regularization
5. Reduce architecture complexity

We assume that you've already collected as much data as you can ,so step (1) isn't relevant. 


### Data augmentation

bStep 2 - Data augmentation refers to creating additional synthetic data, based on reasonable modifications of your input data. For images, this is likely to involve flipping, rotation, zooming, cropping, panning, minor color changes ...

Which types of augmentation are appropriate depends on your data. For instance, for regular photots, you want to use hotizontal flipping, but not vertical flipping. We recommand **always** using at least some light data aumentation, unless you have so much data that your model will never see the same input twice.

Keras comes with very convenient features for automating data augmentation. You simply define what types and maximum amount of augementation you want.

In [21]:
# dim_ordering='tf' uses tensorflow dimension ordering,
#   which is the same order as matplotlib uses for display.
# Therefore when just using for display purposes, this is more convenient
gen = image.ImageDataGenerator(
    rotation_range=10, width_shift_range=0.1,height_shift_range=0.1,
    shear_range=0.15, zoom_range=0.1, channel_shift_range=10., horizontal_flip=True,
    dim_ordering='tf')

So to decide which augmentation methods to use, let's take a look at the generated imaged, and use our intuition.

In [22]:
# Create a 'batch' of a single image
img = np.expand_dims(ndimage.imread('data/dogscats/test/7.jpg'),0)
# Request the generator to create batches from this image
aug_iter = gen.flow(img)

### Batch normalization