In [1]:
%matplotlib inline

from utils import *

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
Using Theano backend.


In [2]:
wk_dir = os.getcwd()
data_dir = wk_dir + "/../../data/kg/cd-redux/"

In [3]:
model_dir = data_dir + "/models/"
if not os.path.exists(model_dir):
    os.mkdir(model_dir)

In [4]:
import bcolz
def save_array(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode='w'); c.flush()
def load_array(fname): return bcolz.open(fname)[:]

On larger data sets, functions like get_data (which we're about to encounter in a minute) can actually take a while to run. So what we're doing here is defining a way to save the arrays we get from our get_data functions so we won't have to run them again if we want to use them later.

## Preparing our data

In [5]:
trn_batches = get_batches(data_dir + "/train/", shuffle=False, batch_size=1)
val_batches = get_batches(data_dir + "/valid/", shuffle=False, batch_size=1)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [6]:
%%time
trn_data = get_data(data_dir + "/train/")
val_data = get_data(data_dir + "/valid/")

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.
CPU times: user 4min 16s, sys: 17.1 s, total: 4min 33s
Wall time: 4min 41s


In [7]:
save_array(model_dir + "trn_data.bc", trn_data)
save_array(model_dir + "val_data.bc", val_data)

Okay, what's going on here? We have training and validation *batches*, as well as training and validation *data*. 

Neither `get_batches` nor `get_data` has a docstring, so we can try to look at the source code for each function:

```
def get_data(path, target_size=(224,224)):
    batches = get_batches(
        path, 
        shuffle=False, 
        batch_size=1, 
        class_mode=None, 
        target_size=target_size
    )
    return np.concatenate(
        [batches.next() for i in range(batches.nb_sample)]
    )
    
def get_batches(
        dirname, 
        gen=image.ImageDataGenerator(), 
        shuffle=True, 
        batch_size=4, 
        class_mode='categorical',
        target_size=(224,224)
    ):
    return gen.flow_from_directory(
        dirname, 
        target_size=target_size,
        class_mode=class_mode, 
        shuffle=shuffle, 
        batch_size=batch_size
    )
```

It looks like the key difference is that `get_batches` uses `image.ImageDataGenerator()`, and I believe a `get_batches` returns a collection of images, whereas `get_data` returns the image data in a numerical format.

We can test this by trying to look at each data type:

In [8]:
type(trn_batches)

keras.preprocessing.image.DirectoryIterator

In [9]:
type(trn_data)

numpy.ndarray

`trn_batch` is a DirectoryIterator object (whatever that means), and `trn_data` is a NumPy array.

Trying to access the first item in `trn_data` gives us:

In [10]:
trn_data[0]

array([[[ 144.,  152.,  156., ...,   38.,   40.,   38.],
        [ 195.,  186.,  161., ...,   38.,   39.,   39.],
        [ 156.,  159.,  146., ...,   39.,   39.,   40.],
        ..., 
        [ 133.,  140.,  105., ...,   97.,  109.,  107.],
        [ 128.,  130.,   93., ...,  104.,  111.,  117.],
        [ 109.,  116.,  117., ...,  111.,  116.,  124.]],

       [[ 152.,  160.,  167., ...,   37.,   39.,   38.],
        [ 203.,  194.,  172., ...,   37.,   38.,   39.],
        [ 164.,  170.,  159., ...,   38.,   38.,   40.],
        ..., 
        [ 129.,  136.,  101., ...,   43.,   55.,   52.],
        [ 124.,  126.,   89., ...,   49.,   56.,   63.],
        [ 105.,  112.,  113., ...,   56.,   61.,   71.]],

       [[ 171.,  181.,  187., ...,   43.,   45.,   38.],
        [ 224.,  215.,  192., ...,   43.,   44.,   39.],
        [ 185.,  190.,  178., ...,   44.,   44.,   40.],
        ..., 
        [ 154.,  161.,  126., ...,    0.,   11.,   11.],
        [ 149.,  151.,  114., ...,    8., 

If we tried to access `trn_batches` in the same way, we would get an error telling us the DirectoryIterator object doesn't support indexing.

In [11]:
trn_data[0].shape

(3, 224, 224)

`trn_data[0]` is an image with 3 color channels (red, green, blue), and a resolution of 224 by 224 pixels.

In [12]:
trn_data.shape

(23000, 3, 224, 224)

`trn_data` (and `val_data`!) is just a bunch of these.

In [13]:
def onehot(x):
    return np.array(
        OneHotEncoder().fit_transform(
            x.reshape(-1, 1)
        ).todense()
    )

In [14]:
trn_classes = trn_batches.classes
trn_labels = onehot(trn_classes)

What's this? 

Well first we're getting our **classes** from `trn_batches`. Classes are assigned to each image depending on the folder they're in. Taking a look inside the training folder in our data directory, we can see that the cats folder appears before the dogs folder, so the images in our cats folder are automatically assigned a class of 0, and the images in our dogs folder get a class of 1.

If we look at the first few classes, we can see that they do have values of 0:

In [23]:
trn_classes[:5]

array([0, 0, 0, 0, 0], dtype=int32)

And the last few classes have values of 1:

In [24]:
trn_classes[-5:]

array([1, 1, 1, 1, 1], dtype=int32)

Next, we have to turn our classes into **labels**.

Most data science algorithms work best when categorical data is in a **one-hot encoded format**. We're not going to get into why, or where the name comes from (a quick Google search didn't turn anything up) but one-hot encoding works like this. Say we had three image categories in our dataset:

|image_id|image_category|
|--------|--------------|
|1       |Cat           |
|2       |Dog           |
|3       |Dog           |
|4       |Bird          |

One-hot encoding would turn each of those categories into its own column, and each row would have either a 1 or a 0 in that column depending on its original category value:

|image_id|image_category|category_cat|category_dog|category_bird|
|--------|--------------|------------|------------|-------------|
|1       |Cat           |1           |0           |0            |
|2       |Dog           |0           |1           |0            |
|3       |Dog           |0           |1           |0            |
|4       |Bird          |0           |0           |1            |

In our case, we only have two categories (cat and dog) so our one-hot encoded values look like this:

In [25]:
trn_labels[:3]

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.]])

A 1 in the first column means cat...

In [26]:
trn_labels[-3:]

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

... And a 1 in the second colummn means dog.

And we can do the same for our validation data:

In [27]:
val_classes = val_batches.classes 
val_labels = onehot(val_classes)

For these last few steps, we've been **transforming our batches into labels**. Now we can actually get the VGG default model...

## Training a linear model

In [29]:
vgg = Vgg16()
model = vgg.model

batch_size=128

Problem occurred during compilation with the command line below:
/usr/bin/g++ -shared -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -I/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/core/include -I/home/ubuntu/anaconda2/include/python2.7 -I/home/ubuntu/anaconda2/lib/python2.7/site-packages/theano/gof -fvisibility=hidden -o /home/ubuntu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/tmpJCnbzl/cdb0b986639740d2acf156f042fe37d2.so /home/ubuntu/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.12-64/tmpJCnbzl/mod.cpp -L/home/ubuntu/anaconda2/lib -lpython2.7
ERROR (theano.gof.cmodule): [Errno 12] Cannot allocate memory


OSError: [Errno 12] Cannot allocate memory

... And have it make predictions on our training and validation data. This is the data we looked at before where each image took the shape (3, 228, 228).

In [None]:
%%time
trn_features = model.predict(trn_data, batch_size=batch_size)
val_features = model.predict(val_data, batch_size=batch_size)

Again, `predict` can take some time to run we're going to save the results:

In [None]:
save_array(model_dir + 'trn_features.bc', trn_data)
save_array(model_dir + 'val_features.bc', val_data)

In [None]:
trn_features[0][:5]

What's up with these predictions? Well, VGG was trained on the ImagetNet dataset, which has 1,000 image categories. So for each image in our dataset, it returns the probability of that image belonging to each of the 1,000 categories:

In [None]:
trn_features[0].shape

We're not going to look at all 1,000 values here, but if we did we would expect to see 0s for most the categories, with some higher values for the categories representing different species of cat in the ImageNet dataset.

In [None]:
lm = Sequential(
    [Dense(2, input_shape=(1000,), activation="softmax")]
)
lm.compile(
    optimizer=RMSprop(lr=0.1),
    loss="categorical_crossentropy"
)

We first encountered this code when we built our linear model on Day Twelve. 

`Sequential` is a linear stack of layers in Keras, and `Dense` is a single layer in the stack.

The parameters we passed to `Dense` tell it to accept an input with 1,000 columns (the probabilities for each ImageNet category), and produce an output with 2 columns ([1, 0] for cat or [0, 1] for dog).

In [None]:
%%time
lm.fit(
    trn_features,
    trn_labels,
    nb_epoch=3,
    batch_size=batch_size,
    validation_data=(
        val_features,
        val_labels
    )
)

All we're doing here is fitting our newly created linear model to the features (probabilities for each ImageNet category) and labels (one-hot encoded [1, 0] for cat or [0, 1] for dog) we created for our training data, and validating it against the features and labels we created for our validation data.

To get a summary of our model we can do:

In [None]:
lm.summary()

Which tells us we have a single Dense layer that produces an output shaped (, 2)... Which is exactly what we want.

Now that we know the model does what we want, let's make a prediction against our validation data:

In [None]:
preds = lm.predict_classes(val_features, batch_size=batch_size)

And check the accuracy of our predictions against our known validation classes:

In [None]:
cm = confusion_matrix(val_classes, preds)
plot_confusion_matrix(cm, val_batches.class_indices)

Explain what a confusion matrix is.

## Updating VGG

Now we have a model that does a pretty good job taking VGG's outputs and making a prediction on the cats vs dogs dataset. But we still have to run VGG, take its outputs, and feed them into our new model.

What we're going to do now is modify VGG so it can do all that in a single step.

First, let's look at the layers in the VGG model:

In [None]:
vgg.model.summary()

Wow, VGG has a **ton** of layers! This can look kind of overwhelming, but towards the bottom we should actually see a couple of things that we recognize. 

The very last layer, for example, is a Dense layer with an output shape of (, 1000). In our linear model, our Dense layer had a output shape of (, 2). We can now see the VGG layer responsible for producing the 1,000 probabilites we've been using as our inputs.

The other layers - Dropout, MaxPooling, Convolution, etc - are just different types of layers. We'll get to them another time.

If you look at the very top, there's a Lambda layer with an output shape of (, 3, 224, 224). We've seen this shape before! This is the layer responsible for taking our images (3 color channels, 244 pixels by 224 pixels) and getting this whole process started!

First, we're going to remove the last layer - the one that classifies our images into their final ImageNet categories. 

In [None]:
model.pop()

Why?

Well, as we've previously observed, ImageNet doesn't have a category for cat or dog. ImageNet *does* have 18 species of cat and 189 species of dog (which you can explore [here](http://image-net.org/explore)), which is an unnecessary level of granularity for our purposes. 

VGG makes its classifications by identifying progressively higher-level details in each image, for example:

1. The first layer might detect edges in the image
2. Another layer might detect corners, or parallel lines
3. Another layer might detect circles...
4. ... Or circles within circles, representing an eye or a wheel
5. A later layer might detect the texture or fur...
6. ... Or the relative position of elements that make up a face

If we look at the second-to-last VGG layer, each output has 4,096 values. These do *not* necessarily correspond to image features that would be recognizable to us, so we can't assume, for example, that since the last layer predict categories like "Dalmation" or "Welsh Corgi, the second-to-last layer predicts categories like "dog".

What we *are* assuming is that by this point, VGG has learned to identify features like eyes or noses or fur that are useful to us, and we don't want to have to throw all that knowledge away.

So we're going to back up one layer and say, **"Ok VGG, instead of using all that knowledge you have about eyes and noses and fur to predict whether an image contains a Dalmation or a Welsh Corgi or 998 other things, use that knowledge to predict whether an image contains a dog or a cat."**

In [None]:
for layer in model.layers:
    layer.trainable = False

Setting our model layers' `trainable` property to `False` just means, "Don't change what you already know about eyes and noses and fur and things."

In [None]:
model.add(Dense(2, activation="softmax"))
model.compile(
    optimizer=RMSprop(lr=0.1), 
    loss="categorical_crossentropy", 
    metrics=["accuracy"]
)

And now we're adding a layer and compiling the model exactly like we did before.

Notice the lack of the `input_shape` parameter this time - because we're adding onto an existing model instead of creating a new one, our new layer just takes the output of the previous layer as its input.

There's also a new `metrics` parameter. The docstring says this is typical, so we're not going to worry about it for now.

Our previous next step was to call `fit` but we have to make some changes first:

In [None]:
gen = image.ImageDataGenerator()
trn_batches = gen.flow(trn_data, trn_labels, batch_size=batch_size, shuffle=True)
val_batches = gen.flow(val_data, val_labels, batch_size=batch_size, shuffle=True)

Remember this guy? `image.ImageDataGenerator`? Returns a batch (DirectoryIterator) object that can't be indexed? We later used the batches to create our training and validation classes, which we converted to labels with one-hot encoding.

I find it helpful here to recall what our `lm.fit` looked like:

```
lm.fit(
    trn_features,
    trn_labels,
    nb_epoch=3,
    batch_size=batch_size,
    validation_data=(
        val_features,
        val_labels
    )
)
```

The inputs to our `lm.fit` function were features (1,000 ImageNet category probabilities) and labels (one-hot encoded [0, 1] or [1, 0]). 

The inputs to the `fit_model` function we're about to define will be batches. If you check the definition for `trn_batches` above, you'll see that it contains `trn_data` (the original (3, 228, 228) NumPy arrays) and `trn_labels` - all the information needed to train our model.

In [None]:
def fit_model(model, trn_batches, val_batches, nb_epoch=1):
    model.fit_generator(
        trn_batches,
        samples_per_epoch=trn_batches.N,
        
        nb_epoch=nb_epoch,
        
        validation_data=val_batches,
        nb_val_samples=val_batches.N
    )

def fit_epochs(model, trn_batches, val_batches, nb_epoch, run):
    for i in range(nb_epoch):
        fit_model(model, trn_batches, val_batches, nb_epoch=1)
        model.save_weights(model_dir + "finetune{}{}.h5".format(run, str(i)))

I'll admit I'm not 100% on why we need a new `fit_model` function, but I think it's so we can fully utilize the parallel computing abilities of the GPU by grouping our data into batches. 

We're also creating a `fit_epochs` function so we can save our weights after each iteration.

In [None]:
%%time
fit_epochs(model, trn_batches, val_batches, nb_epoch=3, run="lastlayer")

Now we can check the accuracy of our predictions - only this time using `val_data` instead of `val_features`.

And `val_batches.class_indices` doesn't work for some reason so we have to manually tell our confusion matrix that cat is 0 and dog is 1. I think this has something to do with us using `gen.flow` instead of `get_batches` to define `val_batches` this time.

In [None]:
preds = model.predict_classes(val_data, batch_size=batch_size)
cm = confusion_matrix(val_classes, preds)
plot_confusion_matrix(cm, {"cat": 0, "dog": 1})

## Finetuning ALL the layers