# DOGS VS CATS REDUX: KERNEL EDITION

My model for submission into the the Dogs vs Cats Redux Kaggle Competition:
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

The final model is a finetuned version of a pre-trained VGG16 model which made it into the top 40% of the public leaderboard.

## Imports, helper functions, and data directory definition

In [413]:
from keras import backend as K
from keras.models import Sequential
from keras.layers import Conv2D, Dense, Flatten, BatchNormalization, Dropout, MaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop

import numpy as np
from sklearn.metrics import log_loss
from matplotlib import pyplot as plt
from shutil import copyfile
%matplotlib inline

In [2]:
def plot_img(img):
    if K.image_dim_ordering() == 'th':
        img = np.rollaxis(img,0,3).astype(np.uint8)
    else:
        img = np.rollaxis(img,0,1).astype(np.uint8)
    plt.imshow(img)

In [211]:
def create_gen(directory, batch_size=4, shuffle=True, 
               gen=ImageDataGenerator(), target_size=(224,224), 
               class_mode = None):
    return gen.flow_from_directory(directory,
                                   batch_size = batch_size,
                                   shuffle = shuffle,
                                   target_size = target_size, 
                                   class_mode = class_mode)

In [412]:
import os,sys
%cd ~/nbs
current_dir = os.getcwd()
HOME_DIR = current_dir
DATA_DIR = HOME_DIR+'/data/dogscats/'

test_dir = DATA_DIR+'test/'
saved_model_dir = DATA_DIR+'saved_models/'

# sample of training data
sample_train_dir = DATA_DIR+'sample/train/'
sample_valid_dir = DATA_DIR+'sample/valid/'

# full training data
train_dir = DATA_DIR+'train/'
valid_dir = DATA_DIR+'valid/'

img_shape = (224,224,3)

/home/ubuntu/nbs


## SPLIT DATA

Get data from here:
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/data

I start by splitting the data into testing, training, and validation. I also copy a small random sample from the training set for initial experimentation. These examples are placed in the sample folder.

In [469]:
# Create separate folders for train, valid, and sample/train, sample/valid
%mkdir -p $DATA_DIR/test/test
%mkdir -p $DATA_DIR/train/dogs
%mkdir -p $DATA_DIR/train/cats
%mkdir -p $DATA_DIR/valid/dogs
%mkdir -p $DATA_DIR/valid/cats
%mkdir -p $DATA_DIR/sample/train/dogs
%mkdir -p $DATA_DIR/sample/train/cats
%mkdir -p $DATA_DIR/sample/valid/dogs
%mkdir -p $DATA_DIR/sample/valid/cats
%mkdir -p $saved_model_dir

In [470]:
!tree -d $DATA_DIR

[01;34m/home/ubuntu/nbs/data/dogscats/[00m
├── [01;34msample[00m
│   ├── [01;34mtrain[00m
│   │   ├── [01;34mcats[00m
│   │   └── [01;34mdogs[00m
│   └── [01;34mvalid[00m
│       ├── [01;34mcats[00m
│       └── [01;34mdogs[00m
├── [01;34msaved_models[00m
├── [01;34mtest[00m
│   └── [01;34mtest[00m
├── [01;34mtrain[00m
│   ├── [01;34mcats[00m
│   └── [01;34mdogs[00m
└── [01;34mvalid[00m
    ├── [01;34mcats[00m
    └── [01;34mdogs[00m

16 directories


In [21]:
# Move training examples to cats/dogs subdirectories
%cd $train_dir
from glob import glob
c = glob('cat.*')
for fname in c: os.rename(fname, DATA_DIR+'/train/cats/'+fname)
d = glob('dog.*')
for fname in d: os.rename(fname, DATA_DIR+'/train/dogs/'+fname)

In [26]:
# Copy some cat and dog examples into the sample folder
%cd $train_dir/cats
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(80): copyfile(shuf[i], DATA_DIR+'/sample/train/cats/'+shuf[i])
for i in range(80,100): copyfile(shuf[i], DATA_DIR+'/sample/valid/cats/'+shuf[i])

In [28]:
%cd $DATA_DIR/train/dogs
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(80): copyfile(shuf[i], DATA_DIR+'/sample/train/dogs/'+shuf[i])
for i in range(80,100): copyfile(shuf[i], DATA_DIR+'/sample/valid/dogs/'+shuf[i])

In [30]:
# Move random subset of training examples to valid directory
%cd $train_dir/cats
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(2500): os.rename(shuf[i],DATA_DIR+'/valid/cats/'+shuf[i])

/home/ubuntu/nbs/data/dogscats/train/cats


In [31]:
%cd $train_dir/dogs
g = glob('*.jpg')
shuf = np.random.permutation(g)
for i in range(2500): os.rename(shuf[i],DATA_DIR+'/valid/dogs/'+shuf[i])

/home/ubuntu/nbs/data/dogscats/train/dogs


In [427]:
# Create data genererators
batch_size = 64
sample_train_gen = create_gen(sample_train_dir, batch_size=batch_size, shuffle=True, class_mode='binary')
sample_valid_gen = create_gen(sample_valid_dir, batch_size=batch_size, shuffle=True, class_mode='binary')
train_gen = create_gen(train_dir, batch_size=batch_size, shuffle=True, class_mode='binary')
valid_gen = create_gen(valid_dir, batch_size=batch_size, shuffle=True, class_mode='binary')
test_gen = create_gen(test_dir, batch_size=batch_size, shuffle=True, class_mode='binary')

Found 160 images belonging to 2 classes.
Found 40 images belonging to 2 classes.
Found 20000 images belonging to 2 classes.
Found 5000 images belonging to 2 classes.
Found 12500 images belonging to 1 classes.


## BASELINE MODELS

The first models tried here are meant to serve as benchmarks to compare later models to. The models I try here are Linear, Simple Neural Network, and a Simple CNN.

### Linear Model

This one starts to fit to the data but it doesn't generalizing very well. We can see this from the quick increase in the training accuracy while the validation accuracy doesn't look like it's going anywhere.

In [431]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=img_shape))
model.add(Flatten())
model.add(Dense(1,activation = 'sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [439]:
model.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen, verbose=2)

Epoch 1/4
 - 1s - loss: 0.4671 - acc: 0.7711 - val_loss: 0.6582 - val_acc: 0.5750
Epoch 2/4
 - 1s - loss: 0.4156 - acc: 0.8602 - val_loss: 0.6431 - val_acc: 0.5500
Epoch 3/4
 - 1s - loss: 0.3980 - acc: 0.8734 - val_loss: 0.6370 - val_acc: 0.5500
Epoch 4/4
 - 1s - loss: 0.3557 - acc: 0.9062 - val_loss: 0.6436 - val_acc: 0.6000


<keras.callbacks.History at 0x7fb800557c10>

### Simple neural network

Same issue as the model above. We can start to overfit it to the data, but it's not generalizing.

In [436]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(224,224,3)))
model.add(Flatten())
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Dense(1,activation = 'sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [438]:
model.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fb89acf4490>

### Simple CNN

The shallow CNN has the same issue as the last two. I try making it deeper, but it doesn't look like it's enough for it to learn anything useful. <br>Next I'll try a pre-trained model.

In [449]:
cnn1 = Sequential()
cnn1.add(BatchNormalization(axis=1, input_shape = img_shape))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
cnn1.add(Flatten())
cnn1.add(Dense(100))
cnn1.add(BatchNormalization())
cnn1.add(Dense(1,activation = 'sigmoid'))
cnn1.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [450]:
cnn1.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fb8a27c8f50>

In [451]:
cnn1 = Sequential()
cnn1.add(BatchNormalization(axis=1, input_shape = img_shape))
# Block 1
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Block 2
cnn1.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Block 4
cnn1.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Top
cnn1.add(Flatten())
cnn1.add(Dense(256))
cnn1.add(BatchNormalization())
cnn1.add(Dense(1,activation = 'sigmoid'))
cnn1.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [452]:
cnn1.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fb894312e50>

## FINETUNED VGG16 MODELS

### Create VGG16 pretrained model

Create VGG16 model and load it with weights trained from imagenet.

In [455]:
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
model = VGG16(weights='imagenet', include_top=False)

In [456]:
batch_size=64
nb_epochs=3
nb_sample_train = 160
nb_sample_valid = 40
nb_train = 20000
nb_valid = 5000

bn_feat_sample_train = 'bn_feat_sample_train.npy'
bn_feat_sample_valid = 'bn_feat_sample_valid.npy'
bn_feat_train = 'bn_feat_train.npy'
bn_feat_valid = 'bn_feat_valid.npy'
bn_feat_test = 'bn_feat_test.npy'

### Compute VGG16 bottleneck features

Here's where I compute the bottleneck features. These are the outputs from the last convolutional block of VGG16 which would feed into the dense layers on top (which we didn't include). The VGG16 convolutional layers are very good at pulling out useful features from images. With the top included, VGG16 would then take these features, and output an imagenet class. So the convolutional layers pull out useful features from images, and the dense layers use these features to determine what the image is. We're going to keep these features that the VGG16 pulls out of our dataset, and then use those to train a new network to classify them as either cats or dogs.

In [None]:
# Create generators

datagen = ImageDataGenerator(rescale=1., featurewise_center=True) #(rescale=1./255)
datagen.mean=np.array([103.939, 116.779, 123.68],dtype=np.float32).reshape(1,1,3) #3,1,1

sample_train_gen = create_gen(sample_train_dir, batch_size=batch_size, shuffle=False, gen=datagen)
sample_valid_gen = create_gen(sample_valid_dir, batch_size=batch_size, shuffle=False, gen=datagen)
train_gen = create_gen(train_dir, batch_size=batch_size, shuffle=False, gen=datagen)
valid_gen = create_gen(valid_dir, batch_size=batch_size, shuffle=False, gen=datagen)
test_gen = create_gen(test_dir, batch_size=batch_size, shuffle=False, gen=datagen)

In [75]:
# create sample train data bottleneck features
bottleneck_features_train = model.predict_generator(sample_train_gen)
np.save(open(saved_model_dir + bn_feat_sample_train, 'w'),
        bottleneck_features_train)

In [77]:
# create sample validation data bottleneck features
bottleneck_features_valid = model.predict_generator(sample_valid_gen)
np.save(open(saved_model_dir + bn_feat_sample_valid, 'w'),bottleneck_features_valid)

In [None]:
# create train data bottleneck features
bottleneck_features_train = model.predict_generator(test_gen)
np.save(open(saved_model_dir + bn_feat_train, 'w'),bottleneck_features_train)

In [232]:
# create valid data bottleneck features
bottleneck_features_valid = model.predict_generator(valid_gen)
np.save(open(saved_model_dir + bn_feat_valid, 'w'),bottleneck_features_valid)

Found 5000 images belonging to 2 classes.


In [233]:
# create test data bottleneck features
bottleneck_features_test = model.predict_generator(test_gen)
np.save(open(saved_model_dir + bn_feat_test, 'w'), bottleneck_features_test)

Found 12500 images belonging to 1 classes.


### Load VGG16 features

Load the saved VGG features. There are two cells here, one for training on the full training set, and one for training only on the small sample data set.

In [234]:
# Full training set of features and labels
test_features = np.load(open(saved_model_dir + bn_feat_test))
train_features = np.load(open(saved_model_dir + bn_feat_train))
valid_features = np.load(open(saved_model_dir + bn_feat_valid))
train_labels = np.array([1] * int(nb_train / 2) + [0] * int(nb_train / 2))
valid_labels = np.array([1] * int(nb_valid / 2) + [0] * int(nb_valid / 2))

In [22]:
# # Small sample set of features and labels
# train_features = np.load(open(saved_model_dir + bn_feat_sample_train))
# valid_features = np.load(open(saved_model_dir + bn_feat_sample_valid))
# train_labels = np.array([1] * int(nb_sample_train / 2) + [0] * int(nb_sample_train / 2))
# valid_labels = np.array([1] * int(nb_sample_valid / 2) + [0] * int(nb_sample_valid / 2))

### Linear model on VGG16

This model is just a single linear layer. 

In [457]:
model_fc1 = Sequential()
model_fc1.add(Flatten(input_shape=train_features.shape[1:]))
model_fc1.add(Dense(1, activation='sigmoid'))

model_fc1.compile(optimizer='rmsprop',
              loss='binary_crossentropy', metrics=['accuracy'])

In [458]:
fc1_weights = 'best_weights_fc1.h5'
checkpointer = ModelCheckpoint(filepath= saved_model_dir + fc1_weights, 
                               save_best_only=True)

model_fc1.fit(train_features,train_labels,
              epochs=5,#nb_epochs,
              batch_size=batch_size,
              validation_data=(valid_features,valid_labels),
              callbacks=[checkpointer],
              verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb892db4050>

In [465]:
# Log loss on validation set
model_fc1.load_weights(saved_model_dir + fc1_weights)
pred = model_fc1.predict(valid_features)
pred = pred.reshape(pred.shape[0])
predictions = pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.12561574509888887

### Final model on VGG16

In [460]:
model_fc3 = Sequential()
model_fc3.add(MaxPooling2D((2, 2), strides=(2, 2), input_shape=train_features.shape[1:]))
model_fc3.add(Flatten())
model_fc3.add(Dense(4096, activation='relu'))
model_fc3.add(BatchNormalization())
model_fc3.add(Dropout(0.5))
model_fc3.add(Dense(4096, activation='relu'))
model_fc3.add(BatchNormalization())
model_fc3.add(Dropout(0.5))
model_fc3.add(Dense(1, activation='sigmoid'))

model_fc3.compile(optimizer='rmsprop',
              loss='binary_crossentropy', metrics=['accuracy'])

In [461]:
fc3_weights = 'best_weights_fc3.h5'
checkpointer = ModelCheckpoint(filepath= saved_model_dir + fc3_weights, 
                               save_best_only=True)

model_fc3.fit(train_features,train_labels,
              epochs=5,#nb_epochs,
              batch_size=batch_size,
              validation_data=(valid_features,valid_labels),
              callbacks=[checkpointer],
              verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb8911c9bd0>

Doesn't look like we're seeing much improvement on the loss function after a few epochs. We'd probably need to tweak some hyperparameters or the architecture to improve much from here.

In [464]:
# Log loss on validation set
model_fc3.load_weights(saved_model_dir + fc3_weights)
pred = model_fc3.predict(valid_features)
pred = pred.reshape(pred.shape[0])
predictions = pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.10920777253806591

Looks like it's an improvement on the linear model above. It's surprising how well the linear model did though. Let's try an ensemble of this model next.

## ENSEMBLE

In [466]:
class Ensemble():
    
    def __init__(self, nb_models = 3):
        self.nb_models = nb_models
    
    def create_model(self):
        model = Sequential()
        model.add(MaxPooling2D((2, 2), strides=(2, 2), input_shape=train_features.shape[1:]))
        model.add(Flatten())
        model.add(Dense(4096, activation='relu'))
        model.add(BatchNormalization())
        model.add(Dropout(0.5))
        model.add(Dense(4096, activation='relu'))
        model.add(BatchNormalization())
        model.add(Dropout(0.5))
        model.add(Dense(1, activation='sigmoid'))

        model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy', metrics=['accuracy'])
        return model

    def train(self,epochs=3):
        for i in range(self.nb_models):
            
            weights = 'best_weights_fc3_'+str(i)+'.h5'
            checkpointer = ModelCheckpoint(filepath= saved_model_dir + weights, 
                                           save_best_only=True)
            model = self.create_model()
            model.fit(train_features,train_labels,
                          epochs=epochs,
                          batch_size=batch_size,
                          validation_data=(valid_features,valid_labels),
                          callbacks=[checkpointer],
                          verbose=1)
            
    def predict(self, x):
        preds = []
        model = self.create_model()
        for i in range(self.nb_models):
            model.load_weights(saved_model_dir + 'best_weights_fc3_'+str(i)+'.h5')
            preds.append(model.predict(x))
        ens_pred = np.stack(preds).mean(axis=0)
        return ens_pred

In [390]:
model_ens = Ensemble(3)
model_ens.train(3)

In [396]:
# Log log on validation set
ens_pred = model_ens.predict(valid_features)
ens_pred = ens_pred.reshape(ens_pred.shape[0])
predictions = ens_pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.097072397928684953

In [467]:
model_ens = Ensemble(5)
model_ens.train(3)

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [468]:
# Log log on validation set
ens_pred = model_ens.predict(valid_features)
ens_pred = ens_pred.reshape(ens_pred.shape[0])
predictions = ens_pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.0926154362514615

As usual, an ensemble tends to do better. Adding more models seems to continue the trend. We could also try an ensemble of different types of models, perhaps using different pre-trained model architectures. I think this should be enough to get into the top 40% of the kaggle competition though.

## PREPARE KAGGLE SUBMISSION

In [407]:
# Make predictions on test set
iscat_pred = model_ens.predict(test_features)
predictions = (1 - iscat_pred).reshape(iscat_pred.shape[0])
predictions = predictions.clip(min=0.05, max=0.95)

In [404]:
# Get file indices
generator = create_gen(test_dir, batch_size=batch_size, shuffle=False, gen=datagen)
filenames = generator.filenames
idx = np.array([int(f[5:f.rfind('.')]) for f in filenames])

Found 12500 images belonging to 1 classes.


In [408]:
# Save predictions
subm = np.stack([idx,predictions],axis=1)
submission_file_name = 'submission3.csv'
%cd ~/git/dogs-v-cats-redux
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')

/home/ubuntu/git/dogs-v-cats-redux
