# Dogs vs. Cats: Redux Edition

My model for submission into the the Dogs vs Cats Redux Kaggle Competition:
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition

The final model is a finetuned version of a pre-trained VGG16 model which made it into the top 40% of the public leaderboard.

## Imports

In [1]:
import os
from shutil import copyfile
from glob import glob
import numpy as np
from sklearn.metrics import log_loss
from matplotlib import pyplot as plt

from keras import backend as K
from keras.models import Sequential
from keras.layers import Conv2D, Dense, Flatten, BatchNormalization, Dropout, MaxPooling2D
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop

%matplotlib inline

Using TensorFlow backend.


## Constants

In [101]:
HOME_DIR = os.getcwd()
DATA_DIR = HOME_DIR + '/data/dogs-vs-cats'
TEST_DIR = DATA_DIR+'/test'
TRAIN_DIR = DATA_DIR+'/train'
VALID_DIR = DATA_DIR+'/valid'
MODEL_DIR = DATA_DIR+'/models'
# sample of training data
SAMPLE_DIR = DATA_DIR+'/sample'
SAMPLE_TRAIN_DIR = SAMPLE_DIR + '/train'
SAMPLE_VALID_DIR = SAMPLE_DIR + '/valid'
IMG_SHAPE = (224,224,3)

In [91]:
nb_train_sample = 80
nb_valid_sample = 20
nb_valid = 2500

## Download and Extract Data

This will create a folder for the data in the current working directory, download the kaggle dataset using the kaggle API and extract the training and test files into the data directory. 

You can also download the data from the kaggle page:<br>
https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/data

If you download it yourself, make sure to place the train and test images in folders corresponding with TEST_DIR and TRAIN_DIR defined above.

In [27]:
!mkdir -p $DATA_DIR
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition -p $DATA_DIR

In [None]:
!unzip -q $DATA_DIR/test.zip -d $DATA_DIR
!unzip -q $DATA_DIR/train.zip -d $DATA_DIR

## Split Data

Split the data into testing, training, and validation. I also copy a small random sample from the training set for initial experimentation. These examples are placed in the sample folder.

In [54]:
for path in [TRAIN_DIR, VALID_DIR, SAMPLE_TRAIN_DIR, SAMPLE_VALID_DIR]:
    %mkdir -p $path/dogs
    %mkdir -p $path/cats
%mkdir -p $MODEL_DIR

In [89]:
def move_data_to_subdirs(train_dir):
    sub_dirs = [f.path for f in os.scandir(TRAIN_DIR) if f.is_dir()]
    for sub_dir in sub_dirs:
        # Category which matches this sub-directory
        category = os.path.basename(sub_dir)[:-1]
        # Paths for files to move to sub-directory
        file_paths = glob(os.path.join(train_dir, category + '.*'))
        for path_original in file_paths:
            filename = os.path.basename(path_original)
            path_new = os.path.join(sub_dir, filename)
            os.rename(path_original, path_new)

In [90]:
move_data_to_subdirs(TRAIN_DIR)

In [92]:
def create_sample(train_dir, sample_dir, nb_train, nb_valid):
    sub_dirs = os.listdir(train_dir)
    for sub_dir in sub_dirs:
        file_paths = glob(os.path.join(train_dir, sub_dir, '*.jpg'))
        np.random.permutation(file_paths) #shuffle files
        # Move some to the sample train folder
        for i in range(nb_train): 
            filename = os.path.basename(file_paths[i])
            destination = os.path.join(sample_dir, 'train', sub_dir, filename)
            copyfile(file_paths[i], destination)
        # Move some to the sample valid folder
        for i in range(nb_train, nb_train+nb_valid): 
            filename = os.path.basename(file_paths[i])
            destination = os.path.join(sample_dir, 'valid', sub_dir, filename)
            copyfile(file_paths[i], destination)

In [93]:
create_sample(TRAIN_DIR, SAMPLE_DIR, nb_train_sample, nb_valid_sample)

In [94]:
def move_to_valid(train_dir, valid_dir, nb_valid):
    sub_dirs = os.listdir(train_dir)
    for sub_dir in sub_dirs:
        file_paths = glob(os.path.join(train_dir, sub_dir, '*.jpg'))
        np.random.permutation(file_paths) #shuffle files
        # Move some to the sample train folder
        for i in range(nb_valid):
            filename = os.path.basename(file_paths[i])
            dest = os.path.join(valid_dir, sub_dir, filename)
            os.rename(file_paths[i], dest)

In [95]:
move_to_valid(TRAIN_DIR, VALID_DIR, nb_valid)

In [97]:
def create_gen(directory, batch_size=4, shuffle=True, 
               gen=ImageDataGenerator(), target_size=(224,224), 
               class_mode = None):
    return gen.flow_from_directory(directory,
                                   batch_size = batch_size,
                                   shuffle = shuffle,
                                   target_size = target_size, 
                                   class_mode = class_mode)

In [98]:
# Create data genererators
batch_size = 64
sample_train_gen = create_gen(SAMPLE_TRAIN_DIR, batch_size=batch_size, shuffle=True, class_mode='binary')
sample_valid_gen = create_gen(SAMPLE_VALID_DIR, batch_size=batch_size, shuffle=True, class_mode='binary')
train_gen = create_gen(TRAIN_DIR, batch_size=batch_size, shuffle=True, class_mode='binary')
valid_gen = create_gen(VALID_DIR, batch_size=batch_size, shuffle=True, class_mode='binary')
test_gen = create_gen(TEST_DIR, batch_size=batch_size, shuffle=True, class_mode='binary')

Found 160 images belonging to 2 classes.
Found 40 images belonging to 2 classes.
Found 20000 images belonging to 2 classes.
Found 5000 images belonging to 2 classes.
Found 0 images belonging to 0 classes.


In [2]:
def plot_img(img):
    if K.image_dim_ordering() == 'th':
        img = np.rollaxis(img,0,3).astype(np.uint8)
    else:
        img = np.rollaxis(img,0,1).astype(np.uint8)
    plt.imshow(img)

## BASELINE MODELS

The first models tried here are meant to serve as benchmarks to compare later models to. 

### Linear Model

The model's accuracy on the training data steadily increases, so we know we're learning something. But looking at the validation accuracy, it looks like it's fluctuating around 50 percent, no better than guessing. So the model is fitting the training data but isn't generalizing.

In [105]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=IMG_SHAPE))
model.add(Flatten())
model.add(Dense(1,activation = 'sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [106]:
model.fit_generator(sample_train_gen, epochs=8, validation_data=sample_valid_gen, verbose=2)

Epoch 1/8
 - 5s - loss: 6.0550 - acc: 0.4859 - val_loss: 7.6444 - val_acc: 0.5000
Epoch 2/8
 - 5s - loss: 5.9200 - acc: 0.5930 - val_loss: 7.8531 - val_acc: 0.4500
Epoch 3/8
 - 5s - loss: 6.4399 - acc: 0.5633 - val_loss: 9.4186 - val_acc: 0.4000
Epoch 4/8
 - 5s - loss: 6.3739 - acc: 0.5898 - val_loss: 9.1734 - val_acc: 0.4000
Epoch 5/8
 - 5s - loss: 6.3072 - acc: 0.5656 - val_loss: 7.0310 - val_acc: 0.5250
Epoch 6/8
 - 5s - loss: 5.0085 - acc: 0.6664 - val_loss: 6.1718 - val_acc: 0.5500
Epoch 7/8
 - 5s - loss: 4.6386 - acc: 0.6953 - val_loss: 5.6401 - val_acc: 0.6250
Epoch 8/8
 - 5s - loss: 4.6860 - acc: 0.6711 - val_loss: 6.3744 - val_acc: 0.5750


<keras.callbacks.History at 0x7fdfe2165748>

### Simple neural network

Same issue as the model above. We can start to fit to the training data, but it's not generalizing.

In [109]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=IMG_SHAPE))
model.add(Flatten())
model.add(Dense(100))
model.add(BatchNormalization())
model.add(Dense(1,activation = 'sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

In [108]:
model.fit_generator(sample_train_gen, epochs=8, validation_data=sample_valid_gen)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7fdfe1de6358>

### Simple CNN

The shallow CNN has the same issue as the last two models. I try making it deeper, but it doesn't look like it's enough for it to learn anything useful. <br>Next I'll try a pre-trained model.

In [110]:
cnn1 = Sequential()
cnn1.add(BatchNormalization(axis=1, input_shape = IMG_SHAPE))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
cnn1.add(Flatten())
cnn1.add(Dense(100))
cnn1.add(BatchNormalization())
cnn1.add(Dense(1,activation = 'sigmoid'))
cnn1.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
cnn1.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen)

Epoch 1/4


In [451]:
cnn1 = Sequential()
cnn1.add(BatchNormalization(axis=1, input_shape = img_shape))
# Block 1
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(64, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Block 2
cnn1.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(128, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Block 4
cnn1.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
cnn1.add(Conv2D(256, (3, 3), activation='relu', padding='same'))
cnn1.add(MaxPooling2D((2, 2), strides=(2, 2)))
# Top
cnn1.add(Flatten())
cnn1.add(Dense(256))
cnn1.add(BatchNormalization())
cnn1.add(Dense(1,activation = 'sigmoid'))
cnn1.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [452]:
cnn1.fit_generator(sample_train_gen, epochs=4, validation_data=sample_valid_gen)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7fb894312e50>

## FINETUNED VGG16 MODELS

### Create VGG16 pretrained model

Create VGG16 model and load it with weights trained from imagenet.

In [455]:
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
model = VGG16(weights='imagenet', include_top=False)

In [456]:
batch_size=64
nb_epochs=3
nb_sample_train = 160
nb_sample_valid = 40
nb_train = 20000
nb_valid = 5000

bn_feat_sample_train = 'bn_feat_sample_train.npy'
bn_feat_sample_valid = 'bn_feat_sample_valid.npy'
bn_feat_train = 'bn_feat_train.npy'
bn_feat_valid = 'bn_feat_valid.npy'
bn_feat_test = 'bn_feat_test.npy'

### Compute VGG16 bottleneck features

Here's where I compute the bottleneck features. These are the outputs from the last convolutional block of VGG16 which would feed into the dense layers on top (which we didn't include). The VGG16 convolutional layers are very good at pulling out useful features from images. With the top included, VGG16 would then take these features, and output an imagenet class. So the convolutional layers pull out useful features from images, and the dense layers use these features to determine what the image is. We're going to keep these features that the VGG16 pulls out of our dataset, and then use those to train a new network to classify them as either cats or dogs.

In [None]:
# Create generators

datagen = ImageDataGenerator(rescale=1., featurewise_center=True) #(rescale=1./255)
datagen.mean=np.array([103.939, 116.779, 123.68],dtype=np.float32).reshape(1,1,3) #3,1,1

sample_train_gen = create_gen(sample_train_dir, batch_size=batch_size, shuffle=False, gen=datagen)
sample_valid_gen = create_gen(sample_valid_dir, batch_size=batch_size, shuffle=False, gen=datagen)
train_gen = create_gen(train_dir, batch_size=batch_size, shuffle=False, gen=datagen)
valid_gen = create_gen(valid_dir, batch_size=batch_size, shuffle=False, gen=datagen)
test_gen = create_gen(test_dir, batch_size=batch_size, shuffle=False, gen=datagen)

In [75]:
# create sample train data bottleneck features
bottleneck_features_train = model.predict_generator(sample_train_gen)
np.save(open(saved_model_dir + bn_feat_sample_train, 'w'),
        bottleneck_features_train)

In [77]:
# create sample validation data bottleneck features
bottleneck_features_valid = model.predict_generator(sample_valid_gen)
np.save(open(saved_model_dir + bn_feat_sample_valid, 'w'),bottleneck_features_valid)

In [None]:
# create train data bottleneck features
bottleneck_features_train = model.predict_generator(test_gen)
np.save(open(saved_model_dir + bn_feat_train, 'w'),bottleneck_features_train)

In [232]:
# create valid data bottleneck features
bottleneck_features_valid = model.predict_generator(valid_gen)
np.save(open(saved_model_dir + bn_feat_valid, 'w'),bottleneck_features_valid)

Found 5000 images belonging to 2 classes.


In [233]:
# create test data bottleneck features
bottleneck_features_test = model.predict_generator(test_gen)
np.save(open(saved_model_dir + bn_feat_test, 'w'), bottleneck_features_test)

Found 12500 images belonging to 1 classes.


### Load VGG16 features

Load the saved VGG features. There are two cells here, one for training on the full training set, and one for training only on the small sample data set.

In [234]:
# Full training set of features and labels
test_features = np.load(open(saved_model_dir + bn_feat_test))
train_features = np.load(open(saved_model_dir + bn_feat_train))
valid_features = np.load(open(saved_model_dir + bn_feat_valid))
train_labels = np.array([1] * int(nb_train / 2) + [0] * int(nb_train / 2))
valid_labels = np.array([1] * int(nb_valid / 2) + [0] * int(nb_valid / 2))

In [22]:
# # Small sample set of features and labels
# train_features = np.load(open(saved_model_dir + bn_feat_sample_train))
# valid_features = np.load(open(saved_model_dir + bn_feat_sample_valid))
# train_labels = np.array([1] * int(nb_sample_train / 2) + [0] * int(nb_sample_train / 2))
# valid_labels = np.array([1] * int(nb_sample_valid / 2) + [0] * int(nb_sample_valid / 2))

### Linear model on VGG16

This model is just a single linear layer. 

In [457]:
model_fc1 = Sequential()
model_fc1.add(Flatten(input_shape=train_features.shape[1:]))
model_fc1.add(Dense(1, activation='sigmoid'))

model_fc1.compile(optimizer='rmsprop',
              loss='binary_crossentropy', metrics=['accuracy'])

In [458]:
fc1_weights = 'best_weights_fc1.h5'
checkpointer = ModelCheckpoint(filepath= saved_model_dir + fc1_weights, 
                               save_best_only=True)

model_fc1.fit(train_features,train_labels,
              epochs=5,#nb_epochs,
              batch_size=batch_size,
              validation_data=(valid_features,valid_labels),
              callbacks=[checkpointer],
              verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb892db4050>

In [465]:
# Log loss on validation set
model_fc1.load_weights(saved_model_dir + fc1_weights)
pred = model_fc1.predict(valid_features)
pred = pred.reshape(pred.shape[0])
predictions = pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.12561574509888887

### Final model on VGG16

In [460]:
model_fc3 = Sequential()
model_fc3.add(MaxPooling2D((2, 2), strides=(2, 2), input_shape=train_features.shape[1:]))
model_fc3.add(Flatten())
model_fc3.add(Dense(4096, activation='relu'))
model_fc3.add(BatchNormalization())
model_fc3.add(Dropout(0.5))
model_fc3.add(Dense(4096, activation='relu'))
model_fc3.add(BatchNormalization())
model_fc3.add(Dropout(0.5))
model_fc3.add(Dense(1, activation='sigmoid'))

model_fc3.compile(optimizer='rmsprop',
              loss='binary_crossentropy', metrics=['accuracy'])

In [461]:
fc3_weights = 'best_weights_fc3.h5'
checkpointer = ModelCheckpoint(filepath= saved_model_dir + fc3_weights, 
                               save_best_only=True)

model_fc3.fit(train_features,train_labels,
              epochs=5,#nb_epochs,
              batch_size=batch_size,
              validation_data=(valid_features,valid_labels),
              callbacks=[checkpointer],
              verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb8911c9bd0>

Doesn't look like we're seeing much improvement on the loss function after a few epochs. We'd probably need to tweak some hyperparameters or the architecture to improve much from here.

In [464]:
# Log loss on validation set
model_fc3.load_weights(saved_model_dir + fc3_weights)
pred = model_fc3.predict(valid_features)
pred = pred.reshape(pred.shape[0])
predictions = pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.10920777253806591

Looks like it's an improvement on the linear model above. It's surprising how well the linear model did though. Let's try an ensemble of this model next.

## ENSEMBLE

In [466]:
class Ensemble():
    
    def __init__(self, nb_models = 3):
        self.nb_models = nb_models
    
    def create_model(self):
        model = Sequential()
        model.add(MaxPooling2D((2, 2), strides=(2, 2), input_shape=train_features.shape[1:]))
        model.add(Flatten())
        model.add(Dense(4096, activation='relu'))
        model.add(BatchNormalization())
        model.add(Dropout(0.5))
        model.add(Dense(4096, activation='relu'))
        model.add(BatchNormalization())
        model.add(Dropout(0.5))
        model.add(Dense(1, activation='sigmoid'))

        model.compile(optimizer='rmsprop',
                      loss='binary_crossentropy', metrics=['accuracy'])
        return model

    def train(self,epochs=3):
        for i in range(self.nb_models):
            
            weights = 'best_weights_fc3_'+str(i)+'.h5'
            checkpointer = ModelCheckpoint(filepath= saved_model_dir + weights, 
                                           save_best_only=True)
            model = self.create_model()
            model.fit(train_features,train_labels,
                          epochs=epochs,
                          batch_size=batch_size,
                          validation_data=(valid_features,valid_labels),
                          callbacks=[checkpointer],
                          verbose=1)
            
    def predict(self, x):
        preds = []
        model = self.create_model()
        for i in range(self.nb_models):
            model.load_weights(saved_model_dir + 'best_weights_fc3_'+str(i)+'.h5')
            preds.append(model.predict(x))
        ens_pred = np.stack(preds).mean(axis=0)
        return ens_pred

In [390]:
model_ens = Ensemble(3)
model_ens.train(3)

In [396]:
# Log log on validation set
ens_pred = model_ens.predict(valid_features)
ens_pred = ens_pred.reshape(ens_pred.shape[0])
predictions = ens_pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.097072397928684953

In [467]:
model_ens = Ensemble(5)
model_ens.train(3)

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [468]:
# Log log on validation set
ens_pred = model_ens.predict(valid_features)
ens_pred = ens_pred.reshape(ens_pred.shape[0])
predictions = ens_pred.clip(min=0.05, max=0.95)
log_loss(valid_labels,predictions)

0.0926154362514615

As usual, an ensemble tends to do better. Adding more models seems to continue the trend. We could also try an ensemble of different types of models, perhaps using different pre-trained model architectures. I think this should be enough to get into the top 40% of the kaggle competition though.

## PREPARE KAGGLE SUBMISSION

In [407]:
# Make predictions on test set
iscat_pred = model_ens.predict(test_features)
predictions = (1 - iscat_pred).reshape(iscat_pred.shape[0])
predictions = predictions.clip(min=0.05, max=0.95)

In [404]:
# Get file indices
generator = create_gen(test_dir, batch_size=batch_size, shuffle=False, gen=datagen)
filenames = generator.filenames
idx = np.array([int(f[5:f.rfind('.')]) for f in filenames])

Found 12500 images belonging to 1 classes.


In [408]:
# Save predictions
subm = np.stack([idx,predictions],axis=1)
submission_file_name = 'submission3.csv'
%cd ~/git/dogs-v-cats-redux
np.savetxt(submission_file_name, subm, fmt='%d,%.5f', header='id,label', comments='')

/home/ubuntu/git/dogs-v-cats-redux
