![data-x](http://oi64.tinypic.com/o858n4.jpg)


# Train CNN from Scratch CATS vs DOGS 
#### Small training sample convolutional neural net with data augmentation

**Author:** Alexander Fred Ojala

**Sources:** 
* **Data**: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition (subset, note all images are unique cat and dog photos)
* **Training + explanations**: https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

**Copright:** Feel free to do whatever you want with this code.

___

# Required data

## Downlad vgg16_weights.h5:
From here: https://drive.google.com/file/d/0Bz7KyqmuGsilT0J5dmRCM0ROVHc/view?usp=sharing
____

# Document outline (clickable links)
___

### [0: Image data logistics & package imports](#sec0)

#### **RUN THIS SECTION FIRST!**
In part 0 we read in the required modules, libraries and packages. We define varibales used throughout in the analysis, look at the data structure (train, validation, test, and try sets).


### [1: Training a small convnet from scratch: 80% accuracy in 40 lines of code](#sec1)

Here we show how to implement a CNN from scratch and train it using only 2000 images (1000 per class). Takes about ~2.5hrs to run from scratch, this is pre-run for you if you want to play with it. We also show how Keras can augment a picture to never show the same picture twice to the CNN.

* **Model filename:** mod_appendix.json
* **Weights filename:** w_appendix.h5

<a id='sec0'></a>
# Part 0
## Image data logistics and package imports


In [1]:
!ls

NOTEBOOK-data-x_breakout7_cnn.ipynb
SOLUTIONS-NOTEBOOK-data-x_breakout7_cnn.ipynb
[34mdata[m[m
features_test.npy
features_train.npy
features_validation.npy
mod_appendix.json
w_appendix.h5


<center>The files you should have are:</center>

| Files                               |  Files                  |
| ----------------------------------- | ----------------------- |
| NOTEBOOK-data-x_breakout7_cnn.ipynb | features_validation.npy |
| data                                | mod_appendix.json       |
| features_test.npy                   | vgg16_weights.h5        |
| features_train.npy                  | w_appendix.h5           |

In [2]:
# Look at files, note all cat images and dog images are unique
from __future__ import absolute_import, division, print_function
import os
for path, dirs, files in os.walk('./data'):
    print('FOLDER',path)
    for f in files[:4]:
        print(f)

FOLDER ./data
FOLDER ./data/test
FOLDER ./data/test/catvdog
try001.jpg
try002.jpg
try003.jpg
try004.jpg
FOLDER ./data/train
FOLDER ./data/train/cats
cat0001.jpg
cat0002.jpg
cat0003.jpg
cat0004.jpg
FOLDER ./data/train/dogs
dog0001.jpg
dog0002.jpg
dog0003.jpg
dog0004.jpg
FOLDER ./data/validation
FOLDER ./data/validation/cats
cat001001.jpg
cat001002.jpg
cat001003.jpg
cat001004.jpg
FOLDER ./data/validation/dogs
dog001001.jpg
dog001002.jpg
dog001003.jpg
dog001004.jpg


In [3]:
print('Number of cat training images:', len(os.walk('./data/train/cats').next()[2]))
print('Number of dog training images:', len(os.walk('./data/train/dogs').next()[2]))
print('Number of cat validation images:', len(os.walk('./data/validation/cats').next()[2]))
print('Number of dog validation images:', len(os.walk('./data/validation/dogs').next()[2]))
print('Number of uncategorized test images:', len(os.walk('./data/test/catvdog').next()[2]))

# There should be 1000 train cat images, 1000 train dogs, 400 validation cats, 400 validation dogs, 100 uncategorized

Number of cat training images: 1000
Number of dog training images: 1000
Number of cat validation images: 400
Number of dog validation images: 400
Number of uncategorized test images: 100


In [4]:
# Define variables
TRAIN_DIR = './data/train/'
VAL_DIR = './data/validation/'
TEST_DIR = './data/test/' #one mixed category

img_width, img_height = 150, 150

n_train_samples = 2000
n_validation_samples = 800
n_epoch = 30
n_test_samples = 100

In [5]:
# Set theano backend
# see ~/.keras/keras.json and / or https://keras.io/backend/#switching-from-one-backend-to-another
!KERAS_BACKEND=theano python -c "from keras import backend"
from keras import backend as K
K.set_image_dim_ordering('th')

Using Theano backend.


Using Theano backend.


In [6]:
# Import relevant packages
import h5py
import os, cv2, random
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns
%matplotlib inline 

from keras.models import Sequential
from keras.layers import Input, Dropout, Flatten, Convolution2D, MaxPooling2D, Dense, Activation, ZeroPadding2D
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping
from keras.utils import np_utils
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing.image import array_to_img, img_to_array, load_img
from keras.models import model_from_json
from keras.preprocessing import image

from IPython.display import Image, display

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

<a id='sec1'></a>
# 1: Training a small convnet from scratch: 80% accuracy in 40 lines of code

The right tool for an image classification job is a convnet, so let's try to train one on our data, as an initial baseline. Since we only have few examples, our number one concern should be overfitting. Overfitting happens when a model exposed to too few examples learns patterns that do not generalize to new data, i.e. when the model starts using irrelevant features for making predictions. For instance, if you, as a human, only see three images of people who are lumberjacks, and three, images of people who are sailors, and among them only one lumberjack wears a cap, you might start thinking that wearing a cap is a sign of being a lumberjack as opposed to a sailor. You would then make a pretty lousy lumberjack/sailor classifier.

Data augmentation is one way to fight overfitting, but it isn't enough since our augmented samples are still highly correlated. Your main focus for fighting overfitting should be the entropic capacity of your model --how much information your model is allowed to store. A model that can store a lot of information has the potential to be more accurate by leveraging more features, but it is also more at risk to start storing irrelevant features. Meanwhile, a model that can only store a few features will have to focus on the most significant features found in the data, and these are more likely to be truly relevant and to generalize better.

There are different ways to modulate entropic capacity. The main one is the choice of the number of parameters in your model, i.e. the number of layers and the size of each layer. Another way is the use of weight regularization, such as L1 or L2 regularization, which consists in forcing model weights to taker smaller values.

In our case we will use a very small convnet with few layers and few filters per layer, alongside data augmentation and dropout. Dropout also helps reduce overfitting, by preventing a layer from seeing twice the exact same pattern, thus acting in a way analoguous to data augmentation (you could say that both dropout and data augmentation tend to disrupt random correlations occuring in your data).

The code snippet below is our first model, a simple stack of 3 convolution layers with a ReLU activation and followed by max-pooling layers. This is very similar to the architectures that Yann LeCun advocated in the 1990s for image classification (with the exception of ReLU).

In order to make the most of our few training examples, we will "augment" them via a number of random transformations, so that our model would never see twice the exact same picture. This helps prevent overfitting and helps the model generalize better.

In Keras this can be done via the keras.preprocessing.image.ImageDataGenerator class. This class allows you to:

configure random transformations and normalization operations to be done on your image data during training
instantiate generators of augmented image batches (and their labels) via .flow(data, labels) or .flow_from_directory(directory). These generators can then be used with the Keras model methods that accept data generators as inputs, fit_generator, evaluate_generator and predict_generator.

In [None]:
# Import image data generator

datagen = ImageDataGenerator(
        rotation_range=40, #rotation_range degrees (0-180), range that randomly rotate pictures
        width_shift_range=0.2, #width_shift range (fraction of total width) within which to randomly translate pic
        height_shift_range=0.2, # -ii-
        
        #rescale value we multiply the data before any other processing. 
        #Our original images consist in RGB coefficients in the 0-255, 
        #but such values would be too high for our models to process (given typical learning rate), 
        # so we target values between 0 and 1 instead by scaling with a 1/255. factor.
        rescale=1./255,
        
        #randomly applying shearing transformations (shear mapping is a linear map that 
        #displaces each point in fixed direction, by an amount proportional to its 
        #signed distance from a line that is parallel to that direction)
        shear_range=0.2, 
        zoom_range=0.2, #randomly zooming inside pictures
        
        #is for randomly flipping half of the images horizontally 
        #--relevant when there are no assumptions of horizontal assymetry (e.g. real-world pictures).

        horizontal_flip=True,
    
        #is the strategy used for filling in newly created pixels, 
        #which can appear after a rotation or a width/height shift.
        fill_mode='nearest')

Now let's start generating some pictures using this tool and save them to a temporary directory, so we can get a feel for what our augmentation strategy is doing --we disable rescaling in this case to keep the images displayable:

In [None]:
img = load_img(TRAIN_DIR+'cats/cat0001.jpg')  # this is a PIL image
x = img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0

if not os.path.exists('preview'):
    os.makedirs('preview')

for batch in datagen.flow(x, batch_size=1,
                          save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
    i += 1
    if i > 20:
        break  # otherwise the generator would loop indefinitely

prev_files = os.walk('./preview').next()[2]
print(prev_files[:4])

for img in prev_files[1:4]:
    print('Image '+img)
    display(Image(filename='preview/'+img))

In [None]:
def first_model():

    model = Sequential()
    model.add(Convolution2D(32, 3, 3, input_shape=(3, img_height, img_width)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Convolution2D(32, 3, 3))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Convolution2D(64, 3, 3))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    
    '''
    On top of it we stick two fully-connected layers. 
    We end the model with a single unit and a sigmoid activation, which is perfect for a binary classification. 
    To go with it we will also use the binary_crossentropy loss to train our model.
    '''

    model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors
    model.add(Dense(64))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    
    # Let's prepare our data. We will use .flow_from_directory()
    # to generate batches of image data (and their labels) directly from our jpgs in their respective folders.
    
    # Below is the augmentation configuration we will use for training
    train_datagen = ImageDataGenerator(
            rescale=1./255,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True)

    # this is the augmentation configuration we will use for testing:
    # only rescaling
    test_datagen = ImageDataGenerator(rescale=1./255)

    # this is a generator that will read pictures found in
    # subfolers of 'data/train', and indefinitely generate
    # batches of augmented image data
    print('Train generator')
    train_generator = train_datagen.flow_from_directory(
            TRAIN_DIR,  # this is the target directory
            target_size=(img_height, img_width),  # all images will be resized to 150x150
            batch_size=32,
            class_mode='binary')  # since we use binary_crossentropy loss, we need binary labels

    # this is a similar generator, for validation data
    print('Validation generator')
    validation_generator = test_datagen.flow_from_directory(
            VAL_DIR,
            target_size=(img_height, img_width),
            batch_size=32,
            class_mode='binary')
    

    
    return model, train_generator, validation_generator

In [None]:
# Look at class indices from our generators


_, train_gen,val_gen =first_model()
print('')
print(val_gen.class_indices)
print(val_gen.classes)

In [None]:
# Define and fit the first model
n_epoch = 50
def fit_first_model():

    mod1, train_generator, validation_generator = first_model()
    mod1.fit_generator(
            train_generator,
            samples_per_epoch=n_train_samples,
            nb_epoch=n_epoch,
            validation_data=validation_generator,
            nb_val_samples=n_validation_samples)

    # save model to disk
    mod1.save_weights('w_appendix.h5')  # always save your weights after training or during training
    model_json = mod1.to_json()
    with open("mod_appendix.json", "w") as json_file:
        json_file.write(model_json)
    print("Saved model to disk")

#fit_first_model()


### DONE ###

In [None]:
# FIRST MODEL EXPLORATION

# load model 1 and weights

json_file = open('mod_appendix.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
mod1 = model_from_json(loaded_model_json)
# load weights into new model
mod1.load_weights("w_appendix.h5")
print("Loaded model from disk")

mod1.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])

# Extract image features from test set - to make predictions

datagen = ImageDataGenerator(rescale=1./255)

# this is a similar generator, for validation data
val_generator = datagen.flow_from_directory( VAL_DIR, target_size=(img_height, img_width),
                                              batch_size=32,class_mode='binary')

preds = mod1.evaluate_generator(val_generator,n_validation_samples)

print('\nModel 1 accuracy on 800 validation images:', round(sum(preds)/2,4)*100,'%')

In [None]:
# Plot picture and print class prediction on cats vs dogs (unsorted)


try_images =  [TEST_DIR+'catvdog/'+img for img in os.listdir(TEST_DIR+'catvdog/')]

def read_image(file_path):
    # For image visualization
    img = cv2.imread(file_path, cv2.IMREAD_COLOR) #cv2.IMREAD_GRAYSCALE
    return cv2.resize(img, (img_height, img_width), interpolation=cv2.INTER_CUBIC)

def plot_pic(img):
    # Plot openCV pic
    pic = read_image(img)    
    plt.figure(figsize=(5,5))
    plt.imshow(pic)
    plt.show()

def predict(mod,i=0,r=None):
    if r==None:
        r=[i]
        
    for idx in r:
        
        img_path = try_images[idx]
        img = image.load_img(img_path, target_size=(150, 150))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        class_pred = mod.predict_classes(x,verbose=0)
        
        if class_pred == 0:
            class_guess='CAT'
        else:
            class_guess='DOG'
        
        print('\n\nI think this is a ' + class_guess)
        plot_pic(try_images[idx])

predict(mod1,r=range(1,10))