## Dog vs Cat classification, Convolutional Neural Network approximation
This is a simple convolutional network model concieved as both, a little practice, and a small baseline for further improvements.

# Index

<a href="#control">• Control Variables </a>

<a href="#hiperparameters">• Hiperparameters </a>

<a href="#loading">• Data loading </a>

<a href="#preprocessing">• Data preprocessing </a>

<a href="#visualization">• Data visualization </a>

<a href="#instantiation">• Model Instantiation </a>

<a href="#callbacks">• Callbacks </a>

<a href="#compilation">• Model compilation </a>

<a href="#training">• Model training </a>

<a href="#evaluation">• Model result evaluation </a>

<a href="#submission">• Result submission </a>

<a href="#references">• References </a>



<a class="anchor" id="control"></a>

# Control variables 

Variables to control some aspects of the model at the change of a simple value, very useful for quick changes to test out new ideas. 

Feel free to toy with them to test out how different choices affect the model.

In [None]:
dataset_subset = False # Train with only a subset of the data for quick tests
data_subset_size = 2000 # Subset size (for each class)
color = True # Keep the color dimension or else load the data in greyscale
data_generation = True # Perform data augmentation
assign_test_labels = False # Sets all test predictions to either 1 or 0
quick_training = False # Reduces the number of epochs to a 10%

early_stop_overfitting = True; # Stop the training if the model doesn't improve in order to prevent overfitting
learning_rate_smoothing = True; # Reduces the learning rate of the backpropagation during the fitting if the model isn't improving


<a class="anchor" id="hiperparameters"></a>

# Hiperparameters 
Model hiperparameters.

Feel free to try ou different values here as well.

In [None]:
#Regular CNNs hiperparameters
batch_size = 16
num_clases = 1
epochs = 100
conv_kernel_size = 3

# CNN fine tuning hiperparameters
default_dropout_rate = 0.2
regularizaion_weight = 0.001
learning_rate_reduction_factor = 0.5

# Data hiperparameters
img_width = 132
img_height = 132
validation_size = 0.2

if color:
    img_channels = 3
else:
    img_channels = 1
    
if quick_training:
    epochs = epochs * 0.1
    

<a class="anchor" id="loading"></a>

# Data loading 

In [None]:
import os # data fetching
import random # training set shuffling
import gc # garbage collector to clean memory
import cv2 # image preprocessing


In [None]:
# Dataset directory check
print(os.listdir("../input/"))


In [None]:
train_dir = '../input/train'
test_dir = '../input/test'

if dataset_subset:
    train_dogs = ['../input/train/{}'.format(filename) for filename in os.listdir(train_dir) if 'dog' in filename]
    train_cats = ['../input/train/{}'.format(filename) for filename in os.listdir(train_dir) if 'cat' in filename]
    
    # Only a small portion of both classes is used, in favor of quicker 
    train_imgs = train_dogs[:data_subset_size] + train_cats[:data_subset_size]
    
    # Memory freeing tasks
    del train_dogs
    del train_cats
    gc.collect()
    
else:
    train_imgs = ['../input/train/{}'.format(filename) for filename in os.listdir(train_dir)]
    

random.shuffle(train_imgs)

test_imgs = ['../input/test/{}'.format(test_img) for test_img in os.listdir(test_dir)]
# The ids gets processed as just the numbers in the filename as integers, without the extension
test_ids = [int(test_img[14:-4]) for test_img in test_imgs]

<a class="anchor" id="preprocessing"></a>

# Data preprocessing 

In [None]:
import numpy as np # linear algebra
from sklearn.model_selection import train_test_split # train-validation splitter

In [None]:
def preprocess_images(img_path_list):
    """
    Loads and preprocesses all the images whose paths included in img_path_list
    Return
        X: array of resized images
        y: array of labels
    """
    X = []
    y = []
    
    for img_path in img_path_list:
        if color:
            x = cv2.imread(img_path, cv2.IMREAD_UNCHANGED)
            x = cv2.cvtColor(x, cv2.COLOR_BGR2RGB)
            # This last bit is to have the images coverted from the default BGR from cv2
            # to RGB to correctly visualize the dataset (it has no effect over the training)
        else:
            x = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
            
        x = cv2.resize(x, (img_height, img_width))
        X.append(x)
            
        if 'dog' in img_path:
            y.append(1)          
        elif 'cat' in img_path:
            y.append(0)

    return X, y


In [None]:
X, y = preprocess_images(train_imgs)

del train_imgs
gc.collect()

# Validation set splitting
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=validation_size)

del X
del y
gc.collect()

X_train = np.array(X_train)
X_val = np.array(X_val)
y_train = np.array(y_train)
y_val = np.array(y_val)

train_size = X_train.shape[0]
val_size = X_val.shape[0]

print("Train and validation shapes")
print("X_train: " + str(X_train.shape))
print("X_val: " + str(X_val.shape))
print("y_train: " + str(y_train.shape))
print("y_val: " + str(y_val.shape))


In [None]:
X_test, _ = preprocess_images(test_imgs)

X_test = np.array(X_test).astype('float32')

# If data generation is beign used, as it will transform train images to float (and, so, values 
# from 0 to 1), test dataset should be adapted to what our model is going to learn to treat.
if data_generation:
    X_test /= 255

# Memory liberation tasks
del test_imgs
gc.collect
    
print("Test dataset shape: ")
print(X_test.shape)


<a class="anchor" id="augmentation"></a>

# RAM Data augmentation 
Keras' ImageDataGenerator class is used to create an image generator from our dataset that is able to produce modified versions of the pictures already present in the dataset.

That is very common way to face overfitting, as the continusly varying dataset disallows the model to hold itself to a close set of characteristics to predict the output. Generalization, arise!

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator 
# Documentation: https://keras.io/preprocessing/image/


In [None]:
if data_generation:
    
    # Fourth dimension addition in case of its value being onesized
    if img_channels == 1:
        X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)
        X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X_val.shape[2], 1)
        
    data_augmentator = ImageDataGenerator(rescale=1./255, rotation_range=0.2, shear_range=0.1, zoom_range=0.2,
                                          width_shift_range=0.1, height_shift_range=0.1, fill_mode='reflect',horizontal_flip=True)
    data_augmentator.fit(X_train)
    data_generator = data_augmentator.flow(X_train, y_train, batch_size=batch_size)
    
    val_augmentator = ImageDataGenerator(rescale=1./255)
    val_generator = val_augmentator.flow(X_val, y_val, batch_size=batch_size)

    # Dimension restitution
    if img_channels == 1:
        X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2])
        X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X_val.shape[2])
        

<a class="anchor" id="visualization"></a>

# Data visualization 
Matplotlib is used to plot the numpy arrays that contain preprocessed images.

This allow to check on the proper state of our input, as, for example, we might be loading it in an incorrect way or our augmentation might be affecting it in unexpected manners.

In [None]:
# The good, the bad and the ugly
%matplotlib inline 

from matplotlib import pyplot as plt # data visualization


In [None]:
def plot_data(X, y, num_figures):
    """
    Prints the images stored in X, with their correspondent labels in y.
    num_figures images by row.
    """
    plt.figure(figsize=(30, 20))

    for i in range(num_figures):
        plt.subplot(2, num_figures, i+1)
        if color:
            plt.imshow(X[i])
        else:
            plt.imshow(X[i], cmap='gray')
        if y[i] >= 0.5:
            plt.title("Doge ("+ str(y[i]) + ")", fontsize=30)
        else:
            plt.title("Catto ("+ str(y[i]) + ")", fontsize=30)
            
    plt.tight_layout()
    plt.show()    


In [None]:
# Preprocess training dataset showcase with labels
for i in range(0, 24, 6): 
    plot_data(X_train[i:], y_train[i:], 6)
    

In [None]:
if data_generation:
    
    for X_train_gen, y_train_gen in data_generator:
        
        if img_channels == 1:
            X_train_gen = X_train_gen.reshape(X_train_gen.shape[0], X_train_gen.shape[1], X_train_gen.shape[2])
        
        print("X_train_gen shape: " + str(X_train_gen.shape))
        print("y_train_gen shape: " + str(y_train_gen.shape))
        
        for i in range(0, batch_size-6, 6):
            plot_data(X_train_gen[i:], y_train_gen[i:], 6)
            
        del X_train_gen
        del y_train_gen
        gc.collect()
        
        break

    

<a class="anchor" id="instantiation"></a>

# Model instantiation 

Kera's Sequential model will be used to build a simple Convolutional Neuronal Network.

This needs for little introduction, although I found out of, both, Spatial Dropout and Batch Normalization while researching for this kernel.

This section is probably the one that directly allows for the most experimentation, as the model can change in the most meaninful ways witout affecting the rest of the implementation.

You can find articles of both techniques below, at the References section.

In [None]:
# Imports
from tensorflow.keras.models import Sequential # Documentation: https://keras.io/models/sequential/
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense # Documentation: https://keras.io/layers/core/, https://keras.io/layers/convolutional/
from tensorflow.keras.layers import Dropout, SpatialDropout2D, BatchNormalization
from tensorflow.keras.optimizers import Adam, RMSprop # Documentation: https://keras.io/optimizers/
from tensorflow.keras.regularizers import l2 # Documentation: https://keras.io/regularizers/


In [None]:
def add_regularization_layer(model, type, rate=default_dropout_rate):
    """
    Adds a regularization layer to the model based on the active control hiperparameters.
    It's open to multipple addition, although you probably want to add only one of them.
    
    'rate' parameter only affects dropout layers.
    """
    if type == "batch_normalization":
        model.add(BatchNormalization())
    if type == "spatial_dropout":
        model.add(SpatialDropout2D(rate)) 
    if type == "dropout":
        model.add(Dropout(rate))

In [None]:
model = Sequential()

model.add(Conv2D(32, kernel_size=(conv_kernel_size, conv_kernel_size), activation='relu', input_shape=(img_width, img_height, img_channels))) # Strides are, by default, (1,1)
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"spatial_dropout", 0.25)
model.add(MaxPooling2D(pool_size=(2,2))) # Strides are, by default, of the same size of the pool size


model.add(Conv2D(64, kernel_size=(conv_kernel_size, conv_kernel_size), activation='relu'))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"spatial_dropout", 0.25)
model.add(MaxPooling2D(pool_size=(2,2)))


model.add(Conv2D(128, kernel_size=(conv_kernel_size, conv_kernel_size), activation='relu'))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"spatial_dropout", 0.25)
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(256, kernel_size=(conv_kernel_size, conv_kernel_size), activation='relu'))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"spatial_dropout", 0.25)
model.add(Conv2D(256, kernel_size=(conv_kernel_size, conv_kernel_size), activation='relu'))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"spatial_dropout", 0.25)
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten()) 

model.add(Dense(1024, activation='relu', kernel_regularizer=l2(regularizaion_weight)))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"dropout", 0.5)
model.add(Dense(1024, activation='relu', kernel_regularizer=l2(regularizaion_weight)))
add_regularization_layer(model,"batch_normalization")
add_regularization_layer(model,"dropout", 0.5)
model.add(Dense(num_clases, activation='sigmoid', kernel_regularizer=l2(regularizaion_weight)))


<a class="anchor" id="callbacks"></a>

# Callbacks
As stated in the documentation, "A callback is a set of functions to be applied at given stages of the training procedure".

Callbacks will be used in this model to smooth learning rate as training goes on and to stop the training in the occurrence of the model not improving over the validation score, to prevent overfitting on the resulting model.

The idea of using callbacks was found in another submission, whose link can be found below in the <a href="#references">references section</a>.


In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# Documentation: https://keras.io/callbacks/

In [None]:
# Stops the training in the case of validation score not improving
if early_stop_overfitting:
    early_stop = EarlyStopping(patience=5)
    
    if not learning_rate_smoothing:
        callbacks = [early_stop]

# Reduces the learning rate of the back propagation gradient descend in the case of validation score not improving
if learning_rate_smoothing:
    learning_rate_reduction = ReduceLROnPlateau(monitor="val_acc", patience=2, factor=learning_rate_reduction_factor, min_lr=0.00001, verbose=1)
    
    if not early_stop_overfitting:
        callbacks = [learning_rate_reduction]
    else:
        callbacks = [early_stop, learning_rate_reduction]


<a class="anchor" id="compilation"></a>

# Model compilation 

In [None]:
model.summary()

model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])


<a class="anchor" id="training"></a>

# Model training 

In [None]:
if data_generation:
    
    history = model.fit_generator(data_generator, epochs=epochs, validation_data=val_generator, 
                                  steps_per_epoch=train_size//batch_size, validation_steps=val_size//batch_size, 
                                  callbacks=callbacks, verbose=2)

else:
    
    # Fourth dimension addition in case of its value being onesized
    if img_channels == 1:
        X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)
        X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X_val.shape[2], 1)

    history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val),
                        steps_per_epoch=train_size//batch_size, validation_steps=val_size//batch_size, 
                        callbacks=callbacks, verbose=2)

    # Dimension restitution
    if img_channels == 1:
        X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2])
        X_val = X_val.reshape(X_val.shape[0], X_val.shape[1], X_val.shape[2])
    

<a class="anchor" id="evaluation"></a>

# Model result evaluation 
Matplotlib is used again, this time to plot the progression epoch after epoch of our model over metrics such as accuracy and loss, handling insight over the training progress, allowing its evaluation.

Here, one of the most insightful aspects that shall be observed is if the model have suffered of overfitting, which means that our model is adapting way too much to the training set, which will make it way worse at generalizating (classifying data not present in the training dataset), which is precisely our objective.

The most obvious symptom of this is a growing difference in between the training and validation set's metric scores, most usually by the training set still improving over epochs while the validation set is stuck or even gets worse.

Methods such as Data Augmentation and Dropout layers have been used for the model.

In [None]:
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


In [None]:
# Fourth dimension addition in case of its value being onesized
if img_channels == 1:
    X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1)
    
predictions = model.predict(X_test, verbose=0)

# Dimension restitution
if img_channels == 1:
    X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2])
    

In [None]:
# Preprocess test dataset showcase with predictions
for i in range(0, 60, 6): 
    plot_data(X_test[i:], predictions[i:], 6)
    

<a class="anchor" id="submission"></a>
# Results submission

In [None]:
import pandas as pd

In [None]:
# Predictions array must be reshaped into a single dimension array in order to create the dataframe
predictions = predictions.reshape(predictions.shape[0])

if assign_test_labels:
    labels = [1 if pred >= 0.5 else 0 for pred in predictions]
else:
    labels = predictions

submission = pd.DataFrame({'id':test_ids , 'label':labels})

# Let's check what we've got
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

<a class="anchor" id="references"></a>

# References 

### Data augmentation

https://machinelearningmastery.com/image-augmentation-deep-learning-keras/

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html

### Overfitting prevention

#### Overall theorical approach

https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/

#### Dropout in Keras

https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/

#### Spatial Dropout

https://towardsdatascience.com/review-tompson-cvpr15-spatial-dropout-human-pose-estimation-c7d6a5cecd8c

#### Batch Normalization

https://towardsdatascience.com/intuit-and-implement-batch-normalization-c05480333c5b

https://towardsdatascience.com/dont-use-dropout-in-convolutional-networks-81486c823c16

https://github.com/harrisonjansma/Research-Computer-Vision/blob/master/08-12-18%20Batch%20Norm%20vs%20Dropout/08-12-18%20Batch%20Norm%20vs%20Dropout.ipynb

### References to other submissions

https://www.kaggle.com/uysimty/keras-cnn-dog-or-cat-classification 

(From this other submission I got, both, the idea of using callbacks to improve model training, as well as more adecuate data augmentation parameters. Eventually, I as well got the idea of using batch normalization over the first dense layer in the classification part of the model.)

### Very good end-to-end tutorials

https://towardsdatascience.com/image-detection-from-scratch-in-keras-f314872006c9

https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html