The [Assira](https://www.microsoft.com/en-us/research/project/asirra/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fredmond%2Fprojects%2Fasirra%2F) Dogs vs. Cats dataset is a great dataset for beginners to get started with image classification. The goal of the prediction task is to generate an output 1 for a Dog image, and 0 for a cat image. 

This notebook contains a detailed EDA on the dataset, followed by some fairly simple Keras CNN architectures. Some model improvement techniques are considered to address overfitting, and we'll evaluate how well these models perform.



In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
from plotly import graph_objs as go
import pandas as pd
import os
import seaborn as sns
from tqdm import tqdm_notebook as tqdm

## Table of contents 
1. [Understanding the folder structures](#folders)
2. [Consolidate labels for images](#labels)
3. [EDA](#eda)
    * [Looking at the images](#look)
    * [Cat vs Dog Frequencies](#freqplot)
    * [Image dimensions](#dimensions)
4. [A simple first CNN with Keras](#firstcnn)
    * [Defining the model architecture](#architecture)
    * [Data preprocessing](#preprocessing)
    * [Model evaluation](#evaluationfirst)
5. [Improving the architecture](#improvedmodel)

## <a class='anchor' id ='folders'> Understanding the folder structures
Let's take a look at the folder structure

In [None]:
os.listdir("../input/")

The train data is contained within a .zip file. Lets extract this zip file into a working folder within `../kaggle/working` folder.

In [None]:
import zipfile 
with zipfile.ZipFile("../input/"+"train"+".zip","r") as z:
    z.extractall("../kaggle/working/temp_unzip")

Let's take a look at the first ten items in the folder we've extracted:

In [None]:
print(f"List of first ten image filenames: \n {os.listdir('../kaggle/working/temp_unzip/train')[:10]}")
print(f"Total number of images in training data: {len(os.listdir('../kaggle/working/temp_unzip/train'))}")

## <a class='anchor' id = 'labels'> Consolidate labels for images </a>

In [None]:
filenames = os.listdir('../kaggle/working/temp_unzip/train')
labels = [str(x)[:3] for x in filenames]
train_df = pd.DataFrame({'filename': filenames, 'label': labels})
train_df.head()

Let's encode the categorical labels to 1 or 0

In [None]:
train_df['label'] = train_df['label'].map({'dog': '1', 'cat':'0'})
train_df.head()

## <a class='anchor' id='eda'> EDA </a>

In this section I'll explore the dataset in some depth. This includes looking at some examples of each image, exploring the distribution of cat vs dog classes, and studying the image dimensions.

### <a class='anchor' id='looking'> Looking at the images </a>

Le'ts plot 5 instances of each class

In [None]:
print(f"The data has {train_df['label'].nunique()} unique classes")

for lab in train_df['label'].unique(): 
    #Subset to just that target 
    label_df = train_df[train_df['label']==lab].reset_index()
    cols = 5
    rows = 1
    fig = plt.figure(figsize = (4*cols - 1, 4.5*rows - 1))
    for c in range(cols):
        for r in range(rows):
            ax = fig.add_subplot(rows, cols, c*rows + r + 1)
            img = mpimg.imread('../kaggle/working/temp_unzip/train/'+label_df['filename'][c+r])
            ax.imshow(img)
            ax.set_title(str(label_df['filename'][c+r]))
    fig.suptitle(str(label_df['filename'][c+r][:3].upper()))
    plt.show()
    plt.close()

## <a class='anchor' id='freqplot'> Cat vs Dog frequencies </a>

We see that the classes are balanced, with 12,500 images in each class. 

In [None]:
pd.DataFrame(train_df['label'].value_counts().reset_index())

## Image dimensions 

Let's break down the image dimensions to understand how they are distributed. This will help us at the modelling stage.

In [None]:
dims_dict = {'image': [], 'width': [], 'height': [], 'channels': []}
for i in tqdm(range(len(train_df))):#['filename']):#train_pathlabel_df['image'].unique())):
    dims = mpimg.imread('../kaggle/working/temp_unzip/train/'+train_df['filename'][i]).shape
    dims_dict['image'].append(train_df['filename'][i])
    dims_dict['height'].append(dims[0])
    dims_dict['width'].append(dims[1])
    dims_dict['channels'].append(dims[2])

dims_df = pd.DataFrame(dims_dict)
dims_df.head()

In [None]:
sns.distplot(dims_df['height'])
plt.title('Distribution of image heights');
plt.show()

In [None]:
sns.distplot(dims_df['width'])
plt.title('Distribution of image widths');

Lets split this between dogs and cats to see if there's any significant difference in the distribution of image dimensions

In [None]:
dims_df['label'] = dims_df['image'].apply(lambda x: x[:3])
dims_df.head(3)

In [None]:
sns.distplot(dims_df[dims_df['label']=='dog']['height'], label='dog')
sns.distplot(dims_df[dims_df['label']=='cat']['height'], label='cat')
plt.title('Distribution of image heights between cats and dogs')
plt.legend();
plt.show()

In [None]:
sns.distplot(dims_df[dims_df['label']=='dog']['width'], label='dog')
sns.distplot(dims_df[dims_df['label']=='cat']['width'], label='cat')
plt.title('Distribution of image widths between cats and dogs')
plt.legend();
plt.show()

Therefore, the distributions of dimensions between the two classes look similar.

## <a class='anchor' id='firstcnn'> A simple CNN with Keras </a>

### <a class='anchor' id='architecture'> Model architecture </a>
Lets start with building a simple 2-layer model with Keras. We'll try a fairly simple architecture taken from Francois Chollet's book 'Deep Learning with Python'.

In [None]:
from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Conv2D(32, (3,3), activation = 'relu', input_shape = (200, 200, 3), name="conv_1"))
network.add(layers.MaxPooling2D((2,2), name="maxpool_1"))
network.add(layers.Conv2D(64, (3,3), activation = 'relu', name="conv_2"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_2"))
network.add(layers.Conv2D(128, (3,3), activation = 'relu', name="conv_3"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_3"))
network.add(layers.Conv2D(128, (3,3), activation = 'relu', name="conv_4"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_4"))

network.add(layers.Flatten())
network.add(layers.Dense(512, activation = 'relu', name="dense_1"))
network.add(layers.Dense(1, activation = 'sigmoid', name="dense_2"))
network.summary()

In [None]:
network.compile(optimizer = 'adam',
               loss = 'binary_crossentropy',
               metrics = ['accuracy'])

Below is a breakdown of each layer and its parameters:
1. __`Conv2D`__: Takes an input of size `(200, 200, 3)`, passes them through 32 filters (each with size `(3,3)`). Since we haven't specified any padding, the output will be a 3D tensor of shape `(198, 198, 3)`. The number of parameters in this layer will be 896 i.e. `out_channels * (in_channels * kernel_h * kernel_w + 1)` i.e. `32 * (3*3*3+1)` where 1 is for the bias term
2. __`MaxPooling2D`__: Takes an argument for a window size, and returns an output with a pooled version of the previous layer
3. __`Flatten`__: Flatten the 3-D output to 1-D
4. __`Dense`__: Layers with some activation function (e.g. relu). The last `Dense`layer will have a sigmoid activation function to predict the final output. 

### <a class='anchor' id='preprocessing'> Data preprocessing </a> 

We use the keras `ImageDataGenerator` class to convert the raw images into tensors

In [None]:
from keras.preprocessing.image import ImageDataGenerator
# Instantiate an ImageDataGenerator with 30% validation data 
datagen = ImageDataGenerator(rescale = 1./255,
                             validation_split=0.3)

# Call the `flow_from_dataframe` method to create a generator for training data. 
# The input dataframe to this method needs to contain the image paths & target labels
train_data_gen = datagen.flow_from_dataframe(dataframe=train_df,
                                             directory='../kaggle/working/temp_unzip/train/',#Target directory
                                             x_col = 'filename',
                                             y_col = 'label',
                                             class_mode = 'binary',#Since we use binary crossentropy
                                             target_size=(200, 200),#All images will be resized to this
                                             color_mode='rgb',
                                             batch_size = 32,
                                             shuffle=True,
                                             seed=42,
                                             subset = 'training',#Just for train data generation
                                             validate_filenames=False)

# Repeat for validation data 
val_data_gen  = datagen.flow_from_dataframe(dataframe=train_df,
                                            directory='../kaggle/working/temp_unzip/train/',
                                            x_col = 'filename',
                                            y_col = 'label',
                                            class_mode = 'binary',
                                            target_size=(200, 200),#All images will be resized to this
                                            color_mode='rgb',
                                            batch_size = 32,
                                            shuffle=True,
                                            seed=42,
                                            subset = 'validation',#Just for train data generation
#                                             interpolation='bilinear',#Can try nearest as well. Need to read up on this
                                            validate_filenames=False)

Let's examine the outputs of the data generator. The [Keras documentation](https://keras.io/api/preprocessing/image/#flowfromdataframe-method) for `ImageDataGenerator.flow_from_dataframe` says that it: 

* Takes the dataframe and the path to a directory + generates batches.

* Returns a DataFrameIterator yielding tuples of (x, y) where x is a numpy array containing a batch of augmented/normalized images with shape (batch_size, *target_size, channels) and y is a numpy array of corresponding labels.



In [None]:
for data_array, label_array in train_data_gen:
    print(f"Shape of train data batch data is {data_array.shape}")
    print(f"Shape of train data batch labels is {label_array.shape}")
    break # The generator has infinite yield, as endlessly iterates over batches. We need to break it manually

We see that the train data generator yields 32 batches of RGB images, each of shape 200 * 200 * 3.

The labels generated are binary labels of shape (20,).

### <a class='anchor' id='training'> Training the model </a>

The `fit_generator`method is used to train the CNN. Its arguements are:
* a Python generator that will yield batches of inputs and targets indefinitely
* `steps_per_epoch`: How many samples to draw from the generator before declaring an epoch over

In [None]:
history = network.fit_generator(train_data_gen,
                                steps_per_epoch = 100,
                                epochs=30,
                                validation_data = val_data_gen,
                                validation_steps=50
                               )

Let's save our model:

In [None]:
network.save('cats_and_dogs_small_model.h5')

### <a class='anchor' id='evaluationfirst'> Model evaluation </a>

Lets plot the loss and the accuracy of the model over training and validation data.

In [None]:
print(history.history.keys())
train_acc = history.history['acc']
val_acc = history.history['val_acc']
train_loss = history.history['loss']
val_loss = history.history['val_loss']
n_epochs = len(train_acc)
fig = plt.figure(figsize = (15,8))
fig.add_subplot(121)
plt.plot(range(n_epochs), train_acc, color = 'orange', label = "Train accuracy")
plt.plot(range(n_epochs), val_acc, color = 'blue', label = "Validation accuracy")
plt.legend();
fig.add_subplot(122)
plt.plot(range(n_epochs), train_loss, color = 'orange', label = "Train loss")
plt.plot(range(n_epochs), val_loss, color = 'blue', label = "Validation loss")
plt.legend();

We observe that: 
* Training accuracy improves with each epoch
* Validation accuracy on the other hand improves less quickly after around the 26th epoch
* Training loss similarly decreases consistently
* Improvement in validation loss tapers off after around the 20th epoch

In order to prevent overfitting and improve the validation accuracy, lets try a few model improvements.

## <a class='anchor' id='improvedmodel'> Improving the architecture </a>

Lets run the below data augmentation steps on the training data: 
* Rotate images within a range of 45 degrees
* Horizontal and vertical translation of images 
* Zoom into some images randomly 
* Apply shear transformations

In [None]:
from keras.preprocessing.image import ImageDataGenerator

# ImageDataGenerator for validation data 
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   rotation_range = 45,
                                   width_shift_range = 0.2,
                                   height_shift_range = 0.2,
                                   shear_range = 0.2,
                                   zoom_range = 0.2)
val_datagen = ImageDataGenerator(rescale = 1./255)

# Call the `flow_from_dataframe` method to create a generator for training data. 
# The input dataframe to this method needs to contain the image paths & target labels

# We split the training and validation data because augmentations/transformations should not be applied to valdation data 
from sklearn.model_selection import train_test_split
augmented_mod_train_df, augmented_mod_val_df = train_test_split(train_df, test_size=0.2)

Lets take a quick look at class balances in the train and validation data 

In [None]:
print(f"Split between classes in train data: \n {augmented_mod_train_df['label'].value_counts()*100 / augmented_mod_train_df.shape[0]}")
print(f"Split between classes in validation data: \n {augmented_mod_val_df['label'].value_counts()*100 / augmented_mod_val_df.shape[0]}")

The distribution looks very similar to the original distribution in the training data. 
Let's run the data augmentation pipeline and train the network.

In [None]:
train_data_gen = train_datagen.flow_from_dataframe(dataframe=augmented_mod_train_df,
                                             directory='../kaggle/working/temp_unzip/train/',#Target directory
                                             x_col = 'filename',
                                             y_col = 'label',
                                             class_mode = 'binary',#Since we use binary crossentropy
                                             target_size=(200, 200),#All images will be resized to this
                                             color_mode='rgb',
                                             batch_size = 32,
                                             shuffle=True,
                                             seed=42,
                                             validate_filenames=False)

# Repeat for validation data 
val_data_gen  = val_datagen.flow_from_dataframe(dataframe=augmented_mod_train_df,
                                            directory='../kaggle/working/temp_unzip/train/',
                                            x_col = 'filename',
                                            y_col = 'label',
                                            class_mode = 'binary',
                                            target_size=(200, 200),#All images will be resized to this
                                            color_mode='rgb',
                                            batch_size = 32,
                                            shuffle=True,
                                            seed=42,
                                            validate_filenames=False)

Let's add a dropout layer before our final classification step, and then retrain our earlier network.

In [None]:
from keras import models
from keras import layers
from keras import optimizers

network = models.Sequential()
network.add(layers.Conv2D(32, (3,3), activation = 'relu', input_shape = (200, 200, 3), name="conv_1"))
network.add(layers.MaxPooling2D((2,2), name="maxpool_1"))
network.add(layers.Conv2D(64, (3,3), activation = 'relu', name="conv_2"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_2"))
network.add(layers.Conv2D(128, (3,3), activation = 'relu', name="conv_3"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_3"))
network.add(layers.Conv2D(128, (3,3), activation = 'relu', name="conv_4"))
network.add(layers.MaxPooling2D((2,2), name = "maxpool_4"))

network.add(layers.Flatten())
network.add(layers.Dropout(0.2))
network.add(layers.Dense(512, activation = 'relu', name="dense_1"))
network.add(layers.Dense(1, activation = 'sigmoid', name="dense_2"))
network.summary()

In [None]:
network.compile(optimizer = optimizers.adam(lr=1e-4),
               loss = 'binary_crossentropy',
               metrics = ['accuracy'])

history = network.fit_generator(train_data_gen,
                                steps_per_epoch = 50,
                                epochs=50,
                                validation_data = val_data_gen,
                                validation_steps=50
                               )
# network.save('cats_and_dogs_augmented_data.h5')

In [None]:
train_acc = history.history['acc']
val_acc = history.history['val_acc']
train_loss = history.history['loss']
val_loss = history.history['val_loss']
n_epochs = len(train_acc)
fig = plt.figure(figsize = (15,8))
fig.add_subplot(121)
plt.plot(range(n_epochs), train_acc, color = 'orange', label = "Train accuracy")
plt.plot(range(n_epochs), val_acc, color = 'blue', label = "Validation accuracy")
plt.legend();
fig.add_subplot(122)
plt.plot(range(n_epochs), train_loss, color = 'orange', label = "Train loss")
plt.plot(range(n_epochs), val_loss, color = 'blue', label = "Validation loss")
plt.legend();

Let's evalute the performance of the model on the test data

In [None]:
import zipfile 
with zipfile.ZipFile("../input/"+"test1"+".zip","r") as z:
    z.extractall("../kaggle/working/temp_test_unzip")

In [None]:
filenames = os.listdir('../kaggle/working/temp_test_unzip/test1')
test_df = pd.DataFrame({'filename': filenames})
print(test_df.shape)
test_df.head()

Let's create a test data generator in the same way that we created a train & validation data generator

In [None]:
test_data_gen  = val_datagen.flow_from_dataframe(dataframe=test_df,
                                            directory='../kaggle/working/temp_test_unzip/test1/',
                                            x_col = 'filename',
                                            y_col = None,
                                            class_mode = None,
                                            target_size=(200, 200),#All images will be resized to this
                                            color_mode='rgb',
                                            batch_size = 64,
                                            shuffle=False)#,
#                                             seed=42,
#                                             validate_filenames=False)

In [None]:
yhat = network.predict_generator(test_data_gen, steps=np.ceil(test_df.shape[0]/64))
print(yhat.shape)

In [None]:
submission_df = test_df.copy()
submission_df['id'] = submission_df['filename'].str.split('.').str[0]
submission_df['label'] = yhat
submission_df['label'] = np.where(yhat>0.5, 1, 0)
submission_df[['id', 'label']].to_csv('submission.csv', index=False)
submission_df[['id', 'label']].head()

That's it for a CNN architecture implementation in Keras, in my next notebook I'll attempt transfer learning on the dataset to improve the accuracy. 

References: 

1. [Deep Learning with Python, Francois Chollet](https://github.com/sri-spirited/fchollet-book-deep-learning-with-python-notebooks)