#### Dogs vs. Cats Redux : Kernels Edition  - Raphaël 

*Hello, I am a French student. I learn statistics, math and computers at the University. I really enjoyed this project. It helped me to better understand convolutional neural networks. The objective here is to predict if it is a cat or a dog through photos. **I hope you will enjoy it and do not hesitate to comment. :-) ** *

- *In the first part, I will create the project architecture: the train and validation and test directory that contains resized images (150x150) of cats and dogs.*

- *In the second part, I will try to use some tools like the PIL package and the keras package to transform images to avoid overfitting.*

- *In the third part, I will designed a CNN architecture to perform prediction and train it.*

In [None]:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib.pyplot import plot

import zipfile # Preprocessing

import cv2 # Preprocessing
import imageio # Preprocessing

import PIL # Preprocessing
import keras

import shutil # Preprocessing
import os # Preprocessing

from tqdm import tqdm # Progress bar
# import keras_tqdm # Progress bar 

import os
# Any results you write to the current directory are saved as output.

In [None]:
from keras import backend 
from keras import applications
from keras.preprocessing import image # Preprocessing
from keras.preprocessing.image import ImageDataGenerator #DataAugmentation
from keras.callbacks import * 
from keras.models import Sequential
from keras.layers import Activation, Dropout, Flatten, Dense, BatchNormalization, Conv2D, MaxPooling2D

##### List of pictures and labels :

In [None]:
list_picture = os.listdir("../input/dogs-vs-cats-redux-kernels-edition/train/")

##### *Label :* 
- *Cat : 0*
- *Dog : 1*

In [None]:
df = pd.DataFrame({"file" : list_picture})
df['label'] = df['file'].apply(lambda x : 0 if x.split('.')[0] == 'cat' else 1)

##### Validation dataset :

In [None]:
df['validation'] = df['label'].apply(lambda x : 1 if np.random.randint(0,11) <= 2 else 0)

In [None]:
df.sample(10)

In [None]:
print('Percentage of validation data : {}'.format(len(df[df['validation']==1])/len(df)*100))

*"Train Test Split" function could be better here. I should reduce the number of images in the validation dataset.*

*We are going to create the architecture below :*
- *The "cat" and "dog" folders contain associated images. This architecture works very well for binary classification with keras generators.*

```
.
├── notebook.ypnb
|
|
├── _data
|    ├── _train
|    |   ├── _cat
|    |   └── _dog
|    |
|    └── _validation  
|        ├── _cat
|        └── _dog
└── _test   
        
```

*The code below allows you to delete and rebuild the train architecture and validation above.*

In [None]:
try : 
    shutil.rmtree('data/train/cat/')
    shutil.rmtree('data/train/dog/')
    shutil.rmtree('data/validation/cat/')
    shutil.rmtree('data/validation/dog/')
except : 
    print('No folders to delete')

In [None]:
os.makedirs('data/train/cat/')
os.makedirs('data/train/dog/')
os.makedirs('data/validation/cat/')
os.makedirs('data/validation/dog/')

```
for index, row in tqdm(df.iterrows(), total=len(df)):
    
    file_name = row['file']
    img = cv2.imread('../input/dogs-vs-cats-redux-kernels-edition/train/{}'.format(file_name), cv2.IMREAD_COLOR)
    # We resized picture thanks to open cv which is optionnal : 
    # Generator already resized images during training 
    #img = cv2.resize(img, (150, 150), interpolation=cv2.INTER_LINEAR) # INTER_CUBIC
    
    # cat if row['label'] == 0 else dog
    if row['label'] == 0 :
        file_name = 'cat/{}'.format(file_name)
    else :
        file_name = 'dog/{}'.format(file_name)
    
    # train if row['validation'] == 0 else validation
    if row['validation'] == 0 : 
        imageio.imwrite('data/train/{}'.format(file_name), img)
    else :
        imageio.imwrite('data/validation/{}'.format(file_name), img)
```

In [None]:
list_picture_test = os.listdir("../input/dogs-vs-cats-redux-kernels-edition/test/")

In [None]:
try : 
    shutil.rmtree('reshape_test')
except : 
    print('No folder to delete')

In [None]:
os.makedirs('reshape_test')

```
for file_path in tqdm(list_picture_test, total=len(list_picture_test)) :
    
    img = cv2.imread('../input/dogs-vs-cats-redux-kernels-edition/test/{}'.format(file_path), cv2.IMREAD_COLOR)
    img = cv2.resize(img, (150, 150), interpolation=cv2.INTER_LINEAR) # INTER_CUBIC
    imageio.imwrite('reshape_test/{}'.format(file_path), img)
```

*We'll try to use some tools to transform Chucky's image.*

In [None]:
img = keras.preprocessing.image.load_img('../input/dogs-vs-cats-redux-kernels-edition/train/dog.11931.jpg')

In [None]:
img

*Convert an image as numpy array :*

In [None]:
np.array(img)[0]

*Shape of the matrix :*

In [None]:
np.array(img).shape

##### Example of preprocessing : 

In [None]:
img_preprocessed = np.array(img.convert('L').rotate(45).transpose(PIL.Image.TRANSPOSE))

In [None]:
matplotlib.pyplot.imshow(img_preprocessed, interpolation='nearest')
matplotlib.pyplot.show()

##### Keras : ImageDataGenerator

*The keras blog helped me understand how to make preprocessing on picture to avoid overfitting and to build a classifier without having a lot of data : "Our model would never see twice the exact same picture. This helps prevent overfitting and helps the model generalize better". (source : [Keras blog's](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html))* 

In [None]:
datagen = keras.preprocessing.image.ImageDataGenerator(
        rotation_range=40,
        width_shift_range=0.2,
        height_shift_range=0.2,
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True,
        fill_mode='nearest')

In [None]:
x = keras.preprocessing.image.img_to_array(img)  # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape)  # this is a Numpy array with shape (1, 3, 150, 150)

In [None]:
try : 
    os.makedirs('example')
except: 
    print('Folder already exist')

*the .flow() command below generates batches of randomly transformed images and saves the results to the "example" directory :*

In [None]:
#i = 0
for batch in datagen.flow(x, save_to_dir='example', save_prefix='preprocessed', save_format='jpg'):
    #Create 20 pictures  : 
    #i += 1
    #if i > 20:
    break  # otherwise the generator would loop indefinitely

In [None]:
keras.preprocessing.image.load_img('example/{}'.format(os.listdir("example")[0]))

*Above we can see how the tools of the keras library have transformed Chucky's image.*

### *Keras convolutionnal neural newtork : * 

![](https://cdn-images-1.medium.com/max/634/1*-r7EkRUvzkqDyyr2kwdeDg.png)

- *Here we defined the input shape :*

In [None]:
img_width, img_height = 150, 150

if backend.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

In [None]:
input_shape

##### Model  : 
*I do not have a Nvidia GPU on my computer. I used batch normalization to accelerate model convergence and reduce overfitting. The batch normalization does not seem to improve my model but allows me to obtain satisfactory results with fewer iterations.
(Source : [towardsdatascience](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c), [dlology](https://www.dlology.com/blog/one-simple-trick-to-train-keras-model-faster-with-batch-normalization/) and first introduced in the paper [Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf))*

In [None]:
model = Sequential()

model.add(Conv2D(32, (3, 3), input_shape= input_shape, use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Conv2D(32, (3, 3), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(64, (3, 3), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Conv2D(64, (3, 3), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(128, (3, 3), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))

model.add(Conv2D(128, (3, 3), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten()) # This converts our 3D feature maps to 1D feature vectors 3*3*128 

model.add(Dense(128, use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.5)) # The Dropout is aggresive but it allow to reduce overfiting.

model.add(Dense(64, use_bias=False)) 
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(1)) # Binary classification
model.add(BatchNormalization())
model.add(Activation('sigmoid'))# Binary classification

*There is a very interesting article comparing the optimization function for this challenge : [shaoanlu](https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/)*

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# optimizer = SGD(momentum=0.9, nesterov=True) could be better here 

*Summary of the model : *

In [None]:
model.summary()

##### This is the augmentation configuration we will use for training
*Our original images consist of RGB coefficients in the 0-255, to help the model process the images, we will scale these values.*

In [None]:
train_datagen = keras.preprocessing.image.ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

In [None]:
batch_size = 32

*These tools are generators. They allow images to be read in the specified directories.
They will generate batches of augmented images data. We will not load all images in the data set into memory. Instead, we will stream the images in batches.*

In [None]:
train_generator = train_datagen.flow_from_directory(
        'data/train',  # target directory
        target_size = (img_width, img_height),
        batch_size = batch_size,
        class_mode = 'binary')  # since we use binary_crossentropy loss, we need binary labels

# this is a similar generator, for validation data
validation_generator = test_datagen.flow_from_directory(
        'data/validation',
        target_size=(img_width, img_height),
        batch_size=batch_size,
        class_mode='binary')

***There is a great tutorial for this challenge : [ahmedbesbes.com](https://ahmedbesbes.com/understanding-deep-convolutional-neural-networks-with-a-practical-use-case-in-tensorflow-and-keras.html). This site allowed me to analyze the training phase of the model and visualize the results. I also learned to stop the convolutionnal neural network training before it overfit with the "early stopping" function thanks to this website.***

In [None]:
# This class will allow me to visualize results of the training
class LossHistory(Callback):
    
    def on_train_begin(self, logs={}):
        self.losses = []
        self.val_losses = []
        
    def on_epoch_end(self, batch, logs={}):
        self.losses.append(logs.get('loss'))
        self.val_losses.append(logs.get('val_loss'))

In [None]:
history = LossHistory()

*The function below allow to stop the model training when there are too much epochs without improvment performance. I will call it with the "callback" parameter when I fit the model.*

In [None]:
early_stopping = keras.callbacks.EarlyStopping(monitor='val_loss', 
                              min_delta=0,
                              patience=15, # Maximum number of epochs without improvment of val_loss, here I disabled early stopping
                              verbose=0, 
                            mode='auto')

*On Kaggle, the kernel can run for at least an hour, I can't train the model here. I'll just load the weights. keras_tqdm is a package that provides a nice output (javscript widget) when you train your neural network.*

In [None]:
model.load_weights("../input/weights-2/model_weights_2.h5")

```fitted_model = model.fit_generator(
                    train_generator,
                    steps_per_epoch = 400,
                    epochs = 15,
                    validation_data = validation_generator,
                    validation_steps =  800 // batch_size,
                    verbose = 0,
                    callbacks=[keras_tqdm.TQDMNotebookCallback(leave_inner=True, leave_outer=True), early_stopping, history])
model.save_weights('model.h5')```


![](https://image.noelshack.com/fichiers/2018/17/3/1524679368-fit.png)

The graph below crosses the evolution of the loss indicator according to the training and validation data sets.

```
losses, val_losses = history.losses, history.val_losses
fig = matplotlib.pyplot.figure(figsize=(15, 5))
matplotlib.pyplot.plot(fitted_model.history['loss'], 'g', label="train losses")
matplotlib.pyplot.plot(fitted_model.history['val_loss'], 'r', label="val losses")
matplotlib.pyplot.grid(True)
matplotlib.pyplot.title('Training loss vs. Validation loss')
matplotlib.pyplot.xlabel('Epochs')
matplotlib.pyplot.ylabel('Loss')
matplotlib.pyplot.legend()
matplotlib.pyplot.show()
```

![](https://image.noelshack.com/fichiers/2018/17/3/1524679364-log-loss.png)



*The graph below crosses the evolution of the accuracy indicator according to the training and validation data sets.*

```losses, val_losses = history.losses, history.val_losses
fig = matplotlib.pyplot.figure(figsize=(15, 5))
matplotlib.pyplot.plot(fitted_model.history['acc'], 'g', label="accuracy on train set")
matplotlib.pyplot.plot(fitted_model.history['val_acc'], 'r', label="accuracy on validation set")
matplotlib.pyplot.grid(True)
matplotlib.pyplot.title('Training Accuracy vs. Validation Accuracy')
matplotlib.pyplot.xlabel('Epochs')
matplotlib.pyplot.ylabel('Accuracy')
matplotlib.pyplot.legend()
matplotlib.pyplot.show()```

![](https://image.noelshack.com/fichiers/2018/17/3/1524679641-accuracy.png)

##### There is a large variance of the accuracy of validation. To reduce it, I would multiply the number of steps per epoch. Change the optimization function could reduce this variance too. 

##### I could initialize an early stopping rule: 3 times without improving the accuracy of validation and we stop the training phase. I should train the model on Amazon Web Service : [Keras_AWS](https://blog.keras.io/running-jupyter-notebooks-on-gpu-on-aws-a-starter-guide.html).

*Now it's time to predict test dataset :*

In [None]:
list_picture_test = [int(file.split('.')[0]) for file in os.listdir('reshape_test')]

In [None]:
list_picture_test.sort()

In [None]:
list_picture_test = ['{}.jpg'.format(file) for file in list_picture_test]

In [None]:
classes = []

*Kaggle evaluate the probability associate to the class dog : *

```
for file in tqdm(list_picture_test) : 
    # Preprocessing images to predict :
    img = image.load_img('reshape_test/{}'.format(file), target_size=(img_width, img_height))
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = img/255 # Scaling image 
    classes.append(model.predict_proba(img))
```

```classes = [x[0][0] for x in classes]```

```list_id = list(range(1,12501))```

```submission = pd.DataFrame({
            'id': list_id,
            'label':classes
            }, columns=['id','label'])```

```submission.head()```

*Save submission as csv file : *

```submission.to_csv('submission_.csv', sep=",", index=False)```

### Public score : 0.26122 (log loss) which is a pretty good result, I expected less because of the few number of iteration of the model.

* *Will you guess if it's a dog or a cat?  This image has been the subject of a major Twitter debate.*

![](https://image.noelshack.com/fichiers/2018/17/3/1524684284-atchoum.jpeg)

It's a cat ! Instagram of "Atchoumthecat" [here](https://www.instagram.com/atchoumthecat/?utm_source=ig_embed&action=profilevisit) and his personnal website [here](https://www.atchoumthecat.com/my-story.html).

*Try my model on random images, 1 = Dog, 0 = Cat. *

In [None]:
img = keras.preprocessing.image.load_img('../input/reshape/144.jpg')
img

In [None]:
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = img/255 # Scaling image
model.predict_classes(img)

In [None]:
img = keras.preprocessing.image.load_img('../input/reshape/145.jpg')
img

In [None]:
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = img/255 # Scaling image
model.predict_classes(img)

In [None]:
img = keras.preprocessing.image.load_img('../input/reshape/146.jpg')
img

In [None]:
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = img/255 # Scaling image
model.predict_classes(img)

Feel free to put an thumbs up if this notebook interested you.

Raphaël