# Keras Simple CNN Benchmark
This kernel consists of the following elements:

* Convolutional Neural Network.
* An Image Generator.
* Training of the model.
* Analysis of the kernels performance.
* Submission of the performance results on the test set.


# Kernel versions description

For the purpose of the miniproject, several versions of an existing kernel [1] were created.

Version 3. Original version.
This version was taken from [1] and has a few things added to it that helps to evaluate the code and answer the exercise questions. The model is evaluated on the validation set which consists of 10.000 images from the 340 categories. The mAP is equal to 0.763 for this version of the code. The next versions have some minor changes introduced to them, following suggestions mentioned in exercise 2.2. In order to evaluate the importance of the change the code has undergone, only one thing at a time was altered.

Version 5. Changing batch size.
In this version of the code, batch size was changed from 512 to 32. This makes the mAP score drop from 0.763 to 0.612. Such drop in performance can be explained by the fact that the batch size determines the number of images in each iteration. The more images there are in each iteration, the easier it is for the model to learn features that will apply to the entire dataset, not just to the small batch of images. 

Version 6. Using data augmentation.
In this version of the code, data augmentation techniques were applied to the input data. Specifically, random cropping and horizontal flipping were utilized. Random erasing was implemented, but not used in the current code, due to it having too drastic changes on the dataset. The augmented data is combined with the original data in order to increase the amount of training images. With data augmentation the mAP score decreased to 0.750. This can occur if the augmentations introduce too drastic changes to the dataset. In this case, excluding random cropping, but keeping the horizontal flipping, might help in increasing the mAP score. Sadly, due to time constraints, this assumption cannot be tested.

Version 19. Adding batch normalization.
When batch normalization was added after every convolutional layer, the mAP score increased to 0.768. This can be explained by the fact that by applying batch normalization, the output from the hidden units of the network is normalized. This helps in reducing the internal covariate shift and enables regularization of the model [2].

Version 20. Increasing the number of network layers. 
In this version of the code, an extra convolutional layer was added to the network and has decreased the mAP score to 0.707. This might be due to the fact that the model is too complex for the given data, thus overfitting it. Besides that, by adding the extra layer, the output from the convolutional layers has feature maps that are decreased to size 2 x 2, which might be too small for the given task. 

Version 21. Adding dropout layers.
Applying dropout to 20% of the nodes decreased the mAP score to 0.752. This might suggest low model capacity. Therefore introducing regularization in the form of dropout will not help in increasing its mAP score.

Version 22. Changing the initial learning rate.
In this version, the initial learning rate of Adam optimizer was increased from 0.0024 to 0.005. This reduced the mAP score to 0.72 which means that the learning rate was too big, making it hard for the model to reach the local minimum.

Below an analysis of the network's results on the validation set is given. 

**References:**

[1] Original kernel. URL: https://www.kaggle.com/gaborfodor/black-white-cnn-lb-0-77

[2] S. Ioffe and C. Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: ArXiv e-prints (Feb. 2015). arXiv: 1502.03167 [cs.LG].


In [None]:
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import os
import ast
import datetime as dt
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 10]
plt.rcParams['font.size'] = 14
import seaborn as sns
import cv2
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization, Dense, Dropout, Flatten, Activation
from tensorflow.keras.metrics import categorical_accuracy, top_k_categorical_accuracy, categorical_crossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
from collections import deque
import random
import scipy.misc as misc
start = dt.datetime.now()

In [None]:
DP_DIR = '../input/shuffle-csvs/'
INPUT_DIR = '../input/quickdraw-doodle-recognition/'
BASE_SIZE = 256
NCSVS = 100
NCATS = 340
np.random.seed(seed=1987)
tf.set_random_seed(seed=1987)
def f2cat(filename: str) -> str:
    return filename.split('.')[0]

def list_all_categories():
    files = os.listdir(os.path.join(INPUT_DIR, 'train_simplified'))
    return sorted([f2cat(f) for f in files], key=str.lower)

In [None]:
def apk(actual, predicted, k=3):
    """
    Source: https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
    """
    if len(predicted) > k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i, p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=3):
    """
    Source: https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
    """
    return np.mean([apk(a, p, k) for a, p in zip(actual, predicted)])

def preds2catids_val(valid_predictions, gt):
    val_pred = np.argsort(-valid_predictions, axis=1)[:, :3]
    val_pred_gt = np.zeros((val_pred.shape[0],val_pred.shape[1]+1), dtype=int)
    val_pred_gt[:,1:] = val_pred
    val_pred_gt[:,0] = gt[:, 0]
    return pd.DataFrame(val_pred_gt, columns=['ground truth','a', 'b', 'c'])

def preds2catids(predictions):
    return pd.DataFrame(np.argsort(-predictions, axis=1)[:, :3], columns=['a', 'b', 'c'])

## Simple ConvNet

In [None]:
def custom_single_cnn(size, conv_layers=(8, 16, 32, 64), dense_layers=(512, 256), conv_dropout=0.2,
                      dense_dropout=0.2):
    model = Sequential()
    model.add(
        Conv2D(conv_layers[0], kernel_size=(3, 3), padding='same', activation='relu', input_shape=(size, size, 1)))
#     model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
    if conv_dropout:
        model.add(Dropout(conv_dropout))

    for conv_layer_size in conv_layers[1:]:
        model.add(Conv2D(conv_layer_size, kernel_size=(3, 3), activation='relu'))
#         model.add(BatchNormalization())
        model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
        if conv_dropout:
            model.add(Dropout(conv_dropout))

    model.add(Flatten())

    for dense_layer_size in dense_layers:
        model.add(Dense(dense_layer_size, activation='relu'))
        model.add(Activation('relu'))
        if dense_dropout:
            model.add(Dropout(dense_dropout))

    model.add(Dense(NCATS, activation='softmax'))
    return model

def top_3_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=3)

In [None]:
STEPS = 500
size = 64
batchsize = 512

In [None]:
model = custom_single_cnn(size=size,
                          conv_layers=[128, 64],
                          dense_layers=[1024],
                          conv_dropout=False,
                          dense_dropout=0.25)
model.compile(optimizer=Adam(lr=0.0024), loss='categorical_crossentropy',
              metrics=[categorical_crossentropy, categorical_accuracy, top_3_accuracy])
print(model.summary())

## Training with Image Generator

In [None]:
def draw_cv2(raw_strokes, size=256, lw=6):
    img = np.zeros((BASE_SIZE, BASE_SIZE), np.uint8)
    for stroke in raw_strokes:
        for i in range(len(stroke[0]) - 1):
            _ = cv2.line(img, (stroke[0][i], stroke[1][i]), (stroke[0][i + 1], stroke[1][i + 1]), 255, lw)
    if size != BASE_SIZE:
        return cv2.resize(img, (size, size))
    else:
        return img


def image_generator(size, batchsize, ks, lw=6):
    while True:
        for k in np.random.permutation(ks):
            filename = os.path.join(DP_DIR, 'train_k{}.csv.gz'.format(k))
            for df in pd.read_csv(filename, chunksize=batchsize):
                df['drawing'] = df['drawing'].apply(ast.literal_eval)
                x = np.zeros((len(df), size, size))
                for i, raw_strokes in enumerate(df.drawing.values):
                    x[i] = draw_cv2(raw_strokes, size=size, lw=lw)
                x = x / 255.
                x = x.reshape((len(df), size, size, 1)).astype(np.float32)
                y = keras.utils.to_categorical(df.y, num_classes=NCATS)
                yield x, y

def df_to_image_array(df, size, lw=6):
    df['drawing'] = df['drawing'].apply(ast.literal_eval)
    x = np.zeros((len(df), size, size))
    for i, raw_strokes in enumerate(df.drawing.values):
        x[i] = draw_cv2(raw_strokes, size=size, lw=lw)
    x = x / 255.
    x = x.reshape((len(df), size, size, 1)).astype(np.float32)
    return x

In [None]:
valid_df = pd.read_csv(os.path.join(DP_DIR, 'train_k{}.csv.gz'.format(NCSVS - 1)), nrows=10**5)
x_valid = df_to_image_array(valid_df, size)
y_valid = keras.utils.to_categorical(valid_df.y, num_classes=NCATS)

cats = list_all_categories()
id2cat = {k: cat.replace(' ', '_') for k, cat in enumerate(cats)}
y_valid_in_words = []
for i in valid_df[['y']].values:
    y_valid_in_words.append(id2cat[i[0]])
#print(y_valid_in_words)
print(x_valid.shape, y_valid.shape)
print('Validation array memory {:.2f} GB'.format(x_valid.nbytes / 1024.**3 ))

In [None]:
train_datagen = image_generator(size=size, batchsize=batchsize, ks=range(NCSVS - 1))

In [None]:
callbacks = [
    EarlyStopping(monitor='val_categorical_accuracy', patience=7, min_delta=0.001, mode='max'),
    ReduceLROnPlateau(monitor='val_categorical_accuracy', factor=0.5, patience=5, min_delta=0.005,
                      mode='max', cooldown=3)
]
hist = model.fit_generator(
    train_datagen, steps_per_epoch=STEPS, epochs=100, verbose=1,
    validation_data=(x_valid, y_valid),
    callbacks = callbacks
)

In [None]:
hist_df = pd.DataFrame(hist.history)
fig, axs = plt.subplots(nrows=2, sharex=True, figsize=(16, 10))
axs[0].plot(hist_df.val_categorical_accuracy, lw=5, label='Validation Accuracy')
axs[0].plot(hist_df.categorical_accuracy, lw=5, label='Training Accuracy')
axs[0].set_ylabel('Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].grid()
axs[0].legend(loc=0)
axs[1].plot(hist_df.val_categorical_crossentropy, lw=5, label='Validation MLogLoss')
axs[1].plot(hist_df.categorical_crossentropy, lw=5, label='Training MLogLoss')
axs[1].set_ylabel('MLogLoss')
axs[1].set_xlabel('Epoch')
axs[1].grid()
axs[1].legend(loc=0)
fig.savefig('hist.png', dpi=300)
plt.show();

In [None]:
valid_predictions = model.predict(x_valid, batch_size=128, verbose=1)
top3_val = preds2catids_val(valid_predictions, valid_df[['y']].values)
newcol = valid_df[['countrycode']].values[:, 0]
top3_cats_fin = top3_val.assign(country = newcol)

# comparison of classification accuracy based on country
tp_country = 0
first = True
five_correct_ex = 0
five_wrong_ex = 0
fig, axs = plt.subplots(nrows=2, ncols=5, sharex=True, sharey=True, figsize=(15, 15))
for i, row in top3_cats_fin.iterrows():
    if row['ground truth'] == row['a'] or row['ground truth'] == row['b'] or row['ground truth'] == row['c']:
        tp_country = 1
        # show 5 examples with correct prediction
        if five_correct_ex < 5:
            ax = axs[0, five_correct_ex % 5]
            ax.imshow(x_valid[i, :, :, 0], cmap=plt.cm.gray)
            name = id2cat[row['ground truth']]
            ax.set_xlabel(name)
            ax.set_ylabel('Correct predict.')
            five_correct_ex += 1
    else:
        # show 5 examples with wrong prediction
        if five_wrong_ex < 5:
            ax = axs[1, five_wrong_ex % 5]
            ax.imshow(x_valid[i, :, :, 0], cmap=plt.cm.gray)
            name_gt = id2cat[row['ground truth']]
            name_a = id2cat[row['a']]
            name_b = id2cat[row['b']]
            name_c = id2cat[row['c']]
            name = 'GT: ' + name_gt + '\n' + name_a + ' ' + name_b + ' ' + name_c
            ax.set_xlabel(name)
            ax.set_ylabel('Wrong predict.')
            five_wrong_ex += 1
        
    if first:
        d = {'Accuracy': [tp_country], '# of imgs': [1]}
        result_country = pd.DataFrame(data=d, index=[row['country']])
        first = False
    else:
        if row['country'] in result_country.index:
            result_country.at[row['country'], 'Accuracy'] += tp_country
            result_country.at[row['country'], '# of imgs'] += 1
        else:
            d = {'Accuracy': [tp_country], '# of imgs': [1]}
            result_con = pd.DataFrame(data=d, index=[row['country']])
            result_country = result_country.append(result_con)
    tp_country = 0
plt.tight_layout()
plt.show();
plt.clf()
# result_country = result_country[result_country['# of imgs'] < 30000]
TP_p_con = result_country['Accuracy']/result_country['# of imgs']
result_country['Accuracy'] = TP_p_con
result_country.nsmallest(result_country.shape[0], 'Accuracy')
plt.plot(result_country['Accuracy'], result_country['# of imgs'], 'r+')
plt.xlabel('Accuracy per country')
plt.ylabel('Number of images per country')
plt.title('How many images per country vs accuracy per country. Data gathered from validation set.')

result_country = result_country[result_country['# of imgs'] < 6000]
TP_p_con = result_country['Accuracy']/result_country['# of imgs']
result_country['Accuracy'] = TP_p_con
result_country.nsmallest(result_country.shape[0], 'Accuracy')
plt.plot(result_country['Accuracy'], result_country['# of imgs'], 'r+')
plt.xlabel('Accuracy per country')
plt.ylabel('Number of images per country')
plt.title('How many images per country vs accuracy per country. A zoomed in version of the plot.')

        
        
# comparison of classification accuracy based on class
first = True
for i in range(len(id2cat)):
    temp = top3_val.loc[top3_val['ground truth'] == i]
    temp_2 = temp.loc[(temp['a'] == i) | (temp['b'] == i) | (temp['c'] == i)]
    tp = temp_2.shape[0]/temp.shape[0]
    d = {'Class': [id2cat[i]], 'Accuracy': [tp], '# of imgs': [temp.shape[0]]}
    if first:
        result_class = pd.DataFrame(data=d)
        first = False
    else:
        result_cl = pd.DataFrame(data=d)
        result_class = result_class.append(result_cl)
result_class.nsmallest(len(id2cat), 'Accuracy')

map3 = mapk(valid_df[['y']].values, preds2catids(valid_predictions).values)
print('Map3: {:.3f}'.format(map3))

## Exercise 3. Breaking down the results of the classifier.
Since it is only possible to update the notebook by committing the code and since the network results may slightly vary even if the same model is being trained, the results from version 24 are being analysed. In this version, the input image was changed to be twice the size in both dimentions and an additional layer was added to the network. 

Table 1 shows classification accuracy for each country. The plot below shows the accuracy plotted against the number of images for every country, except a single outlier that had more than 30.000 images and would otherwise have made the plot hard to read (its accuracy value was close to the mean). Each red cross represents a country. When plotting the data is seems to form a Gaussian distribution. Even if some countries achieved an accuracy that is above the mean, it could be by chance due to a low sample size. 

In Table 2 accuracy is calculated for every category and the results are sorted from lowest accuracy to the highest one. As it can be observed, categories such as cooler and garden hose have the smallest accuracy. Furthermore, categories such as ladder and rainbow have the highest accuracy scores. This result might indicate that there is a lack of a clear mental picture of some objects in the participants imagination. For example, if one is asked to draw a ladder, then most people might draw two parallel lines with some perpendicular lines in between; but when it comes to drawing a cooler, it might not be such an easy task to accomplish, given that different people might not have the same mental picture of the object. Therefore, in the latter case, there might be more variety in the drawings. Besides that, it might appear difficult to some people to draw complex objects, like a garden hose, given a relatively simple drawing tool and having average drawing skills. 

Furthermore, five images with the correct classification are shown next to 5 images where classification failed. Below, assumptions are provided as to why the model failed to classify each of the 5 latter images:

1. The image belongs to the peanut category and the model suggested that it might be a tornado, tennis racquet or a blackberry. The reason for such a mistake could be the shape of the drawing, which is not that clear and can be mistaken for a lot of other objects. The image also contains a lot of noise.

2. A participant was asked to draw a garden and the model classified it as being rollerskates, computer or a bulldozer. The reason behind the classification mistake might be that “garden” is a very broad term and different people may have various interpretations of it. Therefore, if the is no clear definition of the word, the drawings for that word might vary a lot.

3. The ground truth for the drawing is “fireplace” and it was misclassified as being a lantern, passport or dishwasher. This might have happened because the drawing seems to be rotated and occupies only part of the screen, thus resembling a small object.

4. The image class is candle and it was classified as being a hedgehog, campfire or bush. The reason might be that the drawing is very noisy, making it difficult to classify correctly.

5. The participant was asked to draw a television. However, the drawing was classified as being either a map, fireplace or sandwhich. This might have happened because of the pattern in the middle of the drawing. This might have caused confusion, as such a pattern is, maybe, more likely to be seen in other objects. 


## Create Submission

In [None]:
test = pd.read_csv(os.path.join(INPUT_DIR, 'test_simplified.csv'))
test.head()
x_test = df_to_image_array(test, size)
print(test.shape, x_test.shape)
print('Test array memory {:.2f} GB'.format(x_test.nbytes / 1024.**3 ))

In [None]:
test_predictions = model.predict(x_test, batch_size=128, verbose=1)

In [None]:
top3 = preds2catids(test_predictions)
top3.head()
top3.shape

In [None]:
cats = list_all_categories()
id2cat = {k: cat.replace(' ', '_') for k, cat in enumerate(cats)}
top3cats = top3.replace(id2cat)
top3cats.head()
top3cats.shape

In [None]:
test['word'] = top3cats['a'] + ' ' + top3cats['b'] + ' ' + top3cats['c']
submission = test[['key_id', 'word']]
submission.to_csv('bw_cnn_submission_{}.csv'.format(int(map3 * 10**4)), index=False)
submission.head()
submission.shape

In [None]:
end = dt.datetime.now()
print('Latest run {}.\nTotal time {}s'.format(end, (end - start).seconds))