# Mini-Project RVGIS by Oliver G. H.
**1. Explore the dataset, e.g. using the kernel provided at https://www.kaggle.com/gaborfodor/how-to-draw-an-owl-baseline-lb-0-002**
**a.** The dataset consists of many thousands of images, specifically 112163 images in both full and simplified formats (simplified meaning not every keystroke was logged)
**b.** There are 340 classes.
**c.** Found under In[9]
**d.** Classifiers always have issues when data is heavily skewed - with two classes and 850 images in one and 900 in the other the issue is not too big, but with 100 classes where 50% of the data comes from one class, this is a real problem as the network will learn the features of one class more often than the others and be much more likely to guess that class.

## Setup
Import the necessary libraries and a few helper functions.

In [None]:
%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import os
import ast
import datetime as dt
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 10]
plt.rcParams['font.size'] = 14
import seaborn as sns
import cv2
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Input, Reshape, ZeroPadding2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation, BatchNormalization
from tensorflow.keras.metrics import categorical_accuracy, top_k_categorical_accuracy, categorical_crossentropy
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import MobileNet
from tensorflow.keras.applications.mobilenet import preprocess_input
start = dt.datetime.now()

In [None]:
DP_DIR = '../input/shuffle-csvs/'
INPUT_DIR = '../input/quickdraw-doodle-recognition/'

BASE_SIZE = 256
NCSVS = 100
NCATS = 340
np.random.seed(seed=1987)
tf.set_random_seed(seed=1987)

def f2cat(filename: str) -> str:
    return filename.split('.')[0]

def list_all_categories():
    files = os.listdir(os.path.join(INPUT_DIR, 'train_simplified'))
    return sorted([f2cat(f) for f in files], key=str.lower)

In [None]:
def apk(actual, predicted, k=3):
    """
    Source: https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
    """
    if len(predicted) > k:
        predicted = predicted[:k]
    score = 0.0
    num_hits = 0.0
    for i, p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)
    if not actual:
        return 0.0
    return score / min(len(actual), k)

def mapk(actual, predicted, k=3):
    """
    Source: https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py
    """
    return np.mean([apk(a, p, k) for a, p in zip(actual, predicted)])

def preds2catids(predictions):
    return pd.DataFrame(np.argsort(-predictions, axis=1)[:, :3], columns=['a', 'b', 'c'])

def top_3_accuracy(y_true, y_pred):
    return top_k_categorical_accuracy(y_true, y_pred, k=3)

## MobileNet

MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks.

[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/pdf/1704.04861.pdf)

In [None]:
STEPS = 800
EPOCHS = 16
size = 64
batchsize = 680

# 2. Our Own Model
Here is where I create my own model, discarding the MobileNet model and trying something new.
1. I started with a simple model consisting of 2 convolutional layers of size 64 and kernel size 3x3, a max pooling layer and .25 dropout
After a single epoch of 800 steps, the model reached top 3 accuracy of 0.6269 and categorical accuracy of 0.4450 which is quite impressive for such a simple network.
2. However, two 3x3 kernels do not cover a particularly large area of the images we feed the network. Therefore, I decided to extend the network depth and allow it to pick up on features of a larger scale. I tested the model with twice the depth, only the size of these two new convolutional layers were 128 and 256, repsectively. This of course increased the training time of the model a fair bit (nearly doubled it), but it also improved the results significantly to categorical accuracy 0.5661 and top 3 0.7421.
3. I figured that additional depth was very positive for the network, so I changed the two last layers to 128 size and added an additional block of two convolutional of size 256, max pooling, and dropout. Additionally, I added a final Dropout of 0.5 just before the dense layer to force the network to keep training poor nodes. This seemed to improve the network a bit, but not as much as hoped or expected. The new model achieved a categorical accuracy of 0.6205 and a top-3 accuracy of 0.7919.
4. I tested another model with an additional layer of convolution, but the accuracy dropped to less than 0.01 which indicates that the filters grew too encompassing an could not distinguish anymore. Instead, I decided to test the impact of strides on training time and accuracy. After adding stride 2 to the second conv layer of every block, the accuracy dropped a bit to categorical accuracy 0.5415 and top-3 accuracy 0.7410. The training time improved by to just under half, which is very significant.
5. I tested the model with double the sizes of every convolutional layer now that training time was acceptable. This improved the accuracy to 0.5935 categorical and 0.7846 top 3. However, this increased the training to 3000s running only a single epoch (three times). Seeing as the session only allows for 21600 seconds, this was far too much. 
6. Instead, I tried to reduce the size again, so the convolutional layers followed this sizing: 32->64->Max+Drop->128->128->Max+Drop->256->512->Max+Drop. This was not a significant enough reduction in accuracy to outweigh the time improvement. In total, running the script took 1158s and yielded a  categorical accuracy of 0.5623 and top 3 accuracy of 0.7569. Even at the end, the training accuracy steadily increased, and there was a rather large disparity in training and validation accuracy. The last training iteration had a top 3 accuracy of 0.6367 which is more than .1 under the validation accuracy, indicating huge rooms for improvement still.
7. With this new information, I decided to revisit 16 epochs as the original notebook had. I do this to test the capabilities of this model versus the Greyscale MobileNet on equal footing. Of course, an issue that can arise here is that the training time is simply still too long. If that is the case, there are options to bring down the training time, however I do not expect it to be necessary. Instead, I believe the model may have room to expand as it currently has roughly 1/3 the parameters of the MobileNet. The resulting network reached a 0.7052 categorical accuracy and 0.8523 top 3 accuracy. 
8. Since the model trained so quickly (roughly 17000s), I decided to expand on it a bit. The first layer was changed to size 64, and the stride layers were switched to the former in their block. In the last block, I removed strides alltogether and instead implemented an additional Max Pooling layer. 'same' padding on stride layers were discarded in favor of 'valid' padding to reduce the size of the image accordingly. In order for this to work I added a zero padding layer before these. This brought the training time down to roughly 15000s and increased the categorical accuracy to 0.7205 and top 3 accuracy 0.8730.
9. I noticed that following the final Max Pooling layer, the size of the resulting image was (2,2). This means that the final set of parameters were significantly increased, and by reducing that I could increase the size of each convolutional layer. To do this, I simply reduced the kernel size of the last final two convolutional layers to (2,2) and doubled each layer size, resulting in a final layer size of 1024. This resulted in a model with just over 5 million parameters. This model ended up scoring 0.8755 on top 3 accuracy, still unable to reach 0.9. 
10. The model trained very quickly, roughly 400ms per step, but as it did not improve by doubling the size, I figured the major bottleneck here is still the depth. However, the output size is currently (2,2), so I need to keep this size relatively uniform. I decided to remove strides from the convolutional layer in the second block, copy it and double the layer sizes. Then, I doubled the first layer's size in the last block to 1024. So now the three first block consist of a zero padding layer, two convolutional layer of sizes 128, 256, and 512, and (2,2) max pooling. The second convolutional layer has the 'same' keyword as padding, meaning each block outputs a POT number (16, 8, 4) after the max pooling. The final block uses only valid outputs, and so the output size is still (2,2). However, this ran into the issue of not improving after an entire epoch due to the learning rate. I figured, after plenty of tests, that the actual information left after the two final layers was near nothing. I removed the final layer, and after increasing the learning rate back to 0.005, it started improving again.

At this point Kaggle started not to work for me on either PC. Commits would never finish, instead brining up an interrupt error, and running the code would inevitably fail to train the network or the network would suddenly return to "random chance" accuracy. Therefore, I hand in this assignment. The best accuracy I achieved will have to be using the network described in step 9. 

<small>Here's to hoping the text changes will commit and I won't have to rewrite</small>

In [None]:
model = Sequential()

model.add(ZeroPadding2D(input_shape=(size,size,1)))
model.add(Conv2D(128, (3,3), strides = 2, padding='valid', activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(128, (3,3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(ZeroPadding2D())
model.add(Conv2D(256, (3,3), padding='valid', activation='relu'))
model.add(Conv2D(256, (3,3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(ZeroPadding2D())
model.add(Conv2D(512, (3,3), activation='relu'))
model.add(Conv2D(512, (3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(1024, (2,2), activation='relu'))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(NCATS, activation='softmax'))

model.compile(optimizer=Adam(lr=0.005), loss='categorical_crossentropy', metrics=[categorical_crossentropy, categorical_accuracy, top_3_accuracy])

print(model.summary())

## Training with Image Generator

In [None]:
def draw_cv2(raw_strokes, size=256, lw=6, time_color=True):
    img = np.zeros((BASE_SIZE, BASE_SIZE), np.uint8)
    for t, stroke in enumerate(raw_strokes):
        for i in range(len(stroke[0]) - 1):
            color = 255 - min(t, 10) * 13 if time_color else 255
            _ = cv2.line(img, (stroke[0][i], stroke[1][i]),
                         (stroke[0][i + 1], stroke[1][i + 1]), color, lw)
    if size != BASE_SIZE:
        return cv2.resize(img, (size, size))
    else:
        return img

def image_generator_xd(size, batchsize, ks, lw=6, time_color=True):
    while True:
        for k in np.random.permutation(ks):
            filename = os.path.join(DP_DIR, 'train_k{}.csv.gz'.format(k))
            for df in pd.read_csv(filename, chunksize=batchsize):
                df['drawing'] = df['drawing'].apply(ast.literal_eval)
                x = np.zeros((len(df), size, size, 1))
                for i, raw_strokes in enumerate(df.drawing.values):
                    x[i, :, :, 0] = draw_cv2(raw_strokes, size=size, lw=lw,
                                             time_color=time_color)
                x = preprocess_input(x).astype(np.float32)
                y = keras.utils.to_categorical(df.y, num_classes=NCATS)
                yield x, y

def df_to_image_array_xd(df, size, lw=6, time_color=True):
    df['drawing'] = df['drawing'].apply(ast.literal_eval)
    x = np.zeros((len(df), size, size, 1))
    for i, raw_strokes in enumerate(df.drawing.values):
        x[i, :, :, 0] = draw_cv2(raw_strokes, size=size, lw=lw, time_color=time_color)
    x = preprocess_input(x).astype(np.float32)
    return x

In [None]:
valid_df = pd.read_csv(os.path.join(DP_DIR, 'train_k{}.csv.gz'.format(NCSVS - 1)), nrows=34000)
x_valid = df_to_image_array_xd(valid_df, size)
y_valid = keras.utils.to_categorical(valid_df.y, num_classes=NCATS)
print(x_valid.shape, y_valid.shape)
print('Validation array memory {:.2f} GB'.format(x_valid.nbytes / 1024.**3 ))

In [None]:
train_datagen = image_generator_xd(size=size, batchsize=batchsize, ks=range(NCSVS - 1))

In [None]:
x, y = next(train_datagen)
n = 8
fig, axs = plt.subplots(nrows=n, ncols=n, sharex=True, sharey=True, figsize=(12, 12))
for i in range(n**2):
    ax = axs[i // n, i % n]
    (-x[i]+1)/2
    ax.imshow((-x[i, :, :, 0] + 1)/2, cmap=plt.cm.gray)
    ax.axis('off')
plt.tight_layout()
fig.savefig('gs.png', dpi=300)
plt.show();

In [None]:
callbacks = [
    ReduceLROnPlateau(monitor='val_categorical_accuracy', factor=0.5, patience=5,
                      min_delta=0.005, mode='max', cooldown=3, verbose=1)
]
hists = []
hist = model.fit_generator(
    train_datagen, steps_per_epoch=STEPS, epochs=EPOCHS, verbose=1,
    validation_data=(x_valid, y_valid),
    callbacks = callbacks
)
hists.append(hist)

In [None]:
hist = model.fit_generator(
    train_datagen, steps_per_epoch=STEPS, epochs=EPOCHS, verbose=1,
    validation_data=(x_valid, y_valid),
    callbacks = callbacks
)
hists.append(hist)

In [None]:
hist = model.fit_generator(
    train_datagen, steps_per_epoch=STEPS, epochs=EPOCHS, verbose=1,
    validation_data=(x_valid, y_valid),
    callbacks = callbacks
)
hists.append(hist)

In [None]:
hist_df = pd.concat([pd.DataFrame(hist.history) for hist in hists], sort=True)
hist_df.index = np.arange(1, len(hist_df)+1)
fig, axs = plt.subplots(nrows=2, sharex=True, figsize=(16, 10))
axs[0].plot(hist_df.val_categorical_accuracy, lw=5, label='Validation Accuracy')
axs[0].plot(hist_df.categorical_accuracy, lw=5, label='Training Accuracy')
axs[0].set_ylabel('Accuracy')
axs[0].set_xlabel('Epoch')
axs[0].grid()
axs[0].legend(loc=0)
axs[1].plot(hist_df.val_categorical_crossentropy, lw=5, label='Validation MLogLoss')
axs[1].plot(hist_df.categorical_crossentropy, lw=5, label='Training MLogLoss')
axs[1].set_ylabel('MLogLoss')
axs[1].set_xlabel('Epoch')
axs[1].grid()
axs[1].legend(loc=0)
fig.savefig('hist.png', dpi=300)
plt.show();

In [None]:
valid_predictions = model.predict(x_valid, batch_size=128, verbose=1)
map3 = mapk(valid_df[['y']].values, preds2catids(valid_predictions).values)
print('Map3: {:.3f}'.format(map3))

# 3. Break down the results of your classifier
After nearly 50 epochs, every network (before the major issues started) was still learning at a reasonable rate, meaning there was room to improve for a long time. Also, training accuracy was consistently lower than validation accuracy by nealy .1, which may indicate that the measures I took to avoid overfitting was a bit too agressive, but since Kaggle stopped working properly for me I can't validate this. 
Some of the major issues I faced was not wanting to discard too much "arbitrarily", so I tried to avoid stepping too much through the data, but that is impossible to do without increasing parameter amounts, leading to a very long training time. A long training time means less iterations and thus less room to experiment and improve. 
Additionally, many of the networks that followed would either refuse to improve due to too low learning rate or suddenly improve massively in one aspect and no others when the learning rate was increased. 


There were plenty of steps I had noted down that I wanted to test - the final dropout seemed a bit too agressive following so much dropout and batch normalization, and I wanted to test what removing it entirely would do for the training accuracy. Additionally, I would like to test the same model without the "decay" in color values to see if a more classical approach to CNN where the image is fed as is would work better or at least as good.

I also wanted to do some more in-depth analysis of the results, such as testing which categories were easier and harder to distinguish, and whether larger or smaller data had any influence on the network's ability to guess correctly. However, since Kaggle no longer trains my network correctly (even when using older, tested and proven models), this has become impossible.

## Create Submission

In [None]:
test = pd.read_csv(os.path.join(INPUT_DIR, 'test_simplified.csv'))
test.head()
x_test = df_to_image_array_xd(test, size)
print(test.shape, x_test.shape)
print('Test array memory {:.2f} GB'.format(x_test.nbytes / 1024.**3 ))

In [None]:
test_predictions = model.predict(x_test, batch_size=128, verbose=1)

top3 = preds2catids(test_predictions)
top3.head()
top3.shape

cats = list_all_categories()
id2cat = {k: cat.replace(' ', '_') for k, cat in enumerate(cats)}
top3cats = top3.replace(id2cat)
top3cats.head()
top3cats.shape

In [None]:
test['word'] = top3cats['a'] + ' ' + top3cats['b'] + ' ' + top3cats['c']
submission = test[['key_id', 'word']]
submission.to_csv('gs_mn_submission_{}.csv'.format(int(map3 * 10**4)), index=False)
submission.head()
submission.shape

In [None]:
end = dt.datetime.now()
print('Latest run {}.\nTotal time {}s'.format(end, (end - start).seconds))