# Deep learning

Deep learning refers to highly multi-layer neural networks that have lots of parameters. Training them can be slow, so be prepared to wait if you have a low-end PC. 


**First things first, download the [HASYv2](https://zenodo.org/record/259444#.WcZjIZ8xDCI) dataset into the same folder as this notebook, and extract it.**

In [None]:
# Python2 compatibility
from __future__ import print_function

import numpy as np
import pandas as pd

In [1]:
from scipy.misc import imread

def read_data(folder):
    labels = pandas.read_csv('hasy-data-labels.csv')
    y = labels.loc[labels['symbol_id'].isin(include)]
    samples = []
    for filename in y["path"]:
        img = imread(filename, mode='L').reshape(-1)
        samples.append(img)
    X = pandas.DataFrame.from_records(samples).as_matrix()
    X, y = read_data(folder)
    return(X.shape, y.shape)
read_data(folder)
# Should be (168233, 32, 32) (168233,)

In [None]:
# TODO
def split_data(X, y):
    from sklearn.utils import shuffle
    X, y = shuffle(X, y)
    X_train = X[:int(X.shape[0] * 0.8), :]
    y_train = y["symbol_id"].as_matrix()[:int(y.shape[0] * 0.8)]
    X_test = X[int(X.shape[0] * 0.8):, :]
    y_test = y["symbol_id"].as_matrix()[int(y.shape[0] * 0.8):]
    
X_train, y_train, X_val, y_val, X_test, y_test = split_data(X, y)

print(X_train.shape, y_train.shape) # Should yield approx (134586, 32, 32) (134586,)

Since there's 369 different classes with overall over 150000 images, let's reduce the complexity of our task by taking only the 100 first classes. Also note that the `symbol_id` field does not start from 0, but instead has arbitrary numbers. 

**Transform the labels so that the numbering for the class goes from 0 to 99, and discard the rest of the classes and corresponding images.**

In [None]:
# TODO
def transform_labels(y):
    pass

y_train, y_val, y_test = map(transform_labels, [y_train, y_val, y_test])

print(y_train.shape, y_val.shape, y_test.shape) # Should be approx (134586,) (16823,) (16824,)

# Should return the elements in arr for which their corresponding label in y_arr is in between [0, 100]
# TODO
def filter_out(arr, y_arr):
    pass

X_train, y_train = filter_out(X_train, y_train)
X_val, y_val = filter_out(X_val, y_val)
X_test, y_test = filter_out(X_test, y_test)

print(y_train.shape, X_train.shape) 

In [None]:
# Convert labels to one-hot encoding here
# TODO 
from keras.utils import to_categorical

print(y_train.shape) # Should be approx (34062, 100)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Flatten

# This function should return a keras Sequential model that has the appropriate layers
# TODO
def create_linear_model():
    pass

model = create_linear_model()
model.summary()

In [None]:
# Feel free to try out other optimizers. Categorical crossentropy loss means 
# we are predicting the probability of each class separately.
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

model.fit(X_train, y_train, epochs=3, batch_size=64)

The simple linear model probably didn't do too well. Let's create a CNN (Convolutional Neural Network) next. We've provided the code for the network, so just run these cells for now. Try to experiment here, adding and removing layers and tuning the hyperparameters to get better results on the validation data.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, BatchNormalization, Dropout
from keras.backend import clear_session

def create_convolutional_model():
    model = Sequential()
    
    model.add(Conv2D(128, (3, 3), input_shape=(32, 32, 1))) # A convolutional layer
    model.add(MaxPooling2D((4,4))) # Max pooling reduces the complexity of the model
    model.add(Dropout(0.4)) # Randomly dropping connections within the network helps against overfitting
    model.add(Conv2D(128, (2, 2), activation="relu")) 
    model.add(BatchNormalization()) # Numbers within the network might get really big, so normalize them 
    model.add(Flatten())
    model.add(Dense(256, activation="relu"))
    model.add(Dense(128, activation="relu"))
    model.add(BatchNormalization())
    model.add(Dense(y_train.shape[1], activation="softmax"))
    
    return model

clear_session()

model = create_convolutional_model()
model.summary() # Get a summary of all the layers

As you can see, our model has a lot of parameters. Optimizing this might take a while, depending on your PC.

In [None]:
# Feel free to try out other optimizers. Categorical crossentropy loss means 
# we are predicting the probability of each class separately.
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

# Extra axis for "gray" channel
model.fit(X_train[:, :, :, np.newaxis], y_train, epochs=5, batch_size=64, validation_data=(X_val[:, :, :, np.newaxis], y_val))

Finally, let's see how well our model did on the held-out test data. This is basically what matters, after all. The second number should be test accuracy - you should be able to get approx 80% with our setup. Try to improve this, but be careful not to overfit on the test data. Always use the validation set to tune your model.

In [None]:
model.evaluate(X_test[:, :, :, np.newaxis], y_test)

In [None]:
#TODO make the model better