# Convolutional neural networks

3 May 2018: [Obóz wielodyscyplinarny Krajowego Funduszu na rzecz Dzieci](http://fundusz.org/2018/04/wkrotce-oboz-w-serocku/) by [Piotr Migdał](http://p.migdal.pl/)


For further materials, see:

* [Learning Deep Learning with Keras](http://p.migdal.pl/2017/04/30/teaching-deep-learning.html)
* [Data science intro for math/phys background](http://p.migdal.pl/2016/03/15/data-science-intro-for-math-phys-background.html)
* [Starting deep learning hands-on: image classification on CIFAR-10](https://blog.deepsense.ai/deep-learning-hands-on-image-classification/)

## Letter recognition

> Indeed, I once even proposed that the toughest challenge facing AI workers is to answer the question: “What are the letters ‘A’ and ‘I’? - [Douglas R. Hofstadter](https://web.stanford.edu/group/SHR/4-2/text/hofstadter.html) (1995)


## notMNIST


Data source: [notMNIST](http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html) (you need to download `notMNIST_small.mat` file):

![](http://yaroslavvb.com/upload/notMNIST/nmn.png)

> some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts.

> Approaching 0.5% error rate on notMNIST_small would be very impressive. If you run your algorithm on this dataset, please let me know your results.


## So, why not MNIST?

Many introductions to image classification with deep learning start with MNIST, a standard dataset of handwritten digits. This is unfortunate. Not only does it not produce a “Wow!” effect or show where deep learning shines, but it also can be solved with shallow machine learning techniques. In this case, plain k-Nearest Neighbors produces more than 97% accuracy (or even 99.5% with some data preprocessing!). Moreover, MNIST is not a typical image dataset – and mastering it is unlikely to teach you transferable skills that would be useful for other classification problems

> Many good ideas will not work well on MNIST (e.g. batch norm). Inversely many bad ideas may work on MNIST and no[t] transfer to real [computer vision]. - [François Chollet’s tweet](https://twitter.com/fchollet/status/852594987527045120)

## Setup

### Local

* Python 3.6 with Anaconda
* Keras 2.1.4
* TensorFlow (for Keras backend)

Installing additional packages - [keras-sequential-ascii](https://github.com/stared/keras-sequential-ascii) and [livelossplot](https://github.com/stared/livelossplot).

We will use Keras 2.1.4. With [2.1.6 there might be some problems](https://github.com/keras-team/keras/issues/9621). Use `pip install -U keras==2.1.4` if needed.

### Neptune

If using on [Neptune - Machine Learning Lab](https://neptune.ml/) - create an account there. Then, create a new notebook:

* medium cpu is enough
* python 3
* keras 2.1.4
* make sure to upload this file!

![](img/neptune_notebook.png)

In [None]:
!pip install livelossplot
!pip install keras-sequential-ascii

In [None]:
!wget http://yaroslavvb.com/upload/notMNIST/notMNIST_small.mat

In [None]:
# Downloading data (112 MB). If needed, I have it on my pendrive.

## Loading packages

In [None]:
# plots
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd

# data preprocessing
from scipy import io
import numpy as np
from keras.utils import np_utils
from sklearn.model_selection import train_test_split

# keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Conv2D, MaxPool2D, Dropout, BatchNormalization, GlobalMaxPool2D

# keras vis
from livelossplot import PlotLossesKeras
from keras_sequential_ascii import keras2ascii

## Data preprocessing

In [None]:
data = io.loadmat("notMNIST_small.mat")

# transform data
X = data['images']
y = data['labels']
resolution = 28
classes = 10

X = np.transpose(X, (2, 0, 1))

y = y.astype('int32')
X = X.astype('float32') / 255.

# shape: (sample, x, y, channel)
X = X.reshape((-1, resolution, resolution, 1))

# 3 -> [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]
Y = np_utils.to_categorical(y, 10)

In [None]:
# looking at data; some fonts are strange
i = 42
plt.imshow(X[i,:,:,0])
plt.title("ABCDEFGHIJ"[y[i]]);

In [None]:
# random letters
rows = 6
fig, axs = plt.subplots(rows, classes, figsize=(classes, rows))
for letter_id in range(10):
    letters = X[y == letter_id]
    for i in range(rows):
        ax = axs[i, letter_id]
        ax.imshow(letters[np.random.randint(len(letters)),:,:,0],
                  cmap='Greys', interpolation='none')
        ax.axis('off')

In [None]:
# splitting data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

In [None]:
Y_train.shape

In [None]:
Y_train[:5]

# Models

## Logistic regression

A simple, shallow method. Uses [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function).

It works as follows:

* flattens input to a single vector
* multiplies by a matrix
* applies softmax

In [None]:
model = Sequential()
model.add(Flatten(input_shape=(resolution, resolution, 1)))
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='sgd',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=0)

In [None]:
def show_predictions(model, X=X_test, Y=Y_test, rows=8):
    # example predictions
    predictions = model.predict(X_test)

    rows = 8
    fig, axs = plt.subplots(rows, 2, figsize=(8, 1.5 * rows))
    for i in range(rows):
        ax = axs[i,0]
        idx = np.random.randint(len(X_test))
        ax.imshow(X_test[idx,:,:,0], cmap='Greys', interpolation='none')
        ax.axis('off')

        pd.Series(Y_test[idx], index=list("ABCDEFGHIJ")).plot('bar', ax=axs[i,1], ylim=[0,1], color=plt.cm.Set1(0))
        pd.Series(predictions[idx], index=list("ABCDEFGHIJ")).plot('bar', ax=axs[i,1], ylim=[0,1], color=plt.cm.Set1(1))

    plt.tight_layout()

In [None]:
show_predictions(model)

## Multilayer perceptron (MLP)

An old-school network - only dense layers, sigmoid (or tanh) activation function.

In [None]:
model = Sequential()

model.add(Flatten(input_shape=(resolution, resolution, 1)))
model.add(Dense(128, activation='tanh'))
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=20,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=1)

## Convolution

See [Image Kernels - Visually Explained](http://setosa.io/ev/image-kernels/).

Change optimizer from `sgd` to `adam`; see:

* [Why Momentum Really Works](https://distill.pub/2017/momentum/)
* [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/)
* [SGD > Adam?? Which One Is The Best Optimizer: Dogs-VS-Cats Toy Experiment](https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/)

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu', padding='same',
                 input_shape=(resolution, resolution, 1)))
model.add(Flatten())
model.add(Dense(classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=0)

## Convolution  + MaxPool

More on typical blocks in [Convolutional Neural Networks (CNNs / ConvNets)](http://cs231n.github.io/convolutional-networks/) by Andrej Karpathy.

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(resolution, resolution, 1)))
model.add(MaxPool2D())

model.add(Flatten())
model.add(Dense(classes, activation='softmax'))
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=0)

### Typical ConvNet architecture

It uses hierarchical features. It allows to

* [How convolutional neural networks see the world](https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html) - Keras blog
* [How neural networks build up their understanding of images](https://distill.pub/2017/feature-visualization/) - distill.pub
* [The Building Blocks of Interpretability](https://distill.pub/2018/building-blocks/) - distill.pub

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(resolution, resolution, 1)))
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Flatten())
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=1)

### More dense layers, dropout

Usually we use 2-3 dense layers. To prevent overfitting we use **dropout**.


* Hinton et al, [Improving neural networks by preventing co-adaptation of feature detectors](https://arxiv.org/abs/1207.0580), 2012

![](img/dropout.png)

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(resolution, resolution, 1)))
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=1)

In [None]:
show_predictions(model)

## Batch normalization

Often we can speed-up training by using batch normalization. It is especially useful for deep neural networks.

* [Understanding the backward pass through Batch Normalization Layer](http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html)
* [On The Perils of Batch Norm](https://www.alexirpan.com/2017/04/26/perils-batch-norm.html)

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(resolution, resolution, 1)))
model.add(BatchNormalization())
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)


## Fully convolutional neural networks

Sometimes we want a network, which is fully translationally-invariant and can accept images of any size.

In [None]:
model = Sequential()

model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(resolution, resolution, 1)))
model.add(BatchNormalization())
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPool2D())

model.add(Conv2D(128, (3, 3), activation='relu'))
model.add(GlobalMaxPool2D())

model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(classes, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

keras2ascii(model)

In [None]:
model.fit(X_train, Y_train,
          epochs=10,
          batch_size=128,
          validation_data=(X_test, Y_test),
          callbacks=[PlotLossesKeras()],
          verbose=0)