<h1>MNIST, the "Hello World!" of Deep Learning</h1>

MNIST is a dataset containing tiny 28 x 28 grayscale images, each showing a handwritten digit ranging between 0 and 9. The task is to classify each image into what digit it contains and to thereby gain some experience with working with keras/tensorflow.



**Some examples from the MNIST dataset:**

![picture](https://drive.google.com/uc?export=view&id=1CEKDn-NpvAzabnm4prVLwk5DTLpTtM1R)

In general, handwritten digit recognition is a real world application. The MNIST dataset is also not particularly small: it contains 60,000 images in the training set and 10,000 in the test set. Each image has a spatial dimension of 28 x 28, totaling 28²=784 features per image — a rather high dimensionality. So why is MNIST considered a “Hello World” example? As discueed during the lecture, one reason is that it is surprisingly easy to obtain a decent accuracy, around 90%, even with a weak or poorly designed machine learning model. A practical problem setting, seemingly challenging task, high accuracy with little work — a perfect combination to get started with Computer Vision using deep learning.



Image classification is typically based on convolutional neural networks (CNNs)or ConvNets. ConvNets are so effective for MNIST, that even if we randomly flip the labels for most of the images in the dataset, a ConvNet can still achieve high accuracy.


> Even with 100 noisy labels for every clean label the ConvNet still attains a performance of 91%. See *Deep Learning is Robust to Massive Label Noise*, Rolnick et al.



To get some exposure to deep learning using ConvNets, this exercise will walk you through the basic steps of building two "toy" models for classifying handwritten numbers - with accuracies surpassing 95%. The first model will be a basic fully-connected neural network, while the second model will be a deeper network that introduces the concepts of convolution and pooling. To generate, train, and evaluate your models you will use the Keras Python API with TensorFlow as the backend.

## Importing prerequisite Python modules


In [None]:
import numpy as np                  
import matplotlib.pyplot as plt      
import random                       

from keras.datasets import mnist     # Keras includes the MNIST dataset
from keras.models import Sequential  # Keras model API to be used

from keras.layers.core import Dense, Dropout, Activation  # Types of layers to be used in our models
from keras.utils import np_utils                          # NumPy related tools

## Loading the MNIST dataset

The MNIST dataset is conveniently bundled within Keras, and we can easily analyze some of its features in Python.

In [None]:
# The MNIST data is split between 60,000 28 x 28 pixel training images 
# and 10,000 28 x 28 pixel images
(X_train, y_train), (X_test, y_test) = mnist.load_data()

print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

Using matplotlib, we can plot some sample images from the training set directly into this Jupyter Notebook.

In [None]:
plt.rcParams['figure.figsize'] = (9,9) # Make the figures a bit bigger

for i in range(9):
    plt.subplot(3,3,i+1)
    num = random.randint(0, len(X_train))
    plt.imshow(X_train[num], cmap='gray', interpolation='none')
    plt.title("Class {}".format(y_train[num]))
    
plt.tight_layout()

## Image representation

Let's examine a single digit a bit closer, and print out the array representing the last digit.

In [None]:
# Just a little function to print a matrix in a pretty way
def matprint(mat, fmt="g"):
    col_maxes = [max([len(("{:"+fmt+"}").format(x)) for x in col]) for col in mat.T]
    for x in mat:
        for i, y in enumerate(x):
            print(("{:"+str(col_maxes[i])+fmt+"}").format(y), end="  ")
        print("")
    
matprint(X_train[num])

Each pixel is an 8-bit integer ranging between 0 and 255. 0 is black, while 255 is white. This is what we call a single-channel pixel. It's called monochrome.

*Fun-fact: Your computer screen has three channels for each pixel: red, green, blue. Each of these channels also likely takes an 8-bit integer. 3 channels -- 24 bits total -- 16,777,216 possible colors!*

## Flattening the input data

Instead of a 28 x 28 x 1 tensor, a fully-connected network instead requires a 784-length (feature) vector as input.

Each image needs to be reshaped (or flattened) into a column vector. We'll also normalize the inputs to be in the range [0-1] rather than [0-255]. Normalizing inputs is generally recommended, so that any additional dimensions (for other network architectures) are of the same scale.

Example:

![picture](https://drive.google.com/uc?export=view&id=1gNBQpPfh6y1p_yqBXe30okN-lBKp7C7E)


In [None]:
# Reshape 60,000 28 x 28 matrices into 60,000 784-length vectors
X_train = X_train.reshape(60000, 784)
# Reshape 10,000 28 x 28 matrices into 10,000 784-length vectors
X_test = X_test.reshape(10000, 784)   

# Change the training and test image pixel values to be represented as 
# 32-bit floating point numbers
X_train = X_train.astype('float32')   
X_test = X_test.astype('float32')

# Normalize each pixel value to be between 0 and 1
X_train /= 255                        
X_test /= 255

print("Training matrix shape", X_train.shape)
print("Testing matrix shape", X_test.shape)

## Converting labels to one-hot format

We then modify our classes (unique digits) to be in the one-hot format, which simply means that each label is represented by a set of values - one value per class.

**Examples:**
```
0 -> [1, 0, 0, 0, 0, 0, 0, 0, 0]
1 -> [0, 1, 0, 0, 0, 0, 0, 0, 0]
2 -> [0, 0, 1, 0, 0, 0, 0, 0, 0]
etc.
```

The final output of our network can then be interpreted as the probability of the input belonging to each of the classes. For example, if the final output is

```
[0, 0.94, 0, 0, 0, 0, 0.06, 0, 0]
```
then the input image is classified as 1 with a probability of 94%.

In [None]:
nb_classes = 10 # number of unique digits

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

# Building a 3-layer fully-connected network (FCN)

We will now use Keras' sequential API to build the following fully-connected network:
![picture](https://drive.google.com/uc?export=view&id=19nX9IVQ1RSp0sj6srPE4IluMvZIETQzg)

## The first hidden layer
Note: You can read more on Keras sequential models here: https://keras.io/guides/sequential_model/.

The first hidden layer is a set of 512 nodes (artificial neurons).
Since the model does not know the dimensions of the input images, we have to specify them for the very first layer. The activation function needs to be inserted specifically. We also add some 20% dropout probability which basically means that some random connections are dropped during training. This is one form of regularization and helps prevent the network from overfitting to the training data. We will talk more about regularization, so don't worry.

In [None]:
# The Sequential model is a linear stack of layers and operations
model = Sequential()

#(784,) is not a typo -- that represents a 784 length vector
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.2))

## Adding the second hidden layer

The second hidden layer is identical to our first layer. You can implement this one yourself.

In [None]:
# TODO: Add another FC layer with 512 nodes, followed by a ReLU activation 
#       and a 20% dropout probability. In principle, it is just as above 
#       without the need to specify the input shape to this second layer.
pass

## The output layer

The final layer should contain as many nodes equal to the number of possible classes, which is 10 in our case. We will also use a special activation function on the final activations called the "softmax" activation" which represents a probability distribution over K different possible outcomes.

In [None]:
# TODO: Add another FC layer with 10 nodes, followed by a softmax activation.
pass

In [None]:
# Summarize the built model
model.summary()

## Compiling the model

Keras is built on top of TensorFlow which allows you to define a *computation graph* in Python, which then compiles and runs efficiently on the CPU or GPU without the overhead of the Python interpreter. When compiing a model, Keras asks you to specify your **loss function** and your **optimizer**. 

The loss function we'll use here is called *categorical cross-entropy*, which is well-suited to comparing two probability distributions. Our predictions are probability distributions across the ten different digits (e.g. "we're 80% confident this image is a 3, 10% sure it's an 8, 5% sure it's a 2, and so on"), and the image label (in one-hot format!) is a probability distribution with 100% for the correct category, and 0% for everything else. The cross-entropy is a measure of how different is your predicted distribution from the target distribution. [Cross Entropy](https://en.wikipedia.org/wiki/Cross_entropy)

The optimizer helps determine how quickly the model learns through **gradient descent**. The **learning rate** determines how harshly the weights are adjusted during each training iteration (i.e. the rate at which the weights decend along the gradients).

In [None]:
# Using the adam optimizer is a very standard choice and always a good starting point
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

## Train the model
This is the fun part! 

The batch size determines over how much data is used per step to compute the loss function, gradients, and back propagation. Large batch sizes allow for faster training; however, there are other factors beyond training speed to consider.

A batch size that is too large smoothes the local minima of the loss function, causing the optimizer to get stuck in a local optimum.
A batch size that is too small results in a very noisy loss value, and the optimizer may never find the global optimum.

So, it may take some trial and error to find a good batch size.

In [None]:
model.fit(X_train, Y_train,
          batch_size=128, epochs=5,
          verbose=1)

The two numbers, in order, represent the value of the loss function of the network on the training set, and the overall accuracy of the network on the training data. But how does it do on data it did not train on?

## Evaluate the model on the test data

This is the exciting (and sometimes also nerve-wracking) part!

In [None]:
score = model.evaluate(X_test, Y_test)
print('Test score:', score[0])
print('Test accuracy:', score[1])

### Inspecting the output

It's always a good idea to inspect the output and make sure everything looks sane. Here, we'll look at some examples for which the model predicts the correct class/digit, and at some examples or which the model predicts the incorrect class/digit.

In [None]:
# The predict function outputs the probability per class for each 
# input example using our trained classifier.
predicted_classes = np.argmax(model.predict(X_test), axis=1)

# Check which items we got right / wrong
correct_indices = np.nonzero(predicted_classes == y_test)[0]
incorrect_indices = np.nonzero(predicted_classes != y_test)[0]

In [None]:
plt.figure()
for i, correct in enumerate(correct_indices[:9]):
    plt.subplot(3,3,i+1)
    plt.imshow(X_test[correct].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[correct], y_test[correct]))
    
plt.tight_layout()
    
plt.figure()
for i, incorrect in enumerate(incorrect_indices[:9]):
    plt.subplot(3,3,i+1)
    plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect], y_test[incorrect]))
    
plt.tight_layout()

Some of this misclassifications kind of make sense, right?

# Trying experimenting with the batch size!

1.   How does increasing the batch size to 10,000 affect the training time and test accuracy?
2.   How about a batch size of 32?


## Building a Convolutional Neural Network (CNN)

In [None]:
# Import some additional tools
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Conv2D, MaxPooling2D, ZeroPadding2D, GlobalAveragePooling2D, Flatten
from tensorflow.keras.layers import BatchNormalization

In [None]:
# Reload the MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

In [None]:
# Again, apply some formatting but we do not need to flatten the images this time. 
# Remember that convolutions directly receive images and preserve their spatial structure.

# Add an additional dimension to represent the depth (which is 1 for grayscale images)
X_train = X_train.reshape(60000, 28, 28, 1) 
X_test = X_test.reshape(10000, 28, 28, 1)

X_train = X_train.astype('float32')        
X_test = X_test.astype('float32')

X_train /= 255
X_test /= 255

print("Training matrix shape", X_train.shape)
print("Testing matrix shape", X_test.shape)

In [None]:
# One-hot format classes as before
nb_classes = 10

Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

In [None]:
model = Sequential()                                 # Linear stacking of layers

# Convolution Layer 1
# The first conv layer will use 32 different 3x3 filters. And again, since 
# the model cannot know the size of the input images, we specify them explicitly 
# just for this first layer. We also use batch normalization which normalizes 
# each feature map before applying the activation function. We will talk more 
# about batch normalization.
model.add(Conv2D(32, (3, 3), input_shape=(28,28,1)))
model.add(BatchNormalization(axis=-1))            
model.add(Activation('relu'))

# Convolution Layer 2
# TODO: Add a second conv layer with 32 3x3 filters. Also add a batch norm 
# layer and a ReLU activation just as for the first layer.
pass                       

# Pooling Layer
# Next, we insert a max pooling layer using a 2x2 filter.
model.add(MaxPooling2D(pool_size=(2,2)))

# Convolution Layer 3
# TODO: Add a third conv layer with 64 3x3 filters. Also add a batch norm 
# layer and a ReLU activation just as for the first two layers.
pass  

# Convolution Layer 4
# TODO: Add a third conv layer with 64 3x3 filters. Also add a batch norm 
# layer and a ReLU activation followed by a 2x2 max pooling layer.
pass

# The final two layer will be fully-connected layer so we need to flatten 
# the 3-dim tensor output by the 4th conv layer into a 1024-element vector.
model.add(Flatten())

# Fully Connected Layer 5
# TODO: Add a FC layer with 512 nodes. Also add a batch norm layer (just use 
# BatchNormalization() without specifying the axis), a ReLU activation, and a 
# 20% dropout layer (as used in the FC network above).
pass                         

# Fully Connected Layer 6
# TODO: Add the output FC layer with 10 nodes which needs to be followed by a 
# softmax activation as discussed above.
pass                                      

In [None]:
model.summary()

In [None]:
# We will use the same loss and optimizer as for the FC network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Data augmentation prevents overfitting by slightly changing the input data.
# Keras has a great API to perform automatic image augmentation.

train_gen = ImageDataGenerator(
    rotation_range=8, width_shift_range=0.08, shear_range=0.3, 
    height_shift_range=0.08, zoom_range=0.08)

test_gen = ImageDataGenerator()

In [None]:
# This also allows us to feed our augmented data batches more efficiently. 
# Besides loss function considerations, as discussed above, using this method 
# actually results in significant memory savings because we are actually 
# also LOADING the data in batches into memory (and directly process the batch).
# This is particularly important when training larger and deeper networks on 
# huge amounts of data. Before, all of the data was loaded into memory, and 
# then only  processed in batches.

train_generator = train_gen.flow(X_train, Y_train, batch_size=128)
test_generator = test_gen.flow(X_test, Y_test, batch_size=128)

In [None]:
# We can now train our model by feeding it data using our batch loader.
# Steps per epoch should always be equal to the total size of the training set 
# divided by the batch size.
model.fit_generator(
    train_generator, steps_per_epoch=60000//128, epochs=5, verbose=1,
    validation_data=test_generator, validation_steps=10000//128)

In [None]:
score = model.evaluate(X_test, Y_test)
print('Test score:', score[0])
print('Test accuracy:', score[1])

As previously mentioned, recognizing the handwritten digits contained in MNIST is not that challenging anymore and we obtain fairly similar accuracy using either a FC network or a CONV network.