# Digit Classification with Neural Networks 

Interest in neural networks, and in particular those with architechures that support deep learning, has been surging in recent years.  

In this notebook we will be revisiting the problem of digit classification on the MNIST data.  In doing so, we will introduce a new Python library, Keras, for working with neural networks.  Keras is a popular choice for neural networks as the same code can be run on either CPUs or GPUs.  GPUs greatly speed up the training and prediction, and is readily available. Amazon even offers GPU machines on EC2.  

In part 1, we'll introduce Keras, and refresh ourselves on the MNIST dataset.  In part 2, we'll create a multi-layer neural network with a simple architechure, and train it using backpropagation.  Part 3 will introduce the convolutional architechure, which can be said to be doing 'deep learning' (also called feature learning or representation learning).

# Part 1: Basics

In [43]:
%matplotlib inline

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
import time

from keras import optimizers
from keras.models import Sequential 
from keras.layers import Dense, Activation, Dropout
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Flatten
from keras.utils import np_utils
from keras.datasets import mnist
from keras import backend as K

np.random.seed(0)
print ("OK")

OK


Now for MNIST data...

In [44]:
numExamples = 2000
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = x_train[:numExamples], y_train[:numExamples]
x_test, y_test = x_test[:numExamples], y_test[:numExamples]
x_train = x_train.reshape(numExamples, 784)
x_test = x_test.reshape(numExamples, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_train.shape)

x_train shape: (2000, 28, 28, 1)
x_test shape: (2000, 28, 28, 1)


Looking ahead to working with neural networks, let's prepare one additional variation of the label data.  Let's make these labels, rather than each being an integer value from 0-9, be a set of 10 binary values, one for each class.  This is sometimes called a 1-of-n encoding, and it makes working with Neural Networks easier, as there will be one output node for each class.

In [45]:
num_classes =10
y_train = np_utils.to_categorical(y_train, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)
num_classes = y_test.shape[1]
train_labels_b = y_train
test_labels_b = y_test
numClasses = num_classes
print("Shape after one-hot encoding: ", y_train.shape)

Shape after one-hot encoding:  (2000, 10)


Lets start with a KNN model to establish a baseline accuracy.

Exercise: You've seen a number of different classification algorithms (e.g. naive bayes, decision trees, random forests, logistic regression) at this point.  How does KNN scalability and accuracy with respect to the size of the training dataset compare to those other algorithms?  

In [56]:
neighbors = 1
knn = KNeighborsClassifier(neighbors)
mini_train_data, mini_train_labels = X[:numExamples], Y[:numExamples] 
start_time = time.time()
knn.fit(mini_train_data, mini_train_labels)
print ('Train time = %.2f' %(time.time() - start_time))
start_time = time.time()
accuracy = knn.score(test_data, test_labels)
print ('Accuracy = %.4f' %(accuracy))
print ('Prediction time = %.2f' %(time.time() - start_time))

Train time = 0.07
Accuracy = 0.9110
Prediction time = 6.78


Alright, now that we have a simple baseline, let's start working in Keras.  Before we jump to multi-layer neural networks though, let's train a logistic regression model to make certain we're using Keras correctly. 

Recall from Josh's regression lecture the four key components: (1) parameters, (2) model, (3) cost function, and (4) objective. 

Two notes relevant at this point:

First, logistic regression can be thought of as a neural network with no hidden layers. The output values are just the dot product of the inputs and the edge weights.

Second, we have 10 classes. We can either train separate one vs all classifiers using sigmoid activation, which would be a hassle, or we can use the softmax activation, which is essentially a multi-class version of sigmoid. We'll use Theano's built-in implementation of softmax.

The objective is minimize the cost, and to do that we'll use batch gradient descent.

Exercise: What are the differences between batch, stochastic, and mini-batch gradient descent?  What are the implications of each for working on large datasets?

Exercise: Do you recall from Josh's lecture what the gradient is for beta in logistic regression?

In [63]:
## Model
model = Sequential() 
model.add(Dense(10, input_dim=784, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=numExamples,verbose=0, epochs=200) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 1.3280851802825928
Test accuracy: 0.6995


Exercise:  What do you expect to happen if we convert batch gradient descent to stochastic gradient descent?  Why?

Let's try it...

In [65]:
## Model
model = Sequential() 
model.add(Dense(10, input_dim=784, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=1,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.505543339252472
Test accuracy: 0.846


# PART 2: Multi-layer Neural Networks

---------

Let's take our implementation of logistic regression (which recall is in fact a single layer neural network), and add a hidden layer, making it a two layer neural network.  Because we have a hidden layer, we will now train the model using backpropagation.

Exercise: How do you expect this model to compare to KNN and logistic regression in terms of train time and accuracy?  Why?

Let's try it out...

In [67]:
## Model
model = Sequential() 
model.add(Dense(units=50, input_dim=784, activation='sigmoid')) 
model.add(Dense(units=10, input_dim=50, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=100,verbose=0, epochs=500) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.5390021901130676
Test accuracy: 0.8415


--------

As interest in networks with more layers and more complicated architechures has increased, a couple of tricks have emerged and become standard practice.  Let's look at two of those--rectifier activation and dropout noise.

Exercise:  We saw an improvement from adding a hidden layer.  What do you expect to happen if a second hidden layer was added?  

Let's try it...

In [72]:
## Model
model = Sequential() 
model.add(Dense(units=50, input_dim=784, activation='sigmoid'))
model.add(Dense(units=100, input_dim=50, activation='sigmoid'))
model.add(Dense(units=10, input_dim=100, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=100,verbose=0, epochs=20) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 2.2829424476623537
Test accuracy: 0.174


#### Activation Revisted

Let's look at a recent idea around activation closely associated with deep learning.  In 2010, in a paper published at NIPS (https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf), Yoshua Bengio showed that rectifier activation works better empirically than sigmoid activation when used in the hidden layers.  

The rectifier activation is simple: f(x)=max(0,x).  Intuitively, the difference is that as a sigmoid activated node approaches 1 it stops learning even if error continues to be propagated to it, whereas the rectifier activated node continue to learn (at least in the positive direction).  Rectifiers also speed up training.

Although the paper was published in 2010, the technique didn't gain widespread adoption until 2012 when members of Hinton's group spread the word, including with this Kaggle entry: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/

Let's change the activation in our 2 layer network to rectifier and see what happens...

In [74]:
## Model
model = Sequential() 
model.add(Dense(units=50, input_dim=784, activation='sigmoid')) 
model.add(Dense(units=10, input_dim=50, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.5362599482536315
Test accuracy: 0.84


#### Noise

Previously when working with the MNIST data we saw a benefit in generalization from adding noise to the training data.  Let's try that again here, however this time with a trick for adding noise called 'Dropouts'.  The idea with dropouts is that instead of (or in addition to) adding noise to our inputs, we add noise by having each node return 0 with a certain probability during training.  This trick both improves generalization in large networks and speeds up training.

Hinton introduced the idea in 2012 and gave an explanation of why it's similar to bagging (http://arxiv.org/pdf/1207.0580v1.pdf)

Let's give it a try...

In [75]:
## Model
model = Sequential() 
model.add(Dense(units=50, input_dim=784, activation='relu')) 
model.add(Dropout(0.5))
model.add(Dense(units=10, input_dim=50, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.43543306159973144
Test accuracy: 0.8585


# PART 3: Convolution Neural Networks

In [47]:
## Model
model = Sequential() 
model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',input_shape=(28, 28, 1)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(units=50, input_dim=128, activation='relu')) 
model.add(Dense(units=10, input_dim=50, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=25, epochs=10, verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Train on 2000 samples, validate on 2000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.40772309064865114
Test accuracy: 0.872
