# Part 1: Basics

In [1]:
%matplotlib inline

import numpy as np
import time

from keras import optimizers
from keras.models import Sequential 
from keras.layers import Dense, Activation, Dropout
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers import Flatten
from keras.utils import np_utils
from keras.datasets import mnist
from keras import backend as K

np.random.seed(0)
print ("OK")

OK


Now for MNIST data...

In [2]:
num_examples = 2000
num_classes = 10
img_rows = 28
img_cols = 28
num_features = 784

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, y_train = x_train[:num_examples], y_train[:num_examples]
x_test, y_test = x_test[:num_examples], y_test[:num_examples]
x_train = x_train.reshape(num_examples, num_features)
x_test = x_test.reshape(num_examples, num_features)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
y_train = np_utils.to_categorical(y_train, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)
train_labels_b = y_train
test_labels_b = y_test

print('x_train shape:', x_train.shape)
print('x_test shape:', x_train.shape)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
x_train shape: (2000, 784)
x_test shape: (2000, 784)


In [4]:
## Model (feedforward model)
model = Sequential() 

# Add a dense output layer where every node is connnected, and use softmax (pick the biggest mumber)
model.add(Dense(10, input_shape=(num_features,), activation='softmax')) 

## Cost function & Objective (and solver)

# Use stochastic gradient descent
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=2000,verbose=0, epochs=52) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test accuracy:', score[1])

Test accuracy: 0.5759999752044678


Exercise:  What do you expect to happen if we convert batch gradient descent to stochastic gradient descent?  Why?

Let's try it...

In [7]:
## Model
model = Sequential() 
model.add(Dense(10, input_dim=num_features, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=1,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test accuracy:', score[1])

Test accuracy: 0.8289999961853027


# PART 2: Multi-layer Neural Networks

---------

Let's take our implementation of logistic regression (which recall is in fact a single layer neural network), and add a hidden layer, making it a two layer neural network.  Because we have a hidden layer, we will now train the model using backpropagation.

Exercise: How do you expect this model to compare to KNN and logistic regression in terms of train time and accuracy?  Why?

Let's try it out...

In [6]:
## Model (feedforward) with 2 layers
model = Sequential() 

# One hidden layer
model.add(Dense(input_dim=num_features, units=20, activation='sigmoid')) 

# One output layer
model.add(Dense(input_dim=20, units=10, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.48643702268600464
Test accuracy: 0.8479999899864197


--------

As interest in networks with more layers and more complicated architechures has increased, a couple of tricks have emerged and become standard practice.  Let's look at two of those--rectifier activation and dropout noise.

Exercise:  We saw an improvement from adding a hidden layer.  What do you expect to happen if a second hidden layer was added?  

Let's try it...

In [9]:
## Model (3 layers)
model = Sequential() 

#1st hidden layer
model.add(Dense(units=20, input_dim=num_features, activation='sigmoid'))

#2nd hidden layer
model.add(Dense(units=20, input_dim=20, activation='sigmoid'))

# output layer
model.add(Dense(units=10, input_dim=20, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.6939327120780945
Test accuracy: 0.7925000190734863


#### Activation Revisted

Let's look at a recent idea around activation closely associated with deep learning.  In 2010, in a paper published at NIPS (https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf), Yoshua Bengio showed that rectifier activation works better empirically than sigmoid activation when used in the hidden layers.  

The rectifier activation is simple: f(x)=max(0,x).  Intuitively, the difference is that as a sigmoid activated node approaches 1 it stops learning even if error continues to be propagated to it, whereas the rectifier activated node continue to learn (at least in the positive direction).  Rectifiers also speed up training.

Although the paper was published in 2010, the technique didn't gain widespread adoption until 2012 when members of Hinton's group spread the word, including with this Kaggle entry: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/

Let's change the activation in our 2 layer network to rectifier and see what happens...

In [10]:
## Model
model = Sequential() 

# To prevent overfitting, we can use relu
model.add(Dense(units=30, input_dim=num_features, activation='relu')) 
model.add(Dense(units=10, input_dim=30, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.02)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=50) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.5663594603538513
Test accuracy: 0.8539999723434448


#### Noise

Previously when working with the MNIST data we saw a benefit in generalization from adding noise to the training data.  Let's try that again here, however this time with a trick for adding noise called 'Dropouts'.  The idea with dropouts is that instead of (or in addition to) adding noise to our inputs, we add noise by having each node return 0 with a certain probability during training.  This trick both improves generalization in large networks and speeds up training.

Hinton introduced the idea in 2012 and gave an explanation of why it's similar to bagging (http://arxiv.org/pdf/1207.0580v1.pdf)

Let's give it a try...

In [11]:
## Model
model = Sequential() 
model.add(Dense(units=30, input_dim=num_features, activation='relu')) 
model.add(Dropout(0.5))
model.add(Dense(units=10, input_dim=30, activation='softmax')) 

## Cost function & Objective (and solver)
sgd = optimizers.SGD(lr=0.01)
model.compile(optimizer=sgd, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, train_labels_b, shuffle=False, batch_size=10,verbose=0, epochs=100) 
score = model.evaluate(x_test, test_labels_b, verbose=0) 
print('Test score:', score[0]) 
print('Test accuracy:', score[1])

Test score: 0.47055062651634216
Test accuracy: 0.8539999723434448
