# MNIST- handwritten digit recognition Part 2
In this modeule, we will talk about how we can further improve performance using various techniques.

## Batch Normalization
Do you remember we normalized input images such that they have zero mean? The training converges faster when images are normalized (zero mean and unit variance) and decorrelated. However, the parameter update during the training changes distributions in each layer, which is called *internal covariant shift*. Ioffe and Szegedy suggested [batch normalization](https://arxiv.org/abs/1502.03167) to normalize and decorrelate inputs to the mid-layers to resolve this issue and make the netwrok training converges faster. 

In [53]:
# Implement Batch Normalization
import numpy
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Activation
from keras.layers.convolutional import Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.utils import to_categorical
from keras import backend as K

# backend
K.set_image_dim_ordering( 'tf' )

# fix random seed for reproducibility
seed = 123
numpy.random.seed(seed)

# load data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# reshape to be [samples][width][height][channel]
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype( 'float32' )
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype( 'float32' )

# normalize inputs from 0-255 to 0-1
X_train = X_train / 255
X_test = X_test / 255

# one hot encode outputs
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
num_classes = y_test.shape[1]


#### Create unit udf to be created 4 times by a loop in the BN_model

* Include batch normalization in the input and hidden layers

In [54]:
def BN_model(): 
    model = Sequential()
    model.add(Conv2D(32, (3, 3), input_shape=(28,28,1)))
    model.add(BatchNormalization(axis=-1))
    model.add(Activation('relu'))
    
    model.add(Conv2D(32,(3, 3)))
    model.add(BatchNormalization(axis=-1))
    model.add(Activation('relu'))
    model.add(Conv2D(64, (3, 3)))
    model.add(BatchNormalization(axis=-1))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size = (2,2)))
    
    model.add(Conv2D(64,(3, 3)))
    model.add(BatchNormalization(axis=-1))
    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    model.add(BatchNormalization(axis=-1))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size = (2,2)))
    
    model.add(Flatten())
    model.add(Dense(16, activation = 'relu'))
    model.add(Dense(num_classes, activation = 'softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    
    return model

In [55]:
# build the model
model = BN_model()

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=200, verbose=2)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error with proper batch implemention: %.2f%%" % (100-scores[1]*100))

Train on 60000 samples, validate on 10000 samples
Epoch 1/20
 - 500s - loss: 0.2078 - acc: 0.9441 - val_loss: 0.1336 - val_acc: 0.9578
Epoch 2/20
 - 584s - loss: 0.0442 - acc: 0.9867 - val_loss: 0.0494 - val_acc: 0.9844
Epoch 3/20
 - 499s - loss: 0.0287 - acc: 0.9918 - val_loss: 0.0456 - val_acc: 0.9849
Epoch 4/20
 - 473s - loss: 0.0212 - acc: 0.9936 - val_loss: 0.0350 - val_acc: 0.9896
Epoch 5/20
 - 483s - loss: 0.0170 - acc: 0.9949 - val_loss: 0.0311 - val_acc: 0.9907
Epoch 6/20
 - 565s - loss: 0.0147 - acc: 0.9955 - val_loss: 0.0319 - val_acc: 0.9904
Epoch 7/20
 - 569s - loss: 0.0107 - acc: 0.9966 - val_loss: 0.0276 - val_acc: 0.9924
Epoch 8/20
 - 534s - loss: 0.0096 - acc: 0.9970 - val_loss: 0.0294 - val_acc: 0.9904
Epoch 9/20
 - 463s - loss: 0.0079 - acc: 0.9976 - val_loss: 0.0335 - val_acc: 0.9892
Epoch 10/20
 - 460s - loss: 0.0082 - acc: 0.9974 - val_loss: 0.0274 - val_acc: 0.9911
Epoch 11/20
 - 458s - loss: 0.0072 - acc: 0.9977 - val_loss: 0.0536 - val_acc: 0.9859
Epoch 12/20
 

From the above, can you get test error below 0.5%?

On the model above I was able to get the loss down to 2 basis points of a percent, but the error on the test data was 1.38, suggesting some overfitting occurred. Below, with the incorrect implementation, I was able to get the model to perform with only 0.92% error, which is still not as good as we need, but better than the model with correctly implemented batch normalization. I think a solution here could be a higher dropoff rate with the model above.

Where should you position the batch norm layer to implement the batch norm correctly?

From everything I've read and researched, you place the batch normalization before after the colvolution but before the activation step of the CNN process.

**Claim:** Some people argue that they can get as good or better result by incorrectly implementing batchnorm such that the batchnorm comes after the activation layer. Test if this is true. What test error do you get?

I end up getting a CNN error of 0.92% when I implement batch normalization after the activation function, which is actually an improvement from the proper implementation, however I believe the proper implementation was overfit and could be tuned to be better.

#### Implement Batch Normalization - after the activation 

In [56]:
def unit(model, n_filter=16, init = False, dropout=True):
    if init:
        model.add(Conv2D(16,3, input_shape = (28,28,1), activation = 'relu', padding = 'same'))
        model.add(Conv2D(12,3, activation = 'relu', padding = 'same'))
        model.add(BatchNormalization())
    else:
        model.add(Conv2D(12,3, activation = 'relu', padding = 'same'))
        model.add(Conv2D(12,3, activation = 'relu', padding = 'same'))
        model.add(BatchNormalization())
    
    model.add(MaxPooling2D(pool_size = (2,2)))
    
    if dropout:
        model.add(Dropout(0.2))
    return model

In [57]:
def BNr_model(n, dropout=True):
### YOUR TURN
    # Create a model with 4 convolutional layers (2 repeating VGG stype units) and 2 dense layers before the output
    # Use Batch Normalization for every conv and dense layers
    model = Sequential()
    model = unit(model, init=True, dropout = dropout)
    if n > 1:
        for i in range(1,n):
            filters = min(16*2**i,512)
            model = unit(model, n_filter = filters)
    model.add(Flatten())
    model.add(Dense(64, activation = 'relu'))
    model.add(Dense(num_classes, activation = 'softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
    return model

In [58]:
# build the model
model = BNr_model(4)

# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=20, batch_size=200, verbose=2)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error with improper batch implemention: %.2f%%" % (100-scores[1]*100))



Train on 60000 samples, validate on 10000 samples
Epoch 1/20
 - 129s - loss: 1.0869 - acc: 0.6324 - val_loss: 0.1890 - val_acc: 0.9469
Epoch 2/20
 - 127s - loss: 0.2925 - acc: 0.9144 - val_loss: 0.0960 - val_acc: 0.9700
Epoch 3/20
 - 127s - loss: 0.1948 - acc: 0.9451 - val_loss: 0.0763 - val_acc: 0.9780
Epoch 4/20
 - 127s - loss: 0.1557 - acc: 0.9566 - val_loss: 0.0648 - val_acc: 0.9820
Epoch 5/20
 - 127s - loss: 0.1333 - acc: 0.9631 - val_loss: 0.0627 - val_acc: 0.9805
Epoch 6/20
 - 127s - loss: 0.1203 - acc: 0.9667 - val_loss: 0.0532 - val_acc: 0.9841
Epoch 7/20
 - 127s - loss: 0.1076 - acc: 0.9697 - val_loss: 0.0571 - val_acc: 0.9826
Epoch 8/20
 - 127s - loss: 0.1002 - acc: 0.9715 - val_loss: 0.0465 - val_acc: 0.9857
Epoch 9/20
 - 127s - loss: 0.0915 - acc: 0.9742 - val_loss: 0.0380 - val_acc: 0.9874
Epoch 10/20
 - 127s - loss: 0.0868 - acc: 0.9750 - val_loss: 0.0387 - val_acc: 0.9885
Epoch 11/20
 - 143s - loss: 0.0813 - acc: 0.9771 - val_loss: 0.0449 - val_acc: 0.9859
Epoch 12/20
 

### Recording loss and metric
The output of `model.fit` by default (in Keras 2) returns a dictionary of model history (also it can be called using the callback). The dictionary has keys loss and metric (when you specified the metric in the model.complie) for train and validation each. For our case here it would be: 'val_loss', 'val_acc', 'loss', 'acc'. A good use of such log is to monitor whether it's over fitting. When overfits, you will see the validation loss may go up at some point while train loss continues go down. Let's get rid of batch norm layers and run the model with higher running rate lr=0.01 and longer epoch (50) to see if it overfits (Answer: Yes it does, quite terribly).

In [None]:
import time
from keras.optimizers import Adam

def model_overfit():
### YOUR TURN
    # 1) Create a model with the same architecture above (4 convs and 2 denses before output) and hyperparameters, 
    # but without any batch normalization and dropouts.
    # 2) To make this overfit surely, let's change the learning rate of our Adam optimizer. Set the learning rate to 0.01.
    # 3) After running the training, plot the train and validation accuracy using the model output hisoty.
    
    return model

# build the model
model = model_overfit()

# Fit the model
t0=time.time()
log = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=200, verbose=2)
t1=time.time()
print(t1-t0," seconds")

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))

#### Tune Learning rate
Without inserting batchnorm or dropout again, decrease learning rate and run for 50 epochs, plot the accuracy from train and validation. What is the highest learning rate that it doesn't overfit? What is the validation accuracy as a result?

In [None]:
#Your code here

#### Add Dropout
Now, add dropouts and run with the same hyperparameters (learning rate, epochs) you found from above. Time the model.fit() using `time.time`. 
1) Does it take longer training time by adding dropouts?
2) For the same epoch, is your final validation accuracy better? If not better and you're sure it's not overfitting yet, try to increase either your learning rate or epoch, OR change your dropout rate(s). Record your optimum values. 

In [None]:
#Your code here

#### Add Batch Normalization
Now, get rid of dropouts and add batch normalization layers. Choose learning rate between 0.01 and 0.001. Find the largest learning rate that still does not overfit but gives highest accuracy.
Time model.fit() using `time.time`. 
Plot the 'acc' and 'val_acc'
Compare the learning rate with those from Exercise 1 and 2. What do you find?

In [None]:
#Your code here

### Quiz.

#### 1. 
What are the advantages of a CNN over a fully connected ANN for image classificaion?

#### 2. 
Consider a CNN composed of 3 convolutional layers, each with 3x3 kernels, a stride of 2, and with 'same' padding. The first layer outputs a featuremap with 100 cahnnels, the second layer outputs a featuremap with 200 depth, and the last outputs one with 400 depth. The input is color (RGB) images of 200x300 pixels. What is the total number of parameters for this CNN model?

#### 3.
If your GPU runs out of memory while you train a CNN model, what can you do resolving the issue? List at least 3 ways to 