# Problem 4: Batch Normalization, Dropout, MNIST

## 4.1: Co-adaptation and Co-Variaate Shift

Co-adaptation: In a dense neural network, neurons must learn to identify and map different features either alone or in a small group. This will ensure that each neuron is learning something different about the feature space, and thus will result in superior performance. However, in a densely connected network, while trainining,it may happen that a cetrain set of neurons get activated every time, since a particular feature (or a slight variation) is present in all the input vectors. This will lead to a form of clustering- the neurons work as a group to predict the same feature. This can be termed as co-adaptation.

Co-Variate Shift: The inherent change in distribution of network activations between different layers is called Internal Covariance Shift. As the input progresses down the layers, based on the behaviour of the activation function and weights in the precious layers, the distribution changes drastically and may result in the activation of the lower layers getting saturated in non-linear spaces in higher dimensions. This inturn increases the time it takes for the network to converge. Reducing Co-Variance shift, in esscence speeds up the convergence.  

### The following sections contain the code for each of the sub-problems. The final section consolidates all the metrics, and has a table containing The Loss On Test Set, and Accuracy of the different models

In [1]:
from keras.optimizers import RMSprop,Adagrad,Adadelta,Adam,Nadam
from keras.models import Sequential
from keras.layers import Conv2D,AveragePooling2D,Dense,Dropout,Activation,BatchNormalization
import keras.layers as layers
from keras import regularizers

from keras.datasets import mnist
from keras.utils import np_utils

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
# Data prep
# Load dataset as train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Set numeric type to float32 from uint8
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")

# Transform lables to one-hot encoding
y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

# Reshape the dataset into 4D array
x_train = x_train.reshape(x_train.shape[0], 28,28,1)
x_test = x_test.reshape(x_test.shape[0], 28,28,1)

metrics = []

## 4.2 With BatchNorm for hidden layers, and standard normalization for input layer

In [4]:
model = Sequential()

# C1 Convolutional Layer
model.add(Conv2D(6, kernel_size=(5,5), strides=(1,1), use_bias=False, input_shape=(28,28,1), padding="same"))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S2 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid'))

# C3 Convolutional Layer
model.add(Conv2D(16, kernel_size=(5, 5), strides=(1, 1), use_bias=False, padding='valid'))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S4 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))

# C5 Fully Connected Convolutional Layer
model.add(Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Flatten the CNN output so that we can connect it with fully connected layers
model.add(layers.Flatten())

# FC6 Fully Connected Layer
model.add(Dense(84))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Output Layer with softmax activation
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy']) 

In [5]:
#standard normalize input
x_train /= 255
x_test /= 255

history = model.fit(x_train,y_train,epochs=10,batch_size=256)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


NameError: name 'histories' is not defined

### Batch Normalization layer Parameters:

In [6]:
print("Learned batch parameters: \nLayer 1 (After first convolutional layer) : ",model.layers[1].get_weights(),
      "\nLayer 2 (After second convolutional layer): ",model.layers[5].get_weights(),
      "\nLayer 3 (After first dense layer): ",model.layers[9].get_weights(),
      "\nLayer 4 (After second dense layer): ",model.layers[13].get_weights())

Learned batch parameters: 
Layer 1 (After first convolutional layer) :  [array([1.0227486 , 0.9933009 , 0.95239097, 1.0617564 , 1.0096271 ,
       1.0478406 ], dtype=float32), array([-0.0511883 , -0.00876471, -0.00128828, -0.08093079, -0.05750775,
       -0.04072163], dtype=float32), array([-0.09701837, -0.04794088, -0.0585382 , -0.16585663, -0.07888251,
       -0.16325763], dtype=float32), array([0.08274772, 0.0272827 , 0.01749997, 0.09768781, 0.03728754,
       0.08225607], dtype=float32)] 
Layer 2 (After second convolutional layer):  [array([1.0038083 , 1.0049825 , 1.0063536 , 0.9980975 , 0.99966276,
       1.0016065 , 1.0001917 , 1.0058019 , 1.0002332 , 1.0133479 ,
       0.999777  , 0.9991526 , 0.9976403 , 1.0245603 , 1.0088576 ,
       0.99244666], dtype=float32), array([-0.01195861,  0.01223815, -0.0022188 ,  0.00485479,  0.01419998,
       -0.02188285,  0.00901312,  0.00899166,  0.00261207, -0.01462903,
        0.01454773, -0.01018127,  0.00911689,  0.03315625, -0.01661647,
   

In [7]:
# test set metrics:
metrics.append(model.evaluate(x_test,y_test))



## 4.3 With Batch Norm for all layers

In [8]:
model = Sequential()

# C1 Convolutional Layer
model.add(BatchNormalization(input_shape=(28,28,1)))
model.add(Conv2D(6, kernel_size=(5,5), strides=(1,1), use_bias=False, padding="same"))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S2 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid'))

# C3 Convolutional Layer
model.add(Conv2D(16, kernel_size=(5, 5), strides=(1, 1), use_bias=False, padding='valid'))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S4 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))

# C5 Fully Connected Convolutional Layer
model.add(Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid'))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Flatten the CNN output so that we can connect it with fully connected layers
model.add(layers.Flatten())

# FC6 Fully Connected Layer
model.add(Dense(84))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Output Layer with softmax activation
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy']) 

In [9]:
model.fit(x_train,y_train,epochs=10,batch_size=256)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


AttributeError: 'list' object has no attribute 'appemd'

In [10]:
metrics.append(model.evaluate(x_test,y_test))



## 4.4 Using dropout instead of Batch Norm

In [11]:
model = Sequential()

# C1 Convolutional Layer
model.add(Dropout(0.2,input_shape=(28,28,1)))
model.add(Conv2D(6, kernel_size=(5,5), strides=(1,1), use_bias=False, padding="same"))
model.add(Dropout(0.5))
model.add(Activation('tanh'))

# S2 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid'))

# C3 Convolutional Layer
model.add(Conv2D(16, kernel_size=(5, 5), strides=(1, 1), use_bias=False, padding='valid'))
model.add(Dropout(0.5))
model.add(Activation('tanh'))

# S4 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))

# C5 Fully Connected Convolutional Layer
model.add(Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid'))
model.add(Dropout(0.5))
model.add(Activation('tanh'))

#Flatten the CNN output so that we can connect it with fully connected layers
model.add(layers.Flatten())

# FC6 Fully Connected Layer
model.add(Dense(84))
model.add(Dropout(0.5))
model.add(Activation('tanh'))

#Output Layer with softmax activation
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy']) 

In [12]:
model.fit(x_train,y_train,epochs=10,batch_size=256)
metrics.append(model.evaluate(x_test,y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## 4.5 Batch Norm + Dropout

In [13]:
model = Sequential()

# C1 Convolutional Layer
model.add(Dropout(0.2,input_shape=(28,28,1)))
model.add(BatchNormalization())
model.add(Conv2D(6, kernel_size=(5,5), strides=(1,1), use_bias=False, padding="same"))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S2 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid'))

# C3 Convolutional Layer
model.add(Conv2D(16, kernel_size=(5, 5), strides=(1, 1), use_bias=False, padding='valid'))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('tanh'))

# S4 Pooling Layer
model.add(AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))

# C5 Fully Connected Convolutional Layer
model.add(Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid'))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Flatten the CNN output so that we can connect it with fully connected layers
model.add(layers.Flatten())

# FC6 Fully Connected Layer
model.add(Dense(84))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Activation('tanh'))

#Output Layer with softmax activation
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy']) 

In [16]:
model.fit(x_train,y_train,epochs=10,batch_size=256)
metrics.append(model.evaluate(x_test,y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [19]:
print(metrics)

[[0.04980983104780316, 0.987500011920929], [0.050857153448462485, 0.9872000217437744], [0.2754036287669092, 0.9246000051498413], [0.24856007757689805, 0.9294000267982483], [0.18753284983639606, 0.9469000101089478]]


## Comparison between Loss and Accuracy

| Model | Loss on Test Set | Accuracy of Model on the Test Set|
|------|------|------------|
| 2- Batch Normalization for Hidden layers + Standard Normalization for Input payer | 0.04980983104780316 | 0.987500011920929 |
| 3- Batch Normalization for all layers |  0.050857153448462485 | 0.9872000217437744 |
| 4- Dropout only | 0.248560077576898 | 0.9294000267982483 |
| 5- Dropout + Batch normalization on all layers | 0.18753284983639606 | 0.9469000101089478 |

From the above, after 10 epochs with a default learning rate and a batch size of 256:

1) Standard Normalization on the input layer and Batch normalization on the rest of the hidden layers seems to give the best performance.

2) Dropout in general performs worse than using normalization

3) Using both dropout and batch normalization performs better than dropout alone, but is significantly worse than using batch normalization alone. 
