# Batch Normalization
It is used to prevent Gradient Vanishing/Exploding problem. Gradient Vanishing/Exploding problem occur when their are large number of hidden layers and weights in first fews layers are either not updated with significant amount (Gradient Vanishing) OR weights in first few layers updates with very large amount i.e. weights are not converging (Gradient Exploding).

### Batch Normalization Working
At every hidden layers, before implementing activation function we should perform some operation to scale (centering and normalizing) the inputs. We compute mean and standard deviation of input vectors and then perform scaling operation. After scaling we implement the activation function. This is done for every hidden layers as well as in the final output layer.

In [47]:
import tensorflow as tf
import numpy as np


# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
    
# To plot pretty figures
%matplotlib inline
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.style.use('ggplot')

##### Importing MNIST datasets from TensorFlow

In [69]:
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

### Implementing Neural Networks with Batch Normalization

In [50]:
# Initializing the parameters
n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [72]:
# Creating a function which will fetch the data into batches for given batch size
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = int(np.ceil(len(X)/batch_size))
    for i in range(n_batches):
        yield X[rnd_idx[i*batch_size : i*batch_size+batch_size]], y[rnd_idx[i*batch_size : i*batch_size+batch_size]]

In [76]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name = "y")
training = tf.placeholder_with_default(False, shape=(), name="training")

#### Deep Neural Net Architechture (with Batch Normalization)

In [77]:
with tf.name_scope("dnn"):
    he_init = tf.variance_scaling_initializer()
    hidden1 = tf.layers.dense(X, n_hidden1, kernel_initializer=he_init, name="hidden1")
    bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
    bn1_act = tf.nn.elu(bn1)

    hidden2 = tf.layers.dense(bn1_act, n_hidden2, kernel_initializer=he_init, name="hidden2")
    bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
    bn2_act = tf.nn.elu(bn2)

    logits_before_bn = tf.layers.dense(bn2_act, n_outputs, kernel_initializer=he_init, name="output")
    logits = tf.layers.batch_normalization(logits_before_bn, training=training, momentum=0.9)

In [78]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

In [79]:
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)

In [80]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [81]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [82]:
n_epochs = 40
batch_size = 50

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y:y_batch})
        if epoch % 5 == 0:
            acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
            acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
            print(epoch, "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)
    save_path = saver.save(sess, "./mnist_model_dnn_BN.ckpt")

0 Batch accuracy: 0.88 Validation accuracy: 0.9
5 Batch accuracy: 0.94 Validation accuracy: 0.936
10 Batch accuracy: 0.92 Validation accuracy: 0.956
15 Batch accuracy: 0.92 Validation accuracy: 0.9648
20 Batch accuracy: 1.0 Validation accuracy: 0.97
25 Batch accuracy: 1.0 Validation accuracy: 0.9726
30 Batch accuracy: 1.0 Validation accuracy: 0.9746
35 Batch accuracy: 1.0 Validation accuracy: 0.976


The results are not better than usual DNN (without Batch Normalization) because their are not too many hidden layers. The effect of BN is seen on large hidden layers where vanish gradient problem occur.