# Batch Normalization from scratch

[Batch Normalization](https://arxiv.org/abs/1502.03167) is another way to avoid overfitting, gradient explosion or elimination.

In [1]:
from __future__ import print_function
import mxnet as mx
import numpy as np
from mxnet import nd, autograd
import numpy as np
mx.random.seed(1)
ctx = mx.gpu()

## The MNIST dataset

Let's prepare the data (as always!)

In [2]:
batch_size = 64
num_inputs = 784
num_outputs = 10
def transform(data, label):
    return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32)
train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),
                                      batch_size, shuffle=True)
test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),
                                     batch_size, shuffle=False)

## Batch Normalization layer

The layer, unlike Dropout, is usually used **before** the activation layer 
(according to the authors' original paper), instead of after activation layer.

The basic idea is doing the normalization then applying a linear scale and shift to the mini-batch:

For input mini-batch $B = \{x_{1, ..., m}\}$, we want to learn the parameter $\gamma$ and $\beta$.
The output of the layer is $\{y_i = BN_{\gamma, \beta}(x_i)\}$, where:

$$\mu_B \leftarrow \frac{1}{m}\sum_{i = 1}^{m}x_i$$
$$\sigma_B^2 \leftarrow \frac{1}{m} \sum_{i=1}^{m}(x_i - \mu_B)^2$$
$$\hat{x_i} \leftarrow \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$
$$y_i \leftarrow \gamma \hat{x_i} + \beta \equiv \mbox{BN}_{\gamma,\beta}(x_i)$$

* formula taken from Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International Conference on Machine Learning. 2015.

For the spirit of "from scratch", we implement the layer by ourselves,
by referencing the formulas from the original paper.

In [3]:
def pure_batch_norm(X, gamma = 1, beta = 0, eps = 1e-5):
    # mini-batch mean
    mu = nd.mean(X, axis=0)
    
    # mini-batch variance
    variance = nd.mean((X - mu) ** 2, axis=0)
    
    # normalize
    X_hat = (X - mu) * 1.0 / nd.sqrt(variance + eps)
    
    # scale and shift
    out = gamma * X_hat + beta
    
    # return
    return out

Let's do some sanity check. We expect each **column** of the input matrix is normalized.

In [4]:
A = nd.array([1,7,5,4,6,10], ctx=ctx).reshape((3,2))
A


[[  1.   7.]
 [  5.   4.]
 [  6.  10.]]
<NDArray 3x2 @gpu(0)>

In [5]:
pure_batch_norm(A,
    gamma = nd.array([1,1], ctx=ctx), 
    beta=nd.array([0,0], ctx=ctx))


[[-1.38872862  0.        ]
 [ 0.46290955 -1.22474384]
 [ 0.9258191   1.22474384]]
<NDArray 3x2 @gpu(0)>

It seems that we implement it correctly.

Note: Batch Normalization's **backward** pass is a little bit tricky. But we are not covering it here, because the `autograd` from `mxnet` automatically figures it out!

However, in the testing process, we want to use the mean and variance of the **complete dataset**, instead of those of **mini batches**. In the implementation, we use moving statistics as a trade off, because we don't want to or don't have the ability to compute the statistics of the complete dataset (in the second loop).

We need to improve the BN function a little bit.

Then here comes another concern: we need to maintain the moving statistics **along with multiple runs of the BN**. It's an engineering issue rather than a deep/machine learning issue. On the one hand, the moving statistics are similar to `gamma` and `beta`; on the other hand, they are **not** updated by the gradient backwards. In this quick-and-dirty implementation, we make good use of a python feature: the statistics are stored as an dictionary attribute of the function, with `scope_name`s of each different layers.

What's the attribute of a function in Python? Look at this example to get a feel:

In [6]:
def func_with_state():
    try:
        func_with_state.something += 1
    except:
        func_with_state.something = 0
    print(func_with_state.something)

for _ in range(5):
    func_with_state()

0
1
2
3
4


Now we are ready to define our `batch_norm()`:

In [7]:
def batch_norm(X,
               gamma = 1,
               beta = 0,
               momentum = 0.9,
               eps = 1e-5,
               scope_name = '',
               is_training = True,
               debug = False):
    """compute the batch norm """
    #########################
    # the usual batch norm transformation
    #########################
    
    # mini-batch mean
    mean = nd.mean(X, axis=0)
    
    # mini-batch variance
    variance = nd.mean((X - mean) ** 2, axis=0)
    
    # normalize
    if is_training:
        # while training, we normalize the data using its mean and variance
        X_hat = (X - mean) / nd.sqrt(variance + eps)
    else:
        # while testing, we normalize the data using the pre-computed mean and variance
        X_hat = (X - batch_norm.moving_mean[scope_name]) / nd.sqrt(batch_norm.moving_var[scope_name] + eps)
    
    # scale and shift
    out = gamma * X_hat + beta
      
    #########################
    # to keep the moving statistics
    #########################
    
    # init the attributes
    try: # to access them
        batch_norm.moving_mean
        batch_norm.moving_var
    except: # error, create them
        batch_norm.moving_mean = {}
        batch_norm.moving_var = {}
    
    # store the moving statistics by their scope_names, inplace    
    if scope_name not in batch_norm.moving_mean:
        batch_norm.moving_mean[scope_name] = mean
    else:
        batch_norm.moving_mean[scope_name] = batch_norm.moving_mean[scope_name] * momentum + mean * (1.0 - momentum)
    if scope_name not in batch_norm.moving_var:
        batch_norm.moving_var[scope_name] = variance
    else:
        batch_norm.moving_var[scope_name] = batch_norm.moving_var[scope_name] * momentum + variance * (1.0 - momentum)
        
    #########################
    # debug info
    #########################
    if debug:
        print('== info start ==')
        print('scope_name = {}'.format(scope_name))
        print('mean = {}'.format(mean))
        print('var = {}'.format(variance))
        print('moving_mean = {}'.format(moving_mean[scope_name]))
        print('moving_var = {}'.format(moving_var[scope_name]))
        print('output = {}'.format(out))
        print('== info end ==')
 
    #########################
    # return
    #########################
    return out

## Parameters and gradients

In [8]:
W1 = nd.random_normal(shape=(num_inputs, 256), ctx=ctx) *.01
b1 = nd.random_normal(shape=256, ctx=ctx) * .01

gamma1 = nd.random_normal(loc = 1, scale = .01, shape=256, ctx=ctx)
beta1 = nd.random_normal(shape=256, ctx=ctx) * .01

W2 = nd.random_normal(shape=(256,128), ctx=ctx) *.01
b2 = nd.random_normal(shape=128, ctx=ctx) * .01

gamma2 = nd.random_normal(loc = 1, scale = .01, shape=128, ctx=ctx)
beta2 = nd.random_normal(shape=128, ctx=ctx) * .01

W3 = nd.random_normal(shape=(128, num_outputs), ctx=ctx) *.01
b3 = nd.random_normal(shape=num_outputs, ctx=ctx) *.01

params = [W1, b1, gamma1, beta1, W2, b2, gamma2, beta2, W3, b3]

In [9]:
for param in params:
    param.attach_grad()

## Activation fucntions

In [10]:
def relu(X):
    return nd.maximum(X, 0)

## Softmax output

In [11]:
def softmax(y_linear):
    exp = nd.exp(y_linear-nd.max(y_linear))
    partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1,1))
    return exp / partition

## The *softmax* cross-entropy loss function

In [12]:
def softmax_cross_entropy(yhat_linear, y):
    return - nd.nansum(y * nd.log_softmax(yhat_linear), axis=0, exclude=True)

## Define the model

We insert the BN layer right after each linear layer.

In [13]:
def net(X, is_training=True, debug=False):
    #######################
    #  Compute the first hidden layer 
    #######################    
    h1_linear = nd.dot(X, W1) + b1
    h1_normed = batch_norm(h1_linear, gamma1, beta1, scope_name='bn1', is_training=is_training, debug=debug)
    h1 = relu(h1_normed)
    
    #######################
    #  Compute the second hidden layer
    #######################
    h2_linear = nd.dot(h1, W2) + b2
    h2_normed = batch_norm(h2_linear, gamma2, beta2, scope_name='bn2', is_training=is_training, debug=debug)
    h2 = relu(h2_normed)
    
    #######################
    #  Compute the output layer.
    #  We will omit the softmax function here 
    #  because it will be applied 
    #  in the softmax_cross_entropy loss
    #######################
    yhat_linear = nd.dot(h2, W3) + b3
    return yhat_linear

## Optimizer

In [14]:
def SGD(params, lr):    
    for param in params:
        param[:] = param - lr * param.grad

## Evaluation metric

In [15]:
def evaluate_accuracy(data_iterator, net):
    numerator = 0.
    denominator = 0.
    for i, (data, label) in enumerate(data_iterator):
        data = data.as_in_context(ctx).reshape((-1,784))
        label = label.as_in_context(ctx)
        label_one_hot = nd.one_hot(label, 10)
        output = net(data, is_training=False)
        predictions = nd.argmax(output, axis=1)
        numerator += nd.sum(predictions == label)
        denominator += data.shape[0]
    return (numerator / denominator).asscalar()

## Execute the training loop

In [16]:
epochs = 10
moving_loss = 0.
learning_rate = .001

for e in range(epochs):
    for i, (data, label) in enumerate(train_data):
        data = data.as_in_context(ctx).reshape((-1,784))
        label = label.as_in_context(ctx)
        label_one_hot = nd.one_hot(label, 10)
        with autograd.record():
            # we are in training process,
            # so we normalize the data using batch mean and variance
            output = net(data, is_training=True)
            loss = softmax_cross_entropy(output, label_one_hot)
        loss.backward()
        SGD(params, learning_rate)
        
        ##########################
        #  Keep a moving average of the losses
        ##########################
        if i == 0:
            moving_loss = nd.mean(loss).asscalar()
        else:
            moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
            
    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy)) 

Epoch 0. Loss: 0.131050964008, Train_acc 0.977533, Test_acc 0.9708
Epoch 1. Loss: 0.0891214484755, Train_acc 0.9857, Test_acc 0.9769
Epoch 2. Loss: 0.0733051626004, Train_acc 0.990283, Test_acc 0.9776
Epoch 3. Loss: 0.0481838382682, Train_acc 0.993233, Test_acc 0.9789
Epoch 4. Loss: 0.0439620076941, Train_acc 0.994983, Test_acc 0.9808
Epoch 5. Loss: 0.035822861736, Train_acc 0.9964, Test_acc 0.9796
Epoch 6. Loss: 0.0315828234747, Train_acc 0.997017, Test_acc 0.9824
Epoch 7. Loss: 0.0262876220982, Train_acc 0.997367, Test_acc 0.9808
Epoch 8. Loss: 0.0210992084311, Train_acc 0.997333, Test_acc 0.9793
Epoch 9. Loss: 0.0173964529426, Train_acc 0.99875, Test_acc 0.9831


## Conclusion

Compared with the pure mlp and dropout results, we achieve over 97% accuracy on this task **even just after the first epoch**, with just two hidden layers containing 256 and 128 hidden nodes.

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/zackchase/mxnet-the-straight-dope)