# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Batch-Normalization" data-toc-modified-id="Batch-Normalization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Batch Normalization</a></div><div class="lev2 toc-item"><a href="#Importações-e-ajustes" data-toc-modified-id="Importações-e-ajustes-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importações e ajustes</a></div><div class="lev2 toc-item"><a href="#O-Algoritmo" data-toc-modified-id="O-Algoritmo-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>O Algoritmo</a></div><div class="lev3 toc-item"><a href="#Equações" data-toc-modified-id="Equações-121"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Equações</a></div><div class="lev3 toc-item"><a href="#Grafo-de-operações" data-toc-modified-id="Grafo-de-operações-122"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Grafo de operações</a></div><div class="lev2 toc-item"><a href="#Keras:-parâmetros" data-toc-modified-id="Keras:-parâmetros-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Keras: parâmetros</a></div><div class="lev3 toc-item"><a href="#Entrada-com-dimensão-2" data-toc-modified-id="Entrada-com-dimensão-2-131"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Entrada com dimensão 2</a></div><div class="lev3 toc-item"><a href="#Entrada-com-dimensão-4" data-toc-modified-id="Entrada-com-dimensão-4-132"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Entrada com dimensão 4</a></div><div class="lev2 toc-item"><a href="#Keras:-Execução-na-fase-de-treinamento" data-toc-modified-id="Keras:-Execução-na-fase-de-treinamento-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Keras: Execução na fase de treinamento</a></div><div class="lev2 toc-item"><a href="#Keras:-Execução-na-fase-de-testes" data-toc-modified-id="Keras:-Execução-na-fase-de-testes-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Keras: Execução na fase de testes</a></div><div class="lev2 toc-item"><a href="#Keras:-Verificando-o-cálculo-das-médias-móveis" data-toc-modified-id="Keras:-Verificando-o-cálculo-das-médias-móveis-16"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Keras: Verificando o cálculo das médias móveis</a></div>

# Batch Normalization

## Importações e ajustes

In [54]:
%matplotlib inline
import matplotlib.pyplot as plot
from IPython import display

import sys
import numpy as np
import numpy.random as nr

import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.initializers import Constant
from keras.optimizers import SGD

from keras import backend as K

print('Keras ', keras.__version__)

np.set_printoptions(precision=2, suppress=True)
nr.seed(23456)


Keras  2.0.4


## O Algoritmo

Para descrição do algoritmo, veja o artigo original:
[Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167.pdf) 

### Equações
\begin{align*} 
\boldsymbol{\mu} &= \frac{1}{m} \sum_{i = 1}^{m} \boldsymbol{x_i} &&
\boldsymbol{\sigma^{2}} = \frac{1}{m} \sum_{i = 1}^{m} (\boldsymbol{x_i} - \boldsymbol{\mu})^{2}
\\[3mm]
\hat{\boldsymbol{x}}_i &= \frac{\boldsymbol{x}_i - \boldsymbol{\mu}}{\sqrt{\boldsymbol{\sigma^{2}} + \epsilon}} &&
\boldsymbol{y_i} = \gamma \hat{\boldsymbol{x}}_i + \beta
\\[3mm]
\boldsymbol{m}_{(t)} &= (1 - \lambda) \boldsymbol{\mu} + \lambda \boldsymbol{m}_{(t-1)} &&
\boldsymbol{v}_{(t)} =  (1 - \lambda) \boldsymbol{\sigma^{2}} + \lambda \boldsymbol{v}_{(t-1)}
\end{align*}

### Grafo de operações
Figura obtida nesta página: [Understanding the backward pass through Batch Normalization Layer](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html)

<table align='left'>
<tr><td> <img src="https://drive.google.com/uc?id=0By1KMDFVxsI2ZGN6eWhCeTJSMDg"> </td></tr>
</table>

## Keras: parâmetros

### Entrada com dimensão 2

Usualmente, quando a entrada da camada tem duas dimensões (amostras e atributos) a normalização é feita na dimensão dos atributos, *axis=1*. Ou seja, calcula-se a estatística (média e variância) para cada coluna da matriz de dados, em cada *mini-batch*.

In [2]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(5,), momentum=0.9, epsilon=0.0001, 
                             gamma_initializer=Constant(10), 
                             beta_initializer=Constant(11), 
                             moving_mean_initializer=Constant(12), 
                             moving_variance_initializer=Constant(13)))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_1 (Batch (None, 5)                 20        
Total params: 20
Trainable params: 10
Non-trainable params: 10
_________________________________________________________________


In [3]:
for layer in model.layers:
    print('\nConfiguration:')
    print('--------------')
    for k, v in layer.get_config().items():
        print('  {:30s}: {}'.format(k, v))
    print('\nParameters:')
    print('-----------')
    for p in layer.weights:
        if p in layer.trainable_weights:
            print('  Trainable:', p.name)
        else:
            print('            ', p.name)
print('\nmodel.get_weights():')
print('--------------------')
for w in model.get_weights():
    print('  ', w, w.shape)


Configuration:
--------------
  name                          : batch_normalization_1
  trainable                     : True
  batch_input_shape             : (None, 5)
  dtype                         : float32
  axis                          : 1
  momentum                      : 0.9
  epsilon                       : 0.0001
  center                        : True
  scale                         : True
  beta_initializer              : {'class_name': 'Constant', 'config': {'value': 11}}
  gamma_initializer             : {'class_name': 'Constant', 'config': {'value': 10}}
  moving_mean_initializer       : {'class_name': 'Constant', 'config': {'value': 12}}
  moving_variance_initializer   : {'class_name': 'Constant', 'config': {'value': 13}}
  beta_regularizer              : None
  gamma_regularizer             : None
  beta_constraint               : None
  gamma_constraint              : None

Parameters:
-----------
  Trainable: batch_normalization_1/gamma:0
  Trainable: batch_normaliz

### Entrada com dimensão 4

Usualmente, quando a entrada da camada tem quatro dimensões (amostras, filtros, altura e largura), a normalização é feita na segunda dimensão, *axis=1*, que corresponde aos filtros (canais). Ou seja, calcula-se a estatística (média e variância) para cada *mapa de atributos* do tensor. Assim, preserva-se a propriedade da invariância à translação, característica importante das convoluções.

In [4]:
model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(3, 10, 10), 
                             gamma_initializer=Constant(0), 
                             beta_initializer=Constant(1), 
                             moving_mean_initializer=Constant(2), 
                             moving_variance_initializer=Constant(3)))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
batch_normalization_2 (Batch (None, 3, 10, 10)         12        
Total params: 12
Trainable params: 6
Non-trainable params: 6
_________________________________________________________________


In [5]:
for layer in model.layers:
    print('\nConfiguration:')
    print('--------------')
    for k, v in layer.get_config().items():
        print('  {:30s}: {}'.format(k, v))
    print('\nParameters:')
    print('-----------')
    for p in layer.weights:
        if p in layer.trainable_weights:
            print('  Trainable:', p.name)
        else:
            print('            ', p.name)
print('\nmodel.get_weights():')
print('--------------------')
for w in model.get_weights():
    print('  ', w, w.shape)


Configuration:
--------------
  name                          : batch_normalization_2
  trainable                     : True
  batch_input_shape             : (None, 3, 10, 10)
  dtype                         : float32
  axis                          : 1
  momentum                      : 0.99
  epsilon                       : 0.001
  center                        : True
  scale                         : True
  beta_initializer              : {'class_name': 'Constant', 'config': {'value': 1}}
  gamma_initializer             : {'class_name': 'Constant', 'config': {'value': 0}}
  moving_mean_initializer       : {'class_name': 'Constant', 'config': {'value': 2}}
  moving_variance_initializer   : {'class_name': 'Constant', 'config': {'value': 3}}
  beta_regularizer              : None
  gamma_regularizer             : None
  beta_constraint               : None
  gamma_constraint              : None

Parameters:
-----------
  Trainable: batch_normalization_2/gamma:0
  Trainable: batch_norm

## Keras: Execução na fase de treinamento

In [6]:
K.set_learning_phase(1)

model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(5,), momentum=0.5, epsilon=0.0001))

model.set_weights([[ 5., 1., 1., 1., 1.],     # gamma
                   [ 10., 0., 0., 0., 0.],     # beta
                   [ 0., 0., 0., 0., 0.],
                   [ 1., 1., 1., 1., 1.]])

x = nr.random(size=(10, 5)) * 20
print(x, x.mean(0), x.std(0))
print()

y = model.predict(x, batch_size=10)
print(y, y.mean(0), y.std(0))


[[  6.44   6.55  18.55   6.23   3.24]
 [  7.28  10.58  15.78  17.51  12.73]
 [ 19.79  16.32   0.83   4.06   3.05]
 [ 15.06   9.1   19.09  18.75   2.71]
 [  4.72   6.3   13.45  12.41  12.44]
 [ 16.76  12.62  11.22  19.74  13.66]
 [  2.03  18.8   13.03   4.55   8.64]
 [  7.8   15.12   2.85   0.2   16.76]
 [  2.18   4.98   9.52  17.68   3.55]
 [ 15.72   0.7   12.17   6.17  15.4 ]] [  9.78  10.11  11.65  10.73   9.22] [ 6.14  5.37  5.7   6.91  5.35]

[[  7.28  -0.66   1.21  -0.65  -1.12]
 [  7.97   0.09   0.73   0.98   0.66]
 [ 18.16   1.16  -1.9   -0.97  -1.15]
 [ 14.3   -0.19   1.31   1.16  -1.22]
 [  5.88  -0.71   0.32   0.24   0.6 ]
 [ 15.68   0.47  -0.08   1.3    0.83]
 [  3.69   1.62   0.24  -0.89  -0.11]
 [  8.39   0.93  -1.54  -1.52   1.41]
 [  3.81  -0.95  -0.37   1.01  -1.06]
 [ 14.84  -1.75   0.09  -0.66   1.16]] [ 10.   0.   0.  -0.  -0.] [ 5.  1.  1.  1.  1.]


In [7]:
gamma, beta, mv_mean, mv_var = model.get_weights()

x2 = x - x.mean(0)
x2 /= x2.std(0)

x3 = x2 * gamma + beta

print(x3, x3.mean(0), x3.std(0))

[[  7.28  -0.66   1.21  -0.65  -1.12]
 [  7.97   0.09   0.73   0.98   0.66]
 [ 18.16   1.16  -1.9   -0.97  -1.15]
 [ 14.3   -0.19   1.31   1.16  -1.22]
 [  5.88  -0.71   0.32   0.24   0.6 ]
 [ 15.68   0.47  -0.08   1.3    0.83]
 [  3.69   1.62   0.24  -0.89  -0.11]
 [  8.39   0.93  -1.54  -1.52   1.41]
 [  3.81  -0.95  -0.37   1.01  -1.06]
 [ 14.84  -1.75   0.09  -0.66   1.16]] [ 10.  -0.   0.  -0.   0.] [ 5.  1.  1.  1.  1.]


## Keras: Execução na fase de testes

In [14]:
K.set_learning_phase(0)

model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(5,), momentum=0.999, epsilon=0.0001))

model.set_weights([[ 1., 1., 1., 1., 1.],     # gamma
                   [ 0., 0., 0., 0., 0.],     # beta
                   [ 0., 0., 0., 0., 0.],     # moving_mean
                   [ 1., 1., 1., 1., 1.]])    # moving_variance

x = nr.random(size=(10, 5)) * 20
print(x, x.mean(0), x.std(0))
print()

y = model.predict(x, batch_size=10)
print(y, y.mean(0), y.std(0))


[[  6.44   6.55  18.55   6.23   3.24]
 [  7.28  10.58  15.78  17.51  12.73]
 [ 19.79  16.32   0.83   4.06   3.05]
 [ 15.06   9.1   19.09  18.75   2.71]
 [  4.72   6.3   13.45  12.41  12.44]
 [ 16.76  12.62  11.22  19.74  13.66]
 [  2.03  18.8   13.03   4.55   8.64]
 [  7.8   15.12   2.85   0.2   16.76]
 [  2.18   4.98   9.52  17.68   3.55]
 [ 15.72   0.7   12.17   6.17  15.4 ]] [  9.78  10.11  11.65  10.73   9.22] [ 6.14  5.37  5.7   6.91  5.35]

[[  6.44   6.55  18.55   6.23   3.24]
 [  7.28  10.58  15.78  17.51  12.73]
 [ 19.79  16.32   0.83   4.06   3.05]
 [ 15.06   9.1   19.09  18.75   2.71]
 [  4.72   6.3   13.45  12.41  12.44]
 [ 16.76  12.62  11.22  19.74  13.66]
 [  2.03  18.8   13.03   4.55   8.63]
 [  7.8   15.11   2.85   0.2   16.76]
 [  2.18   4.98   9.52  17.68   3.55]
 [ 15.72   0.7   12.17   6.17  15.4 ]] [  9.78  10.11  11.65  10.73   9.22] [ 6.14  5.37  5.7   6.91  5.35]


In [15]:
gamma, beta, mv_mean, mv_var = model.get_weights()

x2 = x - mv_mean
x2 /= np.sqrt(mv_var)

x3 = x2 * gamma + beta

print(x3, x3.mean(0), x3.std(0))

[[  6.44   6.55  18.55   6.23   3.24]
 [  7.28  10.58  15.78  17.51  12.73]
 [ 19.79  16.32   0.83   4.06   3.05]
 [ 15.06   9.1   19.09  18.75   2.71]
 [  4.72   6.3   13.45  12.41  12.44]
 [ 16.76  12.62  11.22  19.74  13.66]
 [  2.03  18.8   13.03   4.55   8.64]
 [  7.8   15.12   2.85   0.2   16.76]
 [  2.18   4.98   9.52  17.68   3.55]
 [ 15.72   0.7   12.17   6.17  15.4 ]] [  9.78  10.11  11.65  10.73   9.22] [ 6.14  5.37  5.7   6.91  5.35]


## Keras: Verificando o cálculo das médias móveis

In [59]:
nr.seed(234588)
K.set_learning_phase(1)
mom = 0.2

model = Sequential()
model.add(BatchNormalization(axis=1, input_shape=(5,), momentum=mom, epsilon=0.0001))
model.add(Dense(1, activation='sigmoid'))

print([x.shape for x in model.get_weights()], end='\n\n')
model.set_weights([[ 1., 1., 1., 1., 1.],     # gamma
                   [ 0., 0., 0., 0., 0.],     # beta
                   [ 0., 0., 0., 0., 0.],     # moving_mean
                   [ 1., 1., 1., 1., 1.]])    # moving_variance

gamma, beta, mv_mean, mv_var, w, b = model.get_weights()

x = nr.random(size=(10, 5)) * 20
y = nr.random(size=(10, 1))
print(x, x.mean(0), x.std(0))
print()

mv_mean_new = (1 - mom) * x.mean(0) + mom * mv_mean
mv_var_new  = (1 - mom) * x.std(0) * x.std(0) + mom * mv_var

sgd = SGD(0.5)
model.compile(sgd, loss='mse')
h = model.fit(x, y, batch_size=10, epochs=1, verbose=0)

new_weights = model.get_weights()

print('lambda:', new_weights[0])
print('beta:  ', new_weights[1])
print()
print('mv_mean:', new_weights[2])
print('        ', mv_mean_new)
print('mv_var: ', new_weights[3])
print('        ', mv_var_new)


[(5,), (5,), (5,), (5,), (5, 1), (1,)]

[[  1.74   4.4    7.11  15.93   7.74]
 [ 15.79  16.52   3.59   8.45   0.57]
 [  6.19   1.2   19.06   9.33   6.74]
 [ 14.86   8.33  19.11  12.7   10.43]
 [  6.52  11.71   0.5   11.32  17.42]
 [  2.81   6.7   15.46  13.86  18.72]
 [ 19.9    5.29  15.4    6.41   5.52]
 [ 15.81   4.68   0.07   0.84  11.68]
 [ 13.75  17.97  17.19   6.23  12.38]
 [  1.05  18.93   1.34   6.03   2.17]] [ 9.84  9.57  9.88  9.11  9.34] [ 6.55  6.    7.67  4.25  5.67]

lambda: [ 1.    0.98  0.98  1.    1.  ]
beta:   [-0.03 -0.02  0.03  0.   -0.02]

mv_mean: [ 7.87  7.66  7.91  7.29  7.47]
         [ 7.87  7.66  7.91  7.29  7.47]
mv_var:  [ 34.54  28.98  47.32  14.66  25.91]
         [ 34.54  28.98  47.32  14.66  25.91]
