一直就想对batch_normalization的细节（尤其是反向传播的推导）了解清楚，正好这次有时间。看了一下的及pain文章，发现反向推导的公式基于论文中的严格的链式推导，的确是非常不错的。只是心里还是想要用自己的方法来推导一遍。  


[Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/pdf/1502.03167)  
[深度学习中 Batch Normalization为什么效果好？](https://www.zhihu.com/question/38102762)  
[Batch Normalization梯度反向传播推导](https://blog.csdn.net/yuechuen/article/details/71502503)  
[What does the gradient flowing through batch normalization looks like ?](http://cthorey.github.io/backpropagation/)  
[Understanding the backward pass through Batch Normalization Layer](https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html)


![batch_normalization_fp](../../image/batch_normalization_fp.jpg) | ![batch_normalization_bp](../../image/batch_normalization_bp.jpg) 
::|::
Forward propagation | Backward propagation 



### Sample数据

In [None]:

def batchnorm_backward(dout, cache):
    X, X_norm, mu, var, gamma, beta = cache

    N, D = X.shape

    X_mu = X - mu
    std_inv = 1. / np.sqrt(var + 1e-8)

    dX_norm = dout * gamma
    dvar = np.sum(dX_norm * X_mu, axis=0) * -.5 * std_inv**3
    dmu = np.sum(dX_norm * -std_inv, axis=0) + dvar * np.mean(-2. * X_mu, axis=0)

    dX = (dX_norm * std_inv) + (dvar * 2 * X_mu / N) + (dmu / N)
    dgamma = np.sum(dout * X_norm, axis=0)
    dbeta = np.sum(dout, axis=0)

    return dX, dgamma, dbeta

上面这个代码好像有问题，计算dx的时候缺少gamma啊。下面是我写的代码

In [None]:
class BatchNormalization(Layer):
    def __init__(self, epsilon=1e-8):
        self.epsilon = epsilon       
        self.gamma = None
        self.beta = None
        self.t = 0
        
        self.initialized = False

    def initial_params(self, var, mu):
        self.gamma = var
        self.beta = mu            
        self.initialized = True
        
    def compute(self, input):
        mu = np.mean(input, axis = 0)
        var = np.mean(np.power(input - mu, 2), axis = 0)     
        x = (input - mu)/np.sqrt(var + self.epsilon)
        return mu, var, x

    def forward(self,input):
        mu, var, x = self.compute(input)
        
        if not self.initialized:
            self.initial_params(var, mu)          
            return input
        else:         
            return self.gamma * x + beta
            
    def backward(self,input,grad_output):    
        mu, var, x = self.compute(input)
                
        if not self.initialized:
            print("self.initialized")
            self.initial_params(var, mu)     
        
        grad_gamma = np.sum(grad_output*x, axis=0)
        grad_beta = np.sum(grad_output, axis=0)
        
        grad_input = self.gamma/np.sqrt(var + self.epsilon)*(grad_output - np.mean(grad_output, axis=0)) - \
            self.gamma *(input - mu) * np.mean(grad_output * (input - mu), axis = 0)/np.power(var + self.epsilon, 1.5)
        
        self.gamma = self.gamma - grad_gamma
        self.beta = self.beta - grad_beta  
        
        return grad_input

### 什么时候使用batch normalization

在神经网络训练时遇到收敛速度很慢，或梯度爆炸等无法训练的状况时可以尝试BN来解决。另外，在一般使用情况下也可以加入BN来加快训练速度，提高模型精度。