# Homework 5: Batch Normalization

Welcome to the course **AI and Deep Learning**!

Batch Normalization(BN) is a widely used technique in deep learning to improve the training process of neural networks.It aims to address the problem of **covariate shift**.During the training of deep neural networks, the distribution of each layer's inputs changes as the parameters of the previous layers change. This can slow down the training process and make it difficult to choose appropriate learning rates and initializations. Batch normalization stabilizes the training process by **normalizing values before activation** for each neuron and it also introduce **two more parameters to allow for heterogeneity**, making the network more robust and easier to train.

Learning Goal: In this homework , we first review how the forward propogation of batch normalization works using one sample and mini-batch respectively, then we summarize the purpose of BN,finally the backward propogation of batch normalization is illustrated.You can add the BN step in your network, give it a try!

# Table of content

<style>
ol li {
  list-style-type: decimal-leading-zero;
  padding-left: 20px;
}
</style>

<ol>
  
  <li>Forward Propagation
    <ol style="list-style-type:lower-alpha;">
      <li>traditional forward propagation without BN senario</li>
      <li>how BN works(one sample)</li>
        <ol class="roman">
          <li>calculate mini-batch mean</li>
          <li>calculate mini-batch variance</li>
          <li>normalize the input (sample i)</li>
          <li>scale and shift(sample i)</li>
        </ol>
       <li>how BN works(mini-batch)</li>
       <li>manually implementing BN (mini-batch)</li>
    </ol>
  </li>
  <li>the purpose of BN</li> 
    <ol style="list-style-type:lower-alpha;">
      <li>problem</li>
      <li>solution</li>
     </ol>
  </li>  
  <li>Backward Propagation</li>
    <ol style="list-style-type:lower-alpha;">
      <li>BP illustration</li>
      <li>manually implementing Backward Propagation</li>
     </ol>
  </li>  
  <li>a concise implementation of BN</li>
  
  
  
</ol>

# 1-Forward Propagation

<span style="font-size: 20px;"> **1.1  traditional forward propogation without BN senario**
</span>

You must have mastered the basic knowledge of forward propogation and the whole process. Then you will be familiar with this:

\
\begin{cases}
Z^{[L]} = A^{[L-1]} W^{[L]} + b^{[L]}, \\
A^{[L]} = \delta^{[L]}(Z^{[L]}).
\end{cases}


The forward propogation of the L-th layer can be divided into two steps: **linear transformation** and **activation function transformation**

Here $Z^{[L]} \in \mathbb{R}^{m \times d}$ ,where  m is the size of the mini-batch and d is the feature dimension.

<span style="font-size: 20px;"> **1.2  how BN works(one sample)**
</span>

Batch Normalization is implemented after the linear transformation,before the activation function.

**1.2.1 calculate mini-batch mean**



$$
\mu_B = \frac{1}{m} \sum_{i=1}^{m} \mathbf{z}_i
$$



- $\mu_B \in \mathbb{R}^d$: Mean vector for each feature dimension.


**1.2.2 calculate mini-batch variance**

$$\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (\mathbf{z}_i - \mu_B)^2$$


- $\sigma_B^2 \in \mathbb{R}^d$ is the variance vector (element-wise square).

**1.2.3 normalize the input  (sample i)**

$$
\hat{\mathbf{z}}_i = \frac{\mathbf{z}_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$

**Here we may notice that the numerator makes $b^{[L]}$ have no effect.**





where:
- $\hat{\mathbf{z}}_i$: Normalized feature vector $\in \mathbb{R}^d$ (sample $i$)
- $\mathbf{z}_i$: Original input vector (sample $i$)
- $\mu_B$: mini-batch mean vector
- $\sigma_B^2$: mini-batch variance vector
- $\epsilon$: Smoothing term (typically $10^{-5}$)

**1.2.4 scale and shift(sample i)**

BN first standardizes each feature(neuron) to zero mean and unit variance,while this helps stabilize training , it also forces a rigid distribution N(0,1),that may reduce the heterogeneity among neurons. 

$$
\mathbf{y}_i = \gamma \odot \hat{\mathbf{z}}_i + \beta
$$

- $\gamma \in \mathbb{R}^d$: Scaling parameter (initialized as 1)
- $\beta \in \mathbb{R}^d$: Shifting parameter (initialized as 0)  
- $\odot$: Element-wise multiplication

<span style="font-size: 20px;"> **1.3  how BN works(mini-batch)**
</span>

After seeing how to normalize a single sample, we consider **a mini-batch of samples** and represent the entire process using **matrix** notation.

- $\mathbf{Z}^{[L]} = \mathbf{A}^{[L-1]} \mathbf{W}^{[L]}$

- $
\mu_B = \frac{1}{m} \mathbf({Z}^{[L]})^{T} \mathbf{1}
$

- $
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (\mathbf{z}_i - \mu_B)^2 = \frac{1}{m} \text{diag} (\mathbf{Z}^\top \mathbf{Z}) - \mu_B^2
$

- $
\hat{\mathbf{Z}^{[L]}} = (\mathbf{Z}^{[L]} - \mathbf{1} \mu_B^\top) \oslash (\mathbf{1} \sqrt{\sigma_B^2 + \epsilon}^\top)
$




- $
\mathbf{Y} = \hat{\mathbf{Z}^{[L]}} \odot \mathbf{1}\gamma^\top + \mathbf{1}\beta^\top
$

<span style="font-size: 20px;"> **1.4  manually implementing BN (mini-batch)**
</span>

In [3]:
import numpy as np

def batch_norm_forward(X, gamma, beta, eps=1e-5):
    """
    Batch Normalization Forward Propogation
    
    Parameters:
    X: mini-batch samples, shape (m, d)
    gamma: shape (d,)
    beta: shape (d,)
    eps: smoothing term
    
    Return:
    Y: output, shape (m, d)
    cache: for backward propogation
    """
    ###1.compute mean vector###
    ###2.compute variance vector###
    ###3.normalization###
    ###4.scale and shift###
    
    
    ###YOUR CODE BENGINS HERE###
    # 1. compute mean vector
    # 2. compute variance vector
    mu = np.mean(X, axis=0)          # (d,)
    var = np.var(X, axis=0)          # (d,)
    
    # 3. normalization
    X_hat = (X - mu) / np.sqrt(var + eps)  # (m, d)
    
    # 3. scale and shift
    Y = gamma * X_hat + beta         # (m, d)
    
    # cache for backward propogation
    cache = (X, X_hat, mu, var, gamma, eps)
    
    return Y, cache
   ###YOUR CODES ENDS HERE###

    
    


You can use your simulated data to verify if your code is right!

In [4]:
###YOUR CODES BEGINS HERE###

#simulated data (batch_size=3, feature_dim=4)
X = np.array([[1.0, 2.0, 3.0, 4.0],
              [5.0, 6.0, 7.0, 8.0],
              [9.0, 10.0, 11.0, 12.0]])
    
# initialization
gamma = np.ones(X.shape[1])     
beta = np.zeros(X.shape[1])     
    
# forward propogation
Y, cache = batch_norm_forward(X, gamma, beta)
    
print("输入 X:\n", X)
print("标准化后输出 Y:\n", np.round(Y, 4))

###YOUR CODES ENDS HERE###

输入 X:
 [[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]]
标准化后输出 Y:
 [[-1.2247 -1.2247 -1.2247 -1.2247]
 [ 0.      0.      0.      0.    ]
 [ 1.2247  1.2247  1.2247  1.2247]]


# 2- the purpose of BN

<span style="font-size: 20px;"> **2.1  problem**
</span>

- Deep networks suffer from internal covariate shift, where the distribution of layer inputs changes during training as weights update, forcing later layers to constantly adapt.

- Poor weight initialization can lead to vanishing/exploding gradients.

- Overfitting due to small batch sizes or noisy data.

<span style="font-size: 20px;"> **2.2  solution**
</span>

- BN Ensures consistent input distributions across layers, allowing higher learning rates and faster convergence.

- By standardizing activations, BN makes the network less sensitive to initial weight scales, enabling more robust training from random starts.

- The noise introduced by mini-batch statistics (mean/variance) adds slight stochasticity, acting like a mild regularizer and reducing the need for techniques like Dropout.

# 3- Backward Propogation

<span style="font-size: 20px;"> **3.1  BP illustration**
</span>

$$
\hat{\mathbf{z}}_i = \frac{\mathbf{z}_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$$

$$
\mathbf{y}_i = \gamma \odot \hat{\mathbf{z}}_i + \beta
$$

Assume we have upper gradient

$$
\frac{\partial L}{\partial y_i}
$$

then we may derive that:

$$
\frac{\partial L}{\partial \gamma} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \odot \hat{z}_i, \quad \frac{\partial L}{\partial \beta} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i}
$$

$$
\frac{\partial L}{\partial \hat{z}_i} = \left( \frac{\partial L}{\partial \hat{z}_i} \cdot \frac{1}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{2(z_i - \mu_B)}{m} + \frac{\partial L}{\partial \mu_B} \cdot \frac{1}{m}
$$



$$
\frac{\partial L}{\partial \hat{z}_i} = \frac{\partial L}{\partial y_i} \odot \gamma, \quad \frac{\partial L}{\partial \sigma_B^2} = \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{z}_i} (z_i - \mu_B) \cdot \left( -\frac{1}{2} \right) (\sigma_B^2 + \epsilon)^{-3/2}
$$

$$
\frac{\partial L}{\partial \mu_B} = \left( \sum_{i=1}^{m} \frac{\partial L}{\partial \hat{z}_i} \cdot \frac{-1}{\sqrt{\sigma_B^2 + \epsilon}} \right) + \frac{\partial L}{\partial \sigma_B^2} \cdot \frac{-2 \sum_{i=1}^{m} (z_i - \mu_B)}{m}
$$

<span style="font-size: 20px;"> **3.2  manually implementing Backward Propagation**
</span>

In [5]:
def batch_norm_backward(dY, cache):
    """""
    
    Parameters:
    dY: known upper gradient, shape (m, d)
    cache: tuple we cache during forward propogation (Z, Z_hat, mu, var, gamma, eps)
    
    Return:
    dZ:  shape (m, d)
    dgamma: shape (d,)
    dbeta:  shape (d,)
    """
    Z, Z_hat, mu, var, gamma, eps = cache
    m, d = Z.shape
    ###1.Compute the gradients for gamma and beta###
    ###2.compute dZ_hat(∂L/∂Z_hat)###
    ###3.conmpute dvar (∂L/∂var)###
    ###4.compute dmu (∂L/∂mu)###
    ###5.compute dZ###
    
    
    
    
    ###YOUR CODES BEGINS HERE###
    
    # 1. Compute the gradients for gamma and beta
    dgamma = np.sum(dY * Z_hat, axis=0)  # (d,)
    dbeta = np.sum(dY, axis=0)           # (d,)
    
    # 2. compute dZ_hat (∂L/∂Z_hat)
    dZ_hat = dY * gamma                  # (m, d)
    
    # 3. compute dvar (∂L/∂var)
    dvar = np.sum(dZ_hat * (Z - mu) * (-0.5) * (var + eps)**(-1.5), axis=0)  
    
    # 4. compute dmu (∂L/∂mu)
    dmu_part1 = np.sum(dZ_hat * (-1) / np.sqrt(var + eps), axis=0)            
    dmu_part2 = dvar * np.mean(-2 * (Z - mu), axis=0)                         
    dmu = dmu_part1 + dmu_part2
    
    # 5. compute dZ 
    dZ_part1 = dZ_hat / np.sqrt(var + eps)                
    dZ_part2 = dvar * 2 * (Z - mu) / m                    
    dZ_part3 = dmu / m                                    
    dZ = dZ_part1 + dZ_part2 + dZ_part3
    
    return dZ, dgamma, dbeta
    
    ###YOUR CODES ENDS HERE###


<span style="font-size: 20px;"> **3.3  a test to verify the BP code**
</span>

below is a test to check to implement the gradient check,using your code.

In [7]:
###PLEASE DON"T CHANGE THIS CODE BELOW###

import numpy as np
def gradient_check_batchnorm():
    np.random.seed(1)
    m, d = 3, 4  
    X = np.random.randn(m, d)
    gamma = np.random.randn(d)
    beta = np.random.randn(d)
    eps = 1e-5
    epsilon = 1e-7  

    # Forward pass to populate cache
    #Here we use the BN forward function you just write
    Y, cache = batch_norm_forward(X, gamma, beta, eps)
    dY = np.random.randn(*Y.shape)  

    # Analytical gradients (from your backward pass)
    dZ_analytical, dgamma_analytical, dbeta_analytical = batch_norm_backward(dY, cache)

    # Numerical gradient approximation
    def compute_numerical_gradient(param, func, idx=None):
        param_plus = param.copy()
        param_minus = param.copy()
        if idx is not None:
            param_plus[idx] += epsilon
            param_minus[idx] -= epsilon
        else:
            param_plus += epsilon
            param_minus -= epsilon
        loss_plus = np.sum(dY * func(param_plus))
        loss_minus = np.sum(dY * func(param_minus))
        return (loss_plus - loss_minus) / (2 * epsilon)

    # Check dgamma (∂L/∂γ)
    dgamma_numerical = np.zeros_like(gamma)
    for j in range(d):
        def func_gamma(gamma_perturbed):
            Y_perturbed, _ = batch_norm_forward(X, gamma_perturbed, beta, eps)
            return Y_perturbed
        dgamma_numerical[j] = compute_numerical_gradient(gamma, func_gamma, j)

    # Check dbeta (∂L/∂β)
    dbeta_numerical = np.zeros_like(beta)
    for j in range(d):
        def func_beta(beta_perturbed):
            Y_perturbed, _ = batch_norm_forward(X, gamma, beta_perturbed, eps)
            return Y_perturbed
        dbeta_numerical[j] = compute_numerical_gradient(beta, func_beta, j)

    # Check dX (∂L/∂X)
    dX_numerical = np.zeros_like(X)
    for i in range(m):
        for j in range(d):
            def func_X(X_perturbed):
                Y_perturbed, _ = batch_norm_forward(X_perturbed, gamma, beta, eps)
                return Y_perturbed
            dX_numerical[i, j] = compute_numerical_gradient(X, func_X, (i, j))

    # Relative error calculation
    def relative_error(grad_analytical, grad_numerical):
        numerator = np.abs(grad_analytical - grad_numerical).sum()
        denominator = np.abs(grad_analytical) + np.abs(grad_numerical)
        return numerator / np.maximum(1e-8, denominator.sum())

    error_dX = relative_error(dZ_analytical, dX_numerical)
    error_dgamma = relative_error(dgamma_analytical, dgamma_numerical)
    error_dbeta = relative_error(dbeta_analytical, dbeta_numerical)

    print("Gradient Check Results:")
    print(f"dX error:     {error_dX:.2e} (should be < 1e-7)")
    print(f"dgamma error: {error_dgamma:.2e} (should be < 1e-7)")
    print(f"dbeta error:  {error_dbeta:.2e} (should be < 1e-7)")

    return error_dX, error_dgamma, error_dbeta

# Run the check
gradient_check_batchnorm()

Gradient Check Results:
dX error:     2.39e-09 (should be < 1e-7)
dgamma error: 9.87e-10 (should be < 1e-7)
dbeta error:  1.73e-09 (should be < 1e-7)


(np.float64(2.393039918468964e-09),
 np.float64(9.866759895117114e-10),
 np.float64(1.7341300572275556e-09))

# 4- a concise implementation of BN

In PyTorch, you can implement batch normalization using`torch.nn.BatchNorm1d`, `torch.nn.BatchNorm2d`, or `torch.nn.BatchNorm3d`, depending on the dimensionality of your  data.

Below is an example of how to implement batch normalization in a **fully connected neural network** using PyTorch. This example uses torch.nn.BatchNorm1d, which is suitable for fully connected layers.

In [8]:



import torch
import torch.nn as nn
import torch.nn.functional as F

# Define a fully connected neural network with batch normalization
class FullyConnectedNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FullyConnectedNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.bn1 = nn.BatchNorm1d(hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.bn2 = nn.BatchNorm1d(hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.fc3(x)
        return x

    


You can also specify the size of the input, hidden layer, and output​ to implementing your code!

In [9]:
input_size = 784  # For example, 28x28 images flattened
hidden_size = 500
output_size = 10   # For example, 10 classes

# Create an instance of the network
net = FullyConnectedNet(input_size, hidden_size, output_size)

# Create a random input tensor (e.g., a batch of 4 images)
input = torch.randn(4, input_size)

output = net(input)
print(output)

tensor([[ 0.4800, -0.3732,  0.2061, -0.4101,  0.4611, -0.0061,  0.1786, -0.3494,
         -0.1817,  0.5988],
        [ 0.2694, -0.5896, -0.1320, -0.1999, -0.0296,  0.0819, -0.4401,  0.4767,
          0.3122,  0.8471],
        [ 0.0426,  0.0931, -0.6851, -0.0217,  0.6040, -0.3760,  0.2737,  0.1184,
          0.1128,  0.1202],
        [-0.2570, -0.0455,  0.0874, -0.4685,  0.5739,  0.5488,  0.4299,  0.8774,
         -0.4747, -0.5152]], grad_fn=<AddmmBackward0>)
