# **批量归一化**

## **内部协变量偏移**

**Internal Covariate Shift**是一个概念，指的是在神经网络训练时，深层输出分布的不确定性。   
在训练的过程中，由于前面参数的变化，导致浅层输出分布较为稳定而深层输出分布不稳定。    
之前一直以为BN解决的是ICS问题，但是后来又论文否定了这一点   
它提出BN的实质是使损失函数更加平滑，这样的话有以下几个好处   
- 加快收敛速度
- 避免陷入局部最优值
- 避免了梯度弥散并由正则化效果

## **批量归一化层**

全连接和卷积层的批归一化有所不同

### **全连接的批量归一化**

通常全连接的批归一化被放在了**仿射变换和激活函数之间**    
假设输入为$u$,权重和偏差为$W$和$b$，激活函数为$\phi$    

$\large{x = Wu + b}$   

$\large{\phi{(BN(x))}}$

如果数据是由m个样本组成的小批量:$\mathcal{B} = \{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)} \}$   
$\boldsymbol{x} \in \mathbb {R}^d$批量归一化的输出也是d维的

最终的输出$y^i$由以下几步得到    

$\large {\hat {x}^i = \frac {x^i - \boldsymbol {\mu}}{\sqrt {{\sigma}^2 + \epsilon}} }$

$\large {{\mu} = \frac {1}{m} \sum_{i=1}^{m}{x^i}}$

$\large {\sigma = \frac {1}{m} \sum_{i=1}^{m}(x^i - {\mu})^2}$

接下来批归一化引入了两个可以学习的参数，分别是拉伸参数$\gamma$和偏移参数$\beta$,分别和$x^i$做元素相乘和相加的运算

$\large {y^i = x^i \odot \gamma + \beta}$

拉伸参数$\large \gamma$和偏移参数$\large \beta$保留了不对数据做归一化的可能，只需要使$\large \gamma$和$\large beta$分别为上式的均值和方差即可

### **卷积层的批归一化**

卷积时的归一化一般发生在卷积操作之后，激活函数之前。当卷积操作输出多通道时，我们应该给每个通道都进行批归一化处理，而且**每个通道还有单独的拉伸参数和偏移参数。**    
对于单个通道而言，假设样本量是$m$，输出大小是$p \times q$，那么要多该通道内所有元素同时归一化，期望和方差是$m \times p \times q$的期望和方差

### **预测情况下的批归一化**

我们应当保证预测时的输出不取决于输入的小批量的均值和方差，所以我们一般在预测的批归一化时采用的是[移动平均](https://baike.baidu.com/item/%E7%A7%BB%E5%8A%A8%E5%B9%B3%E5%9D%87%E5%80%BC/10533531?fr=aladdin)估算整个训练集的样本均值和方差，因此使用了批量归一化的模型训练和测试的过程也是不一样的

## **批量归一化的实现**

In [1]:
import torch
from torch import nn, optim
import torch.nn.functional as F

In [2]:
import sys
sys.path.append(r'..\utils') 
import d2lzh as d2l
device = torch.device('cuda')

In [3]:
def batch_norm(is_training, X, gamma, beta, moving_mean, moving_var, eps, momentum):
    # 判断当前模式是训练模式还是预测模式
    if not is_training:
        # 如果是在预测模式下，直接使用传入的移动平均所得的均值和方差
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            # 使用全连接层的情况，计算特征维上的均值和方差
            mean = X.mean(dim=0)
            var = ((X - mean) ** 2).mean(dim=0)
        else:
            # 使用二维卷积层的情况，计算通道维上（axis=1）的均值和方差。这里我们需要保持
            # X的形状以便后面可以做广播运算
            mean = X.mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
            var = ((X - mean) ** 2).mean(dim=0, keepdim=True).mean(dim=2, keepdim=True).mean(dim=3, keepdim=True)
        # 训练模式下用当前的均值和方差做标准化
        X_hat = (X - mean) / torch.sqrt(var + eps)
        # 更新移动平均的均值和方差
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta  # 拉伸和偏移
    return Y, moving_mean, moving_var

接下来我们定义一个批归一化层用以训练拉伸参数和偏移参数，同时维护移动平均的均值和方差

In [4]:
class BatchNorm(nn.Module):
    def __init__(self, num_features, num_dims):
        super(BatchNorm, self).__init__()
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        # 参与求梯度和迭代的拉伸和偏移参数，分别初始化成0和1
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        # 不参与求梯度和迭代的变量，全在内存上初始化成0
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.zeros(shape)

    def forward(self, X):
        # 如果X不在内存上，将moving_mean和moving_var复制到X所在显存上
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        # 保存更新过的moving_mean和moving_var, Module实例的traning属性默认为true, 调用.eval()后设成false
        Y, self.moving_mean, self.moving_var = batch_norm(self.training, 
            X, self.gamma, self.beta, self.moving_mean,
            self.moving_var, eps=1e-5, momentum=0.9)
        return Y

### **批量归一化的LeNet**

In [5]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            BatchNorm(6, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            BatchNorm(16, num_dims=4),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            BatchNorm(120, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            BatchNorm(84, num_dims=2),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

In [6]:
net = net.cuda()
X = torch.rand(1, 1, 28, 28).cuda()
for name, blk in net.named_children(): 
    X = blk(X)
    print(name, 'output shape: ', X.shape)

0 output shape:  torch.Size([1, 6, 24, 24])
1 output shape:  torch.Size([1, 6, 24, 24])
2 output shape:  torch.Size([1, 6, 24, 24])
3 output shape:  torch.Size([1, 6, 12, 12])
4 output shape:  torch.Size([1, 16, 8, 8])
5 output shape:  torch.Size([1, 16, 8, 8])
6 output shape:  torch.Size([1, 16, 8, 8])
7 output shape:  torch.Size([1, 16, 4, 4])
8 output shape:  torch.Size([1, 256])
9 output shape:  torch.Size([1, 120])
10 output shape:  torch.Size([1, 120])
11 output shape:  torch.Size([1, 120])
12 output shape:  torch.Size([1, 84])
13 output shape:  torch.Size([1, 84])
14 output shape:  torch.Size([1, 84])
15 output shape:  torch.Size([1, 10])


In [7]:
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

lr, num_epochs = 0.1, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()

In [8]:
for epoch in range(num_epochs):
    train_l_sum, train_acc_sum, n, batch_count = 0, 0, 0.0, 0.0
    for X, y in train_iter:
        X = X.to(device)
        y = y.to(device)
        y_hat = net(X)
        l = loss(y_hat, y)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        train_l_sum += l.cpu()
        train_acc_sum += (y_hat.argmax(dim=1) == y).float().cpu().sum()
        n += y.shape[0]
    print(f'{train_l_sum:.4f}  {train_acc_sum/n:.3f}')

158.4415  0.755
91.9730  0.857
80.1246  0.874
72.1466  0.886
67.2501  0.894


### **使用pytorch提供的batchnorm**

In [9]:
net = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            nn.BatchNorm2d(6),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            nn.BatchNorm2d(16),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2),
            d2l.FlattenLayer(),
            nn.Linear(16*4*4, 120),
            nn.BatchNorm1d(120),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            nn.BatchNorm1d(84),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

In [10]:
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

In [11]:
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()

In [12]:
net = net.to(device)
for epoch in range(num_epochs):
    train_l_sum, train_acc_sum, n, batch_count = 0, 0, 0.0, 0.0
    for X, y in train_iter:
        X = X.to(device)
        y = y.to(device)
        y_hat = net(X)
        l = loss(y_hat, y)
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        train_l_sum += l.cpu()
        train_acc_sum += (y_hat.argmax(dim=1) == y).float().cpu().sum()
        n += y.shape[0]
    print(f'{train_l_sum:.4f}  {train_acc_sum/n:.3f}')

319.5829  0.764
139.4169  0.858
95.6146  0.876
82.4442  0.884
75.9913  0.890
