## 批量归一化

&emsp;&emsp;在神经网络反向传播的过程中，由于损失出现在后面，后面几层的训练是会比较快的。而靠近数据层的训练是比较慢的，这样在靠近数据层的权重发生变化的时候，靠近输出层的网络权重就需要重新训练，会导致收敛变慢。

&emsp;&emsp;**那我们可以在学习底部层的时候，避免顶部层的变化吗？**

&emsp;&emsp;批量归一化，说的是在`mini-batch`数据输入的时候，在不同层的不同地方的`mini-batch`，它的均值和方差给固定住。

$$
\mu_{B}=\frac{1}{|B|} \sum_{i \in B} x_{i} \text { and } \sigma_{B}^{2}=\frac{1}{|B|} \sum_{i \in B}\left(x_{i}-\mu_{B}\right)^{2}+\epsilon
$$

&emsp;&emsp;对所有样本求和，再除以批量大小，得到均值。方差就是数据减去均值的平方再除以批量大小数，最后再加上一个很小的数，防止变为0。

&emsp;&emsp;批量归一化，就是将所有的输入样本做如下处理:

$$
x_{i+1}=\gamma \frac{x_{i}-\mu_{B}}{\sigma_{B}}+\beta
$$

&emsp;&emsp;其中$\gamma$和$\beta$是可学习参数。来使得加权之后的均值和方差对网络更加好一点，但是会限制$\gamma$和$\beta$的变化，使其变化地不要过于猛烈。

&emsp;&emsp;批量归一化可以作用在

1. 全连接层和卷积层的输出上，激活函数前; 批量归一化做的是线性变换，激活函数做的是非线性变换。
2. 全连接层和卷积层的输入上;

&emsp;&emsp;对于全连接层，批量归一化作用的是在特征维度，与在数据层面做归一化类似，但是这里是对卷积层和全连接层的每一层输入输出都做这样的事情，而不是只作用在数据上。之后再用学习到的参数$\gamma$和$\beta$再对其做一次校验。

&emsp;&emsp;对于卷积层，作用的是在通道维度上。假设通道数为100的话，那么对于一个像素来说，你可以认为这个向量是这个像素的一个特征。

&emsp;&emsp;批量归一化最初的论文是想用它来减少内部协变量的转移，

&emsp;&emsp;后续有论文指出它可能就是通过在每个小批量里加入噪音来控制模型复杂度，因为每个小批量的均值和方差都是在随机变动的。从这个控制模型复杂度角度理解的话，就没有必要和丢弃法混合使用了。

&emsp;&emsp;批量归一化可以加速收敛速度，但一般不改变模型精度。

## 代码实现

In [1]:
import torch
import torch.nn as nn

### batch_norm核心公式

In [2]:
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
    """
    moving_mean: 整个数据集上的均值。
    moving_var: 整个数据集上的方差。
    eps: 是一个为了避免除0的东西。
    momentum: 用于更新moving_mean和moving_var的东西。
    """
    if not torch.is_grad_enabled():  # 在做inference的时候，减去全局的均值和方差。
        # 可能只有一张图片，所以不用在这个batch上做归一化。
        X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
    else:
        # 确定这里要么是一个全连接层(len(X.shape)=2)，要么是一个2D的卷积(len(X.shape)=4)。
        assert len(X.shape) in (2, 4)
        if len(X.shape) == 2:
            mean = X.mean(dim=0)
            var = ((X-mean)**2).mean(dim=0)
        else:
            # 按照通道数来求均值，对里面每一个通道的所有批量，所有高宽来求均值。
            mean = X.mean(dim=(0, 2, 3), keepdim=True) # 结果是一个 1xNx1x1 这样一个4D的东西。
            var = ((X - mean)**2).mean(dim=(0, 2, 3), keepdim=True) # 结果是一个 1xNx1x1 这样一个4D的东西。
        X_hat = (X - mean) / torch.sqrt(var + eps)
        
        # 更新全局的均值。并且只是在训练的过程中去做这样一个更新，做推理的时候并不做这样一个更新。
        moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
        moving_var = momentum * moving_var + (1.0 - momentum) * var
    Y = gamma * X_hat + beta
    return Y, moving_mean.data, moving_var.data

### batch_norm层

In [3]:
class BatchNorm(nn.Module):
    def __init__(self, num_features, num_dims):
        super().__init__()
        """
        num_features: 特征的个数。
        num_dims: 维度，这里简单起见，要么是2，要么是4。
        """
        # 获取shape维度，方便之后创建变量。
        if num_dims == 2:
            shape = (1, num_features)
        else:
            shape = (1, num_features, 1, 1)
        
        self.gamma = nn.Parameter(torch.ones(shape))
        self.beta = nn.Parameter(torch.zeros(shape))
        self.moving_mean = torch.zeros(shape)
        self.moving_var = torch.ones(shape)
    
    def forward(self, X):
        if self.moving_mean.device != X.device:
            self.moving_mean = self.moving_mean.to(X.device)
            self.moving_var = self.moving_var.to(X.device)
        Y, self.moving_mean, self.moving_var = batch_norm(X, self.gamma, self.beta, self.moving_mean,
                                                         self.moving_var, eps=1e-5, momentum=0.9)
        return Y

### batch_norm用于LeNet

In [6]:
net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), BatchNorm(6, num_dims=4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), BatchNorm(16, num_dims=4), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(16*4*4, 120), BatchNorm(120, num_dims=2), nn.Sigmoid(),
    nn.Linear(120, 84), BatchNorm(84, num_dims=2), nn.Sigmoid(),
    nn.Linear(84, 10))

In [7]:
X = torch.randn(size=(1, 1, 28, 28), dtype=torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__, "Out Put Shape: ", X.shape)

Conv2d Out Put Shape:  torch.Size([1, 6, 24, 24])
BatchNorm Out Put Shape:  torch.Size([1, 6, 24, 24])
Sigmoid Out Put Shape:  torch.Size([1, 6, 24, 24])
AvgPool2d Out Put Shape:  torch.Size([1, 6, 12, 12])
Conv2d Out Put Shape:  torch.Size([1, 16, 8, 8])
BatchNorm Out Put Shape:  torch.Size([1, 16, 8, 8])
Sigmoid Out Put Shape:  torch.Size([1, 16, 8, 8])
AvgPool2d Out Put Shape:  torch.Size([1, 16, 4, 4])
Flatten Out Put Shape:  torch.Size([1, 256])
Linear Out Put Shape:  torch.Size([1, 120])
BatchNorm Out Put Shape:  torch.Size([1, 120])
Sigmoid Out Put Shape:  torch.Size([1, 120])
Linear Out Put Shape:  torch.Size([1, 84])
BatchNorm Out Put Shape:  torch.Size([1, 84])
Sigmoid Out Put Shape:  torch.Size([1, 84])
Linear Out Put Shape:  torch.Size([1, 10])


In [8]:
import torchvision
from torchvision import transforms
from torch.utils import data

def get_dataloader_workers():
    """Use 4 processes to read the data."""
    return 4

def load_data_fashion_mnist(batch_size, resize=None):
    """Download the Fashion-MNIST dataset and then load it into memory."""
    trans = [transforms.ToTensor()]
    
    if resize:
        trans.insert(0, transforms.Resize(resize))
        
    trans = transforms.Compose(trans)
    mnist_train = torchvision.datasets.FashionMNIST(root="../data", train=True, transform=trans, download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="../data", train=False, transform=trans, download=True)
    
    return (data.DataLoader(mnist_train, batch_size, shuffle=True, num_workers=get_dataloader_workers()),
            data.DataLoader(mnist_test, batch_size, shuffle=False, num_workers=get_dataloader_workers()))

In [9]:
train_iter, test_iter = load_data_fashion_mnist(batch_size=256)

  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)


In [10]:
class Accumulator:
    """For accumulating sums over `n` variables."""
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

In [11]:
argmax = lambda x, *args, **kwargs: x.argmax(*args, **kwargs)
reduce_sum = lambda x, *args, **kwargs: x.sum(*args, **kwargs)
astype = lambda x, *args, **kwargs: x.type(*args, **kwargs)

def accuracy(y_hat, y):
    """Compute the number of correct predictions."""
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = argmax(y_hat, axis=1)
    cmp = astype(y_hat, y.dtype) == y
    return float(reduce_sum(astype(cmp, y.dtype)))

In [12]:
def evaluate_accuracy_gpu(net, data_iter, device=None):
    """使用GPU计算模型在数据集上的精度"""
    if isinstance(net, torch.nn.Module):
        net.eval()
        if not device: # 如果没有指定device, 就用网络参数中的所给定的那个device。
            device = next(iter(net.parameters())).device
    
    metric = Accumulator(2)
    
    for X, y in data_iter:
        if isinstance(X, list):
            X = [x.to(device) for x in X]  # 将数据挪到device上。
        else:
            X = X.to(device)
            
        y = y.to(device)
        
        metric.add(accuracy(net(X), y), y.numel())
        
        return metric[0] / metric[1]

In [13]:
def train(net, train_iter, test_iter, num_epochs, lr, device):
    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            nn.init.xavier_uniform_(m.weight)
    
    net.apply(init_weights)  # 对net里面的每一个parameter都去run一下init_weights这个函数。
    print("trainng on {}".format(device))
    
    net.to(device)
    
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss = nn.CrossEntropyLoss()
    
    num_batches = len(train_iter)
    
    for epoch in range(num_epochs):
        metric = Accumulator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            X, y = X.to(device), y.to(device)
            
            y_hat = net(X)
            
            optimizer.zero_grad()
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            
            metric.add(l * X.shape[0], accuracy(y_hat, y), X.shape[0])
            
            train_loss = metric[0] / metric[2]
            train_acc = metric[1] / metric[2]
            
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                print("epoch {}, train_loss {}, train_acc {}".format(epoch + (i + 1) / num_batches, 
                                                                  train_loss, train_acc))
    test_acc = evaluate_accuracy_gpu(net, test_iter)
    print(f'loss {train_loss:.3f}, train acc {train_acc:.3f}, 'f'test acc {test_acc:.3f}')

In [14]:
def try_gpu(i=0):
    """Return gpu(i) if exists, otherwise return cpu()."""
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

In [15]:
lr, num_epochs = 0.9, 10
train(net, train_iter, test_iter, num_epochs, lr, try_gpu())

trainng on cpu
epoch 0.2, train_loss 1.1157996844738087, train_acc 0.6095412234042553
epoch 0.4, train_loss 0.919659712213151, train_acc 0.6742021276595744
epoch 0.6, train_loss 0.8318415930930604, train_acc 0.7018783244680851
epoch 0.8, train_loss 0.7667046658536221, train_acc 0.7247755984042553
epoch 1.0, train_loss 0.7192401525497436, train_acc 0.7416
epoch 1.2, train_loss 0.5064855338411128, train_acc 0.8115857712765957
epoch 1.4, train_loss 0.48988876729569536, train_acc 0.820063164893617
epoch 1.6, train_loss 0.4825343162032729, train_acc 0.8231105939716312
epoch 1.8, train_loss 0.47760752176350735, train_acc 0.8243018617021277
epoch 2.0, train_loss 0.4683734359105428, train_acc 0.8284666666666667
epoch 2.2, train_loss 0.4082371751044659, train_acc 0.8485704787234043
epoch 2.4, train_loss 0.40830606127038915, train_acc 0.8496509308510638
epoch 2.6, train_loss 0.41287886969586635, train_acc 0.8478224734042553
epoch 2.8, train_loss 0.4081670050608351, train_acc 0.8491730385638298
e

### 拉伸参数gamma和偏移参数beta

In [16]:
net[1].gamma.reshape((-1,)), net[1].beta.reshape((-1,))

(tensor([0.8035, 2.9319, 2.7292, 4.4260, 2.3077, 2.6277],
        grad_fn=<ViewBackward>),
 tensor([-0.7501, -1.6460, -2.9647,  2.4097,  1.4707, -2.0397],
        grad_fn=<ViewBackward>))

## 简明实现

&emsp;&emsp;简明实现就是调用`nn.BatchNorm2d`。

In [17]:
net = nn.Sequential(
    nn.Conv2d(1, 6, kernel_size=5), nn.BatchNorm2d(6), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(6, 16, kernel_size=5), nn.BatchNorm2d(16), nn.Sigmoid(),
    nn.AvgPool2d(kernel_size=2, stride=2), nn.Flatten(),
    nn.Linear(256, 120), nn.BatchNorm1d(120), nn.Sigmoid(),
    nn.Linear(120, 84), nn.BatchNorm1d(84), nn.Sigmoid(),
    nn.Linear(84, 10))

In [18]:
train(net, train_iter, test_iter, num_epochs, lr, try_gpu())

trainng on cpu
epoch 0.2, train_loss 1.0844372206545891, train_acc 0.6171875
epoch 0.4, train_loss 0.8916318369038562, train_acc 0.6798952792553191
epoch 0.6, train_loss 0.8001158011297808, train_acc 0.711436170212766
epoch 0.8, train_loss 0.7445767458449019, train_acc 0.7301155252659575
epoch 1.0, train_loss 0.703881033706665, train_acc 0.7449
epoch 1.2, train_loss 0.5291786510893639, train_acc 0.8035239361702128
epoch 1.4, train_loss 0.5035403221845627, train_acc 0.8138713430851063
epoch 1.6, train_loss 0.49055850907420434, train_acc 0.8199523492907801
epoch 1.8, train_loss 0.478026808100812, train_acc 0.8243641954787234
epoch 2.0, train_loss 0.46913920974731443, train_acc 0.8276666666666667
epoch 2.2, train_loss 0.42546047492230193, train_acc 0.8426695478723404
epoch 2.4, train_loss 0.41432177861954306, train_acc 0.8472822473404256
epoch 2.6, train_loss 0.40768800538482397, train_acc 0.8503435283687943
epoch 2.8, train_loss 0.40304638699014134, train_acc 0.8524559507978723
epoch 3.0