# 最大池化层

在上一章，我们引入了卷积层来捕捉图像的局部特征。卷积层的一个重要特征就是“权重共享”，这是实现“平移不变性"的关键技术，同时也大大降低了卷积层自身的参数数量。

具体到我们的网络模型，卷积层的输入图像的数据结构是 (1, 28, 28)，卷积核大小是 3，我们使用了 16 个卷积核，所以一共是 1 × 3 × 3 × 16 = 144 个权重。卷积层的输出图像的数据结构是 (16, 26, 26)。

但是当我们把卷积层接入后续的线性层以后。会发现线性层需要的神经元个数大幅增加，达到 16 × 26 × 26 × 64 = 692224 个。这是因为卷积层并不能缩减输出图像的尺寸，反而可能因为输出通道数（卷积核数）增加，导致输出数据更大。

多数时候，我们并不需要非常大的图像尺寸，就足够我们观察清楚图像的内容。所以，对于大尺寸的图像，我们通常会采取压缩尺寸的方法。既可以方便保存、加快显示速度，也不会影响我们看明白图像的内容。

对于网络模型也是如此。我们可以借鉴图像压缩的技术，来替网络模型减肥。

-----------------

**最大池化层**（MaxPool Layer）是最常用的一种方式。它的操作逻辑非常简单：它像卷积层一样也有一个滑动窗口：**池化核**（Kernel）。但它没有权重，也不做复杂的乘法，而是进行简单的统计：只保留池化窗口区域内数值最大的一个像素，其余舍弃。

比如，一个 2 × 2 的池化核会将图像的面积压缩到原来的 1/4。

```{figure} images/pool.png
:align: center
:width: 360px
**图例：最大池化层示意图**
```

由于卷积层的每个输出数据都已经保留了图像的局部特征，最大池化的结果就是只保留每一个池化窗口区域内最强的局部特征。也因此，最大池化层的另一个特点就是增强了“平移不变性"。只要这个最强局部特征出现在池化窗口内的任何位置，都会被模型识别出来。

我们总结最大池化层的特点如下：

* 压缩数据，降低后续神经元的数量。
* 防止过拟合，只保留最重要的局部特征，放弃次要局部特征。
* 增强泛化能力，增强“平移不变性"，提供类似相机防抖的功能。

In [1]:
from abc import abstractmethod, ABC
import numpy as np

np.random.seed(99)

## 基础架构

### 张量

In [2]:
class Tensor:

    def __init__(self, data):
        self.data = np.array(data)
        self.grad = np.zeros_like(self.data)
        self.gradient_fn = lambda: None
        self.parents = set()

    def backward(self):
        if self.gradient_fn:
            self.gradient_fn()

        for p in self.parents:
            p.backward()

    @property
    def size(self):
        return np.prod(self.data.shape[1:])

    def __repr__(self):
        return f'Tensor({self.data})'

### 基础数据集

In [3]:
class Dataset(ABC):

    def __init__(self, batch_size=1):
        self.batch_size = batch_size

        self.test_labels = self.test_features = None
        self.train_labels = self.train_features = None

        self.load()
        self.train()

    @abstractmethod
    def load(self):
        pass

    def train(self):
        self.features = self.train_features
        self.labels = self.train_labels

    def eval(self):
        self.features = self.test_features
        self.labels = self.test_labels

    def shape(self):
        return Tensor(self.features).size, Tensor(self.labels).size

    def items(self):
        return Tensor(self.features), Tensor(self.labels)

    def __len__(self):
        return len(self.features) // self.batch_size

    def __getitem__(self, index):
        start = index * self.batch_size
        end = start + self.batch_size

        feature = Tensor(self.features[start: end])
        label = Tensor(self.labels[start: end])
        return feature, label

    def estimate(self, predictions):
        pass

### 基础层

In [4]:
class Layer(ABC):

    def __init__(self):
        self.training = True

    def __call__(self, x: Tensor):
        return self.forward(x)

    def train(self):
        self.training = True

    def eval(self):
        self.training = False

    @abstractmethod
    def forward(self, x: Tensor):
        pass

    @property
    def parameters(self):
        return []

    def __repr__(self):
        return ''

### 基础损失函数

In [5]:
class Loss(ABC):

    def __call__(self, p: Tensor, y: Tensor):
        return self.loss(p, y)

    @abstractmethod
    def loss(self, p: Tensor, y: Tensor):
        pass

### 基础优化器

In [6]:
class Optimizer(ABC):

    def __init__(self, parameters, lr):
        self.parameters = parameters
        self.lr = lr

    def reset(self):
        for p in self.parameters:
            p.grad = np.zeros_like(p.data)

    @abstractmethod
    def step(self):
        pass

### 基础模型

In [7]:
class Model(ABC):

    def __init__(self, layer, loss_fn, optimizer):
        self.layer = layer
        self.loss_fn = loss_fn
        self.optimizer = optimizer

    @abstractmethod
    def train(self, dataset, epochs):
        pass

    @abstractmethod
    def test(self, dataset):
        pass

## 数据

### MNIST 数据集

In [8]:
class MNISTDataset(Dataset):

    def __init__(self, filename, batch_size=1):
        self.filename = filename
        super().__init__(batch_size)

    def load(self):
        with (np.load(self.filename, allow_pickle=True) as f):
            self.train_features, self.train_labels = self.normalize(f['x_train'], f['y_train'])
            self.test_features, self.test_labels = self.normalize(f['x_test'], f['y_test'])

    @staticmethod
    def normalize(x, y):
        inputs = x / 255
        inputs = np.expand_dims(inputs, axis=1)
        targets = np.zeros((len(y), 10))
        targets[range(len(y)), y] = 1
        return inputs, targets

    def estimate(self, predictions):
        count = (predictions.data.argmax(axis=1) == self.labels.argmax(axis=1)).sum()
        total = len(self.labels)
        return count / total

## 模型

### 线性层

In [9]:
class Linear(Layer):

    def __init__(self, in_size, out_size):
        super().__init__()
        self.weight = Tensor(np.random.rand(out_size, in_size) / in_size)
        self.bias = Tensor(np.random.rand(out_size))

    def forward(self, x: Tensor):
        p = Tensor(x.data @ self.weight.data.T + self.bias.data)

        def gradient_fn():
            self.weight.grad += p.grad.T @ x.data
            self.bias.grad += np.sum(p.grad, axis=0)
            x.grad += p.grad @ self.weight.data

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    @property
    def parameters(self):
        return [self.weight, self.bias]

    def __repr__(self):
        return f'Linear[weight{self.weight.data.shape}; bias{self.bias.data.shape}]'

### 顺序层

In [10]:
class Sequential(Layer):

    def __init__(self, layers):
        super().__init__()
        self.layers = layers

    def train(self):
        for l in self.layers:
            l.train()

    def eval(self):
        for l in self.layers:
            l.eval()

    def forward(self, x: Tensor):
        for l in self.layers:
            x = l(x)
        return x

    @property
    def parameters(self):
        return [p for l in self.layers for p in l.parameters]

    def __repr__(self):
        return '\n'.join(str(l) for l in self.layers if str(l))

### 卷积层

In [11]:
class Convolution(Layer):

    def __init__(self, in_channels, out_channels, pool_size):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.pool_size = pool_size
        in_size = in_channels * pool_size * pool_size
        self.weight = Tensor(np.random.rand(out_channels, in_channels, pool_size, pool_size) * np.sqrt(2 / in_size))
        self.bias = Tensor(np.zeros(out_channels))

    def forward(self, x: Tensor):
        batch, in_ch, in_h, in_w = x.data.shape
        out_h = in_h - self.pool_size + 1
        out_w = in_w - self.pool_size + 1
        x_padded = x.data

        patches = []
        for i in range(out_h):
            for j in range(out_w):
                patch = x_padded[:, :, i:i + self.pool_size, j:j + self.pool_size]
                patches.append(patch)
        patches = np.array(patches).transpose(1, 0, 2, 3, 4)
        patches_reshaped = patches.reshape(batch, out_h * out_w, -1)
        weight_reshaped = self.weight.data.reshape(self.out_channels, -1)
        output = patches_reshaped @ weight_reshaped.T + self.bias.data
        output = output.reshape(batch, out_h, out_w, self.out_channels)
        output = output.transpose(0, 3, 1, 2)
        p = Tensor(output)

        def gradient_fn():
            grad_output = p.grad.transpose(0, 2, 3, 1).reshape(batch, out_h * out_w, self.out_channels)
            weight_grad = grad_output.reshape(-1, self.out_channels).T @ patches_reshaped.reshape(-1, patches_reshaped.shape[-1])

            self.weight.grad += weight_grad.reshape(self.weight.data.shape) / batch
            self.bias.grad += np.sum(grad_output, axis=(0, 1)) / batch

            input_grad = grad_output @ weight_reshaped
            input_grad = input_grad.reshape(batch, out_h, out_w, in_ch, self.pool_size, self.pool_size)

            grad_input = np.zeros_like(x_padded)
            idx = 0
            for i in range(out_h):
                for j in range(out_w):
                    grad_input[:, :, i:i + self.pool_size, j:j + self.pool_size] += input_grad[:, i, j, :, :, :]
                    idx += 1
            x.grad += grad_input / batch

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    @property
    def parameters(self):
        return [self.weight, self.bias]

    def __repr__(self):
        return f'Convolution[weight{self.weight.data.shape}; bias{self.bias.data.shape}; kernels={self.out_channels}, kernel size={self.pool_size}]'

### 最大池化层

**最大池化层**（MaxPool Layer）只保留每个池化窗口区域内数值最大的像素。

**前向传播**：利用 NumPy 的矢量运算能力，我们可以成批地统计出各个池化窗口区域的最大值。同时保留最大值出现的位置。

**梯度计算**：最大池化层没有权重，不会对数据进行加工。所以也不会对后一层传回的梯度值进行加工，而是依旧按照池化核的排列，将梯度”原路返回“。

对于最大池化层，梯度只会流向那个前向传播时“贡献了最大值”的像素点，其余位置的梯度为 0。这里，就需要利用到前向传播过程中保留的最大值出现的位置了。

**父节点列表**（parents）：输入值。

**参数列表**（parameters）：无。


In [12]:
class MaxPool(Layer):

    def __init__(self, pool_size):
        super().__init__()
        self.pool_size = pool_size

    def forward(self, x: Tensor):
        batch, channel, in_h, in_w = x.data.shape
        out_h = in_h // self.pool_size
        out_w = in_w // self.pool_size

        output = np.zeros((batch, channel, out_h, out_w))
        mask = np.zeros_like(x.data, dtype=bool)
        for i in range(out_h):
            for j in range(out_w):
                h_start = i * self.pool_size
                w_start = j * self.pool_size
                region = x.data[:, :, h_start:h_start + self.pool_size, w_start:w_start + self.pool_size]
                max_val = np.max(region, axis=(2, 3), keepdims=True)
                output[:, :, i, j] = max_val.squeeze()
                region_mask = (region == max_val)
                mask[:, :, h_start:h_start + self.pool_size, w_start:w_start + self.pool_size] |= region_mask
        p = Tensor(output)

        def gradient_fn():
            grad_input = np.zeros_like(x.data)
            for i in range(out_h):
                for j in range(out_w):
                    h_start = i * self.pool_size
                    w_start = j * self.pool_size
                    region_mask = mask[:, :, h_start:h_start + self.pool_size, w_start:w_start + self.pool_size]
                    grad_region = p.grad[:, :, i:i + 1, j:j + 1]
                    grad_input[:, :, h_start:h_start + self.pool_size, w_start:w_start + self.pool_size] += region_mask * grad_region
            x.grad += grad_input

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'MaxPool[pool size={self.pool_size}]'

### 展平层

In [13]:
class Flatten(Layer):

    def forward(self, x: Tensor):
        p = Tensor(np.array(x.data.reshape(x.data.shape[0], -1)))

        def gradient_fn():
            x.grad += p.grad.reshape(x.data.shape)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Flatten[]'

### 丢弃层

In [14]:
class Dropout(Layer):

    def __init__(self, dropout_rate=0.2):
        super().__init__()
        self.dropout_rate = dropout_rate

    def forward(self, x: Tensor):
        if not self.training:
            return x

        mask = np.random.random(x.data.shape) > self.dropout_rate
        p = Tensor(x.data * mask)

        def gradient_fn():
            x.grad += p.grad * mask

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Dropout[rate={self.dropout_rate}]'

### ReLU 激活函数

In [15]:
class ReLU(Layer):

    def forward(self, x: Tensor):
        p = Tensor(np.maximum(0, x.data))

        def gradient_fn():
            x.grad += p.grad * (p.data > 0)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'ReLU[]'

### Tanh 激活函数

In [16]:
class Tanh(Layer):

    def forward(self, x: Tensor):
        p = Tensor(np.tanh(x.data))

        def gradient_fn():
            x.grad += p.grad * (1 - p.data ** 2)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Tanh[]'

### Sigmoid 激活函数

In [17]:
class Sigmoid(Layer):

    def __init__(self, clip_range=(-100, 100)):
        super().__init__()
        self.clip_range = clip_range

    def forward(self, x: Tensor):
        z = np.clip(x.data, self.clip_range[0], self.clip_range[1])
        p = Tensor(1 / (1 + np.exp(-z)))

        def gradient_fn():
            x.grad += p.grad * p.data * (1 - p.data)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Sigmoid[]'

### Softmax 激活函数

In [18]:
class Softmax(Layer):

    def __init__(self, axis=-1):
        super().__init__()
        self.axis = axis

    def forward(self, x: Tensor):
        exp = np.exp(x.data - np.max(x.data, axis=self.axis, keepdims=True))
        p = Tensor(exp / np.sum(exp, axis=self.axis, keepdims=True))

        def gradient_fn():
            grad = np.sum(p.data * p.grad, axis=self.axis, keepdims=True)
            x.grad += p.data * (p.grad - grad)

        p.gradient_fn = gradient_fn
        p.parents = {x}
        return p

    def __repr__(self):
        return f'Softmax[]'

### 损失函数（均方误差）

In [19]:
class MSELoss(Loss):

    def loss(self, p: Tensor, y: Tensor):
        mse = Tensor(np.mean(np.square(y.data - p.data)))

        def gradient_fn():
            p.grad += -2 * (y.data - p.data) / len(y.data)

        mse.gradient_fn = gradient_fn
        mse.parents = {p}
        return mse

### 损失函数（交叉熵）

In [20]:
class CELoss(Loss):

    def loss(self, p: Tensor, y: Tensor):
        exp = np.exp(p.data - np.max(p.data, axis=-1, keepdims=True))
        softmax = exp / np.sum(exp, axis=-1, keepdims=True)

        log = np.log(np.clip(softmax, 1e-10, 1))
        ce = Tensor(0 - np.sum(y.data * log) / len(y.data))

        def gradient_fn():
            p.grad += (softmax - y.data) / len(y.data)

        ce.gradient_fn = gradient_fn
        ce.parents = {p}
        return ce

### 损失函数（二元交叉熵）

In [21]:
class BCELoss(Loss):

    def loss(self, p: Tensor, y: Tensor):
        clipped = np.clip(p.data, 1e-7, 1 - 1e-7)
        bce = Tensor(-np.mean(y.data * np.log(clipped) + (1 - y.data) * np.log(1 - clipped)))

        def gradient_fn():
            p.grad += (clipped - y.data) / (clipped * (1 - clipped)) / len(y.data)

        bce.gradient_fn = gradient_fn
        bce.parents = {p}
        return bce

### 优化器（随机梯度下降）

In [22]:
class SGDOptimizer(Optimizer):

    def step(self):
        for p in self.parameters:
            p.data -= p.grad * self.lr

### 神经网络模型

In [23]:
class NNModel(Model):

    def train(self, dataset, epochs):
        self.layer.train()
        dataset.train()

        for epoch in range(epochs):
            for i in range(len(dataset)):
                features, labels = dataset[i]

                predictions = self.layer(features)
                loss = self.loss_fn(predictions, labels)
                self.optimizer.reset()
                loss.backward()
                self.optimizer.step()

    def test(self, dataset):
        self.layer.eval()
        dataset.eval()

        features, labels = dataset.items()
        predictions = self.layer(features)
        loss = self.loss_fn(predictions, labels)
        return predictions, loss

## 设置

### 学习率

In [24]:
LEARNING_RATE = 0.01

### 批大小

In [25]:
BATCH_SIZE = 2

### 卷积核大小

In [26]:
KERNEL_SIZE = 3

### 池化窗口大小

In [27]:
POOL_SIZE = 2

### 轮次

In [28]:
EPOCHS = 10

## 训练

### 迭代

让我们在卷积层后添加一个池化窗口大小为 2 的最大池化层，看看训练结果如何。

In [29]:
dataset = MNISTDataset('tinymnist.npz', BATCH_SIZE)
feature, label = dataset[0]
_, channels, rows, columns = feature.data.shape
conv_rows = (rows - KERNEL_SIZE + 1) // POOL_SIZE
conv_columns = (columns - KERNEL_SIZE + 1) // POOL_SIZE
layer = Sequential([Convolution(channels, 16, KERNEL_SIZE),
                    MaxPool(POOL_SIZE),
                    Flatten(),
                    Dropout(),
                    Linear(conv_rows * conv_columns * 16, 64),
                    ReLU(),
                    Linear(64, dataset.shape()[1])])
loss_fn = CELoss()
optimizer = SGDOptimizer(layer.parameters, lr=LEARNING_RATE)

model = NNModel(layer, loss_fn, optimizer)
model.train(dataset, EPOCHS)
print(layer)

Convolution[weight(16, 1, 3, 3); bias(16,); kernels=16, kernel size=3]
MaxPool[pool size=2]
Flatten[]
Dropout[rate=0.2]
Linear[weight(64, 2704); bias(64,)]
ReLU[]
Linear[weight(10, 64); bias(10,)]


## 验证

### 测试

In [30]:
predictions, loss = model.test(dataset)
accuracy = dataset.estimate(predictions)
print(f'accuracy: {accuracy:.2%}')

accuracy: 92.80%


我们可以从测试结果观察到几点：

* 线性层的神经元数量大幅减少，训练速度提高。
* 在丢弃 75% 数据的情况下，仍然保持较高的准确率，只略有下降。事实上，MNIST 数据集的图像已经是很小的尺寸，继续压缩可能已经开始破坏局部特征了，从而某种程度上增加了识别难度。

## 结束语

在这一部分，我们继续扩展我们的神经网络训练框架：

* 针对**分类问题**，我们引入了更多的激活函数，和损失函数；
* 针对**多维度数据**，我们引入了展平层，和丢弃层；
* 针对**图像数据**，我们引入了卷积层，和最大池化层。

**卷积神经网络**（Convolutional Neural Network, CNN）的意义在于打破了**全连接神经网络**的限制，实现了对**局部特征**（平移不变性）的识别。

下一部分，我们将开始关注**流数据**，比如语言。我们将继续扩展我们的神经网络训练框架，实现对语言和文字的理解，让我们的网络模型开始能“说话”。