# 计算图

计算图是用于神经网络BP算法的一种工具，其基本思想是复合函数的链式求导法则，可简化误差反向传播的计算。

## 局部计算节点

计算图将神经网络的推理与学习任务分散各个局部计算节点，通过局部节点的计算结果实现神经网络的推理与学习任务。

- **加法节点**

加法节点的作用是实现以下推理的计算片断，

$$
x + y = z
$$(node-add)

误差的反向传播则将$\frac{\partial L}{\partial z} $乘上以下局部计算结果后，

$$
\begin{split}
\frac{\partial z}{\partial x}&=1\\
\frac{\partial z}{\partial y}&=1\\
\end{split}
$$(node-add-local-comp)

反向传入相应分支，即，$x,y$的各分支分别反向传入$\frac{\partial L}{\partial z}\times 1$。式{eq}`node-add-local-comp`分别对应各个分支的局部梯度计算结果。加法节点的计算图如下所示：

:::{figure-md}
:name: fig-comp-graph-node-plus
![加法节点](../img/node-plus.svg){width=200px}

加法节点
:::

- **乘法节点**

与加法节点类似，实现以下局部计算，

$$
x*y=z
$$(node-mult)

误差反向传播则分别将以下结果反向传入对应分支，即$x$分支传入，

$$
\frac{\partial L}{\partial z}\frac{\partial z}{\partial x}=\frac{\partial L}{\partial z}\cdot y
$$(node-mult-back-x)

$y$分支传入，

$$
\frac{\partial L}{\partial z}\frac{\partial z}{\partial y}=\frac{\partial L}{\partial z}\cdot x
$$(node-mult-back-y)

乘法节点的计算图如下所示：

:::{figure-md} fig-comp-graph-node-times
![乘法节点](../img/node-times.svg){width=200px}

乘法节点
:::

图片的引用{ref}`fig-comp-graph-node-times`

- **分支节点**

分支节点是指相同的值复制后传入各个分支，也称为复制节点。反向传播则是上游传来的梯度之和（与加法节点的运算逻辑正好相反）。

:::{figure-md} fig-comp-graph-node-plus
![分支节点](../img/node-repeat.svg){width=200px}

分支节点示例
:::

当分支扩展到$N$个节点，则可称为**重复节点**。重复节点的反向传播与分支节点类似，是上游所传的梯度之和。



In [15]:
import numpy as np
x = np.random.randn(1, 3)  # 假设的输入数据
y = np.repeat(x, 2, axis=0) # 正向传播

dy=np.random.randn(2, 3) #假设的梯度
dx=np.sum(dy, axis=0, keepdims=True) #梯度传播
print("Input x:\n", x)
print("Output y:\n", y) 
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)

Input x:
 [[0.72206891 0.35389271 1.72719994]]
Output y:
 [[0.72206891 0.35389271 1.72719994]
 [0.72206891 0.35389271 1.72719994]]
Gradient dy:
 [[-0.67843465  0.65692653 -1.63123009]
 [-0.97966934 -0.71100144 -0.25300468]]
Gradient dx:
 [[-1.65810399 -0.05407491 -1.88423478]]


- **sum节点**

sum节点与重复节点正好相反，推理时其输入为各个分支的和，反向传播时各分支传入的值是其上游值的复制。

In [16]:
x = np.random.randn(2, 3)  # 假设的输入数据
y = np.sum(x, axis=0, keepdims=True)  # 正向传播
dy = np.random.randn(1, 3)  # 假设的梯度
dx = np.repeat(dy, 2, axis=0)  # 梯度传播
print("Input x:\n", x)  
print("Output y:\n", y)
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)

Input x:
 [[ 0.92489881  0.86625853 -0.54744188]
 [-0.0619725  -0.62391338 -0.17605427]]
Output y:
 [[ 0.86292631  0.24234514 -0.72349616]]
Gradient dy:
 [[-0.96189216 -0.72897876 -0.0355408 ]]
Gradient dx:
 [[-0.96189216 -0.72897876 -0.0355408 ]
 [-0.96189216 -0.72897876 -0.0355408 ]]


- **MatMul节点**

矩阵乘积节点(假设向量为行向量)，即

$$
\pmb{y}_{1\times H}=\pmb{x}_{1\times D}\pmb{W}_{D\times H}
$$(node-matmul)

该计算结点的难点在于反向传播，即以下偏导数(**分子布局**)的计算，

$$
\begin{split}
\frac{\partial L}{\partial \pmb{x}}_{1\times D}&=\frac{\partial L}{\partial \pmb{y}}\frac{\partial \pmb{y}}{\partial \pmb{x}}   =\left[\frac{\partial L}{\pmb{y}}\right]_{1\times H}\left[\pmb{W}^\top\right]_{H\times D}\\
\left[\frac{\partial L}{\partial \pmb{W}}\right]_{D\times H}&=\frac{\partial L}{\partial \pmb{y}}\frac{\partial\pmb{y}}{\partial \pmb{W}}   =\left[\pmb{x}^\top\right]_{D\times 1} \left[\frac{\partial L}{\partial\pmb{y}}\right]_{1\times H}\\
\end{split}
$$(node-matmul-back)

式{eq}`node-matmul-back`的第1个等式容易实现，略过。第2个等式的推导如下：由矩阵乘法定义可知，

$$
\frac{\partial y_j}{\partial W_{ik}}=\left\{\begin{array}{ll}x_i,&j==k\\ 0,& j\neq k \end{array} \right.
$$

因此，可以得到以下等式，

$$
\frac{\partial L}{\partial W_{ij}}=\sum_k \frac{\partial L}{\partial y_k}\frac{\partial y_k}{\partial W_{ij}}=\frac{\partial L}{\partial y_j}\frac{\partial y_j}{\partial W_{ij}}=\frac{\partial L}{\partial y_j}\cdot x_i
$$

则有梯度如下，

$$
\frac{\partial L}{\partial \pmb{W}}=\left[ \frac{\partial L}{\partial y_j}x_i  \right]_{ij}=\left[\pmb{x}^\top\right]_{D\times 1} \left[\frac{\partial L}{\partial\pmb{y}}\right]_{1\times H}
$$

注意：当$\pmb{x}$是小批量样本时，{eq}`node-matmul-back`形式仍然保持不变。

In [17]:
class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]  #分子布局
        self.x = None
    
    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)  # 矩阵乘法
        self.x = x  #保存输入，反向传播时使用
        return out
    
    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)  # 输入梯度
        dW = np.dot(self.x.T, dout)  # 权重梯度
        self.grads[0][...] = dW  # 更新梯度
        return dx
    
# 测试 MatMul 类
W = np.random.randn(3, 4)  # 假设的权重 
x = np.random.randn(2, 3)  # 假设的输入数据
matmul = MatMul(W)  
y = matmul.forward(x)  # 正向传播
dy = np.random.randn(2, 4)  # 假设的梯度    
dx = matmul.backward(dy)  # 反向传播
print("Input x:\n", x)  
print("Weights W:\n", W)
print("Output y:\n", y)     
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)     
print("Gradient dW:\n", matmul.grads[0])  # 权重梯度
    
        

Input x:
 [[-0.93687019  0.57828754 -0.76800081]
 [-0.00399834  0.1000363  -1.1926997 ]]
Weights W:
 [[-0.93569731  0.72740693  2.95627284 -0.19767411]
 [ 0.78800105  0.3317538   0.27264821 -0.77995306]
 [-1.71889265  0.94648079 -0.74407137  0.08367462]]
Output y:
 [[ 2.65242905 -1.2165348  -2.04052743 -0.33010432]
 [ 2.1326927  -1.09858836  0.90290823 -0.17703194]]
Gradient dy:
 [[-0.21584035  0.67554835 -0.10841518 -2.37497773]
 [ 0.36791339  0.42988391  0.54364351 -0.63800321]]
Gradient dx:
 [[ 0.84232656  1.87684524  0.8923432 ]
 [ 1.70172023  1.07836775 -0.68342101]]
Gradient dW:
 [[ 0.20074334 -0.63461993  0.09939727  2.2275968 ]
 [-0.08801309  0.43366518 -0.00831106 -1.4372435 ]
 [-0.27304463 -1.03154408 -0.56514051  2.58493105]]


## 局部计算的层

通过计算图的计算结点，可以实现一些神经网络的常用层。这些层一般都是结点的组合结果。

### **sigmoid层**

sigmoid层主要由sigmoid函数组成，即

$$
\begin{split}
y&=\frac{1}{\exp(-x)}\\
\frac{\partial y}{\partial x}&=y(1-y)
\end{split}
$$(sigmoid-layer)



In [18]:
class Sigmoid:
    def __init__(self):
        self.params = []    
        self.grads = []
        self.out = None
    
    def forward(self, x):
        out = 1 / (1 + np.exp(-x))  # Sigmoid 函数
        self.out = out  # 保存输出，反向传播时使用
        return out
    
    def backward(self, dout):
        dx = dout * self.out * (1-self.out)  # Sigmoid 的梯度
        return dx

# 测试 Sigmoid 类
x = np.random.randn(2, 3)  # 假设的输入数据
sigmoid = Sigmoid()
y = sigmoid.forward(x)  # 正向传播
dy = np.random.randn(2, 3)  # 假设的梯度
dx = sigmoid.backward(dy)  # 反向传播
print("Input x:\n", x)
print("Output y:\n", y)
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)  # Sigmoid 的梯度


Input x:
 [[ 1.43282435 -0.87674304 -0.63521218]
 [-0.97507988 -1.85462779  0.61775703]]
Output y:
 [[0.807341   0.29385315 0.34632963]
 [0.27386913 0.13533046 0.64970825]]
Gradient dy:
 [[ 1.29730034 -0.64475725  0.06031784]
 [-0.47752643 -0.1826718   1.30658829]]
Gradient dx:
 [[ 0.20178405 -0.13378937  0.01365508]
 [-0.09496321 -0.02137555  0.29736308]]


### **ReLU层**

ReLU (Rectified Linear Unit)函数定义如下，

$$
\text{ReLU}(x) =
\begin{cases}
x, & \text{if } x > 0; \\
0, & \text{if } x \leq 0.
\end{cases}
$$(relu-def)

易知，该函数的导数为，

$$
\frac{d}{dx}\text{ReLU}(x)=
\begin{cases}
1,&\text{if } x>0;\\
0,&\text{if } x\le 0.
\end{cases}
$$(relu-der-def)

In [19]:
class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        dout[self.mask] = 0
        dx = dout

        return dx

# 测试 Relu 类
x = np.random.randn(2, 3)  # 假设的输入数据
relu = Relu()
y = relu.forward(x)  # 正向传播
dy = np.random.randn(2, 3)  # 假设的梯度
dx = relu.backward(dy)  # 反向传播
print("Input x:\n", x)
print("Output y:\n", y)
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)  # Relu 的梯度


Input x:
 [[ 1.00959427 -0.15736295  0.26379541]
 [-0.03848728 -1.39110164 -0.4967722 ]]
Output y:
 [[1.00959427 0.         0.26379541]
 [0.         0.         0.        ]]
Gradient dy:
 [[ 0.31260195  0.         -0.36205015]
 [ 0.          0.          0.        ]]
Gradient dx:
 [[ 0.31260195  0.         -0.36205015]
 [ 0.          0.          0.        ]]


### **仿射层(Affine)**

Affine层主要实现了线性计算的功能，通过矩阵乘法节点和重复节点完成计算功能。

$$
\pmb{z}=\pmb{x}^\top\pmb{W}+\pmb{b}
$$(affine-node)

:::{figure-md} fig-affine
![仿射节点](../img/node-affine.svg){width=600px}

仿射节点
:::

参见图{ref}`fig-affine`的计算过程。该层的计算代码实现如下：

In [20]:
class Affine:
    def __init__(self, W, b):
        self.params = [W, b]
        self.grads = [np.zeros_like(W), np.zeros_like(b)]
        self.x = None   #反向传播时需要
    
    def forward(self, x):
        W, b = self.params
        out = np.dot(x, W) + b
        self.x = x
        return out
    
    def backward(self, dout):
        W, b = self.params
        dx = np.dot(dout, W.T)
        dw = np.dot(self.x.T, dout)
        db = np.sum(dout, axis=0)

        self.grads[0][...]=dw
        self.grads[1][...]=db
        return dx 

#测试Affine节点
W = np.random.randn(3, 4)  # 假设的权重
b = np.random.randn(4)  # 假设的偏置    
x = np.random.randn(2, 3)  # 假设的输入数据
affine = Affine(W, b)
y = affine.forward(x)  # 正向传播
dy = np.random.randn(2, 4)  # 假设的梯度
dx = affine.backward(dy)  # 反向传播
print("Input x:\n", x)
print("Weights W:\n", W)
print("Bias b:\n", b)
print("Output y:\n", y)
print("Gradient dy:\n", dy)
print("Gradient dx:\n", dx)  # 输入梯度
print("Gradient dW:\n", affine.grads[0])  # 权重梯度
print("Gradient db:\n", affine.grads[1])  # 偏置梯度

    

Input x:
 [[-1.02898513  1.13416245  0.85302068]
 [ 0.24062443  0.96097782 -0.85473967]]
Weights W:
 [[ 0.08850005  0.3438311   0.64733261 -0.11949433]
 [-0.70266012  0.48296786  0.57096931  0.48416589]
 [-2.09054459 -1.11171277 -0.28543242  0.17613605]]
Bias b:
 [ 2.1261547   2.39785713 -1.43789332  0.54487446]
Output y:
 [[-0.54511903  1.64351007 -1.69989675  1.36720282]
 [ 3.25908058  3.89493769 -0.48947002  0.83084341]]
Gradient dy:
 [[-0.40635683  0.94765271 -0.52173287 -0.29886113]
 [-2.03194024 -0.23013739  0.04897851 -1.28197821]]
Gradient dx:
 [[-0.01215261  0.30062472 -0.10773129]
 [-0.07406069  0.72388952  4.26392572]]
Gradient dW:
 [[-7.07993375e-02 -1.03049723e+00  5.48640789e-01 -9.51621894e-04]
 [-2.41352416e+00  8.53635197e-01 -5.44662573e-01 -1.57090970e+00]
 [ 1.39014915e+00  1.00507491e+00 -4.86912796e-01  8.40822907e-01]]
Gradient db:
 [-2.43829707  0.71751532 -0.47275436 -1.58083934]


### **Softmax损失层**

该层主要由softmax函数以及交叉熵损失函数复合而成。softmax函数是指以下函数，

$$
\sigma(\pmb{x})=\frac{\exp(\pmb{x})}{\sum_i \exp(x_i)}
$$(softmax-fun-def)

交叉熵损失则是指以下损失函数(单个样本one-hot形式)，

$$
loss(y,t)=-\sum_i t_i\log y_i
$$(corss-entropy-def)

式{eq}`softmax-fun-def`的偏导数为（分子布局），

$$
\frac{\partial \sigma(\pmb{x}) }{\partial \pmb{x}}= \begin{bmatrix} \frac{\partial \sigma_1}{\partial x_1} & \frac{\partial \sigma_1}{\partial x_2} &\cdots & \frac{\partial \sigma_1}{\partial x_n} \\ \frac{\partial \sigma_2}{\partial x_1} & \frac{\partial \sigma_2}{\partial x_2} &\cdots & \frac{\partial \sigma_2}{\partial x_n} \\ 
\vdots & \vdots &\ddots &\vdots \\
\frac{\partial \sigma_n}{\partial x_1} & \frac{\partial \sigma_n}{\partial x_2} &\cdots & \frac{\partial \sigma_n}{\partial x_n} \\\end{bmatrix}
$$(derivative-softmax)

其中， $\sigma_i = \frac{e^{x_i}}{\sum_{k}e^{x_k}}$。令$S=\sum_{k}e^{x_k}, \delta_{ik}=\{1 (i=k), 0(i\neq k)\}$，则项$\frac{\partial \sigma_i}{\partial x_k}$可通过以下计算得到，即

$$
\begin{split}
\frac{\partial \sigma_i}{\partial x_k} &= e^{x_i}\cdot \delta_{ik}\cdot\frac1S - \frac{e^{x_i}}{S^2}\cdot\frac{\partial S}{\partial x_k}\\
&=\sigma_i(\delta_{ik}-\sigma_k)\\
\end{split}
$$

由$\delta_{ik}$可知，矩阵只有对角线元素为$\sigma_i-\sigma_i\sigma_k$，其余元素均为$-\sigma_i\sigma_k$，因此，式{eq}`derivative-softmax`可成以下形式，

$$
\frac{\partial \sigma(\pmb{x}) }{\partial \pmb{x}} =\text{diag}(\pmb{\sigma})-\pmb{\sigma}\pmb{\sigma}^\top
$$(derivative-softmax-detail)

若采用交叉熵做为损失函数，则可推导出$\frac{\partial L}{\partial \pmb{x}}$(分子布局)如下，

$$
\begin{split}
\frac{\partial L}{\partial \pmb{x}} &= \frac{\partial L}{\partial \sigma(\pmb{x})}\frac{\partial \sigma(\pmb{x})}{\partial \pmb{x}}\\
&=\left[\frac{-t_1}{\sigma(\pmb{x})_1}, \frac{-t_2}{\sigma(\pmb{x})_2}, ..., \frac{-t_n}{\sigma(\pmb{x})_n} \right]\left(\text{diag}(\pmb{\sigma})-\pmb{\sigma}\pmb{\sigma}^\top\right)\\
&=[\sigma_1-t_1,\sigma_2-t_2,...,\sigma_n-t_n]
\end{split}
$$(derivative-softmax-with-loss)



In [21]:
def softmax(x):
    if x.ndim == 2:  # 如果是二维数组
        x -= np.max(x, axis=1, keepdims=True)  # 减去每行的最大值
        y = np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True)  # softmax计算
    else:  # 如果是一维数组
        x -= np.max(x)  # 减去最大值
        y = np.exp(x) / np.sum(np.exp(x))  # softmax计算
    return y

def cross_entropy_loss(y, t):
    if y.ndim == 1:  # 如果是向量
        y = y.reshape(1, -1)  # 转换为二维数组
        t = t.reshape(1, -1)  # 转换为二维数组
    if t.size == y.size:
        t = t.argmax(axis=1)  # 如果t是one-hot编码，转换为类别索引
    batch_size = y.shape[0]  # 获取批大小
    loss = -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size  # 计算交叉熵损失
    return loss

class SoftmaxWithLoss:
    def __init__(self):
        self.params = []
        self.grads = []
        self.y = None  # 保存softmax输出
        self.t = None  # 保存标签
    
    def forward(self, x, t):
        self.t = t  # 保存标签
        self.y = softmax(x)  # 计算softmax输出

        # 在监督标签为one-hot向量的情况下，转换为正确解标签的索引
        if self.t.size == self.y.size:
            self.t = self.t.argmax(axis=1)
        loss = cross_entropy_loss(self.y, self.t)
        return self.y
    
    def backward(self, dout=1):
        batch_size = self.y.shape[0]  # 获取批大小
        dx = self.y.copy()  # 复制softmax输出
        dx[np.arange(batch_size), self.t] -= 1  # 减去正确类别的梯度
        dx *= dout
        dx /= batch_size  # 平均化梯度
        return dx
    
# 测试 SoftmaxWithLoss 类
x = np.random.randn(2, 3)  # 假设的输入数据
t = np.array([0, 1])  # 假设的标签  
softmax_loss = SoftmaxWithLoss()
y = softmax_loss.forward(x, t)  # 正向传播
dy = softmax_loss.backward()  # 反向传播
print("Input x:\n", x)
print("Labels t:\n", t)
print("Softmax Output y:\n", y)
print("Gradient dy:\n", dy)  # softmax的梯度
# 测试 SoftmaxWithLoss 类的交叉熵损失
loss = cross_entropy_loss(y, t)  # 计算交叉熵损失
print("Cross Entropy Loss:\n", loss)  # 输出交叉熵损失
# 测试 SoftmaxWithLoss 类的梯度
print("Gradient dx:\n", dy)  # 输出梯度


Input x:
 [[-0.01879012 -0.54829851  0.        ]
 [-0.84352563  0.         -0.07835929]]
Labels t:
 [0 1]
Softmax Output y:
 [[0.38345585 0.225815   0.39072915]
 [0.18268511 0.42466031 0.39265458]]
Gradient dy:
 [[-0.30827208  0.1129075   0.19536457]
 [ 0.09134255 -0.28766984  0.19632729]]
Cross Entropy Loss:
 0.9074979969849499
Gradient dx:
 [[-0.30827208  0.1129075   0.19536457]
 [ 0.09134255 -0.28766984  0.19632729]]


## 一个完整的网络

下面将使用以上所设计的模块，构建一个2层的示例前向网络。

In [25]:
# 二层前向网络
from collections import OrderedDict


class SimpleNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # 初始化网络参数
        self.params = {}
        self.params['W1'] = np.random.randn(input_size, hidden_size) * weight_init_std
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = np.random.randn(hidden_size, output_size) * weight_init_std 
        self.params['b2'] = np.zeros(output_size)

        # 所有层
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1'])
        self.layers['Relu1'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2'])
        self.lastLayer = SoftmaxWithLoss()
    
    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x
    
    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        if t.ndim != 1:
            t = np.argmax(t, axis=1)
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
    
    def gradient(self, x, t):
        # 前向传播
        self.loss(x, t)

        # 反向传播
        dout = 1
        dout = self.lastLayer.backward()
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout) 
        # 梯度
        grads = {}
        grads['W1'] = self.layers['Affine1'].grads[0]
        grads['b1'] = self.layers['Affine1'].grads[1]   
        grads['W2'] = self.layers['Affine2'].grads[0]
        grads['b2'] = self.layers['Affine2'].grads[1]
        return grads

# 测试 SimpleNet 类
input_size = 4  # 输入层大小    
hidden_size = 5  # 隐藏层大小
output_size = 3  # 输出层大小
net = SimpleNet(input_size, hidden_size, output_size)  # 创建网络
x = np.random.randn(2, input_size)  # 假设的输入数据
t = np.array([0, 1])  # 假设的标签
loss = net.loss(x, t)  # 计算损失
accuracy = net.accuracy(x, t)  # 计算准确率
grads = net.gradient(x, t)  # 计算梯度
print("Input x:\n", x)
print("Labels t:\n", t)
print("Loss:\n", loss)  # 输出损失
print("Accuracy:\n", accuracy)  # 输出准确率
print("Gradient W1:\n", grads['W1'])  # 输出第一层权重梯度
print("Gradient b1:\n", grads['b1'])  # 输出第一层偏置梯度
print("Gradient W2:\n", grads['W2'])  # 输出第二层权重梯度
print("Gradient b2:\n", grads['b2'])  # 输出第二层偏置梯度



        

Input x:
 [[ 0.54576423  0.70829806 -1.39423678 -0.74337296]
 [-2.10609531  0.58867088  0.19073156 -0.86751311]]
Labels t:
 [0 1]
Loss:
 [[0.33326675 0.3334218  0.33331145]
 [0.33335621 0.33329482 0.33334898]]
Accuracy:
 0.0
Gradient W1:
 [[-0.00462424  0.0005273  -0.00812862  0.0011283   0.        ]
 [ 0.00129251  0.00068433  0.00407598  0.00146432  0.        ]
 [ 0.00041878 -0.00134706 -0.00208202 -0.00288242  0.        ]
 [-0.00190475 -0.00071822 -0.00537711 -0.00153684  0.        ]]
Gradient b1:
 [0.00219565 0.00096616 0.00649818 0.00206738 0.        ]
Gradient W2:
 [[ 2.62461166e-03 -5.24916640e-03  2.62455474e-03]
 [-1.52762858e-02  7.63940701e-03  7.63687876e-03]
 [-2.78548305e-03  7.54273022e-05  2.71005575e-03]
 [-5.25436777e-03  2.62761869e-03  2.62674908e-03]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00]]
Gradient b2:
 [-0.16668852 -0.16664169  0.33333021]


### 前向网络的BP训练

使用BP算法对上述自定义网络进行训练，寻找最优参数。

In [33]:
from dataset.mnist import load_mnist

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

network = SimpleNet(input_size=784, hidden_size=50, output_size=10)

iters_num = 10000  # 迭代次数
train_size = x_train.shape[0]  # 训练集大小
batch_size = 100  # 批大小
learning_rate = 0.1  # 学习率
train_loss_list = []  # 存储训练损失
train_acc_list = []  # 存储训练准确率
test_acc_list = []  # 存储测试准确率

iter_per_epoch = max(train_size / batch_size, 1)  # 每个epoch的迭代次数
for i in range(iters_num):
    batch_mask = np.random.choice(train_size, batch_size)  # 随机选择批次
    x_batch = x_train[batch_mask]  # 获取批次数据
    t_batch = t_train[batch_mask]  # 获取批次标签

    # 计算梯度
    grads = network.gradient(x_batch, t_batch)

    # 更新参数
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grads[key]

    # 记录训练损失
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    # 每个epoch计算一次准确率
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)  # 训练集准确率
        test_acc = network.accuracy(x_test, t_test)  # 测试集准确率
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print(f"Iteration {i}, Train Accuracy: {train_acc:.4f}, Test Accuracy: {test_acc:.4f}")

Iteration 0, Train Accuracy: 0.1106, Test Accuracy: 0.1078
Iteration 600, Train Accuracy: 0.9051, Test Accuracy: 0.9064
Iteration 1200, Train Accuracy: 0.9223, Test Accuracy: 0.9245
Iteration 1800, Train Accuracy: 0.9362, Test Accuracy: 0.9351
Iteration 2400, Train Accuracy: 0.9445, Test Accuracy: 0.9417
Iteration 3000, Train Accuracy: 0.9524, Test Accuracy: 0.9497
Iteration 3600, Train Accuracy: 0.9584, Test Accuracy: 0.9550
Iteration 4200, Train Accuracy: 0.9625, Test Accuracy: 0.9562
Iteration 4800, Train Accuracy: 0.9661, Test Accuracy: 0.9620
Iteration 5400, Train Accuracy: 0.9683, Test Accuracy: 0.9636
Iteration 6000, Train Accuracy: 0.9708, Test Accuracy: 0.9640
Iteration 6600, Train Accuracy: 0.9725, Test Accuracy: 0.9659
Iteration 7200, Train Accuracy: 0.9735, Test Accuracy: 0.9659
Iteration 7800, Train Accuracy: 0.9760, Test Accuracy: 0.9672
Iteration 8400, Train Accuracy: 0.9770, Test Accuracy: 0.9683
Iteration 9000, Train Accuracy: 0.9781, Test Accuracy: 0.9676
Iteration 96

## 优化器


In [34]:
class SGD:
    '''
    随机梯度下降法（Stochastic Gradient Descent）
    '''
    def __init__(self, lr=0.01):
        self.lr = lr  # 学习率
    
    def update(self, params, grads):
        for i in range(len(params)):
            params[i] -= self.lr * grads[i]  # 更新参数


class Momentum:
    '''
    Momentum SGD
    '''
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = []
            for param in params:
                self.v.append(np.zeros_like(param))

        for i in range(len(params)):
            self.v[i] = self.momentum * self.v[i] - self.lr * grads[i]
            params[i] += self.v[i]

class AdaGrad:
    '''
    AdaGrad
    '''
    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = []
            for param in params:
                self.h.append(np.zeros_like(param))

        for i in range(len(params)):
            self.h[i] += grads[i] * grads[i]
            params[i] -= self.lr * grads[i] / (np.sqrt(self.h[i]) + 1e-7)


class RMSprop:
    '''
    RMSprop
    '''
    def __init__(self, lr=0.01, decay_rate = 0.99):
        self.lr = lr
        self.decay_rate = decay_rate
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = []
            for param in params:
                self.h.append(np.zeros_like(param))

        for i in range(len(params)):
            self.h[i] *= self.decay_rate
            self.h[i] += (1 - self.decay_rate) * grads[i] * grads[i]
            params[i] -= self.lr * grads[i] / (np.sqrt(self.h[i]) + 1e-7)


class Adam:
    '''
    Adam (http://arxiv.org/abs/1412.6980v8)
    '''
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = [], []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))
        
        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)

        for i in range(len(params)):
            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
            
            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)


## 训练


In [35]:
from matplotlib import pyplot as plt
import time

def remove_duplicates(params, grads):
    params, grads = params[:], grads[:]  # copy list

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):
                # 在共享权重的情况下
                if params[i] is params[j]:
                    grads[i] += grads[j]  # 加上梯度
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                # 在作为转置矩阵共享权重的情况下（weight tying）
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                if find_flg: break
            if find_flg: break
        if not find_flg: break
    return params, grads

def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate


class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.loss_list = []  # 存储损失值
        self.eval_interval = None
        self.current_epoch = 0
    
    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval= eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0.0
        loss_count = 0
        start_time = time.time()
        for epoch in range(max_epoch):
            idx = np.random.permutation(data_size)
            x = x[idx]
            t = t[idx]
            for iters in range(max_iters):
                batch_x = x[iters * batch_size:(iters + 1) * batch_size]
                batch_t = t[iters * batch_size:(iters + 1) * batch_size]

                loss = model.forward(batch_x, batch_t)  # 正向传播
                model.backward()
                params, grads = remove_duplicates(model.params, model.grads)  # 获取参数和梯度
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1
                # 评价
                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | loss %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = np.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()
