## 让训练更加稳当

- ### 目标：让梯度值在合理的范围内
    - 例如 [le-6, le3]
- ### 将乘法变加法
    - RestNet, LSTM
- ### 归一化
    - 梯度归一化，梯度剪裁
- ### 合理的权重初始化和激活函数

## 让每层的方差都是一个常数
- ### 建哥每层的输出和梯度都看成随机变量
- ### 让他们的均值和方差都保持一致

- 正向：
$$
\mathbb E[h_i^t] = 0 \\
\text{Var}[h_i^t] = a \quad \text{(方差a是一个常数)}
$$

- 反向：
$$
\mathbb E \left[ \frac{\partial \ell}{\partial h_i^t} \right] = 0 \\
\text{Var} \left[ \frac{\partial \ell}{\partial h_i^t} \right] = b \quad \forall_{i, t}
$$

## 权重初始化
- ### 在合理区间里随机初始化参数
- ### 在训练开始时更容易有数值不稳定
    - 原理最优解的地方损失函数可能很复杂
    - 最优解附近表面会比较平
- ### 使用 $\mathcal N(0, 0.01)$来初始化可能对小网络没问题，但不能保证深度网络没问题

## 例子: MLP
- ### 假设
    - （t是层数）$w_{i, j}^t$是$i.i.d$(独立同分布：两个是相互独立的随机变量，但概率分布相同)，那么均值$\mathbb E[w_{i, j}^t] = 0$, 方差$Var[w_{i, j}^t] = \gamma_t$
    - $h_i^{t-1}$独立于$w_{i, j}^t$
- ### 假设没有激活函数 $\pmb h^t = \pmb W^t \pmb h^{t-1}$, 这里$\pmb W^t \in \mathbb R^{n_t \times n_{t-1}}$
$$
\mathbb E[h_i^t] = \mathbb E \left[ \sum_j w_{i, j}^t h_j^{t -1} \right] = \sum_j \mathbb E[w_{i, j}^t] \mathbb E[h_j^{t-1}] = 0
$$
注, 独立随机变量: $P(AB) = P(A)P(B)$

## 正向方差
方差的定义: $D(X) = E(X^2) - E(X) ^2$
$$
Var[h_i^t] = \mathbb E[(h_i^t)^2] - \mathbb E[h_i^t]^2 = \mathbb E \left[ \left(\sum_j w_{i, j}^t h_{j}^{t-1} \right)\right] \\
= \mathbb E \left[ \sum_j(w_{i,j}^t)^2(h_j^{t-1})^2  + \sum_{j \neq k} w_{i,j}^t w_{i,k}^t h_j^{t-1}h_k^{t-1}\right]
$$
其中，根据前面的推导，第二项为0
$$
= \sum_j \mathbb E \left[ (w_{i, j}^t)^2 \right] \mathbb E \left[(h_j^{t-1})^2\right] 
= \sum _j Var[w_{i,j}^t] Var[h_j^{t-1}] \\
= n_{t-1} \gamma_t Var[h_j^{t-1}]
$$
推出: $n_{t-1} \gamma_t = 1$

## 反项均值和方差
- ### 跟正向情况类似
$$
\frac{\partial \ell}{\partial \pmb h^{t-1}} = \frac{\partial \ell}{\partial \pmb h^t}\pmb W^t 
=> \left(\frac{\partial \ell}{\partial \pmb h^{t-1}} \right)^T = (W^t)^T \left( \frac{\partial \ell}{\partial \pmb h^t} \right) \\
\mathbb E \left[ \frac{\partial \ell }{\partial h_i^{t-1}}\right] = 0 \\
Var \left[\frac{\partial \ell}{\partial h_i^{t-1}} \right] = n_t\gamma_t Var\left[\frac{\partial \ell}{\partial h_j^t} \right] 
=> n_t \gamma_t = 1
$$



## Xavier初始化
- ### 难以需要满足 $n_{t-1}\gamma_t = 1$ 和$n_t\gamma_t = 1$
- Xavier使得$\gamma_t(n_{t-1} + n_t) /2 = 1  \quad \rightarrow \quad \gamma_t = 2/(n_{t-1} + n_t)$ 
    - 正态分布 $\mathcal N(0, \sqrt{1/(n_{t-1} + n_t)})$
    - 均匀分布 $\mathcal u(-\sqrt{6/(n_{t-1} + n_t)}, \sqrt{6/(n_{t-1} + n_t)})$
        - 分布 $\mathcal u[-a, a]$和方差是$a^2 / 3$
- ### 适配权重形状变换，特别是$n_t$

## 假设线性的激活函数
- ### 假设$\sigma(x) = \alpha x + \beta$
$$
\pmb h^{'} = \pmb W^t \pmb h^{t-1} \quad \text{and} \quad \pmb h^t = \sigma(\pmb h^{'}) \\
\mathbb E[h_i^t] = \mathbb E[\alpha h_i^{'} + \beta] = \beta  \qquad \Rightarrow \beta = 0 \\
\begin{eqnarray}
Var[h_i^t] &=& \mathbb E[(h_i^t)^2] - \mathbb[h_i^t]^2 \\
&=& \mathbb E [(\alpha h_i^{'} + \beta)^2] - \beta^2  \qquad \Rightarrow \alpha = 1 \\
&=& \mathbb E[\alpha ^2(h_i^{'})^2 + 2 \alpha \beta h_i^{'} + \beta^2] - \beta^2 \\
&=& \alpha^2 Var[h_i^{'}]
\end{eqnarray}
$$

## 反向
- ### 假设$\sigma(x) = \alpha x + \beta$
$$
\frac{\partial \ell}{\partial \pmb h^{'}} = \frac{\partial \ell}{\partial \pmb h^t} (W^t)^T \quad \text{and} \quad \frac{\partial \ell}{\partial \pmb h^{t-1}} = \alpha \frac{\partial \ell}{\partial \pmb h^{'}} \\
\mathbb E \left[ \frac{\partial \ell}{\partial h_i^{t-1}}\right] = 0  \qquad \qquad \Longrightarrow \beta = 0 \\
Var \left[ \frac{\partial \ell}{\partial h_i^{t-1}}\right] = \alpha^2 Var \left[ \frac{\partial \ell}{\partial h_j^{'}}\right] \qquad \qquad \Longrightarrow \alpha = 1
$$

## 检查常用激活函数
- ### 使用泰勒展开
$$
\begin{eqnarray}
\text{sigmoid}(x) &=& \frac{1}{2} + \frac{x}{4} - \frac{x^2}{48} + O(x^5) \\
\text{tanh}(x) &=&0 + x - \frac{x^3}{3} + O(x^5) \\
\text{relu}(x) &=&0 + x \quad \text{for} x \ge 0
\end{eqnarray}
$$

- ### 调整sigmoid:
$$
x \times \text{sigmoid}(x) - 2
$$

## 多语言代码块测试

```java []
System.out.println("hello, world")
```
```python []
print("hello, world, im python")
```