## 序列模型

### 建模过程

&emsp;&emsp;在时间$t$观察到一个数据$x_{t}$, 那么如果有$T$个时间的话，我们可以得到$T$个不独立的随机变量：$\left(x_{1}, \ldots x_{T}\right) \sim p(\mathbf{x})$。机器学习就是在求这个$p(\mathbf{x})$。

&emsp;&emsp;我们又已知条件概率公式，可以表示为：

$$
p(a, b)=p(a) p(b \mid a)=p(b) p(a \mid b)
$$

&emsp;&emsp;有了上述的条件概率公式的话，我们就可以把$p(\mathbf{x})$表示出来：

$$
p(\mathbf{x})=p\left(x_{1}\right) \cdot p\left(x_{2} \mid x_{1}\right) \cdot p\left(x_{3} \mid x_{1}, x_{2}\right) \cdot \ldots p\left(x_{T} \mid x_{1}, \ldots x_{T-1}\right)
$$

&emsp;&emsp;也很好理解，比如想要算$t$时刻发生的事情，我们就需要去知道$t$时刻之前发生了什么事情。

&emsp;&emsp;当然我们也可以逆向推出：

$$
p(\mathbf{x})=p\left(x_{T}\right) \cdot p\left(x_{T-1} \mid x_{T}\right) \cdot p\left(x_{T-2} \mid x_{T-1}, x_{T}\right) \cdot \ldots p\left(x_{1} \mid x_{2}, \ldots x_{T}\right)
$$

&emsp;&emsp;反序在某些时候是有意义的，比如我已经知道未来的事情了，我们来推之前的事情会是什么样子的。但是物理上不一定可行。

### 序列模型核心

&emsp;&emsp;序列模型的核心就是要计算：给定$t$时刻之前的数据，我们来计算$t$时刻事情发生的概率：

$$
p\left(x_{t} \mid x_{1}, \ldots x_{t-1}\right)=p\left(x_{t} \mid f\left(x_{1}, \ldots x_{t-1}\right)\right)
$$

&emsp;&emsp;上述的$f$函数就可以理解为给定过去的$t-1$个数据来训练一个模型。上述这种对见过的数据建模也称为**自回归模型**。

&emsp;&emsp;核心的事情就是如何来计算$f\left(x_{1}, \ldots x_{t-1}\right)$。第二个就是给定$f$之后怎么来计算这个$p$。

### 方案一：马尔可夫假设

&emsp;&emsp;第一个方案就是马尔可夫假设：假设当前数据值跟$\tau$个过去数据点相关。当我们预测新的数据的时候，我们就只看过去的$\tau$个数据。那么概率建模就可以表示成：

$$
p\left(x_{t} \mid x_{1}, \ldots x_{t-1}\right)=p\left(x_{t} \mid x_{t-\tau}, \ldots x_{t-1}\right)=p\left(x_{t} \mid f\left(x_{t-\tau}, \ldots x_{t-1}\right)\right)
$$

&emsp;&emsp;这样我们训练$f$的时候，我们训练一个MLP就可以了，这就可以变成最简单的回归问题。

### 方案二：潜变量模型

&emsp;&emsp;另外的一种方案就是引入一个潜变量$h_{t}$来表示过去的信息$h_{t}=f\left(x_{1}, \ldots x_{t-1}\right)$。和我们的隐变量从统计意义上来说有一点点区别，从概念上来说更加广一点，可以认为是隐变量稍微推广了一点的类型。

&emsp;&emsp;这样的话，我们有：$x_{t}=p\left(x_{t} \mid h_{t}\right)$。

&emsp;&emsp;这里我们就会有两个模型

1. 模型1:依据潜变量和输入x来计算新的潜变量。
2. 模型2: 给定新的潜变量和前一个时刻的输入x，怎么来计算新的x。

&emsp;&emsp;这样的话，就将其拆分成了两个模型，每个模型只和一个或者两个变量相关，相对来说会比较容易一点。

&emsp;&emsp;**潜变量模型就是使用潜变量来概括历史信息。** 隐变量(hidden)一般是真实存在的东西，只不过我们没有观察到。但是潜变量(laten)包括了隐变量这一个假设，他可以是真实不存在的东西，你根本就观察不到。

1. 潜变量模型和隐马尔可夫模型有什么区别？ 潜变量模型是可以使用隐马尔可夫假设的。



## 语言模型

&emsp;&emsp;语言模型说的是给定一个文本序列$x_{1}, \cdots, x_{T}$, 语言模型的目标是估计联合概率$p(x_{1},\cdots, x_{T})$。相关的应用有：

1. 做个预训练模型，像`BERT`、`GPT-3`这样的。

2. 生成文本，比如说给定前面的几个词，不断的使用$x_{t} \sim p(x_{t} | x_{0}, \sim x_{t-1})$生成后续文本。

3. 判断多个序列中哪个更常见，比如语音模型输出多个可选项，因为一个意思可以用多种话来说，因此判断哪种方式出现的概率比较高也是有必要的。

### 使用计数来建模

&emsp;&emsp;假设序列长度为`2`，我们想要去预测$p(x_{1}, x_{2})$, 假设我们总共有$n$个词，$n(x_{1}), n(x_{1}, x_{2})$是单个词和连续词对出现的概率：

$$
p(x_{1}, x_{2}) = p(x_{1}) p(x_{1} | x_{2}) = \frac{n(x_{1})}{n} \frac{n(x_{1}, x_{2})}{n(x_{2})}
$$

#### N元语法

&emsp;&emsp;当序列很长时，因为文本量不够大，很可能$n(x_{1}, \cdots, x_{T}) \leq 1$。一旦说会出现0个的现象的话，整个概率乘起来就会等于`0`了。可以采用马尔可夫假设来缓解这个问题：

#### 一元语法 

$$
p(x_{1}, x_{2}, x_{3}, x_{4}) = p(x_{1}) p(x_{2}) p(x_{3}) p(x_{4}) = \frac{n(x_{1})}{n} \frac{n(x_{2})}{n} \frac{n(x_{3})}{n} \frac{n(x_{4})}{n}
$$

#### 二元语法

$$
p(x_{1}, x_{2}, x_{3}, x_{4}) = p(x_{1}) p(x_{2}｜x_{1}) p(x_{3}|x_{2}) p(x_{4} | x_{3}) = \frac{n(x_{1})}{n} \frac{n(x_{1}, x_{2})}{n(x_{1})} \frac{n(x_{2}, x_{3})}{n(x_{2})} \frac{n(x_{3}, x_{4})}{n(x_{4})}
$$

## 门控循环单元 GRU

&emsp;&emsp;在`RNN`中，所有隐藏信息都放在一个隐藏状态里面，当时间步很长的时候，里面就会累计了太多的东西。之前的信息就不是那么好抽取出来。并且对于一个输入序列，每个观察值都不是说是同等重要的。

1. **更新门**：能关注的机制（更新门）。$Z_{t}$

2. **重置门**：能遗忘的机制（重置门）。$R_{t}$

$$
Z_{t} = \sigma(X_{t} W_{xz} + H_{t-1} W_{hz} + b_{z}) \\
R_{t} = \sigma(X_{t} W_{xr} + H_{t-1} W_{hr} + b_{r})
$$

3. **候选隐藏状态**:

&emsp;&emsp;之后基于这两个门的输出，我们来计算候选隐藏状态:

$$
\tilde{\boldsymbol{H}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}+\left(\boldsymbol{R}_{t} \odot \boldsymbol{H}_{t-1}\right) \boldsymbol{W}_{h h}+\boldsymbol{b}_{h}\right)
$$

&emsp;&emsp;如果将其与`RNN`做对比的话，我们可以发现，如果不看$R_{t}$的话，他就和之前的`RNN`是一样的。

4. **隐状态**:

$$
\boldsymbol{H}_{t}=\mathbf{Z}_{t} \odot \boldsymbol{H}_{t-1}+\left(1-\boldsymbol{Z}_{t}\right) \odot \tilde{\boldsymbol{H}}_{t}
$$

## 长短期记忆网络 LSTM

1. **忘记门**: 将值朝0减少。$F_{t}$

2. **输入门**: 决定不是忽略掉输入数据。$I_{t}$

3. **输出门**: 决定是不是使用隐状态。$O_{t}$

$$
\begin{aligned}
\boldsymbol{I}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x i}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h i}+\boldsymbol{b}_{i}\right) \\
\boldsymbol{F}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x f}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h f}+\boldsymbol{b}_{f}\right) \\
\boldsymbol{O}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x o}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h o}+\boldsymbol{b}_{o}\right)
\end{aligned}
$$

4. **候选记忆单元**：

$$
\tilde{\boldsymbol{C}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x c}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h c}+\boldsymbol{b}_{c}\right)
$$

5. **记忆单元**:

$$
\boldsymbol{C}_{t}=\boldsymbol{F}_{t} \odot \boldsymbol{C}_{t-1}+\boldsymbol{I}_{t} \odot \tilde{\boldsymbol{C}}_{t}
$$

&emsp;&emsp;`LSTM`与`RNN、GRU`的区别是里面所含的状态有两个，一个是$C$，另外一个是$H$。在`GRU`中，认为当前的信息但愿与之前的全局信息是此消彼长的关系，所以直接用$1 - z_{t}$替换$I_{t}$了, 简单粗暴。

6. **隐状态**:

$$
\boldsymbol{H}_{t}=\boldsymbol{O}_{t} \odot \tanh \left(\boldsymbol{C}_{t}\right)
$$

&emsp;&emsp;$\tanh$是为了保证值是在`+1`到`-1`之间的。

## 衡量指标

- 困惑度(perplexity)

&emsp;&emsp;衡量一个语言模型的好坏可以用平均交叉熵来衡量：

$$
\pi = \frac{1}{n} \sum_{i=1}^{n} -log p(x_{t} | x_{t-1}, \cdots)
$$

&emsp;&emsp;$p$是语言模型的预测概率，$x_{t}$是真实词。但是由于一些历史原因`NLP`中采用困惑度$exp(\pi)$来衡量, 这样的话1表示完美，无穷大是最差的情况。

## 梯度裁剪

&emsp;&emsp;迭代中计算这$T$个时间步上的梯度，在反向传播过程中产生长度为$O(T)$的矩阵乘法链，导致数值不稳定。梯度裁剪能有效预防梯度爆炸。梯度裁剪说的是如果梯度长度超过$\theta$，那么拖影回长度$\theta$。

$$
\mathbf{g} \leftarrow \min \left(1, \frac{\theta}{\|\mathbf{g}\|}\right) \mathbf{g}
$$

&emsp;&emsp;因为时间步为$T$，所以会等价于$T$个$MLP$。所以他会容易产生梯度爆炸。

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import time
sns.set()

In [2]:
def get_vocab(file, lower = False):
    with open(file, 'r') as fopen:
        data = fopen.read() # 将文件中的所有数据读取进来。
    if lower:
        data = data.lower()
    
    vocab = list(set(data))
    return data, vocab

def embed_to_control(data, vocab):
    onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)
    for i in range(len(data)):
        onehot[i, vocab.index(data[i])] = 1.0
    return onehot

In [3]:
text, text_vocab = get_vocab('consumer.h', lower = False)
one_hot = embed_to_control(text, text_vocab)
print('len text: ', len(text))
print('len text_vocab: ', len(text_vocab))
print('one_hot shape: ', one_hot.shape)

len text:  15294
len text_vocab:  75
one_hot shape:  (15294, 75)


&emsp;&emsp;`tanh`激活函数为:

$$
\tanh x=\frac{\sinh x}{\cosh x}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$

&emsp;&emsp;导数为:

$$
(\tanh x)^{\prime}=\operatorname{sech}^{2} x=1-\tanh ^{2} x
$$

In [4]:
def tanh(x, grad=False):
    if grad:
        output = np.tanh(x)
        return (1.0 - np.square(output))
    else:
        return np.tanh(x)

$$
h_{t} = \tanh(W_{hh}h_{t-1} + W_{hx}x_{t-1} + b_{h})
$$

$$
o_{t} = \phi(W_{ho}h_{t} + b_{o})
$$

In [5]:
def forward_rnn_recurrent(x, prev_state, W_hx, W_hh, W_ho):
    mul_hx = np.dot(x, W_hx.T)
    
    # 处理隐藏状态部分。
    mul_hh = np.dot(prev_state, W_hh.T)
    add_previous_now = mul_hx + mul_hh
    current_state = tanh(add_previous_now)
    
    # 处理输出部分。
    mul_o = np.dot(current_state, W_ho.T)
    return (mul_hx, mul_hh, add_previous_now, current_state, mul_o)

In [6]:
def softmax(x):
    """
    x: np.max(x)取的是二维数组x中的最大值。
    """
    exp_scores = np.exp(x - np.max(x))
    return exp_scores / (np.sum(exp_scores, axis=1, keepdims=True) + 1e-8)

&emsp;&emsp;多分类的交叉熵损失如下:

$$
L=\frac{1}{N} \sum_{i} L_{i}=-\frac{1}{N} \sum_{i} \sum_{c=1}^{M} y_{i c} \log \left(p_{i c}\right)
$$

&emsp;&emsp;其中$M$表示类别的数量，$y_{ic}$是符号函数(0或者1), 如果样本$i$的真实类别等于$c$取1，否者取0。$p_{ic}$表示观测样本$i$属于类别$c$的概率。

In [7]:
def cross_entropy(Y_hat, Y, epsilon=1e-12):
    Y_hat = np.clip(Y_hat, epsilon, 1. - epsilon)
    N = Y_hat.shape[0]
    return -np.sum(np.sum(Y * np.log(Y_hat + 1e-9))) / N

&emsp;&emsp;`RNN`的反向传播:


In [8]:
def backward_multiply_gate(w, x, dz):
    """
    w shape = (75, 128)
    x shape = (64, 128)
    dz shape = (64, 75)
    """
    dw = np.dot(dz.T, x) # shape = (75, 128)
    dx = np.dot(w.T, dz.T) # shape = (128, 64)
    return dw, dx

def backward_add_gate(x1, x2, dz):
    dx1 = dz * np.ones_like(x1)
    dx2 = dz * np.ones_like(x2)
    return dx1, dx2

def backward_rnn_recurrent(x, prev_state, W_hx, W_hh, W_ho, d_mu_o, saved_graph):
    mul_hx, mul_hh, add_previous_now, current_state, mul_o = saved_graph
    
    dW_ho, d_CurrentState = backward_multiply_gate(W_ho, current_state, d_mu_o)
    
    dadd_previous_now = tanh(add_previous_now, True) * d_CurrentState.T
    
    dmul_hh, dmul_hx = backward_add_gate(mul_hh, mul_hx, dadd_previous_now)
    dW_hh, dprev_state = backward_multiply_gate(W_hh, prev_state, dmul_hh)
    dW_hx, dx = backward_multiply_gate(W_hx, x, dmul_hx)
    
    return (dprev_state, dW_hx, dW_hh, dW_ho)

In [9]:
epoch = 1000
learning_rate = 0.0001
batch_size = 64
sequence_length = int(12)
dimension = one_hot.shape[1]
print('dimension is :', dimension)
possible_batch_id = range(len(text) - sequence_length - 1)
hidden_dim = 128

W_hx = np.random.randn(hidden_dim, dimension) / np.sqrt(hidden_dim)
W_hh = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_ho = np.random.randn(dimension, hidden_dim) / np.sqrt(hidden_dim)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    prev_s = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    for n in range(sequence_length):
        layers.append(forward_rnn_recurrent(batch_x[:, n, :], prev_s, W_hx, W_hh, W_ho))
        prev_s = layers[-1][3]
        out_logits[:, n, :] = layers[-1][-1]
    
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    # 之后需要开始计算反向更新部分了。
    
    # 1. 计算梯度:
    delta = probs  # 取网络的输出结果。
    delta[range(y.shape[0]), y] -= 1  # 将网络输出结果中对应标签的那个概率减去1。
    delta = delta.reshape((batch_size, sequence_length, dimension))
    
    dW_hx = np.zeros(W_hx.shape)
    dW_hh = np.zeros(W_hh.shape)
    dW_ho = np.zeros(W_ho.shape)
    prev_state = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        d_mul_o = delta[:, n, :] # shape = (batch_size, dimension)
        
        dprev_s, dW_hx_t, dW_hh_t, dW_ho_t = backward_rnn_recurrent(batch_x[:,n,:], prev_state, 
                                                         W_hx, W_hh, W_ho, d_mul_o, layers[n])
        prev_state = layers[n][3]
        dW_hx += dW_hx_t
        dW_hh += dW_hh_t
        dW_ho += dW_ho_t
    
    # 更新
    W_hx -= learning_rate * dW_hx 
    W_hh -= learning_rate * dW_hh
    W_ho -= learning_rate * dW_ho
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss, accuracy))

dimension is : 75
epoch 50, loss 4.184205109308582, accuracy 0.11848958333333333
epoch 100, loss 3.885575498679348, accuracy 0.16536458333333334
epoch 150, loss 3.677142900928832, accuracy 0.171875
epoch 200, loss 3.453516756363543, accuracy 0.21354166666666666
epoch 250, loss 3.372584007720515, accuracy 0.23046875
epoch 300, loss 3.0169563063422404, accuracy 0.3411458333333333
epoch 350, loss 3.1033829138311897, accuracy 0.2903645833333333
epoch 400, loss 3.058355521375249, accuracy 0.3111979166666667
epoch 450, loss 2.966433976269963, accuracy 0.3098958333333333
epoch 500, loss 2.703835234343643, accuracy 0.3997395833333333
epoch 550, loss 2.6678355568941465, accuracy 0.3671875
epoch 600, loss 2.662716969275844, accuracy 0.390625
epoch 650, loss 2.6745754178191317, accuracy 0.35546875
epoch 700, loss 2.617491185412247, accuracy 0.3776041666666667
epoch 750, loss 2.5143628759044883, accuracy 0.39453125
epoch 800, loss 2.3443398005588794, accuracy 0.4375
epoch 850, loss 2.2565569735503

### Pytorch实现

In [10]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class RNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(RNN, self).__init__()
        self.rnn = nn.RNN(input_dim, hidden_dim, num_layers, batch_first=True, nonlinearity='tanh')
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        out, hn = self.rnn(x, None)
        
        out = self.fc(out)
        return out
rnn = RNN(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(rnn)

RNN(
  (rnn): RNN(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)


In [11]:
# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.5
optimizer = torch.optim.SGD(rnn.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = rnn(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy))
    

epoch 50, loss 3.419111967086792, accuracy 0.17317708333333334
epoch 100, loss 3.0476462841033936, accuracy 0.2591145833333333
epoch 150, loss 2.631009817123413, accuracy 0.34375
epoch 200, loss 2.3307220935821533, accuracy 0.4309895833333333
epoch 250, loss 1.7352737188339233, accuracy 0.5677083333333334
epoch 300, loss 1.6444469690322876, accuracy 0.5963541666666666
epoch 350, loss 1.6622191667556763, accuracy 0.5989583333333334
epoch 400, loss 1.369931697845459, accuracy 0.6471354166666666
epoch 450, loss 1.159907579421997, accuracy 0.7122395833333334
epoch 500, loss 1.1306272745132446, accuracy 0.7174479166666666
epoch 550, loss 1.1083282232284546, accuracy 0.7200520833333334
epoch 600, loss 1.1236608028411865, accuracy 0.72265625
epoch 650, loss 0.8815138936042786, accuracy 0.7747395833333334
epoch 700, loss 0.9854378700256348, accuracy 0.734375
epoch 750, loss 1.0347537994384766, accuracy 0.71875
epoch 800, loss 0.9369350075721741, accuracy 0.75390625
epoch 850, loss 0.9494147300

## Vanilla-LSTM-代码

### Numpy实现

&emsp;&emsp;首先依据公式创建三个门的权重:

$$
\begin{aligned}
\boldsymbol{I}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x i}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h i}+\boldsymbol{b}_{i}\right) \\
\boldsymbol{F}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x f}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h f}+\boldsymbol{b}_{f}\right) \\
\boldsymbol{O}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x o}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h o}+\boldsymbol{b}_{o}\right)
\end{aligned}
$$

In [12]:
W_hi = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_hf = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_ho = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)

&emsp;&emsp;之后需要产生**候选记忆单元**：

$$
\tilde{\boldsymbol{C}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x c}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h c}+\boldsymbol{b}_{c}\right)
$$

&emsp;&emsp;**记忆单元**:

$$
\boldsymbol{C}_{t}=\boldsymbol{F}_{t} \odot \boldsymbol{C}_{t-1}+\boldsymbol{I}_{t} \odot \tilde{\boldsymbol{C}}_{t}
$$

&emsp;&emsp;`LSTM`与`RNN、GRU`的区别是里面所含的状态有两个，一个是$C$，另外一个是$H$。

In [13]:
W_hc = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)

&emsp;&emsp;**隐状态**:

$$
\boldsymbol{H}_{t}=\boldsymbol{O}_{t} \odot \tanh \left(\boldsymbol{C}_{t}\right)
$$

&emsp;&emsp;$\tanh$是为了保证值是在`+1`到`-1`之间的。

&emsp;&emsp;之后我们再创建一个处理输入的权重矩阵，本来是有四个的$W_{xi}, W_{xf}, W_{xo}, W_{xc}$，这里我们为了方便起见，统一为一个$U$

In [14]:
U = np.random.randn(hidden_dim, dimension)

&emsp;&emsp;之后还有一个输出层的权重，也将其全部设置为$V$:

In [15]:
V = np.random.randn(dimension, hidden_dim) / np.sqrt(hidden_dim)

In [16]:
def sigmoid(x, grad=False):
    if grad:
        return sigmoid(x) * (1 - sigmoid(x))
    else:
        return 1 / (1 + np.exp(-x))

def forward_lstm_recurrent(x, c_state, h_state, U, W_hf, W_hi, W_hc, W_ho, V):
    
    # 计算输入的处理单元。
    mul_u = np.dot(x, U.T)
    
    # 计算遗忘门
    mul_Wf = np.dot(h_state, W_hf.T)
    add_Wf = mul_u + mul_Wf
    f = sigmoid(add_Wf)
    
    # 计算输入门
    mul_Wi = np.dot(h_state, W_hi.T)
    add_Wi = mul_u + mul_Wi
    i = sigmoid(add_Wi)
    
    # 计算候选记忆单元
    mul_Wc = np.dot(h_state, W_hc.T)
    add_Wc = mul_u + mul_Wc
    c_hat = tanh(add_Wc)
    
    # 记忆单元选择需要从之前的c_state中遗忘多少，从输入门中提取多少候选记忆单元。
    C = c_state * f + i * c_hat
    
    # 输出门
    mul_Wo = np.dot(h_state, W_ho.T)
    add_Wo = mul_u + mul_Wo
    o = sigmoid(add_Wo)
    
    # 计算隐藏状态。
    h = o * tanh(C)
    
    mul_v = np.dot(h, V.T)
    return (mul_u, mul_Wf, add_Wf, mul_Wi, add_Wi, mul_Wc, add_Wc, C, mul_Wo, add_Wo, h, mul_v, i, o, c_hat)

In [17]:
def backward_recurrent(x, c_state, h_state, U, Wf, Wi, Wc, Wo, V, d_mul_v, saved_graph):
    mul_u, mul_Wf, add_Wf, mul_Wi, add_Wi, mul_Wc, add_Wc, C, mul_Wo, add_Wo, h, mul_v, i, o, c_hat = saved_graph
    dV, dh = backward_multiply_gate(V, h, d_mul_v)
    dC = tanh(C, True) * o * dh.T
    do = tanh(C) * dh.T
    dadd_Wo = sigmoid(add_Wo, True) * do
    dmul_u1, dmul_Wo = backward_add_gate(mul_u, mul_Wo, dadd_Wo)
    dWo, dprev_state = backward_multiply_gate(Wo, h_state, dmul_Wo)
    dc_hat = dC * i
    dadd_Wc = tanh(add_Wc, True) * dc_hat
    dmul_u2, dmul_Wc = backward_add_gate(mul_u, mul_Wc, dadd_Wc)
    dWc, dprev_state = backward_multiply_gate(Wc, h_state, dmul_Wc)
    di = dC * c_hat
    dadd_Wi = sigmoid(add_Wi, True) * di
    dmul_u3, dmul_Wi = backward_add_gate(mul_u, mul_Wi, dadd_Wi)
    dWi, dprev_state = backward_multiply_gate(Wi, h_state, dmul_Wi)
    df = dC * c_state
    dadd_Wf = sigmoid(add_Wf, True) * df
    dmul_u4, dmul_Wf = backward_add_gate(mul_u, mul_Wf, dadd_Wf)
    dWf, dprev_state = backward_multiply_gate(Wf, h_state, dmul_Wf)
    dU, dx = backward_multiply_gate(U, x, dmul_u4)
    return (dU, dWf, dWi, dWc, dWo, dV)

In [18]:
for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    batch_id = random.sample(possible_batch_id, batch_size)
    
    prev_c = np.zeros((batch_size, hidden_dim))
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
        
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    
    for n in range(sequence_length):
        layers.append(forward_lstm_recurrent(batch_x[:, n, :], prev_c, prev_h, U, W_hf, W_hi, W_hc, W_ho, V))
        
        prev_c = layers[-1][7]
        prev_h = layers[-1][10]
        
        out_logits[:, n, :] = layers[-1][-4]
        
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    delta = probs
    delta[range(y.shape[0]), y] -= 1
    delta = delta.reshape((batch_size, sequence_length, dimension))
    
    dU = np.zeros(U.shape)
    dV = np.zeros(V.shape)
    dW_hf = np.zeros(W_hf.shape)
    dW_hi = np.zeros(W_hi.shape)
    dW_hc = np.zeros(W_hc.shape)
    dW_ho = np.zeros(W_ho.shape)
    
    prev_c = np.zeros((batch_size, hidden_dim))
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        d_mul_v = delta[:, n, :]
        dU_t, dWf_t, dWi_t, dWc_t, dWo_t, dV_t = backward_recurrent(batch_x[:,n,:], prev_c, prev_h, U, W_hf, W_hi, 
                                                                    W_hc, W_ho, V, d_mul_v, layers[n])
        prev_c = layers[n][7]
        prev_h = layers[n][10]
        dU += dU_t
        dV += dV_t
        dW_hf += dWf_t
        dW_hi += dWi_t
        dW_hc += dWc_t
        dW_ho += dWo_t
    U -= learning_rate * dU
    V -= learning_rate * dV
    W_hf -= learning_rate * dW_hf
    W_hi -= learning_rate * dW_hi
    W_hc -= learning_rate * dW_hc
    W_ho -= learning_rate * dW_ho
    if (i+1) % 50 == 0:
        print('epoch {}, loss {}, accuracy {}'.format(i+1, loss, accuracy))

  """


epoch 50, loss 19.208023533434275, accuracy 0.09244791666666667
epoch 100, loss 20.480660717090018, accuracy 0.045572916666666664
epoch 150, loss 20.467480289638072, accuracy 0.05859375
epoch 200, loss 20.614337940220782, accuracy 0.044270833333333336
epoch 250, loss 20.351349475323413, accuracy 0.06770833333333333
epoch 300, loss 20.495700816200294, accuracy 0.0703125
epoch 350, loss 20.695284218999248, accuracy 0.036458333333333336
epoch 400, loss 20.522661789722466, accuracy 0.036458333333333336
epoch 450, loss 20.683593274968842, accuracy 0.06380208333333333
epoch 500, loss 20.507914893910957, accuracy 0.0625
epoch 550, loss 20.380189429453395, accuracy 0.06901041666666667
epoch 600, loss 20.59945398751996, accuracy 0.041666666666666664
epoch 650, loss 20.668302103213716, accuracy 0.044270833333333336
epoch 700, loss 20.69528421900098, accuracy 0.015625
epoch 750, loss 20.58735574856191, accuracy 0.018229166666666668
epoch 800, loss 20.722266336613327, accuracy 0.009114583333333334

### Pytorch实现

In [19]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(LSTM, self).__init__()
        self.LSTM = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        c0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        # out, (hn, hc) = self.LSTM(x, (h0, c0))
        out, (hn, hc) = self.LSTM(x, None)  # 得到所有时间序列的输出。
        
        # out = self.fc(out[:, -1, :])  # 取最后一个时间步的输出。
        out = self.fc(out)
        
        return out
lstm = LSTM(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(lstm)


# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.01
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    # prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = lstm(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy)) 

LSTM(
  (LSTM): LSTM(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)
epoch 50, loss 1.8513680696487427, accuracy 0.50390625
epoch 100, loss 0.9603528380393982, accuracy 0.7486979166666666
epoch 150, loss 0.6873628497123718, accuracy 0.7864583333333334
epoch 200, loss 0.5569496154785156, accuracy 0.8307291666666666
epoch 250, loss 0.7212016582489014, accuracy 0.7890625
epoch 300, loss 0.5719035267829895, accuracy 0.80859375
epoch 350, loss 0.5459523797035217, accuracy 0.8177083333333334
epoch 400, loss 0.447481244802475, accuracy 0.8658854166666666
epoch 450, loss 0.4484831392765045, accuracy 0.8502604166666666
epoch 500, loss 0.4738970994949341, accuracy 0.8385416666666666
epoch 550, loss 0.46853509545326233, accuracy 0.8411458333333334
epoch 600, loss 0.47863996028900146, accuracy 0.8372395833333334
epoch 650, loss 0.4469071924686432, accuracy 0.8528645833333334
epoch 700, loss 0.4506141245365143, accuracy 0.8385416666666666
epoch 750, loss 0.4

## Vanilla-GRU-代码

### Numpy实现



1. **更新门**：能关注的机制（更新门）。$Z_{t}$

2. **重置门**：能遗忘的机制（重置门）。$R_{t}$

$$
Z_{t} = \sigma(X_{t} W_{xz} + H_{t-1} W_{tz} + b_{z}) \\
R_{t} = \sigma(X_{t} W_{xr} + H_{t-1} W_{hr} + b_{r})
$$

3. **候选隐藏状态**:

&emsp;&emsp;之后基于这两个门的输出，我们来计算候选隐藏状态:

$$
\tilde{\boldsymbol{H}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}+\left(\boldsymbol{R}_{t} \odot \boldsymbol{H}_{t-1}\right) \boldsymbol{W}_{h h}+\boldsymbol{b}_{h}\right)
$$

&emsp;&emsp;如果将其与`RNN`做对比的话，我们可以发现，如果不看$R_{t}$的话，他就和之前的`RNN`是一样的。

4. **隐状态**:

$$
\boldsymbol{H}_{t}=\mathbf{Z}_{t} \odot \boldsymbol{H}_{t-1}+\left(1-\boldsymbol{Z}_{t}\right) \odot \tilde{\boldsymbol{H}}_{t}
$$

In [20]:
learning_rate = 0.001
U = np.random.randn(hidden_dim, dimension) / np.sqrt(hidden_dim)
Wz = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
Wr = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
Wh = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
V = np.random.randn(dimension, hidden_dim) / np.sqrt(dimension)

def forward_gru_recurrent(x, h_state, U, Wz, Wr, Wh, V):
    mul_u = np.dot(x, U.T)
    
    # 更新门。
    mul_Wz = np.dot(h_state, Wz.T)
    add_Wz = mul_u + mul_Wz
    z = sigmoid(add_Wz)
    
    # 重置门。
    mul_Wr = np.dot(h_state, Wr.T)
    add_Wr = mul_u + mul_Wr
    r = sigmoid(add_Wr)
    
    # 计算候选隐藏状态。
    mul_Wh = np.dot(h_state * r, Wh.T)
    add_Wh = mul_u + mul_Wh
    h_hat = tanh(add_Wh)
    
    # 隐藏状态。
    h = (1 - z) * h_state + z * h_hat
    mul_v = np.dot(h, V.T)
    return (mul_u, mul_Wz, add_Wz, z, mul_Wr, add_Wr, r, mul_Wh, add_Wh, h_hat, h, mul_v)

def backward_multiply_gate(w, x, dz):
    dW = np.dot(dz.T, x)
    dx = np.dot(w.T, dz.T)
    return dW, dx

def backward_gru_recurrent(x, h_state, U, Wz, Wr, Wh, V, d_mul_v, saved_graph):
    mul_u, mul_Wz, add_Wz, z, mul_Wr, add_Wr, r, mul_Wh, add_Wh, h_hat, h, mul_v = saved_graph
    dV, dh = backward_multiply_gate(V, h, d_mul_v)
    dh_hat = z * dh.T
    dadd_Wh = tanh(add_Wh, True) * dh_hat
    dmul_u1, dmul_Wh = backward_add_gate(mul_u, mul_Wh, dadd_Wh)
    dWh, dprev_state = backward_multiply_gate(Wh, h_state * r, dmul_Wh)
    dr = dprev_state * h_state.T
    dadd_Wr = sigmoid(add_Wr, True) * dr.T
    dmul_u2, dmul_Wr = backward_add_gate(mul_u, mul_Wr, dadd_Wr)
    dWr, dprev_state = backward_multiply_gate(Wr, h_state, dmul_Wr)
    dz = -h_state + h_hat
    dadd_Wz = sigmoid(add_Wz, True) * dz
    dmul_u3, dmul_Wz = backward_add_gate(mul_u, mul_Wz, dadd_Wz)
    dWz, dprev_state = backward_multiply_gate(Wz, h_state, dmul_Wz)
    dU, dx = backward_multiply_gate(U, x, dmul_u3)
    return (dU, dWz, dWr, dWh, dV)

In [21]:
for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    batch_id = random.sample(possible_batch_id, batch_size)
    
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        batch_x[:, n, :] = one_hot[id1, :]
        batch_y[:, n, :] = one_hot[id2, :]
    
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    
    for n in range(sequence_length):
        layers.append(forward_gru_recurrent(batch_x[:,n,:], prev_h, U, Wz, Wr, Wh, V))
        prev_h = layers[-1][-2]
        out_logits[:, n, :] = layers[-1][-1]
        
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    delta = probs
    delta[range(y.shape[0]), y] -= 1
    delta = delta.reshape((batch_size, sequence_length, dimension))
    dU = np.zeros(U.shape)
    dV = np.zeros(V.shape)
    dWz = np.zeros(Wz.shape)
    dWr = np.zeros(Wr.shape)
    dWh = np.zeros(Wh.shape)
    
    prev_h = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        d_mul_v = delta[:, n, :]
        dU_t, dWz_t, dWr_t, dWh_t, dV_t = backward_gru_recurrent(batch_x[:,n,:], prev_h, 
                                                                    U, Wz, Wr, Wh, V, d_mul_v, layers[n])
        prev_h = layers[n][-2]
        dU += dU_t
        dV += dV_t
        dWz += dWz_t
        dWr += dWr_t
        dWh += dWh_t
    U -= learning_rate * dU
    V -= learning_rate * dV
    Wz -= learning_rate * dWz
    Wr -= learning_rate * dWr
    Wh -= learning_rate * dWh
    if (i+1) % 50 == 0:
        print('epoch {}, loss {}, accuracy {}'.format(i+1, loss, accuracy))

epoch 50, loss 4.220576144917128, accuracy 0.19401041666666666
epoch 100, loss 4.163034346392908, accuracy 0.15885416666666666
epoch 150, loss 3.9401398859344394, accuracy 0.13671875
epoch 200, loss 4.029454087610817, accuracy 0.13671875
epoch 250, loss 4.1403706628203585, accuracy 0.13541666666666666
epoch 300, loss 4.068431923794667, accuracy 0.13151041666666666
epoch 350, loss 3.9290011396953695, accuracy 0.109375
epoch 400, loss 3.7876088091245994, accuracy 0.11197916666666667
epoch 450, loss 3.6983184190809077, accuracy 0.1015625
epoch 500, loss 3.6330332433132972, accuracy 0.12109375
epoch 550, loss 3.7198871597792142, accuracy 0.08984375
epoch 600, loss 3.6618738580789203, accuracy 0.07682291666666667
epoch 650, loss 3.6028788817038997, accuracy 0.09375
epoch 700, loss 3.6629171562385268, accuracy 0.11588541666666667
epoch 750, loss 3.580059198173091, accuracy 0.09765625
epoch 800, loss 3.5696795531308645, accuracy 0.10416666666666667
epoch 850, loss 3.486513414060022, accuracy 

### Pytorch实现

In [22]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class GRU(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(GRU, self).__init__()
        self.GRU = nn.GRU(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        out, hn = self.GRU(x, None)
        
        out = self.fc(out)
        return out
    
gru = GRU(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(gru)


# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.01
optimizer = torch.optim.Adam(gru.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    # prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = gru(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy)) 

GRU(
  (GRU): GRU(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)
epoch 50, loss 1.5683003664016724, accuracy 0.5911458333333334
epoch 100, loss 0.7187840938568115, accuracy 0.80859375
epoch 150, loss 0.5786863565444946, accuracy 0.8098958333333334
epoch 200, loss 0.7100645899772644, accuracy 0.7903645833333334
epoch 250, loss 0.5751112699508667, accuracy 0.8177083333333334
epoch 300, loss 0.4777928292751312, accuracy 0.8489583333333334
epoch 350, loss 0.5105846524238586, accuracy 0.8411458333333334
epoch 400, loss 0.47605299949645996, accuracy 0.8346354166666666
epoch 450, loss 0.4262023866176605, accuracy 0.85546875
epoch 500, loss 0.42007994651794434, accuracy 0.8502604166666666
epoch 550, loss 0.37748900055885315, accuracy 0.875
epoch 600, loss 0.4321237802505493, accuracy 0.85546875
epoch 650, loss 0.40571892261505127, accuracy 0.859375
epoch 700, loss 0.4379006326198578, accuracy 0.8658854166666666
epoch 750, loss 0.40357568860054016, accu

## 小结

1. `RNN`中的梯度消失/梯度爆炸和普通的`MLP`或者深层`CNN`中梯度消失/剃度爆炸的含义不一样。`MLP/CNN`中不同的层有不同的参数，各是各的梯度；而`RNN`中同样的权重在各个时间步共享，最终的梯度 `g = 各个时间步的梯度 g_t` 的和。

2. 由1中所述，**RNN中总的梯度是不会消失的**。即便梯度越传越弱，那也只是远距离的梯度消失, 但是近距离的梯度并不会消失，所有梯度之和便不会消失。`RNN`所谓梯度消失的真正含义是: **梯度被近距离梯度主导，导致模型难以学到远距离的依赖关系**。

3. **`LSTM`中梯度的传播有很多条路径**, $c_{t-1} \rightarrow c_{t}=f_{t} \odot c_{t-1}+i_{t} \odot \hat{c_{t}}$这条路径上只有逐元素相乘和相加的操作，梯度流最稳定；但是其它路径(例如$c_{t-1} \rightarrow h_{t-1} \rightarrow i_{t} \rightarrow c$)上梯度流与普通`RNN`类似，照样会发生相同的权重矩阵反复连乘。

4. `LSTM`刚提出来的时候没有遗忘门，或者说相当于$f_{t} = 1$, 这时候在$c_{t-1} \rightarrow c_{t}$直接相连的短路径上，$\frac{dl}{dc_{t}}$可以无损地传递给$\frac{dl}{dc_{t-1}}$, 从而这条路径熵的梯度畅通无阻，不会消失，类似于`ResNet`中的残差连接。

5. 但是在**其他路径**上，`LSTM`的梯度流和普通`RNN`没有太大区别，依然会爆炸或者消失。由于总的远距离梯度 = 各条路径的远距离梯度之和，即便其他远距离路径梯度消失了，只要保证有一条远距离路径(就是上面说的那条高速公路) 梯度不消失，总的远距离梯度就不会消失 (正常梯度 + 消失梯度 = 正常梯度)。因此`LSTM`通过改善**一条路径上**的梯度问题拯救了**总体的远距离梯度**。

6. 同样，因为总的远距离梯度 = 各条路径的远距离梯度之和，高速公路上梯度流比较稳定，但其他路径上梯度有可能爆炸，此时总的远距离梯度 = 正常梯度 + 爆炸梯度 = 爆炸梯度，**因此LSTM仍然有可能发生爆炸**。不过，由于LSTM的其他路径非常崎岖， 和普通RNN相比，多经过了很多次激活函数（导数都小于1），因此**LSTM发生梯度爆炸的频率要低得多**。实践中，梯度爆炸一般通过梯度裁剪来解决。

7. 对于现在常用的带遗忘门的`LSTM`来说，`6`中的分析依然成立，而`5`分为两种情况：其一是遗忘门接近`1`（例如模型初始化时会把`forget bias`设置成较大的正数，让遗忘门饱和），这时候远距离梯度不小时；其二是遗忘门接近`0`，但这时模型是故意阻断梯度流的，这不是`bug`而是`feature`（例如情感分析任务中有一条样本 “A，但是 B”，模型读到“但是”后选择把遗忘门设置成 0，遗忘掉内容 A，这是合理的）。当然，常常也存在f介于[0, 1]之间的情况，在这种情况下只能说LSTM改善（而非解决）了梯度消失的状况。

- [Written Memories: Understanding, Deriving and Extending the LSTM](https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html)

- [Why LSTMs Stop Your Gradients From Vanishing: A View from the Backwards Pass](https://weberna.github.io/blog/2017/11/15/LSTM-Vanishing-Gradients.html)

## 参考

- [LSTM如何来避免梯度弥散和梯度爆炸？](https://www.zhihu.com/question/34878706/answer/665429718)
- [RNN梯度消失/爆炸的深度好文](https://kexue.fm/archives/7888)
- https://github.com/huseinzol05/Machine-Learning-Numpy