## torch.nn.GRU


在`PyTorch`中，`GRU`的`API`地址为：[torch.nn.GRU(*args, **kwargs)](https://pytorch.org/docs/stable/generated/torch.nn.GRU.html?highlight=gru#torch.nn.GRU)

$$
\begin{aligned}
&r_{t}=\sigma\left(W_{i r} x_{t}+b_{i r}+W_{h r} h_{(t-1)}+b_{h r}\right) \\
&n_{t}=\tanh \left(W_{i n} x_{t}+b_{i n}+r_{t} *\left(W_{h n} h_{(t-1)}+b_{h n}\right)\right) \\
&z_{t}=\sigma\left(W_{i z} x_{t}+b_{i z}+W_{h z} h_{(t-1)}+b_{h z}\right) \\
&h_{t}=\left(1-z_{t}\right) * n_{t}+z_{t} * h_{(t-1)}
\end{aligned}
$$

`GRU`大概是在`14`年或者`15`年提出来的，`LSTM`在`2000`年左右提出来的。在`GRU`中没有$c_{t}$这个东西的。在`GRU`中有两个门，一个叫做`reset`，一个叫做更新门$z_{t}$。

由于没有$c_{t}$，所以在做初始状态的时候，我们只需要提供$h_{0}$作为初始状态。

相比于`LSTM`，没有$c_{t}$，也就没有输入门对于全局信息的控制，取而代之的是一个重制门对于上一时刻信息的提取，对于当前时刻的信息则并不用一个门控制，而是直接给到$n_{t}$，也就是:

$$
n_{t}=\tanh \left(W_{i n} x_{t}+b_{i n}+r_{t} *\left(W_{h n} h_{(t-1)}+b_{h n}\right)\right)
$$

在相同输入输出模型下，gru的模型参数大致是lstm的四分之三倍。可以验证一下:

In [1]:
import torch
import torch.nn as nn
lstm_layer = nn.LSTM(3, 5)
gru_layer = nn.GRU(3, 5)

In [2]:
sum(p.numel() for p in lstm_layer.parameters())

200

In [3]:
sum(p.numel() for p in gru_layer.parameters())

150

In [4]:
bs, T, i_size, h_size = 2, 3, 4, 5
input_data = torch.randn(bs, T, i_size)  # 输入序列

h0 = torch.randn(bs, h_size)  # 初始值，不需要训练

In [5]:
gru_layer = nn.GRU(i_size, h_size, batch_first=True)
output, h_final = gru_layer(input_data, h0.unsqueeze(0))
print(output)

tensor([[[-0.1355,  0.4056,  0.3317,  0.4589, -0.4083],
         [-0.1811,  0.7739,  0.2401,  0.1952, -0.1132],
         [-0.1386,  0.8449,  0.1788, -0.0816, -0.0403]],

        [[-0.0475, -0.0651,  0.4481,  0.2246,  0.4372],
         [ 0.1057, -0.1324,  0.3786,  0.1656,  0.5676],
         [ 0.0160,  0.4196,  0.3794, -0.0798,  0.3670]]],
       grad_fn=<TransposeBackward1>)


In [6]:
for k, v in gru_layer.named_parameters():
    print(k, v.shape)

weight_ih_l0 torch.Size([15, 4])
weight_hh_l0 torch.Size([15, 5])
bias_ih_l0 torch.Size([15])
bias_hh_l0 torch.Size([15])


### 实现GRU前向传播

In [7]:
def gru_forward(input_data, init_state, w_ih, w_hh, b_ih, b_hh):
    prev_h = init_state
    bs, T, i_size = input_data.shape
    h_size = w_ih.shape[0] // 3
    
    # 对权重扩维，复制成batch size倍。
    batch_w_ih = w_ih.unsqueeze(0).tile(bs, 1, 1)
    batch_w_hh = w_hh.unsqueeze(0).tile(bs, 1, 1)
    
    output = torch.zeros(bs, T, h_size)  # GRU网络的输出状态序列
    
    for t in range(T):
        x = input_data[:, t, :]  # t时刻GRU cell的输入特征向量， [bs, i_size]
        
        w_times_x = torch.bmm(batch_w_ih, x.unsqueeze(-1)).squeeze(-1)  # [bs, 3*h_size, 1]
        
        w_times_h_prev = torch.bmm(batch_w_hh, prev_h.unsqueeze(-1)).squeeze(-1) 
        
        # 重置门
        r_t = torch.sigmoid(w_times_x[:, :h_size] + w_times_h_prev[:, :h_size] + b_ih[:h_size]+ b_hh[:h_size])
        # 更新门
        z_t = torch.sigmoid(w_times_x[:, h_size:2*h_size] + w_times_h_prev[:, h_size:2*h_size] \
                            + b_ih[h_size:2*h_size]+ b_hh[h_size:2*h_size])
        
        # 候选状态
        n_t = torch.tanh(w_times_x[:, 2*h_size:3*h_size] + b_ih[2*h_size:3*h_size] + \
                         r_t * (w_times_h_prev[:, 2*h_size:3*h_size] + b_hh[2*h_size:3*h_size]))
        
        prev_h = (1 - z_t) * n_t + z_t * prev_h  # 增量更新隐含状态
        
        output[:, t, :] = prev_h
        
    return output, prev_h

In [8]:
output, h_final = gru_forward(input_data, h0, gru_layer.weight_ih_l0, gru_layer.weight_hh_l0, \
                             gru_layer.bias_ih_l0, gru_layer.bias_hh_l0)
print(output)

tensor([[[-0.1355,  0.4056,  0.3317,  0.4589, -0.4083],
         [-0.1811,  0.7739,  0.2401,  0.1952, -0.1132],
         [-0.1386,  0.8449,  0.1788, -0.0816, -0.0403]],

        [[-0.0475, -0.0651,  0.4481,  0.2246,  0.4372],
         [ 0.1057, -0.1324,  0.3786,  0.1656,  0.5676],
         [ 0.0160,  0.4196,  0.3794, -0.0798,  0.3670]]], grad_fn=<CopySlices>)


## Vanilla-GRU-代码

### Numpy实现

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import time
sns.set()

def get_vocab(file, lower = False):
    with open(file, 'r') as fopen:
        data = fopen.read() # 将文件中的所有数据读取进来。
    if lower:
        data = data.lower()
    
    vocab = list(set(data))
    return data, vocab

def embed_to_control(data, vocab):
    onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)
    for i in range(len(data)):
        onehot[i, vocab.index(data[i])] = 1.0
    return onehot


text, text_vocab = get_vocab('../consumer.h', lower = False)
one_hot = embed_to_control(text, text_vocab)
print('len text: ', len(text))
print('len text_vocab: ', len(text_vocab))
print('one_hot shape: ', one_hot.shape)

epoch = 1000
learning_rate = 0.0001
batch_size = 64
sequence_length = int(12)
dimension = one_hot.shape[1]
print('dimension is :', dimension)
possible_batch_id = range(len(text) - sequence_length - 1)
hidden_dim = 128

len text:  15294
len text_vocab:  75
one_hot shape:  (15294, 75)
dimension is : 75


&emsp;&emsp;`tanh`激活函数为:

$$
\tanh x=\frac{\sinh x}{\cosh x}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$

&emsp;&emsp;导数为:

$$
(\tanh x)^{\prime}=\operatorname{sech}^{2} x=1-\tanh ^{2} x
$$

In [10]:
def tanh(x, grad=False):
    if grad:
        output = np.tanh(x)
        return (1.0 - np.square(output))
    else:
        return np.tanh(x)

In [11]:
def softmax(x):
    """
    x: np.max(x)取的是二维数组x中的最大值。
    """
    exp_scores = np.exp(x - np.max(x))
    return exp_scores / (np.sum(exp_scores, axis=1, keepdims=True) + 1e-8)

&emsp;&emsp;多分类的交叉熵损失如下:

$$
L=\frac{1}{N} \sum_{i} L_{i}=-\frac{1}{N} \sum_{i} \sum_{c=1}^{M} y_{i c} \log \left(p_{i c}\right)
$$

&emsp;&emsp;其中$M$表示类别的数量，$y_{ic}$是符号函数(0或者1), 如果样本$i$的真实类别等于$c$取1，否者取0。$p_{ic}$表示观测样本$i$属于类别$c$的概率。

In [12]:
def cross_entropy(Y_hat, Y, epsilon=1e-12):
    Y_hat = np.clip(Y_hat, epsilon, 1. - epsilon)
    N = Y_hat.shape[0]
    return -np.sum(np.sum(Y * np.log(Y_hat + 1e-9))) / N

In [13]:
def sigmoid(x, grad=False):
    if grad:
        return sigmoid(x) * (1 - sigmoid(x))
    else:
        return 1 / (1 + np.exp(-x))

In [14]:
def backward_multiply_gate(w, x, dz):
    """
    w shape = (75, 128)
    x shape = (64, 128)
    dz shape = (64, 75)
    """
    dw = np.dot(dz.T, x) # shape = (75, 128)
    dx = np.dot(w.T, dz.T) # shape = (128, 64)
    return dw, dx


def backward_add_gate(x1, x2, dz):
    dx1 = dz * np.ones_like(x1)
    dx2 = dz * np.ones_like(x2)
    return dx1, dx2



1. **更新门**：能关注的机制（更新门）。$Z_{t}$

2. **重置门**：能遗忘的机制（重置门）。$R_{t}$

$$
Z_{t} = \sigma(X_{t} W_{xz} + H_{t-1} W_{tz} + b_{z}) \\
R_{t} = \sigma(X_{t} W_{xr} + H_{t-1} W_{hr} + b_{r})
$$

3. **候选隐藏状态**:

&emsp;&emsp;之后基于这两个门的输出，我们来计算候选隐藏状态:

$$
\tilde{\boldsymbol{H}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x h}+\left(\boldsymbol{R}_{t} \odot \boldsymbol{H}_{t-1}\right) \boldsymbol{W}_{h h}+\boldsymbol{b}_{h}\right)
$$

&emsp;&emsp;如果将其与`RNN`做对比的话，我们可以发现，如果不看$R_{t}$的话，他就和之前的`RNN`是一样的。

4. **隐状态**:

$$
\boldsymbol{H}_{t}=\mathbf{Z}_{t} \odot \boldsymbol{H}_{t-1}+\left(1-\boldsymbol{Z}_{t}\right) \odot \tilde{\boldsymbol{H}}_{t}
$$

In [15]:
learning_rate = 0.001
U = np.random.randn(hidden_dim, dimension) / np.sqrt(hidden_dim)
Wz = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
Wr = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
Wh = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
V = np.random.randn(dimension, hidden_dim) / np.sqrt(dimension)

def forward_gru_recurrent(x, h_state, U, Wz, Wr, Wh, V):
    mul_u = np.dot(x, U.T)
    
    # 更新门。
    mul_Wz = np.dot(h_state, Wz.T)
    add_Wz = mul_u + mul_Wz
    z = sigmoid(add_Wz)
    
    # 重置门。
    mul_Wr = np.dot(h_state, Wr.T)
    add_Wr = mul_u + mul_Wr
    r = sigmoid(add_Wr)
    
    # 计算候选隐藏状态。
    mul_Wh = np.dot(h_state * r, Wh.T)
    add_Wh = mul_u + mul_Wh
    h_hat = tanh(add_Wh)
    
    # 隐藏状态。
    h = (1 - z) * h_state + z * h_hat
    mul_v = np.dot(h, V.T)
    return (mul_u, mul_Wz, add_Wz, z, mul_Wr, add_Wr, r, mul_Wh, add_Wh, h_hat, h, mul_v)

def backward_multiply_gate(w, x, dz):
    dW = np.dot(dz.T, x)
    dx = np.dot(w.T, dz.T)
    return dW, dx

def backward_gru_recurrent(x, h_state, U, Wz, Wr, Wh, V, d_mul_v, saved_graph):
    mul_u, mul_Wz, add_Wz, z, mul_Wr, add_Wr, r, mul_Wh, add_Wh, h_hat, h, mul_v = saved_graph
    dV, dh = backward_multiply_gate(V, h, d_mul_v)
    dh_hat = z * dh.T
    dadd_Wh = tanh(add_Wh, True) * dh_hat
    dmul_u1, dmul_Wh = backward_add_gate(mul_u, mul_Wh, dadd_Wh)
    dWh, dprev_state = backward_multiply_gate(Wh, h_state * r, dmul_Wh)
    dr = dprev_state * h_state.T
    dadd_Wr = sigmoid(add_Wr, True) * dr.T
    dmul_u2, dmul_Wr = backward_add_gate(mul_u, mul_Wr, dadd_Wr)
    dWr, dprev_state = backward_multiply_gate(Wr, h_state, dmul_Wr)
    dz = -h_state + h_hat
    dadd_Wz = sigmoid(add_Wz, True) * dz
    dmul_u3, dmul_Wz = backward_add_gate(mul_u, mul_Wz, dadd_Wz)
    dWz, dprev_state = backward_multiply_gate(Wz, h_state, dmul_Wz)
    dU, dx = backward_multiply_gate(U, x, dmul_u3)
    return (dU, dWz, dWr, dWh, dV)

In [16]:
for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    batch_id = random.sample(possible_batch_id, batch_size)
    
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        batch_x[:, n, :] = one_hot[id1, :]
        batch_y[:, n, :] = one_hot[id2, :]
    
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    
    for n in range(sequence_length):
        layers.append(forward_gru_recurrent(batch_x[:,n,:], prev_h, U, Wz, Wr, Wh, V))
        prev_h = layers[-1][-2]
        out_logits[:, n, :] = layers[-1][-1]
        
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    delta = probs
    delta[range(y.shape[0]), y] -= 1
    delta = delta.reshape((batch_size, sequence_length, dimension))
    dU = np.zeros(U.shape)
    dV = np.zeros(V.shape)
    dWz = np.zeros(Wz.shape)
    dWr = np.zeros(Wr.shape)
    dWh = np.zeros(Wh.shape)
    
    prev_h = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        d_mul_v = delta[:, n, :]
        dU_t, dWz_t, dWr_t, dWh_t, dV_t = backward_gru_recurrent(batch_x[:,n,:], prev_h, 
                                                                    U, Wz, Wr, Wh, V, d_mul_v, layers[n])
        prev_h = layers[n][-2]
        dU += dU_t
        dV += dV_t
        dWz += dWz_t
        dWr += dWr_t
        dWh += dWh_t
    U -= learning_rate * dU
    V -= learning_rate * dV
    Wz -= learning_rate * dWz
    Wr -= learning_rate * dWr
    Wh -= learning_rate * dWh
    if (i+1) % 50 == 0:
        print('epoch {}, loss {}, accuracy {}'.format(i+1, loss, accuracy))

epoch 50, loss 4.205812871238337, accuracy 0.17708333333333334
epoch 100, loss 4.125757887102406, accuracy 0.14973958333333334
epoch 150, loss 3.9376015608682837, accuracy 0.13020833333333334
epoch 200, loss 3.964367698691824, accuracy 0.22526041666666666
epoch 250, loss 4.11912351339582, accuracy 0.125
epoch 300, loss 4.0097548409817465, accuracy 0.09635416666666667
epoch 350, loss 3.876725700625436, accuracy 0.10677083333333333
epoch 400, loss 3.766448631615121, accuracy 0.109375
epoch 450, loss 3.673177671133151, accuracy 0.10546875
epoch 500, loss 3.6843513633393843, accuracy 0.10546875
epoch 550, loss 3.6535054605590456, accuracy 0.0859375
epoch 600, loss 3.5876677972762896, accuracy 0.09895833333333333
epoch 650, loss 3.525700681275738, accuracy 0.11979166666666667
epoch 700, loss 3.6460132689903233, accuracy 0.1015625
epoch 750, loss 3.6322743117080214, accuracy 0.08723958333333333
epoch 800, loss 3.58966615628067, accuracy 0.10416666666666667
epoch 850, loss 3.6297152951641727,

### Pytorch实现

In [17]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class GRU(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(GRU, self).__init__()
        self.GRU = nn.GRU(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        out, hn = self.GRU(x, None)
        
        out = self.fc(out)
        return out
    
gru = GRU(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(gru)


# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.01
optimizer = torch.optim.Adam(gru.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    # prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = gru(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy)) 

GRU(
  (GRU): GRU(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)
epoch 50, loss 1.600624680519104, accuracy 0.5651041666666666
epoch 100, loss 0.8518295884132385, accuracy 0.7682291666666666
epoch 150, loss 0.6507782340049744, accuracy 0.80078125
epoch 200, loss 0.5483970642089844, accuracy 0.8411458333333334
epoch 250, loss 0.5464432835578918, accuracy 0.8111979166666666
epoch 300, loss 0.4642275869846344, accuracy 0.83984375
epoch 350, loss 0.43444767594337463, accuracy 0.8528645833333334
epoch 400, loss 0.4160071611404419, accuracy 0.8489583333333334
epoch 450, loss 0.5063161253929138, accuracy 0.8268229166666666
epoch 500, loss 0.4659353494644165, accuracy 0.8450520833333334
epoch 550, loss 0.4283309876918793, accuracy 0.859375
epoch 600, loss 0.4190591275691986, accuracy 0.8567708333333334
epoch 650, loss 0.4155247211456299, accuracy 0.8671875
epoch 700, loss 0.4059120714664459, accuracy 0.8580729166666666
epoch 750, loss 0.462495476007461