- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)

## torch.nn.LSTM

在`PyTorch`中，`LSTM`的`API`地址为：[torch.nn.LSTM(*args, **kwargs)](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html?highlight=nn%20lstm#torch.nn.LSTM)

在`LSTM`经典结构图中，依据最上面一条黑线细胞状态更新历史信息，

$$
\begin{aligned}
i_{t} &=\sigma\left(W_{i i} x_{t}+b_{i i}+W_{h i} h_{t-1}+b_{h i}\right) \\
f_{t} &=\sigma\left(W_{i f} x_{t}+b_{i f}+W_{h f} h_{t-1}+b_{h f}\right) \\
g_{t} &=\tanh \left(W_{i g} x_{t}+b_{i g}+W_{h g} h_{t-1}+b_{h g}\right) \\
c_{t} &=f_{t} \odot c_{t-1}+i_{t} \odot g_{t} \\
o_{t} &=\sigma\left(W_{i o} x_{t}+b_{i o}+W_{h o} h_{t-1}+b_{h o}\right) \\
h_{t} &=o_{t} \odot \tanh \left(c_{t}\right)
\end{aligned}
$$

$i_{t}$是输入门，$f_{t}$是遗忘门，$o_{t}$是输出门。$c_{t}$是细胞单元，$g_{t}$是候选细胞状态。

`LSTM`需要做的事情，就两件事：更新当前的细胞状态，拿到更新之后的细胞状态决定当前的输出。

1. 想要更新当前的细胞状态，就需要一个遗忘门，对上一个时刻的细胞状态进行选择性遗忘，也就是$f_{t} \odot c_{t-1}$, 还需要一个输入门，获取当前时刻输入信息的过滤，也就是$i_{t} \odot g_{t}$部分。

2. 之后就是设计输出门，与当前时刻细胞状态进行过滤，$o_{t} \odot \tanh \left(c_{t}\right)$。

与`RNN`相比，`LSTM`多了一个需要初始化的值$c_{t}$, 如果初始化的参数是学习得到的，这样的算法也称作`meta-learning`。

其参数空间有：

1. input_size: 输入数据$x$的特征大小。
2. hidden_size: 隐藏层输出大小，也就是$h_{t}$和$c_{t}$的大小。
3. num_layers: 循环层的大小。
4. bias: 若设置为False，则不含偏置项。
5. batch_first: 若设置为True，则batch的这个维度在第一个维度。
6. dropout：如果为非`0`的值，则会设置`dropout`，默认为`0`。
7. `bidirectional`: 是否为双向。若为双向的结构，则输出的结果大小为 `2 * hidden_size`。
8. proj_size: 这个参数是LSTM网络的变体，是LSTMP，它的作用是为了减少LSTM的参数和计算量。

### 实现LSTM

In [1]:
import torch
import torch.nn as nn

bs, T, i_size, h_size = 2, 3, 4, 5
input_data = torch.randn(bs, T, i_size)

In [2]:
c_0 = torch.randn(bs, h_size)  # 初始值，不参与训练。
h_0 = torch.randn(bs, h_size)

lstm_layer = nn.LSTM(i_size, h_size, batch_first=True)

output, (h_n, c_n) = lstm_layer(input_data, (h_0.unsqueeze(0), c_0.unsqueeze(0)))
print(output)

tensor([[[-0.1963, -0.1278, -0.0995,  0.0151, -0.3178],
         [-0.1245, -0.4401, -0.1217, -0.0785,  0.0063],
         [-0.2461, -0.1975, -0.1093, -0.0019, -0.0014]],

        [[-0.3640, -0.1631,  0.1700,  0.0082, -0.3252],
         [-0.2442, -0.1396,  0.1449, -0.0412, -0.1769],
         [-0.0059, -0.0577, -0.0728, -0.0343, -0.1941]]],
       grad_fn=<TransposeBackward0>)


In [3]:
for k, v in lstm_layer.named_parameters():
    print(k, v.shape)

weight_ih_l0 torch.Size([20, 4])
weight_hh_l0 torch.Size([20, 5])
bias_ih_l0 torch.Size([20])
bias_hh_l0 torch.Size([20])


其中的参数`weight_ih`和参数`weight_hh`都是四个`w`拼接起来的。

### LSTM前向传播理解

`w_ih`的维度为`[4*h_size, i_size]`, `w_hh`的维度为`[4*h_size, h_size]`。

In [19]:
def lstm_forward(input_data, init_states, w_ih, w_hh, b_ih, b_hh):
    h0, c0 = init_states
    bs, T, i_size = input_data.shape
    h_size = w_ih.shape[0] // 4  
    
    prev_h = h0
    prev_c = c0
    batch_w_ih = w_ih.unsqueeze(0).tile(bs, 1, 1)  # [bs, 4*h_size, i_size]
    batch_w_hh = w_hh.unsqueeze(0).tile(bs, 1, 1)  # [bs, 4*h_size, h_size]
    
    
    output_size = h_size
    output = torch.zeros(bs, T, output_size)  # 输出序列
    
    for t in range(T):
        x = input_data[:, t, :]  # 当前时刻的输入向量, [bs, i_size]
        
        w_times_x = torch.bmm(batch_w_ih, x.unsqueeze(-1)).squeeze(-1)  # [bs, 4 * h_size]
        w_times_h_prev = torch.bmm(batch_w_hh, prev_h.unsqueeze(-1)).squeeze(-1)  # [bs, 4 * h_size]
        
        # 分别计算输入门(i)，遗忘门(f)，cell门(g)，输出门(o)
        i_t = torch.sigmoid(w_times_x[:, :h_size]+w_times_h_prev[:, :h_size]+b_ih[:h_size]+b_hh[:h_size])
        f_t = torch.sigmoid(w_times_x[:, h_size:2*h_size]+w_times_h_prev[:, h_size:2*h_size]+ \
                            b_ih[h_size:2*h_size]+b_hh[h_size:2*h_size])
        g_t = torch.tanh(w_times_x[:, 2*h_size:3*h_size]+w_times_h_prev[:, 2*h_size:3*h_size]+ \
                            b_ih[2*h_size:3*h_size]+b_hh[2*h_size:3*h_size])
        
        o_t = torch.sigmoid(w_times_x[:, 3*h_size:4*h_size]+w_times_h_prev[:, 3*h_size:4*h_size]+ \
                            b_ih[3*h_size:4*h_size]+b_hh[3*h_size:4*h_size])
        
        prev_c = f_t * prev_c + i_t * g_t
        prev_h = o_t * torch.tanh(prev_c)
        
        output[:, t, :] = prev_h
        
    return output, (prev_h, prev_c)

In [20]:
output, (h_n, c_n) = lstm_forward(input_data, (h_0, c_0), lstm_layer.weight_ih_l0, lstm_layer.weight_hh_l0, \
             lstm_layer.bias_ih_l0, lstm_layer.bias_hh_l0)
print(output)

tensor([[[-0.1963, -0.1278, -0.0995,  0.0151, -0.3178],
         [-0.1245, -0.4401, -0.1217, -0.0785,  0.0063],
         [-0.2461, -0.1975, -0.1093, -0.0019, -0.0014]],

        [[-0.3640, -0.1631,  0.1700,  0.0082, -0.3252],
         [-0.2442, -0.1396,  0.1449, -0.0412, -0.1769],
         [-0.0059, -0.0577, -0.0728, -0.0343, -0.1941]]], grad_fn=<CopySlices>)


### 实现LSTMP

In [23]:
proj_size = 3  # 对h进行压缩。但是c不进行压缩

c_0 = torch.randn(bs, h_size)  # 初始值，不参与训练。
h_0 = torch.randn(bs, proj_size)

lstm_layer = nn.LSTM(i_size, h_size, batch_first=True, proj_size=proj_size)

output, (h_n, c_n) = lstm_layer(input_data, (h_0.unsqueeze(0), c_0.unsqueeze(0)))
print(output)

tensor([[[ 0.2407,  0.3258,  0.0930],
         [ 0.1654,  0.2884,  0.1637],
         [ 0.1339,  0.0925,  0.0966]],

        [[ 0.1401,  0.3316, -0.0414],
         [ 0.1399,  0.1083,  0.0132],
         [ 0.0920,  0.0590,  0.0889]]], grad_fn=<TransposeBackward0>)


In [24]:
for k, v in lstm_layer.named_parameters():
    print(k, v.shape)

weight_ih_l0 torch.Size([20, 4])
weight_hh_l0 torch.Size([20, 3])
bias_ih_l0 torch.Size([20])
bias_hh_l0 torch.Size([20])
weight_hr_l0 torch.Size([3, 5])


相比于`LSTM`，多了一个`weight_hr_l0`，这个参数其实就是对`hidden state`进行压缩的。

### LSTMP前向传播

In [27]:
def lstm_forward(input_data, init_states, w_ih, w_hh, b_ih, b_hh, w_hr=None):
    h0, c0 = init_states
    bs, T, i_size = input_data.shape
    h_size = w_ih.shape[0] // 4  
    
    prev_h = h0
    prev_c = c0
    batch_w_ih = w_ih.unsqueeze(0).tile(bs, 1, 1)  # [bs, 4*h_size, i_size]
    batch_w_hh = w_hh.unsqueeze(0).tile(bs, 1, 1)  # [bs, 4*h_size, h_size]
    
    if w_hr is not None:
        p_size = w_hr.shape[0]
        output_size = p_size
        batch_w_hr = w_hr.unsqueeze(0).tile(bs, 1, 1)  # [bs, p_size, h_size]
    else:
        output_size = h_size
        
        
    output = torch.zeros(bs, T, output_size)  # 输出序列
    
    for t in range(T):
        x = input_data[:, t, :]  # 当前时刻的输入向量, [bs, i_size]
        
        w_times_x = torch.bmm(batch_w_ih, x.unsqueeze(-1)).squeeze(-1)  # [bs, 4 * h_size]
        w_times_h_prev = torch.bmm(batch_w_hh, prev_h.unsqueeze(-1)).squeeze(-1)  # [bs, 4 * h_size]
        
        # 分别计算输入门(i)，遗忘门(f)，cell门(g)，输出门(o)
        i_t = torch.sigmoid(w_times_x[:, :h_size]+w_times_h_prev[:, :h_size]+b_ih[:h_size]+b_hh[:h_size])
        f_t = torch.sigmoid(w_times_x[:, h_size:2*h_size]+w_times_h_prev[:, h_size:2*h_size]+ \
                            b_ih[h_size:2*h_size]+b_hh[h_size:2*h_size])
        g_t = torch.tanh(w_times_x[:, 2*h_size:3*h_size]+w_times_h_prev[:, 2*h_size:3*h_size]+ \
                            b_ih[2*h_size:3*h_size]+b_hh[2*h_size:3*h_size])
        
        o_t = torch.sigmoid(w_times_x[:, 3*h_size:4*h_size]+w_times_h_prev[:, 3*h_size:4*h_size]+ \
                            b_ih[3*h_size:4*h_size]+b_hh[3*h_size:4*h_size])
        
        prev_c = f_t * prev_c + i_t * g_t
        prev_h = o_t * torch.tanh(prev_c) # [bs, h_size]
        
        if w_hr is not None:  # 做projection
            prev_h = torch.bmm(batch_w_hr, prev_h.unsqueeze(-1))# [bs, p_size, 1]
            prev_h = prev_h.squeeze(-1)  # [bs, p_size]
        
        output[:, t, :] = prev_h
        
    return output, (prev_h, prev_c)

In [28]:
output, (h_n, c_n) = lstm_forward(input_data, (h_0, c_0), lstm_layer.weight_ih_l0, lstm_layer.weight_hh_l0, \
             lstm_layer.bias_ih_l0, lstm_layer.bias_hh_l0, lstm_layer.weight_hr_l0)
print(output)

tensor([[[ 0.2407,  0.3258,  0.0930],
         [ 0.1654,  0.2884,  0.1637],
         [ 0.1339,  0.0925,  0.0966]],

        [[ 0.1401,  0.3316, -0.0414],
         [ 0.1399,  0.1083,  0.0132],
         [ 0.0920,  0.0590,  0.0889]]], grad_fn=<CopySlices>)


## Vanilla-LSTM-代码

### Numpy实现

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import time
sns.set()

def get_vocab(file, lower = False):
    with open(file, 'r') as fopen:
        data = fopen.read() # 将文件中的所有数据读取进来。
    if lower:
        data = data.lower()
    
    vocab = list(set(data))
    return data, vocab

def embed_to_control(data, vocab):
    onehot = np.zeros((len(data), len(vocab)), dtype = np.float32)
    for i in range(len(data)):
        onehot[i, vocab.index(data[i])] = 1.0
    return onehot


text, text_vocab = get_vocab('../consumer.h', lower = False)
one_hot = embed_to_control(text, text_vocab)
print('len text: ', len(text))
print('len text_vocab: ', len(text_vocab))
print('one_hot shape: ', one_hot.shape)

epoch = 1000
learning_rate = 0.0001
batch_size = 64
sequence_length = int(12)
dimension = one_hot.shape[1]
print('dimension is :', dimension)
possible_batch_id = range(len(text) - sequence_length - 1)
hidden_dim = 128

len text:  15294
len text_vocab:  75
one_hot shape:  (15294, 75)
dimension is : 75


&emsp;&emsp;`tanh`激活函数为:

$$
\tanh x=\frac{\sinh x}{\cosh x}=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}
$$

&emsp;&emsp;导数为:

$$
(\tanh x)^{\prime}=\operatorname{sech}^{2} x=1-\tanh ^{2} x
$$

In [7]:
def tanh(x, grad=False):
    if grad:
        output = np.tanh(x)
        return (1.0 - np.square(output))
    else:
        return np.tanh(x)

In [8]:
def softmax(x):
    """
    x: np.max(x)取的是二维数组x中的最大值。
    """
    exp_scores = np.exp(x - np.max(x))
    return exp_scores / (np.sum(exp_scores, axis=1, keepdims=True) + 1e-8)

&emsp;&emsp;多分类的交叉熵损失如下:

$$
L=\frac{1}{N} \sum_{i} L_{i}=-\frac{1}{N} \sum_{i} \sum_{c=1}^{M} y_{i c} \log \left(p_{i c}\right)
$$

&emsp;&emsp;其中$M$表示类别的数量，$y_{ic}$是符号函数(0或者1), 如果样本$i$的真实类别等于$c$取1，否者取0。$p_{ic}$表示观测样本$i$属于类别$c$的概率。

In [9]:
def cross_entropy(Y_hat, Y, epsilon=1e-12):
    Y_hat = np.clip(Y_hat, epsilon, 1. - epsilon)
    N = Y_hat.shape[0]
    return -np.sum(np.sum(Y * np.log(Y_hat + 1e-9))) / N

In [10]:
def backward_multiply_gate(w, x, dz):
    """
    w shape = (75, 128)
    x shape = (64, 128)
    dz shape = (64, 75)
    """
    dw = np.dot(dz.T, x) # shape = (75, 128)
    dx = np.dot(w.T, dz.T) # shape = (128, 64)
    return dw, dx


def backward_add_gate(x1, x2, dz):
    dx1 = dz * np.ones_like(x1)
    dx2 = dz * np.ones_like(x2)
    return dx1, dx2

&emsp;&emsp;首先依据公式创建三个门的权重:

$$
\begin{aligned}
\boldsymbol{I}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x i}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h i}+\boldsymbol{b}_{i}\right) \\
\boldsymbol{F}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x f}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h f}+\boldsymbol{b}_{f}\right) \\
\boldsymbol{O}_{t} &=\sigma\left(\boldsymbol{X}_{t} \boldsymbol{W}_{x o}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h o}+\boldsymbol{b}_{o}\right)
\end{aligned}
$$

In [11]:
W_hi = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_hf = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)
W_ho = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)

&emsp;&emsp;之后需要产生**候选记忆单元**：

$$
\tilde{\boldsymbol{C}}_{t}=\tanh \left(\boldsymbol{X}_{t} \boldsymbol{W}_{x c}+\boldsymbol{H}_{t-1} \boldsymbol{W}_{h c}+\boldsymbol{b}_{c}\right)
$$

&emsp;&emsp;**记忆单元**:

$$
\boldsymbol{C}_{t}=\boldsymbol{F}_{t} \odot \boldsymbol{C}_{t-1}+\boldsymbol{I}_{t} \odot \tilde{\boldsymbol{C}}_{t}
$$

&emsp;&emsp;`LSTM`与`RNN、GRU`的区别是里面所含的状态有两个，一个是$C$，另外一个是$H$。

In [12]:
W_hc = np.random.randn(hidden_dim, hidden_dim) / np.sqrt(hidden_dim)

&emsp;&emsp;**隐状态**:

$$
\boldsymbol{H}_{t}=\boldsymbol{O}_{t} \odot \tanh \left(\boldsymbol{C}_{t}\right)
$$

&emsp;&emsp;$\tanh$是为了保证值是在`+1`到`-1`之间的。

&emsp;&emsp;之后我们再创建一个处理输入的权重矩阵，本来是有四个的$W_{xi}, W_{xf}, W_{xo}, W_{xc}$，这里我们为了方便起见，统一为一个$U$

In [13]:
U = np.random.randn(hidden_dim, dimension)

&emsp;&emsp;之后还有一个输出层的权重，也将其全部设置为$V$:

In [14]:
V = np.random.randn(dimension, hidden_dim) / np.sqrt(hidden_dim)

In [15]:
def sigmoid(x, grad=False):
    if grad:
        return sigmoid(x) * (1 - sigmoid(x))
    else:
        return 1 / (1 + np.exp(-x))

def forward_lstm_recurrent(x, c_state, h_state, U, W_hf, W_hi, W_hc, W_ho, V):
    
    # 计算输入的处理单元。
    mul_u = np.dot(x, U.T)
    
    # 计算遗忘门
    mul_Wf = np.dot(h_state, W_hf.T)
    add_Wf = mul_u + mul_Wf
    f = sigmoid(add_Wf)
    
    # 计算输入门
    mul_Wi = np.dot(h_state, W_hi.T)
    add_Wi = mul_u + mul_Wi
    i = sigmoid(add_Wi)
    
    # 计算候选记忆单元
    mul_Wc = np.dot(h_state, W_hc.T)
    add_Wc = mul_u + mul_Wc
    c_hat = tanh(add_Wc)
    
    # 记忆单元选择需要从之前的c_state中遗忘多少，从输入门中提取多少候选记忆单元。
    C = c_state * f + i * c_hat
    
    # 输出门
    mul_Wo = np.dot(h_state, W_ho.T)
    add_Wo = mul_u + mul_Wo
    o = sigmoid(add_Wo)
    
    # 计算隐藏状态。
    h = o * tanh(C)
    
    mul_v = np.dot(h, V.T)
    return (mul_u, mul_Wf, add_Wf, mul_Wi, add_Wi, mul_Wc, add_Wc, C, mul_Wo, add_Wo, h, mul_v, i, o, c_hat)

In [16]:
def backward_recurrent(x, c_state, h_state, U, Wf, Wi, Wc, Wo, V, d_mul_v, saved_graph):
    mul_u, mul_Wf, add_Wf, mul_Wi, add_Wi, mul_Wc, add_Wc, C, mul_Wo, add_Wo, h, mul_v, i, o, c_hat = saved_graph
    dV, dh = backward_multiply_gate(V, h, d_mul_v)
    dC = tanh(C, True) * o * dh.T
    do = tanh(C) * dh.T
    dadd_Wo = sigmoid(add_Wo, True) * do
    dmul_u1, dmul_Wo = backward_add_gate(mul_u, mul_Wo, dadd_Wo)
    dWo, dprev_state = backward_multiply_gate(Wo, h_state, dmul_Wo)
    dc_hat = dC * i
    dadd_Wc = tanh(add_Wc, True) * dc_hat
    dmul_u2, dmul_Wc = backward_add_gate(mul_u, mul_Wc, dadd_Wc)
    dWc, dprev_state = backward_multiply_gate(Wc, h_state, dmul_Wc)
    di = dC * c_hat
    dadd_Wi = sigmoid(add_Wi, True) * di
    dmul_u3, dmul_Wi = backward_add_gate(mul_u, mul_Wi, dadd_Wi)
    dWi, dprev_state = backward_multiply_gate(Wi, h_state, dmul_Wi)
    df = dC * c_state
    dadd_Wf = sigmoid(add_Wf, True) * df
    dmul_u4, dmul_Wf = backward_add_gate(mul_u, mul_Wf, dadd_Wf)
    dWf, dprev_state = backward_multiply_gate(Wf, h_state, dmul_Wf)
    dU, dx = backward_multiply_gate(U, x, dmul_u4)
    return (dU, dWf, dWi, dWc, dWo, dV)

In [17]:
for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension))
    batch_y = np.zeros((batch_size, sequence_length, dimension))
    batch_id = random.sample(possible_batch_id, batch_size)
    
    prev_c = np.zeros((batch_size, hidden_dim))
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
        
    layers = []
    out_logits = np.zeros((batch_size, sequence_length, dimension))
    
    for n in range(sequence_length):
        layers.append(forward_lstm_recurrent(batch_x[:, n, :], prev_c, prev_h, U, W_hf, W_hi, W_hc, W_ho, V))
        
        prev_c = layers[-1][7]
        prev_h = layers[-1][10]
        
        out_logits[:, n, :] = layers[-1][-4]
        
    probs = softmax(out_logits.reshape((-1, dimension)))
    y = np.argmax(batch_y.reshape((-1, dimension)), axis=1)
    accuracy = np.mean(np.argmax(probs, axis=1) == y)
    loss = cross_entropy(probs, batch_y.reshape((-1, dimension)))
    
    delta = probs
    delta[range(y.shape[0]), y] -= 1
    delta = delta.reshape((batch_size, sequence_length, dimension))
    
    dU = np.zeros(U.shape)
    dV = np.zeros(V.shape)
    dW_hf = np.zeros(W_hf.shape)
    dW_hi = np.zeros(W_hi.shape)
    dW_hc = np.zeros(W_hc.shape)
    dW_ho = np.zeros(W_ho.shape)
    
    prev_c = np.zeros((batch_size, hidden_dim))
    prev_h = np.zeros((batch_size, hidden_dim))
    
    for n in range(sequence_length):
        d_mul_v = delta[:, n, :]
        dU_t, dWf_t, dWi_t, dWc_t, dWo_t, dV_t = backward_recurrent(batch_x[:,n,:], prev_c, prev_h, U, W_hf, W_hi, 
                                                                    W_hc, W_ho, V, d_mul_v, layers[n])
        prev_c = layers[n][7]
        prev_h = layers[n][10]
        dU += dU_t
        dV += dV_t
        dW_hf += dWf_t
        dW_hi += dWi_t
        dW_hc += dWc_t
        dW_ho += dWo_t
    U -= learning_rate * dU
    V -= learning_rate * dV
    W_hf -= learning_rate * dW_hf
    W_hi -= learning_rate * dW_hi
    W_hc -= learning_rate * dW_hc
    W_ho -= learning_rate * dW_ho
    if (i+1) % 50 == 0:
        print('epoch {}, loss {}, accuracy {}'.format(i+1, loss, accuracy))

epoch 50, loss 3.921067552998206, accuracy 0.10546875
epoch 100, loss 3.575667178766933, accuracy 0.10026041666666667
epoch 150, loss 3.3109259011061094, accuracy 0.18359375
epoch 200, loss 3.191043920397957, accuracy 0.22916666666666666
epoch 250, loss 2.980070036653099, accuracy 0.2994791666666667
epoch 300, loss 2.9500002072683067, accuracy 0.30078125
epoch 350, loss 2.7624340653346326, accuracy 0.3294270833333333
epoch 400, loss 2.778283871115836, accuracy 0.3268229166666667
epoch 450, loss 2.788082161722753, accuracy 0.3111979166666667
epoch 500, loss 2.563601537836983, accuracy 0.3841145833333333
epoch 550, loss 2.584979086319973, accuracy 0.3385416666666667
epoch 600, loss 2.520896597699117, accuracy 0.3828125
epoch 650, loss 2.707388597092463, accuracy 0.3151041666666667
epoch 700, loss 2.3654156105836175, accuracy 0.4075520833333333
epoch 750, loss 2.4318962914566575, accuracy 0.41015625
epoch 800, loss 2.311684977494303, accuracy 0.41015625
epoch 850, loss 2.3953958969078815,

### Pytorch实现

In [18]:
import torch
import torch.nn as nn
from torch.autograd import Variable

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):
        """
        """
        self.num_layers = num_layers
        self.hidden_dim = hidden_dim
        super(LSTM, self).__init__()
        self.LSTM = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, x):
        h0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        c0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_dim))
        # out, (hn, hc) = self.LSTM(x, (h0, c0))
        out, (hn, hc) = self.LSTM(x, None)  # 得到所有时间序列的输出。
        
        # out = self.fc(out[:, -1, :])  # 取最后一个时间步的输出。
        out = self.fc(out)
        
        return out
lstm = LSTM(input_dim = dimension, hidden_dim = hidden_dim, num_layers=1, output_dim = dimension)
print(lstm)


# 分开定义softmax运算和交叉熵损失函数会造成数值不稳定。
# 因此PyTorch提供了一个具有良好数值稳定性且包括softmax运算和交叉熵计算的函数。
criterion = nn.CrossEntropyLoss()

learning_rate = 0.01
optimizer = torch.optim.Adam(lstm.parameters(), lr=learning_rate)

for i in range(epoch):
    batch_x = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_y = np.zeros((batch_size, sequence_length, dimension), dtype=np.float32)
    batch_id = random.sample(possible_batch_id, batch_size)  # 随机采样，选择batch_id。
    # prev_s = np.zeros((batch_size, hidden_dim))
    for n in range(sequence_length):
        id1 = [k + n for k in batch_id]
        id2 = [k + n + 1 for k in batch_id]
        
        batch_x[:, n, :] = one_hot[id1]
        batch_y[:, n, :] = one_hot[id2]
    
    # 从Numpy转成torch之后送入神经网络中去。
    output = lstm(torch.from_numpy(batch_x))  # torch.Size([64, 12, 75])
    label = torch.argmax(torch.from_numpy(batch_y).view(-1, dimension), dim=1) # shape = 786
    
    accuracy = np.mean(torch.argmax(output.view(-1, dimension), axis=1).numpy() == label.numpy())
    
    optimizer.zero_grad()
    loss = criterion(output.view(-1, dimension), label)
    loss.backward()
    optimizer.step()
    
    if(i + 1) % 50 == 0:
        print("epoch {}, loss {}, accuracy {}".format(i+1, loss.item(), accuracy)) 

LSTM(
  (LSTM): LSTM(75, 128, batch_first=True)
  (fc): Linear(in_features=128, out_features=75, bias=True)
)
epoch 50, loss 2.1111416816711426, accuracy 0.4856770833333333
epoch 100, loss 1.1410794258117676, accuracy 0.7096354166666666
epoch 150, loss 0.8636826872825623, accuracy 0.76953125
epoch 200, loss 0.6843307018280029, accuracy 0.8033854166666666
epoch 250, loss 0.6464931964874268, accuracy 0.82421875
epoch 300, loss 0.5510137677192688, accuracy 0.8346354166666666
epoch 350, loss 0.5469765067100525, accuracy 0.8463541666666666
epoch 400, loss 0.47095003724098206, accuracy 0.8346354166666666
epoch 450, loss 0.44830718636512756, accuracy 0.8489583333333334
epoch 500, loss 0.50926673412323, accuracy 0.8372395833333334
epoch 550, loss 0.4441101849079132, accuracy 0.8541666666666666
epoch 600, loss 0.45717760920524597, accuracy 0.8424479166666666
epoch 650, loss 0.4530371427536011, accuracy 0.8359375
epoch 700, loss 0.45306655764579773, accuracy 0.8450520833333334
epoch 750, loss 0.