### 语言模型

其实就是语言的概率，一句话中每个词的概率，一篇文档中每个词的概率等。

### RNN

#### 不含隐藏状态的神经网络

让我们考虑一个含单隐藏层的多层感知机。给定样本数为$n$、输入个数（特征数或特征向量维度）为$d$的小批量数据样本$\boldsymbol{X} \in \mathbb{R}^{n \times d}$。设隐藏层的激活函数为$\phi$，那么隐藏层的输出$\boldsymbol{H} \in \mathbb{R}^{n \times h}$计算为

$$\boldsymbol{H} = \phi(\boldsymbol{X} \boldsymbol{W}_{xh} + \boldsymbol{b}_h),$$

其中隐藏层权重参数$\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$，隐藏层偏差参数 $\boldsymbol{b}_h \in \mathbb{R}^{1 \times h}$，$h$为隐藏单元个数。上式相加的两项形状不同，因此将按照广播机制相加。把隐藏变量$\boldsymbol{H}$作为输出层的输入，且设输出个数为$q$（如分类问题中的类别数），输出层的输出为

$$\boldsymbol{O} = \boldsymbol{H} \boldsymbol{W}_{hq} + \boldsymbol{b}_q,$$

其中输出变量$\boldsymbol{O} \in \mathbb{R}^{n \times q}$, 输出层权重参数$\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$, 输出层偏差参数$\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$。如果是分类问题，我们可以使用$\text{softmax}(\boldsymbol{O})$来计算输出类别的概率分布。

#### 含隐藏状态的循环神经网络

现在我们考虑输入数据存在时间相关性的情况。假设$\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$是序列中时间步$t$的小批量输入，$\boldsymbol{H}_t  \in \mathbb{R}^{n \times h}$是该时间步的隐藏变量。与多层感知机不同的是，这里我们保存上一时间步的隐藏变量$\boldsymbol{H}_{t-1}$，并引入一个新的权重参数$\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$，该参数用来描述在当前时间步如何使用上一时间步的隐藏变量。具体来说，时间步$t$的隐藏变量的计算由当前时间步的输入和上一时间步的隐藏变量共同决定：

$$\boldsymbol{H}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}  + \boldsymbol{b}_h).$$

与多层感知机相比，我们在这里添加了$\boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}$一项。由上式中相邻时间步的隐藏变量$\boldsymbol{H}_t$和$\boldsymbol{H}_{t-1}$之间的关系可知，这里的隐藏变量能够捕捉截至当前时间步的序列的历史信息，就像是神经网络当前时间步的状态或记忆一样。因此，该隐藏变量也称为隐藏状态。由于隐藏状态在当前时间步的定义使用了上一时间步的隐藏状态，上式的计算是循环的。使用循环计算的网络即循环神经网络（recurrent neural network）。

循环神经网络有很多种不同的构造方法。含上式所定义的隐藏状态的循环神经网络是极为常见的一种。若无特别说明，本章中的循环神经网络均基于上式中隐藏状态的循环计算。在时间步$t$，输出层的输出和多层感知机中的计算类似：

$$\boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q.$$

循环神经网络的参数包括隐藏层的权重$\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$、$\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$和偏差 $\boldsymbol{b}_h \in \mathbb{R}^{1 \times h}$，以及输出层的权重$\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$和偏差$\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$。值得一提的是，即便在不同时间步，循环神经网络也始终使用这些模型参数。因此，循环神经网络模型参数的数量不随时间步的增加而增长。

下图展示了循环神经网络在3个相邻时间步的计算逻辑。在时间步$t$，隐藏状态的计算可以看成是将输入$\boldsymbol{X}_t$和前一时间步隐藏状态$\boldsymbol{H}_{t-1}$连结后输入一个激活函数为$\phi$的全连接层。该全连接层的输出就是当前时间步的隐藏状态$\boldsymbol{H}_t$，且模型参数为$\boldsymbol{W}_{xh}$与$\boldsymbol{W}_{hh}$的连结，偏差为$\boldsymbol{b}_h$。当前时间步$t$的隐藏状态$\boldsymbol{H}_t$将参与下一个时间步$t+1$的隐藏状态$\boldsymbol{H}_{t+1}$的计算，并输入到当前时间步的全连接输出层。

![含隐藏状态的循环神经网络](./img/rnn.svg)

实际上，隐藏状态中$\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}$的计算等价于$\boldsymbol{X}_t$与$\boldsymbol{H}_{t-1}$连结后的矩阵乘以$\boldsymbol{W}_{xh}$与$\boldsymbol{W}_{hh}$连结后的矩阵。

### 数据集

##### 周杰伦所有专辑的歌词

In [1]:
import random
import zipfile

with zipfile.ZipFile('data/jaychou_lyrics.txt.zip') as zin:
    with zin.open('jaychou_lyrics.txt') as f:
        corpus_chars = f.read().decode('utf-8')
corpus_chars[:40]

'想要有直升机\n想要和你飞到宇宙去\n想要和你融化在一起\n融化在宇宙里\n我每天每天每'

In [2]:
len(corpus_chars)

63282

In [10]:
# 这个数据集有6万多个字符。为了打印方便，我们把换行符替换成空格，然后仅使用前1万个字符来训练模型。

corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
corpus_chars = corpus_chars[0:10000]
len(corpus_chars),corpus_chars[:40]

(10000, '想要有直升机 想要和你飞到宇宙去 想要和你融化在一起 融化在宇宙里 我每天每天每')

In [11]:
# 我们将每个字符映射成一个从0开始的连续整数，又称索引，来方便之后的数据处理。
# 为了得到索引，我们将数据集里所有不同字符取出来，然后将其逐一映射到索引来构造词典。
# 接着，打印vocab_size，即词典中不同字符的个数，又称词典大小。
idx_2_char = list(set(corpus_chars))
char_2_idx = dict([(char, i) for i ,char in enumerate(idx_2_char)])
vocab_size = len(char_2_idx)
vocab_size

1027

In [15]:
# 之后，将训练数据集中每个字符转化为索引，并打印前20个字符及其对应的索引。
corpus_index = [char_2_idx[char] for char in corpus_chars]
corpus_index[:20]

[949,
 130,
 63,
 277,
 327,
 479,
 10,
 949,
 130,
 745,
 768,
 981,
 647,
 93,
 576,
 577,
 10,
 949,
 130,
 745]

### 基于字符级序列的RNN模型

#### 时序数据的采样

在训练中我们需要每次随机读取小批量样本和标签。与之前章节的实验数据不同的是，时序数据的一个样本通常包含连续的字符。假设时间步数为5，样本序列为5个字符，即“想”“要”“有”“直”“升”。该样本的标签序列为这些字符分别在训练集中的下一个字符，即“要”“有”“直”“升”“机”。我们有两种方式对时序数据进行采样，分别是随机采样和相邻采样。

下面的代码每次从数据里随机采样一个小批量。其中批量大小batch_size指每个小批量的样本数，num_steps为每个样本所包含的时间步数。 在随机采样中，每个样本是原始序列上任意截取的一段序列。相邻的两个随机小批量在原始序列上的位置不一定相毗邻。因此，我们无法用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态。在训练模型时，每次随机采样前都需要重新初始化隐藏状态。

In [295]:
import random
import numpy as np
def data_iter_random(corpus_indices, batch_size, num_steps):
    # 减1是因为输出的索引是相应输入的索引加1
    
    # num_steps 就是时间步数，也就是一段文字序列所包含的文字个数
    # num_examples 将整个语料切成多少段。
    num_examples = (len(corpus_indices) - 1) // num_steps
    
    # epoch_size 根据用户选择的段数（batch_size），再分成多少份，每份表示一个迭代。
    epoch_size = num_examples // batch_size
    example_indices = list(range(num_examples))
    random.shuffle(example_indices)  # 打乱顺序，保证batch_size之间是随机的。

    # 返回从pos开始的长为num_steps的序列
    def _data(pos):
        return corpus_indices[pos: pos + num_steps]

    for i in range(epoch_size):
        # 每次读取batch_size个随机样本
        i = i * batch_size
        batch_indices = example_indices[i: i + batch_size]
        X = [_data(j * num_steps) for j in batch_indices]
        Y = [_data(j * num_steps + 1) for j in batch_indices]
        yield np.array(X), np.array(Y)

In [296]:
# 让我们输入一个从0到29的连续整数的人工序列。设批量大小和时间步数分别为2和6。
# 打印随机采样每次读取的小批量样本的输入X和标签Y。
# 可见，相邻的两个随机小批量在原始序列上的位置不一定相毗邻。

In [297]:
my_seq = list(range(30))
for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:  [[12 13 14 15 16 17]
 [ 6  7  8  9 10 11]] 
Y: [[13 14 15 16 17 18]
 [ 7  8  9 10 11 12]] 

X:  [[18 19 20 21 22 23]
 [ 0  1  2  3  4  5]] 
Y: [[19 20 21 22 23 24]
 [ 1  2  3  4  5  6]] 



#### 相邻采样

除对原始序列做随机采样之外，我们还可以令相邻的两个随机小批量在原始序列上的位置相毗邻。这时候，我们就可以用一个小批量最终时间步的隐藏状态来初始化下一个小批量的隐藏状态，从而使下一个小批量的输出也取决于当前小批量的输入，并如此循环下去。这对实现循环神经网络造成了两方面影响：一方面， 在训练模型时，我们只需在每一个迭代周期开始时初始化隐藏状态；另一方面，当多个相邻小批量通过传递隐藏状态串联起来时，模型参数的梯度计算将依赖所有串联起来的小批量序列。同一迭代周期中，随着迭代次数的增加，梯度的计算开销会越来越大。 为了使模型参数的梯度计算只依赖一次迭代读取的小批量序列，我们可以在每次读取小批量前将隐藏状态从计算图中分离出来。

In [303]:
import numpy as np

def data_iter_consecutive(corpus_indices, batch_size, num_steps):
    corpus_indices = np.array(corpus_indices)
    data_len = len(corpus_indices)
    batch_len = data_len // batch_size
    # 将vocab_size平均分成batch_size份
    indices = corpus_indices[0: batch_size*batch_len].reshape((
        batch_size, batch_len))
    epoch_size = (batch_len - 1) // num_steps
    for i in range(epoch_size):
        i = i * num_steps
        # 取所有的batch_size行，列的话，长度就是num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

In [304]:
for X, Y in data_iter_consecutive(my_seq, batch_size=2, num_steps=6):
    print('X: ', X, '\nY:', Y, '\n')

X:  [[ 0  1  2  3  4  5]
 [15 16 17 18 19 20]] 
Y: [[ 1  2  3  4  5  6]
 [16 17 18 19 20 21]] 

X:  [[ 6  7  8  9 10 11]
 [21 22 23 24 25 26]] 
Y: [[ 7  8  9 10 11 12]
 [22 23 24 25 26 27]] 



### 从零实现RNN

In [55]:
import math
import random
import zipfile
import numpy as np


# 导入歌词数据集

def load_data_jay_lyrics():
    with zipfile.ZipFile('data/jaychou_lyrics.txt.zip') as zin:
        with zin.open('jaychou_lyrics.txt') as f:
            corpus_chars = f.read().decode('utf-8')
        # 去掉\n 和 \r
        corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    # 只取前10000个词
    corpus_chars = corpus_chars[0:10000]
    
    # 字典
    idx_2_char = list(set(corpus_chars))
    char_2_idx = dict([(char, i) for i ,char in enumerate(idx_2_char)])
    # 字典大小
    vocab_size = len(char_2_idx)
    # 字典索引
    corpus_index = [char_2_idx[char] for char in corpus_chars]
    return corpus_index,char_2_idx,idx_2_char,vocab_size

(corpus_indices, char_to_idx, idx_to_char, vocab_size) = load_data_jay_lyrics()

In [126]:
# one-hot向量
# vocab_size=100
X = np.array([0, 2])

def one_hot(X, size):
    one_hot = []
    try:
        for x in X:
            vec = np.zeros(size)
            vec[x] = 1
            one_hot.append(vec)
    except:
        vec = np.zeros(size)
        vec[X] = 1
        one_hot.append(vec)
    return np.array(one_hot)

one_hot(X, vocab_size)

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

我们每次采样的小批量的形状是(批量大小, 时间步数)。下面的函数将这样的小批量变换成数个可以输入进网络的形状为(批量大小, 词典大小)的矩阵，矩阵个数等于时间步数。也就是说，时间步t的输入为Xt∈Rn×d，其中n为批量大小，d为输入个数，即one-hot向量长度（词典大小）。

In [169]:
def to_onehot(X, size):
    return [one_hot(x, size) for x in X.T] # (批量大小 n, 词典大小 d)

# 批量 = 2， 时间步数 = 5
X = np.arange(10).reshape((2, 5))
inputs = to_onehot(X, vocab_size)
len(inputs), inputs[0].shape

(5, (2, 1027))

#### 初始化模型参数

隐藏单元个数 num_hiddens是一个超参数。

In [170]:
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size

def get_params():
    def _one(shape):
        return np.random.normal(scale=0.01, size=shape)

    # 隐藏层参数 
    # Ht=ϕ(XtWxh+Ht_1Whh+bh)
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = np.zeros(num_hiddens)
    
    # 输出层参数
    # Ot=HtWhq+bq
    W_hq = _one((num_hiddens, num_outputs))
    b_q = np.zeros(num_outputs)
    
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    return params

In [66]:
# get_params()

#### 定义模型

我们根据循环神经网络的计算表达式实现该模型。首先定义init_rnn_state函数来返回初始化的隐藏状态。它返回由一个形状为(批量大小, 隐藏单元个数)的值为0的npArray组成的元组。使用元组是为了更便于处理隐藏状态含有多个npArray的情况。

In [171]:
# 保存隐藏层的t-1时刻状态
def init_rnn_state(batch_size, num_hiddens):
    return (np.zeros(shape=(batch_size, num_hiddens)), )

In [172]:
# rnn函数定义了在一个时间步里如何计算隐藏状态和输出。这里的激活函数使用了tanh函数。
# 当元素在实数域上均匀分布时，tanh函数值的均值为0。

def rnn(inputs, state, params):
    # inputs和outputs皆为num_steps个形状为(batch_size, vocab_size)的矩阵
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        H = np.tanh(np.dot(X, W_xh) + np.dot(H, W_hh) + b_h)
        Y = np.dot(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

In [186]:
# 简单的测试来观察输出结果的个数（时间步数），以及第一个时间步的输出层输出的形状和隐藏状态的形状。
state = init_rnn_state(X.shape[0], num_hiddens)
inputs = to_onehot(X, vocab_size)
params = get_params()
outputs, state_new = rnn(inputs, state, params)
len(outputs), outputs[0].shape, len(state_new),state_new[0].shape

(5, (2, 1027), 1, (2, 256))

#### 定义预测函数

以下函数基于前缀prefix（含有数个字符的字符串）来预测接下来的num_chars个字符。这个函数稍显复杂，其中我们将循环神经单元rnn设置成了函数参数，这样在后面小节介绍其他循环神经网络时能重复使用这个函数。

In [196]:
def predict_rnn(prefix, num_chars, rnn, params, init_rnn_state,
                num_hiddens, vocab_size, idx_to_char, char_to_idx):
    state = init_rnn_state(1, num_hiddens)
    output = [char_to_idx[prefix[0]]]
    for t in range(num_chars + len(prefix) - 1):
        # 将上一时间步的输出作为当前时间步的输入
        X = to_onehot(np.array([output[-1]]), vocab_size)
        # 计算输出和更新隐藏状态
        (Y, state) = rnn(X, state, params)
        # 下一个时间步的输入是prefix里的字符或者当前的最佳预测字符
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(int(np.asscalar(Y[0].argmax(axis=1))))
    return ''.join([idx_to_char[i] for i in output])

In [195]:
# 测试一下predict_rnn函数。
# 我们将根据前缀“分开”创作长度为10个字符（不考虑前缀长度）的一段歌词。
#因为模型参数为随机值，所以预测结果也是随机的。
predict_rnn('分开', 10, rnn, params, init_rnn_state, num_hiddens, vocab_size, idx_2_char, char_2_idx)

1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)
1
(1, 1027)


'分开天期相十碎痛入轮银长'

#### 裁剪梯度

循环神经网络中较容易出现梯度衰减或梯度爆炸。为了应对梯度爆炸，我们可以裁剪梯度（clip gradient）。假设我们把所有模型参数梯度的元素拼接成一个向量 g，并设裁剪的阈值是θ。裁剪后的梯度

$\min\left(\frac{\theta}{\|\boldsymbol{g}\|}, 1\right)\boldsymbol{g}$

的L2范数不超过θ。


In [139]:
def grad_clipping(params, theta):
    norm = np.array([0])
    for param in params:
        norm += (param.grad ** 2).sum()
    norm = np.asscalar(norm.sqrt())
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm

#### 困惑度

我们通常使用困惑度（perplexity）来评价语言模型的好坏。回忆一下“softmax回归”一节中交叉熵损失函数的定义。困惑度是对交叉熵损失函数做指数运算后得到的值。特别地，

    最佳情况下，模型总是把标签类别的概率预测为1，此时困惑度为1；
    最坏情况下，模型总是把标签类别的概率预测为0，此时困惑度为正无穷；
    基线情况下，模型总是预测所有类别的概率都相同，此时困惑度为类别个数。 （均匀分布）

显然，任何一个有效模型的困惑度必须小于类别个数。在本例中，困惑度必须小于词典大小vocab_size。

#### 定义模型训练函数

跟之前章节的模型训练函数相比，这里的模型训练函数有以下几点不同：

    使用困惑度评价模型。
    在迭代模型参数前裁剪梯度。
    对时序数据采用不同采样方法将导致隐藏状态初始化的不同。

另外，考虑到后面将介绍的其他循环神经网络，为了更通用，这里的函数实现更长一些。

In [None]:
%%time
import time
import keras.losses.categorical_crossentropy

def train_and_predict_rnn(rnn, get_params, init_rnn_state, num_hiddens,
                          vocab_size, corpus_indices, idx_to_char,
                          char_to_idx, is_random_iter, num_epochs, num_steps,
                          lr, clipping_theta, batch_size, pred_period,
                          pred_len, prefixes):
    if is_random_iter:
        # 随机采样
        data_iter_fn = data_iter_random
    else:
        # 均匀采样
        data_iter_fn = data_iter_consecutive
    params = get_params()
    loss = categorical_crossentropy()

    for epoch in range(num_epochs):
        # 如使用相邻采样，在epoch开始时初始化隐藏状态
        # 注意，num_epochs是指整个模型训练时的迭代次数，不是时序采样的epoch
        if not is_random_iter:  
            state = init_rnn_state(batch_size, num_hiddens)
        l_sum, n, start = 0.0, 0, time.time()
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps)
        for X, Y in data_iter:
            # 如使用随机采样，在每个小批量更新前初始化隐藏状态
            # 为什么要每次batch_size就要更新呢？ 因为批量之间的序列是不连续的，所以隐藏层的状态用不了
            if is_random_iter:  
                state = init_rnn_state(batch_size, num_hiddens)
            else:  # 否则需要使用detach函数从计算图分离隐藏状态
                for s in state:
                    s.detach()
            with autograd.record():
                inputs = to_onehot(X, vocab_size)
                # outputs有num_steps个形状为(batch_size, vocab_size)的矩阵
                (outputs, state) = rnn(inputs, state, params)
                # 拼接之后形状为(num_steps * batch_size, vocab_size)
                outputs = nd.concat(*outputs, dim=0)
                # Y的形状是(batch_size, num_steps)，转置后再变成长度为
                # batch * num_steps 的向量，这样跟输出的行一一对应
                y = Y.T.reshape((-1,))
                # 使用交叉熵损失计算平均分类误差
                l = loss(outputs, y).mean()
            l.backward()
            grad_clipping(params, clipping_theta)  # 裁剪梯度
            d2l.sgd(params, lr, 1)  # 因为误差已经取过均值，梯度不用再做平均
            l_sum += l.asscalar() * y.size
            n += y.size

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f, time %.2f sec' % (
                epoch + 1, math.exp(l_sum / n), time.time() - start))
            for prefix in prefixes:
                print(' -', predict_rnn(
                    prefix, pred_len, rnn, params, init_rnn_state,
                    num_hiddens, vocab_size, idx_to_char, char_to_idx))

#### RNN的损失函数

RNN的损失函数使用交叉熵，RNN输出的激活函数为softmax函数，隐藏层的激活函数为tanh函数。
<img src="img/2019-08-14_114820.png">

#### RNN的梯度  BPTT

在RNN网络中，权重矩阵从一个时间步长被传递到下一个时间步长。在反向传播的过程中，计算损失函数在t时刻的梯度需要将从0时刻到t时刻的所有梯度相加，而越早时刻的梯度越可能会出现梯度弥散或者梯度爆炸的现象。BRTT和DNN也有很大的不同点，即这里所有的U,W,V,b,c在序列的各个位置是共享的，反向传播时我们更新的是相同的参数。

损失是交叉熵，对于RNN，由于我们在序列的每个位置都有损失函数，因此最终的损失L为：

$\begin{aligned}E_t(y_t, \hat{y}_t) &= - y_{t} \log \hat{y}_{t} \\ E(y, \hat{y}) &=\sum\limits_{t} E_t(y_t,\hat{y}_t) \\ & = -\sum\limits_{t} y_{t} \log \hat{y}_{t} \end{aligned}$

$L = \sum\limits_{t=1}^{\tau}L^{(t)}$


其中 yt 是真实值， (^yt) 是预估值，将误差展开可以用图表示为：

<img src="img/rnn-bptt1-e1461597894645.png">

$x_0,x_1,x_2,x_3,x_4$可以理解为词序列。

当然这里还有batch_size的概念，以下我们假设batch_size =1。

    W 是 [num_hidden, num_hidden] 隐藏节点的状态转移矩阵（t_0 -> t_1）。
    U 是 [input_dim, num_hidden]  输入层到隐藏层矩阵
    V 是 [input_dim, num_hidden]  隐藏层到输出层矩阵
    
参数有：W、U、V、b、c   b是输入->隐藏层，c是隐藏层-输出的偏差

先求c和V，因为简单，为啥呢？不涉及到循环。

$\frac{\partial L}{\partial c} = \sum\limits_{t=1}^{\tau}\frac{\partial L^{(t)}}{\partial c}  = \sum\limits_{t=1}^{\tau}\hat{y}^{(t)} - y^{(t)}$

$\frac{\partial L}{\partial V} =\sum\limits_{t=1}^{\tau}\frac{\partial L^{(t)}}{\partial V}  = \sum\limits_{t=1}^{\tau}(\hat{y}^{(t)} - y^{(t)}) (h^{(t)})^T$

$h^{(t)} = \sigma(z^{(t)}) = \sigma(Ux^{(t)} + Wh^{(t-1)} +b )$

其中σ为RNN的激活函数，一般为tanh, b为线性关系的偏倚。

序列索引号t时模型的输出o(t)的表达式比较简单：
$o^{(t)} = Vh^{(t)} +c$

在最终在序列索引号t时我们的预测输出为:
$\hat{y}^{(t)} = \sigma(o^{(t)})$

通常由于RNN是识别类的分类模型，所以上面这个激活函数一般是softmax。



但是W,U,b的梯度计算就比较的复杂了。从RNN的模型可以看出，在反向传播时，在在某一序列位置t的梯度损失由当前位置的输出对应的梯度损失和序列索引位置t+1时的梯度损失两部分共同决定。对于W在某一序列位置t的梯度损失需要反向传播一步步的计算。我们定义序列索引t位置的隐藏状态的梯度为

<img src="img/2019-08-14_182857.png">

### numpy实现RNN

In [251]:
import csv
import itertools
import operator
import numpy as np
import nltk
import sys
from datetime import datetime
from utils import *
from collections import Counter

import matplotlib.pyplot as plt
# %matplotlib inline

# Download NLTK model data (you need to do this once)
# nltk.downxload("book")


##### 准备数据

In [256]:
vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
print ("Reading CSV file...")
with open('data/reddit-comments-2015-08.csv', 'r', encoding="utf-8") as f:
    reader = csv.reader(f, skipinitialspace=True)
#     reader.next()
    # Split full comments into sentences
    sentences=[ x[0].strip() for x in reader]
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print("Parsed %d sentences." % (len(sentences)))
    
# Tokenize the sentences into words
tokenized_sentences = [sent.split() for sent in sentences]

# Count the word frequencies
word_freq =  Counter()
for s in sentences:
    words = s.split()
    if len(words) == 0:continue
    for word in words :
        word_freq[word] += 1
print("Found %d unique words tokens." % len(word_freq.items()))

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
print( "index_to_word top 10", index_to_word[0:10])
# print( "迈向:", word_to_index["迈向"])
print( "Using vocabulary size %d." % vocabulary_size)
print( "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

print( "\nExample sentence: '%s'" % sentences[0])
print( "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0])

Reading CSV file...
Parsed 15001 sentences.
Found 117581 unique words tokens.
index_to_word top 10 ['the', 'to', 'a', 'and', 'I', 'of', 'is', 'you', 'in', 'that']
Using vocabulary size 8000.
The least frequent word in our vocabulary is 'drastically' and appeared 11 times.

Example sentence: 'SENTENCE_START body SENTENCE_END'

Example sentence after Pre-processing: '['SENTENCE_START', 'body', 'SENTENCE_END']'


In [257]:
# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

In [258]:
# Print an training data example
x_example, y_example = X_train[17], y_train[17]
print( "x:\n%s\n%s" % (" ".join([index_to_word[x] for x in x_example]), x_example))
print( "\ny:\n%s\n%s" % (" ".join([index_to_word[x] for x in y_example]), y_example))

x:
SENTENCE_START I think UNKNOWN_TOKEN is one of the most interesting early sound UNKNOWN_TOKEN UNKNOWN_TOKEN UNKNOWN_TOKEN really made incredible use of sound, especially for a director who had UNKNOWN_TOKEN done silent films at the time. Much of the film is completely silent which adds to the UNKNOWN_TOKEN for me and the anxiety that the film UNKNOWN_TOKEN There's never really a wasted sound UNKNOWN_TOKEN every line of dialogue is extremely important. Even Peter UNKNOWN_TOKEN UNKNOWN_TOKEN is UNKNOWN_TOKEN to the UNKNOWN_TOKEN
[10, 4, 56, 7999, 6, 53, 5, 0, 107, 653, 567, 597, 7999, 7999, 7999, 65, 166, 3871, 117, 5, 6048, 457, 13, 2, 5719, 71, 67, 7999, 291, 5417, 4329, 31, 0, 287, 4167, 5, 0, 1692, 6, 347, 5417, 89, 1596, 1, 0, 7999, 13, 59, 3, 0, 4015, 9, 0, 1692, 7999, 454, 130, 65, 2, 4168, 597, 7999, 153, 509, 5, 7334, 6, 780, 2932, 426, 7850, 7999, 7999, 6, 7999, 1, 0, 7999]

y:
I think UNKNOWN_TOKEN is one of the most interesting early sound UNKNOWN_TOKEN UNKNOWN_TOKEN UNKNO

#### 创建RNN

In [263]:
def softmax(x):
    return np.exp(x)/np.sum(np.exp(x),axis=0)

In [289]:
class RNNNumpy:
    
    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate
        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

    def forward_propagation(self, x):
        # The total number of time steps
        T = len(x)
        # During forward propagation we save all hidden states in s because need them later.
        # We add one additional element for the initial hidden, which we set to 0
        s = np.zeros((T + 1, self.hidden_dim))
        s[-1] = np.zeros(self.hidden_dim)
        # The outputs at each time step. Again, we save them for later.
        o = np.zeros((T, self.word_dim))
        # For each time step...
        for t in np.arange(T):
            # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
            s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
            o[t] = softmax(self.V.dot(s[t]))
        return [o, s]

# RNNNumpy.forward_propagation = forward_propagation

    def predict(self, x):
        # Perform forward propagation and return index of the highest score
        o, s = self.forward_propagation(x)
        return np.argmax(o, axis=1)

    # RNNNumpy.predict = predict
    def calculate_total_loss(self, x, y):
        L = 0
        # For each sentence...
        for i in np.arange(len(y)):
            o, s = self.forward_propagation(x[i])
            # We only care about our prediction of the "correct" words   

            # yn 是 one-hot向量
            # 所以L 只需要去对应1位置上的概率即可。
            correct_word_predictions = o[np.arange(len(y[i])), y[i]]
            # Add to the loss based on how off we were
            L += -1 * np.sum(np.log(correct_word_predictions))
        return L

    def calculate_loss(self, x, y):
        # Divide the total loss by the number of training examples
        N = np.sum((len(y_i) for y_i in y))
        return self.calculate_total_loss(x,y)/N

    def bptt(self, x, y):
        T = len(y)
        # Perform forward propagation
        o, s = self.forward_propagation(x)
        # We accumulate the gradients in these variables
        dLdU = np.zeros(self.U.shape)
        dLdV = np.zeros(self.V.shape)
        dLdW = np.zeros(self.W.shape)
        delta_o = o
        # 一次性全部取出所有y对应的y_hat，然后-1
        delta_o[np.arange(len(y)), y] -= 1.
        # For each output backwards...
        for t in np.arange(T)[::-1]:
            dLdV += np.outer(delta_o[t], s[t].T)
            # Initial delta calculation
            delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
            # Backpropagation through time (for at most self.bptt_truncate steps)
            for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
                # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
                dLdW += np.outer(delta_t, s[bptt_step-1])              
                dLdU[:,x[bptt_step]] += delta_t
                # Update delta for next step
                delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
        return [dLdU, dLdV, dLdW]
    
    # 防止梯度弥散
    def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
        # Calculate the gradients using backpropagation. We want to checker if these are correct.
        bptt_gradients = self.bptt(x, y)
        # List of all parameters we want to check.
        model_parameters = ['U', 'V', 'W']
        # Gradient check for each parameter
        for pidx, pname in enumerate(model_parameters):
            # Get the actual parameter value from the mode, e.g. model.W
            parameter = operator.attrgetter(pname)(self)
            print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
            # Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
            it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
            while not it.finished:
                ix = it.multi_index
                # Save the original value so we can reset it later
                original_value = parameter[ix]
                # Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
                parameter[ix] = original_value + h
                gradplus = model.calculate_total_loss([x],[y])
                parameter[ix] = original_value - h
                gradminus = model.calculate_total_loss([x],[y])
                estimated_gradient = (gradplus - gradminus)/(2*h)
                # Reset parameter to original value
                parameter[ix] = original_value
                # The gradient for this parameter calculated using backpropagation
                backprop_gradient = bptt_gradients[pidx][ix]
                # calculate The relative error: (|x - y|/(|x| + |y|))
                relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))
                # If the error is to large fail the gradient check
                if relative_error > error_threshold:
                    print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
                    print "+h Loss: %f" % gradplus
                    print "-h Loss: %f" % gradminus
                    print "Estimated_gradient: %f" % estimated_gradient
                    print "Backpropagation gradient: %f" % backprop_gradient
                    print "Relative Error: %f" % relative_error
                    return 
                it.iternext()
            print "Gradient check for parameter %s passed." % (pname)

    # Performs one step of SGD.
    def numpy_sdg_step(self, x, y, learning_rate):
        # Calculate the gradients
        dLdU, dLdV, dLdW = self.bptt(x, y)
        # Change parameters according to gradients and learning rate
        self.U -= learning_rate * dLdU
        self.V -= learning_rate * dLdV
        self.W -= learning_rate * dLdW

In [292]:
import numpy as np
 
print("下面我们将讨论一些关于一维数组的乘法的问题")
A=np.array([1,2,3])
B=np.array([2,3,4])
c=[1,2,3]
print("*:",A*B)#对数组执行的是对应位置元素相乘
print("np.dot():",np.dot(A,B))#当dot遇到佚为1，执行按位乘并相加
print("np.multiply():",np.multiply(A,B))#对数组执行的是对应位置的元素相乘
print("np.outer():",np.outer(A,B))#A的一个元素和B的元素相乘的到结果的一行

下面我们将讨论一些关于一维数组的乘法的问题
*: [ 2  6 12]
np.dot(): 20
np.multiply(): [ 2  6 12]
np.outer(): [[ 2  3  4]
 [ 4  6  8]
 [ 6  9 12]]


In [290]:
np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print(len(X_train[10]))
print( o.shape)
print( o)

43
(43, 8000)
[[0.00012515 0.00012471 0.00012528 ... 0.00012576 0.00012413 0.0001239 ]
 [0.00012522 0.0001245  0.00012627 ... 0.00012484 0.00012424 0.00012467]
 [0.0001247  0.00012517 0.00012424 ... 0.00012478 0.00012466 0.00012511]
 ...
 [0.00012421 0.00012493 0.00012506 ... 0.00012499 0.00012491 0.00012363]
 [0.00012615 0.00012495 0.00012454 ... 0.00012503 0.00012509 0.00012388]
 [0.00012466 0.000125   0.00012489 ... 0.00012541 0.00012528 0.00012475]]


In [291]:
predictions = model.predict(X_train[10])
print( predictions.shape)
# for pred in predictions:
#     print(index_to_word[pred])

(43,)


In [286]:
# Limit to 1000 examples to save time
# 对于整个语料集而言，假设词典有C个词, 因此每个词的概率都是 1/C
# 所以损失函数 L = -1/N * N * log(1/C) = log(C):
print( "Expected Loss for random predictions: %f" % np.log(vocabulary_size))
print( "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000]))

Expected Loss for random predictions: 8.987197
Actual loss: 8.987130


In [None]:
# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 100
np.random.seed(10)
model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

In [None]:
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    for epoch in range(nepoch):
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
            # Adjust the learning rate if loss increases
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5  
                print "Setting learning rate to %f" % learning_rate
            sys.stdout.flush()
        # For each training example...
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

#### 完成，开始训练:

In [None]:
np.random.seed(10)
model = RNNNumpy(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

In [None]:
np.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNNNumpy(vocabulary_size)
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=10, evaluate_loss_after=1)

#### 生成文本

生成过程其实就是模型的应用过程，只需要反复执行预测函数即可：

In [220]:
def generate_sentence(model):
    # We start the sentence with the start token
    new_sentence = [word_to_index[sentence_start_token]]
    # Repeat until we get an end token
    while not new_sentence[-1] == word_to_index[sentence_end_token]:
        next_word_probs = model.forward_propagation(new_sentence)
        sampled_word = word_to_index[unknown_token]
        # We don't want to sample unknown words
        while sampled_word == word_to_index[unknown_token]:
            samples = np.random.multinomial(1, next_word_probs[-1])
            sampled_word = np.argmax(samples)
        new_sentence.append(sampled_word)
    sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
    return sentence_str
 
num_sentences = 10
senten_min_length = 7
 
for i in range(num_sentences):
    sent = []
    # We want long sentences, not sentences with one or two words
    while len(sent) &lt; senten_min_length:
        sent = generate_sentence(model)
    print " ".join(sent)

SyntaxError: invalid syntax (<ipython-input-220-58dbd58ed662>, line 22)

### keras 实现RNN

In [300]:
from keras.layers import Dense, Activation
from keras.layers.recurrent import SimpleRNN
from keras.models import Sequential

import numpy as np
import math
import random
import zipfile


# 导入歌词数据集
def load_data_jay_lyrics():
    with zipfile.ZipFile('data/jaychou_lyrics.txt.zip') as zin:
        with zin.open('jaychou_lyrics.txt') as f:
            corpus_chars = f.read().decode('utf-8')
        # 去掉\n 和 \r
        corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    # 只取前10000个词
    corpus_chars = corpus_chars[0:10000]
    
    # 字典
    idx_2_char = list(set(corpus_chars))
    char_2_idx = dict([(char, i) for i ,char in enumerate(idx_2_char)])
    # 字典大小
    vocab_size = len(char_2_idx)
    # 字典索引
    corpus_index = [char_2_idx[char] for char in corpus_chars]
    return corpus_index,char_2_idx,idx_2_char,vocab_size

(corpus_indices, char_to_idx, idx_to_char, vocab_size) = load_data_jay_lyrics()

# 定义模型
# 一个含单隐藏层、隐藏单元个数为256的循环神经网络层rnn_layer，并对权重做初始化。
num_hiddens = 256
num_epochs, batch_size, lr, clipping_theta = 250, 32, 1e2, 1e-2
num_steps = 35

model = Sequential()
# model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(num_hiddens, return_sequences=False,
                    input_shape=(num_steps, vocab_size), unroll=True))
model.add(Dense(vocab_size))
model.add(Activation("softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

In [321]:
%%time
import time

# 训练和预测
NUM_EPOCHS_PER_ITERATION=1
pred_period=50
prefixes = ['分开', '不分开']
is_random_iter = False
num_chars = 50
if is_random_iter:
    data_iter_fn = data_iter_random
else:
    data_iter_fn = data_iter_consecutive

for epoch in range(num_epochs):
    l_sum, n, start = 0.0, 0, time.time()
    data_iter = data_iter_fn(corpus_indices, batch_size, num_steps)
    for X, Y in data_iter:
        X = to_onehot(X, vocab_size)
        Y = to_onehot(Y, vocab_size)
        model.fit(X, Y, batch_size=batch_size, epochs=NUM_EPOCHS_PER_ITERATION)
    if (epoch + 1) % pred_period == 0:
#         print('epoch %d, perplexity %f, time %.2f sec' % (
#             epoch + 1, math.exp(l_sum / n), time.time() - start))
        for prefix in prefixes:
            output = [char_to_idx[prefix[0]]]
            for t in range(num_chars + len(prefix) - 1):
                X = to_onehot(np.array([output[-1]]), vocab_size)
                pred = model.predict(X, verbose=0)[0]
                if t < len(prefix) - 1:
                    output.append(char_to_idx[prefix[t + 1]])
                else:
                    output.append(np.argmax(pred))

            print(' -', ''.join([idx_to_char[i] for i in output]))

Wall time: 35.3 s


### 先看其他博客实现

In [378]:
import numpy as np
from keras.layers import Dense, Activation
from keras.layers.recurrent import SimpleRNN, LSTM, GRU
from keras.models import Sequential


# Read lines from an example source file.
with open("data/alice_in_wonderland.txt", 'r', encoding='utf-8') as _in:
    lines = []
    for line in _in:
        line = line.strip().lower()
        if len(line) == 0:
            continue
        lines.append(line)
text = " ".join(lines)
chars = set([c for c in text])
nb_chars = len(chars)  # vocab_size
len(lines),nb_chars,len(text)

(2791, 59, 161803)

In [379]:
# Create a character index and reverse mapping to go between a numerical
# ID and a specific character. The numerical ID will correspond to a column
# number when using a one-hot encoded representation of character inputs.
char2index = {c: i for i, c in enumerate(chars)}
index2char = {i: c for i, c in enumerate(chars)}

In [340]:
# For convenience, choose a fixed sequence length of 10 characters.
SEQLEN, STEP = 10, 1  # num_steps
input_chars, label_chars = [], []

# Convert the data into a series of different SEQLEN-length subsequences.
for i in range(0, len(text) - SEQLEN, STEP):
    input_chars.append(text[i: i + SEQLEN])
    label_chars.append(text[i + SEQLEN])
input_chars[:10],label_chars[:10],len(input_chars)

(['project gu',
  'roject gut',
  'oject gute',
  'ject guten',
  'ect gutenb',
  'ct gutenbe',
  't gutenber',
  ' gutenberg',
  'gutenberg’',
  'utenberg’s'],
 ['t', 'e', 'n', 'b', 'e', 'r', 'g', '’', 's', ' '],
 161793)

In [344]:
# Compute the one-hot encoding of the input sequences X and the next
# character (the label) y
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=np.bool)
y = np.zeros((len(input_chars), nb_chars), dtype=np.bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1
X.shape,y.shape

((161793, 10, 59), (161793, 59))

In [None]:
# Set up a bunch of metaparameters for the network and training regime.
BATCH_SIZE, HIDDEN_SIZE = 128, 128
NUM_ITERATIONS = 25
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 100

In [345]:
# Create a super simple recurrent neural network. There is one recurrent
# layer that produces an embedding of size HIDDEN_SIZE from the one-hot
# encoded input layer. This is followed by a Dense fully-connected layer
# across the set of possible next characters, which is converted to a
# probability score via a standard softmax activation with a multi-class
# cross-entropy loss function linking the prediction to the one-hot
# encoding character label.
model = Sequential()
model.add(
#     GRU(  # Note you can vary this with LSTM or SimpleRNN to try alternatives.
#         HIDDEN_SIZE,
#         return_sequences=False,
#         input_shape=(SEQLEN, nb_chars),
#         unroll=True
#     )
    SimpleRNN(HIDDEN_SIZE, 
              return_sequences=False,
              input_shape=(SEQLEN, nb_chars),
              unroll=True)
)
model.add(Dense(nb_chars))
model.add(Activation("softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

In [351]:
%%time
# Execute a series of training and demonstration iterations.
for iteration in range(NUM_ITERATIONS):

    # For each iteration, run the model fitting procedure for a number of epochs.
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)

    # Select a random example input sequence.
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]

    # For a number of prediction steps using the current version of the trained
    # model, construct a one-hot encoding of the test input and append a prediction.
    print("Generating from seed: %s" % (test_chars))
#     print(test_chars, end="")
    for i in range(NUM_PREDS_PER_EPOCH):

        # Here is the one-hot encoding.
        X_test = np.zeros((1, SEQLEN, nb_chars))
        for j, ch in enumerate(test_chars):
            X_test[0, j, char2index[ch]] = 1

        # Make a prediction with the current model.
        pred = model.predict(X_test, verbose=0)[0]
        y_pred = index2char[np.argmax(pred)]

        # Print the prediction appended to the test example.
        print(y_pred, end="")

        # Increment the test example to contain the prediction as if it
        # were the correct next letter.
        test_chars = test_chars[1:] + y_pred
    break
print("\n")

Iteration #: 0
Epoch 1/1
Generating from seed: k half the
 round the caterus the was so done the rabbit the mare the rabbit the mare the rabbit the mare the r

Wall time: 11.3 s


    The sky wa -> s
    he sky was ->  
    e sky was  -> f
    sky was f -> a
    sky was fa -> l

上面的算法是，输入样本是 SEQLEN个字符序列，而标签label是SEQLEN长的字符的下一个字符。

这跟动手学深度学习的例子不同，上面的例子样本是每一个字符，而标签是该字符的下一个字符。

### 依葫芦画瓢，改一下

In [380]:
import numpy as np
from keras.layers import Dense, Activation
from keras.layers.recurrent import SimpleRNN, LSTM, GRU
from keras.models import Sequential
import random
import zipfile

with zipfile.ZipFile('data/jaychou_lyrics.txt.zip') as zin:
    with zin.open('jaychou_lyrics.txt') as f:
        text = f.read().decode('utf-8')

chars = list(set(text))
nb_chars = len(chars)
nb_chars,len(text)

(2583, 63282)

In [381]:
char2index = {c: i for i, c in enumerate(chars)}
index2char = {i: c for i, c in enumerate(chars)}

In [391]:
SEQLEN, STEP = 10, 10
# 令 STEP = SEQLEN， 保证两个input之间不重叠

input_chars, label_chars = [], []

for i in range(0, len(text) - 1, STEP):
    input_chars.append(text[i: i + SEQLEN])
    label_chars.append(text[i+1: i+1+ SEQLEN])
input_chars[:10],label_chars[:10],len(input_chars)

(['想要有直升机\n想要和',
  '你飞到宇宙去\n想要和',
  '你融化在一起\n融化在',
  '宇宙里\n我每天每天每',
  '天在想想想想著你\n这',
  '样的甜蜜\n让我开始乡',
  '相信命运\n感谢地心引',
  '力\n让我碰到你\n漂亮',
  '的让我面红的可爱女人',
  '\n温柔的让我心疼的可'],
 ['要有直升机\n想要和你',
  '飞到宇宙去\n想要和你',
  '融化在一起\n融化在宇',
  '宙里\n我每天每天每天',
  '在想想想想著你\n这样',
  '的甜蜜\n让我开始乡相',
  '信命运\n感谢地心引力',
  '\n让我碰到你\n漂亮的',
  '让我面红的可爱女人\n',
  '温柔的让我心疼的可爱'],
 6329)

In [392]:
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=np.bool)
y = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=np.bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
for i, label_char in enumerate(label_chars):
    for j, ch in enumerate(label_char):
        y[i, j, char2index[ch]] = 1
X.shape,y.shape

((6329, 10, 2583), (6329, 10, 2583))

In [393]:
BATCH_SIZE, HIDDEN_SIZE = 32, 256
NUM_ITERATIONS = 25
NUM_EPOCHS_PER_ITERATION = 1
NUM_PREDS_PER_EPOCH = 50

In [398]:
model = Sequential()
model.add(
#     GRU(  # Note you can vary this with LSTM or SimpleRNN to try alternatives.
#         HIDDEN_SIZE,
#         return_sequences=False,
#         input_shape=(SEQLEN, nb_chars),
#         unroll=True
#     )
    SimpleRNN(HIDDEN_SIZE, 
              # 堆叠多个RNN层获得更好的结果,,相当于多个隐藏层若，为True则返回整个序列，否则仅返回输出序列的最后一个输出.
              # True：  输入是[samples, time_steps, input_dim], 输出是[samples, time_steps, output_dim]
              # False： 输入是[samples, time_steps, input_dim], 输出是[samples, output_dim]  
              #                         没有time_steps了，也就是说一段时序样本，只对应一个output，见上面的例子
              return_sequences=True, 
              input_shape=(SEQLEN, nb_chars), # input_shape=(序列的个数,序列的维度),
              unroll=True)
)
model.add(Dense(nb_chars))
model.add(Activation("softmax"))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

In [410]:
%%time
# Execute a series of training and demonstration iterations.
for iteration in range(NUM_ITERATIONS):

    # For each iteration, run the model fitting procedure for a number of epochs.
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)

    # Select a random example input sequence.
#     test_chars = "分开"
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    
    # For a number of prediction steps using the current version of the trained
    # model, construct a one-hot encoding of the test input and append a prediction.
    print("Generating from seed: %s" % (test_chars))
    for i in range(NUM_PREDS_PER_EPOCH):

        # Here is the one-hot encoding.
        X_test = np.zeros((1, SEQLEN, nb_chars))
        for j, ch in enumerate(test_chars):
            X_test[0, j, char2index[ch]] = 1

        # Make a prediction with the current model.
        pred = model.predict(X_test, verbose=0)[0]
#         print(pred.shape)
        y_pred = index2char[np.argmax(pred[-1,:])]
        
        # Print the prediction appended to the test example.
        print(y_pred, end="")

        # Increment the test example to contain the prediction as if it
        # were the correct next letter.
        test_chars = test_chars[i+1:] + y_pred
#     print(test_chars)
#     output = [char2index[test_chars[0]]]
#     for i in range(NUM_PREDS_PER_EPOCH + len(test_chars) - 1):
#         # Here is the one-hot encoding.
#         X_test = np.zeros((1, 1, nb_chars))
#         ch = output[-1]
#         X_test[0, 0, ch] = 1

#         # Make a prediction with the current model.
#         pred = model.predict(X_test, verbose=0)[0]
#         y_pred = index2char[np.argmax(pred)]
#         if i < len(test_chars)-1:
#             output.append(char2index[test_chars[t + 1]])
#         else:
#             output.append(np.argmax(pred))
#     print(' -'.join([index2char[i] for i in output]))
print("\n")

Iteration #: 0
Epoch 1/1
Generating from seed: 一起 努力 的感觉

我要的 













































Iteration #: 1
Epoch 1/1
Generating from seed: 疤
血染盔甲我挥泪杀

一是的













































Iteration #: 2
Epoch 1/1
Generating from seed: 蛮好看
黑板是吸收知
Iteration #: 3
Epoch 1/1
Generating from seed: 留我孤单　
在湖面成
了  再













































Iteration #: 4
Epoch 1/1
Generating from seed: 慢慢来 这首歌我自己
再以
当













































Iteration #: 5
Epoch 1/1
Generating from seed: 脸上
麦田已倒向战车
经着不
Iteration #: 6
Epoch 1/1

KeyboardInterrupt: 

### 从0实现，网上版本

In [411]:
vocaburary_size = 8000
unknown_token = 'UNKNOWN_TOKEN'
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

In [None]:
import csv
import itertools
import nltk
import numpy as np
import operator
import datetime
import sys

nltk.download()
print("download ok.")

In [None]:
https://github.com/satojkovic/SimpleRNN/blob/master/rnn_tutorial_own.ipynb

### 关于RNN的训练参数的保存

3.14.4. 训练深度学习模型

在训练深度学习模型时，正向传播和反向传播之间相互依赖。下面我们仍然以本节中的样例模型分别阐述它们之间的依赖关系。

一方面，正向传播的计算可能依赖于模型参数的当前值，而这些模型参数是在反向传播的梯度计算后通过优化算法迭代的。例如，计算正则化项s=(λ/2)(∥W(1)∥2F+∥W(2)∥2F)
依赖模型参数W(1)和W(2)

的当前值，而这些当前值是优化算法最近一次根据反向传播算出梯度后迭代得到的。

另一方面，反向传播的梯度计算可能依赖于各变量的当前值，而这些变量的当前值是通过正向传播计算得到的。举例来说，参数梯度∂J/∂W(2)=(∂J/∂o)h⊤+λW(2)
的计算需要依赖隐藏层变量的当前值h

。这个当前值是通过从输入层到输出层的正向传播计算并存储得到的。

因此，在模型参数初始化完成后，我们交替地进行正向传播和反向传播，并根据反向传播计算的梯度迭代模型参数。既然我们在反向传播中使用了正向传播中计算得到的中间变量来避免重复计算，那么这个复用也导致正向传播结束后不能立即释放中间变量内存。这也是训练要比预测占用更多内存的一个重要原因。另外需要指出的是，这些中间变量的个数大体上与网络层数线性相关，每个变量的大小跟批量大小和输入个数也是线性相关的，它们是导致较深的神经网络使用较大批量训练时更容易超内存的主要原因。

3.14.5. 小结

    正向传播沿着从输入层到输出层的顺序，依次计算并存储神经网络的中间变量。
    反向传播沿着从输出层到输入层的顺序，依次计算并存储神经网络中间变量和参数的梯度。
    在训练深度学习模型时，正向传播和反向传播相互依赖。

