# 使用递归神经网络对序列数据建模

- 介绍序列数据
- 用于序列建模的RNN
- 长短期记忆LSTM
- 延时间截断反向传播(T-BPTT)
- 在TensorFlow实现一个用于序列建模的多层RNN
- 项目1 - 用RNN对IMDb电影评论数据集进行情感分析
- 项目2 - 使用来自莎士比亚《哈姆雷特》的文本数据，使用LSTM单元进行RNN字符级语言建模
- 使用梯度削波，以避免爆炸的梯度

## 介绍序列数据

### 序列数据建模 - 顺序关系

### 表示序列

![1](1.png)

RNN和CNN,MLP不同的地方在于:

RNN具有记忆过去信息的能力并对新数据进行相应的处理

### 不同种类的序列建模

![2](2.png)

- __Many-to-one__: 输入是一个序列但是输出是一个固定大小的向量,例如情感分析,输入是文本,输出是类标签
- __One-to-many__: 输入是一个标准形式但是输出是序列,例如图像描述,输入是图片,输入是一个英文短语
- __Many-to-many__: 输入输出都是序列,这个类别更进一步为基于输入输出是否同步,若同步,例如视频分类(视频每一帧都是有标签的).若不同步则比如将一种语言翻译成另一种语言

## 用于序列建模的RNN

### 理解RNN的结构和流

![3](3.png)

![4](4.png)

### 在RNN计算激活项

- $W_{xh}$: 输入层和隐藏层之间的权重矩阵
- $W_{hh}$: 递归边缘相关联的权重矩阵
- $W_{hy}$: 隐藏层和输出层之间的权重矩阵

![5](5.png)

net输入<br>
$z_h^{(t)} = W_{xh}x^{(t)} + W_{hh}h^{(h-1)} + b_n$

隐藏层的激活项为
\begin{equation}
\boldsymbol{h}^{(t)}=\phi_{h}\left(z_{h}^{(t)}\right)=\phi_{h}\left(\boldsymbol{W}_{x h} \boldsymbol{x}^{(t)}+\boldsymbol{W}_{h h} \boldsymbol{h}^{(t-1)}+\boldsymbol{b}_{h}\right)
\end{equation}

\begin{equation}
\boldsymbol{h}^{(t)}=\phi_{h}\left(\left[\boldsymbol{W}_{x h} ; \boldsymbol{W}_{h h}\right]\left[\begin{array}{c}{\boldsymbol{x}^{(t)}} \\ {\boldsymbol{h}^{(t-1)}}\end{array}\right]+\boldsymbol{b}_{h}\right)
\end{equation}

\begin{equation}
\boldsymbol{y}^{(t)}=\phi_{y}\left(\boldsymbol{W}_{h y} \boldsymbol{h}^{(t)}+\boldsymbol{b}_{y}\right)
\end{equation}

![6](6.png)

### 长期交互学习的挑战

所谓的vanishing或者exploding梯度问题

![7](7.png)

two solutions:
- TBPTT
- LSTM

### LSTM单元

![8](8.png)

$\odot$ refers to the element-wise product (element-wise multiplication) 
and $\oplus$ means element-wise summation (element-wise addition)

- forge gate($f_t$) 允许记忆单元重置细胞状态而不会无限期增长
$$
f_t=\sigma\left(W_x f x^{(t)}+W_{h f} h^{(t-1)}+b_{f}\right)
$$
- input gate($i_t$)和input node($g_t$)用于更新细胞状态
$$
i_{t}=\sigma\left(W_{x i} x^{(t)}+W_{h i} h^{(t-1)}+b_{i}\right)
$$
$$
g_{t}=\tanh \left(W_{x g} x^{(t)}+W_{h g} h^{(t-1)}+b_{g}\right)
$$
$$
C^{(t)}=\left(C^{(t-1)} \odot f_{t}\right) \oplus\left(i_{t} \odot g_{t}\right)
$$
- output gate($o_T$)决定隐藏层单元值的更新
$$
o_{t}=\sigma\left(W_{x o} x^{(t)}+W_{h o} h^{(t-1)}+b_{o}\right)
$$
$$
h_{(t)}=o_t \odot \tanh(C^{(t)})
$$

## 在TensorFlow实现一个用于序列建模的多层RNN

two common problems tasks:
- Sentiment analysis
- Language modeling

## 项目1 - 用RNN对IMDb电影评论数据集进行情感分析

### 准备数据

In [2]:
import pyprind
import pandas as pd
from string import punctuation
import re
import numpy as np

df = pd.read_csv('movie_data.csv', encoding='utf-8')

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [6]:
# Preprocessing the data:
# separate words and
# count each word's occurence
from collections import Counter

counts = Counter()
pbar = pyprind.ProgBar(len(df['review']), title='Counting words occurences')

for i, review in enumerate(df['review']):
    text = ''.join([c if c not in punctuation else ' ' +
                    c + ' ' for c in review]).lower()
    df.loc[i, 'review'] = text
    pbar.update()
    counts.update(text.split())

# create a mapping
# map each unique word to a integer
word_counts = sorted(counts, key=counts.get, reverse=True)
print(word_counts[:5])
word_to_int = {word: ii for ii, word in enumerate(word_counts, 1)}

mapped_reviews = []
pbar = pyprind.ProgBar(len(df['review']), title='Map reviews to int')

for review in df['review']:
    mapped_reviews.append([word_to_int[word] for word in review.split()])
    pbar.update()

Counting words occurences
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:17
Map reviews to int


['the', '.', ',', 'and', 'a']


0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:02


> 为了生成匹配RNN架构的输入数据,我们需要保证所有的sequences有相同的长度

![9](9.png)

In [12]:
# sequence_length是一个可以被调优的超参

# Define same-length sequences
# if sequence length < 200: left_pad with zero
# if sequence length > 200: use the last 200 elements
sequence_length = 200
sequences = np.zeros((len(mapped_reviews), sequence_length), dtype=int)

for i,row in enumerate(mapped_reviews):
    review_arr = np.array(row)
    sequences[i,-len(row):] = review_arr[-sequence_length:]

In [14]:
X_train = sequences[:25000, :]
y_train = df.loc[:25000, 'sentiment'].values
X_test = sequences[25000:, :]
y_test = df.loc[25000:, 'sentiment'].values

In [15]:
# mini batch
np.random.seed(123)

# define a function to generate mini-batches


def create_batch_generator(x, y=None, batch_size=64):
    n_batches = len(x)//batch_size
    x = x[:n_batches*batch_size]
    if y is not None:
        y = y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        if y is not None:
            yield x[ii:ii+batch_size], y[ii:ii+batch_size]
        else:
            yield x[ii:ii+batch_size]

### embedding

embedding相比于one-hot的优点:
- 减小了特征空间的维数,降低了维数诅咒的效果
- 神经网络在embedding层进行主要特征的提取的过程是可训练的

![10](10.png)