# 0. Outline

1. RNN

2. BP for RNN

3. LSTM

4. Applications

# 1. RNN

### 1. Recall: Feedforward Neural Network

<img src='attachment/105. Feedforward Neural Network.jpg' style='zoom:40%'/>

$$
\begin{align}
z &= \sum_{i=1}^{n}w_{i}x_{i}+b \\
a &= g(z)
\end{align}
$$

### 2. Recall: Language Modelling

A language model estimates probabilities of sequence of words:

- For the sequence $\color{MediumOrchid}{w_{1}, w_{2}, \cdots, w_{n}}$，using the chain rule, we have:

$$
\color{MediumOrchid}{P\left(w_{1}, \ldots, w_{n}\right)=P\left(w_{n} \mid w_{1}, \ldots, w_{n-1}\right) P\left(w_{n-1} \mid w_{1}, \ldots, w_{n-2}\right) \ldots P\left(w_{2} \mid w_{1}\right) P\left(w_{1}\right)}
$$

- N-Gram Approximation: $\color{MediumOrchid}{P\left(w_{1}, \ldots, w_{n}\right)=\prod_{i=1}^{n} P\left(w_{i} \mid w_{i-N+1}, \ldots, w_{i-1}\right)}$

- Applications:
    - Machine Translation: $\color{MediumOrchid}{P(\text { the cat is small })>P(\text { small is the cat })}$
    
    - Grammar Checking: $\color{MediumOrchid}{P(\text { He graduated from SJTU. })>P(\text { He graduated on SJTU.) }}$
    
    
Context is import in modelling context,But:

- N-Gram models use only limited context;

- They failed to handling long(>N) dependencies which is common seen in NL.

RNNs Could address the above issues: 

- <font color=red>It includes the previous hidden layer as the input;</font>

- <font color=red>It could model dependencies of arbitrary length.</font>

### 3. Why RNN？

**通用近似定理(Universal Approximation Theorem)**

一个包含足够多隐含层神经元的多层前馈网络，能以任意精度逼近任意预定的连续函数。

### 4. Simple RNN Architecture

<img src='attachment/105. Simple RNN Architecture.png' style='zoom:40%'/>

$$
\begin{align}
h^{(t)} &= \sigma\left(Wx^{(t)} + Uh^{(t-1)} \right) \\
y^{(t)} &= Softmax  \left(Vh^{(t)} \right)
\end{align}
$$

语言模型的评价指标：

1. 困惑度（Perplexity，ppl）

2. 词错误率（Word Error Rate, WER）

### 5. Unfolding RNNs

A Full RNN use all previous time steps: 

<img src='attachment/105. Full RNN.png' style='zoom:40%'/>

#### 5.1 Backprop of $V$

设Loss function: 

$$L=\sum_{t=1}^{T} \ell\left(y^{(t)}, \hat{y}^{(t)}\right)$$

令 $\color{MediumOrchid}{z = Vh}$，则损失函数对 $\color{MediumOrchid}{V}$的偏导数可表示为:

$$
\color{MediumOrchid}{\frac{\partial \ell^{(t)}}{\partial V}=\frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial z^{(t)}} \cdot \frac{\partial z^{(t)}}{\partial V}}
$$

#### 5.2 Backprop through Time（BPTT）

$$
\begin{align}
\frac{\partial \ell^{(t)}}{\partial W} &= \frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial h^{(t)}} \cdot \frac{\partial h^{(t)}}{\partial W}\\
&+ \frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial h^{(t)}} \cdot \frac{\partial h^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial W} \\
&+ \frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial h^{(t)}} \cdot \frac{\partial h^{(t)}}{\partial h^{(t-1)}} \cdot \frac{\partial h^{(t-1)}}{\partial h^{(t-2)}} \cdot \frac{\partial h^{(t-2)}}{\partial W} \\
&+  \cdots
\end{align}
$$

In sum: 

$$
\frac{\partial \ell^{(t)}}{\partial W} = \sum_{k=0}^{t} \frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial h^{(t)}} \left(\prod_{j=k+1}^{t}\frac{ \partial h^{(j)}}{\partial h^{(j-1)}} \right) \cdot \frac{\partial h^{(k)}}{\partial W}
$$

#### 5.3 Real-Time Recurrent Learning（RTRL）

$$
\frac{\partial \ell^{(t)}}{\partial U} = \frac{\partial \ell^{(t)}}{\partial h^{(t)}} \cdot \frac{\partial h^{(t)}}{\partial U}
$$

令 

$$
\begin{align}
h^{(t)} &= f\left(z^{(t)}\right) \\
&= f\left(Uh^{(t-1)} + Wx^{(t)}\right)
\end{align}
$$

则

$$
\begin{align}
\frac{\partial h^{(t)}}{\partial U} &= \frac{\partial h^{(t)}}{\partial z^{(t)}} \left(\frac{\partial z^{(t)}}{\partial U} + U\frac{\partial h^{(t-1)}}{\partial U} \right) \\
&= f^{\prime} \left( z^{(t)} \right) \odot \left(h^{(t-1)} + U\frac{\partial h^{(t-1)}}{\partial U} \right)
\end{align}
$$



### 6. Vanishing Gradients

$$
\frac{\partial \ell^{(t)}}{\partial W} = \sum_{k=0}^{t} \frac{\partial \ell^{(t)}}{\partial y^{(t)}} \cdot \frac{\partial y^{(t)}}{\partial h^{(t)}} \left( \color{MediumOrchid}{\prod_{j=k+1}^{t}\frac{ \partial h^{(j)}}{\partial h^{(j-1)}}} \right) \cdot \frac{\partial h^{(k)}}{\partial W}
$$

<img src='105. Vanishing Gradients.jpg' style='zoom:50%'/>

# 3. LSTM

### 3.1 原始LSTM

<img src='attachment/105. LSTM.png' style='zoom:70%'/>

#### 遗忘门

$$
f_{t} = \sigma\left( W_{f} \cdot \left[h_{t-1}, x_{t}\right] + b_{f} \right)
$$

#### 输入门

$$
\begin{align}
i_{t} &= \sigma \left(W_{i} \cdot \left[h_{t-1}, x_{t}\right] + b_{i}\right) \\
\tilde{C}_{t} &= \tanh \left(W_{C} \cdot \left[h_{t-1}, x_{t}\right] + b_{C}\right) \\
C_{t} &= f_{t} * C_{t-1} + i_{t} * \tilde{C}_{t}
\end{align}
$$

#### 输出门

$$
\begin{align}
o_{t} &= \sigma \left(W_{o} \cdot \left[h_{t-1}, x_{t}\right] + b_{o}\right) \\
h_{t} &= o_{t} * \tanh \left(C_{t}\right)
\end{align}
$$

### 3.2 变种1: adding peephold

<img src='105. LSTM with peephold connection.jpg' style='zoom:50%'/> 

$$
\begin{aligned} 
f_{t} &=\sigma\left(W_{f} \cdot\left[\color{red}{{C}_{t-1}}, h_{t-1}, x_{t}\right]+b_{f}\right) \\ 
i_{t} &=\sigma\left(W_{i} \cdot\left[\color{red}{{C}_{t-1}}, h_{t-1}, x_{t}\right]+b_{i}\right) \\ 
o_{t} &=\sigma\left(W_{o} \cdot\left[\color{red}{{C}_{t}}, h_{t-1}, x_{t}\right]+b_{o}\right) 
\end{aligned}
$$

### 3.3 变种2：GRU

<img src='105. GRU.jpg' style='zoom:50%'/> 

$$
\begin{align}
z_{t} &= \sigma\left(W_{z} \cdot\left[h_{t-1}, x_{t}\right]\right) \\
r_{t} &= \sigma\left(W_{r} \cdot\left[h_{t-1}, x_{t}\right]\right) \\
\tilde{h}_{t} &= \tanh \left(W \cdot\left[r_{t} * h_{t-1}, x_{t}\right]\right) \\
h_{t} &= \left(1-z_{t}\right) * h_{t-1}+z_{t} * \tilde{h}_{t}
\end{align}
$$

# 4. Application

### 4.1 RNNs for Classification

<img src='attachment/105. RNNs for Classification.jpg' style='zoom:30%'/> 

### 4.2 POS Tagging

<img src='attachment/105. POS Tagging.jpg' style='zoom:30%'/> 

### 4.3 NER

<img src='attachment/105. NER.jpg' style='zoom:30%'/> 

### 4.3 Machine Translation-Encoder_Decoder Architecture

<img src='attachment/105. Machine Translation-Encoder_Decoder Architecture.png' style='zoom:30%'/> 