### Recurrent Neural Network (RNN)

If convolution networks are deep networks for images, recurrent networks are networks for speech and language. For example, both LSTM and GRU networks based on the recurrent network are popular for the natural language processing (NLP). Recurrent networks are heavily applied in Google home and Amazon Alexa. To illustrate the core ideas, we look into the Recurrent neural network (RNN) before explaining LSTM & GRU.

In deep learning, we model h in a fully connected network as:

$$
h = f(X_i)
$$

where $ X_i $ is the input.

For time sequence data, we also maintain a hidden state representing the features in the previous time sequence. Hence, to make a word prediction at time step t in speech recognition, we take both input $$ X_t $$ and the hidden state from the previous time step $ h_{t-1}$ to compute $ h_t $:

$$
h_t = f(x_t, h_{t-1})
$$

<div class="imgcap">
<img src="images/rnn_b.png" style="border:none;width:60%;">
</div>

We can unroll the time step $ t $ which takes the hidden state $ h_{t-1} $ and input $ X_t $  to compute $ h_t $.

<div class="imgcap">
<img src="images/rnn_b3.png" style="border:none;width:35%;">
</div>

To give another perspective, we unroll a RNN from time step $ t-1 $ to $ t+1 $:
<div class="imgcap">
<img src="images/rnn_b2.png" style="border:none;width:60%;">
</div>

In RNN, $ h $ servers 2 purposes: the hidden state for the previous sequence data as well as making a prediction. In the following example, we multiply $ h_t $ with a matrix $ W $ to make a prediction for $ Y $. Through the multiplication with a matrix, $ h_t $ make a prediction for the word that a user is pronouncing. 

<div class="imgcap">
<img src="images/cap14.png" style="border:none;width:30%;">
</div>

> RNN makes prediction based on the hidden state in the previous timestep and current input. $ h_t = f(x_t, h_{t-1})$

#### Create image caption using RNN
Let's study a real example to study RNN in details. We want our system to automatically provide captions by simply reading an image. For example, we input a school bus image into a RNN and the RNN produces a caption like "A yellow school bus idles near a park." 

<div class="imgcap">
<img src="images/cap.png" style="border:none;">
</div>

During the RNN training, we
1. Use a CNN network to capture features of an image.
2. Multiple the features with a trainable matrix to generate $ h_0 $.
3. Feed $ h_0 $ to the RNN.
4. Use a word embedding lookup table to convert a word to a word vector $ X_1 $. (a.k.a word2vec)
5. Feed the word vector and $ h_0 $ to the RNN. $ h_1 = f(X_1, h_0) $
6. Use a trainable matrix to map $ h $ to scores which predict the next word in our caption.
7. Move to the next time step with $ h_1 $ and the word "A" as input.

<div class="imgcap">
<img src="images/cap12.png" style="border:none;;">
</div>

#### Capture image features
We pass the image into a CNN and use one of the activation layer in the fully connected (FC) network to initialize the RNN. For example, in the picture below, we pick the input of the second FC layer to compute the initial state of the RNN $ h_0 $.
<div class="imgcap">
<img src="images/cnn.png" style="border:none;;">
</div>

We multiply the CNN image features with a trainable matrix to compute $ h_0 $ for the first time step 1.
<div class="imgcap">
<img src="images/cap2.png" style="border:none;">
</div>

With $ h_0 $, we  compute $ h_1 = f(h_0, X_1) $ for time step 1.

<div class="imgcap">
<img src="images/cap8.png" style="border:none;width:80%;">
</div>

> We use a CNN to extract image features. Multiple it with a trainable matrix for the initial hidden state $h_0$.


### Code in computing h0

Define the shape of CNN image features (N, 512) and h (N, 512):
```python
input_dim   = 512   # CNN features dimension: 512  
hidden_dim  = 512   # Hidden state dimension: 512
```

Define a matrix to project the CNN image features to $$ h_0 $$.

```python
# W_proj: (input_dim, hidden_dim)
W_proj  = np.random.randn(input_dim, hidden_dim)
W_proj /= np.sqrt(input_dim)
b_proj  = np.zeros(hidden_dim)
```

Compute $$ h_0 $$ by multiply the image features with $$ W_{proj} $$.
```python
# Initialize CNN -> hidden state projection parameters
# h0: (N, hidden_dim)
h0 = features.dot(W_proj) + b_proj
```

> We initialize $$h_0 = W_{proj} \cdot x_{cnn}+ b$$ .


#### Map words to RNN
Our training data contains both the images and captions (the true labels). It also has a dictionary which maps a vocabulary word to an integer index. Caption words in the dataset are stored as word indexes using the dictionary. For example, the caption "A yellow school bus idles near a park." can be represented as "1 5 3401 3461 78 5634 87 5 111 2" which 1 represents the "start" of a caption, 5 represents 'a', 3401 represents 'yellow', ...  and 2 represents the "end" of a caption.

The RNN does not use the word index. The word index does not contain information about the semantic relationship between words. We map a word to a higher dimensional space such that we can encode semantic relationship between words. For example, if we encode the word "father" as (0.2, 0.3, 0.1, ...) we should expect the word "mother" to be close by say (0.3, 0.3, 0.1, ...). The vector distance between the word "Paris" and "France" should be similar to the one between "Seoul" and "Korea". The encoding method _word2vec_ provides a mechanism to convert a word to a higher dimensional space. We use a word embedding lookup table $$ W_{embed} $$ to convert a word index to a vector of length wordvec_dim. This embedding table is trained together with the caption creating network instead of training them independently (end-to-end training).

> We train the word embedding table together with the caption creation network. 


The RNN will take this vector $$ X_t $$ and $$ h_{t-1} $$ to compute $$ h_t $$

<div class="imgcap">
<img src="images/cap9.png" style="border:none;width:45%;">
</div>

>  _word2vec_ encodes words to higher dimensional space that provides semantic relationships that we can manipulate as vectors.

When we create the training data, we encodes words to the corresponding word index using a vocabulary dictionary. The encoded data will then be saved. During training, we read the saved dataset and use _word2vec_ to convert the word index to a word vector.
<div class="imgcap">
<img src="images/encode.png" style="border:none;width:70%;">
</div>

#### RNN
<div class="imgcap">
<img src="images/score.png" style="border:none;width:40%;">
</div>

We pass the word vector
$$
X_0
$$
into the RNN cell to compute the next hidden state $$h_{t}$$:

$$
\begin{align}
state & = W_x x_t + W_h h_{t-1} + b \\
h_{t} &= \tanh(state) \\
\end{align}
$$

The output of the RNN 
$$
h_1
$$
is then multiply with 
$$
W_{vocab}
$$
to generate scores for each word in the vocabulary. For example, if we have 10004 words in the vocabulary, it generates 10004 scores predicting the likeliness of each word to be the next word in the caption. With the true caption in the training dataset and the scores computed, we calculate the softmax loss of the RNN. We apply gradient descent to optimize the trainable parameters.
<div class="imgcap">
<img src="images/score_1.png" style="border:none;">
</div>

We compute $$ h_t $$ 
by feeding the RNN cell with $$ X_t $$ and $$ h_{t-1} $$.
We then map $$ h_t $$ to scores which are used to compute the softmax cost.

#### rnn_forward

"rnn forward" is the RNN layer that compute $$h_1, h_2, \cdots, h_t $$

```python
h, cache_rnn = rnn_forward(x, h0, Wx, Wh, b)
```

<div class="imgcap">
<img src="images/cap13.png" style="border:none;width:50%;">
</div>

rnn_forward unroll the RNN by T time steps and compute $$ h_t $$ by calling the RNN cell "rnn_step_forward". At each step, it takes $$ h_{t-1} $$ from the previous step and use the true captions provided by the training set to lookup $$ X_t $$.  Note, we use the true label instead of the highest score word from previous time step as input.

In each RNN time step, we compute:

$$
\begin{align}
state & = W_x x_t + W_h h_{t-1} + b \\
h_{t} &= \tanh(state) \\
\end{align}
$$

#### Scores

After finding $$ h_t $$, we compute the scores by:

$$
score = W_{vocab} * h_t
$$

#### Softmax cost

For each word in the vocabulary (1004 words), we predict their probabilities of being the next caption word using softmax. Without changing the result, we subtract it with the maximum score for better numeric stability.

$$
softmax(z) = \frac{e^{z_i}}{\sum e^{z_c}} =  \frac{e^{z_i - m}}{\sum e^{z_c  - m}}  
$$

We then compute the softmax loss (negative log likelihood) and the gradient.

$$
\begin{align}
J(w) &= -  \sum_{i=1}^{N}  \log p(\hat{y}^i = y^i \vert x^i, w ) \\
\nabla_{z_i} J &= \begin{cases}
                        p - 1 \quad & \hat{y} = y \\
                        p & \text{otherwise}
                    \end{cases}
\end{align}
$$

#### Time step 0

Here, we recap how we calculate 
$$ h_0 $$
from the image features and use the true caption "start" to make a prediction $$ h_1 $$ from the RNN. Then we compute the scores and the softmax loss.

<div class="imgcap">
<img src="images/cap11.png" style="border:none;width:70%;">
</div>

#### Making prediction

To generate captions automatically, we will use the CNN to generate image features and map it to $$ h_0 $$ with $$ W_{proj} $$.
<div class="imgcap">
<img src="images/cap4.png" style="border:none;width:80%;">
</div>

At time step 1, we feed the RNN with the input "start" to get the word vector $$ X_1 $$. The RNN computes the value $$ h_1$$
which later multiplies with $$ W_{vocab} $$ to generate scores for each word in the vocabulary. We select the word with the highest score for the first word in the caption (say, "A"). Unlikely training, we use this word as the next time step input. With $$ h_1 $$ and the highest score word "A" in time step 1, we go through the RNN step again and made the second prediction "bus" at time step 2. 
​	
<div class="imgcap">
<img src="images/cap7.png" style="border:none;width:70%;">
</div>

We compute the score and set the input for the next time step to be the word with the highest score.

Finally here is the final detail flows:
<div class="imgcap">
<img src="images/cap5.png" style="border:none;;">
</div>

### Long Short Term Memory network (LSTM)

$$ h_t $$ in RNN serves 2 purpose:
* Make an output prediction, and
* A hidden state representing the data sequence processed so far.

LSTM splits these 2 roles into 2 separate variables $$ h_t $$ and $$ C $$. The hidden state of the LSTM cell is now $$ C $$.

<div class="imgcap">
<img src="images/lstm.png" style="border:none;width:50%;">
</div>

Here are the LSTM equations:

There are 3 gates controlling what information will pass through:

$$
\begin{split}
gate_{forget} &= \sigma (W_{fx} X_t + W_{fh} h_{t-1} + b_f) \\
gate_{input} &= \sigma (W_{ix} X_t + W_{ih} h_{t-1} + b_i) \\
gate_{out} &= \sigma (W_{ox} X_t + W_{oh} h_{t-1} + b_o) \\
\end{split}
$$

3 equations to update the cell state and the hidden state:

$$
\begin{split}
\tilde{C} & = \tanh (W_{cx} X_t + W_{ch} h_{t-1} + b_c)  \\
C_t & = gate_{forget} \cdot C_{t-1} + gate_{input} \cdot \tilde{C} \\
h_t & = gate_{out} \cdot \tanh (C_t) \\
\end{split}
$$

#### Gates

There are 3 gates in LSTM. All gates are function of $$x_t$$ and $$h_{t-1}$$

$$
gate = \sigma (W_{x} X_t + W_{h} h_{t-1} + b) \\
$$

* $$gate_{forget}$$ controls what part of the previous cell state will be kept.
* $$gate_{input}$$ controls what part of the new computed information will be added to the cell state $$C$$.
* $$gate_{out} $$ controls what part of the cell state will exposed as the hidden state.

#### Updating C

The key part in LSTM is to update the cell state $$C$$.

<div class="imgcap">
<img src="images/lstm2.png" style="border:none;width:20%;">
</div>

To do that, we need to compute a new proposal $$\tilde{C}$$ just based in $$X_t$$ and $$h_{t-1}$$:

$$
\tilde{C} = \tanh (W_{cx} X_t + W_{ch} h_{t-1} + b_c)  \\
$$

The new state $$C_t$$ is form by forgetting part of the previous cell state while add part of the new proposal $$\tilde{C}$$ to the cell state:

$$
\begin{split}
C_t & = gate_{forget} \cdot C_{t-1} + gate_{input} \cdot \tilde{C} \\
\end{split}
$$

#### Update h
<div class="imgcap">
<img src="images/lstm1.png" style="border:none;width:20%;">
</div>

To update $$ h_{t} $$, we use the output gate to control what cell state to export as $$h_t$$

$$
 h_t = gate_{out} \cdot \tanh (C_t)
$$

### Gated Recurrent Units (GRU)

Compare with LSTM, GRU does not maintain a cell state $$ C $$ and use 2 gates instead of 3. 

$$
\begin{split}
gate_r &= \sigma (W_{rx} X_t + W_{rh} h_{t-1} + b) \\
gate_{update} &= \sigma (W_{ux} X_t + W_{uh} h_{t-1} + b) 
\end{split}
$$

The new hidden state is compute as:

$$
h_t = (1 - gate_{update}) \cdot h_{t-1} +  gate_{update} \cdot \tilde{h_{t}}
$$

As seen, we use the compliment of $$gate_{update}$$ instead of creating a new gate to control what we want to keep from the $$h_{t-1}$$.

The new proposed $$\tilde{h_{t}}$$ is calculated as:

$$
\tilde{h_{t}} = \tanh (W_{hx} X_t + W_{hh} \cdot (gate_r \cdot h_{t-1}) + b)
$$

We use $$gate_r$$ to control what part of $$h_{t-1}$$ we need to compute a new proposal.