### Recurrent Neural Network (RNN)

If convolution networks are deep networks for images, recurrent networks are networks for speech and language. For example, both LSTM and GRU networks based on the recurrent network are popular for the natural language processing (NLP). Recurrent networks are heavily applied in Google home and Amazon Alexa. To illustrate the core ideas, we look into the Recurrent neural network (RNN) before explaining LSTM & GRU.

In deep learning, we model h in a fully connected network as:

$$
h = f(X_i)
$$

where $ X_i $ is the input.

For time sequence data, we also maintain a hidden state representing the features in the previous time sequence. Hence, to make a word prediction at time step t in speech recognition, we take both input $$ X_t $$ and the hidden state from the previous time step $ h_{t-1}$ to compute $ h_t $:

$$
h_t = f(x_t, h_{t-1})
$$

<div class="imgcap">
<img src="images/rnn_b.png" style="border:none;width:60%;">
</div>

We can unroll the time step $ t $ which takes the hidden state $ h_{t-1} $ and input $ X_t $  to compute $ h_t $.

<div class="imgcap">
<img src="images/rnn_b3.png" style="border:none;width:35%;">
</div>

To give another perspective, we unroll a RNN from time step $ t-1 $ to $ t+1 $:
<div class="imgcap">
<img src="images/rnn_b2.png" style="border:none;width:60%;">
</div>

In RNN, $ h $ servers 2 purposes: the hidden state for the previous sequence data as well as making a prediction. In the following example, we multiply $ h_t $ with a matrix $ W $ to make a prediction for $ Y $. Through the multiplication with a matrix, $ h_t $ make a prediction for the word that a user is pronouncing. 

<div class="imgcap">
<img src="images/cap14.png" style="border:none;width:30%;">
</div>

> RNN makes prediction based on the hidden state in the previous timestep and current input. $ h_t = f(x_t, h_{t-1})$

#### Create image caption using RNN
Let's study a real example to study RNN in details. We want our system to automatically provide captions by simply reading an image. For example, we input a school bus image into a RNN and the RNN produces a caption like "A yellow school bus idles near a park." 

<div class="imgcap">
<img src="images/cap.png" style="border:none;">
</div>

During the RNN training, we
1. Use a CNN network to capture features of an image.
2. Multiple the features with a trainable matrix to generate $ h_0 $.
3. Feed $ h_0 $ to the RNN.
4. Use a word embedding lookup table to convert a word to a word vector $ X_1 $. (a.k.a word2vec)
5. Feed the word vector and $ h_0 $ to the RNN. $ h_1 = f(X_1, h_0) $
6. Use a trainable matrix to map $ h $ to scores which predict the next word in our caption.
7. Move to the next time step with $ h_1 $ and the word "A" as input.

<div class="imgcap">
<img src="images/cap12.png" style="border:none;;">
</div>

#### Capture image features
We pass the image into a CNN and use one of the activation layer in the fully connected (FC) network to initialize the RNN. For example, in the picture below, we pick the input of the second FC layer to compute the initial state of the RNN $ h_0 $.
<div class="imgcap">
<img src="images/cnn.png" style="border:none;;">
</div>

We multiply the CNN image features with a trainable matrix to compute $ h_0 $ for the first time step 1.
<div class="imgcap">
<img src="images/cap2.png" style="border:none;">
</div>

With $ h_0 $, we  compute $ h_1 = f(h_0, X_1) $ for time step 1.

<div class="imgcap">
<img src="images/cap8.png" style="border:none;width:80%;">
</div>

> We use a CNN to extract image features. Multiple it with a trainable matrix for the initial hidden state $h_0$.
