## Vanilla Recurrent Unit

* RNN works on data that are sequences

> Sequence involves a factor of time. Elements in a sequence appear or occur in chronological order

* and exploits

> States of elements in sequence. States is defined as the output of hidden layer neurons, which will serve as additional input to the network during next training (time) step.

* Suppose a sequence is of length T (T observations/elements). Then, this sequence would be: 

$$ X(1), X(2),...,X(t) $$

* If each element in this sequence has D features, this sequence is a $T \times D $ matrix
* All sequeces would have the shape $ N \times T \times D $
* Sequences may have arbitrary length (e.g., sound, music or sentences)
    * We can store elements of each sequence in a list 
    * The nth sequence would be a list E(n) containing elements of the nth sequence and each element has D features
    * The nth sequence is of shape $ len(E(n)) \times D $


** Regular feedforward unit:**
    
<img src="images/forward.png" alt="forward" style="width:50%;height:50%"/>
    
** Recurrent unit: **
    
* The recurrent unit has a feedback connection from hidden layer to itself. The feedback loop implies there is a delay of one time unit. 

<img src="images/simple_recurrent_unit.png" alt="forward" style="width:50%;height:50%"/>


* Recurrent Unit Internals

Let’s take a closer look at what is going on inside a recurrent unit.
<img src="images/recurrent_unit_internal.png" alt="forward" style="width:90%;height:100%"/>

* The input word vector $x_t$ is first multiplied by the weight matrix: $W_x$
* Then bias values are added to produce our first intermediate result: $x_t W_x + b$ 
* Meanwhile, the state vector from the previous time step $h_{t-1}$ is multiplied with another weight matrix $ W_h $ to produce our second intermediate result: $h_{t-1} W_h$ 
* These two are then added together, and passed through an activation function such as ReLU, sigmoid or tanh to produce the state for the current time step: $h_t$
* This state vector $h_t$ is passed on as input to the next fully-connected layer, that applies another weight matrix, bias and activation to produce the output: $y_t$

> The key thing to note here is that the RNN’s state $h_t$ is used to produce the output $y_t$ as well as looped back to produce the next state.

* In Math

$$ h(t) = f(W^T_h h(t-1) + W^T_x x(t) + b_h) $$

$$ y(t) = softmax(W^T_o h(t) + b_o) $$

where $f()$ certain non-linear activation function such as sigmoid, tanh or relu

* How big is $ W_h $ ?
    * if the hidden layer contains $M$ neuros, each neuro would be connected to all other neuros in the hidden layer:
        * first neuro connects back to all $M$ neuros, second neuro connects back to all $M$ neuros, ..., $M$th neuro connects back to all $M$ neuros. Therefore, $ W_h $ is $ M \times M $

> Let's say you've decided to use a word vector ($x_t$) of length 200, and a state vector ($h_t$) of length 300. Treating these as single-row matrices, we can write the sizes as 1x200 and 1x300 respectively. Then, the input weight matrix ($W_x$) has shape 200x300, the recurrent weight matrix ($W_h$) has shape 300x300 and bias vector $b$ has shape 1x300.

* Note that this is not Markov, since $ h(t) $ is defined by $ h(t-1) $, which in turns is defined by $ h(t-2) $, all the way down to $ h(0) $, which is different from Markov property that each state only depends on its previous states. 


$$ h(t) = f(W^T_h h(t-1) + W^T_x x(t) + b_h) $$
 
$$ h(t - 1) = f(W^T_h h(t-2) + W^T_x x(t-1) + b_h) $$
   
$$  h(0) = ... $$

* $ h(0) $ is a hyperparameter. We can initialize it to 0

* We can add more than one recurrent unit to neural network

<img src="images/two_recurrent_units.png" alt="forward" style="width:50%;height:50%"/>

** Intuition by unfolding recurrent unit**

* We visualize a recurrent neural network by unfolding it in time to a feedforward neural network, as shown below.
* We have a sequence of elements, each is represented by a hidden layer in the feedforward neural network. At each hidden layer, we want to make a decision about what happened so far in this sequence. 
* If your sequence is reasonably stationary, you can use the same classifier at each hidden layer in time.
* Because this is a sequence, you also want to take into account the past - everything that happened before each hidden layer.  

> One natural thing to do is to use the states of the previous hidden layer as a summary of what happened before recursively. We use the same model $h$ to model the states of each hidden layer. As shown below, at each hidden layer, it takes into account the previous hidden layer's states multiplying the weight matrix between the two hidden layers.

<img src="images/unfold_rnn_1.png" alt="forward" style="width:65%;height:65%"/>

* Since at each hidden layer, we take in previous hidden layer's states multiplying a distinct weight matrix $ W_1, W_2, ..., W_3, or ... $, we would need a very deep neural network to remember far in the past since a sequence may involve hundreds or even thousands steps.

> A more compact way is to use a <b style="color:red">shared weights</b> $ W_h $ between each pair of adjacent layers in the sequence. Similarly, we use <b style="color:red">shared weights</b> $ W_x $ for all the inputs $ x(1), x(2),..., x(t),...$ into the hidden layers and <b style="color:red">shared weights</b> $ W_o $ for outputs of all layers. We would end up with a network with a relatively simple repeating pattern.

<img src="images/unfold_rnn_2.png" alt="forward" style="width:65%;height:65%"/>


<img src="images/rnn_3.png" alt="forward" style="width:40%;height:40%"/>

**Take Away**
* The simple recurrent unit is the foundation for other variations on recurrent neural network

## Backpropagation Through Time (BPTT)

## Rated Recurrent Unit

* The rated recurrent unit weights 2 things:
    * $ h(t-1) $, the previous value of hidden state
    * $ f(x(t), h(t-1)) $, output that we would have gotten from a regular recurrent unit 

* We use a rate/update gate z, which is a matrix, to do the weighting
* The rated recurrent unit architecture:

<img src="images/rrnn.png" alt="forward" style="width:65%;height:65%"/>

* The mathematical formula for calculating current hidden state is as:

$$ h(t) = (1-z(t)) \circ h(t-1) + z(t) \circ f(x(t), h(t-1)) $$

where $ \circ $ indicates element-by-element multiplication

* we define <b style="color:red">$ \hat{h}(t) = f(x(t), h(t-1)) $</b>, which can be considered as candidate hidden state that will go through the rating gate z. Then we rewrite the mathematical formula for calculating current hidden state:

<b style="color:red">$$ h(t) = (1-z(t)) \circ h(t-1) + z(t) \circ \hat{h}(t) $$</b>

> Intuitively, the rating gate z is choosing between the old state value $ h(t-1) $  and the new state value $ \hat{h} $

** Calculating candidate hidden value $\hat{h}$**
$$ \hat{h}(t) = f(x(t), h(t-1)) = f(x(t)W_{xh} + h(t-1)W_{hh} + b_h)$$

** Calculating rating gate z**
* Many options for how z can be calculated: as a weight parameer, as a function of x, etc...
* One common way is calculating z as a function of $ x(t) $ and $ h(t-1) $ 

$$ z(t) = \sigma(x(t)W_{xz} + h(t-1)W_{hz} + b_z)$$

## Gated Recurrent Unit

* The Gate Recurrent Unit adds one more gate called reseting gate to the Rated Recurrent Unit.

<img src="images/grn.png" alt="forward" style="width:65%;height:65%"/>

* We calculate the current hidden state $ h(t) $ the same way as we calculate it for Rated Recurrent Unit:

$$ h(t) = (1-z(t)) \circ h(t-1) + z(t) \circ \hat{h}(t) $$

* $ \hat{h} $ is calculated by multiplying reset gate r with the previous hidden state $ h(t-1) $. The gate g controls how much of the previous hidden state we will consider when we create the candidate hidden state. In other words, it has the ability to reset the hidden state:

<b style="color:red">$$ \hat{h}(t) = f(x(t), r(t) \circ h(t-1)) $$</b> 

** Calculating candidate hidden value $\hat{h}$**
$$ \hat{h}(t) = f(x(t), h(t-1)) = f(x(t)W_{xh} + (r(t) \circ h(t-1))W_{hh} + b_h)$$

** Calculating rating gate z**
$$ z(t) = \sigma(x(t)W_{xz} + h(t-1)W_{hz} + b_z)$$

** Calculating reset gate r**
$$ r(t) = \sigma(x(t)W_{xr} + h(t-1)W_{hr} + b_r)$$

### Take Away
* Compared with Rated Recurrent Units
    * GRU adds a reset gate r
    * GRU calculates $ \hat{h} $ considering the r
    * All other components are the same

## Long Short-Term Memory (LSTM)

* The concepts of LSTM are quite similar to those of GRU
* LSTM has superior performance and it does not easily succumb to the vanishing gradient problem
* Compared with GRU, LSTM adds complexity in terms of more components
* Basically, LSTM have:
    * Three gates: input gate $ i(t) $, output gate $ o(t) $ and forget gate $ f(t) $
    * A memory cell $ c(t) $ 
    * a candidate memory cell $ \hat{c}(t) $, which take place of $ \hat{h}(t) $ 
* The LSTM architecture:
   
<img src="images/lstm.png" alt="forward" style="width:65%;height:65%"/>

> The input gate controls how much of the new memory cell value goes into current memory cell; 

> The forget gate controls how much of the previous memory cell value goes into current memory cell; 

> The output gate control how much of current memory cell value goes into the hidden state

* The mathematical formula for calculating current hidden state is:

$$ h(t) = o(t)g(c(t))$$

* The mathematical formula caculating the memory cell is:

$$ c(t) = f(t)c(t-1) + i(t)\hat{c}(t)$$

* where the candidate memory cell $ \hat{c}(t) $ is calculated by:

$$ \hat{c}(t) = g(x(t)W_{xc} + h(t-1)W_{hc} + b_c)$$

** Calculating input gate $i$**
$$ i(t) = \sigma(x(t)W_{xi} + h(t-1)W_{hi} + c(t-1)W_{ci} + b_i)$$

** Calculating forget gate $f$**
$$ f(t) = \sigma(x(t)W_{xf} + h(t-1)W_{hf} + c(t-1)W_{cf} + b_f)$$

** Calculating output gate $o$**
$$ o(t) = \sigma(x(t)W_{xo} + h(t-1)W_{ho} + c(t)W_{co} + b_o)$$

* Notice that: different from $i(t)$ and $f(t)$, the output gate $o(t)$ depends on $c(t)$, not $c(t-1)$

## Take Away

**Modularization**
* Rated recurrent unit, gated recurrent unit and LSTM are different ways to architect the recurrent net unit. We can treat the recurrent unit as a black box and modularize it since the specific way of how recurrent unit works is internal detail. Thus, we can plug-in different recurrent unit as needed.

**Simple, but not Simpler**
* You may tend to use more complex model such as GRUs and LSTMs
* But if a simpler model solves the problem completely, we should stop there
* Do not add more training time + possibility of overfitting
* [Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Chung 2014](https://arxiv.org/pdf/1412.3555.pdf)
* Interesting points:
    * Deep learning is not that common in Kaggle
    * Many contestants do feature engineering + simpler models
    * Deep learning is interesting because it touches a lot different but related fields including computational neuroscience, machine learning and AI
    * For basic tasks, a very simple and fast model may be the best choice
