# GRU

**Gated Recurrent Unit (GRU)** is a type of [Recurrent Neural Network (RNN)](https://fedmug.github.io/kbtu-ml-book/rnn/vanilla_rnn.html) architecture that was introduced in 2014 by [Kyunghyun Cho et al](https://arxiv.org/pdf/1412.3555.pdf). It is similar to [Long Short-Term Memory (LSTM)](https://fedmug.github.io/kbtu-ml-book/rnn/lstm.html) in that it can capture long-term dependencies in sequential data, but it has fewer parameters and is faster to train.

## Architecture

<center><img src="GRU_arch.gif" alt="GRU architecture" width="500"/></center>

### Reset Gate and Update Gate

In GRU we have two gates the reset gate and the update gate. The outputs of the gates are given by two fully connected layers with a [sigmoid activation <img src="sigma.svg" width="25"/> function](https://fedmug.github.io/kbtu-ml-book/mlp/activations.html).

<center><img src="Reset gate and Update gate.svg" alt="Reset gate and Update gate" width="500"/></center>

> Suppose that the <b>input</b> is minibatch $X_t \in \mathbb{R}^{n \times d}$ (where $n$ is <b>number of examples</b>, $d$ is <b>number of inputs</b> and $t$ is <b>time step</b>).
>
>The <b>hidden state</b> of the previous time step is $H_{t-1} \in \mathbb{R}^{n \times h}$ (where $h$ is <b>number of hidden units</b>).
>
>Then the <b>reset gate</b> $ R_t \in \mathbb{R}^{n \times h}$ and <b>update gate</b> $ Z_t \in \mathbb{R}^{n \times h}$.

They are computed as follows:

\begin{align}
    R_t &= \sigma(X_t W_{r} + H_{t-1} U_{r} + b_r) \\
    Z_t &= \sigma(X_t W_{z} + H_{t-1} U_{z} + b_z) \\
\end{align}

where $W_{r}, W_{z} \in \mathbb{R}^{d \times h}$ and $U_{r},U_{z} \in \mathbb{R}^{h \times h}$ are weight parameters,

$b_r, b_z \in \mathbb{R}^{1 \times h}$ are bias parameters.


### Candidate Hidden State

Next, we combine the <b>reset gate</b> $R_t$ with the standard updating mechanism, resulting in a <b>candidate hidden state</b> $\tilde{H}_t$ at time step $t$. Here we use a [tanh activation <img src="tanh.svg" width="25"/> function](https://fedmug.github.io/kbtu-ml-book/mlp/activations.html):

\begin{align}
\tilde{H}_t &= \tanh(X_t W_{h} + (R_t \odot H_{t-1}) U_{h} + b_h) \\
\end{align}

where $W_{h} \in \mathbb{R}^{d \times h}$ and $U_{h} \in \mathbb{R}^{h \times h}$ are weight parameters,

$b_h \in \mathbb{R}^{1 \times h}$ is the bias. 

$\odot$ - [Hadamard (elementwise) product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) operator.

<center><img src="Hidden state.svg" alt="Hidden state" width="500"/></center>

> If the reset gate $R_t$ are close to 1, the model acts like a regular [Recurrent Neural Network (RNN)](https://fedmug.github.io/kbtu-ml-book/rnn/vanilla_rnn.html). Conversely, when the values in $R_t$ are close to 0, the candidate hidden state is computed using a [Multi-Layer Perceptron (MLP)](https://fedmug.github.io/kbtu-ml-book/mlp/mlp.html) with the current input $X_t$.

### Hidden State

Finally, we need to incorporate the effect of the update gate $Z_t$. This determines how much the new hidden state $H_t$ matches the old state $H_{t-1}$ compared to how much it resembles the new candidate state $H_t$. The update gate $Z_t$ can be used for this purpose by taking elementwise convex combinations of $H_{t-1}$ and $H_t$. This leads to the final update equation for the GRU.

\begin{align}
H_t &= (1-Z_t) \odot H_{t-1} + Z_t \odot  \tilde{H}_t 
\end{align}

<center><img src="Candidate hidden state.svg" alt=" Candidate hidden state" width="500"/></center>

>When $Z_t$ is close to 1, we keep the old state, ignoring the information from the current input $X_t$ and effectively skipping the current time step in the dependency chain. 
>
>On the other hand, when $Z_t$ is close to 0, the new latent state $H_t$ approaches the candidate latent state $\tilde{H}_t$.

> GRUs possess two distinctive characteristics:
>
> - Reset gates are employed to capture short-term dependencies within sequences.
> 
> - Update gates are utilized to capture long-term dependencies within sequences.