# GRU

1. Introduction

2. Foundations of GRUs
- Detailed explanation of RNNs and their limitations.
- Introduction to the architecture and mechanisms of GRUs.
- Comparison of GRUs with other recurrent units like LSTM (Long Short-Term Memory).
- Math stuff

3. Architectural Design and Variants
- Backpropagation thourgh time
- In-depth analysis of the GRU architecture, including the reset gate and update gate.
- Exploration of different variants and modifications of GRUs proposed in the literature.
- Discussion on parameter tuning and hyperparameter settings.

4. Applications of GRUs
- Review of real-world applications of GRUs, including natural language processing, speech recognition, and time series forecasting.
- Case studies highlighting successful implementations of GRU-based models.

5. Advancements and Research Trends
- Overview of recent advancements and breakthroughs in GRU research.
- Discussion on open research challenges and emerging trends in sequence modeling.

6. Conclusion
- Summary of key findings and contributions.
- Recommendations for future research directions and potential improvements in GRU architectures.

7. References
- Citations of relevant research papers and resources.

# 1. Introduction ()

### RNN
Modeling and predicting sequential data requires a different approach from standard regression or classification. Luckily, a particular type of Neural Networks called Recurrent Neural Networks (RNNs) are specifically designed for that purpose.

`toggle`

A recurrent neural network (RNN) is the type of artificial neural network (ANN) that is used to address the limitations of traditional neural networks, when it comes to processing sequential data. Traditional approaches to Neural Network Architecture posses a significant drawback, due to which it is unable to handle sequential data effectively and capture the dependencies between inputs. RNN remembers past inputs due to an internal memory which is useful for predicting target values.

`toggle`

<table>
  <tr>
    <td><img src="./images/FNN.png" alt="Image 1"></td>
    <td><img src="./images/RNN.png" alt="Image 2"></td>
  </tr>
  <tr>
    <td>Forward Neural Network</td>
    <td>Reccurent Neural Network</td>
  </tr>
</table>


__Important characteristics:__
1. RNN shares the same weights within each layer of the network.
2. RNN doesn't take into consideration just the actual input but also the previous inputs which allows it to memorize what happens previously. 

Simple RNN models usually run into two major issues. These issues are related to gradient, which is the slope of the loss function along with the error function.

- Vanishing Gradient problem occurs when the gradient becomes so small that updating parameters becomes insignificant; eventually the algorithm stops learning.
    
- Exploding Gradient problem occurs when the gradient becomes too large, which makes the model unstable. In this case, larger error gradients accumulate, and the model weights become too large. This issue can cause longer training times and poor model performance.



### GRU (Gated recurrent unit)
Introduced by Cho(`add wiki ink`), et al. in 2014(`add original paper link #1`), it aims to solve the __vanishing gradient problem__ which comes with a standard recurrent neural network. GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results.<br>

The main idea behind GRU is to use gating mechanisms to selectively update the hidden state of the network at each time step. The gating mechanisms are used to control the flow of information in and out of the network. The GRU has two gating mechanisms, called the reset gate and the update gate.

The reset gate determines how much of the previous hidden state should be forgotten, while the update gate determines how much of the new input should be used to update the hidden state. The output of the GRU is calculated based on the updated hidden state.

`4 quiz questions`

# 2. Foundation



### LSTM VS GRU 
`3 quiz questions`

# 3. Architecture
<center><img src="GRU_arch.gif" alt="GRU architecture" width="500"/></center>

### Reset Gate and Update Gate

In GRU we have two gates the reset gate and the update gate. The outputs of the gates are given by two fully connected layers with a [sigmoid activation <img src="sigma.svg" width="25"/> function](https://fedmug.github.io/kbtu-ml-book/mlp/activations.html).
<center><img src="Reset gate and Update gate.svg" alt="Reset gate and Update gate" width="500"/></center>

Suppose that the <b>input</b> is minibatch $X_t \in \mathbb{R}^{n \times d}$ (where $n$ is <b>number of examples</b>, $d$ is <b>number of inputs</b> and $t$ is <b>time step</b>).

The <b>hidden state</b> of the previous time step is $H_{t-1} \in \mathbb{R}^{n \times h}$ (where $h$ is <b>number of hidden units</b>).

Then the <b>reset gate</b> $ R_t \in \mathbb{R}^{n \times h}$ and <b>update gate</b> $ Z_t \in \mathbb{R}^{n \times h}$.

They are computed as follows:

\begin{align}
    R_t &= \sigma(X_t W_{r} + H_{t-1} U_{r} + b_r) \\
    Z_t &= \sigma(X_t W_{z} + H_{t-1} U_{z} + b_z) \\
\end{align}

where $W_{r}, W_{z} \in \mathbb{R}^{d \times h}$ and $U_{r},U_{z} \in \mathbb{R}^{h \times h}$ are weight parameters,

$b_r, b_z \in \mathbb{R}^{1 \times h}$ are bias parameters.

### Candidate Hidden State

Next, we combine the <b>reset gate</b> $R_t$ with the standard updating mechanism, resulting in a <b>candidate hidden state</b> $\tilde{H}_t$ at time step $t$. Here we use a [tanh activation <img src="tanh.svg" width="25"/> function](https://fedmug.github.io/kbtu-ml-book/mlp/activations.html):

\begin{align}
\tilde{H}_t &= \tanh(X_t W_{h} + (R_t \odot H_{t-1}) U_{h} + b_h) \\
\end{align}

where $W_{h} \in \mathbb{R}^{d \times h}$ and $U_{h} \in \mathbb{R}^{h \times h}$ are weight parameters,

$b_h \in \mathbb{R}^{1 \times h}$ is the bias. 

$\odot$ - [Hadamard (elementwise) product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) operator.

<center><img src="Candidate hidden state.svg" alt="Candidate hidden state" width="500"/></center>

If the reset gate $R_t$ are close to 1, the model acts like a regular [Recurrent Neural Network (RNN)](https://fedmug.github.io/kbtu-ml-book/rnn/vanilla_rnn.html). Conversely, when the values in $R_t$ are close to 0, the candidate hidden state is computed using a [Multi-Layer Perceptron (MLP)](https://fedmug.github.io/kbtu-ml-book/mlp/mlp.html) with the current input $X_t$.

### Hidden State

Finally, we need to incorporate the effect of the update gate $Z_t$. This determines how much the new hidden state $H_t$ matches the old state $H_{t-1}$ compared to how much it resembles the new candidate state $H_t$. The update gate $Z_t$ can be used for this purpose by taking elementwise convex combinations of $H_{t-1}$ and $H_t$. This leads to the final update equation for the GRU.

\begin{align}
H_t &= (1-Z_t) \odot H_{t-1} + Z_t \odot  \tilde{H}_t 
\end{align}

<center><img src="hidden state.svg" alt="hidden state" width="500"/></center>

When $Z_t$ is close to 1, we keep the old state, ignoring the information from the current input $X_t$ and effectively skipping the current time step in the dependency chain. 

On the other hand, when $Z_t$ is close to 0, the new latent state $H_t$ approaches the candidate latent state $\tilde{H}_t$.

> GRUs possess two distinctive characteristics:
>
> - Reset gates are employed to capture short-term dependencies within sequences.
> 
> - Update gates are utilized to capture long-term dependencies within sequences.

4. 

# References
1. https://arxiv.org/pdf/1412.3555v1.pdf