```{contents}
```

## Variants

RNNs process **sequential data** by maintaining a **hidden state** (memory) that carries information across time steps.
Different types of RNNs exist to handle different kinds of **sequence relationships** — one-to-one, one-to-many, many-to-one, and many-to-many.

---

### Major Types of RNN Architectures

| **Type**                        | **Input–Output Relationship**    | **Example Use Case**                    |
| ------------------------------- | -------------------------------- | --------------------------------------- |
| **1️⃣ One-to-One (Vanilla NN)** | Single input → single output     | Image classification                    |
| **2️⃣ One-to-Many**             | Single input → sequence output   | Image captioning                        |
| **3️⃣ Many-to-One**             | Sequence input → single output   | Sentiment analysis                      |
| **4️⃣ Many-to-Many**            | Sequence input → sequence output | Machine translation, speech recognition |

---

### Structural Types of RNNs

#### **A. Simple (Vanilla) RNN**

**Definition:**
The basic recurrent network where each output depends on the current input and the previous hidden state.

**Equations:**

$$
h_t = f(W_x x_t + W_h h_{t-1} + b_h)
$$

$$
y_t = g(W_y h_t + b_y)
$$

**Use:** Short-term sequential patterns.
**Limitation:** Suffers from *vanishing/exploding gradients* for long sequences.

---

#### **B. Long Short-Term Memory (LSTM)**

**Definition:**
An improved RNN designed to handle **long-term dependencies** using **gates** to control information flow.

**Key Gates:**

* **Forget gate:** Decides what to discard.
* **Input gate:** Decides what new info to store.
* **Output gate:** Decides what to output.

**Equations (simplified):**

$$

\begin{aligned}
f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) &\text{(forget gate)}\
i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) &\text{(input gate)}\
\tilde{C}*t &= \tanh(W_c [h*{t-1}, x_t] + b_c) &\text{(candidate)}\
C_t &= f_t * C_{t-1} + i_t * \tilde{C}*t &\text{(cell state)}\
o_t &= \sigma(W_o [h*{t-1}, x_t] + b_o) &\text{(output gate)}\
h_t &= o_t * \tanh(C_t)
\end{aligned}
$$

**Use:** Long text, speech, and time-series tasks.
**Example:** Chatbots, stock prediction.

---


#### C. Gated Recurrent Unit (GRU)

**Definition:**
A simplified version of LSTM with fewer gates — combines forget and input gates into a single **update gate**.

**Equations:**

$$
\begin{aligned}
z_t &= \sigma(W_z [h_{t-1}, x_t]) &\text{(update gate)}\
r_t &= \sigma(W_r [h_{t-1}, x_t]) &\text{(reset gate)}\
\tilde{h}*t &= \tanh(W_h [r_t * h*{t-1}, x_t])\
h_t &= (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t
\end{aligned}
$$

**Advantages:**

* Faster training than LSTM.
* Similar performance.
  **Use:** Any sequential data where efficiency is key.

---

#### **D. Bidirectional RNN (BiRNN)**

**Definition:**
Processes sequence **in both directions** (forward and backward) to capture **past and future context** simultaneously.

**Mechanism:**

$$
h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]
$$

**Use:**

* Speech recognition
* Text classification
* Named Entity Recognition

**Limitation:**
Cannot be used for real-time prediction (needs full sequence).

---

#### **E. Deep (Stacked) RNN**

**Definition:**
Multiple RNN layers stacked vertically, allowing hierarchical feature extraction.

**Mechanism:**
The hidden state from one layer becomes the input to the next:

$$
h_t^{(l)} = f(W^{(l)}_x h_t^{(l-1)} + W^{(l)}*h h*{t-1}^{(l)} + b^{(l)})
$$

**Use:** Complex language or sequence modeling (deep context understanding).

---

#### **F. Echo State Networks (ESN)**

**Definition:**
A special RNN where only the **output weights** are trained; internal weights are fixed random values forming a **reservoir**.
Efficient for certain time-series problems.

**Use:** Signal processing, dynamical system modeling.

---

#### **G. Attention-based RNN (Seq2Seq + Attention)**

**Definition:**
Enhances many-to-many RNNs by learning **which parts of the input sequence to focus on** for each output step.

**Use:**
Machine translation, text summarization, question answering.

**Note:**
The concept of attention led directly to **Transformers**, which replaced RNNs in modern NLP.

---

**Summary Table**

| Type              | Handles Long-Term Context | Parallelizable | Speed    | Key Use Case                   |
| ----------------- | ------------------------- | -------------- | -------- | ------------------------------ |
| Simple RNN        | ❌                         | ❌              | Fast     | Short text, time series        |
| LSTM              | ✅                         | ❌              | Moderate | Language, speech               |
| GRU               | ✅                         | ❌              | Faster   | Text, stock data               |
| Bidirectional RNN | ✅                         | ❌              | Slow     | Sentiment, NER                 |
| Deep RNN          | ✅                         | ❌              | Slow     | Hierarchical sequence features |
| Attention RNN     | ✅✅                        | ⚠️             | Moderate | Translation, summarization     |

---

**In Short**

RNN types evolve to address **memory, speed, and dependency length** issues:

* **Simple RNN:** Basic memory.
* **LSTM:** Adds gates for long-term memory.
* **GRU:** Simplified LSTM, faster.
* **BiRNN:** Looks both ways in the sequence.
* **Deep/Stacked RNN:** Adds multiple layers.
* **Attention RNN:** Learns to focus selectively.


| **Feature**              | **Vanilla RNN**                              | **LSTM (Long Short-Term Memory)**                    | **GRU (Gated Recurrent Unit)**                     | **BiRNN / BiLSTM / BiGRU**                       | **Attention-based RNN**                            | **Transformer**                             |
| ------------------------ | -------------------------------------------- | ---------------------------------------------------- | -------------------------------------------------- | ------------------------------------------------ | -------------------------------------------------- | ------------------------------------------- |
| **Core Idea**            | Single hidden state passes info through time | Adds **cell state** + **gates** for long-term memory | Combines forget & input gates into **update gate** | Processes sequence **both forward and backward** | Adds **attention** to focus on relevant past steps | Removes recurrence, uses **self-attention** |
| **Memory Type**          | Short-term only                              | Long + short-term via cell                           | Long + short-term (simplified)                     | Both directions of sequence                      | Long-term via attention weights                    | Global context via attention                |
| **Number of Gates**      | None                                         | 3 (input, forget, output)                            | 2 (update, reset)                                  | Same as chosen cell type                         | Same as chosen cell type                           | None (uses attention heads)                 |
| **Gradient Stability**   | Poor (vanishing gradient)                    | Stable                                               | Stable (slightly less than LSTM)                   | Stable                                           | Stable (improved by attention)                     | Stable (no recurrence)                      |
| **Computation Speed**    | Fast (few parameters)                        | Slowest (4 weight sets per step)                     | Faster than LSTM                                   | Slower (two passes)                              | Slower (extra attention computations)              | Fast (parallelizable)                       |
| **Model Complexity**     | Simple                                       | High                                                 | Moderate                                           | High (double direction)                          | High                                               | Very High                                   |
| **Training Parallelism** | Low (sequential)                             | Low                                                  | Low                                                | Low                                              | Low                                                | High (full parallel)                        |
| **Best For**             | Short sequences                              | Long sequences                                       | Mid-to-long sequences                              | Context-rich tasks (e.g. translation, speech)    | Context-sensitive sequential data                  | NLP, vision, time series (modern standard)  |
| **Memory Control**       | None                                         | Explicit (via gates)                                 | Simplified (fewer gates)                           | Depends on direction and gating                  | Gating + attention weights                         | Attention weights only                      |
| **Parameter Count**      | Lowest                                       | Highest                                              | Lower than LSTM                                    | Double due to both directions                    | Depends on architecture                            | Very high (multi-heads)                     |
| **Interpretability**     | Low                                          | Moderate                                             | Moderate                                           | Moderate                                         | Better (visualizable attention)                    | Best (attention visualization)              |
| **Applications**         | Basic sequence prediction                    | Text, speech, translation                            | Similar to LSTM but faster                         | NLP, speech recognition                          | Sequence tasks needing focus                       | NLP, CV, time-series forecasting            |



```{dropdown} Click here for Sections
```{tableofcontents}