
## Intuition

A **Recurrent Neural Network** is designed to learn **sequences** — data where the **order matters**.

Think of reading a sentence:

* Each new word’s meaning depends on the words before it.
* You don’t forget the whole sentence when you read the next word — you **carry context forward**.

That’s exactly what RNNs do:

* They take **one input at a time**.
* They **store context** in a hidden state.
* They use that hidden state to influence the next prediction.

So, an RNN acts like a **memory-augmented neural network**.

---

### Theoretical View

#### Sequence Modeling Problem

We are given a sequence of inputs:
$$
x_1, x_2, x_3, ..., x_T
$$
and we want to predict either:

* An output at each time step: $y_1, y_2, ..., y_T$
* Or one final output after all steps (e.g., sentiment score).

We need a function that can model dependencies between inputs **across time**.

Ordinary neural nets (ANNs) can only model:
$$
y = f(x)
$$
RNNs extend this to:
$$
y_t = f(x_t, x_{t-1}, ..., x_1)
$$
— using a **hidden state** to summarize the past.

---

### Mathematical Formulation

At each time step ( t ):

### (a) Hidden State Update

$$
h_t = f(W_x x_t + W_h h_{t-1} + b_h)
$$

### (b) Output

$$
y_t = g(W_y h_t + b_y)
$$

Where:

| Symbol       | Meaning                               |
| ------------ | ------------------------------------- |
| $x_t$      | Input at time ( t )                   |
| $h_t$      | Hidden state (memory) at time ( t )   |
| $y_t$      | Output at time ( t )                  |
| $W_x$      | Input → hidden weights                |
| $W_h$      | Hidden → hidden weights (recurrent)   |
| $W_y$      | Hidden → output weights               |
| $f$        | Activation (tanh / ReLU)              |
| $g$        | Output activation (sigmoid / softmax) |
| $b_h, b_y$ | Bias terms                            |

---

### Why Recurrence Works

The equation
$$
h_t = f(W_x x_t + W_h h_{t-1} + b_h)
$$
creates a **recurrence relation**:

* Each hidden state $h_t$ depends on the previous one $h_{t-1}$.
* This “connects time steps together.”

When you unfold the recurrence:
$$
h_t = f(W_x x_t + W_h f(W_x x_{t-1} + W_h h_{t-2} + b_h) + b_h)
$$
You can see that $h_t$ becomes a **function of all past inputs** $x_t, x_{t-1}, ..., x_1$.

Thus, RNNs can remember the history of inputs.

---

### Geometric / Functional Intuition

* $W_x x_t$ → captures **current input information**.
* $W_h h_{t-1}# → brings **past memory** forward.
* Activation $f(\cdot)$ → squashes combined signal, deciding how much of past and present to keep.
* This recurrence lets the network **learn patterns over time**, like “not good” → negative, “very good” → positive.

---

## 6. Forward Pass (Step-by-Step)

At $t = 1$:

$$
h_1 = f(W_x x_1 + b_h)
$$

At $t = 2$:
$$
h_2 = f(W_x x_2 + W_h h_1 + b_h)
$$

At $t = 3$:
$$
h_3 = f(W_x x_3 + W_h h_2 + b_h)
$$


Final output (for classification, say sentiment):
$$
\hat{y} = \sigma(W_y h_T + b_y)
$$

---

## 7. Backward Pass — Learning

To train the RNN, we minimize a loss:
$$
L = \sum_t \ell(y_t, \hat{y_t})
$$

and compute gradients by **Backpropagation Through Time (BPTT)** — which unfolds the RNN across all time steps and applies the chain rule backward through each.

---

## 8. Theoretical Challenges

Because $h_t$ depends on many previous $h$’s, gradients flow through long chains of matrix multiplications during backpropagation:

$$
\frac{\partial L}{\partial W_h} \propto \prod_t W_h^T f'(a_t)
$$

If eigenvalues of $W_h$ are:

* \< 1 → gradients **vanish** (become near zero)
* \> 1 → gradients **explode**

This explains the **vanishing/exploding gradient problem**, which limits how far back RNNs can “remember.”

---

## 9. Long-Term Memory Fixes

To overcome that:

* **LSTM (Long Short-Term Memory)** introduces gates (forget, input, output) to control memory flow.
* **GRU (Gated Recurrent Unit)** simplifies LSTM with fewer gates.

Both stabilize gradient flow and allow learning of longer dependencies.

---

## 10. Practical Example (Numbers)

Suppose:

* $x_t \in \mathbb{R}^5$
* $h_t \in \mathbb{R}^3$

Then:

* $W_x$: 5×3
* $W_h$: 3×3
* $W_y$: 3×1

At each step:
$$
z_t = W_x x_t + W_h h_{t-1} + b_h
$$
$$
h_t = \tanh(z_t)
$$
$$
\hat{y_t} = \sigma(W_y h_t + b_y)
$$

You iterate this through the sequence. The RNN learns weights $W_x, W_h, W_y$ that best map the sequential inputs to outputs.

---

## 11. Summary Table

| Concept                  | Role                                         |
| ------------------------ | -------------------------------------------- |
| Hidden state ( h_t )     | Memory that carries information through time |
| Recurrent weight ( W_h ) | Connects previous state to current           |
| Activation function      | Non-linear mixing of current and past info   |
| BPTT                     | Training method for temporal sequences       |
| Problem                  | Vanishing/exploding gradients                |
| Solution                 | LSTM / GRU / Attention                       |

---

**In short:**

* Theoretically, RNNs extend neural nets to functions over sequences:
  $h_t = f(W_x x_t + W_h h_{t-1})$.
* Mathematically, they model **dynamical systems** with recurrent state updates.
* Intuitively, they act as a **short-term memory** that learns temporal patterns.

