# Long Short Term Memory

As discussed earlier LSTM aims to solve the structural gradient problem with RNN models, in order to attain long term memory models.

To understand why this is a structural issue, consider that the model receives a loss and updates its parameters accordingly. 

In a vanilla RNN, all information—what to remember, forget, or output—is controlled by a single hidden state and a single set of parameters. This means the model lacks the flexibility to independently decide which information to keep, discard, or expose at each time step. As a result, it struggles to preserve relevant context over long sequences.

LSTM addresses this by introducing separate gates, each with its own parameters. These gates allow the model to learn, through backpropagation, **how much to remember (forget gate), what new information to add (input gate), and what to output (output gate)**. This structural change gives the model the expressiveness needed to manage context and memory more effectively.

**LSTM aims to:**
- Keep important information
- Forget Irrelevant information 
- Control what and when to remember

## LSTM Architecture
<div align="center">

| <div align="center"><img src="../images/chap10/LSTMkey.png" width="725"/></div> |
|---|
| <div align="center"><img src="../images/chap10/LSTMarch.png" width="725"/></div> |
</div>



| **Component**         | **Symbol**      | **Description**                      |
|---------------------- |-----------------|--------------------------------------|
| Input                 | $x_t$           | Input at time $t$                    |
| Previous Context      | $C_{t-1}$       | Context from time $t-1$              |
| Previous Hidden State | $h_{t-1}$       | Internal state at time $t-1$         |
| Non-linear Function   | $\sigma$        | Sigmoid activation                   |
| Non-linear Function   | $\tanh$         | Hyperbolic tangent activation        |
| Output Context        | $C_t$           | Context from time $t$                |
| Output Hidden State   | $h_t$           | Internal state at time $t$           |

### Three internal Gates

We now zoom in the LSTM structure and focus on the three gates that helps us obtain our properties. 

**Forget Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
|$$f_t = \sigma(W_f[h_{t-1}, x_t]^T + b_f)$$ <br> $$f_t \odot C_{t-1}$$ | <div align="center"><img src="../images/chap10/fGate.png" width="325"/></div> | $\{W_{fh}, W_{fx}, b_f\}$ | The forget gate outputs values [0,1] (after sigmoid) <br> Values closer to 1 will be rememebered and values closer to 0 will be forgotten| 

**Input Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
| $$\begin{align*}i_t &= \sigma(W_i[h_{t-1}, x_t]^T + b_i) \\ \tilde{C_t} &= \tanh(W_c[h_{t-1}, x_t] + b_c) \\ & i_t \odot \tilde{C_t}\end{align*}$$ | <div align="center"><img src="../images/chap10/iGate.png" width="1200"/></div> | $\{W_{ih}, W_{ix}, b_i, W_{ch}, W_{cx}, b_c\}$ | $i_t$ provides the ability for the model to choose how (sigmoid effect) much of the new information to store. <br> $\tilde{C_t}$ provides the ability for the model to choose what aspects to store, tanh uses sigmoid to obtain this effect, values that are more negative will be diminished when the sigmoid is applied to it thereby, obtain this property. |

We use the forget gate and the input gate to produce pur new context. 

$$C_t = (f_t \odot C_{t-1}) + (i_{t} \odot \tilde{C_{t}})$$

This has two properties:
1. We provide contextual information longer shelf life againts BPP, using the additive property. 
2. Logically, our context is intuively understood as the amount past memory and new information is stored.

**Output Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
| $$\begin{align*}o_t &= \sigma(W_o[h_{t-1}, x_t]^T + b_o) \\ h_t &= o_t \odot \tanh(C_t)\end{align*}$$ | <div align="center"><img src="../images/chap10/oGate.png" width="800"/></div> | $\{W_{oh}, W_{ox}, b_o\}$ | This gate controls how much of the $C_t$ is exposed to the next hidden state $h_t$, enabling context dependant outputs for each step.|

| **Pros** | **Cons** |
|----------|----------|
| 1. Solves the core RNN problem (vanishing/exploding gradients) <br> 2. Explicit, interpretable memory <br>  3. Learns when to forget (via gates) <br> 4. Works well on sequential data <br>  5. More stable training compared to vanilla RNN | 1. Sequential = Slow; must process time steps one by one & hard to parallelize <br> 2. Still struggles with very long sequences <br> memory isn't infinite & long-range dependencies still degrade <br>  3. Complex and parameter heavy <br> 4. LSTM was a stepping stone for Transformers |


## End to End Example

Suppose our structure is a many to many architecutre, in which the task is language translation. 

For the sake of simplicity we'll set T=2 to provide us with enough understanding of the forward and backward pass.

**Setup**

- **Inputs:** $x_1, x_2 \in \mathbb{R}^{d_x}$
- **Targets:** $y_1, y_2 \in \mathbb{R}^{d_{out}}$
- **Hidden Size:** $d_h$
- **Initial States:** $h_0, C_0 \in \mathbb{R}^{d_h}$
- **LSTM Weight Parameters:** $\mathbf{W}_f, \mathbf{W}_i, \mathbf{W}_o, \mathbf{W}_c \in \mathbb{R}^{d_h \times (d_h + d_x)}$
-  **LSTM Bias Parameters:** $b_f, b_i, b_o, b_c \in \mathbb{R}^{d_h}$
- **Output Layer:** $\mathbf{W}_{out} \in \mathbb{R}^{d_{out} \times d_h}, b_{out} \in \mathbb{R}^{d_{out}}$
- **Loss** $\mathbf{L} = l(\hat{y_1}, y_1) + l(\hat{y_2}, y_2)$ Each step we apply the cross-entropy loss

### Forward Pass

**At $t=1$**

1. **Concatenate:** $z_1 = [h_0 \ x_1] \in \mathbb{R}^{d_h + d_x}$
2. **Gates:**
   $$
   \begin{aligned}
   f_1 &= \sigma(W_f z_1 + b_f) \\
   i_1 &= \sigma(W_i z_1 + b_i) \\
   o_1 &= \sigma(W_o z_1 + b_o) \\
   \tilde{C}_1 &= \tanh(W_c z_1 + b_c)
   \end{aligned}
   $$
3. **Cell state:** $C_1 = f_1 \odot C_0 + i_1 \odot \tilde{C}_1$
4. **Hidden state:** $h_1 = o_1 \odot \tanh(C_1)$
5. **Output:** $\hat{y}_1 = W_{out}h_1 + b_{out}$

**At $t=2$**

1. **Concatenate:** $z_2 = [h_1; x_2]$
2. **Gates:**
   $$
   \begin{aligned}
   f_2 &= \sigma(W_f z_2 + b_f) \\
   i_2 &= \sigma(W_i z_2 + b_i) \\
   o_2 &= \sigma(W_o z_2 + b_o) \\
   \tilde{C}_2 &= \tanh(W_c z_2 + b_c)
   \end{aligned}
   $$
3. **Cell state:** $C_2 = f_2 \odot C_1 + i_2 \odot \tilde{C}_2$
4. **Hidden state:** $h_2 = o_2 \odot \tanh(C_2)$
5. **Output:** $\hat{y}_2 = W_{out}h_2 + b_{out}$

---


### Backward Pass (Symbolic, Step-by-Step)

**Step 1: Output Layer Gradients**
For $t=2, 1$ (reverse order):
* $\frac{\partial L}{\partial \hat{y}_t}$ from the loss function (e.g., cross-entropy)
* $\frac{\partial L}{\partial h_t} = W_{out}^T \frac{\partial L}{\partial \hat{y}_t}$

**Step 2: LSTM Gradients at $t=2$**
* $h_2 = o_2 \odot \tanh(C_2)$
* $\frac{\partial L}{\partial o_2} = \frac{\partial L}{\partial h_2} \odot \tanh(C_2) \odot o_2 \odot (1 - o_2)\\\ $
* $\frac{\partial L}{\partial C_2} = \textcolor{red}{\underbrace{\frac{\partial L}{\partial h_2} \odot o_2 \odot (1 - \tanh^2(C_2))}_{\text{Local Gradient}}} + \textcolor{blue}{\underbrace{dC_2^{\text{from future}}}_{\text{Flow from } t+1}} \quad (\text{for } t=T=2, \ dC_2^{\text{from future}} = 0)\\\ $
* $\frac{\partial L}{\partial f_2} = \frac{\partial L}{\partial C_2} \odot C_1 \odot f_2 \odot (1 - f_2)$
* $\frac{\partial L}{\partial i_2} = \frac{\partial L}{\partial C_2} \odot \tilde{C}_2 \odot i_2 \odot (1 - i_2)$
* $\frac{\partial L}{\partial \tilde{C}_2} = \frac{\partial L}{\partial C_2} \odot i_2 \odot (1 - \tilde{C}_2^2)$
* **Propagate to previous cell:** $dC_1^{(t=2)} = \frac{\partial L}{\partial C_2} \odot f_2$

**Step 3: LSTM Gradients at $t=1$**
* $h_1 = o_1 \odot \tanh(C_1)$
* $\frac{\partial L}{\partial o_1} = \frac{\partial L}{\partial h_1} \odot \tanh(C_1) \odot o_1 \odot (1 - o_1) \\\ $
* $\frac{\partial L}{\partial C_1} = \textcolor{red}{\underbrace{\frac{\partial L}{\partial h_1} \odot o_1 \odot (1 - \tanh^2(C_1))}_{\text{Local Gradient}}} + \textcolor{blue}{\underbrace{dC_1^{\text{from future}}}_{\text{Flow from } t+1}} \\\ $
* $\frac{\partial L}{\partial f_1} = \frac{\partial L}{\partial C_1} \odot C_0 \odot f_1 \odot (1 - f_1)$
* $\frac{\partial L}{\partial i_1} = \frac{\partial L}{\partial C_1} \odot \tilde{C}_1 \odot i_1 \odot (1 - i_1)$
* $\frac{\partial L}{\partial \tilde{C}_1} = \frac{\partial L}{\partial C_1} \odot i_1 \odot (1 - \tilde{C}_1^2)$
* **Propagate to previous cell:** $dC_0 = \frac{\partial L}{\partial C_1} \odot f_1$

**Step 4: Parameter Gradients**
For each gate $g \in \{f, i, o, c\}$ and $t=1, 2$:
* $\frac{\partial L}{\partial W_g} += \frac{\partial L}{\partial g_t} \cdot z_t^T$
* $\frac{\partial L}{\partial b_g} += \frac{\partial L}{\partial g_t}$

**Step 5: Input/Hidden Gradients**
* $\frac{\partial L}{\partial z_t} = W_f^T \frac{\partial L}{\partial f_t} + W_i^T \frac{\partial L}{\partial i_t} + W_o^T \frac{\partial L}{\partial o_t} + W_c^T \frac{\partial L}{\partial \tilde{C}_t}$
* Split $\frac{\partial L}{\partial z_t}$ into $\frac{\partial L}{\partial h_{t-1}}$ and $\frac{\partial L}{\partial x_t}$

---

##### Note

When we compute the gradient with respect to the Context we had an additive term.<br> To understand this, consider the forward pass of the LSTM layer focusing on this section

For $C_1$:

* $h_1 = o_1 \odot \tanh(C_1)$ (direct path)
* $C_2 = f_2 \odot C_1 + i_2 \odot \tilde{C_2}$ (indirect path)

so, the total derivative is:

$$\frac{\partial L}{\partial C_1} = \textcolor{red}{\underbrace{\frac{\partial L}{\partial h_1} \cdot  \frac{\partial h_1}{\partial C_1}}_{\text{direct}}} + \textcolor{blue}{ \underbrace{\frac{\partial L}{\partial C_2} \cdot\frac{\partial C_2}{\partial C_1}}_{\text{indirect}}}$$

* $\frac{\partial h_1}{\partial C_1} = o_1 \odot (1-\tanh^2(C_1))$
* $\frac{\partial C_2}{\partial C_1} = f_2$ (elementwise multiplication)

Altogether,

$$\frac{\partial L}{\partial C_1} = \frac{\partial L}{\partial h_1} \odot o_1 \odot (1 - \tanh^2(C_1)) + \frac{\partial L}{\partial C_2} \odot f_2$$