# Long Short Term Memory

As discussed earlier LSTM aims to solve the structural gradient problem with RNN models, in order to attain long term memory models.

To understand why this is a structural issue, consider that the model receives a loss and updates its parameters accordingly. 

In a vanilla RNN, all information—what to remember, forget, or output—is controlled by a single hidden state and a single set of parameters. This means the model lacks the flexibility to independently decide which information to keep, discard, or expose at each time step. As a result, it struggles to preserve relevant context over long sequences.

LSTM addresses this by introducing separate gates, each with its own parameters. These gates allow the model to learn, through backpropagation, **how much to remember (forget gate), what new information to add (input gate), and what to output (output gate)**. This structural change gives the model the expressiveness needed to manage context and memory more effectively.

**LSTM aims to:**
- Keep important information
- Forget Irrelevant information 
- Control what and when to remember

## LSTM Architecture
<div align="center">

| <div align="center"><img src="../images/chap10/LSTMkey.png" width="725"/></div> |
|---|
| <div align="center"><img src="../images/chap10/LSTMarch.png" width="725"/></div> |
</div>



**Components**

For every time $t$ we have what is known as an **LSTM Cell** comprised of:
- Inputs:
  - $x_t \ \text{– Input at time t}$
  - $C_{t-1} \ \text{– Context from time t – 1}$
  - $h_{t-1} \ \text{- Internal State at time t – 1}$
- Non-Linear functions:
  - $\text{Sigmoid}$
  - $\text{Hyperbolic Tangent}$
- Outputs:
  - $C_{t} \  \text{– Context from time t}$
  - $h_{t} \ \text{- Internal State at time t}$

### Three internal Gates

We now zoom in the LSTM structure and focus on the three gates that helps us obtain our properties. 

**Forget Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
|$$f_t = \sigma(W_f[h_{t-1}, x_t]^T + b_f)$$ <br> $$f_t \odot C_{t-1}$$ | <div align="center"><img src="../images/chap10/fGate.png" width="325"/></div> | $\{W_{fh}, W_{fx}, b_f\}$ | The forget gate outputs values [0,1] (after sigmoid) <br> Values closer to 1 will be rememebered and values closer to 0 will be forgotten| 

**Input Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
| $$\begin{align*}i_t &= \sigma(W_i[h_{t-1}, x_t]^T + b_i) \\ \tilde{C_t} &= \tanh(W_c[h_{t-1}, x_t] + b_c) \\ & i_t \odot \tilde{C_t}\end{align*}$$ | <div align="center"><img src="../images/chap10/iGate.png" width="1200"/></div> | $\{W_{ih}, W_{ix}, b_i, W_{ch}, W_{cx}, b_c\}$ | $i_t$ provides the ability for the model to choose how (sigmoid effect) much of the new information to store. <br> $\tilde{C_t}$ provides the ability for the model to choose what aspects to store, tanh uses sigmoid to obtain this effect, values that are more negative will be diminished when the sigmoid is applied to it thereby, obtain this property. |

We use the forget gate and the input gate to produce pur new context. 

$$C_t = (f_t \odot C_{t-1}) + (i_{t} \odot \tilde{C_{t}})$$

This has two properties:
1. We provide contextual information longer shelf life againts BPP, using the additive property. 
2. Logically, our context is intuively understood as the amount past memory and new information is stored.

**Output Gate**

|**Gate Function** | **Visual** | **Parameters** | **Explanantion** | 
|------------------|------------------|------------------|------------------|
| $$\begin{align*}o_t &= \sigma(W_o[h_{t-1}, x_t]^T + b_o) \\ h_t &= o_t \odot \tanh(C_t)\end{align*}$$ | <div align="center"><img src="../images/chap10/oGate.png" width="800"/></div> | $\{W_{oh}, W_{ox}, b_o\}$ | This gate controls how much of the $C_t$ is exposed to the next hidden state $h_t$, enabling context dependant outputs for each step.|

| **Pros** | **Cons** |
|----------|----------|
| 1. Solves the core RNN problem (vanishing/exploding gradients) <br> 2. Explicit, interpretable memory <br>  3. Learns when to forget (via gates) <br> 4. Works well on sequential data <br>  5. More stable training compared to vanilla RNN | 1. Sequential = Slow; must process time steps one by one & hard to parallelize <br> 2. Still struggles with very long sequences <br> memory isn't infinite & long-range dependencies still degrade <br>  3. Complex and parameter heavy <br> 4. LSTM was a stepping stone for Transformers |
