# Intuition

So we first discuss shortcomings of RNNs, how they experience exploding gradients or vanishing gradients. Also notice how we tend to lose long term memory because the same weights are being updated and overwriting previous learning.

We first discussed the typical backprop through time, where we computed the gradient of some unit j with respect to some unit i at some previous time t - q. This was the derivative of the activation function composited with the sum of product of inputs times the weight.
We tried a simplified example, wehre we had one weight and one unit j, and it would loop back and feedback into the weight along with a new input.
We noticed two problems

We present LSTM

Far less efficient, lacks parallelization

# Introduction

Use hidden representations to process back into weights for a new time step

Across many timesteps, the same weights may be multiplied so if w_ij > 1, it will explode, if w_ij < 1 it will vanish
Bridge across time lags

> , error signals flowing backward in time tend to (1) blow up or (2) vanish

1. vanish when time lag is significant, meaning it takes a while for the input to have a realized affect on another input in a future time step,
2. blow up leads to oscillating weights because weights may be needed for one input, but not needed for another one
   Same weights contribute to the same outputs in consecutive timesteps, resulting in ever larger weight vlaues and ever smaller weights that dont contribute,

# Constant Error Backpropagation

## 3.1 Exponentially decaying error

Conventional BPTT

Recall that the error signal is d_k(t) - y_k(t)
Where k is the index of an output unit.

$\mid f_{l_m}^{\prime}\left(\right. net \left._{l_m}(t-m)\right) w_{l_m l_{m-1}} \mid>1.0$

Short term misfit:

- weights that are adjusted to fit a set of inputs already processed might not generalize well to new inputs of the sequence.
- If new inputs are different, weights must adjust again, potentially in an opposite direction from which they updated
- Bad for short term, because if a new input activates the same weights but in different way, it may feel the strong weights too heavily and thus reverse equally hard.
- This short term optimization is too focused on minimizing error based on recent inputs.

Long term misfit:

- weights that constantly given small gradients to indicate they are no needed for this continuous flow of inputs.
- Then when new inputs are given, and the weights are actually needed, they are already too small to learn, essentially rendering them useless.
- This fast decay of weights ruins long term memory.

## 3.2 Constant Error Flow: Naive Approach

We want a way to backpropogate the errors without the gradients vanishing or exploding.
Based on RNNs, we can define the unit j's local error backflow as $v_j(t) = f_j'(net_j(t))w_jj*v_j(t+1)$
To perform constant error flow through j, we need
$|f_j'(net(t))w_jj| = 1$
This means that weights are constant, and every time step, gradients are 0 and not updating the weights, maintining itself at 1
f_j(x) = x

This means the product of the derivative of activation function and weight should be 1.

## 3.2.2: The Constant Error Caurosel

$f_j(net_j(t)) = net_j(t) / wjj$
f_j is a function of net_j, so as we increase net_j, f_j increases linearly to it.

This transformation does not alter the scale of error signal, preventing it from vanishing or exploding

Activation of unit j remains constant over time, y_j(t+1) = y_j(t).
Whatever output at time t is, it will remain unchanged at time t+1, creating a loop where the error can circulate without growing or shrinking.

However, unit j is connected to many inputs:
$net_j(t) = \sum_i w_{ij}v_i(t+1)$. net_j turns into sum of all inputs i of weight w_ij connecting from i to j, and v_i(t+1), the furture time output from unit i
$f_j'(net_j(t))

2 Related problems:
Input weight conflict:

> Assume that the total error can be reduced by switching on unit j in response to a certain input and keeping it active for a long time.
> We want to store some input in some unit j
> wji will often receive conflicting weight update signals during this time
> So the weight connecting j to i now has to make a choice: sustain the value j in i so make the weight high, or ignore units other than j so make weight lower
> This presents an inherent conflict

Output weight conflict:

> Assume j is switched on and currently stores some previous input.
> We both want to store the previous input so we can retrieve it from w_kj AND preventing j from disturbing k

> As the time lag increases, stored information must be protected against perturbation for longer and longer periods, and, especially in advanced stages of learning, more and more already correct outputs also require protection against perturbation
> This is the difficulty of recurrent neural nets, since the output units need to store information of previous units into weights, but the same weights are updated for new inputs coming in, causing conflict. As time lag increases, newer inputs that fit well for some very long-ago context may not match because the weights have weathered it out. It essentially "forgets".

# The Concept of Long Short-Term Memory

## 4.1 Memory Cells and Gate Units.

> A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs
> multiplicative output gate unit is introduced to protect other units from perturbation by currently irrelevant memory contents stored in j

![Screenshot 2024-05-14 at 2.26.28 PM.png](../../images/Screenshot_2024-05-14_at_2.26.28_PM.png)

A memory cell is a complex unit made up of a central linear unit and a self-connection (CEC)

$in_j$ is the input gate that $w_{in_j, i} * y_u(t-1) = net_{in_j}$. And $y_{in_j} = f_{in_j}(net_{in_j})$. We make sure that $w{in_j, i}$ makes j untouched.

- This indicates whether a new input can influence the values in this cell. If its open, then it will enter the cell. If closed, it will not
  $out_j$ is the output gate that bypasses the current memory contents stored in j
- If open, cell memory will exit and influence other memory cells. If closed, cell memory cannot exit.

$\begin{gathered}s_{c_j}(0)=0 \\ s_{c_j}(t)=s_{c_j}(t-1)+y^{\mathrm{in}_j}(t) g\left(\operatorname{net}_{c_j}(t)\right)\end{gathered}$

For a memory cell $c_j$, we define a recurrence relation where it equals the value in previous time step (long time step) plus the value computed by activation of the current cell, and the activation of the previous cell's output (short term)

The activation of the input cell is the value to add ot long term memory, and activation of previous cell telsl us how much relevance to contribute to this specific memory cell that weights more of the input cell.

The output can be calculated as

$y^{c*j}(t)=y^{\text {out }\_j}(t) h\left(s*{c_j}(t)\right)$

This shows the relevance of the long term memory coming from other memory cells, and the activation of the current cell's output gate input.

The actual short term outputs from the previous cell are from:

$\operatorname{net}_{c_j}(t)=\sum_u w_{c_j u} y^u(t-1)$

## Why Gate Units?

Error signals may get trapped in the long term pathway (CEC)
The gates determine how much time steps affect gates of an individual cell. Also shows much much the gradients should be backpropogated in time (for example, if some long memory cell influenced current timestep, backpropogation may occur on that long ago cell)

The big strategy is to know which errors to TRAP into CEC. And how much to scale this.
If an error is relevant to a cell, output gate needs to scale the CEC pathway to highly impact the cells output

> Essentially the multiplicative gate units open and close access to constant error flow through CEC

Backpropogation occurs in gates when needed.
Output gates receive important previous values, and when creating errors are trapped, means that output gate is zero or mostly closed, meaning gradients wont be backpropogated through the output pathway.
Input gates control incoming information from current and previous output of network. capturing errors and enables whether to be backpropogated in this cell.

## 4 Memory Cell Blocks

Memory cell blocks still only have two gates, but they are able to store complex memories more than a single CEC pathway

## 5 Learning

> To ensure nondecaying error backpropagation through internal states of memory cells, as with truncated BPTT (e.g., Williams & Peng, 1990), errors arriving at memory cell net inputs (for cell cj, this includes netcj , netinj , netoutj ) do not get propagated back further in time

Same gradients dont get backpropogated in network. If it is vanishing or exploding, then we prevent them from occuring through the short term memory pathway.

To backpropogate, gradients must flow through the memory cells, and they can block flow from the inputs or output gates. Not ALL errors need to be corrected. Input and output gates control which errors get corrected. For example, the input gate when open basically wishes for memory cell to flush the errors for the most recent input on that time and cell, taking advantage of short term memory and how future new information in this cell will be taken advantage.

So, allowing error flow to occur basically takes advantage of error flow for a particular gate and their purpose.

- Input gate: more error means short term relevance of future new information in this cell
- Output gate: more error means short term relevance of previous information in this cell
