# Recurrent Neural Networks

Recurrent Neural Networks, is a class of models that deal with **Sequential Data** such as: 
- Text
- Speech
- Music
- Audio
- Video
- Stock Prices
- Gaming

### Intuition

In many real world problems the meaning of the current observation depends on the pervious observation. 
- To be able to answer a question one must be able to understand the question. 
- Videos and Film have plots based on past events in the video

A FC NN treats the input independently and in an unordered manner, we already saw that with images this isn't helpful since we needed the context of the aspect of the image to be able to classify and localize. 

**Deep NN**

- Applied combination of Linear and non-linear transformation.
- Each internel Layer is and intermediate representation.
- The intermediate representation of the data, is a step closer to solve our task more easily.

<div align="center">
<img src="../images/chap10/FCvis.png" width="210"/>
</div>


**Recurrant NN** 

- We also use the intermediate representation.
- In this model class, we pass this intermediate representation onwards to **another** intermediate representation which also takes in a new input.


<div align="center">
<img src="../images/chap10/RNNInuition.png" width="410"/>
</div>

### Common RNN Structures

|**Name**| **One to One** | **One to Many** | **Many to One** | **Many to Many** | **Many to Many**|
|-----|--------------|-----------------|----------------|-----------------|----------------|
|**Visual**| <img src="../images/chap10/O2O.png" width="110"/> | <img src="../images/chap10/O2M.png" width="700"/> | <img src="../images/chap10/M2O.png" width="165"/> | <img src="../images/chap10/M2M1.png" width="1400"/> | <img src="../images/chap10/M2M2.png" width="170"/>|
|**Explanation**| Standard Vanilla Network | Trying to convert an image to string vector | Provide a vector of inputs through different intermediate representation and output a single value | Provide a vector of inputs through different representations, and previous layers pas the result to the next representation, which at some point output a vector | We provide a vector of inputs through different representations, each representatio layer, must output a value and also pass on its information to the next representation layer (who also recieve an input) | 
|**Task**| **Image Classification** | **Image Captioning** | **Sentiment Classification** | **Translation**  | **Video Frame Classification** | 
| **In $\to$ Out** | Image $\to$ Class | Image $\to$ Seq. Words | Seq. Words $\to$ Sentiment | Seq. Words $\to$ Seq. Words | Video Frame $\to$ Action |


## RNN Details

This class of networks, should be viewed as NN that recieve inputs at **some time steps**. In the CNN our input was at t=0 and produced an output "directly". 
Therefore, in general we have 3 layers that'll have time steps.

**Definitions**

- $x_t \in \mathbb{R}^{d_{in}} := \text{Input at time t}$
- $W_x \in \mathbb{R}^{(d_h \times d_{in})} := \text{Input to Hidden weight matrix}$
- $W_h \in \mathbb{R}^{(d_h \times d_h)} := \text{Hidden to Hidden weight matrix}$
- $W_y \in \mathbb{R}^{d_h \times d_{out}} := \text{Hidden to Output weight matrix}$
- $b \in \mathbb{R}^{d_h} := \text{Bias hidden vector}$
- $b_y \in \mathbb{R}^{d_{out}} := \text{Bias output vector}$
- $z_t = W_h h_{t-1} + W_x x_t + b \in \mathbb{R}^{d_h} := \text{The "score" from input at time t and internal representation passed from t-1}$
- $h_{t} \in \mathbb{R}^{d_h} := \text{internal state vector time t}$
  - $h_t = \tanh(z_t) = f_W(x_t, h_{t-1})$
  - This is our non-linear function applied **element wise**, and thus the dimension remain the same
- $y_t = \tanh(W_yh_t + b_y) \in \mathbb{R}^{d_{out}}$


**IMPORTANT**

We assume that the hidden states have the same dimension througout time.

---

**Algebraic Modification**

The concatonation operation over vector and matrices provided a clean way to represent multiple linear operations:

$\tilde{x}_t = \left[x_t \ h_{t-1}\right]^T \in \mathbb{R}^{d_{in} + d_{h}}$

$W = [W_x \ W_h] \in \mathbb{R}^{(d_h \times d_{in} + d_h)}$

Everything remains the same, just makes our work tidier. 

$W \tilde{x}_t = [W_x \ W_h] \cdot \left[x_t \ h_{t-1}\right]^T \in \mathbb{R}^{d_h}$

So now when applying our non-linear activation function we have: 

$$f_W(W \tilde{x}_t + b) \in \mathbb{R}^{d_h}$$

**IMPORTANT**

$W$ is applied to **ALL** inputs and internal representation (unless said otherwise), it's independent of time.

---

**Weight Representation**

|Weight Notation | One-Line Explanation | How it behaves | Example | 
|----------------|----------------------|----------------|---------|
| $$W_h$$ | How **past** information **Influences** the **present**| If our values are binary $\{0,1\}$, the it represents what to keep and that to forget <br> If our values is $[0,1]$ then it represents what to amplify and what to supress | "I did **not** enjoy the movie" <br> Learns that negation persists, thereby affecting the words "enjoy"|
| $$W_x$$ | How to **interprate current input** | What part of the input matters <br> how strong is their influence currently | If $x_t$ is a word embedding, $W_x$ learns: <br> which emedding dimension are important <br> How strongly a word should influence memory|
| $$W_y$$ | How to **interprete** the internal state to **make** a **descision** | This doesn't affect the memory <br> Can be viewed like the "classic" wights in a DNN | Hidden state encodes: $\{subject, negation, Action\}$ <br> $W_y$ learns which hidden features to output label.|

---

**Hyperbolic Function Reminder**

Tanh (Hyperbolic Tangent)

**Definition**

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{2}{1 + e^{-2x}} - 1$$

**Derivative:**
$$\tanh'(x) = 1 - \tanh^2(x)$$

**Range:** $(-1, 1)$

**Shape:** S-shaped curve (zero-centered)

<div align="center">
<img src="../images/chap10/tanh.png" width="510"/>
</div>

| **✅ Advantages** | **❌ Disadvantages** |
|------------------|---------------------|
| **Zero-centered**: Output in $(-1, 1)$ → better gradient flow than sigmoid | **Vanishing gradients**: For $\|x\| > 3$, gradient $\approx 0$ → problem persists in very deep networks |
| **Stronger gradients**: $\tanh'(x)_{\text{max}} = 1$ (4× stronger than sigmoid!) | **Computationally expensive**: Still requires exponential calculations |
| **Symmetric**: Easier optimization, less bias in weight updates | **Saturation**: Can still saturate at extremes ($\pm 1$) |
| **Smooth and differentiable**: Good gradient properties everywhere | **Not ideal for deep networks**: ReLU typically performs better |

---

## Unrolling Through Time

We shall now go through the training flow of the structures we introduced above 

### Forward + Backward: Many to One 

<div align="center">
<img src="../images/chap10/Many2one.png" width="510"/>

$$\begin{align} h_1 &= f_W(W(x_1, h_0)^T + b) \\ h_2 &= f_W(W(x_2, h_1)^T + b) \\  h_3 &= f_W(W(x_3, h_2)^T + b) \\ \cdots &= \cdots \\ h_T &= f_W(W(x_T, h_{t-1})+b) \\ y_T &= \tanh(W_yh_T + b_y) \end{align}$$

$$\mathbf{L} = l(y_T, y_{gt})$$
</div>




<div align="center">
<img src="../images/chap10/ManyToOneBPP.png" width="510"/>
</div>

We can directly compute the gradient of the loss w.r.t the output: $$\frac{\partial \mathbf{L}}{\partial y_T} = \frac{\partial l(\tanh(W_yh_T + b_y), y_{gt})}{\partial y_T}$$

Now we can apply many sequence of chain rules: 

$$\begin{align} 
\frac{\partial \mathbf{L}}{\partial h_T} &= \frac{\partial \mathbf{L}}{\partial y_T} \frac{\partial y_T}{\partial h_T} \\ 
\frac{\partial \mathbf{L}}{\partial h_{T-1}} &= \frac{\partial \mathbf{L}}{\partial h_T} \frac{\partial h_T}{\partial h_{T-1}} \\ 
&= \frac{\partial \mathbf{L}}{\partial h_T} \frac{\partial f_W(W(x_T, h_{T-1})^T+b)}{\partial h_{T-1}} \\
&= \frac{\partial \mathbf{L}}{\partial h_T} \frac{\partial h_T}{\partial f_W} \frac{\partial f_W}{\partial h_{T-1}} \\
&= \frac{\partial \mathbf{L}}{\partial h_T} \left(1 - \tanh^2(W(x_T, h_{T-1})+b)\right) \cdot W_h^T \\
\frac{\partial \mathbf{L}}{\partial h_{T-2}} &= \frac{\partial \mathbf{L}}{\partial h_{T-1}} \frac{\partial h_{T-1}}{\partial h_{T-2}} \\
\dots &= \dots \\ 
\frac{\partial \mathbf{L}}{\partial h_{0}} &= \frac{\partial \mathbf{L}}{\partial h_{1}} \frac{\partial h_{1}}{\partial h_{0}}
\end{align}$$


Line 3-5 are nearly expanded nearly completely to show what the partial derivative develops into, since this is a repetitive pattern. <br>
Note that the partial derivative of an internal state w.r.t the activation function is applied **Element wise**. <br>
Furthermore, we only computed the **main** backward flow, we musn't forget the partial derivative w.r.t all the parameters specifically: $\{W_x, W_y, W_h, b, b_y\}$



<div align="center">
<img src="../images/chap10/ManyToOneBPP.png" width="510"/>
</div>

As usual in the back-propagation algorithm we wish to determine the change in the paramters w.r.t the loss specifically:
$\{\frac{\partial L}{\partial W_x}, \frac{\partial L}{\partial W_y}, \frac{\partial L}{\partial W_h}, \frac{\partial L}{\partial b}, \frac{\partial L}{\partial b_y}\}$

Consider computing the partial derivative of the loss wrt $W_x$ at time $T$ (we denote this as a time-stamp with the bar): 

$$\left. \frac{\partial L}{\partial W_x} \right|_{t=T} = \frac{\partial L}{\partial y_T} \cdot \frac{\partial y_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial z_T} \cdot \frac{\partial z_T}{\partial W_x}$$

Indeed we need to consider the change in the loss wrt $W_x$ everytime $t$ it was used: 

$$
\frac{\partial L}{\partial W_x} = \sum_{i=1}^T\left. \frac{\partial L}{\partial W_x}\right|_{t=i}
$$

The following table summarises the partial derivative of the vanilla network:

<div align="center">

|**Derivative** | **Formula** | 
|---------------|-------------|
|$$\frac{\partial L}{\partial W_x}$$ | $$\sum_{i=1}^T\left. \frac{\partial L}{\partial W_x}\right\|_{t=i}$$
| $$\frac{\partial L}{\partial W_h}$$ | $$\sum_{i=1}^T\left. \frac{\partial L}{\partial W_h}\right\|_{t=i}$$ |
| $$\frac{\partial L}{\partial b}$$ | $$\sum_{i=1}^T\left. \frac{\partial L}{\partial b}\right\|_{t=i}$$ |
| $$\frac{\partial L}{\partial b_y}$$ | $$\frac{\partial L}{\partial y_T} \cdot \frac{\partial y_T}{\partial b_y}$$ |
</div>