**CS596 - Machine Learning**
<br>
Date: **2 November 2020**


Title: **Lecture 9**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Sources:
Bibliography: 
<br>[1] **Chapter 13**. Christopher M. Bishop, *Pattern Recognition and Machine Learning*, Springer, 2006.
<br>[2] https://medium.com/towards-artificial-intelligence/recurrent-neural-networks-for-dummies-8d2c4c725fbe
<br>[3] https://medium.com/@annikabrundyn1/the-beginners-guide-to-recurrent-neural-networks-and-text-generation-44a70c34067f
<br>[4] https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9
<br>[5] https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
<br>[6] https://www.youtube.com/watch?v=QciIcRxJvsM

<h1 align="center">Recurrent Neural Networks (RNN)</h1>

<img src="images/L9_Siri.jpg" width="600" alt="Example" />

- You asked Siri about the **weather today**, and it brilliantly resolved your queries.


- But, **how did it happen**? How it **converted your speech to the text** and fed it to the search engine?


- This is the magic of **Recurrent Neural Networks** (**RNN**).

<h3 align="center">What is RNN?</h3>

- RNNs **generalise** feedforward networks (FFNs) to be able to model **sequential data**.


- **FFNs** take an **input** (e.g. an image) and immediately **produce an output** (e.g. probabilities of different classes). 


- **RNNs**, on the other hand, **consider the data sequentially**, and **can remember** what they have seen earlier in the sequence to help interpret elements from later in the **sequence**.


- For example, imagine we want to **label words** as the **part-of-speech categories** that they belong to.

  I.e. for the **input sentence** `I would like the duck` and `He had to duck`, our model **should predict** that **duck** is a `Noun` in the **first sentence** and a `Verb` in the **second**. 
  
  To do this successfully, the **model needs to be aware of the surrounding context**. 
  
  However, if we **feed a FFN model** only **one word at a time**, how could it **know the difference**? 
  
  If we want to feed it all the words at once, **how do we deal** with the fact that **sentences are of different lengths**?

<h3 align="center">Where RNN is Used?</h3>

- **Sequence data** comes in **many** other **forms**:

  - **Audio** (a natural sequence of audiograms);
  - **Stock market prices** (numerical time series);
  - **Genomes**;
  - **Videos** (sequence of images)
  
  
 - **RNNs** can operate over sequences of vectors in both the input and the output.
 
 
 - The **many forms** of **sequence prediction problems** are probably best **described by** the types of **inputs** and **outputs** supported:
 
   1. **One-to-one**: Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. **image classification**).
  
   2. **One-to-many**: Sequence output (e.g. **image captioning** takes an image and outputs a sentence of words).

   3. **Many-to-one**: Sequence input (e.g. **sentiment analysis** where a given sentence is classified as expressing positive or negative sentiment or given some text predict the next character)

   4. **Many-to-many**: Sequence input and sequence output (e.g. **Machine Translation**: an RNN reads a sentence in English and then outputs a sentence in French).

   5. **Many-to-many**: Synced sequence input and output (e.g. **video classification** where we wish to label each frame of the video).
   
   <img src="images/L9_RNN_IO.jpeg" width="900" alt="Example" />


<h3 align="center">Sequential Data</h3>

- **Sequential Data** is any kind of data where the **order matters**.


- There are **two types** of **sequential distributions**:
  - **stationary sequential distributions**, when the data evolves in time, but the distribution from which it is generated remains the same.
  - **nonstationary sequential distributions**, when the generative distribution itself is evolving with time.


- We shall focus on the stationary case only.
  
  <img src="images/L9_Sequential_Data.png" width="400" alt="Example" />




<h3 align="center">Markov Models</h3>

- The **easiest way** to treat **sequential data** would be simply to ignore the sequential aspects and **treat the observations** as **i.i.d.**.

  However, would **fail to exploit the sequential patterns in the data**, such as correlations between observations that are close in the sequence.


- For example, suppose we observe a **binary variable** denoting whether on a particular day it **rained or not**.

  Given a time series of recent observations of this variable, we **wish to predict whether it will rain on the next day**.
  
  If we **treat the data as i.i.d.**, then the **only information** we can glean from the data is the **relative frequency of rainy days**.
  
  However, we know in practice that the weather often exhibits trends that may last for several days.
  
  Observing whether or not it rains today is therefore of significant help in predicting if it will rain tomorrow.
  
  
- To **express such effects** in a probabilistic model is to consider a **Markov model**.


- Lets use he product rule to express the joint distribution for a sequence of observations in the form:

  $$p(\mathbf{x}_1, ..., \mathbf{x}_N ) = \prod_{n=1}^{N} p(\mathbf{x}_n | \mathbf{x}_1, ..., \mathbf{x}_{n-1}).$$
  

- If we now assume that **each of the conditional distributions** on the right-hand side
is **independent of all previous observations except the most recent**, we obtain the **first-order Markov chain**:

  $$p(\mathbf{x}_1, ..., \mathbf{x}_N ) = p(\mathbf{x}_1) \prod_{n=2}^{N} p(\mathbf{x}_n | \mathbf{x}_{n-1}).$$

  <img src="images/L9_MC1.png" width="600" alt="Example" />

  If the observations are **discrete variables having $K$ states**, then the conditional distribution $p(\mathbf{x}_n| \mathbf{x}_{n-1})$ in a **first-order Markov chain** will be specified by a set of $K − 1$ parameter for each of the $K$ state of $\mathbf{x}_{n-1}$ giving a **total of $K(K − 1)$ parameters**.
  

- If we allow the **predictions** to **depend also on the previous-but-one value**, we obtain a **second-order Markov chain**:

  $$p(\mathbf{x}_1, ..., \mathbf{x}_N ) = p(\mathbf{x}_1) p(\mathbf{x}_2| \mathbf{x}_1) \prod_{n=3}^{N} p(\mathbf{x}_n | \mathbf{x}_{n-1}, \mathbf{x}_{n-2}).$$

  <img src="images/L9_MC2.png" width="600" alt="Example" />


- We can similarly **consider** extensions to an **$M^{th}$ order Markov chain** in which the **conditional distribution** for a particular variable **depends on the previous $M$ variables**.

  If the observations are **discrete variables having $K$ states**, then the conditional distribution $p(\mathbf{x}_n| \mathbf{x}_{n-M, ..., \mathbf{x}_{n-1}})$ in am **$M^{th}$-order Markov chain** will be specified by a **total of** $K^{M-1}(K − 1)$ **parameter**.
 


- **Note**, that we have **paid a price** for this increased flexibility because the **number of parameters** in the model **grows exponentially** with $M$.

<h3 align="center">State Space Model</h3>

- Suppose we **wish to build a model** for sequences that is **not limited by the Markov assumption** to **any order** and yet that **can be specified** using a **limited number of free parameters**.


- We can achieve this by **introducing additional latent variables** to permit a rich class of models to be constructed out of simple components.


- For each **observation** $\mathbf{x}_n$, we introduce a corresponding **latent variable** $\mathbf{z}_n$.


- Assume that it is the **latent variables*- $\mathbf{z}_m$ that **form a Markov chain**, giving rise to the graphical structure known as a **State Space Model**.


- The **joint distribution** for this **model** is given by:

  $$p(\mathbf{x}_1, ..., \mathbf{x}_N, \mathbf{z}_1, ..., \mathbf{z}_N ) = p(\mathbf{z}_1) \left [ \prod_{n=2}^{N} p(\mathbf{z}_n| \mathbf{z}_{n-1}) \right ]  \prod_{n=1}^{N} p(\mathbf{x}_n | \mathbf{z}_n).$$
  
  <img src="images/L9_SSM.png" width="600" alt="Example" />
  
  
- There are **two important models** for sequential data that are described by this graph:
  - If the **latent variables are discrete**, then we obtain the **Hidden Markov Model (HMM)**;
  - If **both the latent** and the **observed variables are Gaussian**, then we obtain the **Linear Dynamical System**. 

<h3 align="center">Sequential Memory</h3>

- Ok so, **RNN’s** are **good at processing sequence data** for predictions. **But how**??


- Well, they do that by having a concept commonly called **sequential memory**. 


- To get a good intuition behind what sequential memory means, let's try to **say the alphabet** in your head.

  <img src="images/L9_ABC.png" width="600" alt="Example" />

  That was pretty easy right. If you were **taught this specific sequence**, it should come quickly to you.

  Now try **saying the alphabet backward**:
  
  <img src="images/L9_CBA.png" width="600" alt="Example" />

  Much harder, isn't it?. Unless you’ve practiced this specific sequence before, you’ll likely have a hard time.

  Here’s a fun one, **start at the letter F**.

  <img src="images/L9_F.png" width="600" alt="Example" />

  At first, you’ll struggle with the first few letters, but then after your brain picks up the pattern, the rest will come naturally.
  
  
- **Sequential memory** is a **mechanism** that makes it easier for your brain to **recognize sequence patterns**.

<h3 align="center">Recurrent Neural Networks</h3>

- Alright, **RNN**’s have this abstract concept of sequential memory, but **how does** an **RNN replicate this concept**?


- Well, let’s look at a traditional neural network also known as a **Feed-Forward Neural Network** (FFN).

  <img src="images/L9_FFN.png" width="150" alt="Example" />
  
- How do we get a **FFN** network to be able to **use previous information** to effect later ones? 

  What if we **add a loop** in the neural network that can **pass prior information forward**?

  And that’s essentially what a **recurrent neural network does**. 
  
  <img src="images/L9_RNN.png" width="130" alt="Example" />


- An **RNN** has a **looping mechanism** that acts as a **highway** to allow information to **flow from one step to the next**.

  This information is the **hidden state**, which is a **representation of previous inputs**. 
  
  <img src="images/L9_HRNN.gif" width="300" alt="Example" />


<h3 align="center">What time is it?</h3>

- Let’s say we want to **build a chatbot**.


- The **chatbot** can **classify intentions** from the **users inputted text**.


- First, we are going to **encode the sequence of text using an RNN**. 


- Then, we are going to **feed the RNN output** into a **feed-forward neural network** which will **classify the intents**.


- Let's assuem, that **user types** in: **what time is it?**.


- **To start**, we **break up the sentence into individual words**:

  <img src="images/L9_What.gif" width="600" alt="Example" />
  
- The **first step** is to feed **What** into the **RNN**. 


- The RNN encodes **What** and produces an **output**.
  
  <img src="images/L9_What2.gif" width="600" alt="Example" />
  
- For the **next step**, we feed the word **time** and the **hidden state** from the **previous step**. 


- The **RNN* now **has information** on both the word **What** and **time**.
    
  <img src="images/L9_What3.gif" width="600" alt="Example" />
  
- We **repeat this process, until the final step**. 


- You can see **by the final step** the **RNN has encoded information** from **all the words** in **previous steps**.
      
  <img src="images/L9_What4.gif" width="600" alt="Example" />
  
- Since the **final output** was created from the rest of the sequence, we should be able to **take the final output** and **pass it** to the **feed-forward layer** to classify an intent.
        
  <img src="images/L9_What5.gif" width="600" alt="Example" />
  
  
- For those **who like looking at code** here is some **Python pseodocoude** showcasing the control flow.

  <img src="images/L9_What6.png" width="600" alt="Example" />


<h3 align="center">Vanishing Gradients Problem</h3>

- **How long can the sequence be**?

  Theoreticly, they **can be infinite**, but we **run into a problem**. 


- Let's consider a simple **example of RNN** with **no hidden units** but with a **recurrence on some scalar** $x^{(0)}$.


  <img src="images/L9_Sequence.png" width="900" alt="Example" />


- After $n$ times units its value would be $x^{(n)}$ so we can consider a discrate dynamical system:

 $$x(n) = W^n x^{(0)},$$
 
 where $W \in \mathbb{R}$ and $x^{(i)} \in \mathbb{R}$ for $i \in [0,n]$.
 
 
- Now, if $W$ is **slighter greater than one**, then $W^{n}x^{(0)}$ would **explode**, and if $W$ is **slighter less than one**, then $W^{n}x^{(0)}$ would **vanish**:

  $$W^{n}x^{(0)} \rightarrow
  \left\{\begin{matrix}
  \infty, & \text { if } W > 1,\\
  0, &\text { if } W < 1.\\
  \end{matrix}\right.$$
  

- Because the **forward propagated** values **explode** or vanish**, the **same will happen** to its **gradients** in **backpropagation**:

  $$\frac{\delta W^{n}x^{(0)}}{\delta W} \rightarrow
  \left\{\begin{matrix}
  \infty, & \text { if } W > 1,\\
  0, &\text { if } W < 1.\\
  \end{matrix}\right.$$
  
- This can be **generalized to matrices** as well!


<h3 align="center">Vanishing Gradients Problem for RNN</h3>

- The **vanishing** and **exploiding gradients** is **much worse** in **RNN** that it is for traditional **DNN**.


- This is because **DNN** have **different wighted matrices** between layers.


- Thus, if the **weights** between the **first two layers** are **grater than one** ($>1$), then the **next layer** can have **matrix wights** which are **less than one** and so their effect would **cancel each other out**.

  <img src="images/L9_Layers.png" width="900" alt="Example" />


- But in the sace of **RNN**, the **same weight parameters** occurs between different recurrent units.

- So its **more of a problem**  because we **cannot cancel out**.

  <img src="images/L9_Layers2.png" width="900" alt="Example" />


<h3 align="center">Ways of dealing with Vanishing Gradients Problem for RNN</h3>


- There are some ways of dealing with this problem of **vanishing** and **exploiding gradients**:

  1. **Introduce Skip Connections**.
  
     We can add additional adges called **skip connections** tp connect States come $d$ neuron in front of it.
     
     So the current state is influenced by the previouse state and a state that accured $d$ times step ago.
     
     Gradint will now explode or vanish as a function of $\frac{\tau}{d}$ instead of just a function of $\tau$.  
     
     <img src="images/L9_Layers3.png" width="900" alt="Example" />

  2. **Replace Length 1 Connections**.
  
     We can actively remove connections of length one and replace them with longer connections.
     
     This force the network to learn alogn this modified path. 

     <img src="images/L9_Layers4.png" width="900" alt="Example" />
     
  3. **Leaky Recurrent Units**.
  
     Let's consider the **vanilla RNN** but this time append a constant $\alpha$ over every edge joining the adjacent hidden units.
     
     Thus $\alpha$ can regulate the amount of information the network remembers over time.
     
     If $\alpha \approx 1$ more memory is retained. If $\alpha \approx 0$ the memory of previous states it forgets.
    
     <img src="images/L9_Layers5.png" width="900" alt="Example" />

  
  4. **Gated Recurrent Networks**.
  
     A **modification** of the **leaky hidden units** is the **Gated Reccurent Networks**.
     
     Instead of **manually assigning** a constant **value** $\alpha$ to determine what to retain, we **introduce a set of parameters** one for every time step.
     
     So we **leave it up to the network** to decide **what to remember** and **what to forget** by introducing new parameters that act as gates. 
     
     <img src="images/L9_Layers6.png" width="900" alt="Example" />

  5. **LSTM**.
  
     One of the most commonly used gated RNN architectures is **Long Short Term Memory** (**LSTM**).
     
     Take **vinalla RNN** and **replace all hidden units** with something called an **LSTM Cell** and **add another connection** from every cell called the **cell states**.
     
     <img src="images/L9_Layers7.png" width="900" alt="Example" />
     

<h3 align="center">LSTM</h3>

- An **LSTM** has a **similar control flow** as a **RNN**. 


- It **processes data** passing on information as it **propagates forward**. 


- The **differences** are the operations within the **LSTM’s cells** and it’s **various gates**.


- These **operations** are **used to allow** the **LSTM** to **keep** or **forget information**.


- The **cell state** act as a **transport highway** that **transfers relative information** all the way down the sequence chain.
    

- The **cell state**, in theory, can **carry relevant information throughout the processing of the sequence**.
  

- So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. 
  

- As the cell state goes on its journey, **information** get’s **added** or **removed** to the cell state **via gates**. 


- The **gates** are different neural networks that **decide which information is allowed on the cell state**. 


- The **gates** can learn what **information** is **relevant to keep** or **forget** during training.

  <img src="images/L9_LSTM.png" width="800" alt="Example" />

<h3 align="center">LSTM</h3>

- An **LSTM** has a **similar control flow** as a **RNN**. 


- It **processes data** passing on information as it **propagates forward**. 


- The **differences** are the operations within the **LSTM’s cells** and it’s **various gates**.


- These **operations** are **used to allow** the **LSTM** to **keep** or **forget information**.


- The **cell state** act as a **transport highway** that **transfers relative information** all the way down the sequence chain.
    

- The **cell state**, in theory, can **carry relevant information throughout the processing of the sequence**.
  

- So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. 
  

- As the cell state goes on its journey, **information** get’s **added** or **removed** to the cell state **via gates**. 


- The **gates** are different neural networks that **decide which information is allowed on the cell state**. 


- The **gates** can learn what **information** is **relevant to keep** or **forget** during training.

  <img src="images/L9_LSTM.png" width="800" alt="Example" />

<h3 align="center">Gate Activation Function</h3>

- **Gates** may contain **Sigmoid activations function** that squishes values between $0$ and $1$.


- That is **helpful** to **update** or **forget data** because any number getting multiplied by $0$ is $0$, causing values to disappears or be **forgotten**.


- Any number multiplied by $1$ is the same value therefore that value stay’s the same or is **kept**. 


  <img src="images/L9_Sigmoid.gif" width="800" alt="Example" />


- **Gates** may also contain **Tanh activation function** that squishes values to always be between $-1$ and $1$.


- The **tanh activation function** is **used** to help **regulate** the **values** flowing through the network.

  <img src="images/L9_Tanh.gif" width="800" alt="Example" />
  
  
 
- Suppose the **value** is **multiplied by 3** every time it **passes** through the **Hidden layer**. 

  As we can see below, the **values can explode** and become astronomical, **causing other values to seem insignificant**:

  <img src="images/L9_Tanh2.gif" width="800" alt="Example" />


- A **tanh function ensures** that the **values stay between** $-1$ and $1$, thus **regulating** the output of the **neural network**. 

  <img src="images/L9_Tanh3.gif" width="800" alt="Example" />


<h3 align="center">Forget Gate Layer</h3>

- First, we have the **Forget Gate Layer**. 


- This gate **decides what information** should be **thrown** away or **kept**. 


- **Information** from the **previous hidden state** ($h_{t-1}$) and **information** from the **current input** ($x_t$) is **passed** through the **sigmoid function**:

  $$f_t = \sigma(W_f \cdot [h_{t-1},  x_t] + b_f).$$


- **Values** come out **between** $0$ and $1$. 


- The **closer to** $0$ means to **forget**, and the **closer to** $1$ means to **keep**.


  <img src="images/L9_Forget_Gate.gif" width="800" alt="Example" />



  

<h3 align="center">Input Gate Layer</h3>

- To update the cell state, we have the **Input Gate Layer**. 


- First, we **pass** the **previous hidden state** ($h_{t-1}$) and **current input** ($x_t$) into a **sigmoid function**. 


- That **decides** which **values** will be **updated** by transforming the values to be between $0$ and $1$.


- $0$ means **not important**, and $1$ means **important**. 


- We also **pass** the **previous hidden state**  ($h_{t-1}$) and **current input** ($x_t$) into the **tanh activation function** to squish values between $-1$ and $1$.


- Then we **multiply** the **tanh output** with the **sigmoid output**, which **decide which information** is **important to keep from** the **tanh output**:


  $$\begin{matrix}
  i_t = & \sigma(W_i \cdot [h_{t-1}, x_t] + b_i),\\ 
  \tilde{C_t} = & \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
  \end{matrix}
   $$


  <img src="images/L9_Input_Gate.gif" width="800" alt="Example" />



  

<h3 align="center">Cell State</h3>

- Now we should have **enough information** to **calculate** the **Cell State**. 


- First, the **cell state** gets **pointwise multiplied** by the **forget vector**. 


- Then we **take the output** from the **input gate** and do a **pointwise addition** which **updates** the **cell state** to **new values**:

  $$C_t =  f_t \times C_{t-1} + i_t \times \tilde{C_t}.$$


  <img src="images/L9_Cell_Gate.gif" width="800" alt="Example" />



  

<h3 align="center">Output Gate Layer</h3>

- Last we have the **Output Gate Layer**.


- The **output gate decides** what the **next hidden state should be**. Remember that the **hidden state contains information** on **previous inputs**. 


- The **hidden state** is also **used for predictions**. 


- First, we **pass** the **previous hidden state** ($h_{t-1}$) and the **current input** ($x_t$) into a **sigmoid function**. 


- Then we **pass** the **newly modified cell state** ($C_t$) to the **tanh function**. 


- Finally, we **multiply** the **tanh output** with the **sigmoid output** to **decide** what information the **hidden state** should carry. 

  $$\begin{matrix}
  o_t = & \sigma(W_o \cdot [h_{t-1}, x_t] + b_o),\\ 
  h_t = & o_t \times \tanh(C_t).
  \end{matrix}
   $$

- The **output** is the **new hidden state** ($h_t$). 


- The **new cell state** ($C_t$) and the **new hidden state** ($h_t$) is then **carried over to the next time step**.

  <img src="images/L9_Output_Gate.gif" width="800" alt="Example" />  

<h3 align="center">GRU</h3>

- The **GRU** is the **newer generation** of **RNN** and is **pretty similar** to an **LSTM**. 

- **GRU**’s got rid of the **cell state** and used the **hidden state** to **transfer information**. 


- It also only **has two gates**, a **Reset Gate** and **Update Gate**.


- **Reset Gate**:
  
  The **reset gate** is another gate is used to **decide how much past information to forget**.
  
  $$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$$
  
- **Update Gate**:

  The **update gate** acts **similar** to the **forget** and **input gate** of an **LSTM**. 
  
  It **decides what information** to **throw away** and **what new information** to **add**.
  
  $$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$$


- So the **output** is the **new hidden state** ($h_t$):

  $$h_t = (1-z_t) \times h_{t-1} + z_t \times \tilde{h_t},$$
  
  where

  $$\tilde{h_t} = \tanh{W \cdot [r_t \times h_{t-1}, x_t]}.$$



- **GRU**’s **has fewer tensor operations**, therefore, they are a **little speedier** to train then **LSTM**’s.


- **There isn’t a clear winner which one is better!** Researchers and engineers usually try both to determine which one works better for their use case.

  <img src="images/L9_GRU.png" width="600" alt="Example" />

<h3 align="center">What Next?</h3>

- **LSTM**s were a **big step** in what we can accomplish with RNNs.


- It’s natural to wonder: **is there another big step**? 

  A common opinion among researchers is: **Yes! There is a next step and it’s attention!** 
 
 
- The **idea** is to **let every step** of an RNN **pick information** to look at from some **larger collection** of information. 

  For example, if you are using an RNN to **create a caption describing an image**, it might **pick a part of the image** to look at for every **word it outputs**. 


<h1 align="center">End of Lecture</h1>