# SEQUENCE MODELS

<h1 style="background:#DDEEEE;padding: 15px;">Recurrent Networks</h1>

- Variants: LSTM, GRU, bi-directional RNN, etc.
  
  
- **Sequence data**:
  - input `x` -> output `x`
  - not always the same length
  - not always both sequence data
  - training examples of varying lengths
  - features for the same input element learned accross different positions in text
    - ex: 'Hermione' in 1st position is a sign that can be a person's name
    - this info cannot be used when 'Hermione' is in a different position
  
<img src="images/rnns-sequence-data-examples.png" width="550">

#### Vocabulary

#### One-hot encoding
- Vector with one value and all the rest is zeros
- If vocab size is 10.000: each word in a sentence is a 10.000 length vector

### *Notation*

<img src="images/sequence-models_rnns/rnn-01-notation-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-01-notation-2.png" width="950">

### *Basic architecture*

<img src="images/sequence-models_rnns/rnn-02-architecture-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-02-architecture-2.png" width="950">


### *Forward propagation formulas*

<img src="images/sequence-models_rnns/rnn-03-forward-propagation-formulas-1.png" width="950">


### *Forward propagation architecture*

<img src="images/sequence-models_rnns/rnn-04-forward-propagation-architecture-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-04-forward-propagation-architecture-2.png" width="950">

<img src="images/sequence-models_rnns/rnn-04-forward-propagation-architecture-3.png" width="950">

<img src="images/sequence-models_rnns/rnn-04-forward-propagation-architecture-4.png" width="950">

### *Simplified forward propagation formula*

<img src="images/sequence-models_rnns/rnn-05-forward-propagation-formulas-simplified.png" width="950">


### *Forward propagation -> Loss -> Backpropagation*

<img src="images/sequence-models_rnns/rnn-06-forward-propagation-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-06-forward-propagation-2.png" width="950">

<img src="images/sequence-models_rnns/rnn-06-forward-propagation-3.png" width="950">

<img src="images/sequence-models_rnns/rnn-06-forward-propagation-4.png" width="950">

<img src="images/sequence-models_rnns/rnn-06-forward-propagation-5.png" width="950">

<img src="images/sequence-models_rnns/rnn-07-loss-function-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-07-loss-function-2.png" width="950">

<img src="images/sequence-models_rnns/rnn-08-backpropagation-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-08-backpropagation-2.png" width="950">


## *Types of RNNs*


<table align=center style="border:1px solid black">
    <tr>
        <th style="text-align:left; width: 150px;">Many to many</th>
        <td style="text-align:left; width: 280px;">
            • input = sequence<br/>• output = sequence
            <br/>• type 1: same length<br/>• type 2: different length
        </td>
        <td style="text-align:left; width: 300px;">
            .
        </td>
    </tr>
    <tr>
        <th style="text-align:left; width: 150px;">.</th>
        <td style="text-align:left; width: 280px;">
            • type 2: different length
            <br/>• encoder: reads source sentence
            <br/>• decoder: outputs target sentence
        </td>
        <td style="text-align:left; width: 300px;">
            machine translation, x = source, y = target
        </td>
    </tr>
    <tr>
        <th style="text-align:left; width: 150px;">Many to one</th>
        <td style="text-align:left; width: 280px;">
            • input = sequence<br/>• output = y_hat only at last time step
        </td>
        <td style="text-align:left; width: 300px;">
            sentiment classification, x = text, y = star | no star
        </td>
    </tr>
    <tr>
        <th style="text-align:left; width: 150px;">One to one</th>
        <td style="text-align:left; width: 280px;">
            • input = one value<br/>• output = one value<br/>• same as standard nn, not interesting
        </td>
        <td style="text-align:left; width: 300px;">
             .
        </td>
    </tr>
    <tr>
        <th style="text-align:left; width: 150px;">One to many</th>
        <td style="text-align:left; width: 280px;">
            • input = empty<br/>• output = sequence
            <br/>• reading input only at first time step
            <br/>• x input of next time steps is synthesized output from previous time step
        </td>
        <td style="text-align:left; width: 300px;">
            music generation, x = none, y = sequence of notes
        </td>
    </tr>
</table>

<br/>
<img src="images/rnns-types.jpg" width="750">

## *Language Modelling with RNNs*

- **Role**: given a sentence, tells you what's the probability of an interpretation of that sentence
- Fundamental component for Speech Recognition and MT
  - MT wants to output only sentences that are likely
    
    
- **Example**:
  - Speech Recognition: P(The apple and pair salad) VS P(The apple and pear salad)
  - Given a text, what is the probability that the next sentence would be a given sentence?
   
   
- **Input**: a sequence of words
- **Output**: probability of the input sequence of words
  
  
- **Training set**: large corpus of english text
- **Step 1**: tokenization -> vocabulary = index of one-hot vectors
  - `<eos>` very important, helps system learn what a sentence is
  - `<unk>` words not in the vocab from training  
- **Step 2**: build an RNN to model the chances of each sequence

<img src="images/sequence-models_rnns/rnn-09-language-modelling-1.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-2.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-3.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-4.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-5.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-6.png" width="950">

<img src="images/sequence-models_rnns/rnn-09-language-modelling-7.png" width="950">

## *Vanishing Gradients with RNNs*

- Gradients from last layer would have a very hard time propagating back to first layers   
  -> Not very good for catching very long range dependencies
- Each step is merely influenced by the local time steps, the ones that are very close
- Very difficult for output to be influenced by very early intputs,    
  because difficult for network to backpropagate error from the end to very early time steps   
= One of the weaknesses of basic RNNs   
= Can cause vanishing or exploding gradients
   
   
### **Exploding gradient -> Gradient clipping**
  - `NaN`, easy to spot, easier to solve that vanishing gradient
  - robust solution: gradient clipping    
    = rescaling the vector of gradients to a certain range so that it doesn't surpass a certain max value


## *Gated Recurrent Unit (GRU)*

  - GRU more recent than LSTM
  - Simpler so easier to scale to very big networks
  - Each is used for different problems
  - **reset gate**: decides how much past info should be forgotten
  
  
https://www.pluralsight.com/guides/lstm-versus-gru-units-in-rnn
   
   
<img src="images/sequence-models_rnns/rnn-10-gru-1.png" width="950">
<img src="images/sequence-models_rnns/rnn-10-gru-2.png" width="950">
<img src="images/sequence-models_rnns/rnn-10-gru-3.png" width="950">
<img src="images/sequence-models_rnns/rnn-10-gru-4.png" width="950">
<img src="images/sequence-models_rnns/rnn-10-gru-5.png" width="950">
<img src="images/sequence-models_rnns/rnn-10-gru-6.png" width="950">

## *Long-Short Term Memory (LSTM)*

- A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates

<img src="images/sequence-models_rnns/rnn-11-lstm-1.png" width="950">

- **Forget** and **Update** gates make LSTM and even GRU good at memorizing **very long range** dependencies for many time steps
- Variant: **peephole connection** = sometimes we add `C<t-1>` in the gates computation (entre crochets)    
  = output depends also on previous value
  - gates = N-dim vectors for ex (N = nb units), `C<t-1>` also N-dim vector
  - 5th element of `C<t-1>` only affects the 5th element of the gates vectors

<img src="images/sequence-models_rnns/rnn-11-lstm-2.png" width="950">
<img src="images/sequence-models_rnns/rnn-11-lstm-3.png" width="950">
<img src="images/sequence-models_rnns/rnn-11-lstm-4.png" width="950">


## *Bi-directional RNNs*

- **Disadvantage**: you need the entire sequence of data before you can make a prediction
  - ex: speech recognition, you need to wait until the person finishes her sentence before you can start speech recognition prediction

<img src="images/sequence-models_rnns/rnn-12-bi-rnn-1.png" width="950">
<img src="images/sequence-models_rnns/rnn-12-bi-rnn-2.png" width="950">


## *Deep RNNs*

<img src="images/sequence-models_rnns/rnn-13-deep-rnn-1.png" width="950">
<img src="images/sequence-models_rnns/rnn-13-deep-rnn-2.png" width="950">
<img src="images/sequence-models_rnns/rnn-13-deep-rnn-3.png" width="950">
<img src="images/sequence-models_rnns/rnn-13-deep-rnn-4.png" width="950">
<img src="images/sequence-models_rnns/rnn-13-deep-rnn-5.png" width="950">


## *C5 W1 - Sequence models - Assignment 1: Building an RNN step by step*

$x^{2}$ superscript
$x_{2}$ subscript

<table align=left style="border:1px solid black">
    <tr>
        <td style="text-align:left; width: 100px;">$-^{[l]}$</td>
        <td style="text-align:left; width: 150px;">layer</td>
        <td style="text-align:center; width: 100px;" rowspan=4>
            $a^{(2)[3]<4>}_{i}$ 
        </td>
        <td style="text-align:left; width: 200px;" rowspan=4>
            denotes the activation of the:
            <br/>- 2nd training example
            <br/>- 3rd layer
            <br/>- 4th time step
            <br/>- 5th entry in the vector
        </td>
    </tr>
    <tr>
        <td style="text-align:left;">$-^{(i)}$</td>
        <td style="text-align:left;">example</td>
    </tr>
    <tr>
        <td style="text-align:left;">$-^{<t>}$</td>
        <td style="text-align:left;">time-step</td>
    </tr>
    <tr>
        <td style="text-align:left;">$-_{i}$</td>
        <td style="text-align:left;">vector entry</td>
    </tr>
</table>
  
  
<br/>
  
  
<table align=left style="border:1px solid black">
    <tr>
        <th style="text-align:left; width: 100px;">Notation</th>
        <th style="text-align:left; width: 350px;">Content</th>
        <th style="text-align:left; width: 400px;">Shape</th>
        <th style="text-align:left; width: 100px;">Variable name</th>
    </tr>
    <!---><--->
    <tbody>
    <tr style="background-color: #DFF2FF;">
        <th style="text-align:left;" colspan=4>Input $x$</th>
    </tr>
    <tr>
        <td style="text-align:left;">$x$</td>
        <td style="text-align:left;">
            Input fed into the RNN
            <br/>
            = all training examples or one mini-batch
        </td>
        <td style="text-align:left;">
            3-dimensional tensor
            <br/>($n_{x}$, $m$, $T_{x}$)
            <br/>(nb_units, mini_batch_size, nb_time_steps)
        </td>
        <td style="text-align:left;">     </td>
    </tr>
    <tr>
        <td style="text-align:left;">$x^{<t>}$</td>
        <td style="text-align:left;">Input used at time-step $t$</td>
        <td style="text-align:left;">
            2-dimensional tensor
            <br/>($n_{x}$, $m$)
            <br/>(nb_units, mini_batch_size)
        </td>
        <td style="text-align:left;"><code>xt</code></td>
    </tr>
    <tr>
        <td style="text-align:left;">$m$</td>
        <td style="text-align:left;">
            number of training examples
            <br/>or mini-batches size</td>
        <td style="text-align:left;">
            Tensor shape = (vocab-size, mini_batch_size, $T_{x}$)
            <br/>ex: (5000, 20, 10)
            <br/>vectorization: we stack 20 columns (20 $x_{i}$ examples)
        </td>
        <td style="text-align:left;">     </td>
    </tr>
    <tr>
        <td style="text-align:left;">$T_{x}$</td>
        <td style="text-align:left;">
            Number of time-steps in a single training example $x^{(i)}$
        </td>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
    </tr>
    <tr>
        <td style="text-align:left;">$x^{(i)<t>}$</td>
        <td style="text-align:left;">
            $t^{th}$ time-step of $i^{th}$ input example
        </td>
        <td style="text-align:left;">
            (5000,)<br/>vocab size = 5000
            <br/>one-dimensional input vector
        </td>
        <td style="text-align:left;">     </td>
    </tr>
    <tr>
        <td style="text-align:left;">$n_{x}$</td>
        <td style="text-align:left;">
            Number of units in a single time-step of a single training example
        </td>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
    </tr>
    </tbody>
    <!---><--->
    <tbody>
    <tr style="background-color: #DFF2FF;">
        <th style="text-align:left;" colspan=4>Hidden state $a$</th>
    </tr>
    <tr>
        <td style="text-align:left;">$a^{<t>}$</td>
        <td style="text-align:left;">
            A "hidden state": 
            <br/>the activation passed to the RNN from one time step to another
        </td>
        <td style="text-align:left;">
            ($n_{a}$, $m$)
            <br/>for a mini-batch of $m$ training examples
        </td>
        <td style="text-align:left;">
            <code>a_prev</code> or <code>a_next</code>
        </td>
    </tr>
    <tr>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">
            ($n_{a}$, $m$, $T_{x}$)
            <br/>for a given time step $t$
            <br/><br/>
            -> We loop through the time steps with index $t$, and work with the $a^{<t>}$ 2D slice of this 3D tensor
            <br/><br/>
            length $n_{a}$
            <br/>for a single training example
        </td>
        <td style="text-align:left;">     </td>
    </tr>
    </tbody>
    <!---><--->
    <tbody>
    <tr style="background-color: #DFF2FF;">
        <th style="text-align:left;" colspan=4>Prediction $\hat{y}$</th>
    </tr>
    <tr>
        <td style="text-align:left;">$\hat{y}$</td>
        <td style="text-align:left;">A prediction</td>
        <td style="text-align:left;">
            ($n_{y}$, $m$, $T_{y}$)
            <br/><br/>
            • $n_{y}$: nb of units in the vector representing the prediciton
            <br/>• $m$: number of examples in a mini-batch
            <br/>• $T_{y}$: number of time steps in the prediction
        </td>
        <td style="text-align:left;"><code>y_pred</code></td>
    </tr>
    <tr>
        <td style="text-align:left;">$\hat{y}^{\langle t \rangle}$</td>
        <td style="text-align:left;">Prediction for a single time step $t$</td>
        <td style="text-align:left;">($n_{y}$, $m$)</td>
        <td style="text-align:left;"><code>yt_pred</code></td>
    </tr>
    <tr>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
        <td style="text-align:left;">     </td>
    </tr>
    </tbody>
</table>

<br/>  


#### NUMPY SLICING !!!!!!!!!!!
https://www.pythoninformer.com/python-libraries/numpy/index-and-slice/

<h1 style="background:#DDEEEE;padding: 15px;">Word embeddings</h1>

## *Word embeddings*

- Featurized representation:
<img src="images/sequence-models_word-embeddings/embeddings-01.png" width="700">
  
  
- Words similarity
<img src="images/sequence-models_word-embeddings/embeddings-02.png" width="700">  
  
  
- Visualizing embeddings with t-SNE algorithm
<img src="images/sequence-models_word-embeddings/embeddings-03.png" width="700">
  
  
- Transfer learning
<img src="images/sequence-models_word-embeddings/embeddings-04.png" width="700">
   
   
- Using embeddings: analogy reasoning
  - $e_{man}$ - $e_{woman}$ = [-2, 0, 0, 0] -> because main difference between them is the Gender feature
  - $e_{king}$ - $e_{queen}$ = [-2, 0, 0, 0] -> idem
  - First pointed out by Mikolov et al. 2013: "Linguistic regularities in continuous space word representations"
<img src="images/sequence-models_word-embeddings/embeddings-05.png" width="700">

  
  - The vector difference between 'man' and 'woman' is very similar to the vector difference between 'queen' and 'king'
<img src="images/sequence-models_word-embeddings/embeddings-06.png" width="700">
  
  
  - Cosine similarity: most commonly used (cosine of the angle between the 2 vectors)
    - normalizes to the lengths of the 2 vectors
  - Squared distance: measures dissimilarity => similarity = -squared_dist
<img src="images/sequence-models_word-embeddings/embeddings-07.png" width="700">


- Embedding matrix (vs one-hot encodings):
  - multiplying the embedding matrix $E$ with the one-hot vector $O_{orange}$ for example, ends-up selecting the column in the matrix corresponding to the word 'orange':   
  $E$ * $O_{j}$ = $e_{j}$  -> embedding for word $j$
<img src="images/sequence-models_word-embeddings/embeddings-08.png" width="700">

- Learning an embedding:
  - intialize the embedding matrix $E$ with zeros,
  - then use gradient descent to learn all the parameters of this matrix
  - and $E$ * $O_{j}$ gives you the embedding for a given word $j$

## *Learning word embeddings*

#### Word2Vec


#### Negative Sampling


#### GloVe

## *Applications of word embeddings*

-

<h1 style="background:#DDEEEE;padding: 15px;">Attention mechanisms</h1>

### Principal Component Analysis

- Technique de réduction de dimension qui projette les données sur k dimensions en maximisant la variance des données de la manière suivante :
- Étape 1: Normaliser les données pour avoir une moyenne de 0 et un écart-type de 1

https://stanford.edu/~shervine/l/fr/teaching/cs-221/pense-bete-modeles-reflex