### Why Seq. Modelling

Examples:
- Speach recognition
- Music generation
- Sentiment classification
- DNA sequence analysis
- Machine translation
- Video activity recognition
- Name entity recognition 

Output or input may be sequence only

### Notation

Consider a sentnece (seq. of words)  
$x$ is a sentence. 
Find where are people's names in the problem: 
`Named-entity recognition` used by search engines (indexing the words for fast search).  

Do one-hot encoding for names and not names (labels are 1 or 0, $y$-vector)

Indexes are $x^{<i>}$ or $x^{<t>}$ with length $T_x$ for temporal index  
and outputs are $y^{<t>}$ with length $T_y$.  

For various sets the length can be different. 
In particular $T_x^{(i)}$ and $T_y^{(i)}$

Representing words is done making a dictionary or vocabluary. 

Dictionaries are set of indexed words with large number of them. Then, each word is a vector the length of the whole vocabluary with one-hot 1 for this word and 0 for the rest. These are `one-hot` vectors. 

The goal is to learn mapping the mapping to the target output $Y$ using this one-hot vector via `supervised learning`. 

If the word is not in the vocabluary, use a  `UNK` (not in the vocabluary workd)


### Recurrent Neural Network Model

One approach iis to use standard NN approach, taking a set of one-hot vectors, building a FC layers and predicting output. 
Problems:
- With various length for different words/sentences
- Features are not shared across different positions in the text (i.e, repeating words) this is similar to CNN
- input layers are extemely large with very large weight matrixes

Proper approach:

Consider $x^{<1>}$ input vector, and predict $y^{<1>}$.  
In RecNN, for the second input word, $^{<2>}$, the information from previous $x^{<1>}$ is used to predict $^y{<2>}$. So in the RecNet, the timesteps are "chained" via passing the information from previous step to the next.  
Also, there is $x^{<0>}$ is added which is initialized as zeros.  
$x^{<t>} \rightarrow y^{<t>}$.  
RecNet scans the data from left to right. The parameters are shared across timesteps. 
So $w_{ax}$, $w_{aa}$ and $w_{ay}$, controlling the input to the layer from $x$, previous layer, and output respectively are __sahred__.  

When making a porediction for $y^{<3>}$, the information from $x^{<1>}$, $x^{<2>}$, $x^{<3>}$ is used.  

_Limitaiton_: In this RNN only the info from earlier in the sequence can be used to make a prediction.  

_Solution_: bidiractional (BRNN)

#### Example of the RNN calculation

$$
a^{<0>}\rightarrow
\overbrace{\underbrace{
\begin{bmatrix}
\cdot \\
\cdot \\
\cdot \\
\end{bmatrix}}_{x^{<1>}}}^{\hat{y}^{<1>}}
\overbrace{\rightarrow}^{a^{<1>}}
\overbrace{\underbrace{
\begin{bmatrix}
\cdot \\
\cdot \\
\cdot \\
\end{bmatrix}}_{x^{<2>}}}^{\hat{y}^{<2>}}
\overbrace{\rightarrow}^{a^{<2>}}
\overbrace{\underbrace{
\begin{bmatrix}
\cdot \\
\cdot \\
\cdot \\
\end{bmatrix}}_{x^{<3>}}}^{\hat{y}^{<3>}}
\overbrace{\rightarrow}^{a^{<3>}}
\cdots
\overbrace{\rightarrow}^{a^{<T_x-1>}}
\overbrace{\underbrace{
\begin{bmatrix}
\cdot \\
\cdot \\
\cdot \\
\end{bmatrix}}_{x^{<T_x>}}}^{\hat{y}^{<T_y>}}
$$

__Forward prop__:  
$$
a^{<1>} = g(w_{aa} a^{<0>} + w_{ax} x^{<1>} + b_a)
$$
and the prediction then 
$$
\hat{y}^{<1>} = g(w_{ya} a^{<1>} + b_y)
$$

where the activation function can be different!

_Notations_: $w_{ax}$ means that it will be multipled by $x$-like and be used to output $a$-like quantity.  

Activations: $\tan{(z)}$ or rarely ReLU.  

Here there are other ways to prevent vanishing gradient.  
The chose of final activation function is determined by the task.  

At time $t$

$$
a^{<t>} = g(w_{aa} a^{<t-1>} + w_{ax} x^{<t>} + b_a) \\
\hat{y}^{<t>} = g(w_{ya} a^{<t>} + b_y)
$$

Let us re-express is as

$$
a^{<t>} = g(w_a[a^{<t-1>},x^{<t>}] + b_a) \\ 
\hat{y}^{<t>} = g(w_{ya} a^{<t>} + b_y)
$$

where the matrix $w_a$ is constructed by _horizontally stacking_ [$w_{aa} | w_{ax}$]. 
I.e, if $a$ is 100 dim, and $x$ is 10000 dim, than  
$w_{aa}$ is [100,100] and $w_{ax}$ is[100,10000],  
so that stacking them along the _common axis_ we get [100, 10100] matrix. 

This $[a^{<t-1>},x^{<t>}]$ implies a __stacking__ operation 
$$
[a^{<t-1>},x^{<t>}] = 
\begin{bmatrix}
a^{<t-1>} \\
x^{<t>}
\end{bmatrix}
$$
where the "hight" of the final vector is given by the sum of the their hight. 

Then the original quantity is recomvered by multiplied the 
$$
[w_{aa} | w_{ax}] \times 
\begin{bmatrix}
a^{<t-1>} \\
x^{<t>}
\end{bmatrix}
=w_{aa} a^{<t-1>} + w_{ax} x^{<t>}
$$

This allows to compress into one parameter matrix $w_a$. 

Finally, the forward prop reduces to 

$$
\hat{y}^{<t>} = g(w_{y} a^{<t>} + b_y)
$$

where $w_{ay}\rightarrow w_{y}$


### Backward Prop. 

Calculations in the oposite direction.  

Define loss function for a given prediction, an element in the sequence, to be the wright or wrong. Consider stanrd _logist regression_ loss (aka cross-entropy  loss). 
$$
\mathcal{L}^{<t>}(\hat{y}^{<t>},y^{<t>}) = -y^{<t>} \log(\hat{y}^{<t>}) - (1-y^{<t>}) \log(1-y^{<t>})
$$

The loss for the entire sequence is a sum of the losses: 
$$
\mathcal{L} = \sum_{t=1}^{T_x=T_y} \mathcal{L}^{<t>}(\hat{y}^{<t>},y^{<t>}) 
$$

Backprop: calculation in the opposite direction, taking gradients and updated parameters.  

As the direction is backwards in $t$. It is  called _backpropagation through time_.  

(P.s. See the drawing from "backpropagation through time" video). 


### Different types of RNNs

Previously we considered the so-called __many-to-many__ architecture (input and output have many units). 

If on the other hand input is the text and output is the number or a single value. 
Here the output is is only considered at the end, (at the $t_x$ unit) after the entire sentece has been read. This is __many-to-one__ achitecture. 

There is also __one-to-one__ (stanrad NN, FC)

There is also __one-to-many__ (example is music generation) Here, the there is only one input but the same overall structure, but the output of the previous layer being fed as an input to the next. 

There is also __many-to-many__ where $T_x \neq T_y$. There some layers get input as output from the previos, while others get input as an actial input data. Such network has __two__ dinstinct parts: _encoder_ and _decoder_. 
The _encoder_ takes the input data $x^{<1..T_x>}$. 

### Language modelling and sequence generation

Speach recognition: selecting the closly sounding workds based on the context. 

Language model estimates the probability of a particlar sentence, sequence of words.  

Building a RNN requires 
- a large _corpus_ of english text.  
(_corpus_ is a large quantity/set). 
- perform _tokanization step_: tokenizing the text (building dictionary as discussed) and map each of those words to _one-hot_ vectors. 
    - Add extra token, EOS (end-of-setnece) can be appended to the end of each sentence in the training set.  
    Punctuation can also be a token.  
    If a word does not exist in the token, than replace the unknown work wth $<UNK>$ (unknown word, _unique token_). 
- Buld an RNN to model the chance of different sequences:

#### RNN achitecture:
At time $0$ compute activation $a^{<1>}$ as a function of $x^{<1>}=\vec{0}$ (recall that $a^{<0>}=\vec{0}$ by convection). The $a^{<1>}$ will make a __soft-max__ prediciton  tryin to figure out, what is the prob. of the first word to be $\hat{y}^{<1>}$ -- i.e., what is the chance of the first word to be... any of the word in the dictionary (each gets a probability).  

At the next step, the the same happens but also the _correct word_ from the previous output is fed as an input.  $x^{<2>} = y^{<1>}$. 

At the next step the same happens as at the previos step., but now __the first $2$__ outputs are fed in as an input. $x^{<3>} = y^{<2>}$. 

At the last timestep, the $x^{<n>} = y^{<n-1>}$. and hopefully there is a large chance of the EOS token. 

Define the cost function at a certain time $t$
$$
\mathcal{L} = -\sum_{i} (y_i^{<t>}\log(\hat{y}_i^{<t>})) 
$$
this is __soft-max__ loss function. 

For a _sufficiently large training set_ the NN can, given an initial set of words, _redict_ the chance of the _last words_. 
For example. For a sentence of a three: each __soft-max__ predixts: 

$$
P(y^{<1>},y^{<2>},y^{<3>}) = \\
P(y^{<1>})\times
P(y^{<2>}|y^{<1>}) \times
P(y^{<3>}|y^{<2>},y^{<1>})
$$


### Sampling sequences from the model

Sampling is done as
- What is the first word you want to generate. 
- Randomly sample according to the soft-max distribution
    - As the output of the soft max is a arange of probabilitites for each word in the dictionary. 
    - Use np.random.choise() to sample according to the distribution. This gives the $\hat{y}^{<1>}$
- in the second timestep uses the $\hat{y}^{<1>}$ as an input, as $x$ for the input to predict $\hat{y}^{<2>}$ again via __soft-max__. 
- Continue untill the EOS. Or just sample a finite amount of words. 
- Reject any sample if UNK is generated. 

It is also possible to build a _character-based dictionary_. This is character-level language model. 
It has pros and cons.  
Main disadvantage: Long sequences. Not as good at capturing long-term dependencies. 
Used in more specialized applications. 

### Vanishing Gradient Problem

Basic RNNs  are not very good at capturing long-term dependencies

In a classical NN gradients had  hard time propagating through all layers, affecting the weghts in the deeper layers. 

Similar problem is in the RNNs.  
In RNNs working with sentences it can be especially bad.  

Inputs affect only close layers. Error does not propagate all the way through the system. 

Another problem is the problem of __exploding gradients__.   
__Gradient clipping__ is a possible solution to that


### Gated Recurrent Unit

Modification to RNN hedden layer.  

The standard form is 

$$
a^{<t>} = g(w_a[a^{<t-1>},x^{<t>}] + b_a) \\ 
\hat{y}^{<t>} = g(w_{ya} a^{<t>} + b_y)
$$

where $g(...)$ is the activation function. 

The _unit of RNN_ takes intput:
- $a^{<t-1>}$ from the previous layer
- $x^{<t>}$ from the input data 

and it outputs: 
- $a^{<t>}$ the current layer activations (for next layer)
- $\hat{y}$ for a given input via, e.g., _soft-max_

Consider $c$, a __memory cell__, that remembers a certain piece of information that can be used by later units. 

At time $t$, the memory cell will have value $c^{<t>}$. The output of GRU is the $c^{<t>}=a^{<t>}$.  
_Memory cell value equals to the activation value_ (they are qual in GRU but not in general)

At every timestep, we overwire the 
$$
\tilde{c}^{<t>} = \tanh(w_c[c^{<t-1>},x^{<t>}] + b_a)
$$
which is a __candidate__ in replacing $c^{<t>}$.  

Then we introduce __Gate__ $\Gamma_u$ (update gate) $\Gamma_u\in[0,1]$. In reality it is computed by applying sigmoid function 
$$
\Gamma_u = \sigma(w_u[c^{<t-1>},x^{<t>}] + b_a)
$$
(so for most of the inputs it is either 0 or 1).  

> The _gate_ controls when to update the $c^{<t>}$ with $\tilde{c}^{<t>}$. 

The actual equations for the update is
$$
c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}
$$

where $*$ is the elementwise multiplication. 

This makes the cell _hold on to the information_ (keeping $\Gamma_u=0$) untill it is needed (and $\Gamma_u=1$).  

So, in summation:

The _unit of GRU_ takes intput:
- $a^{<t-1>}=c^{<t-1>}$ from the previous layer
- $x^{<t>}$ from the input data 

and it outputs: 
- $\tilde{c}^{<t>}$ the current layer candidate (via $\tan{()}$ activation)
- $\Gamma_u$ the current layer update gate (via sigmoid activation) 

Final output is the combination for the last two.  

_The GRU is good at keeping the cell without updates untill it is needed_   

Note, that here $\tilde{c}^{<t>}$, $\Gamma_u$ and $\tilde{c}^{<t>}$ are of the same dimesion!  

In the __full GRU unit__, there is also _relvence_ as:

$$
\tilde{c}^{<t>} = \tanh(w_c[\Gamma_r * c^{<t-1>},x^{<t>}] + b_a) \\
\Gamma_u = \sigma(w_u[c^{<t-1>},x^{<t>}] + b_a) \\
\Gamma_r = \sigma(w_r[c^{<t-1>},x^{<t>}] + b_r) \\ 
c^{<t>} = \Gamma_u * \tilde{c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}
$$

### LSTM 
