# VARIOS SEQUENCE TO SEQUENCE ARCHITECTURES

### Basic models

- machine translation: Input $\{x^{<1>}\}$, $\{x^{<2>}\}$ and output are $\{y^{<1>}\}$ ... $\{y^{<2>}\}$.  

Start by building an incput, `encoder network` (RNN (GRU or LSTM), that takes one word at a time). After that build a `decoder netword` that outputs a sequence $\{y^{<1>}\}$ ... $\{y^{<2>}\}$.

This model words given enough pairs of sentences of different words.  This is `encoder-decoder` network.  

A similar architecture words for image captioning. 
The netword takes image via CNN (pre-train net), an learn a set of feautres of an image. It returns a dense layer with a lot of units. This can be a _encoder netword_. This can then be fed into _RNN_, a _decoder network_ to generate a caption. This works reasonably well.  

These are basic seq-to-seq.  but it gives a randomly chosen output, not _the most likely output_.  

### Picking the most likely sequence

Machine trasnlation as a conditional language model.  

A language model returns a probability of a given output.  

- Conditional language model. In a classical language model the inputs $a^{<0>}$ are set as zero. Here we use the output of the _encoder NN_ to _condition_ the language model. Thus, machine translation becomes _encoder and decoder_ NNs. The _decoder_ NN is has an output similar to the _language model_ but the _encoder_ is used to condition it. 

The model gives probabilities of possible translation sentences.  It is important to _avoid sampling randomly_ from the distribution of outputs.  
Find a sentence, output, that maximises the conditional ptobability $x$.  This is done via `beam search`. 
The `gready search` _is not generally used_ (where word by word are picked based on their probabilities). Instead, _joint probabilities_ of _all words in the sentence_ is used. 

Also, _approximate search_ is used as the sample of search is too large.  

# Beam search

Choosing the best ouput from LLMs.  

Beam search picks first the first most likely word.  
Chose the probability of various outputs given an input. The Search outputs _several_ likely outputs. The number is given by $B$ variable.  

Next, for each of the _selected_ outputs, consider the probabilities for the next word (remember we feed the previous word into the next layer of the _decoder_, so for each previous output, the next one will change).  

Now we collect pairs of first-second words that are most likely. The selsection is done as 
$$
P(y^{<1>},y^{<2>}|x) = P(y^{<1>}|x)P(y^{<2>}|x,y^{<1>})
$$
The same is done for the third etc word. This allows to select the _most probable setnece_ rather than word by word, contrary to the _greedy search_ (Note, at $B=1$ it is a greedy search)

Beam search aims to maximize 

$$
\text{arg max} \,\Pi_{t=1}^{T_y} P(y^{<t>}|x,y^{<1>}...y^{<t-1>})
$$

### Refinments to beam search 
- Length normalization

More numerically stable algorithm can be obtained by considering $\sum\log(P(...))$ instead of product of probabilities.  

This function however, prefers short sentences, as it is simpler to optimize for them.  
A common approach is to add 
$$
\text{arg max} \frac{1}{T_y^{\alpha}} \,\sum_{t=1}^{T_y} \log P(y^{<t>}|x,y^{<1>}...y^{<t-1>})
$$

where the first time here accounts for the equation bias for short sentences. 

This a _heuristic approach_. 

IF $B$ is large -- it is slower, more memory needed, but it is more accurate. 
If $B$ is small, -- less accurate

In practice, $B\in(10-100)$

Beam search is _not exact search_ lke BFS or DFS (search algorithms)




### Error analysis in Beam Search

A model has two components: RNN and BeamSearch algorithm  
The error in translation can be attributed to either RNN or the beam search algorithm.  

Increasing the Beam Width may not increase the performance, as inclreasing the training dataset may not achieve this.  

RNN computes $P(y|x)$.  
Compare the result of RNN with human translation.  
Beam search choses $\hat{y}$. But $y^*$ gives higher $P(y|x)$, $P(y^*|x) > P(\hat{y}|x)$ Then, _beacm search fails_ as it fails to give the highest probability.  
If on the other hand $P(y^*|x) < P(\hat{y}|x)$, then RNN is the problem. 

Go through the dev.set and find the mistackes that algorithm made. 

Compare the probabiltiies for $*$ and $\hat{}$ and what gives more error, beam search or RNN.  
Then the error analysis is what fraction attributed to different algorithm.  

The beam search is worth checking only if it is responsible for most of the errors. 


### Bleu Score (Bilingual evaluation)

- Single number evaluation metric
- System to evalaute the text
- _Not_ for speach recognition

Evaluating the machine translation when there are many valid answers.

Bleu score is the score that assess how close they are to human level.  

__PRecision__: how close each word to the expcted. Not usefull. 
__Modified Precision__: a word has a maximum number of times to be there. So, the word gives a cridit $2/7$, where number of appearances/number of words in the sentence.  

`Bigrams` - pairs of words appearining. 

Aan algorithm can compare these pairs in mahine outputs

$$
P_1 = \frac{\sum\text{count(unigram)}}{\sum\text{count(unigram)}}
$$

$$
P_n = \frac{\sum_{n-gram}\text{count(n-gram)}}{\sum_{n-gram}\text{count(n-gram)}}
$$

Bleu $p_n$ is the Bleu score on $n$-grams only.  

$$
BP\exp( .25 \sum_{n=1}^4 p_n )
$$
BP -s the gravity penalty to penalize the outputs that are too short.  


### Attention model intution

In the usual NN, the enire sentence is read, stored and translated.  
A human, however, read the text part by part.  

In encoder-decoder the problems thus begin when setneces are too large. Hard to translate them.  

Attention model performs part-by-part translation/analysis of the text.  

Developed in 2014 and is now used in many applications.  

Consider a _bidirectional RNN_.  
The translation is generated by another RNN with a hidden state $S^{<0>}$. The question is, in order to generate the first output, _what part of the sentence to consider_.  
The model takes `attention weights` that assess how much _attention_ is payed to a given part of the input setnece. 

For a new part of the sentence, $S^{<2>}$, there is a _new set of weights_ and is also an input from the previous output.  

Similar for the next step.  

The __context__, what part of the sentence to consider.  
So inputs now are: activations, weights, previous input. 



### Attention Model

Consider an input sentence and a _bidirectional RNN_.  
In each hidden state (for each timestep) there are activations, features $a^{\rightarrow <i>}$ and $a^{\leftarrow <i>}$, 
where the first timestep takes $a^{\rightarrow <0>} = \vec{0}$.  
Assume $a^{<t'>} = (a^{\rightarrow <i>},a^{\rightarrow <i>})$, the _feature vector for timestep $t'$ in the original sequence.  

Then consider a _forward only RNN_ for translation, that 
takes as input $s^{<0>}$, and context $c$ has a hidden state $s^{<i>}$ and outputs $y^{<i>}$. 

The $c$ depends on the attention parameters $\alpha^{o,i}$, where $o$ is the current $s$ and $i$ is the input (from all states of the previos, bidirectional NN)

The `context` then is the _weighted sum_ of features, by attention weights. 
Normalization:  
$$
\sum\alpha^{<1,t'>} = 1
$$

And the `context vectors` read  
$$
c^{<1>} = \sum_{t'}\alpha^{<1,t'>}a^{<t'>}
$$

where $\alpha^{<1,t'>}$ is the amount of _attention_ that $y^{<t>}$ should pay to $a^{<t'>}$.  

At the next timestep, the output is generated similarly, but taking the output from the previous one as an input.  
This part of _one-directional_ RNN. 

So, the $s$ network is similar to classical RNN. 


#### Calculation of the attention $a^{<t,t'>}$  

Recall that $\alpha^{<1,t'>}$ is the amount of _attention_ that $y^{<t>}$ should pay to $a^{<t'>}$.  

$$
a^{<t,t'>} = \frac{\exp{e^{<t,t'>}}}{\sum_{t'=1}^{T_x}\exp e^{<t,t'>}}
$$

where $e$ are the weights, that sum-up to one ober $t'$.  
Factors $e$ are computed using small NN by passing $s^{<t-1>}$ and $a^{<t'>}$ into a _one-hidden-layer_ NN to get $e^{<t,t'>}$ (it approximates the function that we do not know).  

Here $ee^{<t,t'>}$ are the $\alpha^{<t,t'>}$.  

_Disadvantages_:  
The cost of algorithm is __quadaratic__. 
The __total number of parameters__ is $T_x\times T_y$.  

In machine translation it is generally acceptable.  

_Other application_:  
- Image caption  

Visualization of the $\alpha$ can help to find where the attention is high.  








### Speach Recognition  

Given an audio clip $x$ and create a transcript $y$.  
The audio clip is the pressure versus time.  

Spectrogram is the way to examine the audio.  

_Old approach_: create uints of sound, `phones`, and discretize the sound into them  
_New approach_: Deep learning.  

Usual datasets $\sim300$-hours or $100.000$ hours for large industry systems.  

Approaches: 
- attention + LSTM model.  
- Connectionist temporal calssification (CTC)  

The idea is:  
Consider a bi-directional LSTM with equal amount of inputs $x^{<N>}$ and outputs $y^{<N>}$.  

Usually, the amount of intput data, the frequency, is _very large_.  The CTC allows to generate the sequence in a form  
$\texttt{ttt\_h\_eee\_\_\_ \_\_\_qqq\_\_}$.  

This is considered to be the correct ouputput of the first part, for a word $\texttt{the}$.  
The basic rule is to _collapse the repeated caracters_ that are __not separated__ by the _empty space_.  
This allows to separate words and have a _shorted output_.  

### Trigger word detection  

This can be accomplished with even a small dataset (contrary to the large speach recognition).  



# Finished exercsie 1,2