# Additional Terms


- word embeddings
- BLUE score
- GLUE benchmark
- MultiNLI accuracy
- SQuAD 
- ablation analysis

- how is context db used
- how is model learning to respond? 

# Autoencoder

An autoencoder is a special case of the encoder-decoder architecture in which the input is also provided as the output. In this case the encoder creates an intermediary representation of the data that the decoder then learns how to translate back into the original input space. 

The classical use case of the autoencoder is dimensionality reduction, compression, and noise reduction. The key expectation of a performant autoencoder is that is it able to represent the input sequence in a more compact format.

More recently, the intermediary representation of data has been thought of as a way to identify or extract features. Assuming that the compressed data holds a "more pure" representation of useful information, from this we can reverse engineer which features are relevant and irrelevant. I.e. the original features that were reduced can be removed from the feature set.

# Sequence Generation & Sequence To Sequence

The term sequence generation often occurs in the context of LLMs. This umbrella term applies to tasks related to the generation of text (ie. sequences of tokens) such as translation, summariazation, and prompt-response. 

<center>
    <img src='./images/sequence_to_sequence_model_comparison.png' style='width:75%'>
    <a href='https://www.analyticsvidhya.com/blog/2020/08/a-simple-introduction-to-sequence-to-sequence-models/'>Source</a>
</center>



With these tasks, the modeler is trying to map sequences in one domain to sequences in another domain. As such the term sequence to sequence is also used interchangably.

**Note**: Sequence to Sequence is often abreviated as seq2seq as noted in the [encoder-decoder notebook](Encoder-Decoders.ipynb).

# Alignment

I did some googling and found a few articles ([stackoverflow answer](https://stats.stackexchange.com/questions/272012/what-does-alignment-between-input-and-output-mean-for-recurrent-neural-network), [Stackoverlow answer](https://ai.stackexchange.com/questions/26184/what-is-the-purpose-of-alignment-in-the-self-attention-mechanism-of-transforme)) which helped me understand what alignment means and what is being aligned. 

In short, I found that the alignment is an abstraction for a problem which arises in the context of sequence to seuence mapping. Alignment is or describes the mapping of tokens in the input sequence to tokens in the output sequence.

To add some intuition behind the choice of the word alignment, lets consider the translation problem space. We need to map one language to another. For example, consider the task of translating French to English. When translating the phrase "It's raining" to french. The equivalent phrase would be "Il pleut". We see our input has the same number of words as the output and they are in the same order. Thus the alignment of the tokens between sequences, or the mapping of input tokens to output tokens is trivial.

```
Il pleut
(1)  (2)
 |    |
 |    |
(1)  (2)
It's raining
```

But what about "It is raining"? In this circomstance we have three inputs to two outputs; there is not a one to one mapping between the words.

```
Il pleut
(1)  (2)
 |     \
 |      \
(1) (2) (3)
It  is  raining
```

Additionally, the order of words may not be the same between the two sequences. For example, consider this translation from Portugese to english:

```
Uma maçã grande e vermelha
(1)   (2)  (3)  (4)   (5)
 |      \ /   _______/
 |       X   /
 |      / \ /
 |     /   X
 |    /   / \
(1) (3) (5)  (2)
 A  big red apple
```

We can see that not only is there a difference in length between the input and output sequence, but the order of the coresponding tokens is not the same.

As we continue to explore the nuances of machine translation we start to understand that there may be additional complexities to the sequence mappings than just these. Additionally as we think more generically and start thinking about other seq2seq tasks such as text summarization or prompt/response we open the door for even more complexity with respect to the alignment task.

So to summarize, alignment is both the process and the end result i.e. the representaion or the mapping itself. Depending on the problem space and model architecture, there will be different implimentations for achieving or producing alignment. In the various notebooks we will explore this in more detail.

A more detailed look and comparison of alignment techniques and architectures can be found [here](https://arxiv.org/abs/2112.07806).


# Annotation


In September 2014, Bahdanau et. al. [published ](https://arxiv.org/abs/1409.0473) *Neural Machine Translation by Jointly Learning to Align and Translate*. In this paper they propose an enhancement(s) to the prior encoder-decoder based implimentations of attention.

As we will see, in their paper, they define the annotations as a component of the attention mechanism for the alignment process. Ultimately the prediction of the next target word $y_i$, at time $i$, is based on a conditional probabilities attached to the set target words given some input. The conditional probability is defined to be dependant on the previous word $y_{i-1}$, the hidden states of the decoder $s_i$, and the context vector $c_i$.

The context vector is where we see the connection with the concept of anotation. The context vector is defined as the weighted sum of each annotation $h_i$. Each annotation refers to the concatenation of the two coresponding neurons in the hidden forward and backward layers of the encoder.

Note: The encoder is defined as a bidirectional RNN which they claim is similar to an LSTM. This RNN has two hidden layers of equal size, each of which coresponds to one of the directions of the bidirectional RNN. One layer contains the forward hidden states and one contains the backward hidden states.

As such, the neurons $(h_1,...,h_{T_x})$ in the hidden hidden layers can be paired based on their upstream input. These paired neurons are concatenated, and this concatenation referred to as the annotation $h_j$ which coresponds to input $x_j$.

The mathematical notation was a bit confusing but I belive the annotation would be defined as a tuple or tensor given the equation

$$ h_j= \big[ \overrightarrow{h_j^T} ;  \overleftarrow{h_j^T} \big]^T$$

It's a bit confusing, but in the diagram below we can see the encoder's hidden states (annotations) being connected to the hidden states of the decoder through the context vector and ultimately the predicted target word.

<center><img src='./images/annotation_diagram.png' style="width:25%"></center>

>  the annotation $h_j$ contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation $h_j$ will be focused on the words around $x_j$.



In the diagram above, the context vector is represented as $\bigoplus$. Mathematically it is expressed as the weighted sum of the annoatations. As the annotations have two dimensions (one fore each direction), I assume so too does the context vector.

The weights of each annotation are calculated by a feed forward neural network which is jointly trained with the other components of the model. This model is referred to as the alignment model

>  which scores how well the inputs around position $j$ and the output at position $i$ match. The score is based on the RNN hidden state si−1 (just before emitting $y_i$, Eq. (4)) and the $j-th$ annotation $h_j$ of the input sentence.

**Note**: This term alignment shows up elsewhere in the context of NLP. Traditionally the term refers to how tokens in the input sequence are matched or "aligned" with tokens in the output seuence. Abstractly I think that intuition applies here; the alignment model calculates the weights which directly impact the probabilities attached to the output tokens. Thus the alignment between the input and output sequence is controlled by the alignment model