The attention mechanism was originally popularized in Neural Machine Translation by Jointly Learning to Align and Translate(2014), which is the guiding reference for this particular post. This paper employs an encoder-decoder architecture for english-to-french translation.

https://towardsdatascience.com/attention-from-alignment-practically-explained-548ef6588aa4

Here, alignment is the problem in machine translation that identifies which parts of the input sequence are relevant to each word in the output, whereas translation is the process of using the relevant information to select the appropriate output. 

https://medium.com/data-science-community-srm/understanding-encoders-decoders-with-attention-based-mechanism-c1eb7164c581



Human friendly explanation:

https://txt.cohere.com/what-is-attention-in-language-models

# Motivation

A problem with classical word embeddings comes from the fact that they do not provide a mechanism for representing words with multiple meanings. As such, when performing tasks relating to translation or NLP for example, the prediction will be sub optimal because.

To solve this problem, researches started considering context (i.e. the neighnoring words in a sentance) as the mechanism through which one could derive the coresponding meaning of a word. That is, certain words in a sentence or paragraph would influence or imply the meaning of a given word. 

With this a priori expectaion, words would be expected to have multiple meanings as there would inevitably be multiple instances of them appearing within multiple observed contexts. And thus the language models would need to provide a mechanism that considers context.

As we will see, Attention is a mechanism (an implimentation) for how we can get a nueral network to understand the context of a word.

Encouraged by recent advances in caption generation and inspired by recent success in employing attention
in machine translation (Bahdanau et al., 2014) and object
recognition (Ba et al., 2014; Mnih et al., 2014), we investigate models that can attend to salient part of an image while
generating its caption.

# History

## Kalchbrenner et. al., (2014) - Variable Input Length Using CNNs
In April of 2014, a paper titled *"A Convolutional Neural Network for Modelling Sentences
"* was [published](https://arxiv.org/abs/1404.2188).

In this paper, Kalchbrenner et. al. put forward a novel convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) which handles input sentences of varying length. Additionally the model is capable of capturing short and long-range relations.

The ability to handle input sequences with variable sizes is facilitated by sequentially processing one token at a time, or by specially designed convolution kernels.

> This approach can lead to significant problems when the input is truly of varying size with varying information content, such as in Section 10.7 in the transformation of text (Sutskever et al., 2014). In particular, for long sequences it becomes quite difficult to keep track of everything that has already been generated or even viewed by the network. Even explicit tracking heuristics such as proposed by Yang et al. (2016) only offer limited benefit.
>
> https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html

Additionally, this sequential processeing leads to a performance hit compared to modern approaches which are able to compute in parallel.

## Mnih et. al. (2014) 

In June 2014, *Recurrent Models of Visual Attention* was published.

## Bahdanau et. al. (2015) - Global Soft Attention In NMT

In September 2014, Bahdanau et. al. [published](https://arxiv.org/abs/1409.0473) *Neural Machine Translation by Jointly Learning to Align and Translate*. 

In their paper, Bahdanau et. at. describer their model arthictecture as an extension of the encode-decoder architecture.

> The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

The authors go on to describe the differences between hard and soft alignment.

> Consider the source phrase [the man] which was translated into [l’ homme]. Any hard
alignment will map [the] to [l’] and [man] to [homme]. This is not helpful for translation, as one
must consider the word following [the] to determine whether it should be translated into [le], [la],
[les] or [l’]. Our soft-alignment solves this issue naturally by letting the model look at both [the] and
[man], and in this example, we see that the model was able to correctly translate [the] into [l’].

They comment that a natural output of the soft attention is that differences between input and output sequence length are naturally handled by the attention mechanism and open the door to veriable length sequences.

Additionally the authors advocated that ehtier soft attention mechanism outperforms the prior hard attention implimentsions:

<center><img src="images/monotonic_soft_attention.png" style="width:75%"></center>


> We can see from the alignments in Fig. 3 that the alignment of words between English and French
is largely monotonic. We see strong weights along the diagonal of each matrix. However, we also
observe a number of non-trivial, non-monotonic alignments. Adjectives and nouns are typically
ordered differently between French and English, and we see an example in Fig. 3 (a). From this
figure, we see that the model correctly translates a phrase [European Economic Area] into [zone
economique europ ´ een]. The RNNsearch was able to correctly align [zone] with [Area], jumping ´
over the two words ([European] and [Economic]), and then looked one word back at a time to
complete the whole phrase [zone economique europ ´ eenne]. ´
>
> The strength of the soft-alignment, opposed to a hard-alignment, is evident, for instance, from
Fig. 3 (d). Consider the source phrase [the man] which was translated into [l’ homme]. Any hard
alignment will map [the] to [l’] and [man] to [homme]. This is not helpful for translation, as one
must consider the word following [the] to determine whether it should be translated into [le], [la],
[les] or [l’]. Our soft-alignment solves this issue naturally by letting the model look at both [the] and
[man], and in this example, we see that the model was able to correctly translate [the] into [l’]. We
observe similar behaviors in all the presented cases in Fig. 3. An additional benefit of the soft alignment is that it naturally deals with source and target phrases of different lengths, without requiring a
counter-intuitive way of mapping some words to or from nowhere ([NULL]) (see, e.g., Chapters 4
and 5 of Koehn, 2010).

Note: The attention mechanism proposed in this paper will later be referred to as a type of global attention mechanism by Luong et. al. (2015). Luong notes that while more complicated than the global attention mechanism he proposes, in the field of NMT, 
Bahdanau was first to propose global attention.

>To the best of our knowledge, there has not been any other work exploring the use of attention-based architectures for NMT.

## (Luong, et al., 2015)
> Luong, et al., 2015 proposed the “global” and “local” attention. The global attention is similar to the soft attention, while the local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: the model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector.
>
> [source](https://lilianweng.github.io/posts/2018-06-24-attention/)

## Xu et. al. (2015) - Applying Soft/Hard Attention To Image Processsing

In Feb 2015, Xu et. al. [published](https://arxiv.org/abs/1502.03044) *Show, Attend and Tell: Neural Image Caption Generation with Visual Attention* which introduces an attention based model that automatically learns to describe the content of images and is inspired by recent work in machine translation and object detection. 

In the paper they describe two attention mechanisms which they apply to caption generation.

> We introduce two attention-based image caption generators under a common framework (Sec. 3.1): 1) a “soft” deterministic attention mechanism trainable by standard back-propagation methods and 2) a “hard” stochastic attention mechanism trainable by maximizing an approximate variational lower bound or equivalently by REINFORCE (Williams, 1992).

## Luong et. al. (2015) - Simplified Approach To Local and Global Attention
In August 2015, Luang et all [published](https://arxiv.org/abs/1508.04025) *Effective Approaches to Attention-based Neural Machine Translation* which introduces two architectures for neural machine translation (NMT); those being global and local attention.

> This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
>
> In this work, we design, with simplicity and effectiveness in mind, two novel types of attentionbased models: a global approach in which all source words are attended and a local one whereby only a subset of source words are considered at a
time. The former approach resembles the model of (Bahdanau et al., 2015) but is simpler architecturally. The latter can be viewed as an interesting blend between the hard and soft attention models proposed in (Xu et al., 2015): it is computationally less expensive than the global model or the soft attention; at the same time, unlike the hard attention, the local attention is differentiable almost everywhere, making it easier to implement and train.

# How Attention Works (Conceptually)

With attention, the basic idea is that some words matter more when determining the meaning of a word within a particular context. Saying it differently, when paying attention to a particular word, in a given context, some words are more relevant to the given word's meaning than others.

In an overly simplified way, we can think of attention as a weighting mechanism that attaches a weight of importance to the words in a particular context relative to a given word we are paying attention to. The words that are more important or relevant have a higher weight than those. Thus, each word is going to map to an "attention" vector which holds the attention weights for al the other words in the given context.

https://txt.cohere.com/what-is-attention-in-language-models/

https://stats.stackexchange.com/questions/599085/training-transformers-self-attention-weights-vs-embedding-layer

https://txt.cohere.com/what-is-attention-in-language-models/#:~:text=Attention%20is%20a%20very%20clever,embeddings%20into%20contextualized%20word%20embeddings).

Conceptually, the attention mechanism will explain the relevance or interdependence between a given input token and the other tokens in the input sequence. The intention is that this understanding of relevance between tokens provides the fondational context required to understand the true meaning of a word.

https://blog.floydhub.com/attention-mechanism/

Though there are several implimentations for attention, the consistent approach is that the relevance of the tokens in the input sequence relative to the token in question is explained via a numeric score ranging between 0 and 1. If the related token has a high attention score, it means the token is very relevant to the task at hand. If the related token has a low attention score, it means the token is less relevant to the task at hand. Another important mathematical property is that the sum of the attention scores is one. Thus the attention score quantifies the amount of "attention" we should pay to a particular token while performing a given task (e.g. translation or question and answer).

**Note**: The tokens will never have an attention score of zero. This is one of the main disadvantages of the current class of attention approaches. Because every attention score is non zero, it means every token is involved in calculation for the task at hand.

In their [paper](https://arxiv.org/pdf/1409.0473.pdf), (bahdanau et al., 2014) show how the attention scores allow the model to dynamically determine the soft-alignment of the input and output sequences. Specifically, they show several instances where the attention model is able to change the order of words from their exact literal translation to a gramatically correct translation.

<center><img src="./images/attention_swapping_word_order.png" style="width:50%"></center>

> Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot
correspond to the words in the source sentence (English) and the generated translation (French),
respectively. Each pixel shows the weight $\alpha_{ij}$ of the annotation of the j-th source word for the i-th
target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three
randomly selected samples among the sentences without any unknown words and of length between
10 and 20 words from the test set.
 


## Types Of Attention

### Implicit vs Explicit

I believe these terms, in the context of data science, are analogs of their neuro-biological counterparts dealing with actual physical human attention mechanisms (for example [this article](https://www.frontiersin.org/articles/10.3389/fpsyg.2015.01861/full). From what I understand, one way of thinking about attention is to classify the attention mechanism as either implicit or explicit.

Explicit attention mechanisms are voluntarily focused on goal relevant stimuli from the environment. For example, if we are looking for pictures of vehicles trying to classify them as car or truck, we will be focusing on the groups of pixels in the image most related to the classification problem; more likely, we will be focusing on the gradient of the loss function. Thinking about translation on the other hand, explicit attention would focus on the words most relvant to the proper translation.

Implicit attention mechanisms are involunarily focused on stimuli who's inherant properties inadvertantly manipulate the attention mechanism. In the human analogy, the images may invoke an emotional response that causes us to focus on features of the stimuli which are irrelevant to the stated goal of classification. In the literal case of deep neural networks, it's possible that the physical structure of the network impact the attention mechanism without a direct causal relationship being defined a priori with respect to the stated goal. Thinking about translation, it may be that a particular passage or set of words comes into focus because of some conincidental relationship with the model's current state or architecture.

Putting it simply, With explicit attention, the physical system is deigned to adjust the attention mechanism based on the model performance during training. With implicit attention, the attention mechanism is not explicitly alterered as part of the training process; instead any changes to the attention mechanism are an unintentional byproduct.

An article worth reading [link](https://www.linkedin.com/pulse/attention-mechanism-part-1-english-version-hay-hoffman/)

### Hard Vs. Soft

Generally speaking, at the core of every attention mechanism is a function which assigns attention scores. These functions, and thus the attention mechanisms, are classified as being either soft or hard. Soft attention functions ar characterized by their continuous, smooth, differentiable nature. Hard attention functions are discrete, non-smooth, and non-differentiable.

https://theaisummer.com/attention/#types-of-attention-hard-vs-soft

The characteristic of being differentiable is important with respect to neural network archiectecure because back propogation does not work if a function is non-differentiable.

https://stats.stackexchange.com/questions/386535/why-cant-we-use-back-propagation-in-hard-attention-but-we-can-use-it-in-relu#

### Local vs. Global

Global attention describes attention mechanisms that consider all input stimuli (e.g. pixels or words) while local attention considers a curated set of words. This curation could be affected asa result of hard/soft characteristics or implicit/explicit design.

> Luong, et al., 2015 proposed the “global” and “local” attention. The global attention is similar to the soft attention, while the local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: the model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector.
>
> [source](https://lilianweng.github.io/posts/2018-06-24-attention/)

# How Attention Works (Mathematically)

The attention mechanism is designed to work with embeddings not raw inputs. As such we must first calculate the word embeddings for the given input. 

<center><img src="./images/transformer_word_embeddings_example2.png" style="width:75%"></center>

<div style="text-align:right"><a href="https://jalammar.github.io/illustrated-transformer">[img source]</a><div>

Below we can see an example of a matrix which contains the embedding vectors for each word in the given input sequence "Thinking Machines":

**Note**: The deimensionality of this matrix are (token count x embedding length).

    
<center><img src="./images/transformer_word_embeddings_example.png" style="width:30%"></center>

    
<div style="text-align:right"><a href="https://jalammar.github.io/illustrated-transformer">[img source]</a><div>

Next is to calculate three intermediary matrices $Q$, $K$, and $V$ which are referred to as they query, key, and value vectors respectively. 

These matrices are produced by applying linear transformations ($W_K$,$W_Q$, and $W_V$) to the input embeddings $X$:

$$ X W_Q = Q $$
$$ X W_K = K $$
$$ X W_V = V $$

This is discussed [here](https://jalammar.github.io/illustrated-transformer/) and [here](https://stackoverflow.com/questions/68266490/dimension-of-query-and-key-tensor-in-multiheadattention), the dimensionality of the linear transformation matrices and thus the resulting query, key, and value matrices is flexible (i.e. The number of columns can change). Some architectures elect specific dimensions so that the calculations take on certain characteristics or properties (like ease of use or speed).

That being said, concuptually, I think it makes sense to think of them as having the same dimensionality of $X$ (token count x embedding length). Thus every embedding $x_i$ has a coresponding query $q_i$, key $k_i$, and value $v_i$ respectively.

<center><img src="./images/transformer_query_key_and_value_example2.png" style="width:60%"></center>

We can then plug these matrices into the definition of attention:

$$ Attention(Q, K, V) =  softmax \big( \frac{QK^T}{\sqrt{d_k}} \big)V $$

Visually, we can see the dimensionality:

<center><img src="./images/transformer_attention_example.png" style="width:50%"></center>

Focusing on the numerator we have:

<center><img src="./images/transformer_attention_example2.png" style="width:50%"></center>


By multiplying the query by the key

Because this is a square matrix, each element in the matrix represents the intersection of the constituent parts. In our case, it's the intersection of the query and the key.

??

The attention score is 

Which then gives us

<center><img src="./images/transformer_attention_example3.png" style="width:50%"></center>

<center><img src="./images/transformer_attention_example4.png" style="width:20%"></center>


**Note**: The softmax function which unsures all the values in the matrix sum to one. This is an important mathematical property which allows us to consider the attention matrix as a multidimensional probability distribution.

In addition to the paper, this [article](https://medium.com/analytics-vidhya/understanding-attention-in-transformers-models-57bada0cce3e was helpful in understanding. I borrowed a few diagrams as well.

# Queries, Keys, and Values

Neural networks can be designed in a way that allows them to accept either fixed or variable input sequence lenghts. However, the ability to impliment the latter was not available right away.

## (Kalchbrenner et al., 2014)
In April of 2014, a paper titled *"A Convolutional Neural Network for Modelling Sentences
"* was [published](https://arxiv.org/abs/1404.2188).


Variable size is addressed by sequentially processing one token at a time, or by specially designed convolution kernels 

This approach can lead to significant problems when the input is truly of varying size with varying information content, such as in Section 10.7 in the transformation of text (Sutskever et al., 2014). In particular, for long sequences it becomes quite difficult to keep track of everything that has already been generated or even viewed by the network. Even explicit tracking heuristics such as proposed by Yang et al. (2016) only offer limited benefit.

https://d2l.ai/chapter_attention-mechanisms-and-transformers/queries-keys-values.html

# History

# Problem: Fixed length Buffer

> An encoder neural network reads and encodes a source sentence into a fixed-length vector. A decoder then outputs a translation from the encoded vector. The whole encoder–decoder system, which consists of the encoder and the decoder for a language pair, is jointly trained to maximize the probability of a correct translation given a source sentence. A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.
>
> (Bahdanau et al., 2014)

## Cho et al. (2014b)
Cho et al. (2014b) showed that indeed the performance of
a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.

## (Bahdanau et al., 2014)
In September 2014, a paper titled *"Neural machine translation by jointly learning to align and translate"* was [published](https://arxiv.org/abs/1409.0473).

Bahdanau et al. acknowlege the issues of the fixed length vector:

> The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. 

Additionally they propose a novel enhancement to the encoder-decoder architecture which is built around an attention mechanism.

> In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
>
> By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector. With this new approach the information can be spread throughout ..., which can be selectively retrieved by the decoder accordingly.

They emphasize that this mechanism bypasses the bottleneck of a fixed length vector

> The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of (context) vectors and chooses a subset of these vectors adaptively (based on the alignment scores) while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.

At the center of their proposed architecture are annotations.  For each token in the input sequence, the encoder (a bi-driectional RNN) will compute an annotation. Each annotation is represented by the concatenation of the coresponding forward and backward hidden states of the encoder's RNNs.

> In this way, the annotation $h_j$ contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation $h_j$ will be focused on the words around $x_j$ . This sequence of annotations is used by the decoder and the alignment model later to compute the context vector.

The context vector is computed as the weighted sum of the annotations (represented by $\bigoplus$). In this case, the weights $\alpha$ are calculted as the softmax of the alingment value of the annotations. 

**Note**: Use of the softmax function continues forward into newer attention mechanisms.


This is then used as an input to the decoders hidden state $s_t$ to predict the target value $y_t.
$.
> The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

<center><img src='./images/attention_architecture_example.png' style='width:25%'></center>

The model is trained by optimizing an alignment score; which quantifies the distance from the correct outputs given the inputs. 

The authors show how the attention scores allow the model to dynamically determine the soft-alignment of the input and output sequences. Specifically, they show several instances where the attention model is able to change the order of words from their exact literal translation to a gramatically correct translation.

<center><img src="./images/attention_swapping_word_order.png" style="width:50%"></center>

> Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot
correspond to the words in the source sentence (English) and the generated translation (French),
respectively. Each pixel shows the weight $\alpha_{ij}$ of the annotation of the j-th source word for the i-th
target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three
randomly selected samples among the sentences without any unknown words and of length between
10 and 20 words from the test set.
 


## Problem: RNN consider entire input seuence (reference window)
https://theaisummer.com/attention/#the-limitations-of-rnns

Although they used different terminology to (Vaswani et al. 2017) the basic premise of the attention mechanism is conceptually the same. 

Conceptually, we can consider a trained or calibrated attention mechanism as a database. As a user, we can query the database to see if any information exists based on the search criteria. In the context of attention, the query is asking the database to return records or values which match a particular key. In the context of machine translation or question/answer, our query is constructed to perform a lookup of the most likely next word (value) given the current token in question (key).