# Attention mechanism

Attention is, to some extent, motivated by how we pay visual attention to different regions of an image or correlate words in one sentence. Take the picture of a Shiba Inu as an example.

<p><center><img src='_images/L270195_1.png'></center></p>

A Shiba Inu in a men’s outfit. The credit of the original photo goes to Instagram @mensweardog.

Human visual attention allows us to focus on a certain region with “high resolution” (i.e. look at the pointy ear in the yellow box) while perceiving the surrounding image in “low resolution” (i.e. now how about the snowy background and the outfit?), and then adjust the focal point or do the inference accordingly. Given a small patch of an image, pixels in the rest provide clues what should be displayed there. We expect to see a pointy ear in the yellow box because we have seen a dog’s nose, another pointy ear on the right, and Shiba’s mystery eyes (stuff in the red boxes). However, the sweater and blanket at the bottom would not be as helpful as those doggy features.

Similarly, we can explain the relationship between words in one sentence or close context. When we see “eating”, we expect to encounter a food word very soon. The color term describes the food, but probably not so much with “eating” directly.

<p><center><img src='_images/L270195_2.png'></center></p>

One word “attends” to other words in the same sentence differently.

In a nutshell, attention in deep learning can be broadly interpreted as a vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

The optic nerve of a primate’s visual system receives massive sensory input, far exceeding what the brain can fully process. Fortunately, not all stimuli are created equal. Focalization and concentration of consciousness have enabled primates to direct attention to objects of interest, such as preys and predators, in the complex visual environment. The ability of paying attention to only a small fraction of the information has evolutionary significance, allowing human beings to live and succeed.

Scientists have been studying attention in the cognitive neuroscience field since the 19th century. Notably, the Nadaraya-Waston kernel regression in 1964 is a simple demonstration of machine learning with attention mechanisms.

Attention is a scarce resource: at the moment you are reading this text and ignoring the rest. Thus, similar to money, your attention is being paid with an opportunity cost. Attention is the keystone in the arch of life and holds the key to any work’s exceptionalism.

Since economics studies the allocation of scarce resources, we are in the era of the attention economy, where human attention is treated as a limited, valuable, and scarce commodity that can be exchanged. Numerous business models have been developed to capitalize on it. On music or video streaming services, we either pay attention to their ads or pay money to hide them. For growth in the world of online games, we either pay attention to participate in battles, which attract new gamers, or pay money to instantly become powerful. Nothing comes for free.

All in all, information in our environment is not scarce, attention is. When inspecting a visual scene, our optic nerve receives information at the order of $10^8$ bits per second, far exceeding what our brain can fully process. Fortunately, our ancestors had learned from experience (also known as data) that not all sensory inputs are created equal. Throughout human history, the capability of directing attention to only a fraction of information of interest has enabled our brain to allocate resources more smartly to survive, to grow, and to socialize, such as detecting predators, preys, and mates.

To explain how our attention is deployed in the visual world, a two-component framework has emerged and been pervasive. This idea dates back to William James in the 1890s, who is considered the “father of American psychology” **[[James, 2007]](https://d2l.ai/chapter_references/zreferences.html#james-2007)**. In this framework, subjects selectively direct the spotlight of attention using both the *nonvolitional cue* and *volitional cue*.

The nonvolitional cue is based on the saliency and conspicuity of objects in the environment. Imagine there are five objects in front of you: a newspaper, a research paper, a cup of coffee, a notebook, and a book. While all the paper products are printed in black and white, the coffee cup is red. In other words, this coffee is intrinsically salient and conspicuous in this visual environment, automatically and involuntarily drawing attention. So you bring the fovea (the center of the macula where visual acuity is highest) onto the coffee.

<p><center><img src='_images/L270195_3.png'></center></p>

After drinking coffee, you become caffeinated and want to read a book. So you turn your head, refocus your eyes, and look at the book. Different from the above case where the coffee biases you towards selecting based on saliency, in this task-dependent case you select the book under cognitive and volitional control. Using the volitional cue based on variable selection criteria, this form of attention is more deliberate. It is also more powerful with the subject’s voluntary effort.

<p><center><img src='_images/L270195_4.png'></center></p>

### Queries, Keys, and Values

Inspired by the nonvolitional and volitional attention cues that explain the attentional deployment, in the following we will describe a framework for designing attention mechanisms by incorporating these two attention cues.

To begin with, consider the simpler case where only nonvolitional cues are available. To bias selection over sensory inputs, we can simply use a parameterized fully-connected layer or even non-parameterized max or average pooling.

Therefore, what sets attention mechanisms apart from those fully-connected layers or pooling layers is the inclusion of the volitional cues. In the context of attention mechanisms, we refer to volitional cues as *queries*. Given any query, attention mechanisms bias selection over sensory inputs (e.g., intermediate feature representations) via *attention pooling*. These sensory inputs are called *values* in the context of attention mechanisms. More generally, every value is paired with a *key*, which can be thought of the nonvolitional cue of that sensory input. We can design attention pooling so that the given query (volitional cue) can interact with keys (nonvolitional cues), which guides bias selection over values (sensory inputs).

<p><center><img src='_images/L270195_5.png'></center></p>

Attention mechanisms bias selection over values (sensory inputs) via attention pooling, which incorporates queries (volitional cues) and keys (nonvolitional cues).

To recapitulate, the interactions between queries (volitional cues) and keys (nonvolitional cues) result in attention pooling. The attention pooling selectively aggregates values (sensory inputs) to produce the output.

Note that there are many alternatives for the design of attention mechanisms. For instance, we can design a non-differentiable attention model that can be trained using reinforcement learning methods [Mnih et al., 2014]. Given the dominance of the framework in the above figure, models under this framework will be the center of our attention next.

## Attention Pooling

The Nadaraya-Watson kernel regression model proposed in 1964 is a simple yet complete example for demonstrating machine learning with attention mechanisms. 

To keep things simple, let us consider the following regression problem: given a dataset of input-output pairs $\{(x_1, y_1), \ldots, (x_n, y_n)\}$, how to learn $f$ to predict the output $\hat{y} = f(x)$ for any new input $x$? Let's take a non-linear function: $y_i = 2\sin(x_i) + x_i^{0.8} + \epsilon$.

### Average Pooling

We begin with perhaps the world’s “dumbest” estimator for this regression problem: using average pooling to average over all the training outputs:

$$f(x) = \frac{1}{n}\sum_{i=1}^n y_i$$

### Nonparametric Attention Pooling

Obviously, average pooling omits the inputs $x_i$ . A better idea was proposed by Nadaraya [Nadaraya, 1964] and Watson [Watson, 1964] to weigh the outputs $y_i$ according to their input locations:

$$f(x) = \sum_{i=1}^n \frac{K(x - x_i)}{\sum_{j=1}^n K(x - x_j)} y_i$$

A key $x_i$ that is closer to the given query $x$ will get more attention via a larger attention weight assigned to the key’s corresponding value $y_i$. To gain intuitions of attention pooling, just consider a Gaussian kernel defined as:

$$K(u) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{u^2}{2})$$

Plugging the Gaussian kernel into the above equation gives:

$$\begin{split}\begin{aligned} f(x) &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}(x - x_i)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}(x - x_j)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}(x - x_i)^2\right) y_i\end{aligned}\end{split}$$

Notably, Nadaraya-Watson kernel regression is a nonparametric model; thus the above equation is an example of nonparametric attention pooling.

### **Parametric Attention Pooling**

Nonparametric Nadaraya-Watson kernel regression enjoys the *consistency* benefit: given enough data this model converges to the optimal solution. Nonetheless, we can easily integrate learnable parameters into attention pooling.

As an example, slightly different from the above equation, in the following the distance between the query $x$ and the key $x_i$ is multiplied by a learnable parameter $w$:

$$\begin{split}\begin{aligned}f(x) &= \sum_{i=1}^n \frac{\exp\left(-\frac{1}{2}((x - x_i)w)^2\right)}{\sum_{j=1}^n \exp\left(-\frac{1}{2}((x - x_j)w)^2\right)} y_i \\&= \sum_{i=1}^n \mathrm{softmax}\left(-\frac{1}{2}((x - x_i)w)^2\right) y_i.\end{aligned}\end{split}$$

## Attention Scoring Functions

<p><center><img src='_images/L270195_6.png'></center></p>

Mathematically, suppose that we have a query $q∈\mathbb{R}^q$ and $m$ key-value pairs $(k_1,v_1),…,(k_m,v_m)$, where any $k_i∈\mathbb{R}^k$ and any $v_i∈\mathbb{R}^v$. The attention pooling $f$ is instantiated as a weighted sum of the values:

$$f(\mathbf{q}, (\mathbf{k}_1, \mathbf{v}_1), \ldots, (\mathbf{k}_m, \mathbf{v}_m)) = \sum_{i=1}^m \alpha(\mathbf{q}, \mathbf{k}_i) \mathbf{v}_i \in \mathbb{R}^v$$

where the attention weight (scalar) for the query q and key $k_i$ is computed by the softmax operation of an attention scoring function a that maps two vectors to a scalar:

$$\alpha(\mathbf{q}, \mathbf{k}_i) = \mathrm{softmax}(a(\mathbf{q}, \mathbf{k}_i)) = \frac{\exp(a(\mathbf{q}, \mathbf{k}_i))}{\sum_{j=1}^m \exp(a(\mathbf{q}, \mathbf{k}_j))} \in \mathbb{R}$$

As we can see, different choices of the attention scoring function a lead to different behaviors of attention pooling.

### Additive Attention

In general, when queries and keys are vectors of different lengths, we can use additive attention as the scoring function. Given a query $q∈\mathbb{R}^q$ and a key $k∈\mathbb{R}^k$, the additive attention scoring function:

$$a(\mathbf q, \mathbf k) = \mathbf w_v^\top \text{tanh}(\mathbf W_q\mathbf q + \mathbf W_k \mathbf k) \in \mathbb{R}$$

where learnable parameters $W_q∈\mathbb{R}^{h×q}$, $W_k∈\mathbb{R}^{h×k}$, and $w_v∈\mathbb{R}^{h}$.

### Scaled Dot-Product Attention

A more computationally efficient design for the scoring function can be simply dot product. However, the dot product operation requires that both the query and the key have the same vector length, say $d$. Assume that all the elements of the query and the key are independent random variables with zero mean and unit variance. The dot product of both vectors has zero mean and a variance of $d$. To ensure that the variance of the dot product still remains one regardless of vector length, the *scaled dot-product attention* scoring function $a(\mathbf q, \mathbf k) = \mathbf{q}^\top \mathbf{k} /\sqrt{d}$ divides the dot product by $\sqrt{d}$.

In practice, we often think in minibatches for efficiency, such as computing attention for $n$ queries and $m$ key-value pairs, where queries and keys are of length $d$ and values are of length $v$. The scaled dot-product attention of queries $Q∈\mathbb{R}^{n×d}$, keys $K∈\mathbb{R}^{m×d}$, and values $V∈\mathbb{R}^{m×v}$ is:

$$\mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{n\times v}$$

### Summary of Popular Attention Mechanisms

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

<p><center><img src='_images/L270195_7.png'></center></p>

Here are a summary of broader categories of attention mechanisms:

<p><center><img src='_images/L270195_8.png'></center></p>

## Bahdanau Attention

Inspired by the idea of learning to align, Bahdanau et al. proposed a differentiable attention model without the severe unidirectional alignment limitation [[Bahdanau et al., 2014](https://d2l.ai/chapter_references/zreferences.html#bahdanau-cho-bengio-2014)]. When predicting a token, if not all the input tokens are relevant, the model aligns (or attends) only to parts of the input sequence that are relevant to the current prediction. This is achieved by treating the context variable as an output of attention pooling.

<p><center><img src='_images/L270195_9.png'></center></p>

Layers in an RNN encoder-decoder model with Bahdanau attention.

## Multi-Head Attention

In practice, given the same set of queries, keys, and values we may want our model to combine knowledge from different behaviors of the same attention mechanism, such as capturing dependencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. Thus, it may be beneficial to allow our attention mechanism to jointly use different representation subspaces of queries, keys, and values.

To this end, instead of performing a single attention pooling, queries, keys, and values can be transformed with $h$ independently learned linear projections. Then these $h$ projected queries, keys, and values are fed into attention pooling in parallel. In the end, $h$ attention pooling outputs are concatenated and transformed with another learned linear projection to produce the final output. This design is called *multi-head attention*, where each of the $h$ attention pooling outputs is a *head* **[[Vaswani et al., 2017]](https://d2l.ai/chapter_references/zreferences.html#vaswani-shazeer-parmar-ea-2017)**. Using fully-connected layers to perform learnable linear transformations

<p><center><img src='_images/L270195_10.png'></center></p>

Multi-head attention, where multiple heads are concatenated then linearly transformed.

More formally, Given a query $q∈\mathbb{R}^{d_q}$ , a key $k∈\mathbb{R}^{d_k}$, and a value $v∈\mathbb{R}^{d_v}$, each attention head $h_i ( i=1,…,h )$ is computed as:

$$\mathbf{h}_i = f(\mathbf W_i^{(q)}\mathbf q, \mathbf W_i^{(k)}\mathbf k,\mathbf W_i^{(v)}\mathbf v) \in \mathbb R^{p_v}$$

## Self-Attention

Self-attention is a basic form of a scaled self-attention mechanism. This mechanism uses an input matrix shown as X and produces an attention score between various items in X. We see X as a 3x4 matrix where 3 represents the number of tokens and 4 presents the embedding size. Q is also known as the query, K is known as the key, and V is noted as the value. Three types of matrices shown as theta, phi, and g are multiplied by X before producing Q, K, and V. The multiplied result between query (Q) and key (K) yields an attention score matrix. This can also be seen as a database where we use the query and keys in order to find out how much various items are related in terms of numeric evaluation. Multiplication of the attention score and the V matrix produces the final result of this type of attention mechanism. The main reason for it being called self-attention is because of its unified input X; Q, K, and V are computed from X.

<p><center><img src='_images/L270195_11.png'></center></p>

Imagine that we feed a sequence of tokens into attention pooling so that the same set of tokens act as queries, keys, and values. Specifically, each query attends to all the key-value pairs and generates one attention output. Since the queries, keys, and values come from the same place, this performs self-attention [Lin et al., 2017b][Vaswani et al., 2017], which is also called intra-attention [Cheng et al., 2016][Parikh et al., 2016][Paulus et al., 2017].

<p><center><img src='_images/L270195_12.png'></center></p>

Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures.

Instead of looking for an input-output sequence association/alignment, we are now looking for scores between the elements of the sequence, as depicted below:

<p><center><img src='_images/L270195_13.png'></center></p>

## Transformer

Different from Bahdanau attention for sequence to sequence learning, the input (source) and output (target) sequence embeddings are added with positional encoding before being fed into the encoder and the decoder that stack modules based on self-attention.

<p><center><img src='_images/L270195_14.png'></center></p>

The transformer architecture.

On a high level, the transformer encoder is a stack of multiple identical layers, where each layer has two sublayers (either is denoted as $sublayer$). The first is a multi-head self-attention pooling and the second is a position-wise feed-forward network. Specifically, in the encoder self-attention, queries, keys, and values are all from the the outputs of the previous encoder layer. Inspired by the ResNet design, a residual connection is employed around both sublayers. In the transformer, for any input $x∈\mathbb{R}^d$ at any position of the sequence, we require that $sublayer(x)∈\mathbb{R}^d$ so that the residual connection $x+sublayer(x)∈\mathbb{R}^d$ is feasible. This addition from the residual connection is immediately followed by layer normalization **[[Ba et al., 2016]](https://d2l.ai/chapter_references/zreferences.html#ba-kiros-hinton-2016)**. As a result, the transformer encoder outputs a $d$-dimensional vector representation for each position of the input sequence.

The transformer decoder is also a stack of multiple identical layers with residual connections and layer normalizations. Besides the two sublayers described in the encoder, the decoder inserts a third sublayer, known as the encoder-decoder attention, between these two. In the encoder-decoder attention, queries are from the outputs of the previous decoder layer, and the keys and values are from the transformer encoder outputs. In the decoder self-attention, queries, keys, and values are all from the the outputs of the previous decoder layer. However, each position in the decoder is allowed to only attend to all positions in the decoder up to that position. This *masked* attention preserves the auto-regressive property, ensuring that the prediction only depends on those output tokens that have been generated.

## References

1. [https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)