Relative positional encoding

Standard self-attention:

Let $x$ be of shape $(B, T, D_x)$. $B$ is the batch dim, $T$ some spatial (time) dim, $D_x$ the input feature dim. Let $Q,K,V$ be of shape $(D_x,D_y)$, and $b_q,b_k,b_v$ be of shape $(D_y)$. Let $D_y$ be divisible by number of heads $H$. $D_y$ is the output feature dim.

We drop any transpose of reshape in the following formulas, and hope they become clear from the context, or from the specified output shape.

$$\begin{align} e &= (Q x + b_q) (K x + b_k) = x Q K x + b_k Q x + b_q K x + b_q b_k, &\text{ shape } (B,T,T,H) \\ w &= \operatorname{softmax}(e), &\text{ shape } (B,T,T,H) \\ y &= w x, &\text{ shape } (B,T,D_y) \end{align}$$

Shawn et al, Self-Attention with Relative Position Representations, 2018

RETURNN SelfAttentionLayer supports key_shift where you can pass the output of RelativePositionalEncodingLayer. This was proposed in Shawn et al, Self-Attention with Relative Position Representations, 2018 and implemented in Tensor2Tensor (see _relative_attention_inner), and RETURNN follows that implementation more or less. The default is a trainable encoding matrix with clipping limit 16, i.e. the encoding matrix shape is $(2 \cdot \operatorname{clipping} + 1, D_y/H)$. It is shared for all heads in our implementation. This is added to the keys, i.e. to $K x + b_k$, using the right entry per each frame in $T$. In the implementation, the encoding matrix is then transformed into a tensor $r$ of shape $(T,T,D_y/H)$.

Ignoring the biases:

$$\begin{align} e = (Q x) (K x + r) = x Q K x + r Q x, &\text{ shape } (B,T,T,H) \end{align}$$

Dai et al, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, 2019

In RETURNN-common, nn.RelPosSelfAttention and nn.relative_positional_encoding follow the paper Dai et al, Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, 2019 and the original implementation.

The derivation starts by looking how the absolute positional encoding $U$ gets into the self-attention, and then it adapts the resulting equation slightly, to end up with:

$$\begin{align} e &= (Q x + b_q) (K x + b_k) + (Q x + b_{v'}) (R r + b_u) \\ &= x Q K x + b_k Q x + b_q K x + b_q b_k + x Q R r + b_u Q x + b_{v'} R r + b_{v'} b_u, \\ &\doteq x Q K x + (b_k + b_u) Q x + b_q K x + x Q R r + b_{v'} R r, & \text{ shape } (B,T,T,H) \end{align}$$

Any shifts ( $b_q b_k$, $b_{v'} b_u$ ) can be removed as it does not matter for softmax.

Actually $b_u$ is obsolete, as you see. We currently anyway have it in our implementation, as we follow ESPnet here (#132, ESPnet code). $R, b_u, b_{v'}$ correspond to linear_pos, pos_bias_u, pos_bias_v in our code.

Note that $(Q x + b_{v'}) (R r + b_u)$ can be computed more efficiently by not representing $r$ as shape $(T,T,D_y/H)$ but instead as shape $(2 \cdot T - 1,D_y/H)$. See the code for details.

Compared to Shawn et al 2018, we have the additional transformation $R$, and the additional biases. Also, $r$ is fixed here and not trainable, although you could generalize and also make it trainable.

T5 method

T5 method.

AliBI

AliBI, from Train short, test long.

ROPE

ROPE, RoFormer: Enhanced Transformer with Rotary Position Embedding, 2021.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly