# RoPE(Rotary Positional Embeddings)

 

首先，在介绍 RoPE 时，先抛出一个问题：RoPE 解决了一个什么问题？

在 RoPE 中，我们的出发点就是 “通过绝对位置编码的方式实现相对位置编码”，这样做既有理论上的优雅之处，也有实践上的实用之处，比如它可以拓展到线性 **Attention**中就是主要因为这一点。"

为了达到这个目的，假设通过下述运算给 q,k 添加了绝对位置信息：
 $$\bar{q}_m = f_q(x_m,m)$$
 $$\bar{k}_n = f_q(k_n,n)$$

通过上面两个函数，分别对位置 `m`,`n`的添加了绝对位置信息$\bar{q}_m,\bar{k}_n$。然后通过对$\bar{q}_m,\bar{k}_n$进行内积运算，又可以得到$m-n$这个相对位置信息，即满足

$$ <f_q(x_m,m),f_k(k_n,n)> \quad = \quad g(x_m,x_n,m-n) \quad \quad\quad(3) $$

> 注意：对于我们该选择什么样的函数来实现，也就是上面这个式子中的$q_m$，$$f$和$g$，我们只需要保证$g$的表达中含有$(x_m,x_n,m-n)$即可，也就是说 $\bar{q}_m,\bar{k}_n$内积的结果受$(m-n)$影响

如何求解 $f()$这个函数呢？这里参照苏神的blog，讲一下公式的求解过程。
- [Transformer升级之路：2、博采众长的旋转式位置编码](https://spaces.ac.cn/archives/8265/comment-page-1)
- 原论文 [RoFormer](https://arxiv.org/pdf/2104.09864.pdf)



这里，我们直接给出论文中的结果：
$$f_q(x_m,m)=(W_qx_m)e^{im \theta} $$
$$f_k(x_n,n)=(W_kx_n)e^{in \theta} $$

将这两个公式带入到$g(x_m,x_n,m-n)$中，得到：
$$g(x_m,x_n,m-n) = f_q(x_m,m) \cdot f_k(k_n,n) 
= Re[(W_qx_m)(W_kx_n)^*e^{i(m-n)\theta}]$$
> 公式中$Re$表示复数，$(W_kx_n)^*$表示$(W_kx_n)$的共轭复数

从这个公式中，我们可以看出$g(x_m,x_n,m-n)$受$m-n$的影响，也就达到了我们的目的，引入了相对位置信息。

接下来，我们继续对这个公式进行修改，带入欧拉公式 $e^{ix} = cos(x) + isin(x)$，可以得到：
$$f_q(x_m,m)=(W_qx_m)e^{im \theta}=(W_qx_m)[cos(m \theta)+isin(m \theta)]$$

> 注意：这个公式中的$W_q$和$x_m$均是张量。
下面，我们假设$x_m$向量的维度为2，即$x_m = [x_m^1, x_m^2]$，可以将上面这个公式进一步改写成：
$$f_q(x_m,m)=(W_qx_m)[cos(m \theta)+isin(m \theta)] \\ = \begin{pmatrix} W_q^{11} &W_q^{12} \\ W_q^{21} & W_q^{22}  \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} [cos(m \theta)+isin(m \theta)] \\ = (q_m^{(1)},q_m^{(2)}) [cos(m \theta)+isin(m \theta)]$$

进一步将 $(q_m^{(1)},q_m^{(2)})$这个向量用复数形式表示即 $(q_m^{(1)}+iq_m^{(2)})$代入公式 (7) 又可得:
$\begin{array}{} \begin{align} f_q(x_m,m) &=(q_m^{(1)},q_m^{(2)})[cos(m \theta)+isin(m \theta)] \\ &=(q_m^{(1)}+iq_m^{(2)})(cos(m \theta)+isin(m \theta)) \\ &=[q_m^{(1)}cos(m \theta)-q_m^{(2)}sin(m \theta)] + i[q_m^{(1)}sin(m \theta)+q_m^{(2)}cos(m \theta)] \quad\quad\quad (8) \end{align} \end{array}{}$

在将公式 (8) 转换为向量的表达形式
$\begin{array}{} \begin{align} f_q(x_m,m) &=[q_m^{(1)}cos(m \theta)-q_m^{(2)}sin(m \theta)] + i[q_m^{(1)}sin(m \theta)+q_m^{(2)}cos(m \theta)]\\ &=[q_m^{(1)}cos(m \theta)-q_m^{(2)}sin(m \theta), i[q_m^{(1)}sin(m \theta)+q_m^{(2)}cos(m \theta)]\\ &=\left (\begin{array}{rrrr} cos(m\theta) &-sin(m \theta) \\ sin(m \theta) & cos(m \theta) \\ \end{array}\right) \left (\begin{array}{rrrr} q_m^{(1)} \\ q_m^{(2)} \\ \end{array}\right) \quad\quad\quad \quad\quad\quad\quad\quad\quad\quad\quad\quad(9) \end{align} \end{array}{}$

到这里终于得到了论文中的最终结果，同理：
$\begin{array}{} \begin{align} f_k(x_n,n)&=\left (\begin{array}{rrrr} cos(n\theta) &-sin(n \theta) \\ sin(n \theta) & cos(n \theta) \\ \end{array}\right) \left (\begin{array}{rrrr} k_n^{(1)} \\ k_n^{(2)} \\ \end{array}\right) \quad\quad\quad \quad\quad\quad\quad\quad\quad\quad\quad\quad(10) \end{align} \end{array}{}$

对于$d>2$的维度，可以得到：
$\begin{array}{} \begin{align} f_q(x_m,m) &= R_{\theta,m}^dq_m &=\left (\begin{array}{rrrr} cos(m\theta_1) &-sin(m \theta_1) &0 &0 &... &0 &0\\ sin(m \theta_1) & cos(m \theta_1) &0 &0 &... &0 &0\\ 0 &0 &cos(m\theta_2) &-sin(m \theta_2) &... &0 &0\\ 0 &0 &sin(m \theta_2) & cos(m \theta_2) &... &0 &0\\ ... &...&...&...&...&...&... \\ 0 &0 &0 &0 &... &cos(m\theta_{d/2}) &-sin(m \theta_{d/2})\\ 0 &0 &0 & 0 &... &sin(m \theta_{d/2}) &cos(m \theta_{d/2})\\ \end{array}\right) \left (\begin{array}{rrrr} q_m^{(1)} \\ q_m^{(2)} \\ q_m^{(3)} \\ q_m^{(4)} \\ ...\\ q_m^{(d-1)}\\ q_m^{(d)} \\ \end{array}\right) (11) \end{align} \end{array}{}$

公式 (11) 这就是整个整个 RoPE 在位置编码时所作的工作，可以发现 $R_{\theta,m}^d$是一个稀疏矩阵，这样直接对 $q,k$进行矩阵乘法的位置编码会很低效，所以可以**通过以下方法来实现 RoPE**

 $\begin{array}{} \begin{align} R_{\theta,m}^dx = \left (\begin{array}{rrrr} x_1 \\ x_2 \\ x_3 \\ x_4 \\ ... \\ x_{d-1} \\ x_{d} \\ \end{array}\right) \otimes \left (\begin{array}{rrrr} cos(m\theta_1) \\ cos(m\theta_1) \\ cos(m\theta_2) \\ cos(m\theta_2) \\ ...\\ cos(m\theta_{d/2}) \\ cos(m\theta_{d/2}) \\ \end{array}\right) + \left (\begin{array}{rrrr} -x_2 \\ x_1 \\ -x_4 \\ x_3 \\ ... \\ -x_{d} \\ x_{d-1} \\ \end{array}\right) \otimes \left (\begin{array}{rrrr} sin(m\theta_1) \\ sin(m\theta_1) \\ sin(m\theta_2) \\ sin(m\theta_2) \\ ...\\ sin(m\theta_{d/2}) \\ sin(m\theta_{d/2}) \\ \end{array}\right) \quad \quad \quad \quad\quad\quad (12) \end{align} \end{array}{}$

论文也提供了一个非常直观的图来说明 RoPE 的处理过程，如下所示, 序列两两一对利用复数坐标嵌入位置信息

![alt text](./_img/RoPE.png)


# LLaMA中的代码实现

In [None]:
import torch

def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
    """Precompute the frequency tensor for complex exponentials (cis) with given dimensions.

    This function calculates a frequency tensor with complex exponentials using the given dimension 'dim'
    and the end index 'end'. The 'theta' parameter scales the frequencies.
    The returned tensor contains complex values in complex64 data type.

    Args:
        dim (int): Dimension of the frequency tensor.
        end (int): End index for precomputing frequencies.
        theta (float, optional): Scaling factor for frequency computation. Defaults to 10000.0.

    Returns:
        torch.Tensor: Precomputed frequency tensor with complex exponentials.

    """
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)
    freqs = torch.outer(t, freqs).float()
    return torch.polar(torch.ones_like(freqs), freqs)  # complex64


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor) -> torch.Tensor:
    """Reshape frequency tensor for broadcasting it with another tensor.

    This function reshapes the frequency tensor to have the same shape as the target tensor 'x'
    for the purpose of broadcasting the frequency tensor during element-wise operations.

    The input freqs_cis tensor is assumed to be of shape (max_seqlen, dim),
    and the first seqlen elements will be sliced, but dim must match x.

    Args:
        freqs_cis (torch.Tensor): Frequency tensor to be reshaped.
        x (torch.Tensor): Target tensor for broadcasting compatibility.

    Returns:
        torch.Tensor: Reshaped frequency tensor.

    """
    ndim = x.ndim
    assert 0 <= 1 < ndim
    seqlen = x.shape[1]
    freqs_cis = freqs_cis[0:seqlen]
    assert freqs_cis.shape == (seqlen, x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Apply rotary embeddings to input tensors using the given frequency tensor.

    This function applies rotary embeddings to the given query 'xq' and key 'xk' tensors using the provided
    frequency tensor 'freqs_cis'. The input tensors are reshaped as complex numbers, and the frequency tensor
    is reshaped for broadcasting compatibility. The resulting tensors contain rotary embeddings and are
    returned as real tensors.

    Args:
        xq (torch.Tensor): Query tensor to apply rotary embeddings.
        xk (torch.Tensor): Key tensor to apply rotary embeddings.
        freqs_cis (torch.Tensor): Precomputed frequency tensor for complex exponentials.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]: Tuple of modified query tensor and key tensor with rotary embeddings.

    """
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)

#### 参考资料：
1. [LLaMA解读](https://www.zhihu.com/question/599713780/answer/3163616360)