### Attention公式

$$ Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

1. matmul和 @作用一样
2. 除以$\sqrt{d_k}$是因为1）防止梯度消失(由于softmax在非常大和非常小的时候都梯度都接近于0)？让QK内积的分布保持和输入一致。
3. 

In [2]:
### 简化版本
import math
import torch
import torch.nn as nn

class selfAttention(nn.Module):

    def __init__(self,hidden_dim):
        super().__init__()
        self.hidden_dim = hidden_dim

        ## 初始化 Q K V 三个线性映射层
        self.query = nn.Linear(hidden_dim,hidden_dim)
        self.key = nn.Linear(hidden_dim,hidden_dim)
        self.value = nn.Linear(hidden_dim,hidden_dim)


    def forward(self, X):
        # X shape (batch_size, seq_len,hidden_dim)
        # Q K V初始化
        Q = self.query(X)
        K = self.key(X)
        V = self.value(X)
        # QKV 为 batchsize,seq_len hidden_dim
        
        # attention_value： (batch_size, seq_len,seq_len)
        ## 分子 ()
        # K为 (batch_size ,hidden_dim,seq_len)
        attention_value = torch.matmul(
            Q,K.transpose(-1,-2)#premute
        )# 这里是只变换后两个维度
        
        #这里得到dim
        # (batch_size,seq_len,seq_len)
        attention_weight = torch.softmax(attention_value / math.sqrt(self.hidden_dim),dim=-1)
        # 开根号的原因：防止乘积后值太大了，可能梯度爆炸。

        # print(attention_weight)
        # softmax(需要针对哪一个维度？）

        # (batch_size,seq,hidden)
        output = torch.matmul(attention_weight,V)
        return output

## test code
X = torch.rand(3,2,4)
#print(X)

net = selfAttention(4)
net(X)

tensor([[[-0.2493,  0.3050,  0.2408, -0.3359],
         [-0.2715,  0.3370,  0.2590, -0.2732]],

        [[-0.2894,  0.3821,  0.3016, -0.2133],
         [-0.2953,  0.3865,  0.3124, -0.2028]],

        [[-0.3192,  0.4417,  0.1703, -0.3778],
         [-0.3195,  0.4415,  0.1687, -0.3757]]], grad_fn=<UnsafeViewBackward0>)

In [3]:
import torch
torch.mm?

[0;31mDocstring:[0m
mm(input, mat2, *, out=None) -> Tensor

Performs a matrix multiplication of the matrices :attr:`input` and :attr:`mat2`.

If :attr:`input` is a :math:`(n \times m)` tensor, :attr:`mat2` is a
:math:`(m \times p)` tensor, :attr:`out` will be a :math:`(n \times p)` tensor.

.. note:: This function does not :ref:`broadcast <broadcasting-semantics>`.
          For broadcasting matrix products, see :func:`torch.matmul`.

Supports strided and sparse 2-D tensors as inputs, autograd with
respect to strided inputs.

This operator supports :ref:`TensorFloat32<tf32_on_ampere>`.

On certain ROCm devices, when using float16 inputs this module will use :ref:`different precision<fp16_on_mi200>` for backward.

Args:
    input (Tensor): the first matrix to be matrix multiplied
    mat2 (Tensor): the second matrix to be matrix multiplied

Keyword args:
    out (Tensor, optional): the output tensor.

Example::

    >>> mat1 = torch.randn(2, 3)
    >>> mat2 = torch.randn(3, 3)
    >>>

In [4]:
nn.Linear??

[0;31mInit signature:[0m
[0mnn[0m[0;34m.[0m[0mLinear[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0min_features[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mout_features[0m[0;34m:[0m [0mint[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdevice[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mLinear[0m[0;34m([0m[0mModule[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34mr"""Applies a linear transformation to the incoming data: :math:`y = xA^T + b`[0m
[0;34m[0m
[0;34m    This module supports :ref:`TensorFloat32<tf32_on_ampere>`.[0m
[0;34m[0m
[0;34m    On certain ROCm devices, when using float16 inputs this module will us

### 优化


In [5]:
## Attention V2 优化
# 如果网络比较小，可以将QKV在一起写，效率优化。

class SelfAttentionV2(nn.Module):
    def __init__(self,hidden_dim):
        super().__init__()
        self.hidden_dim = hidden_dim

        self.proj = nn.Linear(hidden_dim,hidden_dim*3)

    def forward(self,X):
        #X: batchsize,seq,hidden_dim

        QKV = self.proj(X) # get (batch_size,seq,hidden_dim*3)
        Q,K,V = torch.split(QKV,self.hidden_dim,dim=-1)#split为三个，每个都是hidden_dim大小
        attention_weight = torch.softmax(Q@ K.transpose(-1,-2)/math.sqrt(self.hidden_dim),dim=-1)
        print(attention_weight)
        output = attention_weight@V
        return output

X = torch.randn(2,3,4)
net2 = SelfAttentionV2(4)
net2(X)

tensor([[[0.3031, 0.3631, 0.3338],
         [0.2562, 0.3624, 0.3814],
         [0.2310, 0.3939, 0.3751]],

        [[0.2875, 0.3755, 0.3370],
         [0.0538, 0.8179, 0.1282],
         [0.2403, 0.4491, 0.3106]]], grad_fn=<SoftmaxBackward0>)


tensor([[[-0.4611,  0.3685,  0.0833, -0.3642],
         [-0.4932,  0.3379,  0.1168, -0.3600],
         [-0.5041,  0.3190,  0.1316, -0.3631]],

        [[-0.9265,  0.0881,  0.2654, -0.7706],
         [-1.8330, -0.5163,  0.6153, -1.0257],
         [-1.0865, -0.0123,  0.3202, -0.8213]]], grad_fn=<UnsafeViewBackward0>)

### 在此基础上继续加入细节

Self-Attention 还有更多的细节
* 有时有dropout；
* 一般会加入attention_mask操作，因为样本会padding
* MultiHeadAttention过程中，除了QKV三个矩阵之外，还有output投影矩阵。

In [6]:
# dropout 
# attention_mask
# outout的映射
class SelfAttentionV3(nn.Module):
    def __init__(self,hidden_dim,dropOutRate=0.1):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.proj = nn.Linear(hidden_dim,hidden_dim*3)
        # dropout的位置在哪里？
        # attention mask 对一些padding的词汇mask
        # output_proj的映射。可选。
        self.attentionDropout = nn.Dropout(dropOutRate)
        # 每个句子长度不一样，所有计算的时候具有mask

        self.output_proj = nn.Linear(hidden_dim,hidden_dim)

    def forward(self,X,mask=None):
        # X()
        QKV = self.proj(X)
        Q,K,V = torch.split(QKV,self.hidden_dim,dim=-1)
        
        # (batch_size,seq,seq)
        attention_weight = Q@K.transpose(-1,-2)/math.sqrt(self.hidden_dim)
        if mask is not None:
            # 一种方法就是给其十分小的值。 softmax之后几乎为0
            # 给被mask词语十分小的值。
            # masked_fill 如果为0时，填充什么。
            attention_weight = attention_weight.masked_fill(mask==0,-1e9)
        #print(attention_weight)# 查看mask的结果

        attention_weight = torch.softmax(attention_weight,dim=-1)
        print(attention_weight)# 查看mask的softmax之后的结果的结果


        # BRET 这里这么用
        # 先dropout之后再乘以V
        attention_weight = self.attentionDropout(attention_weight)

        attention_result = attention_weight@ V
        output = self.output_proj(attention_result)
        return output
    
X = torch.randn(3,4,2)
b = torch.tensor(
    [
        [1,1,1,0], #第一个句子只padding为3
        [1,1,0,0], # 第二个padding两个元素
        [1,0,0,0],  # 第三个padding3个元素
    ]
)
#b.shape
mask = b.unsqueeze(dim=1).repeat(1,4,1) # unsqueeze扩充一个维度，在1


net = SelfAttentionV3(2)
net(X,mask)

tensor([[[0.4129, 0.2558, 0.3313, 0.0000],
         [0.4294, 0.2701, 0.3005, 0.0000],
         [0.3445, 0.3168, 0.3387, 0.0000],
         [0.3925, 0.2593, 0.3482, 0.0000]],

        [[0.5243, 0.4757, 0.0000, 0.0000],
         [0.5098, 0.4902, 0.0000, 0.0000],
         [0.4758, 0.5242, 0.0000, 0.0000],
         [0.4935, 0.5065, 0.0000, 0.0000]],

        [[1.0000, 0.0000, 0.0000, 0.0000],
         [1.0000, 0.0000, 0.0000, 0.0000],
         [1.0000, 0.0000, 0.0000, 0.0000],
         [1.0000, 0.0000, 0.0000, 0.0000]]], grad_fn=<SoftmaxBackward0>)


tensor([[[ 0.2368,  0.1474],
         [ 0.2398,  0.1432],
         [ 0.2582,  0.1145],
         [ 0.3062,  0.0471]],

        [[-0.0366,  0.5418],
         [-0.0368,  0.5421],
         [-0.0371,  0.5426],
         [-0.0369,  0.5423]],

        [[ 0.0981,  0.3499],
         [ 0.0981,  0.3499],
         [ 0.0981,  0.3499],
         [ 0.0981,  0.3499]]], grad_fn=<ViewBackward0>)

In [7]:
nn.Dropout??

[0;31mInit signature:[0m [0mnn[0m[0;34m.[0m[0mDropout[0m[0;34m([0m[0mp[0m[0;34m:[0m [0mfloat[0m [0;34m=[0m [0;36m0.5[0m[0;34m,[0m [0minplace[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mFalse[0m[0;34m)[0m [0;34m->[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mDropout[0m[0;34m([0m[0m_DropoutNd[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34mr"""During training, randomly zeroes some of the elements of the input[0m
[0;34m    tensor with probability :attr:`p` using samples from a Bernoulli[0m
[0;34m    distribution. Each channel will be zeroed out independently on every forward[0m
[0;34m    call.[0m
[0;34m[0m
[0;34m    This has proven to be an effective technique for regularization and[0m
[0;34m    preventing the co-adaptation of neurons as described in the paper[0m
[0;34m    `Improving neural networks by preventing co-adaptation of feature[0m
[0;34m    detectors`_ .[0m
[0;34m

In [8]:
torch.Tensor.masked_fill_??

[0;31mDocstring:[0m
masked_fill_(mask, value)

Fills elements of :attr:`self` tensor with :attr:`value` where :attr:`mask` is
True. The shape of :attr:`mask` must be
:ref:`broadcastable <broadcasting-semantics>` with the shape of the underlying
tensor.

Args:
    mask (BoolTensor): the boolean mask
    value (float): the value to fill in with
[0;31mType:[0m      method_descriptor

In [9]:
### 第四种self-attention
class SelfAttentionV4(nn.Module):
    def __init__(self,dim,dropOut_rate= 0.1):
        super().__init__()
        self.dim = dim

        self.query = nn.Linear(dim,dim) # 不写bias为False
        self.key = nn.Linear(dim,dim)
        self.vaule = nn.Linear(dim,dim)

        self.attention_dropout = nn.Dropout(dropOut_rate)

    def forward(self,X,mask=None):
        # X : (batch_size ,seq, dim)
        Q = self.query(X)
        K = self.key(X)
        V = self.vaule(X)
        # batch_size,seq,dim * (batch_size,dim,seq) => batchsize,seq,seq
        # attention_weight
        attention_weight = Q@K.transpose(-1,-2)/math.sqrt(self.dim)
        # mask操作
        if mask is not None:
            attention_weight = attention_weight.masked_fill(mask==0,-1e9) #(float("-inf"))
        # softmax操作 
        attention_weight = torch.softmax(attention_weight,dim=-1)

        # dropout
        attention_weight = self.attention_dropout(attention_weight)
        # (batch_size,seq,seq) * (batchsize,seq,dim) =>
        #out 
        # (btach_size,seq,dim)
        output = attention_weight@V
        # 这里是得到了饿
        return output
    
X = torch.rand(3,4,2)
mask = torch.Tensor([
    [1,1,1,0],
    [1,1,0,0],
    [1,0,0,0]
])
# 原先是一个3x4的mask，然后在列上添加一个新维度得到
# 3x1x4的的tensor
# 然后再再
mask = mask.unsqueeze(dim=1).repeat(1,4,1)
# unsqueeze得到(3,4,4)大小的mask
# 在进行mask时，我们的attention_weight的shape是(batch_size,seq,seq)也就是(3,4,4)所以能够进行mask
net = SelfAttentionV4(2)
net(X,mask)


tensor([[[0.3652, 1.0789],
         [0.3655, 1.0783],
         [0.3647, 1.0796],
         [0.3651, 1.0790]],

        [[0.4184, 0.9227],
         [0.4183, 0.9227],
         [0.0000, 0.0000],
         [0.4184, 0.9227]],

        [[0.3013, 1.2024],
         [0.3013, 1.2024],
         [0.3013, 1.2024],
         [0.3013, 1.2024]]], grad_fn=<UnsafeViewBackward0>)

In [10]:
torch.unsqueeze??

[0;31mDocstring:[0m
unsqueeze(input, dim) -> Tensor

Returns a new tensor with a dimension of size one inserted at the
specified position.

The returned tensor shares the same underlying data with this tensor.

A :attr:`dim` value within the range ``[-input.dim() - 1, input.dim() + 1)``
can be used. Negative :attr:`dim` will correspond to :meth:`unsqueeze`
applied at :attr:`dim` = ``dim + input.dim() + 1``.

Args:
    input (Tensor): the input tensor.
    dim (int): the index at which to insert the singleton dimension

Example::

    >>> x = torch.tensor([1, 2, 3, 4])
    >>> torch.unsqueeze(x, 0)
    tensor([[ 1,  2,  3,  4]])
    >>> torch.unsqueeze(x, 1)
    tensor([[ 1],
            [ 2],
            [ 3],
            [ 4]])
[0;31mType:[0m      builtin_function_or_method

![ ](./images/multi_head_attention.jpg)


In [19]:
## 最后一步，就是multi-head的书写
# 上述得到的都是self-attention单个的实现，实际上要使用multi-head的实现。
class MultiHeadAttention(nn.Module):
    #先进行QKV然后对每一个得到的Linaer层进行attention然后进行concat再进行Linear
    def __init__(self,hidden_dim,heads):
        super().__init__()
        self.dim = hidden_dim
        self.heads = heads

        self.query = nn.Linear(hidden_dim,hidden_dim)
        self.key = nn.Linear(hidden_dim,hidden_dim)
        self.value = nn.Linear(hidden_dim,hidden_dim)

        self.head_dim = hidden_dim//heads
        self.attention = SelfAttentionV4(self.head_dim)

        #这么记忆：多头注意力是其中的不同部分，所以分块去注意每个head，然后将其concat起来

        self.Concat_proj = nn.Linear(hidden_dim,hidden_dim)
    
    def forward(self,X,mask=None):
        batch_size,seq_len,_ =X.shape
        # X(batch_size,seq_len,hidden_dim)
        Q = self.query(X)
        K = self.key(X)
        V = self.value(X)
        # 进行完了第一步 
        # shape要变成 (batch_size,num_head,seq,head_dim)
        Q = Q.view(batch_size,seq_len,self.heads,self.head_dim).permute(0,-2,1,-1)
        K = K.view(batch_size,seq_len,self.heads,self.head_dim).permute(0,-2,1,-1)
        V = V.view(batch_size,seq_len,self.heads,self.head_dim).permute(0,-2,1,-1)

        # scaled-dot product attention
        attention_weight = Q@K.transpose(-1,-2)/math.sqrt(self.head_dim)

        if mask is not None:
            attention_weight = attention_weight.masked_fill(mask==0,float("-inf"))

        # softmax 
        attention_weight = torch.softmax(attention_weight,dim=-1)
        single_attention = attention_weight @ V

        ## 将数据重新变成
        # batch_size,seq_len,num_heads,head_dim
        single_attention = single_attention.transpose(1,2).contiguous()

        # 然后concat起来，也就是变成
        # batch_size,seq,hidden_dim
        output = single_attention.view(batch_size,seq_len,-1)
        output = self.Concat_proj(output)
        return output

X = torch.rand(3,2,128)
mask = torch.Tensor([
    [0,1],
    [1,0],
    [0,0]
])
# 注意这里是多头的mask
# 其shape应该是 (batch_size,num_head,seq,seq)
# (3,8,2,2)
# (3,2)->(3,1,2)->(3,1,1,2)
mask = mask.unsqueeze(dim=1).unsqueeze(dim=2).repeat(1,8,2,1)
print(mask.shape)

net = MultiHeadAttention(128,8)
## 这里用8个头 hidden_dim是128
## 128/8 = 16
net(X,mask).shape

torch.Size([3, 8, 2, 2])


torch.Size([3, 2, 128])

In [None]:
class multiheadAttentionV2(nn.Module):
    def __init__(self,hidden_dim,nums_head):
        super().__init__()
        self.nums_head = nums_head
        self.attention = SelfAttentionV4()

In [13]:
torch.Tensor.view??

[0;31mDocstring:[0m
view(*shape) -> Tensor

Returns a new tensor with the same data as the :attr:`self` tensor but of a
different :attr:`shape`.

The returned tensor shares the same data and must have the same number
of elements, but may have a different size. For a tensor to be viewed, the new
view size must be compatible with its original size and stride, i.e., each new
view dimension must either be a subspace of an original dimension, or only span
across original dimensions :math:`d, d+1, \dots, d+k` that satisfy the following
contiguity-like condition that :math:`\forall i = d, \dots, d+k-1`,

.. math::

  \text{stride}[i] = \text{stride}[i+1] \times \text{size}[i+1]

Otherwise, it will not be possible to view :attr:`self` tensor as :attr:`shape`
without copying it (e.g., via :meth:`contiguous`). When it is unclear whether a
:meth:`view` can be performed, it is advisable to use :meth:`reshape`, which
returns a view if the shapes are compatible, and copies (equivalent to calling
:

$$\text{stride}[i] = \text{stride}[i+1] \times \text{size}[i+1]$$

In [14]:
torch.Tensor.contiguous??

[0;31mDocstring:[0m
contiguous(memory_format=torch.contiguous_format) -> Tensor

Returns a contiguous in memory tensor containing the same data as :attr:`self` tensor. If
:attr:`self` tensor is already in the specified memory format, this function returns the
:attr:`self` tensor.

Args:
    memory_format (:class:`torch.memory_format`, optional): the desired memory format of
        returned Tensor. Default: ``torch.contiguous_format``.
[0;31mType:[0m      method_descriptor