## 11.3. Attention Scoring Functions

As it turns out, distance functions are slightly more expensive to compute than inner products.

我们简化Q和K的二范数得到了点积注意力机制。用点积来表示注意力权重大大减小了乘方带来的计算开销。

#### 11.3.2.1. Masked Softmax Operation

我们引入了Mask操作，为了保证NLP中的Sequence保持相同的长度，这是因为你一个序列有时候词数量不同，但需要并行化处理。我们通过将注意力权重值在softmax前（即QK内积）后置为一个无穷小来实现，例如10的负六次方。这样是为了方便GPU并行计算快（if-else开销更大）。

In [5]:
import math
import torch
from torch import nn
from d2l import torch as d2l

def masked_softmax(X, valid_lens):  #@save
    """Perform softmax operation by masking elements on the last axis."""
    # X: 3D tensor, valid_lens: 1D or 2D tensor
    def _sequence_mask(X, valid_len, value=0):
        maxlen = X.size(1)
        mask = torch.arange((maxlen), dtype=torch.float32,
                            device=X.device)[None, :] < valid_len[:, None]
        X[~mask] = value
        return X

    if valid_lens is None:
        return nn.functional.softmax(X, dim=-1)
    else:
        shape = X.shape
        if valid_lens.dim() == 1:
            valid_lens = torch.repeat_interleave(valid_lens, shape[1])
        else:
            valid_lens = valid_lens.reshape(-1)
        # On the last axis, replace masked elements with a very large negative
        # value, whose exponentiation outputs 0
        X = _sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6)
        return nn.functional.softmax(X.reshape(shape), dim=-1)

print(masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3])))
masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]]))

tensor([[[0.5468, 0.4532, 0.0000, 0.0000],
         [0.6060, 0.3940, 0.0000, 0.0000]],

        [[0.4260, 0.1756, 0.3984, 0.0000],
         [0.4944, 0.1954, 0.3102, 0.0000]]])


tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.3492, 0.2424, 0.4084, 0.0000]],

        [[0.4861, 0.5139, 0.0000, 0.0000],
         [0.3200, 0.2312, 0.2007, 0.2480]]])

#### 11.3.2.2. Batch Matrix Multiplication

when we have minibatches of queries, keys, and values.  # be careful that Q's dimensions == d 即Q的dimension才是人们常说的d  

    - Shape of queries: (batch_size, no. of queries, d)
    - Shape of keys: (batch_size, no. of key-value pairs, d)
    - Shape of values: (batch_size, no. of key-value pairs, value dimension)

In [7]:
Q = torch.ones((2, 3, 4))
K = torch.ones((2, 4, 6))
print(torch.bmm(Q, K).shape) # (2, 3, 6))

torch.Size([2, 3, 6])


### 11.3.3. Scaled Dot-Product Attention

除以了根号d， 这个d就是上面说的Q的d，不用记，不求甚解。

In [8]:
class DotProductAttention(nn.Module):  #@save
    """Scaled dot product attention."""
    def __init__(self, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    # Shape of queries: (batch_size, no. of queries, d)
    # Shape of keys: (batch_size, no. of key-value pairs, d)
    # Shape of values: (batch_size, no. of key-value pairs, value dimension)
    # Shape of valid_lens: (batch_size,) or (batch_size, no. of queries)
    def forward(self, queries, keys, values, valid_lens=None):
        d = queries.shape[-1]
        # Swap the last two dimensions of keys with keys.transpose(1, 2)
        scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d)
        self.attention_weights = masked_softmax(scores, valid_lens)
        return torch.bmm(self.dropout(self.attention_weights), values)
    

queries = torch.normal(0, 1, (2, 1, 2))
keys = torch.normal(0, 1, (2, 10, 2))
values = torch.normal(0, 1, (2, 10, 4))
valid_lens = torch.tensor([2, 6])

attention = DotProductAttention(dropout=0.5)
attention.eval()
print(attention(queries, keys, values, valid_lens).shape)   #, (2, 1, 4))

torch.Size([2, 1, 4])


### 11.3.4. Additive Attention

In [None]:
### 11.3.4. Additive Attention