https://www.bilibili.com/video/BV1nV411a74n?t=70.1

### 注意力机制

- **注意力分数，求加权和**
- **输入和输出的维度相同，很容易把模型做深**

找到哪更应该被注意 - 找注意力

然后将原来的内容，咱找注意力加权求和，得到真正要被注意的内容

比如找到消极或积极的态度

- 今天天气不错
- 今天天气太糟了
- 今天天气
- 不错

最早用在机器翻译，用在seq2seq模型，从不定长到不定长的序列，分为两部分encoder（对上文的理解）和decoder

encoder是一次性读完，decoder是一个字一个字输出

- t0: decoder(对上文的理解，起始字符) -> I
- t1: decoder(对上文的理解（产生变化），I) -> love
- t2: decoder(对上文的理解（产生变化），love) -> you
- t3: decoder(对上文的理解（产生变化），you) -> 中止字符

找到注意力（[我的理解，爱的理解，你的理解]: key，起始字符: Query）-> 注意力向量 = [0,98, 0.01, 0.01]

注意力 * 每个字的理解 = sum([0.98, 0.01, 0.01] * [我的理解，爱的理解，你的理解]: Value) = sum( 0.98 * 我 + 0.01 * 爱 + 0.01 * 你) = 根据起始字符进行注意力后的理解

根据起始字符进行注意力后的理解 = 注意力理解（[我的理解，爱的理解，你的理解]，起始字符）

- t0: decoder(根据起始字符进行注意力后的理解) -> I
- t1: decoder(根据I进行注意力后的理解，I) -> love
- t2: decoder(根据love进行注意力后的理解，love) -> you
- t3: decoder(根据you进行注意力后的理解，you) -> 中止字符

### Self attention

没有decode的过程

|       |  我的理解|爱的理解|你的理解|
|  --   |  --     | --    |   --  |
|我的理解|    1.0  |    0.2 |     0.1|
|爱的理解 |   0.2 |     1.0  |    0.2|
|你的理解 |   0.1 |     0.0  |    1.0|

输入：我的理解，爱的理解，你的理解
输出：我的新理解，爱的新理解，你的新理解。输出融入了输入中其他向量的信息，更能体会到上下文语境

In [1]:
import tensorflow as tf

In [3]:
batch_size = 4
sequence_length = 10
vector_size = 32

x = tf.random.uniform((batch_size, sequence_length, vector_size))

In [4]:
x.shape

TensorShape([4, 10, 32])

In [6]:
# 线性映射，把x转换成q,k,v
key = tf.keras.layers.Dense(vector_size)(x)
value = tf.keras.layers.Dense(vector_size)(x)
query = tf.keras.layers.Dense(vector_size)(x)
key.shape

TensorShape([4, 10, 32])

$$ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{n}})V$$

In [9]:
mat_mul = tf.matmul(query, key, transpose_b=True)
mat_mul.shape

TensorShape([4, 10, 10])

atten的输出为什么是10*10：因为针对的是每个句子中10个单词进行attention，所以输出的就是这10个单词两两cross cos similar的分数

In [10]:
n = vector_size
atten = mat_mul / tf.sqrt(tf.cast(n, tf.float32))
atten = tf.nn.softmax(atten)
# 乘上v
y = tf.matmul(atten, value)
y.shape

TensorShape([4, 10, 32])

封装成函数


In [11]:
def self_attention(x):
  key = tf.keras.layers.Dense(vector_size)(x)
  value = tf.keras.layers.Dense(vector_size)(x)
  query = tf.keras.layers.Dense(vector_size)(x)
  mat_mul = tf.matmul(query, key, transpose_b=True)
  n = vector_size
  atten = mat_mul / tf.sqrt(tf.cast(n, tf.float32))
  atten = tf.nn.softmax(atten)
  # 乘上v
  y = tf.matmul(atten, value)
  return y

self_attention(x).shape

TensorShape([4, 10, 32])

### Multi-head

heads数量需要被vactor_size整除

In [15]:
batch_size = 4
sequence_length = 10
vector_size = 32
heads = 4

x = tf.random.uniform((batch_size, sequence_length, vector_size))
x.shape

TensorShape([4, 10, 32])

In [16]:
# multi-head输入
x_mh = tf.reshape(x, [batch_size, sequence_length, heads, vector_size // heads])
x_mh.shape

TensorShape([4, 10, 4, 8])

每个头可以单独的算一个新句子, 现在有4 * 4 = 16个句子

把新句子移到第二个维度

In [18]:
x_mh = tf.transpose(x_mh, (0, 2, 1, 3))
x_mh.shape

TensorShape([4, 4, 10, 8])

In [19]:
key = tf.keras.layers.Dense(vector_size // heads)(x_mh)
value = tf.keras.layers.Dense(vector_size // heads)(x_mh)
query = tf.keras.layers.Dense(vector_size // heads)(x_mh)
key.shape, value.shape, query.shape

(TensorShape([4, 4, 10, 8]),
 TensorShape([4, 4, 10, 8]),
 TensorShape([4, 4, 10, 8]))

In [20]:
mat_mul = tf.matmul(query, key, transpose_b=True)
n = vector_size
atten = mat_mul / tf.sqrt(tf.cast(n, tf.float32))
atten = tf.nn.softmax(atten)
# 乘上v
y = tf.matmul(atten, value)
y.shape

TensorShape([4, 4, 10, 8])

需要转回到原来的输入shape

In [21]:
y = tf.transpose(y, (0, 2, 1, 3))
y.shape

TensorShape([4, 10, 4, 8])

再把后面的两个维度合并在一起

In [22]:
y = tf.reshape(y, (batch_size, sequence_length, vector_size))

封装成函数

In [23]:
def multi_head_self_attention(x):
  x_mh = tf.reshape(x, [batch_size, sequence_length, heads, vector_size // heads])
  x_mh = tf.transpose(x_mh, (0, 2, 1, 3))
  key = tf.keras.layers.Dense(vector_size // heads)(x_mh)
  value = tf.keras.layers.Dense(vector_size // heads)(x_mh)
  query = tf.keras.layers.Dense(vector_size // heads)(x_mh)
  mat_mul = tf.matmul(query, key, transpose_b=True)
  n = vector_size
  atten = mat_mul / tf.sqrt(tf.cast(n, tf.float32))
  atten = tf.nn.softmax(atten)
  # 乘上v
  y = tf.matmul(atten, value)
  y = tf.transpose(y, (0, 2, 1, 3))
  y = tf.reshape(y, (batch_size, sequence_length, vector_size))
  return y

multi_head_self_attention(x).shape

TensorShape([4, 10, 32])

封装成tf的model

In [25]:
class MultiHeadSelfAttention(tf.keras.Model):
  def __init__(self, vector_size, heads=1):
    super(MultiHeadSelfAttention, self).__init__()
    self.vector_size = vector_size
    self.heads = heads
    self.key = tf.keras.layers.Dense(vector_size // heads)
    self.value = tf.keras.layers.Dense(vector_size // heads)
    self.query = tf.keras.layers.Dense(vector_size // heads)

  def call(self, x):
    batch_size = x.shape[0]
    sequence_length = x.shape[1]
    x_mh = tf.reshape(x, [batch_size, sequence_length, self.heads, self.vector_size // self.heads])
    x_mh = tf.transpose(x_mh, (0, 2, 1, 3))
    key = tf.keras.layers.Dense(self.vector_size // self.heads)(x_mh)
    value = tf.keras.layers.Dense(self.vector_size // self.heads)(x_mh)
    query = tf.keras.layers.Dense(self.vector_size // self.heads)(x_mh)
    mat_mul = tf.matmul(query, key, transpose_b=True)
    n = self.vector_size
    atten = mat_mul / tf.sqrt(tf.cast(n, tf.float32))
    atten = tf.nn.softmax(atten)
    # 乘上v
    y = tf.matmul(atten, value)
    y = tf.transpose(y, (0, 2, 1, 3))
    y = tf.reshape(y, (batch_size, sequence_length, self.vector_size))
    return y

In [26]:
batch_size = 4
sequence_length = 10
vector_size = 32
heads = 4

x = tf.random.uniform((batch_size, sequence_length, vector_size))
x.shape

TensorShape([4, 10, 32])

In [27]:
attention_model = MultiHeadSelfAttention(vector_size, heads)
attention_model(x).shape

TensorShape([4, 10, 32])