attention mechanism用作encoder-decoder的桥梁，本质上是一个上下文权重向量组成的矩阵

![1611572112865-59576f75-2985-4b67-b455-81ded0893b31.png](attachment:2252fd6e-184b-4371-9b32-eb3fe1d62a75.png)

在tacotron2中，attention computation 发生在每一个解码器时间步上，其中包含一下阶段：

1.目标隐函数(上图绿框所示)与每一个源状态(上图蓝框所示)"相比"，以生成attention weights(注意力权重)或alignments(对齐)

$$\alpha_{ts}=\frac{exp(score(h_t,\bar{h_s}))}{\sum_{s'=1}^{S}exp(score(h_t,\bar{h_s'}))}$$

其中$h_t$为目标隐状态，$\hat{h_s}$为源状态，score函数常被称为“能量(energy)”,因此可以表示为$e$.不同的score函数决定了不同类型的attention mechanism. Tacotron2 中使用的是 Location Sensitive Attention

$$e_{ij}=score(s_{i},c\alpha_{i-1},h_j)=v_a^Ttanh(W{s_{i}}+Vh_j+Uf_{i,j}+b)$$

其中$s_i$为当前解码器的输出(解码器隐状态 decoder hidden states), $h_j$是第j个编码器此刻输出(编码器隐状态 encoder hidden state),位置特征$f_{i,j}$使用累加是attention weight $c \alpha_i$经卷积而得的位置特征

$$f_i=F*c\alpha_{i-1},c\alpha_i=\sum_{j=1}^{i-1}\alpha_j$$

之所以使用加法累加而非乘法累积原因如图：

![1611574398473-2f0c94b3-0eac-401f-b693-eafd2cd4a17d.png](attachment:ffd1f363-cf31-4154-ae6d-5a0ef1f27da3.png)

累加注意力权重可以使得注意力权重网络了解它已经学习到的注意力信息，使得模型能在序列中持续进行并且避免重复未预料的语音。

使用注意力机制如图：

![1611574490868-502ff888-576a-445d-9883-ce3f24767790.png](attachment:cbfcb1a9-78b4-42b2-897e-9ea13f65730a.png)

In [None]:
class LocationSensitiveAttention(BahdanauAttention):
    """Impelements Bahdanau-style (cumulative) scoring function.
    Usually referred to as "hybrid" attention (content-based + location-based)
    Extends the additive attention described in:
    "D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla
    tion by jointly learning to align and translate,” in Proceedings
    of ICLR, 2015."
    to use previous alignments as additional location features.
    This attention is described in:
    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
  gio, “Attention-based models for speech recognition,” in Ad-
  vances in Neural Information Processing Systems, 2015, pp.
  577–585.
    """

    def __init__(self,num_units,memory,hparams,is_training,mask_encoder=True,memory_sequence_length=None,
                 smoothing=False,cumulate_weights=True,name='LocationSensitiveAttention'):
        """Construct the Attention mechanism.
        Args:
            num_units: The depth of the query mechanism.
            memory: The memory to query; usually the output of an RNN encoder.  This
                tensor should be shaped `[batch_size, max_time, ...]`.
            mask_encoder (optional): Boolean, whether to mask encoder paddings.
            memory_sequence_length (optional): Sequence lengths for the batch entries
                in memory.  If provided, the memory tensor rows are masked with zeros
                for values past the respective sequence lengths. Only relevant if mask_encoder = True.
            smoothing (optional): Boolean. Determines which normalization function to use.
                Default normalization function (probablity_fn) is softmax. If smoothing is
                enabled, we replace softmax with:
                        a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))
                Introduced in:
                    J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Ben-
                     gio, “Attention-based models for speech recognition,” in Ad-
                     vances in Neural Information Processing Systems, 2015, pp.
                     577–585.
                This is mainly used if the model wants to attend to multiple input parts
                at the same decoding step. We probably won't be using it since multiple sound
                frames may depend on the same character/phone, probably not the way around.
                Note:
                    We still keep it implemented in case we want to test it. They used it in the
                    paper in the context of speech recognition, where one phoneme may depend on
                    multiple subsequent sound frames.
            name: Name to use when creating ops.
        """
        #Create normalization function
        #Setting it to None defaults in using softmax
        normalization_function = _smoothing_normalization if (smoothing == True) else None
        memory_length = memory_sequence_length if (mask_encoder==True) else None
        super(LocationSensitiveAttention, self).__init__(num_units=num_units,memory=memory,
              memory_sequence_length=memory_length,probability_fn=normalization_function,
              name=name)

        self.location_convolution = tf.layers.Conv1D(filters=hparams.attention_filters,
            kernel_size=hparams.attention_kernel, padding='same', use_bias=True,
            bias_initializer=tf.zeros_initializer(), name='location_features_convolution')
        self.location_layer = tf.layers.Dense(units=num_units, use_bias=False,
            dtype=tf.float32, name='location_features_layer')
        self._cumulate = cumulate_weights
        self.synthesis_constraint = hparams.synthesis_constraint and not is_training
        self.attention_win_size = tf.convert_to_tensor(hparams.attention_win_size, dtype=tf.int32)
        self.constraint_type = hparams.synthesis_constraint_type

    def __call__(self, query, state, prev_max_attentions):
        """Score the query based on the keys and values.
        Args:
            query: Tensor of dtype matching `self.values` and shape
                `[batch_size, query_depth]`.
            state (previous alignments): Tensor of dtype matching `self.values` and shape
                `[batch_size, alignments_size]`
                (`alignments_size` is memory's `max_time`).
        Returns:
            alignments: Tensor of dtype matching `self.values` and shape
                `[batch_size, alignments_size]` (`alignments_size` is memory's
                `max_time`).
        """
        previous_alignments = state
        with variable_scope.variable_scope(None, "Location_Sensitive_Attention", [query]):

            # processed_query shape [batch_size, query_depth] -> [batch_size, attention_dim]
            processed_query = self.query_layer(query) if self.query_layer else query
            # -> [batch_size, 1, attention_dim]
            processed_query = tf.expand_dims(processed_query, 1)

            # processed_location_features shape [batch_size, max_time, attention dimension]
            # [batch_size, max_time] -> [batch_size, max_time, 1]
            expanded_alignments = tf.expand_dims(previous_alignments, axis=2)
            # location features [batch_size, max_time, filters]
            f = self.location_convolution(expanded_alignments)
            # Projected location features [batch_size, max_time, attention_dim]
            processed_location_features = self.location_layer(f)

            # energy shape [batch_size, max_time]
            energy = _location_sensitive_score(processed_query, processed_location_features, self.keys)

        if self.synthesis_constraint:
            Tx = tf.shape(energy)[-1]
            # prev_max_attentions = tf.squeeze(prev_max_attentions, [-1])
            if self.constraint_type == 'monotonic':
                key_masks = tf.sequence_mask(prev_max_attentions, Tx)
                reverse_masks = tf.sequence_mask(Tx - self.attention_win_size - prev_max_attentions, Tx)[:, ::-1]
            else:
                assert self.constraint_type == 'window'
                key_masks = tf.sequence_mask(prev_max_attentions - (self.attention_win_size // 2 + (self.attention_win_size % 2 != 0)), Tx)
                reverse_masks = tf.sequence_mask(Tx - (self.attention_win_size // 2) - prev_max_attentions, Tx)[:, ::-1]
            
            masks = tf.logical_or(key_masks, reverse_masks)
            paddings = tf.ones_like(energy) * (-2 ** 32 + 1)  # (N, Ty/r, Tx)
            energy = tf.where(tf.equal(masks, False), energy, paddings)

        # alignments shape = energy shape = [batch_size, max_time]
        alignments = self._probability_fn(energy, previous_alignments)
        max_attentions = tf.argmax(alignments, -1, output_type=tf.int32) # (N, Ty/r)

        # Cumulate alignments
        if self._cumulate:
            next_state = alignments + previous_alignments
        else:
            next_state = alignments

        return alignments, next_state, max_attentions

In [None]:
def _smoothing_normalization(e):
    """Applies a smoothing normalization function instead of softmax

                        Smoothing normalization function
                a_{i, j} = sigmoid(e_{i, j}) / sum_j(sigmoid(e_{i, j}))

    Args:
        e: matrix [batch_size, max_time(memory_time)]: expected to be energy (score)
            values of an attention mechanism
    Returns:
        matrix [batch_size, max_time]: [0, 1] normalized alignments with possible
            attendance to multiple memory time steps.
    """
    return tf.nn.sigmoid(e) / tf.reduce_sum(tf.nn.sigmoid(e), axis=-1, keepdims=True)

In [None]:
def _location_sensitive_score(W_query, W_fil, W_keys):
    """
              hybrid attention (content-based + location-based)
                               f = F * α_{i-1}
          energy = dot(v_a, tanh(W_keys(h_enc) + W_query(h_dec) + W_fil(f) + b_a))

    Args:
        W_query: Tensor, shape '[batch_size, 1, attention_dim]' to compare to location features.
        W_location: processed previous alignments into location features, shape '[batch_size, max_time, attention_dim]'
        W_keys: Tensor, shape '[batch_size, max_time, attention_dim]', typically the encoder outputs.
    Returns:
        A '[batch_size, max_time]' attention score (energy) #e_ij
    """
    # Get the number of hidden units from the trailing dimension of keys
    dtype = W_query.dtype
    num_units = W_keys.shape[-1].value or array_ops.shape(W_keys)[-1]

    v_a = tf.get_variable(
        'attention_variable_projection', shape=[num_units], dtype=dtype,
        initializer=tf.contrib.layers.xavier_initializer())
    b_a = tf.get_variable(
        'attention_bias', shape=[num_units], dtype=dtype,
        initializer=tf.zeros_initializer())

    return tf.reduce_sum(v_a * tf.tanh(W_keys + W_query + W_fil + b_a), [2])

$e_{ij}=score(s_{i},c\alpha_{i-1},h_j)=v_a^Ttanh(W{s_{i}}+Vh_j+Uf_{i,j}+b)$

W_keys:$V$ 对应encoder_hidden state

W_query:$W$ 对应decoder_hidden state

W_fil: $f_i=F*c\alpha_{i-1},c\alpha_i=\sum_{j=1}^{i-1}\alpha_j$

### Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

Say the following sentence is an input sentence we want to translate:   

`The animal didn't cross the street because it was too tired`    

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.    

当模型处理每个单词（输入序列中的每个位置）时，self-attention 允许它查看输入序列中的其他位置以寻找有助于对该单词进行更好编码的线索。如果您熟悉 RNN，想想 a hidden state 如何允许 RNN 将其先前处理过的词/向量的表示与它正在处理的当前词/向量结合起来。Self-attention is the method the Transformer uses to bake the “understanding” of other relevant(相关) words into the one we’re currently processing.   

![Screen Shot 2021-07-29 at 7.04.32 PM.png](attachment:ec2ef824-b5c0-42e3-878b-30d8c4e58f7c.png)

### Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at(继续看) how it’s actually implemented – using matrices. 

The **first step** in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

![Screen Shot 2021-07-29 at 7.11.36 PM.png](attachment:43e2ffdd-7e15-47b6-a71a-514308e70304.png) 

What are the “query”, “key”, and “value” vectors?   

They’re abstractions that are useful for calculating and thinking about attention.

The **second step** in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”.
我们需要根据这个词对输入句子的每个词进行评分。The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
score的计算方法是将'Thinking'这个单词的 query vector 与我们正在评分的各个单词的 key value进行点积。So if we’re processing the self-attention for the word in position \#1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2. 

![Screen Shot 2021-07-29 at 7.20.58 PM.png](attachment:53f01cc1-1176-45f2-a83e-b8a4015edb2b.png)

The **third and forth steps** are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. 64是默认值，当然也可能有其他的值。 Softmax normalizes the scores so they’re all positive and add up to 1.

![Screen Shot 2021-07-29 at 7.22.59 PM.png](attachment:49af34eb-1a31-4001-aea9-aba796d0e297.png)

This softmax score determines how much each word will be expressed(表达量) at this position. Clearly the word at this position will have the highest softmax sore, but sometimes it's useful to attend to another word that is relevant(相关) to the current word.

The **fifth step** is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact(完整，完好) the values of the word(s) we want to focus on, and drown-out(淹没) irrelevant words (by multiplying them by tiny numbers like 0.001, for example).   

The **sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

![Screen Shot 2021-07-29 at 7.29.09 PM.png](attachment:a1483fa7-1ea4-4bf8-b0ed-fa6af0cdfcf2.png)

self-attention 计算到此结束。 The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

### Matrix Calculation of Self-Attention

The **first step** is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained ($W^Q$, $W^K$, $W^V$).  

![Screen Shot 2021-07-29 at 7.34.46 PM.png](attachment:6de407b0-7a67-4498-8000-b1990cd1c8f2.png)

**Finally**, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

![Screen Shot 2021-07-29 at 7.36.48 PM.png](attachment:f8855963-04f0-4685-8448-4d1c5a3eb331.png)

