## 动机
- 在机器翻译中，每个生成的词可能相关于源句子中的不同词。
- seq2seq模型只传入最后一个隐藏状态

## 预测 
- Encoder RNN(下图左边) 对每个词的输出作为key和value, 放进Attention里
- Decoder RNN(下图右边)对上一个词的输出作为query,放进Attention里
- Attention的输出(作为context)和下一个次的word embedding合并进入Decoder RNN,得到新的输出
![](./imgs/66-1.png)

## 总结
- Seq2seq通过隐状态在编码器和解码器中传递信息
- 注意力机制可以根据解码器RNN的输出来匹配到合适的编码器RNN的输出，从而更有效地传递信息

In [None]:
## Bahdaau注意力

In [1]:
import torch
from torch import nn
import d2l

In [None]:
class AttentionDecoder(d2l.Decoder):
    """带有注意力机制的解码器基本接口"""
    def __init__(self, **kwargs):
        super(AttentionDecoder, self).__init__(**kwargs)
        
    @property
    def attention_weights(self):
        raise NotImplementedError

In [None]:
# Decoder里加attention
class Seq2SeqAttentionDecoder(AttentionDecoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
        # 这里用了加性attention
        # 虽然key,query,value长度都一致（num_hiddens），但是加性attention可以训练更多的参数
        self.attention = d2l.AdditiveAttention(
            num_hiddens, num_hiddens, num_hiddens, dropout)
        # 这三行不变
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(
            embed_size + num_hiddens, num_hiddens, num_layers,
            dropout=dropout)
        self.dense = nn.Linear(num_hiddens, vocab_size)
    
    # 这里多了enc_valid_lens
    def init_state(self, enc_outputs, enc_valid_lens, *args):
        # `enc_outputs`的形状为 (`batch_size`, `num_steps`, `num_hiddens`).
        # `hidden_state[0]`的形状为 (`num_layers`, `batch_size`,
        # `num_hiddens`)
        outputs, hidden_state = enc_outputs
        # (1, 0, 2): batchsize, sentencesize,h
        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)

    def forward(self, X, state):
        # `enc_outputs`的形状为 (`batch_size`, `num_steps`, `num_hiddens`).
        # `hidden_state[0]`的形状为 (`num_layers`, `batch_size`,
        # `num_hiddens`)
        enc_outputs, hidden_state, enc_valid_lens = state
        # 输出 `X`的形状为 (`num_steps`, `batch_size`, `embed_size`)
        X = self.embedding(X).permute(1, 0, 2)
        outputs, self._attention_weights = [], []
        # 这里每一步context会变，所以要循环
        
        for x in X:
            # `query`的形状为 (`batch_size`, 1, `num_hiddens`)
            #  dim=1是加一个num_of_queries的维度
            query = torch.unsqueeze(hidden_state[-1], dim=1) # hidden_state[-1]:上一个时间的rnn的输出
            # `context`的形状为 (`batch_size`, 1, `num_hiddens`)
              
            context = self.attention(
                query, enc_outputs, enc_outputs, enc_valid_lens)
            # 在特征维度上连结
            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
            # 将 `x` 变形为 (1, `batch_size`, `embed_size` + `num_hiddens`)
            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
            outputs.append(out)
            self._attention_weights.append(self.attention.attention_weights)
        # 全连接层变换后， `outputs`的形状为 
        # (`num_steps`, `batch_size`, `vocab_size`)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
                                          enc_valid_lens]
    
    @property
    def attention_weights(self):
        return self._attention_weights