<a href="https://colab.research.google.com/github/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/transformer_anatomy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Transformer Anatomy

Transformer is based on the encoder decoder
architecture that is widely used for tasks like machine translation,
where a sequence of words is translated from one language to another.

This architecture consists of two components:

- Encoder
  - Converts an input sequence of tokens into a sequence of embedding
vectors, often called the hidden state or context.
- Decoder
  - Uses the encoder’s hidden state to iteratively generate an output
sequence of tokens, one token at a time.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/0.png?raw=1' width='600'/>

The Transformer architecture was originally designed for sequence-to
sequence
tasks like machine translation, but both the encoder and decoder
submodules were soon adapted as stand-alone models. 

Although there are
hundreds of different transformer models, most of them belong to one of
three types:

- Encoder-only
  - These models convert an input sequence of text into a rich numerical
representation that is well suited for tasks like text classification or
named entity recognition. BERT and its variants like RoBERTa and
DistilBERT belong to this class of architectures.
- Decoder-only
  - Given a prompt of text like “Thanks for lunch, I had a …”, these models
will auto-complete the sequence by iteratively predicting the most
probable next word. The family of GPT models belong to this class.
- Encoder-decoder
  - Used for modeling complex mappings from one sequence of text to
another. Suitable for machine translation and summarization. The
Transformer, BART and T5 models belong to this class.


##Setup

In [None]:
%%shell

pip -q install transformers
pip -q install bertviz

In [None]:
from transformers import AutoTokenizer
from transformers import AutoConfig
from transformers import AutoModel

import torch
import torch.nn as nn
import torch.nn.functional as F

from math import sqrt

from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show
from bertviz import head_view

##Transformer Encoder

Transformer’s encoder consists of many encoder layers stacked next to each other.

Each encoder
layer receives a sequence of embeddings and feeds them through the
following sub-layers:

- A multi-head self-attention layer
- A feed-forward layer

The output embeddings of each encoder layer have the same size and the main role of the encoder stack is to
“update” the input embeddings to produce representations that encode some
contextual information in the sequence.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/1.png?raw=1' width='600'/>

Each of these sub-layers also has a skip connection and layer normalization,
which are standard tricks to train deep neural networks effectively.

###Self-Attention

As we know, self-attention is a mechanism that
allows neural networks to assign a different amount of weight or “attention”
to each element in a sequence.

For text sequences, the elements are token embeddings, where each token is
mapped to a vector of some fixed dimension. For example, in BERT each
token is represented as a 768-dimensional vector.

The “self” part of self
attention
refers to the fact that these weights are computed for all hidden
states in the same set, e.g. all the hidden states of the encoder.

By contrast,
the attention mechanism associated with recurrent models involves
computing the relevance of each encoder hidden state to the decoder hidden
state at a given decoding timestep.

The main idea behind self-attention is that instead of using a fixed
embedding for each token, we can use the whole sequence to compute a
weighted average of each embedding.

$$
y_i = \sum_{j==1}^n w_{ji}x_j
$$

The coefficients $w_{ji}$ are called attention weights and are normalized so that $\sum_j w_{ji}= 1$.

Let's consider what comes to mind when you see the word “flies”. You
might think of an annoying insect, but if you were given more context like
“time flies like an arrow” then you would realize that “flies” refers to the
verb instead. 

Similarly, we can create a representation for “flies” that
incorporates this context by combining all the token embeddings in
different proportions, perhaps by assigning a larger weight $w_{ji}$ to the token embeddings for “time” and “arrow”.

Embeddings that are generated in this
way are called contextualized embeddings and predate the invention of
ELMo
transformers with language models like ELMo.



###Scaled Dot-Product Attention

There are several ways to implement a self-attention layer, but the most
common one is scaled dot-product attention.

There are four main steps needed to implement this mechanism:
- Create query, key, and value vectors
  - Each token embedding is projected into three vectors called query, key,
and value.
- Compute attention scores
  - Determine how much the query and key vectors relate to each other
using a similarity function(dot-product matrix).
- Compute attention weights
  - Dot-products can in general produce arbitrarily large numbers which
can destabilize the training process. To handle this, the attention scores
are first multiplied by a scaling factor and then normalized with a
softmax to ensure all the column values sum to one.The resulting $n × n$
matrix now contains all the attention weights $w_{ji}$.
- Update the token embeddings
  - Once the attention weights are computed, we multiply them by the
value vector to obtain an updated representation for embedding $y_i=\sum_{j} w_{ji}v_j$.

We can visualize how the attention weights are calculated with a nifty
library called [BertViz](https://github.com/jessevig/bertviz).

To visualize the attention weights, we can use the `neuron_view` module
which traces the computation of the weights to show how the query and key
vectors are combined to produce the final weight.

In [None]:
model_ckpt = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

In [None]:
text = "time flies like an arrow"
show(model, "bert", tokenizer, text, display_mode="light", layer=0, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**A simple analogy for Query, Key and Value**

Let's imagine that
you’re at the supermarket buying all the ingredients necessary for your
dinner. From the dish’s recipe, each of the required ingredients can be
thought of as a query, and as you scan through the shelves you look at
the labels (keys) and check if it matches an ingredient on your list
(similarity function/dot-product). If you have a match then you take the item (value) from the shelf.

In this example, we only get one grocery item for every label that
matches the ingredient. Self-attention is a more abstract and “smooth”
version of this: every label in the supermarket matches the ingredient to
the extent to which each key matches the query.

Let’s take a look at this process in more detail by implementing the diagram
of operations to compute scaled dot-product attention.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/2.png?raw=1' width='600'/>

The first thing we need to do is tokenize the text, so let’s use our tokenizer
to extract the input IDs:

In [None]:
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

As we saw, each token in the sentence has been mapped to a
unique ID in the tokenizer’s vocabulary.To keep things simple, we’ve also
excluded the `[CLS]` and `[SEP]` tokens.

Next we need to create some dense embeddings that acts as a lookup table for each input ID:



In [None]:
# load the config.json file associated with the bert-base-uncased checkpoint
config = AutoConfig.from_pretrained(model_ckpt)

In [None]:
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
token_emb

Embedding(30522, 768)

In Transformers,
every checkpoint is assigned a configuration file that specifies various hyperparameters like `vocab_size` and `hidden_size`, which in our
example shows us that each input ID will be mapped to one of the 30,522 embedding vectors stored in `nn.Embedding`, each with a size of 768.

Now we have our lookup table we can generate the embeddings by feeding
the input IDs:

In [None]:
inputs_embeds = token_emb(inputs.input_ids)
inputs_embeds.size()

torch.Size([1, 5, 768])

This has given us a tensor of size `(batch_size, seq_len, hidden_dim)`.

So, now let's create the query, key,
and value vectors and calculate the attention scores using the dot-product as
the similarity function:

In [None]:
query = key = value = inputs_embeds

In [None]:
query.size(), key.size(), value.size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

In [None]:
key.transpose(1, 2).size()

torch.Size([1, 768, 5])

In [None]:
query = key = value = inputs_embeds
dim_k = key.size(-1)  # 768

# [1, 5, 768] * [1, 768, 5] / sqrt(768) = [1, 5, 5]
# [5, 768] * [768, 5] / sqrt(768) = [5, 5] if we ignore batch size
scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)  # key.transpose(1, 2) means, we want to exchange the position of seq_len and hidden_dim only 
scores.size()

torch.Size([1, 5, 5])

This has created a 5 × 5 matrix of attention scores.

In scaled dot-product attention,
the dot-products are scaled by the size of the embedding vectors so that we
don’t get too many large numbers during training that can cause problems
with back propagation.

> The torch.bmm function performs a batch matrix-matrix product that simplifies the computation of the attention scores where the query and key vectors have size `(batch_size, seq_len, hidden_dim)`. If we ignored the batch dimension we
could calculate the dot product between each query and key vector by simply
transposing the key tensor to have shape `(hidden_dim, seq_len)` and then using
the matrix product to collect all the dot-products in a `(seq_len, seq_len)` matrix.

Next we normalize them
by applying a softmax so the sum over each column is equal to one:

In [None]:
weights = F.softmax(scores, dim=-1)

In [None]:
weights.shape

torch.Size([1, 5, 5])

The final step is to multiply the attention weights by the values:

In [None]:
# [1, 5, 5] * [1, 5, 768] = [1, 5, 768]
# [5, 5] * [5, 768] = [5, 768]
attn_outputs = torch.bmm(weights, value)
attn_outputs.shape

torch.Size([1, 5, 768])

And that’s it - we’ve gone through all the steps to implement a simplified
form of self-attention!

Notice that the whole process is just two matrix
multiplications and a softmax, so next time you think of “self-attention”
you can mentally remember that all we’re doing is just a fancy form of
averaging.

Let’s wrap these steps in a function called scale-dot-product attention.

In [None]:
def scaled_dot_product_attention(query, key, value):
  dim_k = key.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
  weights = F.softmax(scores, dim=-1)
  return torch.bmm(weights, value)

In [None]:
attn_outputs = scaled_dot_product_attention(query, key, value)
attn_outputs.shape

torch.Size([1, 5, 768])

Our attention mechanism with equal query and key vectors will assign a
very large score to identical words in the context, and in particular to the
current word itself: the dot product of a query with itself is always 1.

But in
practice the meaning of a word will be better informed by complementary
words in the context than by identical words, e.g. the meaning of “flies” is
better defined by incorporating information from “time” and “arrow” than
by another mention of “flies”. How can we promote this behavior?

Let’s allow the model to create a different set of vectors for the query, key
and value of a token by using three different linear projections to project
our initial token vector into three different spaces.

###Multi-Headed Attention

In practice,
the self-attention layer applies three independent linear transformations to
each embedding to generate the query, key, and value vectors. These
transformations project the embeddings and each projection carries its own
set of learnable parameters, which allows the self-attention layer to focus on
different semantic aspects of the sequence.

It also turns out to be beneficial to have multiple sets of linear projections, each one representing a so-called attention head.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/7.png?raw=1' width='600'/>

Let’s implement this layer by first coding up a single attention head:

In [None]:
class AttentionHead(nn.Module):

  def __init__(self, embed_dim, head_dim):
    super().__init__()
    self.q = nn.Linear(embed_dim, head_dim)
    self.k = nn.Linear(embed_dim, head_dim)
    self.v = nn.Linear(embed_dim, head_dim)

  def forward(self, hidden_state):
    attention_outputs = scaled_dot_product_attention(self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
    return attention_outputs

Here we’ve initialized three independent linear layers that apply matrix
multiplication to the embedding vectors to produce tensors of size
`(batch_size, seq_len, head_dim)` where `head_dim` is the
dimension we are projecting into. 

Although `head_dim` does not have to be
smaller than the embedding dimension `embed_dim` of the tokens, in
practice it is chosen to be a multiple of `embed_dim` so that the
computation across each head is constant.

For example in BERT has 12 attention heads, so the dimension of each head is `768/12 = 64`.

Now that we have a single attention head, we can concatenate the outputs of
each one to implement the full multi-headed attention layer:


In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, config):
    super().__init__()
    embed_dim = config.hidden_size
    num_heads = config.num_attention_heads
    head_dim = embed_dim // num_heads

    self.heads = nn.ModuleList([AttentionHead(embed_dim, head_dim) for _ in range(num_heads)])
    self.output_linear = nn.Linear(embed_dim, embed_dim)

  def forward(self, hidden_state):
    x = torch.cat([head(hidden_state) for head in self.heads], dim=-1)
    x = self.output_linear(x)
    return x

Notice that the concatenated output from the attention heads is also fed
through a final linear layer to produce an output tensor of size
`(batch_size, seq_len, hidden_dim)` that is suitable for the
feed forward network downstream. 

As a sanity check, let’s see if the multiheaded attention produces the expected shape of our inputs:

In [None]:
multihead_attnention = MultiHeadAttention(config)
attention_output = multihead_attnention(inputs_embeds)
attention_output.size()

torch.Size([1, 5, 768])

Now, let’s use BertViz again to visualise the attention for two different uses of the word “flies”.

In [None]:
model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)

sentence_a = "time flies like an arrow"
sentence_b = "fruit flies like a banana"

viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
attention = model(**viz_inputs).attentions

sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

head_view(attention, tokens, sentence_b_start, heads=[8])

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

###Feed Forward Layer

The feed forward sub-layer in the encoder and decoder is just a simple 2-
layer fully-connected neural network, but with a twist; instead of processing
the whole sequence of embeddings as a single vector, it processes each
embedding independently. For this reason, this layer is often referred to as a
position-wise feed forward layer. These position-wise feed forward layers
are sometimes also referred to as a one-dimensional convolution with
kernel size of one, typically by people with a computer vision background.

A rule of thumb
from the literature is to pick the hidden size of the first layer to be four
times the size of the embeddings and a GELU activation function is most
commonly used. This is where most of the capacity and memorization is
hypothesized to happen and the part that is most often scaled when scaling
up the models.

In [None]:
class FeedForwardLayer(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.linear_1 = nn.Linear(config.hidden_size, config.intermediate_size)
    self.linear_2 = nn.Linear(config.intermediate_size, config.hidden_size)
    self.gelu = nn.GELU()
    self.dropout = nn.Dropout(config.hidden_dropout_prob)

  def forward(self, x):
    x = self.linear_1(x)
    x = self.gelu(x)
    x = self.linear_2(x)
    x = self.dropout(x)
    return x

Let’s test this by passing the attention outputs:

In [None]:
feed_forward = FeedForwardLayer(config)
ff_outputs = feed_forward(attention_output)
ff_outputs.size()

torch.Size([1, 5, 768])

We now have all the ingredients to create a fully-fledged transformer
encoder layer! The only decision left to make is where to place the skip
connections and layer normalization.

###Putting It All Together

When it comes to placing the layer normalization in the encoder or decoder
layers of a transformer, there are two main choices adopted in the literature:

- Post layer normalization
  - This arrangement is
tricky to train from scratch as the gradients can diverge. For this reason,
you will often see a concept known as learning rate warm-up, where the
learning rate is gradually increased from a small value to some
maximum during training.
- Pre layer normalization
  - Tends to be much more stable during training and does not usually require learning rate warmup.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/8.png?raw=1' width='600'/>

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/9.png?raw=1' width='600'/>

We’ll use the pre-layernorm arrangement so we can simply stick together
our building blocks as follows:



In [None]:
class TransformerEncoderLayer(nn.Module):
  def __init__(self, config):
    super().__init__()
    
    self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
    self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
    self.multi_head_attention = MultiHeadAttention(config)
    self.feed_forward_layer = FeedForwardLayer(config)

  def forward(self, x):
    # Apply layer normalization and then copy input into query, key, value
    hidden_state = self.layer_norm_1(x)
    # Apply attention with a skip connection
    x = x + self.multi_head_attention(hidden_state)
    # Apply feed-forward layer with a skip connection
    x = x + self.feed_forward_layer(self.layer_norm_2(x))
    return x

Let’s now test this with our input embeddings:

In [None]:
encoder_layer = TransformerEncoderLayer(config)
inputs_embeds.shape, encoder_layer(inputs_embeds).size()

(torch.Size([1, 5, 768]), torch.Size([1, 5, 768]))

It works! We’ve now implemented our very first transformer encoder layer
from scratch!

In principle, we could now pass the input embeddings through
the encoder layer. However, there is a caveat with the way we setup the
encoder layers: **they are totally invariant to the position of the tokens**. Since the multi-head attention layer is effectively a fancy weighted sum, there is no way to encode the positional information in the sequence.

Luckily there is an easy trick to incorporate positional information with
positional encodings.

###Positional Embeddings

Positional embeddings are based on a simple, yet very effective idea:
augment the token embeddings with a position-dependent pattern of values
arranged in a vector. If the pattern is characteristic for each position, the
attention heads and feed-forward layers in each stack can learn to
incorporate positional information in their transformations.

Let’s create a custom `Embeddings` module that combines a token
embedding layer that projects the `input_ids` to a dense hidden state
together with the positional embedding that does the same for
`position_ids`. 

The resulting embedding is simply the sum of both
embeddings:

In [None]:
class Embeddings(nn.Module):

  def __init__(self, config):
    super().__init__()

    self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
    self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
    self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
    self.dropout = nn.Dropout()

  def forward(self, input_ids):
    # Create position IDs for input sequence
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)

    # create token and position embeddings
    token_embeddings = self.token_embeddings(input_ids)
    position_embeddings = self.position_embeddings(position_ids)

    # Combine token and position embeddings
    embeddings = token_embeddings + position_embeddings
    embeddings = self.layer_norm(embeddings)
    embeddings = self.dropout(embeddings)
    return embeddings

In [None]:
embedding_layer = Embeddings(config)
embedding_layer(inputs.input_ids).size()

torch.Size([1, 5, 768])

We see that the embedding layer now creates a single, dense embedding for
each token. 

While learnable position embeddings are easy to implement and
widely used there are several alternatives:

- **Absolute positional representations**: The Transformer model uses static patterns to encode the position of the
tokens.The pattern consists of modulated sine and cosine signals and works especially well in the low data regime.
- **Relative positional representations**: Relative positional representations follow that intuition
and encode the relative positions between tokens. Models such as
DeBERTa use such representations.
- **Rotary position embeddings**: By combining the idea of absolute and relative positional
representations rotary position embeddings achieve excellent results on
many tasks. A recent example of rotary position embeddings in action is
GPT-Neo.

Let’s put it all together now by building the full transformer encoder by
combining the embeddings with the encoder layers:

In [None]:
class TransformerEncoder(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.embeddings = Embeddings(config)
    self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])

  def forward(self, x):
    x = self.embeddings(x)
    for layer in self.layers:
      x = layer(x)
    return x

Let’s check the output shapes of the encoder:

In [None]:
inputs.input_ids

tensor([[ 2051, 10029,  2066,  2019,  8612]])

In [None]:
encoder = TransformerEncoderLayer(config)
encoder(inputs_embeds).size()

torch.Size([1, 5, 768])

We can see that we get a hidden state for each token in the batch. This
output format makes the architecture very flexible and we can easily adapt
it for various applications such as predicting missing tokens in masked
langauge modeling or predicting start and end position of an answer in
question-answering.

Let’s see how we can build a classifier with the
encoder like the one we used `AutoModelForSequenceClassification` from `transformers` library.

###Bodies and Heads

So now that we have a full transformer encoder model we would like to
build a classifier with it. The model is usually divided into a task
independant body and a task specific head.

**What we’ve built so far is the
body and we now need to attach a classification head to that body. Since we
have a hidden state for each token but only need to make one prediction
there are several option how to approach this.**

Traditionally, the first token
in such models is used for the prediction and we can attach a dropout and linear layer to make a classification prediction.

The following class extends
the existing encoder for sequence classification:



In [None]:
class TransformerForSequenceClassification(nn.Module):
  def __init__(self, config):
    super().__init__()

    self.encoder = TransformerEncoder(config)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.classifier = nn.Linear(config.hidden_size, config.num_labels)

  def forward(self, x):
    x = self.encoder(x)[:, 0, :]
    x = self.dropout(x)
    x = self.classifier(x)
    return x

Before initializing the model, we need to define how many classes we would
like to predict:

In [None]:
config.num_labels = 3
encoder_classifier = TransformerForSequenceClassification(config)
encoder_classifier(inputs.input_ids).size()

torch.Size([1, 3])

That is exactly what we have been looking for. For each example in the
batch we get the un-normalized logits for each class in the output. 

This corresponds to the BERT model that we used `AutoModelForSequenceClassification` to detect emotions in tweets.

##Transformer Decoder

As illustrated, the main difference between the decoder and
encoder is that the decoder has two attention sublayers:

* Masked multi-head attention
  - Ensures that the tokens we generate at each timestep are only based on
the past outputs and the current token being predicted. Without this, the
decoder could cheat during training by simply copying the target
translations, so masking the inputs ensures the task is not trivial.
* Encoder-decoder attention
  - Performs multi-head attention over the output key and value vectors of
the encoder stack, with the intermediate representation of the decoder
acting as the queries. This way the encoder-decoder attention layer
learns how to relate tokens from two different sequences such as two
different languages.

<img src='https://github.com/rahiakela/transformers-research-and-practice/blob/main/natural-language-processing-with-transformers/03-transformer-anatomy/images/10.png?raw=1' width='600'/>

Let’s take a look at the modifications we need to include masking in selfattention and the trick with masked self-attention is to
introduce a mask matrix with ones on the lower diagonal and zeros above:

In [None]:
seq_len = inputs.input_ids.size(-1)
mask = torch.tril(torch.ones(seq_len, seq_len)).view(1, seq_len, seq_len)
mask[0]

In [None]:
seq_len = inputs.input_ids.size(-1)
mask1 = torch.triu(torch.ones(seq_len, seq_len)).view(1, seq_len, seq_len)
mask1[0]

Here we’ve used PyTorch’s `tril` function to create the lower triangular
matrix.

Once we have this mask matrix, we can the prevent each attention
head from peeking at future tokens by using
`torch.Tensor.masked_fill` to replace all the zeros with negative
infinity:

In [None]:
scores.masked_fill(mask==0, -np.inf)

By setting the upper values to negative infinity, we guarantee that the
attention weights are all zero once we take the softmax over the scores
because $e^{−∞} = 0$. 

We can easily include this masking behavior with a small
change to our scaled dot-product attention function.

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
  dim_k = key.size(-1)
  scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(dim_k)
  if mask is not None:
    scores = scores.masked_fill(mask == 0, float("-inf"))
  weights = F.softmax(scores, dim=-1)
  return torch.bmm(weights, value)

From here it is a simple matter to build up the decoder layer.