##### Copyright 2022 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Neural machine translation with a Transformer and Keras

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/text/tutorials/transformer">
    <img src="https://www.tensorflow.org/images/tf_logo_32px.png" />
    View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/transformer_keras.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
    Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/text/blob/master/docs/tutorials/transformer_keras.ipynb">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />
    View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/text/docs/tutorials/transformer_keras.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial demonstrates how to create and train a [sequence-to-sequence](https://developers.google.com/machine-learning/glossary#sequence-to-sequence-task){:.external} [Transformer](https://developers.google.com/machine-learning/glossary#Transformer){:.external} model for a machine translation task with TensorFlow and built-in Keras APIs using a Portuguese-English dataset. This model with an [encoder](https://developers.google.com/machine-learning/glossary#encoder){:.external}-[decoder](https://developers.google.com/machine-learning/glossary#decoder){:.external} architecture is largely based on the original Transformer with [self-attention](https://developers.google.com/machine-learning/glossary#self-attention){:.external} layers proposed in ["Attention is all you need"](https://arxiv.org/abs/1706.03762){:.external} by Vaswani et al. (2017).

Transformers are deep neural networks primarily based on various types of [attention](https://developers.google.com/machine-learning/glossary#attention){:.external} mechanisms, which allow them to attend to different positions of the input sequence to compute a representation of that sequence, and [feed-forward networks](https://www.tensorflow.org/guide/keras/sequential_model).

In this tutorial you will:

- Load the data with [TensorFlow Datasets](https://tensorflow.org/datasets).
- Define tokenization functions.
- Prepare `tf.data` pipelines.
- Implement positional embedding to help learn word ordering.
- Implement the encoder-decoder Transformer:
  - Create a point-wise feedforward network with Keras [Sequential](https://www.tensorflow.org/guide/keras/sequential_model) API and `tf.keras.layers.Dense` layers.
  - Implement encoder and decoder layers by subclassing `tf.keras.layers.Layer`.
  - Define the encoder and decoder blocks, which are made up of `tf.keras.layers.MultiHeadAttention` for self-attention layers, as well as `tf.keras.layers.LayerNormalization` and `tf.keras.layers.Dense`.
  - Put the encoder and decoder blocks together to create the Transformer model.
- Train the Transformer.
- Generate translations.


Transformers excel at modeling sequential data, such as natural language, as shown by results you can find in research papers mentioned in the next section. Attention-based models have also been applied in other tasks, such as [image recognition](https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html){:.external}, [music transcription](https://magenta.tensorflow.org/transcription-with-transformers){:.external}, [code generation](https://www.deepmind.com/blog/competitive-programming-with-alphacode){:.external}, [reinforcement learning](https://ai.googleblog.com/2022/07/training-generalist-agents-with-multi.html){:.external}, and [protein structure prediction](https://www.nature.com/articles/s41586-021-03819-2){:.external}, to name a few.

There is a wide variety of Transformer-based models (as shown, for example, in ["Efficient Transformers: a survey"](https://arxiv.org/abs/2009.06732){:.external} (Tay et al., 2022), many of which improve upon the 2017 version of the original Transformer, with encoder-decoder, encoder-only and decoder-only architectures.

Most of the Transformer components in this tutorial use the built-in APIs like `tf.keras.layers.MultiHeadAttention`. You should have knowledge of [text generation with recurrent neural networks (RNNs)](https://www.tensorflow.org/text/tutorials/text_generation) and [RNNs with attention](https://www.tensorflow.org/text/tutorials/nmt_with_attention). The latter can be regarded as a predecessor to the original Transformer.

## A brief overview of the model

- The ["Attention is all you need"](https://arxiv.org/abs/1706.03762){:.external} paper's authors demonstrated that a model made of [self-attention](https://developers.google.com/machine-learning/glossary#self-attention){:.external} layers and feed-forward networks can achieve high translation quality, outperforming [recurrent](https://www.tensorflow.org/text/tutorials/text_classification_rnn) and [convolutional](https://www.tensorflow.org/tutorials/images/cnn) neural networks.
- Similar to [RNNs with attention](https://www.tensorflow.org/text/tutorials/nmt_with_attention), this kind of model _transforms_ sequences of input embeddings into sequences of output embeddings. Embeddings are covered later in this tutorial.
- Self-attention performs sequence processing by replacing an element with a weighted average of the rest of that sequence.
- The original Transformer used [encoder](https://developers.google.com/machine-learning/glossary#encoder){:.external} and [decoder](https://developers.google.com/machine-learning/glossary#decoder){:.external} blocks, made up of primarily [multi-head attention](https://developers.google.com/machine-learning/glossary#multi-head-self-attention){:.external} layers. These layers are self-attention layers with N heads. This is covered in more detail later in this tutorial.
- Positional [embedding](https://developers.google.com/machine-learning/glossary#embeddings){:.external} is added to make sure the model can recognize the word order. This helps avoid a [bag-of-words](https://critique.corp.google.com/cl/467004506){:.external} representation.
- Read the [Google AI blog post](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html){:.external} for more details.

<img src="standard_transformer_architecture" alt="standard-transformer-architecture">

(Figure 1: The standard Transformer architecture from ["Attention is all you need"](https://arxiv.org/abs/1706.03762){:.external} (Vaswani et al., 2017). The image is from Google Research's "Efficient Transformers: a survey"](https://arxiv.org/abs/2009.06732){:.external} (Tay et al., 2022))


## Why Transformers are significant

- Unlike the [RNNs](https://www.tensorflow.org/text/tutorials/text_generation) such as LSTMs, Transformers can be more computationally efficient and parallelizable across several specialized hardware, like GPUs and TPUs. One of the main reasons is that Transformers replaced recurrence with attention, and computations can happen simultaneously. Layer outputs can be calculated in parallel, instead of a series like an RNN.
- In addition, unlike LSTMs, they are also able to capture distant or long-range contexts and dependencies in the data between distant positions in the input or output sequences. Thus, longer connections can be learned.
- Transformers make no assumptions about the temporal/spatial relationships across the data. This is ideal for processing a set of objects (for example, [StarCraft units](https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/#block-8){:.external}).
- The original Transformer influenced such language models as [T5](https://arxiv.org/abs/1910.10683){:.external}, [MUM](https://blog.google/products/search/introducing-mum/){:.external}, and [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html){:.external}. Other Transformer-based models that have since been released include [Reformer](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html){:.external}, [LaMDA](https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html){:.external}, and [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html){:.external}.

The downsides of this architecture are:

- For a time-series, the output for a time-step is calculated from the *entire history* instead of only the inputs and current hidden-state. This _may_ be less efficient.   
- If the input *does* have a  temporal/spatial relationship, like text, some positional encoding must be added or the model will effectively see a bag of words. 

After training the model in this notebook, you will be able to input a Portuguese sentence and return the English translation. Below is an example of a "heatmap" of attention scores, which you'll learn about later in this tutorial.

<img src="https://www.tensorflow.org/images/tutorials/transformer/attention_map_portuguese.png" width="800" alt="Attention heatmap">

Figure 2: The attention heatmap you can generate at the end of this tutorial.

## Setup

Begin by installing [TensorFlow Datasets](https://tensorflow.org/datasets) for loading the dataset and [TensorFlow Text](https://www.tensorflow.org/text) for text preprocessing:

In [None]:
!pip install tensorflow_datasets
!pip install -U tensorflow-text

You also need [TensorFlow Probability](https://www.tensorflow.org/probability) to create a look-ahead mask, which is explained later in this tutorial:

In [None]:
!pip install tensorflow-probability

Import the necessary modules:

In [None]:
import logging
import time

import numpy as np
import matplotlib.pyplot as plt

import tensorflow_datasets as tfds
import tensorflow as tf

import tensorflow_text

import tensorflow_probability as tfp

## Download the dataset

Use TensorFlow Datasets to load the [Portuguese-English translation dataset](https://www.tensorflow.org/datasets/catalog/ted_hrlr_translate#ted_hrlr_translatept_to_en) from the TED Talks Open Translation Project. This dataset contains approximately 52,000 training, 1,200 validation and 1,800 test examples.

In [None]:
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en',
                               with_info=True,
                               as_supervised=True)

train_examples, val_examples = examples['train'], examples['validation']

The `tf.data.Dataset` object returned by TensorFlow Datasets yields pairs of text examples:

In [None]:
for pt_examples, en_examples in train_examples.batch(3).take(1):
  print('> Examples in Portuguese:')
  for pt in pt_examples.numpy():
    print(pt.decode('utf-8'))
  print()

  print('> Examples in English:')
  for en in en_examples.numpy():
    print(en.decode('utf-8'))

## Implement tokenization

Now that you have loaded the dataset, you need to tokenize the text, so that each element is represented as a vector (a numeric representation).

Tokenization is the process of breaking up a sequence, such as a text, into [tokens](https://developers.google.com/machine-learning/glossary#token){:.external} or token IDs, for each element in that sequence. Commonly, these tokens are words, characters, numbers, subwords, and/or punctuation.

Tokenization can be done in various ways. For example, for a text sequence of `how are you`, you can apply:

- Word-level tokenization, such as `how`, `are`, `you`.
- Character-level tokenization, such as `h`, `o`, `w`, `a`, and so on. This would result in a much longer sequence length compared with the previous method.
- Subword tokenization, which can take care of common/recurring word parts, such as `ing` and `tion`, as well as common words like `are` and `you`.

### Subword tokenizer

This tutorial uses a popular [subword tokenizer](https://www.tensorflow.org/text/guide/subwords_tokenizer) implementation, which builds subword tokenizers (`text.BertTokenizer`) optimized for the dataset and exports them in a TensorFlow `saved_model` format.

Download, extract, and import the `saved_model`:

In [None]:
model_name = 'ted_hrlr_translate_pt_en_converter'
tf.keras.utils.get_file(
    f'{model_name}.zip',
    f'https://storage.googleapis.com/download.tensorflow.org/models/{model_name}.zip',
    cache_dir='.', cache_subdir='', extract=True
)

In [None]:
tokenizers = tf.saved_model.load(model_name)

The `tf.saved_model` contains two text tokenizers, one for English and one for Portuguese. Both have the same methods:

In [None]:
[item for item in dir(tokenizers.en) if not item.startswith('_')]

The `tokenize` method converts a batch of strings to a padded-batch of token IDs. This method splits punctuation, lowercases and unicode-normalizes the input before tokenizing. That standardization is not visible here because the input data is already standardized.

In [None]:
print('> This is a batch of strings:')
for en in en_examples.numpy():
  print(en.decode('utf-8'))

In [None]:
encoded = tokenizers.en.tokenize(en_examples)

print('> This is a padded-batch of token IDs:')
for row in encoded.to_list():
  print(row)

The `detokenize` method attempts to convert these token IDs back to human-readable text: 

In [None]:
round_trip = tokenizers.en.detokenize(encoded)

print('> This is human-readable text:')
for line in round_trip.numpy():
  print(line.decode('utf-8'))

The lower level `lookup` method converts from token-IDs to token text:

In [None]:
print('> This is token text:')
tokens = tokenizers.en.lookup(encoded)
tokens

The output demonstrates the "subword" aspect of the subword tokenization.

For example, the word `'searchability'` is decomposed into `'search'` and `'##ability'`, and the word `'serendipity'` into `'s'`, `'##ere'`, `'##nd'`, `'##ip'` and `'##ity'`.

The distribution of tokens per example in the dataset is as follows:

In [None]:
lengths = []

for pt_examples, en_examples in train_examples.batch(1024):
  pt_tokens = tokenizers.en.tokenize(pt_examples)
  lengths.append(pt_tokens.row_lengths())

  en_tokens = tokenizers.en.tokenize(en_examples)
  lengths.append(en_tokens.row_lengths())
  print('.', end='', flush=True)

In [None]:
all_lengths = np.concatenate(lengths)

plt.hist(all_lengths, np.linspace(0, 500, 101))
plt.ylim(plt.ylim())
max_length = max(all_lengths)
plt.plot([max_length, max_length], plt.ylim())
plt.title(f'Maximum tokens per example: {max_length}');

### Create tokenizer functions

This section shows how to define custom functions for transforming/tokenizing the text in the dataset into tokens. You will need these for building an input pipeline suitable for training.

The following function drops examples longer than the maximum number of tokens (`MAX_TOKENS`). Without limiting the size of sequences, the performance may be negatively affected.

In [None]:
MAX_TOKENS = 128

def filter_max_tokens(pt, en):
  num_tokens = tf.maximum(tf.shape(pt)[1],tf.shape(en)[1])
  return num_tokens < MAX_TOKENS

Next, define a function that tokenizes the batches of raw text:

In [None]:
def tokenize_pairs(pt, en):
    pt = tokenizers.pt.tokenize(pt)
    # Convert from ragged to dense, padding with zeros.
    pt = pt.to_tensor()

    en = tokenizers.en.tokenize(en)
    # Convert from ragged to dense, padding with zeros.
    en = en.to_tensor()
    return pt, en

## Set up a data pipeline with `tf.data`

In this step, you set up a `tf.data` pipeline. Make sure to use buffered prefetching, so you can yield data from disk without having I/O become blocking. These are two important methods you should use when loading data:

- `Dataset.cache` keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache.
- `Dataset.prefetch` overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the *Prefetching* section of the [Better performance with the `tf.data` API](https://www.tensorflow.org/guide/data_performance.ipynb) guide.

The `tf.data` input pipeline that processes, shuffles and batches the data looks as follows:

In [None]:
BUFFER_SIZE = 20000
BATCH_SIZE = 64

In [None]:
def make_batches(ds):
  return (
      ds
      .cache()
      .shuffle(BUFFER_SIZE)
      .batch(BATCH_SIZE)
      .map(tokenize_pairs, num_parallel_calls=tf.data.AUTOTUNE)
      .filter(filter_max_tokens)
      .prefetch(buffer_size=tf.data.AUTOTUNE))

# Training and validation set batches.
train_batches = make_batches(train_examples)
val_batches = make_batches(val_examples)

## Embedding

Having turned the sequences of text into sequences of tokens, you will create input and output embeddings. Embeddings represent tokens in a d-dimensional space where tokens with similar meaning will be closer to each other.

Converting tokens into embeddings is done with the built-in `tf.keras.layers.Embedding` layer—this is shown in the encoder/decoder sections of this tutorial.

Next, the input embeddings need to be summed with the positional encoding, which is covered in the next section.

## Positional encoding

In the original Transformer research [paper](https://arxiv.org/abs/1706.03762){:.external}, _positional encoding (or embedding)_ was added to the embeddings to give the model some information about the relative position of the tokens in the sentence. Now the model can _learn_ to recognize the word order.

This section shows how to implement positional encoding.

Attention layers see their input as a set of vectors, with no sequential order. As discussed earlier in the tutorial, the model doesn't contain any recurrent or convolutional layers.

If the word order is not learned, you may end up with a [bag of words](https://developers.google.com/machine-learning/glossary#bag-of-words){:.external}, where, for instance, `how are you`, `how you are`, `you how are`, and so on, are represented identically.

The embeddings on their own do not encode the relative position of tokens in a sentence. Therefore, after adding the positional encoding, tokens are closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space.

The formula for calculating the positional encoding (implemented in Python below) is as follows:

$$\Large{PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})} $$
$$\Large{PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})} $$

The following functions closely mirror the positional encoding method described in the original Transformer paper.

In [None]:
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

In [None]:
def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # Apply the sine function to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # Apply the cosine function to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

Test the positional encoding function:

In [None]:
# Set the inner-layer dimensionality and the input/output dimensionality.
n, d = 2048, 512

pos_encoding = positional_encoding(position=n, d_model=d)

# Check the shape.
print(pos_encoding.shape)

pos_encoding = pos_encoding[0]

# Juggle the dimensions for the plot.
pos_encoding = tf.reshape(pos_encoding, (n, d//2, 2))
pos_encoding = tf.transpose(pos_encoding, (2, 1, 0))
pos_encoding = tf.reshape(pos_encoding, (d, n))

# Plot the dimensions.
plt.pcolormesh(pos_encoding, cmap='RdBu')
plt.ylabel('Depth')
plt.xlabel('Position')
plt.colorbar()
plt.show()

## Masking

In this step, create a function that masks all the pad tokens in the batch of sequence. This ensures that the model does not treat padding as the input. The mask indicates where pad value `0` is present: it outputs a `1` at those locations, and a `0` otherwise.

In [None]:
def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # Add extra dimensions to add the padding to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

In [None]:
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)

The look-ahead mask is used to mask the future tokens in a sequence. In other words, the mask indicates which entries should not be used. This mask prevents the model from peeking at the expected output.

This means that to predict the third token, only the first and second token are used. Similarly, to predict the fourth token, only the first, second and the third tokens will be used and so on.

The `tf.keras.layers.MultiHeadAttention` layer considers the inverted mask, where `1` is a token to be attended to, and `0` should be ignored:

In [None]:
def create_look_ahead_mask(size):
    n = int(size*(size+1)/2)
    mask = tfp.math.fill_triangular(tf.ones((n,), dtype=tf.float32), upper=False)
    return mask

Test the look-ahead mask function:

In [None]:
temp_input = tf.random.uniform((1, 3))

test_look_ahead_mask = create_look_ahead_mask(temp_input.shape[1])
test_look_ahead_mask

## Create a point-wise feedforward network

A point-wise feedforward network consists of two fully-connected/linear layers (`tf.keras.layers.Dense`) with a ReLU activation in-between:

In [None]:
def point_wise_feed_forward_network(d_model, # Input/output dimensionality.
                                    dff # Inner-layer dimensionality.
                                    ):

  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])

In [None]:
sample_ffn = point_wise_feed_forward_network(512, 2048)

# Print the shape.
print(sample_ffn(tf.random.uniform((64, 50, 512))).shape)

## Build the Transformer

The Transformer model in this tutorial follows the same general pattern as a standard [sequence-to-sequence](https://www.tensorflow.org/text/tutorials/nmt_with_attention) model.

- The model is made of an encoder block and a decoder block. Each encoder/decoder block consists of N encoder/decoder layers, containing multi-head attention and point-wise feedforward networks.
  - You will create these Transformer building blocks and layers in this section.
- The input embeddings, summed with positional encoding, are passed through the encoder block (N encoder layers) that generates an output for each token in the sequence.
  - You will create embeddings with `tf.keras.layers.Embedding` and position encoding—with the `positional_encoding()` function inside the encoder/decoder blocks later int this section.
- The decoder (N decoder layers) attends to the encoder's output and its own input (self-attention) to predict the next word.

<img src="https://www.tensorflow.org/images/tutorials/transformer/transformer.png" width="600" alt="transformer">

Before creating the encoder/decoder blocks, start with defining the encoder and decoder layers. 

### Define the encoder layer

An encoder block consists of N encoder layers.

Each encoder layer consists of sublayers:

- A multi-head attention layer (with padding mask), implemented with `tf.keras.layers.MultiHeadAttention`.
- A point-wise feedforward network with `tf.keras.layers.Dense`.
- Each of these sublayers has a residual connection around it, followed by layer normalization (`tf.keras.layers.LayerNormalization`). Residual connections help in avoiding the vanishing gradient problem in deep networks.

The output of each sublayer is `LayerNorm(x + Sublayer(x))`. The normalization is done on the `d_model` (last) axis (the dimensionality of the input/output). There are N encoder layers in a Transformer.

Note: Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (`tf.keras.layers.Dense`) layers before the multi-head attention function (`tf.keras.layers.MultiHeadAttention`). Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information from different representation subspaces at different positions. The equation used to calculate the self-attention weights is as follows: $$\Large{Attention(Q, K, V) = softmax_k\left(\frac{QK^T}{\sqrt{d_k}}\right) V} $$

<img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width="500" alt="multi-head attention">

Define the encoder layer by subclassing `tf.keras.layers.Layer`:

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self,*,
               d_model, # Input/output dimensionality.
               num_heads, # Number of attention heads.
               dff, # Inner-layer dimensionality.
               rate=0.1 # Dropout rate.
               ):
    super(EncoderLayer, self).__init__()

    # Keras multi-head attention.
    self.mha = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
    # Point-wise feed-forward network.
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    # Layer normalization.
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    # Dropout.
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):
    
    attn_output = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2

Test the encoder layer:

In [None]:
sample_encoder_layer = EncoderLayer(d_model=512, num_heads=8, dff=2048)

sample_encoder_layer_output = sample_encoder_layer(
    tf.random.uniform((64, 43, 512)), training=False, mask=None)

# Print the shape.
print(sample_encoder_layer_output.shape)  # (batch_size, input_seq_len, d_model)

### Define the decoder layer

A decoder block also consists of N encoder layers. Each decoder layer consists of sublayers:

- A masked multi-head attention layer (with a look-ahead mask and a padding mask), implemented with `tf.keras.layers.MultiHeadAttention`).
- A multi-head attention layer (with a padding mask) (also implemented with `tf.keras.layers.MultiHeadAttention`). V (value) and K (key) receive the encoder output as inputs. Q (query) receives the output from the masked multi-head attention sublayer.
- A point-wise feedforward network with `tf.keras.layers.Dense`.
- Each of these sublayers has a residual connection around it, followed by layer normalization (`tf.keras.layers.LayerNormalization`).

The output of each sublayer is `LayerNorm(x + Sublayer(x))`. The normalization is done on the `d_model` (last) axis.

Note: As query (Q) receives the output from decoder's first attention block, and key (K) receives the encoder output, the attention weights represent the importance given to the decoder's input based on the encoder's output. In other words, the decoder predicts the next token by looking at the encoder output and self-attending to its own output. See the demonstration above in the scaled dot product attention section.

Define the decoder layer by subclassing `tf.keras.layers.Layer`:

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self,
               *,
               d_model, # Input/output dimensionality.
               num_heads, # Number of attention heads.
               dff, # Inner-layer dimensionality.
               rate=0.1 # Dropout rate.
               ):
    super(DecoderLayer, self).__init__()

    # Keras multi-head attention.
    self.mha1 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
    self.mha2 = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

    # Point-wise feed-forward network.
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    # Layer normalization.
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    # Dropout.
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # Encoder output shape is (batch_size, input_seq_len, d_model)

    attn1, attention_weights1 = self.mha1(x, x, x, look_ahead_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attention_weights2 = self.mha2(
        query=out1, key=enc_output, value=enc_output, attention_mask=padding_mask, return_attention_scores=True)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attention_weights1, attention_weights2

Test the decoder layer:

In [None]:
sample_decoder_layer = DecoderLayer(d_model=512, num_heads=8, dff=2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)),
    sample_encoder_layer_output,
    False, None, None)

# Print the shape.
print(sample_decoder_layer_output.shape)  # (batch_size, target_seq_len, d_model)

Having defined the encoder and decoder layers, you can now create the Transformer encoder and decoder blocks, and then build the Transformer model.

### Create the encoder block

The Transformer encoder block consists of:

- Input embeddings (with `tf.keras.layers.Embedding`)
- Positional encoding (with `positional_encoding()`)
- N encoder layers (with `EncoderLayer()`)

As mentioned before, the input (English) is turned into embeddings, which are summed with the positional encoding. The output of this summation is the input to the encoder layers. The output of the encoder is the input to the decoder.

Define the encoder block by extending `tf.keras.layers.Layer`:


In [None]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self,
               *,
               num_layers, # Number of encoder layers.
               d_model, # Input/output dimensionality.
               num_heads, # Number of attention heads.
               dff, # Inner-layer dimensionality.
               input_vocab_size,
               rate=0.1 # Dropout rate.
               ):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(MAX_TOKENS, self.d_model)

    self.enc_layers = [
        EncoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)
        for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # Sum up embeddings and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

Test the encoder block:

In [None]:
# Instantiate the encoder.
sample_encoder = Encoder(num_layers=2,
                         d_model=512,
                         num_heads=8,
                         dff=2048,
                         input_vocab_size=8500)

# Set the test input.
temp_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

sample_encoder_output = sample_encoder(temp_input,
                                       training=False,
                                       mask=None)

# Print the shape.
print(sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)

### Create the decoder block

The Transformer decoder block consists of:

- Output embeddings (with `tf.keras.layers.Embedding`)
- Positional encoding (with `positional_encoding()`)
- N decoder layers (with `DecoderLayer`)

The target is turned into embeddings, which are summed with the positional encoding. The output of this summation is the input to the decoder layers. The output of the decoder block is the input to the final linear layer, where the prediction is made.

Define the decoder block by extending `tf.keras.layers.Layer`:

In [None]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self,
               *,
               num_layers, # Number of decoder layers.
               d_model, # Input/output dimensionality.
               num_heads, # Number of attention heads.
               dff, # Inner-layer dimensionality.
               target_vocab_size,
               rate=0.1 # Dropout rate.
               ):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(MAX_TOKENS, d_model)

    self.dec_layers = [
        DecoderLayer(d_model=d_model, num_heads=num_heads, dff=dff, rate=rate)
        for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}

    # Sum up embeddings and position encoding.
    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2  = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)

      attention_weights[f'decoder_layer{i+1}_block1'] = block1
      attention_weights[f'decoder_layer{i+1}_block2'] = block2

    # The shape of x is (batch_size, target_seq_len, d_model).
    return x, attention_weights

Test the decoder block:

In [None]:
# Instantiate the decoder.
sample_decoder = Decoder(num_layers=2,
                         d_model=512,
                         num_heads=8,
                         dff=2048,
                         target_vocab_size=8000)

# Set the test input.
temp_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

output, attn = sample_decoder(temp_input,
                              enc_output=sample_encoder_output,
                              training=False,
                              look_ahead_mask=None,
                              padding_mask=None)

# Print the shapes.
output.shape, attn['decoder_layer2_block2'].shape

Having created the Transformer encoder and decoder blocks, it's time to build the Transformer model and train it.

## Set up the Transformer architecture

You now have the `Encoder` and `Decoder` blocks. To complete the Transformer model, you need to put the blocks togeter add a final linear (`Dense`) layer. The output of the decoder is the input to the final linear layer.

Create the `Transformer` by extending `tf.keras.Model`:


In [None]:
class Transformer(tf.keras.Model):
  def __init__(self,
               *,
               num_layers, # Number of decoder layers.
               d_model, # Input/output dimensionality.
               num_heads, # Number of attention heads.
               dff, # Inner-layer dimensionality.
               input_vocab_size,
               target_vocab_size,
               rate=0.1 # Dropout rate.
               ):
    super().__init__()
    # Encoder block.
    self.encoder = Encoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           input_vocab_size=input_vocab_size, rate=rate)

    # Decoder block.
    self.decoder = Decoder(num_layers=num_layers, d_model=d_model,
                           num_heads=num_heads, dff=dff,
                           target_vocab_size=target_vocab_size, rate=rate)

    # Final linear layer.
    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs, training):
    # Keras models prefer if you pass all your inputs in the first argument.
    # Portuguese is used as the input (`inp`) language.
    # English is the target (`tar`) language.
    inp, tar = inputs

    padding_mask, look_ahead_mask = self.create_masks(inp, tar)

    # Encoder output.
    enc_output = self.encoder(inp, training, padding_mask)  # (batch_size, inp_seq_len, d_model)

    # The decoder output shape is (batch_size, tar_seq_len, d_model).
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    # Return the final output and attention weights.
    return final_output, attention_weights

  def create_masks(self, inp, tar):
    # Encoder padding mask, which is also used in the 2nd attention block
    # in the decoder.
    padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder
    # to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)

    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return padding_mask, look_ahead_mask

## Training

It's time to prepare the model and start training it.

### Set hyperparameters

To keep this example small and relatively fast, the values for the stuck of identical encoder/decoder layers (`num_layers`), the dimensionality of the input/output (`d_model`), and the dimensionality of the inner-layer (`dff`) have been reduced.

The base model described in the original Transformer paper used: `num_layers=6`, `d_model=512`, `dff=2048`.

The number of self-attention heads remains the same (`num_heads=8`).


In [None]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1

### Define the optimizer with a custom learning rate scheduler

Use the Adam optimizer with a custom learning rate scheduler according to the formula in the original Transformer [paper](https://arxiv.org/abs/1706.03762){:.external}:.

$$\Large{lrate = d_{model}^{-0.5} * \min(step{\_}num^{-0.5}, step{\_}num \cdot warmup{\_}steps^{-1.5})}$$

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

Instantiate the optimizer:

In [None]:
learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

Test the custom learning rate scheduler:

In [None]:
temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel('Learning Rate')
plt.xlabel('Train Step')

### Define the loss function and metrics

Since the target sequences are padded, it is important to apply a padding mask when calculating the loss. Use the cross-entropy loss function (`tf.keras.losses.SparseCategoricalCrossentropy`)

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

In [None]:
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
  accuracies = tf.equal(real, tf.argmax(pred, axis=2))

  mask = tf.math.logical_not(tf.math.equal(real, 0))
  accuracies = tf.math.logical_and(mask, accuracies)

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

Set up the metrics:

In [None]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

Instantiate the `Transformer` model:

In [None]:
transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    rate=dropout_rate)

### Checkpointing

Create the checkpoint path and the checkpoint manager. This will be used to save checkpoints every `n` epochs.

In [None]:
checkpoint_path = './checkpoints/train'

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# If a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print('Latest checkpoint restored!!')

With the Portuguese-English dataset, Portuguese is used as the input (`inp`) language and English is the target (`tar`) language.

The target is divided into target input (`tar_inp`) and real target (`tar_real`).

- `tar_inp` is passed as an input to the decoder.
- `tar_real` is that same input shifted by `1`: at each location in `tar_input`, `tar_real` contains the next token that should be predicted.

For example, `sentence = 'SOS A lion in the jungle is sleeping EOS'` becomes:

* `tar_inp =  'SOS A lion in the jungle is sleeping'`
* `tar_real = 'A lion in the jungle is sleeping EOS'`

A Transformer is an auto-regressive model: it makes predictions one part at a time, and uses its output so far to decide what to do next.

During training this example uses teacher-forcing (like in the [text generation with RNNs](https://www.tensorflow.org/text/tutorials/text_generation) tutorial). Teacher forcing is passing the true output to the next time step regardless of what the model predicts at the current time step.

As the model predicts each token, the self-attention mechanism allows it to look at the previous tokens in the input sequence to better predict the next token. As mentioned before, to prevent the model from peeking at the expected output the model uses a look-ahead mask.


### Train the Transformer

Define the training step:

In [None]:
train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.


#@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  with tf.GradientTape() as tape:
    predictions, _ = transformer([inp, tar_inp],
                                 training = True)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))

You can now train the Transformer.

Note: This example model is trained for a few epochs (20) to keep training time reasonable for this tutorial.

In [None]:
EPOCHS = 20

In [None]:
for epoch in range(EPOCHS):
  start = time.time()

  train_loss.reset_states()
  train_accuracy.reset_states()

  # inp -> portuguese, tar -> english
  for (batch, (inp, tar)) in enumerate(train_batches):
    train_step(inp, tar)

    if batch % 50 == 0:
      print(f'Epoch {epoch + 1} Batch {batch} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager.save()
    print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}')

  print(f'Epoch {epoch + 1} Loss {train_loss.result():.4f} Accuracy {train_accuracy.result():.4f}')

  print(f'Time taken for 1 epoch: {time.time() - start:.2f} secs\n')

## Run inference

You can now test the model by performing a translation. The following steps are used for inference:

* Encode the input sentence using the Portuguese tokenizer (`tokenizers.pt`). This is the encoder input.
* The decoder input is initialized to the `[START]` token.
* Calculate the padding masks and the look ahead masks.
* The `decoder` then outputs the predictions by looking at the `encoder output` and its own output (self-attention).
* Concatenate the predicted token to the decoder input and pass it to the decoder.
* In this approach, the decoder predicts the next token based on the previous tokens it predicted.

Note: The model is optimized for _efficient training_ and makes a next-token prediction for each token in the output simultaneously. This is redundant during inference, and only the last prediction is used.  This model can be made more efficient for inference if you only calculate the last prediction when running in inference mode (`training=False`).

Define the `Translator` class by subclassing `tf.Module`:

In [None]:
class Translator(tf.Module):
  def __init__(self, tokenizers, transformer):
    self.tokenizers = tokenizers
    self.transformer = transformer

  def __call__(self, sentence, max_length=MAX_TOKENS):
    # The input sentence is Portuguese, hence adding the START and END tokens.
    assert isinstance(sentence, tf.Tensor)
    if len(sentence.shape) == 0:
      sentence = sentence[tf.newaxis]

    sentence = self.tokenizers.pt.tokenize(sentence).to_tensor()

    encoder_input = sentence

    # As the output language is English, initialize the output with the
    # english START token.
    start_end = self.tokenizers.en.tokenize([''])[0]
    start = start_end[0][tf.newaxis]
    end = start_end[1][tf.newaxis]

    # `tf.TensorArray` is required here (instead of a Python list), so that the
    # dynamic-loop can be traced by `tf.function`.
    output_array = tf.TensorArray(dtype=tf.int64, size=0, dynamic_size=True)
    output_array = output_array.write(0, start)

    for i in tf.range(max_length):
      output = tf.transpose(output_array.stack())
      predictions, _ = self.transformer([encoder_input, output], training=False)

      # Select the last token from the `seq_len` dimension
      predictions = predictions[:, -1:, :]  # (batch_size, 1, vocab_size)

      predicted_id = tf.argmax(predictions, axis=-1)

      # Concatentate the `predicted_id` to the output which is given to the
      # decoder as its input.
      output_array = output_array.write(i+1, predicted_id[0])

      if predicted_id == end:
        break

    output = tf.transpose(output_array.stack())
    # output.shape (1, tokens)
    text = tokenizers.en.detokenize(output)[0]  # shape: ()

    tokens = tokenizers.en.lookup(output)[0]

    # `tf.function` prevents us from using the attention_weights that were
    # calculated on the last iteration of the loop.
    # Therefore, recalculate them outside the loop.
    _, attention_weights = self.transformer([encoder_input, output[:,:-1]], training=False)

    return text, tokens, attention_weights

Note: This function uses an unrolled loop, not a dynamic loop. It generates `MAX_TOKENS` on every call. Refer to [NMT with attention](nmt_with_attention.ipynb) for an example implementation with a dynamic loop, which can be much more efficient.

Create an instance of this `Translator` class, and try it out a few times:

In [None]:
translator = Translator(tokenizers, transformer)

In [None]:
def print_translation(sentence, tokens, ground_truth):
  print(f'{"Input:":15s}: {sentence}')
  print(f'{"Prediction":15s}: {tokens.numpy().decode("utf-8")}')
  print(f'{"Ground truth":15s}: {ground_truth}')

Example 1:

In [None]:
sentence = 'este é um problema que temos que resolver.'
ground_truth = 'this is a problem we have to solve .'

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Example 2:

In [None]:
sentence = 'os meus vizinhos ouviram sobre esta ideia.'
ground_truth = 'and my neighboring homes heard about this idea .'

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Example 3:

In [None]:
sentence = 'vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.'
ground_truth = "so i'll just share with you some stories very quickly of some magical things that have happened."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

## Create attention plots

The `Translator` class you created in the previous section returns a dictionary of attention maps you can use to visualize the internal working of the model. For example:

In [None]:
sentence = 'este é o primeiro livro que eu fiz.'
ground_truth = "this is the first book i've ever done."

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

Create a function that plots the attention when a token was generated: 

In [None]:
def plot_attention_head(in_tokens, translated_tokens, attention):
  # The plot is of the attention when a token was generated.
  # The model didn't generate `<START>` in the output. Skip it.
  translated_tokens = translated_tokens[1:]

  ax = plt.gca()
  ax.matshow(attention)
  ax.set_xticks(range(len(in_tokens)))
  ax.set_yticks(range(len(translated_tokens)))

  labels = [label.decode('utf-8') for label in in_tokens.numpy()]
  ax.set_xticklabels(
      labels, rotation=90)

  labels = [label.decode('utf-8') for label in translated_tokens.numpy()]
  ax.set_yticklabels(labels)

In [None]:
head = 0
# shape: (batch=1, num_heads, seq_len_q, seq_len_k)
attention_heads = tf.squeeze(
  attention_weights['decoder_layer4_block2'], 0)
attention = attention_heads[head]
attention.shape

These are the input (Portuguese) tokens:

In [None]:
in_tokens = tf.convert_to_tensor([sentence])
in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
in_tokens = tokenizers.pt.lookup(in_tokens)[0]
in_tokens

And these are the output (English translation) tokens:

In [None]:
translated_tokens

In [None]:
plot_attention_head(in_tokens, translated_tokens, attention)

In [None]:
def plot_attention_weights(sentence, translated_tokens, attention_heads):
  in_tokens = tf.convert_to_tensor([sentence])
  in_tokens = tokenizers.pt.tokenize(in_tokens).to_tensor()
  in_tokens = tokenizers.pt.lookup(in_tokens)[0]
  in_tokens

  fig = plt.figure(figsize=(16, 8))

  for h, head in enumerate(attention_heads):
    ax = fig.add_subplot(2, 4, h+1)

    plot_attention_head(in_tokens, translated_tokens, head)

    ax.set_xlabel(f'Head {h+1}')

  plt.tight_layout()
  plt.show()

In [None]:
plot_attention_weights(sentence,
                       translated_tokens,
                       attention_weights['decoder_layer4_block2'][0])

The model does okay on unfamiliar words. Neither `'triceratops'` nor `'encyclopedia'` are in the input dataset, and the model almost learns to transliterate them even without a shared vocabulary. For example:

In [None]:
sentence = 'Eu li sobre triceratops na enciclopédia.'
ground_truth = 'I read about triceratops in the encyclopedia.'

translated_text, translated_tokens, attention_weights = translator(
    tf.constant(sentence))
print_translation(sentence, translated_text, ground_truth)

plot_attention_weights(sentence, translated_tokens,
                       attention_weights['decoder_layer4_block2'][0])

## Export the model

You have tested the model and the inference is working. Next, export it as a `tf.saved_model`.

To do that, wrap it in yet another `tf.Module` subclass, this time with a `tf.function` on the `__call__` method:

In [None]:
class ExportTranslator(tf.Module):
  def __init__(self, translator):
    self.translator = translator

  @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.string)])
  def __call__(self, sentence):
    (result,
     tokens,
     attention_weights) = self.translator(sentence, max_length=MAX_TOKENS)

    return result

In the above `tf.function` only the output sentence is returned. Thanks to the [non-strict execution](https://tensorflow.org/guide/intro_to_graphs) in `tf.function` any unnecessary values are never computed.

In [None]:
translator = ExportTranslator(translator)

Since the model is decoding the predictions using `tf.argmax` the predictions are deterministic. The original model and one reloaded from its `SavedModel` should give identical predictions:

In [None]:
translator('este é o primeiro livro que eu fiz.').numpy()

In [None]:
tf.saved_model.save(translator, export_dir='translator')

In [None]:
reloaded = tf.saved_model.load('translator')

In [None]:
reloaded('este é o primeiro livro que eu fiz.').numpy()

## Summary

In this tutorial you learned about:

* The Transformers and their significance in machine learning
* Attention and self-attention mechanisms, and multi-head attention
* Positional encoding with embeddings
* The encoder-decoder architecture of the original Transformer
* The importance of masking
* How to put it all together

This implementation tried to stay close to the implementation of the 
[original Transformer](https://arxiv.org/abs/1706.03762){:.external}. If you want to practice, there are many things you could try with it. For example: 

* Using a different dataset to train the Transformer.
* Create the "Base Transformer" or "Transformer XL" configurations from the original paper by changing the hyperparameters.
* Use the layers defined here to create an implementation of [BERT](https://arxiv.org/abs/1810.04805){:.external}.
* Implement beam search to get better predictions.

There are many research papers you can study, including:
* ["Efficient Transformers: a survey"](https://arxiv.org/abs/2009.06732){:.external} (Tay et al., 2022)
* ["Formal algorithms for Transformers"](https://arxiv.org/abs/2207.09238){:.external} (Phuong and Hutter, 2022).
* [T5 ("Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer")](https://arxiv.org/abs/1910.10683){:.external} (Raffel et al., 2019)

You can also learn more about other models in Google blog posts, such as:
* [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html){:.external}.
* [LaMDA](https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html){:.external}
* [MUM](https://blog.google/products/search/introducing-mum/){:.external}
* [Reformer](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html){:.external}
* [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html){:.external}