<a href="https://colab.research.google.com/github/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/04_machine_translation_sequence_to_sequence_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Machine translation: A sequence-to-sequence learning

In this notebook, you’ll deepen your expertise by learning about
sequence-to-sequence models.

A sequence-to-sequence model takes a sequence as input (often a sentence or
paragraph) and translates it into a different sequence. This is the task at the heart of many of the most successful applications of NLP:
- **Machine translation**—Convert a paragraph in a source language to its equivalent in a target language.
- **Text summarization**—Convert a long document to a shorter version that retains the most important information.
- **Question answering**—Convert an input question into its answer.
- **Chatbots**—Convert a dialogue prompt into a reply to this prompt, or convert the history of a conversation into the next reply in the conversation.
- **Text generation**—Convert a text prompt into a paragraph that completes the prompt.

The general template behind sequence-to-sequence models is described in figure.

<img src='https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/images/3.png?raw=1' width='600'/>

During training:-
- An `encoder` model turns the source sequence into an intermediate representation.
- A `decoder` is trained to predict the next token i in the target sequence by looking at both previous tokens `(0 to i - 1)` and the encoded source sequence.

**During inference, we don’t have access to the target sequence**—we’re trying to predict it from scratch. We’ll have to generate it one token at a time:

- We obtain the encoded source sequence from the encoder.
- The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string `[start]`), and uses them to predict the first real token in the sequence.
- The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string
`[end]`).

Everything you’ve learned so far can be repurposed to build this new kind of model.

Let’s dive in.


##Setup

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import random
import string
import re

import numpy as np

We’ll be working with an English-to-Spanish translation dataset available at
www.manythings.org/anki/. 

Let’s download it:

In [2]:
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip

--2022-02-03 04:36:25--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.133.128, 74.125.140.128, 108.177.15.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.133.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’


2022-02-03 04:36:26 (174 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]



##Data preparation

The text file contains one example per line: an English sentence, followed by a tab character, followed by the corresponding Spanish sentence. 

Let’s parse this file.

In [3]:
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
  lines = f.read().split("\n")[:-1]

text_pairs = []
for line in lines:
  # Each line contains an English phrase and its Spanish translation, tab-separated.
  english, spanish = line.split("\t")
  # We prepend "[start]" and append "[end]" to the Spanish sentence, to match the template
  spanish = "[start]" + spanish + "[end]"
  text_pairs.append((english, spanish))

Our `text_pairs` look like this:

In [4]:
print(random.choice(text_pairs))

("In my opinion, it's better to change the policy.", '[start]A mi entender es mejor cambiar de procedimiento.[end]')


In [None]:
print(random.choice(text_pairs))

('Mary was arrested for shoplifting.', '[start]Mary fue arrestada por ratera.[end]')


Let’s shuffle them and split them into the usual training, validation, and test sets:

In [5]:
random.shuffle(text_pairs)

num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples

train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples: num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Next, let’s prepare two separate TextVectorization layers: one for English and one for Spanish. 

We’re going to need to customize the way strings are preprocessed:

In [6]:
# Prepare a custom string standardization function for the Spanish TextVectorization layer: it preserves [ and ] 
# but strips ¿ (as well as all other characters from strings.punctuation).
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")

def custom_standardization(input_string):
  lowercase = tf.strings.lower(input_string)
  return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

# To keep things simple, we’ll only look at the top 15,000 words in each language, and we’ll restrict sentences to 20 words.
vocab_size = 15000
sequence_length = 20

source_vectorization = layers.TextVectorization(max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length)
# Generate Spanish sentences that have one extra token, since we’ll need to offset the sentence by one step during training.
target_vectorization = layers.TextVectorization(max_tokens=vocab_size, output_mode="int", 
                                                output_sequence_length=sequence_length + 1,
                                                standardize=custom_standardization)

train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
# Learn the vocabulary of each language
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)

Finally, we can turn our data into a tf.data pipeline. 

We want it to return a tuple `(inputs, target)` where `inputs` is a dict with two keys,`encoder_inputs` (the English sentence) and `decoder_inputs` (the Spanish sentence), and `target` is the Spanish sentence offset by one step ahead.

In [7]:
batch_size = 64

def format_dataset(eng, spa):
  eng = source_vectorization(eng)
  spa = target_vectorization(spa)
  return (
      {
      "english": eng,
      "spanish": spa[:, :-1],  # The input Spanish sentence doesn’t include the last token to keep inputs and targets at the same length
      },
      spa[:, 1:]   # The target Spanish sentence is one step ahead. Both are still the same length (20 words)
  )

In [8]:
def make_dataset(pairs):
  eng_texts, spa_texts = zip(*pairs)
  eng_texts = list(eng_texts)
  spa_texts = list(spa_texts)
  dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
  dataset = dataset.batch(batch_size)
  dataset = dataset.map(format_dataset, num_parallel_calls=4)
  # Use in-memory caching to speed up preprocessing
  return dataset.shuffle(2048).prefetch(16).cache()

In [9]:
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

Here’s what our dataset outputs look like:

In [12]:
for inputs, targets in train_ds.take(1):
  print(f"inputs['english'].shape: {inputs['english'].shape}")
  print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
  print(f"targets.shape: {targets.shape}")

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


The data is now ready—time to build some models. We’ll start with a recurrent
sequence-to-sequence model before moving on to a Transformer.

##Sequence-to-sequence learning with RNNs

The simplest, naive way to use RNNs to turn a sequence into another sequence is to keep the output of the RNN at each time step. 

In Keras, it would look like this:

```python
inputs = keras.Input(shape=(sequence_length,), dtype="int64")
x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x = layers.LSTM(32, return_sequences=True)(x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
```

However, there are two major issues with this approach:

* The target sequence must always be the same length as the source sequence.
* Due to the step-by-step nature of RNNs, the model will only be looking at
tokens `0…N` in the source sequence in order to predict token N in the target
sequence. This constraint makes this setup unsuitable for most tasks, and
particularly translation.

If you’re a human translator, you’d start by reading the entire source sentence before
starting to translate it. This is especially important if you’re dealing with languages
that have wildly different word ordering, like English and Japanese. And that’s exactly
what standard sequence-to-sequence models do.

In a proper sequence-to-sequence setup, you would first use an
RNN (the encoder) to turn the entire source sequence into a single vector (or set of
vectors). 

This could be the last output of the RNN, or alternatively, its final internal
state vectors. 

<img src='https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/images/4.png?raw=1' width='600'/>

Then you would use this vector (or vectors) as the `initial state` of another RNN (the decoder), which would look at elements `0…N` in the target sequence, and
try to predict step `N+1` in the target sequence.

Let’s implement this in Keras with GRU-based encoders and decoders. The choice
of GRU rather than LSTM makes things a bit simpler, since GRU only has a single
state vector, whereas LSTM has multiple. 

Let’s start with the encoder.

In [None]:
embed_dim = 256
latent_dim = 1024

In [None]:
# The English source sentence goes here.
source = keras.Input(shape=(None, ), dtype="int64", name="english")
# Don’t forget masking: it’s critical in this setup
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

Next, let’s add the decoder—a simple GRU layer that takes as its initial state the encoded source sentence. 

On top of it, we add a Dense layer that produces for each
output step a probability distribution over the Spanish vocabulary.

In [None]:
# The Spanish target sentence goes here
past_target = keras.Input(shape=(None, ), dtype="int64", name="spanish")
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
# The encoded source sentence serves as the initial state of the decoder GRU
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
# Predicts the next token
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)

# End-to-end model: maps the source sentence and the target sentence to the target sentence one step in the future
seq2seq_rnn = keras.Model(inputs=[source, past_target], outputs=target_next_step)

During training, the decoder takes as input the entire target sequence, but thanks to
the step-by-step nature of RNNs, it only looks at tokens `0…N` in the input to predict token N in the output (which corresponds to the next token in the sequence, since
the output is intended to be offset by one step). 

This means we only use information
from the past to predict the future, as we should; otherwise we’d be cheating, and our
model would not work at inference time.

Let’s start training.

In [None]:
seq2seq_rnn.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fb4cd5975d0>

We picked accuracy as a crude way to monitor validation-set performance during
training. We get to 64% accuracy: on average, the model predicts the next word in the
Spanish sentence correctly 64% of the time. However, in practice, next-token accuracy
isn’t a great metric for machine translation models.

If you work on a real-world machine translation system, you will likely use `BLEU scores` to evaluate your models—a metric that looks at entire generated sequences
and that seems to correlate well with human perception of translation quality.

At last, let’s use our model for inference.

We’ll pick a few sentences in the test set
and check how our model translates them. We’ll start from the seed token, `[start]`,
and feed it into the decoder model, together with the encoded English source sentence.

We’ll retrieve a next-token prediction, and we’ll re-inject it into the decoder
repeatedly, sampling one new target token at each iteration, until we get to `[end]`
or reach the maximum sentence length.

In [None]:
# Prepare a dict to convert token index predictions to string tokens
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
  tokenized_input_sentence = source_vectorization([input_sentence])
  # seed token
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = target_vectorization([decoded_sentence])
    # sample the next token
    next_token_predictions = seq2seq_rnn.predict([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(next_token_predictions[0, i, :])
    # Convert the next token prediction to a string and append it to the generated sentence.
    sampled_token = spa_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token

    # Exit condition: either hit max length or sample a stop character
    if sampled_token == "[end]":
      break
  return decoded_sentence

In [None]:
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
  input_sentence = random.choice(test_eng_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence(input_sentence))

-
How big you are!
[start]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
Why don't you leave, Tom?
[start] no te por tom[end]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
Tom wants to stay here.
[start] tom quiere aquí[end]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
We love you.
[start]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
There were a lot of boats on the lake.
[start] muchos en el mary del error[end]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
Let's play tennis in the afternoon.
[start] al tenis por la tarde[end]  [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK]
-
This is the worst of all.
[start] es el mejor de la para nada[end]  [UNK] [UNK] [UNK] [UNK] [UNK] [UN

Note that this inference setup, while very simple, is rather inefficient, since we reprocess
the entire source sentence and the entire generated target sentence every time
we sample a new word.

In a practical application, you’d factor the encoder and the
decoder as two separate models, and your decoder would only run a single step at
each token-sampling iteration, reusing its previous internal state.

There are many ways this toy model could be improved: 

* We could use a deep stack of
recurrent layers for both the encoder and the decoder (note that for the decoder, this makes state management a bit more involved). 
* We could use an LSTM instead of a GRU. And so on. 

Beyond such tweaks, however, the RNN approach to sequence-to-sequence
learning has a few fundamental limitations:

* The source sequence representation has to be held entirely in the encoder state
vector(s), which puts significant limitations on the size and complexity of the
sentences you can translate. It’s a bit as if a human were translating a sentence
entirely from memory, without looking twice at the source sentence while producing
the translation.

* RNNs have trouble dealing with very long sequences, since they tend to progressively
forget about the past—by the time you’ve reached the 100th token in
either sequence, little information remains about the start of the sequence.That means RNN-based models can’t hold onto long-term context, which can
be essential for translating long documents.

These limitations are what has led the machine learning community to embrace the
Transformer architecture for sequence-to-sequence problems.

##Sequence-to-sequence learning with Transformer

Sequence-to-sequence learning is the task where Transformer really shines. Neural
attention enables Transformer models to successfully process sequences that are con
siderably
longer and more complex than those RNNs can handle.

Look at the decoder
internals: you’ll recognize that it looks very similar to the Transformer encoder, except
that an extra attention block is inserted between the self-attention block applied to
the target sequence and the dense layers of the exit block.

<img src='https://github.com/rahiakela/deep-learning-research-and-practice/blob/main/deep-learning-with-python-by-francois-chollet/11-deep-learning-for-text/images/5.png?raw=1' width='600'/>


Let’s implement it. Like for the TransformerEncoder, we’ll use a Layer subclass.

###Positional embedding

In [15]:
class PositionalEmbedding(layers.Layer):

  # A downside of position embeddings is that the sequence length needs to be known in advance
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)

    # Prepare an Embedding layer for the token indices.
    self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
    # And another one for the token positions
    self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)

    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim

  def call(self, inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    # add both embedding vectors together
    return embedded_tokens + embedded_positions

  def compute_mask(self, inputs, mask=None):
    """
    Like the Embedding layer, this layer should be able to generate a mask so we can ignore padding 0s in the inputs. 
    The compute_mask method will called automatically by the framework, and the mask will get propagated to the next layer.
    """
    return tf.math.not_equal(inputs, 0)

  def get_config(self):
    config = super().get_config()
    config.update({
        "output_dim": self.output_dim,
        "sequence_length": self.sequence_length,
        "input_dim": self.input_dim
    })
    return config

###Transformer encoder

In [16]:
class TransformerEncoder(layers.Layer):
  
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim    # Size of the input token vectors
    self.dense_dim = dense_dim    # Size of the inner dense layer
    self.num_heads = num_heads  # Number of attention heads

    self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
    self.dense_projection = keras.Sequential([
         layers.Dense(dense_dim, activation="relu"),
         layers.Dense(embed_dim)                                     
    ])

    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()

  def call(self, inputs, mask=None):
    # The mask that will be generated by the Embedding layer will be 2D, but the attention layer expects to be 3D or 4D, so we expand its rank.
    if mask is not None:
      mask = mask[:, tf.newaxis, :]
    attention_output = self.attention(inputs, inputs, attention_mask=mask)
    projection_input = self.layernorm_1(inputs + attention_output)
    projection_output = self.dense_projection(projection_input)
    return self.layernorm_2(projection_input + projection_output)

  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim
    })
    return config

###Transformer decoder

In [23]:
class TransformerDecoder(layers.Layer):

  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
      super().__init__(**kwargs)

      self.embed_dim = embed_dim
      self.dense_dim = dense_dim
      self.num_heads = num_heads

      self.multi_head_attention_layer_1 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
      self.multi_head_attention_layer_2 = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
      
      self.dense_projection = keras.Sequential([
           layers.Dense(dense_dim, activation="relu"),
           layers.Dense(embed_dim, )                                     
      ])

      self.layernorm_1 = layers.LayerNormalization()
      self.layernorm_2 = layers.LayerNormalization()
      self.layernorm_3 = layers.LayerNormalization()

      # This attribute ensures that the layer will propagate its input mask to its outputs
      self.supports_masking = True

  def get_config(self):
      config = super().get_config()
      config.update({
          "embed_dim": self.embed_dim,
          "num_heads": self.num_heads,
          "dense_dim": self.dense_dim
      })
      return config

  def get_causal_attention_mask(self, inputs):
    """
    Causal padding is absolutely critical to successfully training a sequence-to-sequence Transformer.
    we’ll mask the upper half of the pairwise attention matrix to prevent the model from paying any attention 
    to information from the future only information from tokens 0...N in the target sequence should be used 
    when generating target token N+1.
    """
    input_shape = tf.shape(inputs)
    batch_size, sequence_length = input_shape[0], input_shape[1]
    i = tf.range(sequence_length)[:, tf.newaxis]
    j = tf.range(sequence_length)
    # Generate matrix of shape (sequence_length, sequence_length) with 1s in one half and 0s in the other
    mask = tf.cast(i >= j, dtype="int32")
    # Replicate it along the batch axis to get a matrix of shape (batch_size, sequence_length, sequence_length)
    mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
    mult = tf.concat([tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], axis=0)

    return tf.tile(mask, mult)

  def call(self, inputs, encoder_outputs, mask=None):
    # Retrieve the causal mask
    causal_mask = self.get_causal_attention_mask(inputs)
    # Prepare the input mask (that describes padding locations in the target sequence)
    if mask is not None:
      padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
      # Merge the two masks together
      padding_mask = tf.minimum(padding_mask, causal_mask)
    # Pass the causal mask to the first attention layer, which performs self-attention over the target sequence
    attention_output_1 = self.multi_head_attention_layer_1(query=inputs, value=inputs, key=inputs, attention_mask=causal_mask)
    # Pass the combined mask to the second attention layer, which relates the source sequence to the target sequence
    attention_output_2 = self.multi_head_attention_layer_2(query=attention_output_1, value=encoder_outputs, key=encoder_outputs, attention_mask=padding_mask)
    attention_output_2 = self.layernorm_2(attention_output_1 + attention_output_2)
    projection_output = self.dense_projection(attention_output_2)

    return self.layernorm_3(attention_output_2 + projection_output)

The end-to-end Transformer is the model we’ll be training. It maps the source
sequence and the target sequence to the target sequence one step in the future. It
straightforwardly combines the pieces we’ve built so far: 

- `PositionalEmbedding` layers
- `TransformerEncoder` layers 
- `TransformerDecoder` layers

Note that both the `TransformerEncoder`
and the `TransformerDecoder` are shape-invariant, so you could be
stacking many of them to create a more powerful encoder or decoder.

In [14]:
embed_dim = 256
dense_dim = 2048
num_heads = 8

In [25]:
encoder_inputs = keras.Input(shape=(None, ), dtype="int64", name="english")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(encoder_inputs)
# Encode the source sentence
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs = keras.Input(shape=(None, ), dtype="int64", name="spanish")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(decoder_inputs)
# Encode the target sentence and combine it with the encoded source sentence
x = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
# Predict a word for each output position
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)

transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

We’re now ready to train our model—we get to 67% accuracy, a good deal above the GRU-based model.

In [26]:
transformer.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"])
transformer.fit(train_ds, epochs=30, validation_data=val_ds)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7ff9704eb950>

##Translation evaluation

Finally, let’s try using our model to translate never-seen-before English sentences from
the test set. 

The setup is identical to what we used for the sequence-to-sequence RNN
model.

In [27]:
# Prepare a dict to convert token index predictions to string tokens
spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20

def decode_sequence(input_sentence):
  tokenized_input_sentence = source_vectorization([input_sentence])
  # seed token
  decoded_sentence = "[start]"
  for i in range(max_decoded_sentence_length):
    tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]
    # sample the next token
    predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
    sampled_token_index = np.argmax(predictions[0, i, :])
    # Convert the next token prediction to a string and append it to the generated sentence.
    sampled_token = spa_index_lookup[sampled_token_index]
    decoded_sentence += " " + sampled_token

    # Exit condition: either hit max length or sample a stop character
    if sampled_token == "[end]":
      break
  return decoded_sentence

In [28]:
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
  input_sentence = random.choice(test_eng_texts)
  print("-")
  print(input_sentence)
  print(decode_sequence(input_sentence))

-
We enjoyed the party very much.
[start] la fiesta muy te [UNK]               
-
The neck of the bottle was broken.
[start] el [UNK] de la situación fue [UNK]  [UNK]           la
-
Tom appreciated Mary's kindness.
[start] la que mary [UNK] la [UNK] de mary[end]            
-
Tom is banned from entering this building.
[start] de [UNK] de este [UNK]  en de esta en de [UNK] en       en
-
The other day, I bought a camera.
[start] una vez un cámara de cámara[end]              
-
She is very free with her money.
[start] es muy viejo con tu dinero[end]              
-
Would you like another apple?
[start] una [UNK]                  
-
What are you going to say?
[start] lo que dice eso[end]                
-
We can pay 200 dollars at most.
[start] palabras para los [UNK] de la gente más [UNK]  que del a el      de
-
The more you have, the more you want.
[start] más que menos tienen dinero que quieras[end]             
-
I am as sad and lonely as can be.
[start] tan rápido y puede estar tan rá

Subjectively, the Transformer seems to perform significantly better than the GRUbased
translation model. 

It’s still a toy model, but it’s a better toy model.