# Exercise project 5 - Transformer networks

I wanted to try using a Transformer model for machine translation. This notebook focused on building a sequence-to-sequence Transformer model to translate English sentences to Spanish. The approach was based on the Keras example on sequence-to-sequence text translation.

I wanted to build a simple encoder-decoder Transformer model. The dataset contained pairs of English sentences and their Spanish translations, which were split into training, validation, and test sets (70/15/15 split).

The data was prepared with a vocabulary size of 15000 for both English and Spanish and a maximum sequence length of 40 tokens. Special tokens [PAD], [START], and [END] were used to manage different sequence lengths.


https://www.kaggle.com/code/abrahamanderson/artificial-neural-networks-for-regression/notebook


In [None]:
!pip install -q  keras-hub

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.2/691.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m615.3/615.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m100.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tf-keras 2.17.0 requires tensorflow<2.18,>=2.17, but you have tensorflow 2.18.0 which is incompatible.[0m[31m
[0m

In [None]:
!pip install -q rouge-score
!pip install -q keras
!pip install -q tensorflow

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [None]:
!pip install -q tensorflow-text

In [None]:
!pip install -q keras-tqdm

In [None]:
import keras_hub
import pathlib
import random
import keras
from keras import ops
import matplotlib.pyplot as plt
from keras_tqdm import TQDMCallback
import tensorflow as tf
import tensorflow_text as text
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

In [None]:
TextVectorization = keras.layers.TextVectorization
tf_data = tf.data.Dataset.from_tensor_slices

In [None]:
BATCH_SIZE = 64
EPOCHS = 10  # This should be at least 10 for convergence
MAX_SEQUENCE_LENGTH = 40
ENG_VOCAB_SIZE = 15000
SPA_VOCAB_SIZE = 15000

EMBED_DIM = 256
INTERMEDIATE_DIM = 2048
NUM_HEADS = 8

In [None]:
text_file = keras.utils.get_file(
    fname="spa-eng.zip",
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
    extract=True,
)
text_file = pathlib.Path(text_file).parent / "spa-eng" / "spa.txt"

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
[1m2638744/2638744[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [None]:
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    eng, spa = line.split("\t")
    spa = "[start] " + spa + " [end]"
    text_pairs.append((eng, spa))

In [None]:
for _ in range(5):
    print(random.choice(text_pairs))

("We can't ignore Tom's past.", '[start] No podemos ignorar el pasado de Tom. [end]')
('You must do as you are told.', '[start] Debes hacer lo que te dicen. [end]')
('Tom is honest, so I like him.', '[start] Tom es honesto, por eso me gusta. [end]')
('How about you?', '[start] ¿Qué hay de ti? [end]')
('She deliberately ignored me on the street.', '[start] Ella deliberadamente me ignoró por la calle. [end]')


In [None]:
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples :]

print(f"{len(text_pairs)} total pairs")
print(f"{len(train_pairs)} training pairs")
print(f"{len(val_pairs)} validation pairs")
print(f"{len(test_pairs)} test pairs")

118964 total pairs
83276 training pairs
17844 validation pairs
17844 test pairs


In [None]:
def train_word_piece(text_samples, vocab_size, reserved_tokens):
    word_piece_ds = tf.data.Dataset.from_tensor_slices(text_samples)
    vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
        word_piece_ds.batch(1000).prefetch(2),
        vocabulary_size=vocab_size,
        reserved_tokens=reserved_tokens,
    )
    return vocab

In [None]:
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]

eng_samples = [text_pair[0] for text_pair in train_pairs]
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)

spa_samples = [text_pair[1] for text_pair in train_pairs]
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)

In [None]:
print("English Tokens: ", eng_vocab[100:110])
print("Spanish Tokens: ", spa_vocab[100:110])

English Tokens:  ['that', 'me', 'have', 'The', 'for', 'it', 'You', 'Mary', 'my', 'do']
Spanish Tokens:  ['é', 'ê', 'í', 'ñ', 'ó', 'ú', 'ü', 'č', '—', '€']


In [None]:
eng_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=eng_vocab, lowercase=False
)
spa_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
    vocabulary=spa_vocab, lowercase=False
)

In [None]:
eng_input_ex = text_pairs[0][0]
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
print("English sentence: ", eng_input_ex)
print("Tokens: ", eng_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    eng_tokenizer.detokenize(eng_tokens_ex),
)

print()

spa_input_ex = text_pairs[0][1]
spa_tokens_ex = spa_tokenizer.tokenize(spa_input_ex)
print("Spanish sentence: ", spa_input_ex)
print("Tokens: ", spa_tokens_ex)
print(
    "Recovered text after detokenizing: ",
    spa_tokenizer.detokenize(spa_tokens_ex),
)

English sentence:  The dentist gave me an appointment for seven o'clock.
Tokens:  tf.Tensor([ 103 2358  333  101  156 1708  104  935   67    8  569   12], shape=(12,), dtype=int32)
Recovered text after detokenizing:  The dentist gave me an appointment for seven o ' clock .

Spanish sentence:  [start] El dentista me citó a las siete. [end]
Tokens:  tf.Tensor(
[  56  111   57  133 3064  128   60 1387   58  142  926   15   56  110
   57], shape=(15,), dtype=int32)
Recovered text after detokenizing:  [ start ] El dentista me citó a las siete . [ end ]


In [None]:
def preprocess_batch(eng, spa):
    batch_size = ops.shape(spa)[0]

    eng = eng_tokenizer(eng)
    spa = spa_tokenizer(spa)

    # Pad `eng` to `MAX_SEQUENCE_LENGTH`.
    eng_start_end_packer = keras_hub.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH,
        pad_value=eng_tokenizer.token_to_id("[PAD]"),
    )
    eng = eng_start_end_packer(eng)

    # Add special tokens (`"[START]"` and `"[END]"`) to `spa` and pad it as well.
    spa_start_end_packer = keras_hub.layers.StartEndPacker(
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
        start_value=spa_tokenizer.token_to_id("[START]"),
        end_value=spa_tokenizer.token_to_id("[END]"),
        pad_value=spa_tokenizer.token_to_id("[PAD]"),
    )
    spa = spa_start_end_packer(spa)

    return (
        {
            "encoder_inputs": eng,
            "decoder_inputs": spa[:, :-1],
        },
        spa[:, 1:],
    )


def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [None]:
for inputs, targets in train_ds.take(1):
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
    print(f"targets.shape: {targets.shape}")

inputs["encoder_inputs"].shape: (64, 40)
inputs["decoder_inputs"].shape: (64, 40)
targets.shape: (64, 40)


In [None]:
# Encoder
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")

x = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=ENG_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(encoder_inputs)

encoder_outputs = keras_hub.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
encoder = keras.Model(encoder_inputs, encoder_outputs)


# Decoder
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs")
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")

x = keras_hub.layers.TokenAndPositionEmbedding(
    vocabulary_size=SPA_VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
)(decoder_inputs)

x = keras_hub.layers.TransformerDecoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
x = keras.layers.Dropout(0.5)(x)
decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation="softmax")(x)
decoder = keras.Model(
    [
        decoder_inputs,
        encoded_seq_inputs,
    ],
    decoder_outputs,
)
decoder_outputs = decoder([decoder_inputs, encoder_outputs])

transformer = keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs,
    name="transformer",
)

In [None]:
transformer.summary()
transformer.compile(
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds, callbacks=[tf.keras.callbacks.ProgbarLogger()])

Epoch 1/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m129s[0m 74ms/step - accuracy: 0.7202 - loss: 1.9083 - val_accuracy: 0.8206 - val_loss: 1.2062
Epoch 2/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 45ms/step - accuracy: 0.8069 - loss: 1.2325 - val_accuracy: 0.8349 - val_loss: 1.0748
Epoch 3/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m56s[0m 43ms/step - accuracy: 0.8230 - loss: 1.1292 - val_accuracy: 0.8400 - val_loss: 1.0134
Epoch 4/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 42ms/step - accuracy: 0.8326 - loss: 1.0515 - val_accuracy: 0.8495 - val_loss: 0.9438
Epoch 5/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 42ms/step - accuracy: 0.8421 - loss: 0.9836 - val_accuracy: 0.8479 - val_loss: 0.9252
Epoch 6/10
[1m1302/1302[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 42ms/step - accuracy: 0.8460 - loss: 0.9444 - val_accuracy: 0.8363 - val_loss: 0.9714
Epo

<keras.src.callbacks.history.History at 0x7abb8ff58910>

In [None]:
def decode_sequences(input_sentences):
    batch_size = 1

    # Tokenize the encoder input.
    encoder_input_tokens = eng_tokenizer(input_sentences)
    encoder_input_tokens = tf.convert_to_tensor(encoder_input_tokens)

    # Pad if necessary
    if len(encoder_input_tokens[0]) < MAX_SEQUENCE_LENGTH:
        pads = tf.fill([1, MAX_SEQUENCE_LENGTH - len(encoder_input_tokens[0])], 0)
        encoder_input_tokens = tf.concat([encoder_input_tokens, pads], axis=1)

    # Define a function that outputs the next token's probability given the input sequence.
    def next(prompt, cache, index):
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
        hidden_states = None  # Not used in this implementation
        return logits, hidden_states, cache

    # Build a prompt of length 40 with a start token and padding tokens.
    length = 40
    start = tf.fill([batch_size, 1], spa_tokenizer.token_to_id("[START]"))
    pad = tf.fill([batch_size, length - 1], spa_tokenizer.token_to_id("[PAD]"))
    prompt = tf.concat((start, pad), axis=-1)

    # Use GreedySampler to generate tokens
    generated_tokens = keras_hub.samplers.GreedySampler()(
        next,
        prompt,
        stop_token_ids=[spa_tokenizer.token_to_id("[END]")],
        index=1,  # Start sampling after the start token.
    )

    generated_sentences = spa_tokenizer.detokenize(generated_tokens)
    return generated_sentences

In [None]:
rouge_1 = keras_hub.metrics.RougeN(order=1)
rouge_2 = keras_hub.metrics.RougeN(order=2)

for test_pair in test_pairs[:30]:
    input_sentence = test_pair[0]
    reference_sentence = test_pair[1]

    # Decode the sequence
    translated_sentence = decode_sequences([input_sentence])[0]  # Extract the first sentence from the list

    # Remove tokens
    translated_sentence = (
        translated_sentence.replace("[PAD]", "")
        .replace("[START]", "")
        .replace("[END]", "")
        .strip()
    )

    # Update ROUGE scores
    rouge_1(reference_sentence, translated_sentence)
    rouge_2(reference_sentence, translated_sentence)

# Print final ROUGE scores
print("ROUGE-1 Score: ", rouge_1.result())
print("ROUGE-2 Score: ", rouge_2.result())


ROUGE-1 Score:  {'precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.8640741109848022>, 'recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.13376900553703308>, 'f1_score': <tf.Tensor: shape=(), dtype=float32, numpy=0.22478343546390533>}
ROUGE-2 Score:  {'precision': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, 'recall': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>, 'f1_score': <tf.Tensor: shape=(), dtype=float32, numpy=0.0>}


## Personal Reflection / Analysis

The model achieved reasonable performance considering the simplicity of the architecture.
Transformers work well for translation tasks, even with limited training time. The ROUGE scores are low, suggesting that the model does not perfectly match human translations but still produces somewhat understandable sentences.