# How to make your Transformer program assignment alive?

This example is to train Transformer by utilizing your program assignment - [Transformers Architecture with TensorFlow](https://www.coursera.org/learn/nlp-sequence-models/programming/roP5y/transformers-architecture-with-tensorflow) in Deep Learning Specialization - Sequence Model Week 4.

In [1]:
import os
import time
import pathlib
import tensorflow as tf
import tensorflow_datasets as tfds

from transformer import *

Just copy what you have done in the program assignment and save to `transformer.py` (check what you should copy to [transformer.py](https://github.com/yhyu/berts/blob/master/transformer/transformer.py), this file is copy from program assignment without answer.), so that this example can import your assignment.
```python
from transformer import *
```
However, there is a little bit different in mask function. Since we are using tensorflow [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention), its padding mask definition looks like below.

In [2]:
def create_padding_mask(seq):
    mask = tf.cast(tf.math.not_equal(seq, 0), tf.float32)
    return mask[:, tf.newaxis, :]

def create_look_ahead_mask(size):
    return tf.linalg.band_part(tf.ones((size, size)), -1, 0)

In [3]:
def create_masks(inp, tar):
    # input mask for encoder
    enc_padding_mask = create_padding_mask(inp)
    
    # attention mask for decoder attends to encoder
    enc_dec_attend_mask = create_padding_mask(inp)
    
    # input mask for decoder (look_ahead_mask + padding_mask)
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_padding_mask = create_padding_mask(tar)
    dec_input_mask = tf.minimum(dec_padding_mask, look_ahead_mask)
    
    return enc_padding_mask, dec_input_mask, enc_dec_attend_mask

For convenience, modified Transformer call function arguments order to move training to the last argument and set default value None (refer to [transformer.py](https://github.com/yhyu/berts/blob/cb3a6bf8b79be639f5f3108199d6a5f3f31a1bb2/transformer/transformer.py#L360)). It looks like:

```python
class Transformer(tf.keras.Model):
    # ...
    def call(self, inp, tar, enc_padding_mask, look_ahead_mask, dec_padding_mask, training=None):
```

Okay, let's define a function to build NMT model.

In [4]:
def build_model(num_layers, num_heads, embedding_dim, fully_connected_dim,
                inp_vocab_size, tar_vocab_size, max_pos_encoding_inp, max_pos_encoding_tar):
    # embedding layer takes word index rather than one-hot vec
    src_input = tf.keras.Input(shape=(None,))            # source language input
    tar_input = tf.keras.Input(shape=(None,))            # target language input
    src_input_mask = tf.keras.Input(shape=(1, None))     # source language input mask
    src_tar_mask = tf.keras.Input(shape=(1, None))       # target attends to source language mask
    tar_input_mask = tf.keras.Input(shape=(None, None))  # target language input mask
    
    outputs, _ = Transformer(
        num_layers,
        embedding_dim,
        num_heads,
        fully_connected_dim,
        inp_vocab_size,
        tar_vocab_size,
        max_pos_encoding_inp,
        max_pos_encoding_tar
    )(src_input, tar_input, src_input_mask, tar_input_mask, src_tar_mask)
    return tf.keras.Model(inputs=[src_input, tar_input, src_input_mask, src_tar_mask, tar_input_mask], outputs=outputs)

## Prepare data set

The remaining is tedious, but crucial in ML engineer daily life.

Tensorflow has some data set for NMT, you can pick any one. I use [wmt15_translate/fr-en](https://www.tensorflow.org/datasets/catalog/wmt15_translate#wmt15_translatefr-en) data set in the example. 

Prepare local file location.

In [5]:
output_dir = "nmt_en-fr"
en_vocab_file = os.path.join(output_dir, "en_vocab")
fr_vocab_file = os.path.join(output_dir, "fr_vocab")
download_dir = "tensorflow-datasets/downloads"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Take a lok at what corpouses we got in [wmt15_translate/fr-en](https://www.tensorflow.org/datasets/catalog/wmt15_translate#wmt15_translatefr-en) data set.

In [6]:
tfds.builder("wmt15_translate/fr-en").subsets

{Split('train'): ['europarl_v7',
  'commoncrawl',
  'multiun',
  'newscommentary_v10',
  'gigafren'],
 Split('validation'): ['newsdiscussdev2015', 'newstest2014'],
 Split('test'): ['newsdiscusstest2015']}

Download training and validation data set: I chose a small size corpouse, `newscommentary_v10`, you can pick a large one to train your transformer more powerful such as `commoncrawl`.

In [7]:
config = tfds.translate.wmt.WmtConfig(
    version="0.0.3",
    language_pair=("fr", "en"),
    subsets={
        tfds.Split.TRAIN: ["newscommentary_v10"],
        tfds.Split.VALIDATION: ["newstest2014"],
    },
)
builder = tfds.builder("wmt_translate", config=config)
builder.download_and_prepare(download_dir=download_dir)
ds_train, ds_val = builder.as_dataset(split=['train[:]', 'validation[:]'], as_supervised=True)



### Tokenize
Most of following code snippets are copy from tensorflow [Subword tokenizers](https://www.tensorflow.org/tutorials/tensorflow_text/subwords_tokenizer). It needs `tf-nightly` and `tensorflow_text_nightly` at this moment. Hopefully these features will be moved to official build soon.

In [8]:
import tensorflow_text as text
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

bert_tokenizer_params=dict(lower_case=True)
reserved_tokens=["[PAD]", "[UNK]", "[START]", "[END]"]

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 2**13,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={},
)

In [9]:
def gen_vocab(dataset, dsIdx, output_file, batch_size, prefetch):
    if os.path.isfile(output_file + '.txt'):
        vocab = pathlib.Path(output_file + '.txt').read_text(encoding='utf-8').splitlines()
    else:
        vocab = bert_vocab.bert_vocab_from_dataset(
            dataset.batch(batch_size).prefetch(prefetch),
            **bert_vocab_args
        )
        
        with open(output_file + '.txt', 'w', encoding='utf8') as f:
            for token in vocab:
                print(token, file=f)

    print('vocabulary size of %s: %d' % (dsIdx, len(vocab)))
    return vocab

In [10]:
en_train = ds_train.map(lambda fr, en: en)
fr_train = ds_train.map(lambda fr, en: fr)

Please forgive me for my lazy bone. I took all training data in one shot. It's not a good way, but ...

In [11]:
MAX_LENGTH = 50
BATCH_SIZE = 320000 # only take 320000
BUFFER_SIZE = 15000
vocab_en = gen_vocab(en_train, 0, en_vocab_file, BATCH_SIZE, BUFFER_SIZE)
vocab_fr = gen_vocab(fr_train, 1, fr_vocab_file, BATCH_SIZE, BUFFER_SIZE)

vocabulary size of 0: 7794
vocabulary size of 1: 7978


In [12]:
fr_tokenizer = text.BertTokenizer(fr_vocab_file + '.txt', **bert_tokenizer_params)
en_tokenizer = text.BertTokenizer(en_vocab_file + '.txt', **bert_tokenizer_params)

In [13]:
PAD = tf.argmax(tf.constant(reserved_tokens) == "[PAD]").numpy()
BOS = tf.argmax(tf.constant(reserved_tokens) == "[START]").numpy()
EOS = tf.argmax(tf.constant(reserved_tokens) == "[END]").numpy()
print(PAD, BOS, EOS)

0 2 3


Preprocess training data: tokenize, truncate over MAX_LENGTH and pad less than MAX_LENGTH.

In [14]:
def encode(fr, en):
    en_indices = [BOS] + list(np.squeeze(en_tokenizer.tokenize(en.numpy()).merge_dims(-2, -1).to_list(), axis=0)) + [EOS]
    fr_indices = [BOS] + list(np.squeeze(fr_tokenizer.tokenize(fr.numpy()).merge_dims(-2, -1).to_list(), axis=0)) + [EOS]
    return fr_indices, en_indices

def tf_encode(fr, en):
    return tf.py_function(encode, [fr, en], [tf.int64, tf.int64])


def filter_max_length(fr, en, max_length=MAX_LENGTH):
    return tf.logical_and(tf.size(fr) <= max_length,
                          tf.size(en) <= max_length)

train_dataset = (ds_train
                 .map(tf_encode)
                 .filter(filter_max_length)
                 .cache()
                 .shuffle(BUFFER_SIZE)
                 .padded_batch(BATCH_SIZE,
                               padded_shapes=([-1], [-1]))
                 .prefetch(tf.data.experimental.AUTOTUNE))

val_dataset = (ds_val
               .map(tf_encode)
               .filter(filter_max_length)
               .padded_batch(BATCH_SIZE, 
                             padded_shapes=([-1], [-1])))

Fetch data.

In [15]:
ds_train_np = tfds.as_numpy(train_dataset)
ds_val_np = tfds.as_numpy(val_dataset)

for fr_train, en_train in ds_train_np:
    print(en_train.shape, fr_train.shape)
    
for fr_val, en_val in ds_val_np:
    print(en_val.shape, fr_val.shape)

# one-step shift between decoder input and output
fr_train_input = fr_train[:, :-1]
fr_train_output = fr_train[:, 1:]
fr_val_input = fr_val[:, :-1]
fr_val_output = fr_val[:, 1:]
print(fr_train_input.shape, fr_train_output.shape, fr_val_input.shape, fr_val_output.shape)

(150391, 50) (150391, 50)
(2219, 50) (2219, 50)
(150391, 49) (150391, 49) (2219, 49) (2219, 49)


Create data masks.

In [16]:
enc_input_mask, dec_input_mask, enc_dec_mask = create_masks(en_train, fr_train_input)
print(enc_input_mask.shape, dec_input_mask.shape, enc_dec_mask.shape)

(150391, 1, 50) (150391, 49, 49) (150391, 1, 50)


In [17]:
val_enc_input_mask, val_dec_input_mask, val_enc_dec_mask = create_masks(en_val, fr_val_input)
print(val_enc_input_mask.shape, val_dec_input_mask.shape, val_enc_dec_mask.shape)

(2219, 1, 50) (2219, 49, 49) (2219, 1, 50)


It's time to train our Transformer.

In [18]:
num_layers = 4
num_heads = 4

embedding_dim = 512
input_vocab_size = len(vocab_en)
target_vocab_size = len(vocab_fr)
max_positional_encoding_input = MAX_LENGTH
max_positional_encoding_target = MAX_LENGTH

fully_connected_dim = 512

def my_loss(y_true, y_pred):
    loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)
    mask = tf.cast(tf.math.logical_not(tf.math.equal(y_true, 0)), loss.dtype)
    loss *= mask
    return tf.reduce_mean(loss)

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()
        
        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)
        self.warmup_steps = warmup_steps
        
    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)
        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(embedding_dim)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98, 
                                     epsilon=1e-9)

model = build_model(num_layers, num_heads, embedding_dim, fully_connected_dim,
                    input_vocab_size, target_vocab_size,
                    max_positional_encoding_input, max_positional_encoding_target)
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 1, None)]    0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            [(None, None, None)] 0                                            
______________________________________________________________________________________________

In [19]:
model.compile(loss=my_loss,
              optimizer=optimizer)

In [20]:
history = model.fit([en_train, fr_train_input, enc_input_mask, enc_dec_mask, dec_input_mask],
                    fr_train_output,
                    validation_data=([en_val,fr_val_input,val_enc_input_mask,val_enc_dec_mask,val_dec_input_mask],
                                     fr_val_output),
                    callbacks=[tf.keras.callbacks.EarlyStopping(patience=3, restore_best_weights=False)],
                    batch_size=32, epochs=100) # batch_size must be <= BATCH_SIZE, and BATCH_SIZE % batch_size = 0

Epoch 1/100


  '"`sparse_categorical_crossentropy` received `from_logits=True`, but '


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100


Finally, we can implement a translation function. The function below is implemented in a simple way. You can enhance it with some techniques you learned, e.g. `beam search`.

In [21]:
def translate(en_sentence):
    en_sentence = [BOS] + list(np.squeeze(en_tokenizer.tokenize([en_sentence]).merge_dims(-2, -1).to_list(), axis=0)) + [EOS]
    encoder_input = tf.expand_dims(en_sentence, 0) # only 1 batch
    
    # decoder start with <BOS>
    decoder_input = [BOS]
    output = tf.expand_dims(decoder_input, 0) # only 1 batch
    
    # predict fr output one-by-one
    for i in tf.range(MAX_LENGTH):
        enc_input_mask, dec_input_mask, enc_dec_mask = create_masks(encoder_input, output)
        
        # output shape (batch_size, seq_len, vocab_size)
        predictions = model.predict([encoder_input,
                                     output,
                                     enc_input_mask,
                                     enc_dec_mask,
                                     dec_input_mask])
        
        # only get last word
        predictions = predictions[: , -1:, :]
        
        # get most posible vocab (or sampling if you like)
        pred_idx = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)
        
        # reach <EOS>
        if tf.equal(pred_idx, EOS):
            break
        
        output = tf.concat([output, pred_idx], axis=-1)
        
    # TODO: beam search
    
    output = tf.squeeze(output, axis=0)
    output_words = fr_tokenizer.detokenize([[idx for idx in output if idx > EOS]]) # EOS is the last word in reserved_tokens
    return tf.strings.reduce_join(output_words, separator=' ', axis=-1).numpy()

Let's try it.

In [22]:
en_sentence = "Jane visits africa in september."
fr_sentence = translate(en_sentence)
print(fr_sentence)

[b'jane visite en afrique en septembre .']
