# Lab 7: Transformers - Part 2

In this lab, you will use some of the components you wrote in the last lab to construct an entire transformer. Then you will train it on movie dialogues so, given a line from a movie, it can respond with its own line. 

In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
%pip install tensorflow-datasets==4.9.2
import tensorflow_datasets as tfds
%pip install pydot
import pydot
%conda install -c conda-forge pygraphviz


import os
import re
import numpy as np
import matplotlib.pyplot as plt

# Section 0: Overview of an Encoder-Decoder Transformer Architecture

Recall from the last lab that the Attention mechanism is the key component of a Transformer. However, to use Attention to learn meaningful representations of sequences and actually then predict a sequence requires us to add Dense layers around the Attention layers that will do the bulk of the learning for the model. The original Transformer paper, [Attention Is All You Need](https://arxiv.org/abs/1706.03762), implemented a Encoder-Decoder Transformer that split the model into an encoder that encodes the input sequence into a meaningful, higher-dimension, representation, and a decoder, that autoregressively (one-token at a time) generated an output sequence. OpenAI's GPT models, those that are used in ChatGPT, do not use an encoder and are decoder-only, so the input is fed directly into the decoder. In this lab, we will implement an Encoder-Decoder Transformer. An example of an encoder-decoder transformer for translation is shown below. 

![](./images/output_shift.png)

After the input sequence is encoded by the encoder to the tensor $z$ that has shape ```(batch_size, max_length, d_model)```, $z$ conditions the decoder by serving as the keys and values for some attention layers in the decoder. Recall that the decoder takes in a sequence and outputs the same sequence shifted to the left by 1, so that the last token in the output is the "next" token predicted by the model. The components you implemented in the last lab are given to you here.




In [None]:
def create_padding_mask(x):
  mask = tf.math.equal(x, 0)
  mask = tf.cast(mask, tf.float32)
  mask = mask[:, tf.newaxis, tf.newaxis, :]
  return mask

def create_look_ahead_mask(x):
  max_length = tf.shape(x)[1]
  look_ahead_mask = 1 - tf.linalg.band_part(tf.ones((max_length, max_length)), -1, 0)
  padding_mask = create_padding_mask(x)
  look_ahead_mask = tf.maximum(look_ahead_mask, padding_mask)
  return look_ahead_mask

def scaled_dot_product_attention(query, key, value, mask):
  matmul_qk = tf.matmul(query, key, transpose_b=True)
  d_query = tf.cast(tf.shape(key)[-1], tf.float32)
  logits = matmul_qk / tf.math.sqrt(d_query)

  if mask is not None:
    logits += (mask * -1e9)

  attention_weights = tf.nn.softmax(logits, axis=-1)
  output = tf.matmul(attention_weights, value)
  return output

class MultiHeadAttention(tf.keras.layers.Layer):

  def __init__(self, d_model, num_heads, name="multi_head_attention"):
    super(MultiHeadAttention, self).__init__(name=name)
    self.num_heads = num_heads
    self.d_model = d_model
    assert d_model % self.num_heads == 0

    self.query_dense = tf.keras.layers.Dense(units=d_model)
    self.key_dense = tf.keras.layers.Dense(units=d_model)
    self.value_dense = tf.keras.layers.Dense(units=d_model)

    self.dense = tf.keras.layers.Dense(units=d_model)

  def split_heads(self, inputs, batch_size):
    inputs = tf.reshape(inputs, shape=(batch_size, -1, self.num_heads, self.d_model//self.num_heads))
    inputs = tf.transpose(inputs, perm=[0, 2, 1, 3])

    return inputs

  def call(self, inputs):
    query, key, value, mask = inputs['query'], inputs['key'], inputs[
        'value'], inputs['mask']
    batch_size = tf.shape(query)[0]

    query = self.query_dense(query)
    key = self.key_dense(key)
    value = self.value_dense(value)

    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)

    scaled_attention = scaled_dot_product_attention(query, key, value, mask)
    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))

    outputs = self.dense(concat_attention)
    return outputs
  
class PositionalEncoding(tf.keras.layers.Layer):

  def __init__(self, max_position, d_model):
      super(PositionalEncoding, self).__init__()
      self.pos_encoding = self.positional_encoding(max_position, d_model)

  def get_angles(self, positions, inds, d_model):
      angles = 1 / tf.pow(10000, (2 * (inds // 2)) / tf.cast(d_model, tf.float32))
      return positions * angles

  def positional_encoding(self, max_position, d_model):
      positions = tf.range(max_position, dtype=tf.float32)[:, tf.newaxis]
      feature_inds = tf.range(d_model, dtype=tf.float32)[tf.newaxis, :]

      angle_rads = self.get_angles(
          positions=positions,
          inds=feature_inds,
          d_model=d_model)
      
      sines = tf.math.sin(angle_rads[:, 0::2])
      cosines = tf.math.cos(angle_rads[:, 1::2])

      pos_encoding = tf.concat([sines, cosines], axis=-1)
      pos_encoding = pos_encoding[tf.newaxis, ...]
      return tf.cast(pos_encoding, tf.float32)

  def call(self, inputs):
      return inputs + self.pos_encoding[:, :tf.shape(inputs)[1], :]

# Section 1: Cornell Movie-Dialogs Corpus

We will use the [Cornell Movie-Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html) that contains over 200,000 conversations between 10,000 pairs of movie characters across over 600 movies.

### Section 1.1: Loading the Dataset
After loading roughly a quarter of the dataset, we clean the sentences by removing capitalization and any character that is not a letter or punctuation, and split the conversations into questions (preceding sentence) and answers (following sentence). 

In [None]:
path_to_zip = tf.keras.utils.get_file(
    'cornell_movie_dialogs.zip',
    origin=
    'http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip',
    extract=True)

path_to_dataset = os.path.join(
    os.path.dirname(path_to_zip), "cornell movie-dialogs corpus")

path_to_movie_lines = os.path.join(path_to_dataset, 'movie_lines.txt')
path_to_movie_conversations = os.path.join(path_to_dataset,
                                           'movie_conversations.txt')

# Maximum number of samples to preprocess
MAX_SAMPLES = 50000

def preprocess_sentence(sentence):
  sentence = sentence.lower().strip()
  # creating a space between a word and the punctuation following it
  # eg: "he is a boy." => "he is a boy ."
  sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
  sentence = re.sub(r'[" "]+', " ", sentence)
  # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
  sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
  sentence = sentence.strip()
  # adding a start and an end token to the sentence
  return sentence


def load_conversations():
  # dictionary of line id to text
  id2line = {}
  with open(path_to_movie_lines, errors='ignore') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    id2line[parts[0]] = parts[4]

  inputs, outputs = [], []
  with open(path_to_movie_conversations, 'r') as file:
    lines = file.readlines()
  for line in lines:
    parts = line.replace('\n', '').split(' +++$+++ ')
    # get conversation in a list of line ID
    conversation = [line[1:-1] for line in parts[3][1:-1].split(', ')]
    for i in range(len(conversation) - 1):
      inputs.append(preprocess_sentence(id2line[conversation[i]]))
      outputs.append(preprocess_sentence(id2line[conversation[i + 1]]))
      if len(inputs) >= MAX_SAMPLES:
        return inputs, outputs
  return inputs, outputs


questions, answers = load_conversations()

In [None]:
for i in range(0, 1000, 200):
    print('Sample question: {}'.format(questions[i]))
    print('Sample answer: {}'.format(answers[i]))
    print('Sample question: {}'.format(questions[i+1]))
    print('Sample answer: {}'.format(answers[i+1]))

### Section 1.2: Tokenize the sentences

Recall from the previous lab that we need to convert sentences to integers so they can be handled by the model. There are many ways to do this. The simplest way is to split a sentence by spaces so "Tokenization is essential for text analysis." becomes ["Tokenization", "is", "essential", "for", "text", "analysis", "."] and each new word seen in the dataset is assigned a new integer. The most common and effective method for Tokenization is Byte-Pair Encoding for Subword tokens. This is an algorithm that starts from a vocabulary of letters, and iteratively merges them based on what subword (part of a word) is used most frequently. For example, after subword tokenization, "Tokenization is essential for text analysis." becomes ["Tok", "en", "ization", "is", "es", "sen", "tial", "for", "text", "an", "al", "ysis", "."]. Notice that more frequently used subwords like "is", "for", "text", get their own token. Subword tokenization can greatly reduce the total number of tokens we need to represent the entire dataset. 

Here, we use ```tfds.features.text.SubwordTextEncoder``` to do the subword tokenization. We also need to define a special start and end token. The start token is used during inference, when the decoder needs to take in some token to start the autoregressive output. The end token tells us when to stop calling the transformer. We also need to pad our sentences so that they are all the same length. Remember the masks we made in the last lab so that the Transformer ignores these pad tokens. 

In [None]:
tf.keras.utils.set_random_seed(1234)
# Create the tokenizer using tfds.deprecated.text.SubwordTextEncoder.build_from_corpus
# Concatenate the questions and answers for the corpus_generator (which is just an array)
# Set the target_vocab_size to 2^13
tokenizer = None

# Define start and end token to indicate the start and end of a sentence
# Make start_token be the integer tokenizer.vocab_size
# and end_token to be tokenizer.vocab_size + 1
START_TOKEN, END_TOKEN = None

# Vocabulary size plus start and end token
VOCAB_SIZE = None

In [None]:
# Maximum sentence length
MAX_LENGTH = 40

# Tokenize, filter and pad sentences
def tokenize_and_filter(inputs, outputs):
  tokenized_inputs, tokenized_outputs = [], []

  for (sentence1, sentence2) in zip(inputs, outputs):
    # tokenize sentence by using tokenizer.encode
    # ensure to add the start_token to the beginning of the sentence and the
    # end_token to the end
    sentence1 = None
    sentence2 = None
    # Append the sentences to the tokenized inputs and outputs if their
    # lengths are less than or equal to max_length
    pass

  # pad the tokenized inputs and outputs using tf.keras.preprocessing.sequence.pad_sequences
  # make sure to use 'post' padding so that 0s are added to the end of the sequence
  tokenized_inputs = None
  tokenized_outputs = None

  return tokenized_inputs, tokenized_outputs


questions, answers = tokenize_and_filter(questions, answers)

In [None]:
# Check that tokenization was done correctly 
question_tokens_solution = [8331, 38, 18, 115, 32, 3065, 19, 981, 8195, 2957, 8107, 2381, 3600, 2384, 13, 7541, 944, 6632, 8107, 46, 466, 85, 5560, 7950, 227, 3091, 3944, 131, 1460, 752, 77, 41, 6, 2117, 8175, 2, 237, 1, 8332, 0]
answer_tokens_solution = [8331, 72, 3, 4, 180, 18, 56, 365, 40, 1086, 1692, 4214, 825, 3, 53, 15, 8, 1116, 40, 29, 1, 8332, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
print(questions[0])
print(answers[0])
assert((questions[0] == question_tokens_solution).all())
assert((answers[0] == answer_tokens_solution).all())

### Section 1.3: Make Tensorflow dataset

For our dataset, the inputs are going to be both the inputs to the encoder and the inputs to the decoder, and the outputs will be the sequence that the decoder outputs. Remember that the whole point of the look-ahead masking for the last lab, was so that we can train the model to predict the input sequence but shifted over by 1 to the left, and the masking will ensure that each output token is only based on attention from tokens earlier on in the sequence. So our decoder outputs will be our decoder inputs shifted over by 1 to the left.

In [None]:
BATCH_SIZE = 64
BUFFER_SIZE = 20000


dataset = tf.data.Dataset.from_tensor_slices((
    {
        # The encoder inputs which are the questions
        'inputs': None,
        # The decoder inputs which are the first 39 tokens of answers
        'dec_inputs': None
    },
    {
        # the decoder outputs which are the last 39 tokens of answers
        'outputs': None
    },
))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE, seed=1234)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

# Check that dataset is correct
first_batch = next(iter(dataset))
encoder_inputs = first_batch[0]['inputs'] 
decoder_inputs = first_batch[0]['dec_inputs']
decoder_outputs = first_batch[1]['outputs']
first_encoder_inputs_solution = tf.constant([8331, 36, 224, 44, 90, 5, 85, 957, 4661, 8107, 88, 273, 7, 8332, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
first_decoder_inputs_solution = tf.constant([8331, 222, 142, 51, 69, 2, 78, 1, 8332, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
first_decoder_outputs_solution = tf.constant([222, 142, 51, 69, 2, 78, 1, 8332, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
assert((encoder_inputs[0] == first_encoder_inputs_solution).numpy().all())
assert((decoder_inputs[0] == first_decoder_inputs_solution).numpy().all())
assert((decoder_outputs[0] == first_decoder_outputs_solution).numpy().all())

# Section 2: Encoder

The goal of the encoder is to represent the input sequence as a sequence of embeddings that capture important information that the decoder will use to make a prediction. 

### Section 2.1: Encoder Layer

The encoder consists of blocks of layers that are repeated multiple times. We will call each of these blocks an encoder layer. Below is a figure that summarizes what is contained in the encoder layer:

![](./images/encoder.png)

Implement ```encoder_layer``` using the comments. For the layers, you can define and apply a layer in one line. For example, ```out = tf.keras.layers.Dense(units=64, activation="relu")(in)```

When you are done, the computational graph of ```encoder_layer``` should look like the following: 

<img src="./images/encoder_layer_graph.png" width="1200"/>

### Section 2.2: Encoder

The encoder itself consists of multiple encoder layers, called one after another.

Implement ```encoder``` using the comments. The graph should look like this: 

<img src="./images/encoder_graph.png" width="1200"/>


In [None]:
def encoder_layer(units, d_model, num_heads, dropout, name="encoder_layer"):
  # the input of the encoder layer takes shape (batch_size, max_length, d_model)
  # remember that input shapes ignore the batch dimension
  # we don't need to specify max_length in the input shape because the network
  # parameters don't depend on the length of the input anywhere
  # thus, the sequence length dimension is None
  # use tf.keras.Input and specify the name argument as "inputs"
  inputs = None

  # recall that the shape of padding_mask is (batch_size, 1, 1, max_length)
  # since it needs to be broadcasted over the d_model and num_heads dimensions
  # but again the sequence length dimension is None
  # use tf.keras.Input and specify the name argument as "padding_mask"
  padding_mask = None

  # MultiHeadAttention with d_model, num_heads, and a name argument of "attention"
  # give it the corresponding inputs, reference the class to see how the inputs are received
  attention = None
  
  # add a dropout layer at rate "dropout" and call it on attention
  # we use dropout on the attention output to ensure that some attention
  # weights don't dominate the others, similar to why we used dropout in neural
  # networks
  # use tf.keras.layers.Dropout
  attention = None

  # adding the inputs back to the attention is a residual connection
  # residual connections prevent the vanishing gradient problem
  # the vanishing gradient problem is when gradients that are backpropogated
  # get very small (close to 0) in very deep networks
  # you can think of residual connections as helping the model to not forget things
  attention = tf.keras.layers.Lambda(lambda x: tf.math.add(x[0], x[1]), name='add_inputs')([attention, inputs])

  # Layer normalization normalizes across the feature (d_model) dimension
  # this ensures that there are no issues as the feature distribution changes
  # over the course of training which stabilizes and accelerates learning
  # use tf.keras.layers.LayerNormalization with epsilon = 1e-6
  # call it on attention
  attention = None

  # apply a dense layer to attention with "units" units and relu activation  
  outputs = None
  
  # apply a dense layer to outputs with "d_model" units (and no activation)
  outputs = None

  # apply dropout to outputs with rate "dropout"
  outputs = None

  # another residual connection
  outputs = tf.keras.layers.Lambda(lambda x: tf.math.add(x[0], x[1]), name='add_attention')([attention, outputs])

  # apply layer normalization with epsilon 1e-6 to outputs
  # this is another residual connection
  outputs = None

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [None]:
sample_encoder_layer = encoder_layer(
    units=512,
    d_model=128,
    num_heads=4,
    dropout=0.3,
    name="sample_encoder_layer")

tf.keras.utils.plot_model(
    sample_encoder_layer, to_file="./encoder_layer.png", show_shapes=True)

In [None]:
def encoder(vocab_size, num_layers, units, d_model, num_heads, dropout, name="encoder"):
  # The inputs to the encoder is an array of tokens (integers) with shape
  # (batch_size, max_length). batch_dimension is ignored in the input shape
  # and the sequence dimension is variable so the input shape is just (None)
  # use tf.keras.Inputs and give the name argument "inputs"
  inputs = None

  # The padding mask is the same shape as in the encoder_layer
  # give it the name argument padding_mask
  padding_mask = None

  # Recall that we convert the integer tokens to higher dimensional embedding
  # vectors that better represent the words
  # The embedding per token is learned.
  # Use tf.keras.layers.Embedding with input_dim = vocab_size and
  # output_dim = d_model. Apply it to the inputs
  embeddings = None

  # multiply embeddings by sqrt(d_model) so that the scale is consistent
  # with attention layers later on
  # you will need to cast d_model as a tf.float32 before taking the sqrt
  embeddings *= None

  # apply the PositionalEncoding that you defined earlier to embeddings
  # use vocab_size for max_position and d_model for d_model
  embeddings = None

  # apply dropout with rate "dropout" to embeddings
  # this regularizes the embeddings, so that the embedding transformation
  # doesn't rely too heavily on certain elements
  outputs = None

  # call encoder_layer num_layers times give it units, d_model, num_heads, dropout
  # and each encoder_layer should take a name argument of "encoder_layer_i" for
  # the iteration number i. The inputs to the encoder_layer should be outputs
  # and remember to give it the padding mask as well.
  for i in range(num_layers):
    outputs = None

  return tf.keras.Model(
      inputs=[inputs, padding_mask], outputs=outputs, name=name)

In [None]:
sample_encoder = encoder(
    vocab_size=8192,
    num_layers=3,
    units=512,
    d_model=128,
    num_heads=4,
    dropout=0.3,
    name="sample_encoder")

tf.keras.utils.plot_model(
   sample_encoder, to_file='encoder.png', show_shapes=True)

# Section 3: Decoder

The goal of the decoder is to output the input sequence shifted to left and with one new, predicted token. Here's a figure of it:

<img src="./images/decoder.png" width=500 />

### Section 3.1: Decoder Layer

The decoder should also take keys and values from the encoder output and use it in a Cross-Attention layer. Cross-Attention just means that the keys and values come from a different tensor than the queries. Remember during training, we want the $i$th output of the decoder to not perform attention with any tokens later in the sequence because during training those tokens are present. We do this with the look-ahead mask that you defined earlier. 

Implement ```decoder_layer```. The model graph should look like:

<img src="./images/decoder_layer_graph.png" width=1000 />

### Section 3.1: Decoder

The decoder is a stack of decoder layers. Again we convert tokens to embeddings and add positional embeddings.

Implement ```decoder```. The model graph should look like:

<img src="./images/decoder_graph.png" width=1000 />



In [None]:
def decoder_layer(units, d_model, num_heads, dropout, name="decoder_layer"):
  
  # same as the inputs to the encoder layer
  inputs = None

  # the encoder outputs are also the shape (batch_size, max_length, d_model)
  # we want to take them in as input so use tf.keras.Input with the same shape
  # as inputs, and give the name argument "encoder_outputs"
  enc_outputs = None

  # the look_ahead_mask is shape (batch_size, 1, max_length, max_length) and
  # it will be broadcasted over the num_heads dimension. batch_size is ignored,
  # and since max_length is variable it is None in the input shape.
  # give it the name "look_ahead_mask"
  look_ahead_mask = None
  
  # define the padding mask as we did in encoder_layer
  padding_mask = None

  # perform Self-MultiHeadAttention with name attention_1 and the look_ahead_mask
  attention1 = None
  
  # residual connection between attention1 and inputs
  attention1 = tf.keras.layers.Lambda(lambda x: tf.math.add(x[0], x[1]), name='add_inputs')([attention1, inputs])

  # layer normalization with epsilon 1e-6 applied to attention1
  attention1 = None

  # Cross attention using enc_outputs for the keys and values
  # still use MultiHeadAttention, but adjust the inputs
  # use padding_mask here because that was what was used for the encoder
  # give it name attention_2
  attention2 = None
  
  # apply dropout with rate dropout to attention2
  attention2 = None
  
  # residual connection between attention2 and attention1
  attention2 = tf.keras.layers.Lambda(lambda x: tf.math.add(x[0], x[1]), name='add_attention1')([attention2, attention1])

  # layer normalization with epsilon 1e-6 applied to attention2
  attention2 = None

  # Dense layer of "units" units and relu activation applied to attention2
  outputs = None
  
  # Dense layer of "d_model" units applied to outputs (no activation)
  outputs = None

  # Dropout layer of rate "dropout" applied to outputs
  outputs = None
  
  # residual connection between outputs and attention2
  outputs = tf.keras.layers.Lambda(lambda x: tf.math.add(x[0], x[1]), name='add_attention2')([outputs, attention2])

  # layer norm with epsilon 1e-6 applied to outputs
  outputs = None

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [None]:
sample_decoder_layer = decoder_layer(
    units=512,
    d_model=128,
    num_heads=4,
    dropout=0.3,
    name="sample_decoder_layer")

tf.keras.utils.plot_model(
    sample_decoder_layer, to_file='decoder_layer.png', show_shapes=True)

In [None]:
def decoder(vocab_size, num_layers, units, d_model, num_heads, dropout, name='decoder'):
  
  # same input shape as encoder since input is a list of tokens
  inputs = None

  # same input shape for enc_outputs in decoder_layer
  enc_outputs = None

  # same input shape as look_ahead_mask in decoder_layer
  look_ahead_mask = None
  
  # same input shape as padding_mask in decoder_layer
  padding_mask = None

  # same embeddings architecture from encoder
  embeddings = None
  embeddings *= None
  embeddings = None

  # apply dropout of rate "dropout" to embeddings
  outputs = None

  # call decoder_layer num_layers times with name "decoder_layer_{i}"
  # decoder_layer takes 4 inputs, make sure you pass in all of them
  for i in range(num_layers):
    outputs = None

  return tf.keras.Model(
      inputs=[inputs, enc_outputs, look_ahead_mask, padding_mask],
      outputs=outputs,
      name=name)

In [None]:
sample_decoder = decoder(
    vocab_size=8192,
    num_layers=3,
    units=512,
    d_model=128,
    num_heads=4,
    dropout=0.3,
    name="sample_decoder")

tf.keras.utils.plot_model(
    sample_decoder, to_file='decoder.png', show_shapes=True)

# Section 4: Transformer

Now to put everything together in the transformer. 

First we give the encoder the "question". Then we give the decoder the decoder inputs (during inference this is the answer so far) and the encoder outputs. The output of the decoder will be a sequence of embeddings with shape ```(batch_size, max_length, d_model)``` but we need tokens to actually convert the output to words so we apply a dense layer that transforms the output to be shape ```(batch_size, max_length, vocab_size)```.

Implement ```transformer```. The model graph should look like:

<img src="./images/transformer_graph.png" width=1000 />



In [None]:
def transformer(vocab_size, num_layers, units, d_model, num_heads, dropout, name="transformer"):
  
  # inputs are a sequence of tokens
  inputs = None

  # decoder inputs which are also a sequence of tokens
  # give it the name "dec_inputs"
  dec_inputs = None

  # the padding mask used for the encoder
  # use tf.keras.layers.Lambda that calls a function given some inputs
  # use create_padding_mask for the function, and give it the output_shape
  # that is the input_shape for it in the encoder
  # apply the lambda layer to inputs and give it the name "enc_padding_mask"
  enc_padding_mask = None
  
  # define a Lambda layer with create_look_ahead_mask and the appropriate
  # output_shape and apply it to dec_inputs with name look_ahead_mask
  look_ahead_mask = None
  
  # same as the enc_padding_mask but with name "dec_padding_mask", still applied
  # to inputs
  dec_padding_mask = None
  
  # call the encoder with all the necessary arguments 
  enc_outputs = None

  # call the decoder with all the necessary arguments 
  dec_outputs = None
 
  # apply a dense layer to transform the embeddings to logits that will be
  # used to classify which word in the vocabulary is most likely for this token
  # has "vocab_size" units and name "ouputs", applied to dec_outputs
  # no activation
  outputs = None

  return tf.keras.Model(inputs=[inputs, dec_inputs], outputs=outputs, name=name)

In [None]:
sample_transformer = transformer(
    vocab_size=8192,
    num_layers=4,
    units=512,
    d_model=128,
    num_heads=4,
    dropout=0.3,
    name="sample_transformer")

tf.keras.utils.plot_model(
    sample_transformer, to_file='transformer.png', show_shapes=True)

# Section 5: Training the model

Although we just need to use ```model.fit``` to train the transformer given our dataset, we need to specify the loss function and a custom learning rate as used in the paper. 

### Section 5.1: Loss function

The output of our transformer has shape ```(batch_size, max_length, vocab_size)``` which is a batch of sequences of vectors that should tell us which token in the vocabulary is most likely to be at that position in the sequence. Thus, think about this as a classification task. As we've done before, we use categorical cross-entropy loss for classification that compares a one-hot encoding of the true label to a distribution of the words given by the model. Implement ```loss_function```.

### Section 5.2: Custom Learning Rate Schedule

There's no reason why the learning rate needs to stay the same over the entire course of training. In the "Attention is All You Need" paper, they used a custom learning rate schedule. It follows the formula: 
$$\text{learning{\_}rate}(t, \text{warmup{\_}steps}) = d_{model}^{-0.5} \min(t^{-0.5}, t(\text{warmup{\_}steps}^{-1.5}))$$

Essentially, there is a warmup time where the learning rate starts low and then goes high. This prevents the model from losing stability when learning basic patterns early on. Then after the warmup, the learning rate gradually decreases to prevent early convergence. Implement ```CustomSchedule``` and verify that the schedule graph matches:

<img src="./images/schedule.png" width=400 />

### Section 5.3: Train the model

The accuracy should increase steadily. My final accuract after 20 epochs was 17%. It took me roughly one and a half hours to train for 20 epochs. If you have more time you should train for more though. 

In [None]:
def loss_function(y_true, y_pred):
  # y_true currently has shape (batch_size, MAX_LENGTH-1) because each
  # sample is a sequence of tokens.
  # reshape it to have shape (batch_size, MAX_LENGTH-1), but infer batch_size
  # using the -1 trick for "filling in" the shape.
  y_true = None

  # use tf.keras.losses.SparseCategoricalCrossentropy
  # the reason we are using the sparse version is because the vector is very
  # large and most of the values will be near zero.
  # specify from_logits to be true since we haven't taken a softmax
  # set reduction to 'none' and apply it to y_true and y_pred
  loss = None

  # padding mask for the loss because we don't want to consider the
  # loss of the pad tokens 
  mask = tf.cast(tf.not_equal(y_true, 0), tf.float32)
  loss = tf.multiply(loss, mask)

  return tf.reduce_mean(loss)

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):

  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()
    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)
    self.warmup_steps = warmup_steps

  def __call__(self, step):
    # compute the first argument of the min function
    arg1 = None

    # compute the second argument of the min function
    arg2 = None

    # return the learning rate for this step using the formula
    return None

In [None]:
sample_learning_rate = CustomSchedule(d_model=128)

plt.plot(sample_learning_rate(tf.range(200000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

In [None]:
tf.keras.backend.clear_session()

# Hyper-parameters
NUM_LAYERS = 2
D_MODEL = 256
NUM_HEADS = 8
UNITS = 512
DROPOUT = 0.1

model = transformer(
    vocab_size=VOCAB_SIZE,
    num_layers=NUM_LAYERS,
    units=UNITS,
    d_model=D_MODEL,
    num_heads=NUM_HEADS,
    dropout=DROPOUT)

learning_rate = CustomSchedule(D_MODEL)

optimizer = tf.keras.optimizers.Adam(
    learning_rate, beta_1=0.9, beta_2=0.98, epsilon=1e-9)

def accuracy(y_true, y_pred):
  # ensure labels have shape (batch_size, MAX_LENGTH - 1)
  y_true = tf.reshape(y_true, shape=(-1, MAX_LENGTH - 1))
  return tf.keras.metrics.sparse_categorical_accuracy(y_true, y_pred)

model.compile(optimizer=optimizer, loss=loss_function, metrics=[accuracy])

In [None]:
EPOCHS = 20

model.fit(dataset, epochs=EPOCHS)

In [None]:
model.save_weights('./model_weights.h5')

# Section 6: Inference

Try out the model!


In [None]:
loaded_model = transformer(
    vocab_size=VOCAB_SIZE,
    num_layers=NUM_LAYERS,
    units=UNITS,
    d_model=D_MODEL,
    num_heads=NUM_HEADS,
    dropout=DROPOUT)
loaded_model.load_weights('./model_weights.h5')

In [None]:
def evaluate(sentence, model):
  sentence = preprocess_sentence(sentence)

  sentence = tf.expand_dims(
      START_TOKEN + tokenizer.encode(sentence) + END_TOKEN, axis=0)

  output = tf.expand_dims(START_TOKEN, 0)

  for i in range(MAX_LENGTH):
    predictions = model(inputs=[sentence, output], training=False)

    # select the last word from the seq_len dimension
    predictions = predictions[:, -1:, :]
    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if tf.equal(predicted_id, END_TOKEN[0]):
      break

    # concatenated the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0)


def predict(sentence, model):
  prediction = evaluate(sentence, model)

  predicted_sentence = tokenizer.decode(
      [i for i in prediction if i < tokenizer.vocab_size])

  print('Input: {}'.format(sentence))
  print('Output: {}'.format(predicted_sentence))

  return predicted_sentence

In [None]:
output = predict('What\'s your favorite color?', loaded_model)


In [None]:
sentence = 'The name\'s bond, James Bond'
for _ in range(3):
  sentence = predict(sentence, loaded_model)
  print('')